HyperAI

Event Review | AMD/Muxi Integrated Circuit/ByteDance/Peking University In-depth Analysis of the Unified Compilation Ecosystem Across Hardware

特色图像

In the era of big models, compilers are once again in the spotlight. On July 5, HyperAI held the 7th Meet AI Compiler Technology Salon in Beijing Zhongguancun, focusing on distributed communications, domestic GPU compilation stacks, new programming language design, and open source ecosystem construction. Senior AI compiler experts from AMD, Muxi Integrated Circuits, ByteDance, and Peking University were invited to systematically present the key mechanisms and implementation details of their respective projects around their respective technical paths that are "really being done and have really made achievements."

Follow the WeChat public account "HyperAI Super Neuro" and reply to the keyword "0705 AI Compiler" to obtain the authorized lecturer's speech PPT.

In the roundtable session, Feng Siyuan, assistant professor at Shanghai Chuangzhi College and Apache TVM PMC, served as the moderator. He focused on the theme of "unified compilation ecosystem across hardware" and discussed in depth with 4 lecturers the collaboration and challenges of different hardware platforms.

This event was not only about the "knowledge output" of the lecturers on stage, but also the interaction from the community partners was equally exciting. Whether it was in-depth questions about technical details, extended discussions on solution selection, or free exchanges during tea breaks, everyone shared their experiences and insights without reservation, and had a warm and sincere conversation around the practical problems they encountered. This "two-way" atmosphere can make our technical community more warm! This technical salon ended perfectly.

Event content review

The following is a brief introduction to the sharing content and the actual sharing article.

Share topic:Helping the open source community, analyzing AMD Triton compiler

Contents:Triton is a programming language proposed by OpenAI that is designed to simplify the development of high-performance GPU Kernel. It has been widely used in the mainstream LLM reasoning training framework. Users can implement GPU Kernel by developing Python Triton code without having to worry about the underlying GPU architecture details, which greatly reduces the difficulty of GPU code development.

AMD implemented the Triton compiler on relevant GPU platforms and contributed it to the Triton open source community.In order to optimize GPU code performance, you need to understand the Triton compiler and its role in kernel performance optimization. This sharing will discuss the AMD Triton compiler in detail and introduce how the compiler improves Triton's performance on AMD GPU platforms.

Watch this sharing session and you will learn:

1. Introduction to AMD GPU architecture

2. AMD GPU’s latest work on the Triton open source community

Click to view the full sharing record:

AMD AI Architect Zhang Ning: Analyzing AMD Triton Compiler from Multiple Perspectives to Help Build an Open Source Ecosystem

Share topic:TVM application practice on Muxi GPU

Contents:This discussion mainly focuses on how to apply TVM on Muxi GPU.For Muxi GPU, high-performance operators are generated around TVM to enable mainstream AI frameworks based on TVM.

Watch this sharing session and you will learn:

1. Problems that may be encountered when adapting TVM to domestic GPGPU

2. What are the benefits of TVM on domestic GPGPU and what aspects need further breakthroughs?

3. About the support status of AI compilers such as TVM on domestic GPGPU, and discuss how to expand the related ecosystem

Click to view the full sharing record:

From architectural features to ecosystem construction, Muxi Dong Zhaohua deeply analyzes the application practice of TVM on domestic GPUs

Share topic:Triton-distributed: native Python programming for high-performance communication

Contents:The scale of single chips is gradually reaching a bottleneck. Single accelerators cannot support large language model training and reasoning. Distributed systems have become a rigid demand. Computing, memory access, and communication are concurrent in distributed systems, but existing frameworks are mostly optimized independently, making it difficult to collaboratively release cluster performance.

This report proposes Triton-distributed (Triton compiler extension), which is the first to advocate native overlapping optimization of distributed AI workloads and covers multi-framework optimization.By integrating OpenSHMEM communication primitives, using the compiler to achieve joint optimization of three activities, demonstrating the application of overlapping technology and single/multi-node programming methods, the generated code fully utilizes heterogeneous resources in a cluster environment, outperforming hand-optimized code, and the development cost is significantly lower than CUDA/C++.

Watch this sharing session and you will learn:

1. Triton-distributed latest technology

2. Challenges of Programming Communications from Python

3. Future Direction of Distributed Compilation

Click to view the full sharing record:

Training performance has been significantly improved. Bytedance's Zheng Size explains the Triton-distributed framework to achieve efficient distributed communication and computing integration for large models

Share topic:TileLang: Operator development is no longer "brain-burning", and performance is still online

Contents:This time we bring a new operator programming language - TileLang.Through explicit tile-level primitives and automatic reasoning mechanisms, it enables developers to efficiently implement hardware-aware neural operators, balancing control and development efficiency.

Watch this sharing session and you will learn:

1. Master a simpler and more efficient high-performance operator development language

2. Understand TileLang's core design concept and technical advantages

Click to view the full sharing record:

Tile-level primitives are integrated with automatic reasoning mechanisms. The initiator of the TileAI community deeply analyzes the core technology and advantages of TileLang

2025 Meet AI Compiler · Stay tuned

From 2023 to 2025, we successfully held 7 offline meetups in Beijing, Shanghai, and Shenzhen, gathering thousands of senior practitioners and enthusiasts, and gradually established a rich community ecosystem. In 2025, we will continue to develop the AI Compiler City Map, and sincerely invite all companies and community partners to participate in co-creation in various forms, whether it is recommending lecturers or providing venues and coffee breaks, we welcome them very much~

Let's work together to create the most active AI compiler community in China! Finally, let's share a group photo of the scene❤️

Organizers and partners

As a premier global community in artificial intelligence and high-performance computing, HyperAI (hyper.ai) is committed to supporting developers and enthusiasts across the global data science and AI industry by providing a comprehensive suite of services— including industry news reports, accelerated data set downloads, online tutorials, benchmarks of leading AI models, curated recommendations of cutting-edge research papers, in-depth interpretations of high-impact results and integration with top conference calendars. HyperAI empowers developers to explore, comprehend, and apply AI, driving innovation and shaping the future of artificial intelligence in collaboration with community.

Visit the official website:https://hyper.ai/

OpenBayes Bayesian Computing is a leading high-performance computing service provider in ChinaBy grafting classic software ecosystems and machine learning models onto new-generation heterogeneous chips, it provides industrial enterprises and university scientific research with faster and easier-to-use data science computing products. Its products have been adopted by dozens of large industrial scenarios or leading scientific research institutes.

Visit the official website:https://openbayes.com/

The MLC.AI community was established in June 2022. Chen Tianqi, the main inventor of Apache TVM and a well-known young scholar in the field of machine learning, led the team to launch the MLC online course, which systematically introduced the key elements and core concepts of machine learning compilation.

In November 2022, with the joint efforts of MLC.AI community volunteers, the first complete TVM Chinese documentation was launched and successfully hosted on the HyperAI official website, further providing domestic developers interested in machine learning compilation with the basic settings for accessing and learning a new technology - documentation.

MLC Online Courses:https://mlc.ai/

TVM Chinese Documentation:https://tvm.hyper.ai/

Founded in April 2011, Garage Coffee is one of the earliest companies in China to focus on early-stage Internet startups. It has built a low-cost, convenient, full-factor, open innovation and entrepreneurship service platform for early-stage entrepreneurs around the concept of "mass entrepreneurship."

As the first makerspace in Beijing's Zhongguancun Entrepreneurship Street, Garage Coffee uses coffee shops as interactive carriers to provide entrepreneurial teams with interactive office space and incubation services for sharing, co-promotion, integration and co-existence. Garage Coffee is the world's first entrepreneurial-themed coffee shop, and is China's most influential national makerspace and international innovation and entrepreneurship platform.

Event Support

Get the PPT:Follow the WeChat public account "HyperAI Super Neuro" and reply to the keyword "0705 AI Compiler" to obtain the authorized lecturer's speech PPT.

Scan the QR code to join the event group⬇️