Command Palette
Search for a command to run...

Abstract
Modern AI workloads rely heavily on optimized computing kernels for bothtraining and inference. These AI kernels follow well-defined data-flowpatterns, such as moving tiles between DRAM and SRAM and performing a sequenceof computations on those tiles. However, writing high-performance kernelsremains complex despite the clarity of these patterns. Achieving peakperformance requires careful, hardware-centric optimizations to fully leveragemodern accelerators. While domain-specific compilers attempt to reduce theburden of writing high-performance kernels, they often struggle with usabilityand expressiveness gaps. In this paper, we present TileLang, a generalizedtiled programming model for more efficient AI Kernel programming. TileLangdecouples scheduling space (thread binding, layout, tensorize and pipeline)from dataflow, and encapsulated them as a set of customization annotations andprimitives. This approach allows users to focus on the kernel's data-flowitself, while leaving most other optimizations to compilers. We conductcomprehensive experiments on commonly-used devices, across numerousexperiments, our evaluation shows that TileLang can achieve state-of-the-artperformance in key kernels, demonstrating that its unified block-and-threadparadigm and transparent scheduling capabilities deliver both the power andflexibility demanded by modern AI system development.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.