HyperAI

NVIDIA's CUDA Tile is a next-generation GPU programming model designed to deliver maximum performance on NVIDIA Tensor Cores by enabling portable, high-level tile-based computation. Introduced in CUDA 13.1, CUDA Tile shifts the focus from individual thread management in the traditional SIMT model to expressing computation at the tile level—where data and operations are grouped into manageable blocks. This abstraction simplifies programming while enabling powerful compiler optimizations and efficient hardware utilization. At the core of this model is CUDA Tile IR, an MLIR-based intermediate representation that defines the semantics, operations, and type system for tile-level computations on NVIDIA GPUs. OpenAI Triton, an open-source Python domain-specific language (DSL) for writing deep learning kernels, is now being extended to support CUDA Tile IR as a backend. This integration, known as Triton-to-TileIR, allows developers to compile Triton code directly to CUDA Tile IR instead of the traditional PTX output. This is a major advancement because it preserves Triton’s high-level, tile-oriented abstractions and enables direct execution on modern NVIDIA GPUs with full support for Tensor Cores and future architectural improvements. Triton-to-TileIR acts as a bridge between the accessible Python syntax of Triton and the low-level efficiency of CUDA Tile IR. Since Triton is inherently tile-based—where computations are expressed over data tiles rather than individual threads—the transition to CUDA Tile IR is natural and efficient. Instead of lowering tile-level code to thread-level SIMT instructions, the new backend compiles directly to CUDA Tile IR, maintaining performance and enabling better optimization opportunities. One of the key benefits is that existing Triton users can access the advantages of CUDA Tile IR without rewriting their code. A simple environment variable switch allows them to choose between the PTX backend and the CUDA Tile IR backend on a per-kernel basis. When using the Tile IR backend, compiled kernels are cached with .tileIR file extensions, indicating the new compilation path. The project is currently in active development under the triton-lang organization and is considered an incubator-level initiative. Key development areas include building core conversion patterns to map Triton operations to CUDA Tile IR, extensive testing for correctness across complex control flow and memory patterns, performance benchmarking across operations like matrix multiplication and convolutions, and integration with broader open source ecosystems such as Helion. To use Triton-to-TileIR, developers must build the project from source, as prebuilt binaries are not yet available. After setup, users can verify the backend by running tutorials and checking for .tileIR files in the cache. The project is still in early stages, with known limitations. Some Triton operations are not yet supported in the Tile IR backend, and performance for certain patterns—especially tensor-of-pointer layouts—can be suboptimal. For workloads using tensor-of-pointer patterns, developers are encouraged to adopt the TMA (Tensor Memory Accelerator) load and store API. When tensors have contiguous, well-defined shapes and strides, it’s more efficient to describe their layout via descriptors rather than compute individual pointers. For example, using tl.make_tensor_descriptor to define a tensor’s shape, strides, and block size enables TMA-based loading and storing, which significantly improves performance on the Tile IR backend. This integration marks a pivotal moment in GPU programming. It empowers researchers and developers with minimal CUDA experience to write high-performance code that runs efficiently on cutting-edge NVIDIA hardware. By combining Triton’s developer-friendly design with CUDA Tile IR’s performance and portability, the collaboration between NVIDIA and the Triton community is setting a new standard for accessible, high-performance GPU computing. As the project evolves, it will be crucial to measure success not just by performance gains, but by how effectively it lowers the barrier to entry for advanced GPU programming.

Related Links

Related Links

Related Links

ByteDance open-sources Lance, a 3B Model Encompassing Understanding, Generation, and Editing; the National University of Singapore Proposes the ViMU Dataset: Covering 588 Videos and non-verbal Question answering.

ByteDance open-sources Lance, a 3B Model Encompassing Understanding, Generation, and Editing; the National University of Singapore Proposes the ViMU Dataset: Covering 588 Videos and non-verbal Question answering.

Command Palette

Triton-to-TileIR Bridges High-Level GPU Programming with NVIDIA’s CUDA Tile for Enhanced Performance and Portability

Related Links

Command Palette

Triton-to-TileIR Bridges High-Level GPU Programming with NVIDIA’s CUDA Tile for Enhanced Performance and Portability

Related Links

Command Palette

Triton-to-TileIR Bridges High-Level GPU Programming with NVIDIA’s CUDA Tile for Enhanced Performance and Portability

Related Links

ByteDance open-sources Lance, a 3B Model Encompassing Understanding, Generation, and Editing; the National University of Singapore Proposes the ViMU Dataset: Covering 588 Videos and non-verbal Question answering.

ByteDance open-sources Lance, a 3B Model Encompassing Understanding, Generation, and Editing; the National University of Singapore Proposes the ViMU Dataset: Covering 588 Videos and non-verbal Question answering.