AI-Generated CUDA Kernels Surpass PyTorch Performance in Key Machine Learning Operations
Researchers at Stanford University's Center for Research on Foundation Models (CRFM) have made a surprising breakthrough in the development of AI-generated kernels for machine learning operations. These kernels, written purely in CUDA-C without reliance on advanced libraries like CUTLASS or Triton, are not only highly optimized but also rival or outperform the expert-optimized production kernels in PyTorch. The findings were published earlier due to their significant impact, despite the project originally aiming to generate synthetic data to train better kernel generation models. Key Findings and Performance Benchmarks The AI-generated kernels were tested on an Nvidia L40S GPU, and the performance metrics are defined as the reference time divided by the generated kernel time. Here are the highlights: Matmul (FP32): Achieved 101.3% performance of PyTorch's torch.matmul for 4096x4096 square matrices. Conv2D: Performed 179.9% faster than PyTorch's torch.nn.Conv2D for a (100, 3, 224, 224) input tensor with specific convolution parameters. Softmax: Beat PyTorch's torch.softmax by 11.8%, achieving 111.8% performance for a (4096, 65536) input tensor. LayerNorm: Showed a remarkable 484.4% performance improvement over PyTorch's torch.nn.LayerNorm for a (16, 64, 256, 256) input tensor. Conv2D + ReLU + MaxPool: Outperformed both PyTorch's reference implementation and compiled version by 290.1% and 189.0%, respectively, for the same convolution parameters and additional ReLU and MaxPool layers. Methodology The project utilized a novel approach to kernel optimization, building on the KernelBench benchmark framework they released in December 2024. The goal was to replace PyTorch's default operators with custom CUDA kernels to achieve a performance speedup, while maintaining a tolerance threshold of 1e-02 for correctness. Key Innovations Natural Language Reasoning for Optimization Ideas: Instead of directly generating new kernels, the AI models (OpenAI o3 and Gemini 2.5 Pro) produce optimization ideas in natural language. These ideas are then translated into code, allowing for a more structured and diverse exploration of optimization strategies. Branching During Optimization: Multiple implementation attempts are spawned from each optimization idea, and the best-performing kernels are selected to seed the next round of refinement. This method introduces parallelism and helps avoid local minima, leading to better overall performance. Example Optimization Trajectory: Conv2D Kernel The process of optimizing the Conv2D kernel involved several rounds of iterative improvements: Round 0: Initial attempt, achieving 20.1% of the reference time (7.02 ms). Rounds 1-3: Incremental improvements focusing on cache utilization, FP16 GEMM conversion, and double-buffering. Round 4: Continued improvements with precomputation of indices in shared memory. Round 5: Significant speedup by addressing redundant arithmetic in the loading loop. Rounds 6-13: Further optimizations, including parallelizing output writes, precomputing base input coordinates, and efficient pipeline management, culminating in a final performance of 179.9% of the reference time (0.795 ms). Final Optimized Conv2D Kernel The final Conv2D kernel leverages advanced CUDA techniques, such as Tensor Core operations, shared memory optimization, and warp-level parallelism. The kernel is designed to handle specific problem sizes efficiently and ensures correct computation by precomputing and caching indices in shared memory. Takeaways The researchers' findings suggest that combining strong reasoning capabilities with parallel exploration of multiple hypotheses can lead to significant improvements in AI-based kernel generation. This method not only achieves impressive performance on par with or better than existing expert-optimized kernels but also generates better synthetic data to train future models. Despite current limitations, such as the focus on FP32 operations, the progress shows promise for broader applications and continued development. Industry insiders and experts in AI and machine learning are intrigued by these results. They believe this approach could revolutionize the way kernels are developed, potentially reducing the time and resources needed to optimize critical ML operations. The work is seen as a stepping stone toward creating self-improving AI systems that can autonomously generate high-performance code. Company Profiles and Support The project received support from companies like Standard Kernel Co. and Prime Intellect, highlighting the industry's interest in advancing this technology. These companies are involved in developing tools and platforms for efficient AI and machine learning operations, and their collaboration with Stanford CRFM underscores the potential practical applications of this research. In conclusion, while the current results are encouraging, there is still much room for improvement, particularly in generating better optimization ideas and extending the method to more complex kernels. The Stanford CRFM team remains optimistic and is committed to pushing the boundaries of AI in kernel generation.