HyperAI

The Fourier Neural Operator (FNO) has gained widespread recognition for its ability to learn solutions to partial differential equations (PDEs). However, the existing FNO implementations lack optimization for specific hardware architectures. The Fourier layer in FNOs typically operates in discrete stages, including Fast Fourier Transform (FFT), filtering, General Matrix Multiply (GEMM), zero-padding, and inverse Fast Fourier Transform (iFFT). This sequential approach leads to multiple kernel launches and significant global memory access overhead, which can degrade performance. To address these issues, we introduce TurboFNO, the first fully fused FFT-GEMM-iFFT GPU kernel that includes built-in FFT optimizations. Our development process started with creating custom FFT and GEMM kernels from scratch, which match or even outperform the state-of-the-art closed-source cuBLAS and cuFFT libraries. One of the key innovations in TurboFNO is the integration of high-frequency truncation, input zero-padding, and pruning directly into the FFT kernel. This integration eliminates the need for additional memory copy operations, thereby reducing the overhead associated with these processes. To further enhance performance, we designed a variant of the FFT algorithm where a single thread block iterates over the hidden dimension, aligning with the $k$-loop in GEMM. This alignment allows for seamless integration of the FFT and GEMM workloads. We also developed two shared memory rearrangement patterns to ensure 100% memory bank utilization when passing FFT outputs to GEMM. These patterns enable the iFFT to directly access GEMM results from shared memory, thus optimizing the overall computational flow. Our experiments on the NVIDIA A100 GPU demonstrate that TurboFNO can achieve up to 150% better performance compared to PyTorch, cuBLAS, and cuFFT. This improvement highlights the effectiveness of our fully fused approach in reducing the computational bottlenecks typically found in FNO implementations. TurboFNO not only accelerates the training and inference processes but also opens up new possibilities for more efficient and scalable applications in the field of PDEs.

Related Links

Related Links

Related Links

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.

Command Palette

TurboFNO: GPU-Optimized High-Efficiency Fourier Neural Operator

Related Links

Command Palette

TurboFNO: GPU-Optimized High-Efficiency Fourier Neural Operator

Related Links

Command Palette

TurboFNO: GPU-Optimized High-Efficiency Fourier Neural Operator

Related Links

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.