Accelerate BEV Pooling Kernels on NVIDIA RTX GPUs
Bird’s-eye-view perception has become a foundational architecture for autonomous vehicles, robotics, and spatial AI systems. By projecting multicamera features into a unified top-down grid, BEV models enable downstream modules to reason about lanes, obstacles, and trajectory in a shared coordinate system. A critical bottleneck in this pipeline is BEV pooling, which gathers depth-aware image features and scatter-reduces them into grid cells. Irregular memory access patterns and repeated index loading frequently stall GPU throughput. To address this, NVIDIA engineers have introduced BEVPoolV3, a refined optimization framework designed to accelerate gather and scatter-heavy workloads across modern NVIDIA RTX architectures. BEVPoolV3 implements a systematic workflow beginning with memory regime classification. By comparing the operator’s working set against the target GPU’s L2 cache capacity, developers determine whether the workload is DRAM-bound or L2-resident. The framework eliminates redundant scatter traffic through a depth-outer traversal strategy, five-array INT32 scatter mapping, and precomputed indices that remove runtime integer division. Each BEV interval is assigned a dedicated processing unit that accumulates feature channels locally before writing the output once. This logic adapts to hardware constraints: DRAM-bound systems prioritize byte reduction and cache-preserving output stores, while L2-resident platforms leverage higher occupancy, vectorized loads, and precision specialization. Benchmarks on NVIDIA RTX A6000 and RTX PRO 6000 Blackwell Max-Q systems demonstrate the framework’s impact. The canonical 49 MB working set exceeds the A6000’s 6 MB L2 cache, making it DRAM-bound, whereas it fits within the Blackwell Max-Q’s 128 MB L2 cache. On the RTX A6000, the optimized FP16 path achieves a 19.3x latency reduction over prior implementations, reaching 90 microseconds in standard configurations. The Blackwell Max-Q delivers speedups between 16x and 42x depending on channel width and point count. Precision evaluation reveals that while newer formats excel in compute-bound operations, FP8 remains superior for L2-resident scatter-reduce kernels due to lower decode overhead and sustained arithmetic throughput. BEVPoolV3 is deployed as a TensorRT IPluginV3 operator, supporting automatic kernel dispatch based on GPU architecture and data type. Validation confirms numerical consistency, with error margins remaining within production tolerances. While the core improvements transfer effectively to edge platforms like the DRIVE AGX Thor, developers should note that FP8 acceleration requires workload-specific profiling on constrained memory hierarchies, as register pressure and conversion overhead can offset bandwidth gains. The BEVPoolV3 methodology establishes a repeatable optimization pipeline for irregular memory-bound operators. By leveraging NVIDIA Nsight Compute to diagnose bottlenecks and aligning implementation strategies with hardware topographies, engineers can systematically accelerate sparse embeddings, voxelization, and segmented reductions. The framework underscores a critical infrastructure insight: as spatial AI workloads grow in complexity, precision selection and memory hierarchy alignment will remain decisive factors in achieving real-time perception and planning performance.
