Scientists clear path to stream 3D volumetric video
Computer scientists at Brown University have developed a new method called PackUV that could revolutionize how 3D volumetric video is stored and streamed. This research, led by graduate student Aashish Rai under Assistant Professor Srinath Sridhar, aims to make 4D video—encompassing three dimensions of space and time—accessible on standard devices like computers and smart televisions. Volumetric video allows viewers to explore a recorded scene from any perspective, offering potential applications in entertainment, sports broadcasting, and manufacturing. The primary barrier to widespread adoption of volumetric video has been the massive data requirements. Traditional methods capture scenes using dozens of synchronized cameras and reconstruct them in 3D, resulting in file sizes that can reach terabytes for just a thirty-minute clip. Furthermore, the native formats are incompatible with existing internet infrastructure and video codecs used by major platforms like Netflix and YouTube. PackUV addresses these issues by introducing a novel compression technique. Building on the state-of-the-art rendering method known as 3D Gaussian splatting, which uses fuzzy mathematical blobs to represent 3D space, the new algorithm maps millions of these points into a structured 2D image. This process is comparable to projecting a globe onto a flat map, significantly reducing file size while preserving high-quality visual data. The resulting video files are compatible with current media infrastructure, making them easy to store, stream, and share. In addition to compression, the research tackles the challenge of handling long video sequences. Previous Gaussian splatting approaches often struggled with clips lasting longer than a few minutes, failing to track moving objects when they were temporarily obscured or when new subjects entered the frame. To solve this, PackUV divides long videos into smaller chunks. At the start of each segment, the system re-evaluates the scene to detect movement, occlusions, or new entries. By resetting the tracking process frequently, the algorithm maintains accuracy over longer durations, successfully rendering complex scenes up to thirty minutes long without degradation. To validate their approach, the team compiled the largest multi-view video dataset ever assembled. The collection features footage of various activities, including basketball, pickleball, cooking, and woodworking. These scenes were captured using arrays of 50 to 90 synchronized cameras in both specialized laboratories and real-world environments. The researchers have made this dataset publicly available to the broader scientific community to accelerate further development in the field. Professor Sridhar emphasized that the work is fundamentally about creating digital twins of the real world. Beyond entertainment and sports, this technology holds promise for manufacturing and other industries where detailed spatial understanding is required. The findings will be presented in June at the IEEE/CVF Conference on Computer Vision and Pattern Recognition. This breakthrough represents a significant step toward bringing immersive 3D video experiences to the mainstream internet.
