Based on 13,000 Video Clips, Peking University's Shi Boxin Team and Bayesian Computing Proposed a Panoramic Video Generation Framework PanoWan, Which Takes Into Account Zero-sample Video Editing.

Panoramic video is one of the important content forms of virtual reality (VR). It is based on the real world and enhances the user's sense of involvement and interactive experience through a 360° immersive perspective. It provides key support for the development of VR in terms of content production, industry implementation and user popularization.Current panoramic video production usually relies on professional equipment, which greatly limits the breadth of content creation.
In recent years, with the rapid development of generative video models, researchers have also begun to try to apply them to the field of panoramic videos, thereby lowering the threshold for panoramic content creation, promoting the large-scale expansion of VR content, and even helping to build a highly immersive interactive virtual world.
However, it is not easy to efficiently transfer traditional video generation models to the panoramic field.The main challenge is that panoramic videos and ordinary videos have fundamental differences in spatial feature representation.For example, equidistant rectangular projection causes image distortion in the latitude direction, and longitudinal boundary splicing causes visual and semantic discontinuity. Therefore, even if the current text-to-video generation technology has achieved excellent results, it is difficult to ensure the consistency and coherence of the spatial layout of elements in the scene during the panoramic video generation process.
To address this key challenge,Peking University Camera Intelligence Laboratory (Shi Boxin's team) and OpenBayes Bayesian computing jointly launched PanoWan, a text-guided panoramic video generation framework.This method uses a very simple and efficient module architecture to smoothly transfer the generative priors of the pre-trained text-to-video model to the panoramic field. To this end, this method designs latitude-aware sampling technology to effectively reduce the image distortion caused by equidistant rectangular projection; at the same time, through the rotation semantic denoising and boundary filling pixel-by-pixel decoding strategy, it solves the problem of visual semantic incoherence at the longitude boundary.
In addition, in order to effectively train the model,The research team specially constructed a high-quality, large-scale panoramic video dataset PanoVid.The dataset contains more than 13,000 video clips with text descriptions, totaling nearly 1,000 hours, covering a variety of scenes such as natural scenery, urban street scenes, and human activities.
The experimental results fully show thatPanoWan not only achieves the current best performance in the task of generating panoramic videos from text, but also demonstrates powerful zero-shot video editing capabilities.Without additional training, it can handle multiple practical scenarios such as panoramic video super-resolution enhancement, semantic editing, and video content extension.

The related research paper "PanoWan: Lifting Diffusion Video Generation Models to 360° with Latitude/Longitude-aware Mechanisms" has been published on arXiv.
For more examples, visit the project homepage:
https://panowan.variantconst.com/

Large-scale panoramic video dataset PanoVid
The lack of paired datasets has always been one of the main obstacles to improving the performance of panoramic video generation models. To solve the problem of data scarcity,The research team built a semantically balanced, scene-diverse and high-quality large-scale panoramic video dataset PanoVid.This dataset brings together multiple existing panoramic video resources, including 360-1M, 360+x, Imagine360, WEB360, Panonut360, Miraikan 360-degree Video Dataset, and public immersive VR video datasets.
After the initial collection, the research team used the Qwen-2.5-VL model to automatically generate high-quality text descriptions for the videos, and tagged the videos with categories, retaining only videos in equirectangular projection (ERP) format. Subsequently, to avoid content duplication, the team adopted a deduplication strategy based on description similarity, and further strictly screened the videos through optical flow smoothness and aesthetic scores, retaining only high-quality clips in each category.
After this series of rigorous processing procedures,The PanoVid dataset ultimately contains more than 13,000 video clips.The total duration is approximately 944 hours, covering a wide variety of scenes including landscapes, street scenes, and people.

PanoWan Technical Highlights: Focusing on Latitude and Longitude
PanoWan uses the same video training framework as the Wan 2.1 model.The goal is to migrate the video generation model to the panoramic field with minimal changes, while retaining the generation priors of the original model to the greatest extent. To solve the panoramic video distortion problem caused by the ERP format,The research team mainly works from two levels: latitude and longitude.
in,In the latitude direction, PanoWan uses latitude-aware sampling (LAS) to alleviate the latitude distortion problem in polar regions.This method remaps the distribution of noise to make it more closely match the actual frequency characteristics of the sphere, thereby effectively reducing the stretching and distortion of the image in the latitudinal direction.
Longitude direction, to solve the visual and semantic discontinuity problem at the left and right boundaries of the generated results.PanoWan proposed Rotated Semantic Denoising (RSD) and Padded Pixel-wise Decoding (PPD).The former evenly distributes the seam error to different longitudes through rotation operations in the latent space, significantly reducing the inconsistency of semantic transitions; the latter expands the context of the seam area, enabling the decoder to consider more information outside the boundary during the decoding process, effectively avoiding the pixel-level boundary segmentation problem.

The figure below uses an ablation experiment to intuitively demonstrate the effectiveness of the latitude and longitude mechanism proposed in this work. The upper left corner of the image shows that after using the latitude-aware sampling method, the ceiling and light strip lines that were originally prone to obvious distortion become straight and natural in the perspective view; and the complete method in the lower right corner combines rotation semantic denoising and boundary filling pixel-by-pixel decoding to successfully eliminate the discontinuity of the image boundary area, and the transition is smooth and natural.

PanoWan effect display
First is the most basic Vincent panoramic video. Let’s take a look at the effect without further ado.
Prompt: Panoramic shot of an active volcano spewing smoky plumes against a fiery sunset sky, majestic mountains shrouded in misty clouds in the foreground, creating a breathtaking contrast. Camera pans slowly, capturing the vastness and awe-inspiring beauty of nature.
Prompt: Panorami view of a shot of a neon-drenched cyberpunk metropolis, high-octane chase unfolds on a multi-tiered highway. Sleek, matte black hypercar rockets through the urban jungle, skimming past colossal skyscrapers. Glowing screens illuminate the scene with pulsating neon advertisements. Camera captures the action from a dramatic low angle, tracking the car's breakneck speed.
Prompt: Inside a bustling Starbucks, a young woman sits by the window, sipping a grande latte, engrossed in a thick novel. Sunlight filters through, casting warm glows on her focused face. Surrounding her are chic wooden interiors, the aroma of freshly brewed coffee, and the chatter of patrons. Medium shot, capturing the vibrant cafe ambiance.
PanoWan can also be used without retraining.Zero-shot applications include long video generation, super-resolution, semantic editing, and video out-scaling tasks for panoramic videos.
Long video generation prompt: Sunset at a beach.
Video Super Resolution Prompt: 360-degree panoramic interior view inside a charming artisan bakery bustling with activity, bakers carefully preparing handcrafted breads, pastries, and desserts. Shelves stocked with warm baked goods, aromatic scents filling the air, creating feelings of warmth, comfort, and culinary delight.
Semantic Editing Prompt: Change the color of the train to red.
Video expansion prompt: Panoramic shot of colorful hot air balloons gracefully ascend, floating over lush green fields, their vibrant hues contrasting against a vast, cloud-dappled blue sky. Gentle breezes propel them in a serene dance, casting dynamic shadows on the verdant landscape below. Wide shot from ground level, capturing the expansive scene.
Quantitative and qualitative evaluation
The research team conducted quantitative and qualitative comparisons of PanoWan with 360DVD (CVPR'24) and DynamicScaler (CVPR'25), which are also applicable to Vincent panoramic videos.
In order to scientifically evaluate the generated visual quality and panoramic video characteristics, the team adopted an evaluation system that takes into account both general video evaluation indicators and panorama-specific indicators. Among them, the general indicators include overall video quality (FVD), text video matching (VideoCLIP-XL) and image quality, while the panorama-specific indicators use evaluation criteria to measure longitude boundary continuity, motion pattern accuracy and scene richness.Judging from the quantitative experimental results, PanoWan achieved the best performance in all key indicators.

The following shows the comparison of the visual effects of PanoWan and existing methods:
About the Research Team
Shi Boxin, the director of Peking University Camera Intelligence Laboratory (http://camera.pku.edu.cn), is the deputy director of the Institute of Video and Vision Technology, School of Computer Science, Peking University, a tenured associate professor (researcher), doctoral supervisor, Beijing Zhiyuan Scholar, and director of the Peking University-Zhifang Embodied Intelligence Joint Laboratory. He received his Ph.D. from the University of Tokyo, Japan, and was a postdoctoral fellow at the MIT Media Lab.
His research direction is computational photography and computer vision. He has published more than 200 papers (including 30 TPAMI papers and more than 100 papers in the three top conferences on computer vision). His paper won the Best Paper, Runner-Up of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024, the Best Paper Runner-Up of the International Conference on Computational Photography (ICCP) 2015, and the Best Paper Candidate of the International Conference on Computer Vision (ICCV) 2015. He won the Japan Okawa Research Grant Award (2021) and the Young Scientist Award of the Chinese Institute of Electronics (2024). He is the chief scientist of the major artificial intelligence projects of the Ministry of Science and Technology, the person in charge of the key projects of the National Natural Science Foundation of China, and the candidate of the National Youth Talent Program. He serves as an editorial board member of the top international journals TPAMI and IJCV, and the field chair of the top conferences CVPR, ICCV, and ECCV. He is an APSIPA Distinguished Speaker, a CCF Distinguished Member, and a Senior Member of IEEE/CSIG.

The main collaborator OpenBayes Bayesian Computing, as a leading domestic artificial intelligence service provider, has been deeply engaged in the fields of industrial research and scientific research support. By grafting classic software ecosystems and machine learning models for the new generation of heterogeneous chips, it provides industrial enterprises and university research institutions with faster and easier-to-use data science computing products. Its products have been adopted by dozens of large industrial scenarios or leading scientific research institutes.
Visit the official website:https://openbayes.com/