HyperAIHyperAI
2 days ago

Waver: Wave Your Way to Lifelike Video Generation

Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Zehuan Yuan, Bingyue Peng
Waver: Wave Your Way to Lifelike Video Generation
Abstract

We present Waver, a high-performance foundation model for unified image andvideo generation. Waver can directly generate videos with durations rangingfrom 5 to 10 seconds at a native resolution of 720p, which are subsequentlyupscaled to 1080p. The model simultaneously supports text-to-video (T2V),image-to-video (I2V), and text-to-image (T2I) generation within a single,integrated framework. We introduce a Hybrid Stream DiT architecture to enhancemodality alignment and accelerate training convergence. To ensure training dataquality, we establish a comprehensive data curation pipeline and manuallyannotate and train an MLLM-based video quality model to filter for thehighest-quality samples. Furthermore, we provide detailed training andinference recipes to facilitate the generation of high-quality videos. Buildingon these contributions, Waver excels at capturing complex motion, achievingsuperior motion amplitude and temporal consistency in video synthesis. Notably,it ranks among the Top 3 on both the T2V and I2V leaderboards at ArtificialAnalysis (data as of 2025-07-30 10:00 GMT+8), consistently outperformingexisting open-source models and matching or surpassing state-of-the-artcommercial solutions. We hope this technical report will help the communitymore efficiently train high-quality video generation models and accelerateprogress in video generation technologies. Official page:https://github.com/FoundationVision/Waver.