Event Recap | Peking University, Tsinghua University, Zilliz, and MoonBit Discuss Open Source, Covering Video Generation, Visual Understanding, Vector Databases, and AI Native Programming Languages

2 months ago

Currently, the AI industry is experiencing an unprecedented development cycle. The large-scale application of big models, the restructuring of AI-native software systems, and the accelerated evolution of multimodal foundational models are blurring the lines between academia and industry. Whether it's the increasingly sophisticated requirements for audio-visual synchronization in video generation, the efficient inference optimization of on-device visual models, or the emergence of next-generation AI-native programming languages, all are driving a clear trend—Industry-academia collaboration and open-source ecosystems are becoming the most critical innovation paradigms in the AI era.

Over the past few decades, the cycle of scientific research driving industry and industry supporting scientific research has been common. However, in today's stage of exponential growth in models, computing power, and data, single-point innovation can no longer meet the needs.Open source has evolved from tool sharing to infrastructure collaboration, becoming a key link connecting universities, enterprises, communities, and individual developers.Especially in cutting-edge fields such as vision, multimodal, vector databases, and AI programming languages, open source has not only accelerated the speed of technology dissemination but also reshaped the way R&D is organized, giving rise to more "co-creation innovation".

In this context,HyperAI, as a co-producing community of COSCon'25, hosted the "Industry-Research Open Source Collaboration Forum" on December 7.We are honored to have invited Shi Baixin, a researcher at Peking University, Li Chenglong, chief open source evangelist of Zilliz, Chen Hui, an assistant researcher at Tsinghua University, and Lei Zhengyu, a core developer of the MoonBit community, to discuss the implementation path of cutting-edge research in the open source ecosystem, the iterative paradigm of open source projects in industrial practice, and how AI applications will continue to expand their boundaries through the power of the community in the future.

Shi Baixin: Constructing a brand-new dataset to realize a new paradigm for video generation and audio-visual synchronization technology

Currently, video generation technology has made progress in image quality and short-term temporal coherence, capable of generating high-fidelity short clips and achieving a certain degree of audio-visual synchronization. However, traditional methods still face problems such as latitude and longitude distortion, discontinuous viewpoint stitching, poor consistency of moving targets, and insufficient long-term temporal stability. Furthermore, there is a high degree of correlation between audio and visual content. To enable models to realistically capture multiple types of information such as speech, music, and ambient sound, it is necessary to construct a generation framework capable of understanding multi-track signals.

In this context,Professor Shi Baixin's team proposed the interval flow technique for audio-visual synchronization, which enables the model to "look at several frames before and after" during the learning process, thereby establishing attentional connections across time.By incorporating internal block modules, the model can implement self-attention mechanisms on different audio tracks to more accurately process different types of audio information, such as speech and ambient sounds. Due to the more global nature of the music portion, the team implemented emotional rendering through global feature injection, enabling the model to generate corresponding visuals based on the musical atmosphere.

Professor Shi Baixin introduced the breakthroughs the team made in this project:

* A multifunctional audio-synchronized video generation framework is proposed.Precise audiovisual mapping and accurate time alignment are achieved through demixed audio.

* A new dataset for audio-synchronized video generation, consisting of 5 overlapping subsets, was constructed.It contains approximately 392,000 audio and video segments, totaling about 1,200 hours. Based on this dataset, the model is able to learn facial lip-syncing, event timing control, and emotional atmosphere rendering in multiple training rounds.

* A multi-stream time control network is proposed for processing demixed audio tracks.Precise control over lip-syncing, event timing, and emotional atmosphere.

The related findings, titled "Audio-Sync Video Generation with Multi-Stream Temporal Control," have been selected for NeurIPS 2025.

besides,Professor Shi Baixin's team has also achieved the ability to generate panoramic videos containing real moving targets, and supports tasks such as long videos, semantic editing, super-resolution, and viewpoint interpolation.This method employs a latitude-aware sampling technique to effectively reduce image distortion caused by equidistant rectangular projection. Simultaneously, it addresses the issue of visual semantic incoherence at longitude boundaries through rotational semantic denoising and pixel-by-pixel boundary-filling decoding strategies.

The related findings, titled "PanoWan: Lifting Diffusion Video Generation Models to 360° with Latitude/Longitude-aware Mechanisms", have also been included in NeurIPS 2025.

Li Chenglong: Building Commercial Services Based on Milvus, the First Open-Source Vector Database

In October 2019,Milvus has been officially open-sourced. As the world's first open-source vector database, it has been implemented in projects of more than 10,000 enterprises and has accumulated 40,000 stars on GitHub.Specifically, Milvus covers a rich set of data types, supporting various vector data such as Float, Sparse, and Binary. It also enables dynamic deletion and deletion, instant addition and retrieval, and real-time disk persistence. Furthermore, it supports tag + vector filtering and keyword + vector search functions.

Professor Li Chenglong reviewed the architectural evolution of Milvus, noting that in the LTS version released in March 2021,The team has done a lot of engineering work on data persistence, data sharding, and support for different heterogeneous hardware.However, this version still has a significant disadvantage: all data writing, indexing, etc., are done in one component, forming a single-machine architecture. Its main drawback is that when the data scale is large or the QPS is high, its scalability is very limited, making it difficult to cope with the large data volume needs of large enterprises or high query traffic scenarios such as Double Eleven.

Currently, the team has made numerous optimizations to the architecture of the latest Milvus 2.6 version, such as adding StreamingNode to handle incremental data, merging DataNode and IndexNode, and adding the self-developed Woodpecker to the object layer message queue, etc.

After achieving success in the open-source field, Zilliz began to consider how to commercialize it, and ultimately discovered that there is essentially only one way to commercialize open-source infrastructure:It means providing SaaS services on the public cloud.Therefore, in addition to the open-source Milvus, the company has also built a fully managed Zilliz Cloud based on the former. Many of our current enterprise customers initially learned about the company through the open-source project Milvus, which led them to recognize the product and promote subsequent SaaS services.

Chen Hui: Building a lightweight backbone network to achieve efficient and accurate edge-side visual understanding

Visual understanding technology is a hot topic in the field of artificial intelligence, with significant academic research and application value. Currently, visual understanding technology has been widely applied in mobile devices, robots, autonomous driving, and other terminal scenarios. However, due to limitations such as insufficient computing power of domestically produced chips and serious redundancy in traditional model structures, coupled with the demand for high versatility in complex scenarios, research on efficient visual models is particularly urgent.

To meet the needs of actual terminal applications,Professor Chen Hui's team focused on both the versatility of the basic model and the efficiency of inference, and constructed a lightweight backbone network to establish an efficient and universal visual basic model, thereby achieving efficient and accurate edge visual understanding.Its main technical aspects include three aspects:

* Design of asymmetric deep learning structures and lightweight dynamic network structures;

* Real-time end-to-end target detection model YOLOv10;

* Open-domain general visual understanding.

To address the redundancy problem caused by the symmetric "training-inference" structure of deep learning models,The team proposed the concept of "asymmetric deep learning architecture".During the training phase, a more complex structure is used to learn more effectively, while during the inference phase, equivalent transformations are used to compress the computational path, enabling lightweight and rapid deployment. Within this framework, the team has launched several influential backbone networks, including RepViT (CVPR 2024) and LSNet (CVPR 2025).

In terms of target detection,The team focused on overcoming two major pain points in the YOLO series models: multi-frame fusion leading to NMS dependency and redundancy in model structure.To address this, the team proposed a consistent dual-label matching strategy. During training, both one-to-one and one-to-many detection heads are optimized at the same frequency, while during inference, only one-to-one detection heads are used, thereby ensuring lossless NMS-free detection and recognition.

Furthermore, efficiency-driven and accuracy-driven model design methods were developed to address the high computational complexity caused by model structural redundancy. Based on these methods, a new generation of NMS-free, high-efficiency, and high-precision target detection model, YOLOv10 (NeurIPS 2024), was constructed, achieving a state-of-the-art balance between performance and inference efficiency.

* View the paper:

https://hyper.ai/papers/2405.14458

Regarding the application of models in various scenarios, traditional object detection models are often limited by predefined label sets, making it difficult to adapt to real-world open scenarios. To address this, the team launched YOLOE (ICCV 2025), a foundational model for visual understanding in open scenarios. This large language model provides generalizable cross-modal representations, utilizes structural reparameterization techniques to reduce inference complexity, and achieves a unified model that simultaneously supports open detection and segmentation. It supports multimodal open cues, including text and vision, breaking through the limitations of traditional visual understanding models.

Lei Zhengyu: MoonBit, Open Source Practices for Reconstructing Software Productivity in the AI-Native Era

MoonBit's exploration stems from an increasingly clear industry reality: large-scale models are becoming deeply integrated into the entire software development process, but existing engineering systems cannot fully adapt to this change. With large-scale models deeply integrated into the development process, software engineering is undergoing a new paradigm shift; AI is no longer just a tool, but is becoming a core participant in code generation, refactoring, and verification processes.The model is gradually shifting from the traditional "human-written code + machine assistance" to "AI-generated, development and review". The MoonBit team at IDEA Research Institute is a pioneer in this trend.

Dr. Lei Zhengyu, a core developer of the MoonBit community, explained that traditional programming languages were not optimized for AI interaction in their initial design, and AI-generated code often suffers from poor readability, difficulty in debugging, and difficulty in reuse. MoonBit's goal is to rebuild a software production system adapted to the intelligent era using an AI-native programming language.The goal is to make AI-generated code easier for humans to understand, more in line with engineering practices, and improve the overall efficiency of development, refactoring, and debugging, building a future-oriented AI cloud-native development platform in an open-source manner.

In his presentation, Lei Zhengyu mentioned that MoonBit's language design, compiler toolchain, and ecosystem development all emphasize three core goals:

* It pursues the ultimate compilation speed and generated target size, and has static analysis tool functions;

* It has a smooth learning curve and low complexity;

* Build rich expressive capabilities that do not rely on conventions.

Driven by this direction,The MoonBit community has accumulated thousands of open-source packages in various areas such as web development, numerical computing, and open-source middleware SDKs, forming a thriving community ecosystem.In terms of industry collaboration, MoonBit is actively establishing technical connections with Python, JavaScript, and WebAssembly. Through automated encapsulation, cross-language calls, and a unified module interface toolchain, developers can not only directly reuse Python's mature ecosystem within MoonBit, but also seamlessly call JavaScript code or integrate WASM components, significantly reducing repetitive development and compatibility costs in cross-language scenarios.

Event Recap | Peking University, Tsinghua University, Zilliz, and MoonBit Discuss Open Source, Covering Video Generation, Visual Understanding, Vector Databases, and AI Native Programming Languages

2 months ago

Information

Artificial Intelligence

Shi Baixin: Constructing a brand-new dataset to realize a new paradigm for video generation and audio-visual synchronization technology

Professor Shi Baixin introduced the breakthroughs the team made in this project:

* A multifunctional audio-synchronized video generation framework is proposed.Precise audiovisual mapping and accurate time alignment are achieved through demixed audio.

* A multi-stream time control network is proposed for processing demixed audio tracks.Precise control over lip-syncing, event timing, and emotional atmosphere.

The related findings, titled "Audio-Sync Video Generation with Multi-Stream Temporal Control," have been selected for NeurIPS 2025.

The related findings, titled "PanoWan: Lifting Diffusion Video Generation Models to 360° with Latitude/Longitude-aware Mechanisms", have also been included in NeurIPS 2025.

Li Chenglong: Building Commercial Services Based on Milvus, the First Open-Source Vector Database

Chen Hui: Building a lightweight backbone network to achieve efficient and accurate edge-side visual understanding

* Design of asymmetric deep learning structures and lightweight dynamic network structures;

* Real-time end-to-end target detection model YOLOv10;

* Open-domain general visual understanding.

* View the paper:

https://hyper.ai/papers/2405.14458

Lei Zhengyu: MoonBit, Open Source Practices for Reconstructing Software Productivity in the AI-Native Era

In his presentation, Lei Zhengyu mentioned that MoonBit's language design, compiler toolchain, and ecosystem development all emphasize three core goals:

* It pursues the ultimate compilation speed and generated target size, and has static analysis tool functions;

* It has a smooth learning curve and low complexity;

* Build rich expressive capabilities that do not rely on conventions.

Command Palette

Event Recap | Peking University, Tsinghua University, Zilliz, and MoonBit Discuss Open Source, Covering Video Generation, Visual Understanding, Vector Databases, and AI Native Programming Languages

Shi Baixin: Constructing a brand-new dataset to realize a new paradigm for video generation and audio-visual synchronization technology

Li Chenglong: Building Commercial Services Based on Milvus, the First Open-Source Vector Database

Chen Hui: Building a lightweight backbone network to achieve efficient and accurate edge-side visual understanding

Lei Zhengyu: MoonBit, Open Source Practices for Reconstructing Software Productivity in the AI-Native Era

Command Palette

Event Recap | Peking University, Tsinghua University, Zilliz, and MoonBit Discuss Open Source, Covering Video Generation, Visual Understanding, Vector Databases, and AI Native Programming Languages

Shi Baixin: Constructing a brand-new dataset to realize a new paradigm for video generation and audio-visual synchronization technology

Li Chenglong: Building Commercial Services Based on Milvus, the First Open-Source Vector Database

Chen Hui: Building a lightweight backbone network to achieve efficient and accurate edge-side visual understanding

Lei Zhengyu: MoonBit, Open Source Practices for Reconstructing Software Productivity in the AI-Native Era

Related News

Meituan's open-source Video Generation Model, LongCat-Video, Combines text-based Video Generation, image-based Video Generation, and Video Continuation Capabilities, Rivaling top-tier open-source and closed-source models.

A low-barrier Trial of Open-AutoGLM: an Intelligent Agent Experience Combining Screen Understanding and Automated Execution; Spatial-SSRL-81k: Building a self-supervised Improvement Path for Spatial awareness.

Selected for NeurIPS 2025, the Zhiyuan, Peking University, and Beijing University of Posts and Telecommunications Proposed a multi-stream Control Video Generation Framework That Achieves Precise audio-visual Synchronization Based on Audio demixing.

The NeurIPS 2025 Best Paper Awards Have Been Announced! A Collaborative Research Project by Qwen's Team, Tsinghua University, Stanford University, and Others Has Been selected.

Baidu Makes a Move! Its OCR Model, PaddleOCR-VL, Breaks Through the Limitations of Pipeline and end-to-end Methods; the Facial Emotion Recognition Dataset Empowers AI to Understand Facial expressions.

Online Tutorial | SAM 3 Achieves Hinted Concept Segmentation With 2x Performance Improvement, Processing 100 Detection Objects in 30 Milliseconds

Breakthrough in 3D Vision: ByteSeed Launches DA3, Enabling Visual Space Reconstruction From Any Viewpoint; 70,000+ real-world Industrial Environment Data! CHIP Fills the Gap in Industrial Data for 6D Pose estimation.

Innovative Input/Output Technology! Tencent Hunyuan Launches HunyuanWorld-Mirror, Refreshing 3D Reconstruction to State-of-the-Art; Decoding the Full Picture of Netflix Content! Netflix Movie and TV Catalog Dataset Helps Insights Into Entertainment Trends

TRELLIS.2: Employs O-Voxel Technology for Efficient Generation of Complex 3D Geometry and Materials; Patient Churn Prediction Dataset: Helps Identify Patients at Risk of attrition.

Command Palette

Event Recap | Peking University, Tsinghua University, Zilliz, and MoonBit Discuss Open Source, Covering Video Generation, Visual Understanding, Vector Databases, and AI Native Programming Languages

Shi Baixin: Constructing a brand-new dataset to realize a new paradigm for video generation and audio-visual synchronization technology

Li Chenglong: Building Commercial Services Based on Milvus, the First Open-Source Vector Database

Chen Hui: Building a lightweight backbone network to achieve efficient and accurate edge-side visual understanding

Lei Zhengyu: MoonBit, Open Source Practices for Reconstructing Software Productivity in the AI-Native Era

Related News

Meituan's open-source Video Generation Model, LongCat-Video, Combines text-based Video Generation, image-based Video Generation, and Video Continuation Capabilities, Rivaling top-tier open-source and closed-source models.

A low-barrier Trial of Open-AutoGLM: an Intelligent Agent Experience Combining Screen Understanding and Automated Execution; Spatial-SSRL-81k: Building a self-supervised Improvement Path for Spatial awareness.

Selected for NeurIPS 2025, the Zhiyuan, Peking University, and Beijing University of Posts and Telecommunications Proposed a multi-stream Control Video Generation Framework That Achieves Precise audio-visual Synchronization Based on Audio demixing.

The NeurIPS 2025 Best Paper Awards Have Been Announced! A Collaborative Research Project by Qwen's Team, Tsinghua University, Stanford University, and Others Has Been selected.

Baidu Makes a Move! Its OCR Model, PaddleOCR-VL, Breaks Through the Limitations of Pipeline and end-to-end Methods; the Facial Emotion Recognition Dataset Empowers AI to Understand Facial expressions.

Online Tutorial | SAM 3 Achieves Hinted Concept Segmentation With 2x Performance Improvement, Processing 100 Detection Objects in 30 Milliseconds

Breakthrough in 3D Vision: ByteSeed Launches DA3, Enabling Visual Space Reconstruction From Any Viewpoint; 70,000+ real-world Industrial Environment Data! CHIP Fills the Gap in Industrial Data for 6D Pose estimation.

Innovative Input/Output Technology! Tencent Hunyuan Launches HunyuanWorld-Mirror, Refreshing 3D Reconstruction to State-of-the-Art; Decoding the Full Picture of Netflix Content! Netflix Movie and TV Catalog Dataset Helps Insights Into Entertainment Trends

TRELLIS.2: Employs O-Voxel Technology for Efficient Generation of Complex 3D Geometry and Materials; Patient Churn Prediction Dataset: Helps Identify Patients at Risk of attrition.

Related News

Meituan's open-source Video Generation Model, LongCat-Video, Combines text-based Video Generation, image-based Video Generation, and Video Continuation Capabilities, Rivaling top-tier open-source and closed-source models.

A low-barrier Trial of Open-AutoGLM: an Intelligent Agent Experience Combining Screen Understanding and Automated Execution; Spatial-SSRL-81k: Building a self-supervised Improvement Path for Spatial awareness.

Selected for NeurIPS 2025, the Zhiyuan, Peking University, and Beijing University of Posts and Telecommunications Proposed a multi-stream Control Video Generation Framework That Achieves Precise audio-visual Synchronization Based on Audio demixing.

The NeurIPS 2025 Best Paper Awards Have Been Announced! A Collaborative Research Project by Qwen's Team, Tsinghua University, Stanford University, and Others Has Been selected.

Baidu Makes a Move! Its OCR Model, PaddleOCR-VL, Breaks Through the Limitations of Pipeline and end-to-end Methods; the Facial Emotion Recognition Dataset Empowers AI to Understand Facial expressions.

Online Tutorial | SAM 3 Achieves Hinted Concept Segmentation With 2x Performance Improvement, Processing 100 Detection Objects in 30 Milliseconds

Breakthrough in 3D Vision: ByteSeed Launches DA3, Enabling Visual Space Reconstruction From Any Viewpoint; 70,000+ real-world Industrial Environment Data! CHIP Fills the Gap in Industrial Data for 6D Pose estimation.

Innovative Input/Output Technology! Tencent Hunyuan Launches HunyuanWorld-Mirror, Refreshing 3D Reconstruction to State-of-the-Art; Decoding the Full Picture of Netflix Content! Netflix Movie and TV Catalog Dataset Helps Insights Into Entertainment Trends

TRELLIS.2: Employs O-Voxel Technology for Efficient Generation of Complex 3D Geometry and Materials; Patient Churn Prediction Dataset: Helps Identify Patients at Risk of attrition.

Related News

Meituan's open-source Video Generation Model, LongCat-Video, Combines text-based Video Generation, image-based Video Generation, and Video Continuation Capabilities, Rivaling top-tier open-source and closed-source models.

A low-barrier Trial of Open-AutoGLM: an Intelligent Agent Experience Combining Screen Understanding and Automated Execution; Spatial-SSRL-81k: Building a self-supervised Improvement Path for Spatial awareness.

Selected for NeurIPS 2025, the Zhiyuan, Peking University, and Beijing University of Posts and Telecommunications Proposed a multi-stream Control Video Generation Framework That Achieves Precise audio-visual Synchronization Based on Audio demixing.

The NeurIPS 2025 Best Paper Awards Have Been Announced! A Collaborative Research Project by Qwen's Team, Tsinghua University, Stanford University, and Others Has Been selected.

Baidu Makes a Move! Its OCR Model, PaddleOCR-VL, Breaks Through the Limitations of Pipeline and end-to-end Methods; the Facial Emotion Recognition Dataset Empowers AI to Understand Facial expressions.

Online Tutorial | SAM 3 Achieves Hinted Concept Segmentation With 2x Performance Improvement, Processing 100 Detection Objects in 30 Milliseconds

Breakthrough in 3D Vision: ByteSeed Launches DA3, Enabling Visual Space Reconstruction From Any Viewpoint; 70,000+ real-world Industrial Environment Data! CHIP Fills the Gap in Industrial Data for 6D Pose estimation.

Innovative Input/Output Technology! Tencent Hunyuan Launches HunyuanWorld-Mirror, Refreshing 3D Reconstruction to State-of-the-Art; Decoding the Full Picture of Netflix Content! Netflix Movie and TV Catalog Dataset Helps Insights Into Entertainment Trends

TRELLIS.2: Employs O-Voxel Technology for Efficient Generation of Complex 3D Geometry and Materials; Patient Churn Prediction Dataset: Helps Identify Patients at Risk of attrition.