HyperAIHyperAI

Command Palette

Search for a command to run...

رؤية السريع والبطيء: تعلم تدفق الوقت في مقاطع الفيديو

Yen-Siang Wu Rundong Luo Jingsen Zhu Tao Tu Ali Farhadi Matthew Wallingford Yu-Chiang Frank Wang Steve Marschner Wei-Chiu Ma

الملخص

كيف يمكننا معرفة ما إذا كان الفيديو قد تم تسريعه أو تبطيئه؟ وكيف يمكننا إنشاء فيديوهات بسرعات مختلفة؟ على الرغم من أن الفيديوهات كانت محورًا أساسيًا في أبحاث الرؤية الحاسوبية (Computer Vision) الحديثة، إلا أنه لم يتم إيلاء اهتمام كافٍ لإدراك والتحكم في مرور الوقت. في هذه الورقة البحثية، ندرس "الزمن" كمفهوم بصري قابل للتعلم، ونقوم بتطوير نماذج للاستدلال على تدفق الوقت في الفيديوهات والتلاعب به.أولًا، نقوم باستغلال الإشارات متعددة الوسائط (multimodal cues) والبنية الزمنية الموجودة طبيعيًا في الفيديوهات لتعلم اكتشاف تغيرات السرعة وتقدير سرعة التشغيل، وذلك بطريقة التعلم الذاتي (self-supervised manner). بعد ذلك، نوضح أن نماذج الاستدلال الزمني التي تعلمناها تمكننا من تنسيق أكبر مجموعة بيانات للفيديوهات بالحركة البطيئة (slow-motion) حتى الآن، وذلك من مصادر طبيعية غير منظمة (in-the-wild sources). تحتوي لقطات الحركة البطيئة هذه، والتي تُصور عادةً باستخدام كاميرات عالية السرعة، على تفاصيل زمنية أغنى بكثير من الفيديوهات القياسية.باستخدام هذه البيانات، نطور أيضًا نماذج قادرة على التحكم الزمني، بما في ذلك "توليد الفيديو المشروط بالسرعة" (speed-conditioned video generation)، والذي ينتج حركة بسرعة تشغيل محددة، و"الدقة الزمنية الفائقة" (temporal super-resolution)، التي تحول الفيديوهات ذات معدل الإطارات المنخفض (low-FPS) والمبهمة إلى تسلسلات ذات معدل إطارات عالٍ (high-FPS) مع تفاصيل زمنية دقيقة. تسلط نتائجنا الضوء على الزمن كبعد إدراكي وقابل للتلاعب في تعلم الفيديو، مما يفتح آفاقًا جديدة لتوليد فيديوهات يمكن التحكم في زمنها، والكشف الجنائي الزمني (temporal forensics detection)، وربما تطوير نماذج عالم (world-models) أكثر ثراءً تفهم كيفية تطور الأحداث بمرور الوقت.

One-sentence Summary

By treating time as a learnable visual concept, the researchers propose a self-supervised framework to detect playback speed and curate large-scale slow-motion datasets, which enables advanced temporal control through speed-conditioned video generation and temporal super-resolution for enhancing low-frame-rate sequences.

Key Contributions

  • The paper introduces a self-supervised method for detecting temporal speed changes and estimating playback speed by leveraging the natural coupling between visual motion and audio pitch shifts. This approach enables the training of a visual speed-change detector that operates solely on video input during inference.
  • This work presents the SloMo-44K dataset, which is the largest slow-motion video dataset to date, curated from noisy in-the-wild sources using the learned temporal reasoning models. This dataset provides the rich temporal detail necessary for training models to understand and control the flow of time.
  • The research develops models for fine-grained temporal control, including speed-conditioned video generation that produces motion at specific playback speeds and temporal super-resolution that transforms low-FPS videos into high-FPS sequences. These models achieve state-of-the-art performance across both video understanding and generation tasks.

Introduction

Understanding the flow of time is essential for creating realistic video models and conducting temporal forensics. While modern computer vision models excel at spatial understanding, they often lack temporal reasoning because they are trained primarily on videos with standard, fixed frame rates. This limitation causes existing vision-language and generative models to struggle with predicting playback speeds or generating content at specific temporal cadences. The authors address these challenges by treating time as a learnable visual concept. They leverage multimodal cues and self-supervised learning to develop models capable of detecting speed changes and estimating playback speed. This approach allowed them to curate the largest slow-motion video dataset to date, which they used to enable advanced temporal control, including speed-conditioned video generation and high-fidelity temporal super-resolution.

Dataset

The authors introduce SloMo-44K, a large-scale dataset designed for slow-motion video understanding and generation. The dataset details are as follows:

  • Composition and Sources: The dataset consists of 44,632 slow-motion video clips totaling 18 million frames. The raw video material is sourced from YouTube, Vimeo, and Flickr using queries related to high frame rates and slow motion.
  • Data Filtering and Quality Control: The authors implement a multi-stage pipeline to ensure high quality. They use TransNetv2 for shot segmentation and an OCR model to remove clips with excessive text. To maintain content integrity, Qwen2.5-VL is used to filter out CGI and screen recordings, while video quality assessment (VQA) metrics are applied to discard low-quality samples.
  • Slow-Motion Identification: To prevent the dataset from being dominated by standard speed content, the authors use a two-stage filtering process. This combines a VideoLLM (Gemini) to localize slow-motion segments with a fine-tuned ViT-based classifier (VideoMAEv2) trained on human-annotated clips. A clip is only retained if it meets strict thresholds from both models.
  • Processing and Annotation:
    • Temporal Segmentation: A speed change detector segments videos into clips with homogeneous playback speeds.
    • Speed Annotation: The authors use a speed estimator to provide pseudo-speed annotations for each clip.
    • Metadata Construction: Dense captions are generated using InternVL3. These include short and long descriptions along with specific attributes such as background, style, shot type, lighting, and atmosphere to capture both semantic and aesthetic details.

Method

The authors leverage a self-supervised framework to train a playback speed estimator that learns to predict temporal speed variations within videos without requiring ground-truth speed annotations. The core idea is to enforce equivariance under temporal resampling: if a video is accelerated by a factor kkk, the predicted speed should scale by the same factor. This principle is formalized through a loss function that compares the logarithm of the predicted speed of the original clip V\mathbf{V}V with the logarithm of the predicted speed of the accelerated clip Vk\mathbf{V}^kVk, scaled by kkk. The training objective is defined as:

L=[logfθ(Vk)log(kfθ(V))]2.\mathcal { L } = \left[ \log f _ { \theta } ( \mathbf { V } ^ { k } ) - \log ( k \cdot f _ { \theta } ( \mathbf { V } ) ) \right] ^ { 2 } .L=[logfθ(Vk)log(kfθ(V))]2.

This self-supervised signal is applied during training, where clips are subsampled by a random factor kN(1,T2)k \sim \mathcal{N}(1, \frac{T}{2})kN(1,2T). For videos with known frame rates, the model also incorporates a supervised regression objective to directly predict the playback speed. The framework is designed to detect speed changes and estimate absolute speed, as illustrated in the figure below.

For speed-conditioned video generation, the model builds upon the Wan2.1-I2V architecture and introduces explicit speed control mechanisms. Given an image, a text prompt, and a target playback speed, the model generates videos with dynamic content that reflects the specified temporal rate. To achieve this, the target speed is first discretized into logarithmically spaced buckets, ranging from 0.01× to 1.0×, and encoded using sinusoidal positional embeddings. This bucket ID is then passed through a multilayer perceptron and added to the timestep embedding, which modulates the denoising schedule to align with the desired temporal speed. The discretization process is defined as:

Butcket_ID=log(speed)log(0.01)log(1)log(0.01)Nbuckets,\mathrm { B u t c k e t \_ I D } = \left\lfloor \frac { \log ( \mathbf { s p e e d } ) - \log ( 0 . 0 1 ) } { \log ( 1 ) - \log ( 0 . 0 1 ) } \cdot N _ { \mathrm { b u c k e t s } } \right\rfloor ,Butcket_ID=log(1)log(0.01)log(speed)log(0.01)Nbuckets,

where Nbuckets=10N_{\text{buckets}} = 10Nbuckets=10 is empirically set. To further enhance speed control, the model applies frame-wise conditioning by modulating the latent features using an MLP that takes a positional embedding of the product of the timestep and the target speed. This conditioning is applied as:

latent[i]latent[i]+MLPψ(ϕ(ispeed)),\mathrm { l a t e n t } [ i ] \gets \mathrm { l a t e n t } [ i ] + \mathrm { M L P } _ { \psi } ( \phi ( i \cdot \mathrm { s p e e d } ) ) ,latent[i]latent[i]+MLPψ(ϕ(ispeed)),

where latent[i]latent[i]latent[i] denotes the latent feature at temporal index iii. This mechanism allows the model to generate videos with varying motion dynamics, as demonstrated in the figure below.

Experiment

The researchers evaluate their approach through a series of experiments designed to validate temporal speed perception and manipulation. They first benchmark a speed-change detector and a playback-speed estimator, demonstrating that their self-supervised methods achieve high accuracy and closely approximate human perception. Using the newly curated SloMo-44K dataset, the study further validates models for speed-conditioned video generation and temporal super-resolution, showing that these models can synthesize lifelike motion at controllable speeds and reconstruct sharp, high-frame-rate sequences even from motion-blurred inputs. Overall, the findings suggest that leveraging high-frame-rate data and cross-modal cues enables superior modeling of real-world physical dynamics compared to existing methods.

The authors introduce a large-scale slow-motion dataset that spans diverse activities and temporal scales, which they use to train models for understanding and manipulating video speed. The dataset is significantly larger than existing collections, enabling improved performance in tasks such as speed estimation, video generation, and temporal super-resolution. Results show that models trained on this dataset achieve near-human accuracy in speed perception and produce high-quality, controllable slow-motion videos. The proposed dataset is substantially larger than existing slow-motion datasets in terms of clips, videos, and frames, enabling more robust training for temporal modeling. Models trained on the dataset achieve near-human performance in playback speed estimation and outperform baselines in speed-controlled video generation and temporal super-resolution. The dataset supports both understanding and generation tasks, with strong results in video forensics and high-fidelity synthesis of real-world dynamics across a wide range of speeds.

The authors present a method for video understanding and generation that leverages self-supervised signals to infer playback speed and manipulate temporal dynamics. Their approach is evaluated on tasks such as speed estimation and temporal super-resolution, with results demonstrating improved performance over baselines in both quantitative and perceptual metrics. The model achieves strong alignment with human perception, particularly in generating realistic slow-motion videos and enhancing low-frame-rate footage. The proposed method achieves higher perceptual quality and motion smoothness compared to the baseline in video generation tasks. Results show that the model generates videos with more accurate speed controllability and temporal consistency. The model outperforms baselines in both image quality and flicker reduction, indicating superior temporal coherence.

The authors evaluate their method for temporal super-resolution against several baselines, focusing on both clean and motion-blurred inputs. Results show that their approach achieves superior performance across multiple metrics, particularly in handling motion blur and producing high-quality, temporally consistent outputs. The method consistently outperforms existing models, demonstrating its effectiveness in reconstructing fine-grained motion details under challenging conditions. The proposed method significantly outperforms existing baselines in temporal super-resolution, especially under motion blur. It achieves the best results across multiple evaluation metrics, including FID and FVD, indicating high-quality reconstruction. The approach demonstrates robustness in generating sharp, coherent frames even from heavily blurred inputs.

The authors evaluate a model for playback speed prediction against several baselines, including human experts and existing methods. Results show that the proposed method achieves performance close to human experts, significantly outperforming other models in correlation and accuracy metrics, while demonstrating robustness in estimating speed across diverse video conditions. The proposed method achieves performance close to human experts in playback speed prediction, outperforming existing models in correlation and error metrics. The model significantly improves upon baselines like VideoLLM and SpeedNet, narrowing the gap between machine and human performance. The method demonstrates strong robustness and accuracy in estimating video speed, particularly in challenging scenarios with motion blur and extreme temporal scales.

The authors evaluate the impact of training data on playback speed prediction, comparing models trained on standard videos versus those trained on their large-scale slow-motion dataset, SloMo-44K. Results show that models trained on SloMo-44K achieve significantly better performance across all metrics, including correlation coefficients and error measures, indicating superior speed estimation accuracy. Models trained on SloMo-44K outperform those trained on standard videos across all speed prediction metrics. Training on SloMo-44K leads to higher correlation with ground truth and lower prediction error. The improvement is consistent across both linear and rank-based correlation measures.

The authors evaluate their method using a large-scale slow-motion dataset across tasks including speed estimation, video generation, and temporal super-resolution. The experiments demonstrate that the proposed approach achieves near-human accuracy in speed perception and produces high-quality, temporally consistent videos with superior motion smoothness. Furthermore, the model proves highly robust in reconstructing fine-grained details from motion-blurred inputs and benefits significantly from the diverse temporal scales provided by the new dataset.


بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp