Towards Understanding Camera Motions in Any Video

We introduce CameraBench, a large-scale dataset and benchmark designed toassess and improve camera motion understanding. CameraBench consists of ~3,000diverse internet videos, annotated by experts through a rigorous multi-stagequality control process. One of our contributions is a taxonomy of cameramotion primitives, designed in collaboration with cinematographers. We find,for example, that some motions like "follow" (or tracking) requireunderstanding scene content like moving subjects. We conduct a large-scalehuman study to quantify human annotation performance, revealing that domainexpertise and tutorial-based training can significantly enhance accuracy. Forexample, a novice may confuse zoom-in (a change of intrinsics) with translatingforward (a change of extrinsics), but can be trained to differentiate the two.Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-LanguageModels (VLMs), finding that SfM models struggle to capture semantic primitivesthat depend on scene content, while VLMs struggle to capture geometricprimitives that require precise estimation of trajectories. We then fine-tune agenerative VLM on CameraBench to achieve the best of both worlds and showcaseits applications, including motion-augmented captioning, video questionanswering, and video-text retrieval. We hope our taxonomy, benchmark, andtutorials will drive future efforts towards the ultimate goal of understandingcamera motions in any video.