Command Palette
Search for a command to run...
Thinking with Camera: A Unified Multimodal Model for Camera-Centric
Understanding and Generation
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
Kang Liao Size Wu Zhonghua Wu Linyi Jin Chao Wang Yikai Wang Fei Wang Wei Li Chen Change Loy
Abstract
Camera-centric understanding and generation are two cornerstones of spatialintelligence, yet they are typically studied in isolation. We present Puffin, aunified camera-centric multimodal model that extends spatial awareness alongthe camera dimension. Puffin integrates language regression and diffusion-basedgeneration to interpret and create scenes from arbitrary viewpoints. To bridgethe modality gap between cameras and vision-language, we introduce a novelparadigm that treats camera as language, enabling thinking with camera. Thisguides the model to align spatially grounded visual cues with photographicterminology while reasoning across geometric context. Puffin is trained onPuffin-4M, a large-scale dataset of 4 million vision-language-camera triplets.We incorporate both global camera parameters and pixel-wise camera maps,yielding flexible and reliable spatial generation. Experiments demonstratePuffin superior performance over specialized models for camera-centricgeneration and understanding. With instruction tuning, Puffin generalizes todiverse cross-view tasks such as spatial imagination, world exploration, andphotography guidance. We will release the code, models, dataset pipeline, andbenchmark to advance multimodal spatial intelligence research.