Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation

Min-Seop Kwak, Junho Kim, Sangdoo Yun, Dongyoon Han, Taekyoung Kim, Seungryong Kim, Jin-Hwa Kim

Release Date: 6/16/2025

Aligned Novel View Image and Geometry Synthesis via Cross-modal
Attention Instillation

Abstract

We introduce a diffusion-based framework that performs aligned novel viewimage and geometry generation via a warping-and-inpainting methodology. Unlikeprior methods that require dense posed images or pose-embedded generativemodels limited to in-domain views, our method leverages off-the-shelf geometrypredictors to predict partial geometries viewed from reference images, andformulates novel-view synthesis as an inpainting task for both image andgeometry. To ensure accurate alignment between generated images and geometry,we propose cross-modal attention distillation, where attention maps from theimage diffusion branch are injected into a parallel geometry diffusion branchduring both training and inference. This multi-task approach achievessynergistic effects, facilitating geometrically robust image synthesis as wellas well-defined geometry prediction. We further introduce proximity-based meshconditioning to integrate depth and normal cues, interpolating between pointcloud and filtering erroneously predicted geometry from influencing thegeneration process. Empirically, our method achieves high-fidelityextrapolative view synthesis on both image and geometry across a range ofunseen scenes, delivers competitive reconstruction quality under interpolationsettings, and produces geometrically aligned colored point clouds forcomprehensive 3D completion. Project page is available athttps://cvlab-kaist.github.io/MoAI.

View Paper Details