Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation

We introduce a diffusion-based framework that performs aligned novel viewimage and geometry generation via a warping-and-inpainting methodology. Unlikeprior methods that require dense posed images or pose-embedded generativemodels limited to in-domain views, our method leverages off-the-shelf geometrypredictors to predict partial geometries viewed from reference images, andformulates novel-view synthesis as an inpainting task for both image andgeometry. To ensure accurate alignment between generated images and geometry,we propose cross-modal attention distillation, where attention maps from theimage diffusion branch are injected into a parallel geometry diffusion branchduring both training and inference. This multi-task approach achievessynergistic effects, facilitating geometrically robust image synthesis as wellas well-defined geometry prediction. We further introduce proximity-based meshconditioning to integrate depth and normal cues, interpolating between pointcloud and filtering erroneously predicted geometry from influencing thegeneration process. Empirically, our method achieves high-fidelityextrapolative view synthesis on both image and geometry across a range ofunseen scenes, delivers competitive reconstruction quality under interpolationsettings, and produces geometrically aligned colored point clouds forcomprehensive 3D completion. Project page is available athttps://cvlab-kaist.github.io/MoAI.