SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass

3D content generation has recently attracted significant research interestdue to its applications in VR/AR and embodied AI. In this work, we address thechallenging task of synthesizing multiple 3D assets within a single sceneimage. Concretely, our contributions are fourfold: (i) we present SceneGen, anovel framework that takes a scene image and corresponding object masks asinput, simultaneously producing multiple 3D assets with geometry and texture.Notably, SceneGen operates with no need for optimization or asset retrieval;(ii) we introduce a novel feature aggregation module that integrates local andglobal scene information from visual and geometric encoders within the featureextraction module. Coupled with a position head, this enables the generation of3D assets and their relative spatial positions in a single feedforward pass;(iii) we demonstrate SceneGen's direct extensibility to multi-image inputscenarios. Despite being trained solely on single-image inputs, ourarchitectural design enables improved generation performance with multi-imageinputs; and (iv) extensive quantitative and qualitative evaluations confirm theefficiency and robust generation abilities of our approach. We believe thisparadigm offers a novel solution for high-quality 3D content generation,potentially advancing its practical applications in downstream tasks. The codeand model will be publicly available at: https://mengmouxu.github.io/SceneGen.