5時間前

Zhifei Yang Guangyao Zhai Keyang Lu YuYang Yin Chao Zhang Zhen Xiao Jieyi Long Nassir Navab Yikai Wang

概要

シーン生成は、高い写実性と幾何学・外観に対する精密な制御の両方を要求する広範な産業応用を有しています。言語駆動の検索手法は大規模なオブジェクトデータベースから妥当なシーンを構成しますが、オブジェクトレベルの制御を見落とし、シーンスケールでのスタイルの一貫性を十分に確保できない傾向があります。グラフに基づく定式化は、オブジェクトに対する高い制御性を提供し、関係を明示的にモデル化することで全体的な整合性を促進しますが、既存手法は高忠実度のテクスチャ付き結果の生成に苦戦しており、その実用性が制限されています。本研究では、マルチモーダルグラフに条件付けられた三枝分構造を持つシーン生成モデル「FlowScene」を提案します。このモデルは、シーンのレイアウト、オブジェクト形状、およびオブジェクトテクスチャを協調的に生成します。その中核には、生成中にオブジェクト情報を交換する緊密に結合された rectified flow モデルが位置し、グラフ全体にわたる協調的推論を可能にします。これにより、オブジェクトの形状、テクスチャ、および関係に対する細粒度の制御を実現しつつ、構造と外観の両面においてシーンスケールのスタイル一貫性を確保します。広範な実験により、FlowScene は、生成の写実性、スタイルの一貫性、および人間の嗜好との整合性の点において、言語条件付きおよびグラフ条件付きのベースラインをいずれも上回る性能を示すことが確認されました。

One-sentence Summary

Researchers from Peking University and Technical University of Munich introduce FlowScene, a tri-branch generative model that leverages a tight-coupled rectified flow mechanism to collaboratively synthesize layouts, shapes, and textures from multimodal graphs, achieving superior style coherence and object-level control compared to existing language or graph-driven baselines.

Key Contributions

The paper introduces FlowScene, a tri-branch generative model conditioned on multimodal graphs that collaboratively produces scene layouts, object shapes, and object textures to ensure fine-grained control and scene-level style coherence.
A tight-coupled rectified flow mechanism is presented as the core engine, which exchanges node information during the sampling process to satisfy both individual object conditions and holistic scene constraints while accelerating generation compared to diffusion-based methods.
Extensive experiments demonstrate that the method outperforms language-conditioned and graph-conditioned baselines in generation realism, style consistency, and alignment with human preferences, supported by a workflow for diverse input sources.

Introduction

Scene generation is critical for industries like interior design, VR/AR, and robotics, where applications demand high realism alongside precise control over geometry and appearance. Prior language-driven methods often fail to enforce scene-level style coherence or provide granular object control, while existing graph-based approaches struggle to produce high-fidelity textured results in an end-to-end manner. The authors introduce FlowScene, a tri-branch generative model that leverages Multimodal Graph Rectified Flow to collaboratively generate scene layouts, object shapes, and textures. By tightly coupling node information exchange during the sampling process, this approach enables fine-grained control over individual objects while ensuring consistent style across the entire scene structure and appearance.

Method

The proposed framework, FlowScene, operates as a tri-branch generator conditioned on a multimodal graph to synthesize indoor scenes. The system accepts diverse inputs, including natural language descriptions and reference images, which are parsed into a structured scene graph. This graph serves as the central conditioning mechanism for the generation process, ensuring consistency across object layouts, shapes, and textures.

The process begins with the construction of a multimodal scene graph $\mathcal{G}_M = (\mathcal{V}_M, \mathcal{E})$ . Nodes in $\mathcal{V}_M$ represent objects and can be text-only, image-only, or multimodal, aggregating learnable embeddings with foundation features from CLIP or DINOv2. Edges in $\mathcal{E}$ encode spatial and semantic relationships such as "left of" or "bigger than." An LLM or VLM parses user inputs to populate this graph, allowing for fine-grained control over the scene configuration.

To generate the scene, the authors employ a Multimodal Graph Rectified Flow backbone. This module adapts rectified flow models to handle multiple content generation jointly. The core of this mechanism is the InfoExchangeUnit, which utilizes a Triplet-Graph Convolutional Network (Triplet-GCN) to perform message passing and feature aggregation across the graph edges. During the denoising process, this unit incorporates temporal denoising data $\mathcal{D}_t$ and projects it alongside node features to produce time-dependent conditions $C_t$ . This ensures that the generation of each object respects the global constraints imposed by the graph structure.

The training objective minimizes the least-squares error between the predicted velocity field $v_{\theta}$ and the target velocity derived from linear interpolation between data and noise. The loss function is defined as:

\mathcal { L } _ { \mathrm { G R F } } = \mathbb { E } _ { \mathcal { D } , C , t } \left[ \| \Theta _ { \mathcal { D } } ( \mathcal { D } _ { t } , C _ { t } , t ) - v \| _ { 2 } ^ { 2 } \right]

where $\Theta_{\mathcal{D}}$ is the denoiser network. During inference, the model integrates the reverse-time ODE starting from Gaussian noise to recover the target data distribution.

FlowScene decomposes the generation task into three coordinated branches, each backed by the graph rectified flow module. The Layout Branch generates 3D bounding boxes defined by location, size, and rotation, utilizing a specialized LayoutExchangeUnit to enforce spatial constraints. The Shape Branch operates in parallel to generate voxelized object shapes. It employs a Shape VQ-VAE to encode sparse voxel structures into compact latent codes, which are then denoised and decoded. The Texture Branch is subordinate to the shape branch, anchoring Gaussian noise to the geometric structure to generate textures. It uses a Texture VQ-VAE and a TextureExchangeUnit to ensure style consistency across objects, which is particularly crucial for text-only nodes where appearance is inferred from relational context.

Prior to training the generative branches, objects undergo preprocessing to prepare the data for the VQ-VAEs. Objects are voxelized into sparse structures, and multi-view images are rendered to extract DINOv2 features. These features are reprojected onto the voxel grid and averaged to create feature voxels. The Shape VQ-VAE learns to reconstruct the voxelized geometry, while the Texture VQ-VAE learns to reconstruct the object appearance from the feature voxels. This preprocessing ensures that the generative models operate on efficient latent representations rather than raw high-dimensional data.

The system supports two primary application modes for user interaction. In the language-driven mode, users provide natural language descriptions which are parsed into a scene graph by an LLM. In the interactive GUI mode, users select object candidates and define relations through a visual interface. Both modes feed into the multimodal graph, which drives the FlowScene backend to generate a high-fidelity textured 3D scene consistent with the specified configuration.

Experiment

Experiments on SG-FRONT and 3D-FRONT datasets validate that FlowScene outperforms training-free language-based methods and graph-conditioned generative models in scene-level realism, object-level geometric fidelity, and style consistency.
Quantitative and qualitative comparisons demonstrate that FlowScene achieves superior alignment with human preferences, producing scenes with higher visual quality and more accurate layout adherence to textual and graph constraints.
Ablation studies confirm that the InfoExchangeUnit is critical for spatial coherence and appearance consistency, while the graph flow backbone significantly enhances shape generation quality compared to diffusion baselines.
Robustness tests show that training with multi-view inputs enables flexible adaptation to varying visual conditions, and the model maintains high performance even with sparse relational information in the input graph.
Additional evaluations highlight FlowScene's ability to preserve inter-shape consistency for identical objects and effectively propagate local graph updates to maintain holistic scene structure.

ソースPDF

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

HyperAI

5時間前

3D生成

マルチモーダル

Zhifei Yang Guangyao Zhai Keyang Lu YuYang Yin Chao Zhang Zhen Xiao Jieyi Long Nassir Navab Yikai Wang

概要

One-sentence Summary

Key Contributions

The paper introduces FlowScene, a tri-branch generative model conditioned on multimodal graphs that collaboratively produces scene layouts, object shapes, and object textures to ensure fine-grained control and scene-level style coherence.
A tight-coupled rectified flow mechanism is presented as the core engine, which exchanges node information during the sampling process to satisfy both individual object conditions and holistic scene constraints while accelerating generation compared to diffusion-based methods.
Extensive experiments demonstrate that the method outperforms language-conditioned and graph-conditioned baselines in generation realism, style consistency, and alignment with human preferences, supported by a workflow for diverse input sources.

Introduction

Method

\mathcal { L } _ { \mathrm { G R F } } = \mathbb { E } _ { \mathcal { D } , C , t } \left[ \| \Theta _ { \mathcal { D } } ( \mathcal { D } _ { t } , C _ { t } , t ) - v \| _ { 2 } ^ { 2 } \right]

where $\Theta_{\mathcal{D}}$ is the denoiser network. During inference, the model integrates the reverse-time ODE starting from Gaussian noise to recover the target data distribution.

Experiment

Experiments on SG-FRONT and 3D-FRONT datasets validate that FlowScene outperforms training-free language-based methods and graph-conditioned generative models in scene-level realism, object-level geometric fidelity, and style consistency.
Quantitative and qualitative comparisons demonstrate that FlowScene achieves superior alignment with human preferences, producing scenes with higher visual quality and more accurate layout adherence to textual and graph constraints.
Ablation studies confirm that the InfoExchangeUnit is critical for spatial coherence and appearance consistency, while the graph flow backbone significantly enhances shape generation quality compared to diffusion baselines.
Robustness tests show that training with multi-view inputs enables flexible adaptation to varying visual conditions, and the model maintains high performance even with sparse relational information in the input graph.
Additional evaluations highlight FlowScene's ability to preserve inter-shape consistency for identical objects and effectively propagate local graph updates to maintain holistic scene structure.

ソースPDF

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

Command Palette

FlowScene：マルチモーダルグラフ補正フローによるスタイル一貫性を持つ屋内シーン生成

Zhifei Yang Guangyao Zhai Keyang Lu YuYang Yin Chao Zhang Zhen Xiao Jieyi Long Nassir Navab Yikai Wang

概要

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

FlowScene：マルチモーダルグラフ補正フローによるスタイル一貫性を持つ屋内シーン生成

Zhifei Yang Guangyao Zhai Keyang Lu YuYang Yin Chao Zhang Zhen Xiao Jieyi Long Nassir Navab Yikai Wang

概要

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

FlowScene：マルチモーダルグラフ補正フローによるスタイル一貫性を持つ屋内シーン生成

Zhifei Yang Guangyao Zhai Keyang Lu YuYang Yin Chao Zhang Zhen Xiao Jieyi Long Nassir Navab Yikai Wang

概要

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AIでAIを構築

HyperAI Newsletters