HyperAIHyperAI

Command Palette

Search for a command to run...

GlobalSplat: Global Scene Tokens를 통한 효율적인 Feed-Forward 3D Gaussian Splatting

Roni Itkin Noam Issachar Yehonatan Keypur Yehonatan Keypur Anpei Chen Sagie Benaim

초록

제시해주신 영문 기술 텍스트를 요청하신 기준(정확성, 유창성, 공식적인 학술 스타일, AI 전문 용어 유지)에 따라 한국어로 번역하였습니다. (단, 요청 사항 마지막에 "사용하여 한국어로 답변해 주세요"라고 하셨으나, 문맥상 "한국어로 번역해 주세요"의 의도로 파악되어 한국어로 번역을 진행하였습니다. 만약 정말로 '한국어'가 아닌 '한국어'로 답변하라는 것이 '한국어로 번역하라'는 뜻이 맞다면 아래 결과물을 확인해 주시기 바랍니다.)[번역본]Primitive의 효율적인 공간 할당은 3D Gaussian Splatting의 기초를 형성하며, 이는 표현의 압축성(compactness), 재구성 속도, 그리고 렌더링 충실도(fidelity) 사이의 시너지를 직접적으로 결정합니다. 반복적 최적화(iterative optimization) 또는 feed-forward inference 기반의 기존 솔루션들은 전역적인 장면 인식(global scene awareness)이 결여된 국소적이고 휴리스틱 중심의 할당 전략에 의존하기 때문에, 이러한 목표들 사이에서 상당한 트레이드오프(trade-off)를 겪고 있습니다. 구체적으로, 현재의 feed-forward 방식은 주로 pixel-aligned 또는 voxel-aligned 방식을 취하고 있습니다. 픽셀을 밀집된 view-aligned primitive로 역투영(unprojecting)함으로써, 이들은 3D 에셋 내에 중복성(redundancy)을 고착화합니다. 입력 뷰(view)가 추가될수록 표현의 크기는 증가하며 전역적 일관성(global consistency)은 취약해집니다.이를 해결하기 위해, 본 논문에서는 '선 정렬, 후 디코딩(align first, decode later)' 원칙을 기반으로 구축된 프레임워크인 GlobalSplat을 소개합니다. 우리의 접근 방식은 명시적인 3D 기하구조를 디코딩하기 전에, 다중 뷰 입력을 인코딩하고 뷰 간 대응 관계(cross-view correspondences)를 해결하는 압축된 전역 잠재 장면 표현(global, latent scene representation)을 학습합니다. 결정적으로, 이러한 정식화(formulation)는 사전 학습된 pixel-prediction backbone에 의존하거나 밀집된 baseline의 latent feature를 재사용하지 않고도 압축적이고 전역적으로 일관된 재구성을 가능하게 합니다. 디코딩 용량을 점진적으로 증가시키는 coarse-to-fine training curriculum을 활용함으로써, GlobalSplat은 표현의 팽창(representation bloat)을 본질적으로 방지합니다.RealEstate10K 및 ACID 데이터셋에서 우리 모델은 밀집된 pipeline이 요구하는 양보다 현저히 적은 16K개의 Gaussian만을 사용하여 경쟁력 있는 novel-view synthesis 성능을 달성하였으며, 4MB라는 가벼운 footprint를 확보했습니다. 나아가 GlobalSplat은 단일 forward pass에서 78ms 미만으로 작동하여 baseline보다 훨씬 빠른 inference를 가능하게 합니다. 프로젝트 페이지는 https://r-itk.github.io/globalsplat/ 에서 확인하실 수 있습니다.

One-sentence Summary

GlobalSplat is a feed-forward framework that utilizes global scene tokens to learn a compact latent representation before decoding explicit geometry, achieving high-fidelity 3D Gaussian Splatting reconstructions on RealEstate10K and ACID with as few as 16K Gaussians, a 4MB footprint, and inference speeds under 78 milliseconds.

Key Contributions

  • The paper introduces GlobalSplat, a feed-forward 3D Gaussian Splatting framework based on an "align first, decode later" principle that aggregates multi-view observations into a compact, fixed-size set of global scene tokens. This approach resolves cross-view correspondences within a global latent representation before decoding explicit 3D geometry to eliminate the redundancy found in dense, view-centric pipelines.
  • The method implements a disentangled dual-branch architecture paired with a coarse-to-fine training curriculum that gradually increases decoded capacity. This design prevents representation bloat and enables a stronger quality-efficiency trade-off when reconstructing large-context scenes.
  • Experiments on the RealEstate10K and ACID datasets demonstrate that the model achieves competitive novel-view synthesis performance using as few as 16K Gaussians and a 4MB footprint. The framework also provides high efficiency, performing inference in under 78 milliseconds in a single forward pass.

Introduction

Feed-forward 3D Gaussian Splatting (3DGS) aims to generate explicit 3D representations from multiple input views in a single network pass, enabling fast novel-view synthesis without per-scene optimization. However, existing methods typically rely on dense, view-aligned intermediates such as pixel-aligned or voxel-aligned predictions. This design introduces significant redundancy and causes the representation size to inflate as more input views are added, making large-context reconstruction difficult to scale. The authors leverage a "align first, decode later" principle to introduce GlobalSplat, a framework that aggregates multi-view inputs into a compact, fixed set of global latent scene tokens before decoding any explicit geometry. By utilizing a dual-branch iterative attention architecture and a coarse-to-fine training curriculum, GlobalSplat achieves highly competitive reconstruction quality while maintaining an ultra-compact footprint of only 16K Gaussians.

Dataset

Dataset overview
Dataset overview

Since the provided text only contains implementation details regarding image resizing and cropping rather than the dataset composition, sources, or mixture ratios, the following description focuses on the data processing pipeline:

  • Image Preprocessing: During evaluation, the authors resize each image to a height of 256 pixels while maintaining the original aspect ratio. The width is rounded to the nearest multiple of 8 to accommodate the patch size.
  • Camera Parameter Adjustment: To maintain spatial accuracy, the intrinsic camera parameters are updated accordingly after the resizing step.
  • Cropping and Final Scaling: The pipeline applies a centered square crop to the resized image. If the dimensions are not already correct, a final resize is performed to produce an exact 256 by 256 pixel image.
  • Consistency: This deterministic preprocessing workflow is applied identically to both the context and target views.

Method

The authors leverage a novel architecture named GlobalSplat, which employs a learnable latent representation to efficiently model 3D scenes. The overall framework operates by first extracting features from input views and then iteratively refining a fixed set of latent scene tokens through a dual-branch attention mechanism before decoding them into explicit 3D Gaussians. The model begins with a view encoder that processes input images to generate patchified features. These features are then used to augment per-patch context by combining patchified Plücker-ray embeddings with a per-view camera code, which explicitly encodes the camera's global context, including absolute position and focal parameters. This augmented context is fed into the core processing pipeline.

GlobalSplat Architecture Overview
GlobalSplat Architecture Overview

As shown in the figure below, the model initializes a fixed set of M=2048M=2048M=2048 learnable latent tokens, which serve as the primary representation of the scene. These tokens are processed through a dual-branch encoder consisting of B=4B=4B=4 iterative blocks. Within each block, the tokens are projected into separate geometry and appearance streams. The geometry stream processes queries QGQ_GQG that cross-attend to the multi-view features KI,VIK_I, V_IKI,VI and then self-attend to the global context, while the appearance stream performs a similar operation with queries QAQ_AQA. This architectural disentanglement ensures that geometric structure and appearance are processed independently, preventing texture from masking poor structural predictions. The outputs of the two streams are fused via a mixer MLP to update the latent tokens, which are then passed to the next block.

Comparison of GlobalSplat with Dense Splatting Methods
Comparison of GlobalSplat with Dense Splatting Methods

Following the iterative refinement, the final latent tokens are decoded into the explicit 3D Gaussian representation. This is achieved through two specialized decoders: a geometry decoder that predicts the 3D mean, anisotropic scale, rotation (using a continuous 6D parameterization), opacity, and an importance score, and an appearance decoder that predicts the view-dependent color coefficients using spherical harmonics (SH) of degree 3. The model employs a coarse-to-fine training curriculum to manage the complexity of the representation. Initially, each latent token predicts a fixed set of 16 Gaussian candidates, but only a single representative Gaussian (G=1G=1G=1) is exposed to the renderer. As training progresses, the capacity is incrementally increased by reducing the merging of candidates, ultimately revealing G=8G=8G=8 Gaussians per token. This staged approach ensures that the model first establishes a stable global geometry before refining local details, preventing representation bloat and improving training stability.

Experiment

The proposed method is evaluated against state-of-the-art feed-forward novel view synthesis baselines using the RealEstate10K dataset for primary performance testing and the ACID dataset to assess zero-shot cross-dataset generalization. Ablation studies further validate the effectiveness of the dual-stream architecture, the coarse-to-fine capacity curriculum, and the inclusion of camera metadata. The results demonstrate that the model achieves a superior trade-off between reconstruction quality and representation compactness, providing sharp, artifact-free renderings with significantly lower memory and computational requirements than existing heavy-weight methods.

The authors evaluate their method against state-of-the-art feed-forward novel view synthesis baselines on RealEstate10K and ACID datasets. Results show that the proposed method achieves strong reconstruction quality with a compact representation, demonstrating favorable trade-offs between quality and model size, and excels in cross-dataset generalization and computational efficiency. The method achieves competitive reconstruction quality while using a significantly smaller number of Gaussians compared to baseline methods. The approach demonstrates robust cross-dataset generalization, maintaining performance on ACID despite being trained only on RealEstate10K. The method is computationally efficient, requiring the lowest peak GPU memory and fastest inference time among compared methods.

Quantitative evaluation results
Quantitative evaluation results

The authors compare their method against state-of-the-art feed-forward novel view synthesis baselines on RealEstate10K, evaluating reconstruction quality, compactness, and efficiency. Results show that their approach achieves competitive image quality with a significantly smaller number of Gaussians compared to other methods, demonstrating a favorable quality-compactness trade-off. The method also maintains strong performance across different numbers of input views and exhibits high computational efficiency. GlobalSplat achieves competitive reconstruction quality while using a fraction of the Gaussians required by other methods. The method maintains consistent performance across 12, 24, and 36 input views, indicating a view-invariant representation. GlobalSplat is the most memory-efficient and fastest method in terms of inference time and disk footprint.

Quantitative comparison on RealEstate10K
Quantitative comparison on RealEstate10K

The authors compare the efficiency of their method with several baselines, focusing on peak GPU memory, inference time, and disk size. Results show that their method achieves significantly lower memory usage and faster inference while maintaining a small disk footprint. These efficiency gains are achieved without sacrificing reconstruction quality, highlighting the benefit of a compact representation. The proposed method uses substantially less peak GPU memory and inference time compared to baselines. The method maintains a minimal disk footprint, significantly smaller than all other methods. Efficiency improvements are achieved while preserving high reconstruction quality.

Efficiency comparison of methods
Efficiency comparison of methods

The authors conduct an ablation study to evaluate the impact of different design choices on model performance. Results show that removing the consistency loss or using a single-stream architecture leads to a drop in reconstruction quality, while predicting the full Gaussian capacity from the start also degrades performance. The full model achieves the best results across all metrics. Removing the consistency loss reduces reconstruction quality and increases artifacts Using a single-stream architecture instead of a two-stream design degrades performance Predicting the full Gaussian capacity from the beginning of training leads to worse results than progressive capacity growth

Ablation study on model variants
Ablation study on model variants

The the the table examines the impact of latent scene representation size and decoder density on reconstruction quality under fixed Gaussian budgets. Results show that increasing the number of latent tokens is more effective than increasing the number of Gaussians per token, with larger latent capacity leading to better performance across metrics. Increasing latent capacity improves reconstruction quality more than increasing Gaussians per token Larger latent representations achieve higher PSNR, SSIM, and lower LPIPS under the same Gaussian budget The trade-off between latent size and decoder density shows diminishing returns for higher decoder density

Compactness-quality trade-off study
Compactness-quality trade-off study

The proposed method is evaluated against state-of-the-art feed-forward novel view synthesis baselines on the RealEstate10K and ACID datasets to assess reconstruction quality, efficiency, and generalization. Results demonstrate that the approach achieves competitive image quality with a significantly more compact representation, offering superior computational efficiency and robust performance across different datasets and input view counts. Ablation studies and architectural analyses further confirm that the two-stream design, consistency loss, and progressive capacity growth are essential for maintaining high-quality reconstructions while optimizing the trade-off between latent representation size and decoder density.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp