HyperAIHyperAI

Command Palette

Search for a command to run...

무한VGGT: 끝없는 스트림을 위한 시각 기하 기반 트랜스포머

Shuai Yuan Yantai Yang Xiaotian Yang Xupeng Zhang Zhonghao Zhao Lingming Zhang Zhipeng Zhang

초록

지속적이고 대규모 3D 시각 기하학적 이해를 실현하는 장대한 비전은 확장성과 장기적 안정성 간의 조화를 이룰 수 없는 요구 사항에 묶여 있다. VGGT와 같은 오프라인 모델은 놀라운 기하학적 능력을 보여주지만, 배치 기반의 성격으로 실시간 시스템에 적용할 수 없다. 반면, 스트리밍 아키텍처는 실시간 운영을 위한 정의된 솔루션이지만, 기존 방법들은 실질적인 무한한 시간 영역 입력을 지원하지 못하거나 긴 시퀀스에 걸쳐 치명적인 드리프트(이탈) 현상을 겪는다는 점에서 부족함을 드러낸다. 우리는 이 오랜 딜레마를 무너뜨리며, 제한된 크기이지만 적응적이고 지속적으로 표현력이 뛰어난 KV 캐시를 통해 '롤링 메모리(rolling memory)' 개념을 실현하는 인과적 시각 기하학 트랜스포머인 InfiniteVGGT를 제안한다. 이 기반 위에서, 새로운 프레임이 입력될 때마다 과거 정보를 지능적으로 제거하는, 훈련이 필요 없고 어텐션에 무관한 가지치기 전략을 설계함으로써, 메모리를 효과적으로 '전진'시키는 방식을 구현한다. FlashAttention과 완전히 호환되며, InfiniteVGGT는 드디어 확장성과 장기적 안정성 사이의 타협을 해소함으로써, 무한한 시간 영역 스트리밍을 가능하게 하면서도 기존 스트리밍 방법보다 우수한 장기적 안정성을 달성한다. 이러한 시스템의 최종 시험은 진정한 무한한 시간 영역에서의 성능에 있다. 그러나 기존에 매우 장기적이고 지속적인 벤치마크가 부족했기 때문에, 이는 엄격한 검증이 불가능한 상태였다. 이 핵심적 공백을 메우기 위해, 우리는 10,000 프레임에 달하는 시퀀스를 기반으로 지속적인 3D 기하학 추정을 엄격히 평가할 수 있는 Long3D 벤치마크를 도입한다. 이는 향후 장기적 3D 기하학 이해 분야의 연구를 위한 결정적 평가 플랫폼이 될 것이다. 코드는 다음 링크에서 확인할 수 있다: https://github.com/AutoLab-SAI-SJTU/InfiniteVGGT

One-sentence Summary

The authors from AutoLab, Shanghai Jiao Tong University, and Anyverse Dynamics propose InfiniteVGGT, a causal visual geometry transformer with a bounded, adaptive KV cache that enables infinite-horizon streaming 3D geometry understanding by intelligently pruning obsolete information without retraining. Unlike prior methods, it maintains long-term stability through a rolling memory mechanism compatible with FlashAttention, and they introduce the Long3D benchmark—featuring 10,000-frame sequences—to rigorously evaluate continuous 3D estimation, setting a new standard for long-term visual geometry systems.

Key Contributions

  • Existing streaming 3D geometry models face a fundamental trade-off between scalability and long-term stability, either accumulating unbounded memory or suffering from catastrophic drift due to lossy state compression, limiting their applicability in real-time systems.
  • InfiniteVGGT introduces a causal visual geometry transformer with a bounded, adaptive rolling memory via a training-free, attention-agnostic pruning strategy that uses key cosine similarity to discard redundant tokens, enabling infinite-horizon streaming while maintaining compatibility with FlashAttention.
  • To rigorously evaluate long-term performance, the authors introduce Long3D, the first benchmark for continuous 3D geometry estimation on sequences of approximately 10,000 frames, providing a definitive platform for assessing infinite-horizon stability and accuracy.

Introduction

The authors address the challenge of continuous, long-term 3D scene reconstruction in real-time systems, where existing methods face a fundamental trade-off: offline models achieve high geometric accuracy but are incompatible with streaming due to batch processing, while online approaches either suffer from unbounded memory growth or catastrophic drift from aggressive state compression. Prior work struggles to maintain both scalability and stability over infinite horizons, particularly due to the inefficiency of pruning strategies that rely on attention scores—these require materializing large attention matrices, which contradicts the use of optimized kernels like FlashAttention. To overcome this, the authors introduce InfiniteVGGT, a causal visual geometry transformer that implements a bounded, adaptive rolling memory via a training-free, attention-agnostic pruning strategy based on key cosine similarity. This enables efficient, dynamic retention of semantically distinct tokens while discarding redundant ones, preserving long-term consistency without memory overflow. The system is fully compatible with FlashAttention, supports infinite-length input streams, and is evaluated on the newly proposed Long3D benchmark— the first dataset designed for rigorous, continuous evaluation over 10,000+ frames—providing a definitive testbed for future research in long-term 3D geometry understanding.

Dataset

  • The Long3D benchmark consists of 5 long, continuous 3D video sequences captured in diverse indoor and outdoor environments, with each sequence ranging from approximately 2,000 to 10,000 frames—significantly longer than prior benchmarks limited to 1,000 frames or fewer.
  • Data was collected using a handheld 3D spatial scanner equipped with an IMU, a 360° horizontal by 59° vertical FOV LiDAR, and an RGB camera (800 × 600 resolution at 10 Hz, 90° FOV).
  • Each sequence includes a globally aligned ground-truth point cloud and a corresponding uninterrupted stream of RGB images, enabling evaluation of long-term 3D reconstruction from continuous input.
  • The dataset is designed to assess dense-view streaming reconstruction, where models process the full image stream in real time to generate a complete, globally consistent point cloud.
  • For evaluation, predicted point clouds are aligned with ground truth using the Iterative Closest Point (ICP) algorithm, following standard practices in the field.
  • Performance is measured using four established metrics: Accuracy (Acc.), Completion (Comp.), Chamfer Distance (CD), and Normal Consistency (NC).
  • The authors use the Long3D dataset as the primary evaluation benchmark, training models on shorter sequences from existing datasets like 7-Scenes and NRGBD, then testing on Long3D to assess long-term robustness and continuity.
  • No explicit cropping is applied; instead, the full sequences are used as captured, preserving temporal continuity.
  • Metadata includes scene identifiers, capture timestamps, and sensor calibration data, which are used to synchronize RGB and LiDAR streams and support accurate alignment during evaluation.

Method

The authors leverage a transformer-based architecture adapted for online 3D reconstruction, building upon the offline model VGTT and its streaming variant StreamVGGT. The framework processes a sequence of input frames incrementally, maintaining a key-value (KV) cache to retain contextual information across time. At each timestep, the model generates 3D outputs—camera parameters, depth maps, point maps, and tracking features—by combining frame-specific features from a frame attention module with cached context from prior frames via a causal temporal attention module. This incremental processing enables streaming inference but leads to unbounded memory growth as the KV cache accumulates tokens over time.

To address this, the proposed method introduces a rolling memory paradigm that prunes the KV cache to prevent memory overflow while preserving geometric consistency. The framework begins by establishing an immutable anchor set derived from the KV cache of the first input frame. This anchor set serves as a fixed global reference, ensuring that all subsequent 3D predictions remain aligned to the coordinate system of the initial frame. All subsequent frames contribute to a mutable candidate set, which is subject to compression. The pruning strategy operates independently on each decoder layer and attention head, reflecting their heterogeneous information content.

The retention of tokens within the candidate set is guided by a diversity-aware mechanism. Instead of relying on attention weights, the method measures redundancy through the geometric dispersion of key vectors in the feature space. As shown in the figure below, queries and keys from different frames occupy nearly orthogonal subspaces, indicating that key-space similarity is a stable proxy for redundancy. The diversity score for each key is defined as the negative cosine similarity to the mean of all normalized candidate keys in the same layer and head. Keys that are dissimilar to the mean, and thus more geometrically distinct, are assigned higher scores and are more likely to be retained.

To further optimize memory usage, the method employs a layer-wise adaptive budget allocation. This mechanism assigns a non-uniform storage budget to each decoder layer based on its average information diversity. The diversity score for a layer is computed as the mean of the diversity scores of all its attention heads. A softmax function, modulated by a temperature hyperparameter, normalizes these scores to determine the proportion of the total memory budget allocated to each layer. The final compressed cache is formed by retaining the top-K tokens from the candidate set in each layer and head, according to their diversity scores, and combining them with the immutable anchor set. This approach ensures that the most informative tokens are preserved while maintaining bounded memory and computational efficiency for long-horizon streaming.

Experiment

  • Evaluated on 3D reconstruction, video depth estimation, and camera pose estimation tasks using StreamVGGT and the novel Long3D benchmark, demonstrating effectiveness in long-sequence streaming scenarios.
  • On 7-Scenes and NRGBD datasets, InfiniteVGGT maintains state-of-the-art reconstruction accuracy with minimal temporal error accumulation, outperforming CUT3R and TTT3R, while avoiding OOM errors that affect baseline methods.
  • On Long3D with sequences up to nearly 10,000 frames, InfiniteVGGT achieves robust performance across diverse scenes, effectively limiting temporal drift compared to baselines, though it underperforms slightly on the Comp. metric.
  • On the Bonn dataset, InfiniteVGGT shows strong video depth estimation performance on long continuous sequences (200–500 frames), matching or exceeding CUT3R and TTT3R.
  • Ablation studies confirm that cosine similarity-based token selection improves reconstruction accuracy and reduces inference latency, while dynamic layer-wise budget allocation and anchor frame retention significantly enhance point cloud quality and stability.
  • The method achieves comparable performance to StreamVGGT on short sequences (50–100 frames) with negligible metric differences and a slight edge in normal consistency, validating its efficiency and robustness.

Results show that InfiniteVGGT achieves state-of-the-art performance on the Long3D benchmark across multiple scenes and sequence lengths, outperforming baselines like CUT3R and TTT3R on most metrics. While it effectively limits temporal error accumulation compared to other methods, it underperforms on the mean Comp. metric, indicating a key area for future improvement.

Results show that increasing the initial token storage budget improves reconstruction quality, with a significant drop in chamfer distance and improvement in normal consistency when moving from B10000l,hB^{l,h}_{10000}B10000l,h to B25000l,hB^{l,h}_{25000}B25000l,h. However, further increasing the budget to B50000l,hB^{l,h}_{50000}B50000l,h yields no additional gains, indicating that B25000l,hB^{l,h}_{25000}B25000l,h is sufficient for optimal performance.

Results show that InfiniteVGGT achieves the best performance on the Bonn dataset for video depth estimation, outperforming baselines like CUT3R and TTT3R across different input lengths. The method maintains high accuracy and consistency, with the lowest absolute relative error and highest δ < 1.25 metric values, demonstrating its effectiveness in long-term depth estimation.

Results show that InfiniteVGGT maintains stable reconstruction accuracy across increasing input frames, outperforming baselines like CUT3R and TTT3R in both chamfer distance and normal consistency. The method effectively limits temporal error accumulation, achieving consistent performance on long sequences where other approaches degrade or fail due to memory constraints.

Results show that incorporating an anchor frame significantly improves reconstruction accuracy, reducing the mean chamfer distance from 0.047 to 0.040 while maintaining consistent normal consistency. The authors use this ablation to demonstrate that preserving the first frame's tokens prevents error accumulation and enhances robustness in long-sequence 3D reconstruction.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp