5時間前

Junyi Zhang Charles Herrmann Junhwa Hur Chen Sun Ming-Hsuan Yang Forrester Cole Trevor Darrell Deqing Sun

概要

フィードフォワード幾何基底モデルは短い時間窓における再構成において優れた性能を発揮するが、分単位の長尺動画へのスケーリングは、二次的なアテンション計算量または再帰的構造における有効メモリ容量の限界によってボトルネックとなっている。本研究では、事後最適化を必要とせずに、極めて長いシーケンスに対する密な 3 次元再構成をスケーリング可能にする新たなアーキテクチャ「LoGeR（Long-context Geometric Reconstruction）」を提案する。LoGeR はビデオストリームをチャンク単位で処理し、チャンク内での高忠実度推論を可能にする強力な双方向事前知識を活用する。さらに、チャンク境界における一貫性という重要な課題に対処するため、学習ベースのハイブリッドメモリモジュールを提案する。この二重構成システムは、大域座標系を固定しスケールのドリフトを防止するパラメトリックなテスト時学習（Test-Time Training: TTT）メモリと、高精度な隣接整合を維持するために圧縮されていない文脈を保持するノンパラメトリックなスライディングウィンドウアテンション（Sliding Window Attention: SWA）メカニズムを組み合わせて構成される。驚くべきことに、このメモリアーキテクチャにより、LoGeR は 128 フレームのシーケンスで学習されながら、推論時には数千フレームまで汎化することが可能となる。標準的なベンチマークおよび最大 19,000 フレームのシーケンスを備えた新たに再構成された VBR データセットにおける評価において、LoGeR は従来の最先端フィードフォワード手法を大幅に凌駕し、KITTI における絶対軌道誤差（ATE）を 74% 以上削減するとともに、前例のない視野にわたる堅牢かつ大域的一貫性のある再構成を実現した。

One-sentence Summary

Researchers from Google DeepMind and UC Berkeley present LoGeR, a feedforward model that scales 3D reconstruction to long videos by combining Test-Time Training for global consistency with Sliding Window Attention for local precision, eliminating the need for post-optimization while achieving superior accuracy on datasets with thousands of frames.

Key Contributions

Feedforward geometric models currently struggle to scale to minute-long videos due to quadratic attention complexity and limited memory, creating a critical gap between short-window reconstruction and the need for global consistency over long sequences.
LoGeR introduces a novel chunk-wise architecture with a hybrid memory module that combines parametric Test-Time Training to anchor the global coordinate frame and non-parametric Sliding Window Attention to preserve high-precision local alignment.
Trained on sequences of only 128 frames, the model generalizes to thousands of frames and achieves state-of-the-art performance by reducing Absolute Trajectory Error on KITTI by over 74% and improving results by 30.8% on a new 19k-frame VBR benchmark.

Introduction

Large-scale dense 3D reconstruction is essential for applications ranging from autonomous driving to generative world-building, yet current methods struggle to balance computational efficiency with long-range consistency. While classical optimization pipelines can handle city-scale scenes, they rely on slow offline processes and fail on sparse inputs, whereas modern feedforward geometric models offer speed but are limited to short, bounded scenes due to quadratic attention complexity and a lack of training data for long sequences. To bridge this gap, the authors propose LoGeR, a feedforward framework that utilizes a hybrid memory module combining non-parametric sliding window attention for high-fidelity local details and parametric associative memory for global structural integrity. This approach allows the model to process massive sequences of up to 19,000 frames with linear computational cost, effectively overcoming the context and data walls that previously prevented feedforward models from scaling to real-world, long-horizon trajectories.

Dataset

Dataset Composition and Sources: The authors utilize a diverse mixture of 14 large-scale datasets containing both real-world and synthetic scenes across indoor, outdoor, and autonomous driving environments to support long-context geometric reconstruction.
Key Details for Each Subset:
- Navigation and large-scale scene datasets like TartanAirV2 and VKITTI2 are heavily weighted to encourage long-range geometric reasoning.
- DL3DV receives a high sampling weight due to its exceptional real-world scene diversity, which aids model generalization.
- Smaller or object-centric datasets are down-weighted in the mixture.
- The OmniWorld-Game contribution is limited to a subset of 5,000 sequences based on the publicly released data at the time of training.
Data Usage and Mixture Ratios: The training configuration employs relative sampling percentages as summarized in Table 4, where the mixture is specifically tuned to provide sufficient long-horizon signals and diverse scene priors.
Processing and Filtering Strategies:
- All datasets are standardized into multi-view sequences with 48 views (or 128 views for H200 GPU training) at a uniform resolution of 504 × 280.
- The sampling strategy follows the CUT3R approach.
- Rigorous depth filtering is applied to ensure geometric supervision quality.
- Metric-scale datasets like ARKitScenes and ScanNet use a maximum depth threshold (e.g., 80.0 meters), while others like DL3DV and TartanAir use percentile-based clipping (e.g., 90th or 98th percentile) to mask noisy or invalid depth values.

Method

The authors propose LoGeR, a novel architecture designed to scale dense 3D reconstruction to extremely long video sequences without post-optimization. To overcome the quadratic complexity of global attention and the scarcity of long-horizon training data, the method processes video streams sequentially by chunk. This approach tightly bounds computational cost while ensuring that local inferences remain within the distribution of existing short-context training data.

Refer to the framework diagram for an overview of the proposed chunk-wise processing pipeline and its performance on long sequences.

The core innovation lies in a learning-based hybrid memory module that manages coherence across chunk boundaries. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment.

As shown in the figure below, the architecture processes input sequences in consecutive chunks, utilizing specific mechanisms for the first chunk versus following chunks to propagate information effectively.

Within each residual block of the geometry backbone, the authors introduce the hybrid memory system. The process begins with per-frame attention to extract spatial features independently for each frame. To align adjacent chunks, sparse sliding-window attention layers are inserted at a subset of network depths. These layers attend to tokens from both the previous chunk $\mathcal{C}^{m-1}$ and the current chunk $\mathcal{C}^{m}$ , establishing a lossless information highway for high-fidelity feature propagation.

To integrate global context, the model maintains a set of fast weights $W^{m}$ that summarize information up to chunk $m$ . The TTT layer performs an apply-then-update procedure at the chunk level. In the apply operation, the TTT layers use historical information stored in the weights to modulate the network's processing of the current chunk. In the update operation, the weights are edited to store information from the current chunk, conceptually compressing important but redundant geometric information. The mathematical formulation for the TTT update and apply operations is defined as:

W \gets W - \eta \nabla _ { W } \mathcal { L } ( f _ { W } ( \mathbf { k } ) , \mathbf { v } )

\mathrm { A p p l y ~ o p e r a t i o n : } \quad o = f _ { W } ( \mathbf { q } )

where $\eta$ is the learning rate and $\mathcal{L}$ is a loss function encouraging the function $f_W$ to link keys with corresponding values. Finally, within each chunk, a bidirectional attention module is employed for powerful geometric reasoning under a bounded context window.

For training, the authors employ a progressive curriculum strategy to stabilize the optimization of recurrent TTT layers. The schedule begins with shorter sequences and gradually increases complexity, forcing the model to shift reliance from local Sliding Window Attention to the global TTT hidden state. Additionally, to mitigate prediction errors in very long streams, a variant called LoGeR* incorporates a purely feedforward alignment step. This step aligns raw predictions into a consistent global coordinate system by computing a rigid SE(3) transformation using overlapping frames between chunks.

Experiment

Long-sequence evaluation on KITTI and VBR benchmarks demonstrates that LoGeR effectively mitigates accumulated drift over thousands of frames, maintaining global scale and trajectory consistency where prior methods fail.
Short-sequence tests on 7-Scenes, ScanNet, and TUM-Dynamics confirm that the proposed model and baseline significantly outperform existing feedforward and optimization-based approaches in 3D reconstruction and camera pose estimation.
Ablation studies validate that the hybrid architecture is essential, with the Test-Time Training layer ensuring global consistency and the Sliding Window Attention layer preserving local geometric smoothness.
Experiments on data mixture and curriculum training prove that incorporating large-scale navigation datasets and a progressive training schedule are critical for generalization and stabilizing recurrent layer optimization.

ソースPDF コードを表示

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

5時間前

Junyi Zhang Charles Herrmann Junhwa Hur Chen Sun Ming-Hsuan Yang Forrester Cole Trevor Darrell Deqing Sun

概要

One-sentence Summary

Key Contributions

Feedforward geometric models currently struggle to scale to minute-long videos due to quadratic attention complexity and limited memory, creating a critical gap between short-window reconstruction and the need for global consistency over long sequences.
LoGeR introduces a novel chunk-wise architecture with a hybrid memory module that combines parametric Test-Time Training to anchor the global coordinate frame and non-parametric Sliding Window Attention to preserve high-precision local alignment.
Trained on sequences of only 128 frames, the model generalizes to thousands of frames and achieves state-of-the-art performance by reducing Absolute Trajectory Error on KITTI by over 74% and improving results by 30.8% on a new 19k-frame VBR benchmark.

Introduction

Dataset

Dataset Composition and Sources: The authors utilize a diverse mixture of 14 large-scale datasets containing both real-world and synthetic scenes across indoor, outdoor, and autonomous driving environments to support long-context geometric reconstruction.
Key Details for Each Subset:
- Navigation and large-scale scene datasets like TartanAirV2 and VKITTI2 are heavily weighted to encourage long-range geometric reasoning.
- DL3DV receives a high sampling weight due to its exceptional real-world scene diversity, which aids model generalization.
- Smaller or object-centric datasets are down-weighted in the mixture.
- The OmniWorld-Game contribution is limited to a subset of 5,000 sequences based on the publicly released data at the time of training.
Data Usage and Mixture Ratios: The training configuration employs relative sampling percentages as summarized in Table 4, where the mixture is specifically tuned to provide sufficient long-horizon signals and diverse scene priors.
Processing and Filtering Strategies:
- All datasets are standardized into multi-view sequences with 48 views (or 128 views for H200 GPU training) at a uniform resolution of 504 × 280.
- The sampling strategy follows the CUT3R approach.
- Rigorous depth filtering is applied to ensure geometric supervision quality.
- Metric-scale datasets like ARKitScenes and ScanNet use a maximum depth threshold (e.g., 80.0 meters), while others like DL3DV and TartanAir use percentile-based clipping (e.g., 90th or 98th percentile) to mask noisy or invalid depth values.

Method

Refer to the framework diagram for an overview of the proposed chunk-wise processing pipeline and its performance on long sequences.

W \gets W - \eta \nabla _ { W } \mathcal { L } ( f _ { W } ( \mathbf { k } ) , \mathbf { v } )

\mathrm { A p p l y ~ o p e r a t i o n : } \quad o = f _ { W } ( \mathbf { q } )

Experiment

Long-sequence evaluation on KITTI and VBR benchmarks demonstrates that LoGeR effectively mitigates accumulated drift over thousands of frames, maintaining global scale and trajectory consistency where prior methods fail.
Short-sequence tests on 7-Scenes, ScanNet, and TUM-Dynamics confirm that the proposed model and baseline significantly outperform existing feedforward and optimization-based approaches in 3D reconstruction and camera pose estimation.
Ablation studies validate that the hybrid architecture is essential, with the Test-Time Training layer ensuring global consistency and the Sliding Window Attention layer preserving local geometric smoothness.
Experiments on data mixture and curriculum training prove that incorporating large-scale navigation datasets and a progressive training schedule are critical for generalization and stabilizing recurrent layer optimization.

ソースPDF コードを表示

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

Command Palette

LoGeR：ハイブリッドメモリによる長文脈幾何学的再構成

Junyi Zhang Charles Herrmann Junhwa Hur Chen Sun Ming-Hsuan Yang Forrester Cole Trevor Darrell Deqing Sun

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

LoGeR：ハイブリッドメモリによる長文脈幾何学的再構成

Junyi Zhang Charles Herrmann Junhwa Hur Chen Sun Ming-Hsuan Yang Forrester Cole Trevor Darrell Deqing Sun

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

LoGeR：ハイブリッドメモリによる長文脈幾何学的再構成

Junyi Zhang Charles Herrmann Junhwa Hur Chen Sun Ming-Hsuan Yang Forrester Cole Trevor Darrell Deqing Sun

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters