HyperAIHyperAI

Command Palette

Search for a command to run...

AdaGaR: 동적 장면 재구성 위한 적응형 고바르 표현

Jiewen Chan Zhenjun Zhao Yu-Lun Liu

초록

단일 카메라 영상에서 동적 3D 장면을 재구성하는 것은 고주파도의 시각적 세부 정보를 동시에 캡처하고 시간적으로 연속적인 운동을 모델링하는 것을 요구한다. 기존의 단일 가우시안 원소를 사용하는 방법은 저역통과 필터링 특성으로 인해 한계를 지니며, 일반적인 가보르 함수는 에너지의 불안정성을 초래한다. 게다가 시간적 연속성 제약이 부족할 경우 보간 과정에서 운동 아티팩트가 발생하기 쉽다. 본 논문에서는 명시적 동적 장면 모델링에서 주파수 적응성과 시간적 연속성의 두 가지 문제를 통합적으로 해결하는 AdaGaR를 제안한다. 우리는 학습 가능한 주파수 가중치와 적응형 에너지 보정을 통해 가우시안을 확장한 적응형 가보르 표현(Adaptive Gabor Representation)을 도입하여 세부 정보 캡처와 안정성 간의 균형을 달성한다. 시간적 연속성 확보를 위해, 시간 곡률 정규화를 적용한 3차 허미트 스플라인(Cubic Hermite Splines)을 활용하여 매끄러운 운동 진화를 보장한다. 초기 학습 단계에서 깊이 추정, 포인트 추적 및 전경 마스크를 결합한 적응형 초기화 기법을 통해 안정적인 포인트 클라우드 분포를 구축한다. Tap-Vid DAVIS 데이터셋에서의 실험 결과, 본 방법은 최신 기준 성능을 달성하며, PSNR 35.49, SSIM 0.9433, LPIPS 0.0723의 수치를 기록했으며, 프레임 보간, 깊이 일관성, 영상 편집, 스테레오 뷰 합성 등 다양한 작업에서 뛰어난 일반화 능력을 보였다. 프로젝트 페이지: https://jiewenchan.github.io/AdaGaR/

One-sentence Summary

The authors from National Yang Ming Chiao Tung University and University of Zaragoza propose AdaGaR, a unified framework for dynamic 3D scene reconstruction that introduces an Adaptive Gabor Representation with learnable frequency weights and energy compensation to achieve high-frequency detail preservation and stability, while employing Cubic Hermite Splines with temporal curvature regularization to ensure smooth motion, outperforming prior methods in video reconstruction, frame interpolation, and view synthesis.

Key Contributions

  • We introduce Adaptive Gabor Representation, a frequency-adaptive extension of 3D Gaussians that learns dynamic frequency weights and applies adaptive energy compensation to simultaneously preserve high-frequency textures and maintain rendering stability, overcoming the low-pass filtering limitation of standard Gaussians and the energy instability of fixed Gabor functions.

  • We propose Temporal Curvature Regularization with Cubic Hermite Splines to enforce smooth motion trajectories over time, ensuring geometric and temporal continuity in dynamic scene reconstruction and effectively eliminating interpolation artifacts, especially under rapid motion or occlusions.

  • We design an Adaptive Initialization mechanism that integrates monocular depth estimation, point tracking, and foreground masks to establish temporally coherent and stable point cloud distributions early in training, significantly improving convergence and final reconstruction quality on the Tap-Vid DAVIS dataset.

Introduction

Reconstructing dynamic 3D scenes from monocular videos is critical for applications in VR, AR, and film production, where both smooth temporal motion and high-fidelity texture representation are essential. Prior methods using Gaussian-based primitives struggle with high-frequency detail due to inherent low-pass filtering, while frequency-enhancing approaches like Gabor representations often compromise energy stability and rendering quality. Many also lack explicit temporal constraints, leading to motion artifacts under rapid motion or occlusions. The authors introduce AdaGaR, a unified framework that jointly optimizes time and frequency in explicit dynamic representations. It features an Adaptive Gabor Representation that learns frequency response for balanced high- and low-frequency modeling with energy stability, and Temporal Curvature Regularization via Cubic Hermite Splines to enforce smooth motion trajectories. An Adaptive Initialization mechanism leverages depth, motion, and segmentation priors to bootstrap stable, temporally coherent geometry. The approach achieves state-of-the-art results on Tap-Vid, demonstrating strong generalization across video reconstruction, interpolation, depth consistency, editing, and stereo synthesis.

Method

The authors leverage a unified framework, AdaGaR, to address the dual challenges of frequency adaptivity and temporal continuity in explicit dynamic scene modeling from monocular videos. The overall architecture, as illustrated in the framework diagram, operates within an orthographic camera coordinate system, which simplifies the representation by treating camera and object motion as a single dynamic variation, avoiding the need for explicit camera pose estimation. The core of the method consists of two primary components: Adaptive Gabor Representation and Adaptive Motion, which are optimized jointly with a multi-supervision loss function.

The Adaptive Gabor Representation extends the standard 3D Gaussian Splatting primitive to capture high-frequency appearance details. It achieves this by modulating the traditional Gaussian density function with a learnable, periodic sinusoidal component. The Gabor function, defined as GGabor(x)=exp(12xμΣ12)cos(fx+ϕ)\mathcal{G}_{\text{Gabor}}(\mathbf{x}) = \exp\left(-\frac{1}{2}||\mathbf{x} - \boldsymbol{\mu}||_{\Sigma^{-1}}^2\right) \cos(\mathbf{f}^\top \mathbf{x} + \phi)GGabor(x)=exp(21∣∣xμΣ12)cos(fx+ϕ), introduces a sinusoidal modulation within the Gaussian envelope, enabling the representation of local directional textures. To model richer frequency components, multiple Gabor waves are combined into a weighted superposition, S(x)=i=1Nωicos(fidi,x+ϕi)S(\mathbf{x}) = \sum_{i=1}^{N} \omega_i \cos(f_i \langle \mathbf{d}_i, \mathbf{x} \rangle + \phi_i)S(x)=i=1Nωicos(fidi,x+ϕi), where the amplitude weights ωi\omega_iωi are learnable parameters. To ensure energy stability and prevent intensity attenuation, a compensation term bbb is introduced, resulting in the final adaptive modulation function Sadap(x)=b+1Ni=1Nωicos(fidi,x)S_{\text{adap}}(\mathbf{x}) = b + \frac{1}{N} \sum_{i=1}^{N} \omega_i \cos(f_i \langle \mathbf{d}_i, \mathbf{x} \rangle)Sadap(x)=b+N1i=1Nωicos(fidi,x⟩). This formulation allows the representation to adaptively span from a low-frequency Gaussian to a high-frequency Gabor kernel, with the compensation term ensuring a smooth degradation to a standard Gaussian when frequency weights vanish.

The Adaptive Motion component ensures temporally smooth and consistent motion evolution. It models the trajectory of each dynamic primitive using Cubic Hermite Splines, which interpolate the positions and velocities at a set of temporal keyframes. The spline interpolation is defined by the Hermite basis functions, which use control points yk\mathbf{y}_kyk and slopes mk\mathbf{m}_kmk to generate a smooth curve. To prevent reverse oscillations and ensure visually stable interpolation, an auto-slope mechanism with a monotone gate is employed, which sets the slope to zero if the direction of motion changes between adjacent keyframes. For rotation, the same principle is applied by interpolating in the so(3)so(3)so(3) Lie algebra space and converting to unit quaternions. To enforce smoothness and prevent motion artifacts, a temporal curvature regularization term is introduced, which penalizes the second-order derivative of the trajectory at each keyframe, thereby constraining the motion to be geometrically and temporally consistent.

The optimization process is driven by a multi-objective loss function that combines several supervisory signals. Rendering reconstruction loss, a combination of L1\mathcal{L}_1L1 and SSIM, ensures appearance fidelity. Optical flow consistency loss, derived from Co-Tracker, aligns the projected positions of the primitives with ground-truth 2D trajectories. Depth loss, using monocular depth estimates from DPT, provides geometric priors. Finally, the curvature regularization loss Lcurv\mathcal{L}_{\text{curv}}Lcurv enforces smooth temporal evolution. The total loss is a weighted sum of these components, enabling the model to achieve both high-fidelity rendering and robust temporal consistency. An adaptive initialization mechanism, which fuses multi-modal cues from depth, tracking, and masks, is used to generate a dense and temporally coherent initial point cloud, reducing early-stage flickering and improving convergence.

Experiment

  • AdaGaR achieves state-of-the-art video reconstruction on Tap-Vid DAVIS, attaining 35.49 dB PSNR and 0.9433 SSIM, with 6.86 dB PSNR improvement over the second-best method, while preserving fine details and temporal consistency.
  • Ablation studies validate the effectiveness of Adaptive Gabor Representation, Cubic Hermite Spline with curvature regularization, and adaptive initialization, showing superior performance in high-frequency detail preservation, motion smoothness, and depth consistency.
  • The method enables robust downstream applications: frame interpolation produces smooth, artifact-free intermediate frames with preserved texture details; video editing maintains temporal coherence through shared canonical primitives; stereo view synthesis achieves plausible geometry from monocular input.
  • On Tap-Vid DAVIS, the approach outperforms baselines across PSNR, SSIM, and LPIPS, with training completed in 90 minutes per sequence on an NVIDIA RTX 4090.

Results show that the proposed method achieves state-of-the-art performance on the Tap-Vid DAVIS dataset, outperforming all baselines across PSNR, SSIM, and LPIPS metrics. It achieves a PSNR of 35.49 dB, 6.86 dB higher than the second-best method, Splatter A Video, while also demonstrating superior texture detail and temporal consistency.

Results show that the proposed Cubic Hermite Spline achieves the highest PSNR and SSIM scores while minimizing LPIPS, outperforming both B-Spline and Cubic Spline across all metrics. The authors use this method to generate smooth intermediate frames, demonstrating superior temporal coherence and preservation of high-frequency details.

The authors use an ablation study to compare different Gabor representation variants, showing that the Adaptive Gabor method achieves the highest PSNR and SSIM while minimizing LPIPS, indicating superior reconstruction quality and perceptual fidelity. Results demonstrate that the adaptive formulation with the compensation term b outperforms standard Gaussian and naive Gabor configurations across all metrics.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp