منذ 2 أشهر

جدول المحتويات

الملخص

تخطيط المسار في البيئات غير المنظمة هو قدرة أساسية وصعبة بالنسبة للروبوتات المتنقلة. تعاني النماذج التقليدية القائمة على خطوط معالجة منفصلة من تأخير زمني وانتشار أخطاء متسلسلة عبر وحدات الاستشعار، والتحديد المكاني، والخرائط، والتخطيط. أما الأساليب الحديثة القائمة على التعلم المتكامل (end-to-end)، فتُحَوِّل الملاحظات البصرية الخام مباشرة إلى إشارات تحكم أو مسارات، وتعتبر واعدة من حيث الأداء والكفاءة في البيئات المفتوحة. ومع ذلك، لا تزال معظم النماذج السابقة المتكاملة تعتمد على وحدات منفصلة للتحديد المكاني، والتي تعتمد على معايرة دقيقة للحساسات الخارجية (extrinsic calibration) لتقدير الحالة الذاتية، مما يحد من قدرتها على التعميم عبر الأشكال المختلفة للروبوتات والبيئات المختلفة. نقدّم "LoGoPlanner"، وهي إطار عمل متكامل للتنقل مدعوم بالتحديد المكاني، والذي يعالج هذه القيود من خلال: (1) تحسين نموذج أساسي طويل المدى يعتمد على البصر والهندسة لربط التنبؤات بقياسات مترية مطلقة، مما يوفر تقديرًا ضمنيًا للحالة لتحسين الدقة في التحديد المكاني؛ (2) إعادة بناء هندسة المشهد المحيط من خلال الملاحظات التاريخية، لتوفير وعي بيئي كثيف ودقيق لتجنب العوائق بشكل موثوق؛ و(3) تهيئة السياسة (policy) على هندسة ضمنية تم استخلاصها من المهام المساعدة المذكورة سابقًا، مما يقلل من انتشار الأخطاء. قمنا بتقييم LoGoPlanner في بيئة محاكاة وبيئة حقيقية، حيث أظهر التصميم الكامل المتكامل تقليلًا في الخطأ التراكمي، بينما ساهم الذاكرة الهندسية المُدركة للقياسات في تحسين اتساق التخطيط وتجنب العوائق، ما أدى إلى تحسن يزيد عن 27.3% مقارنة بأساليب المقارنة التي تستخدم تحديدًا مكانيًا مثاليًا (oracle-localization)، بالإضافة إلى أداء قوي في التعميم عبر مختلف الأشكال الروبوتية والبيئات. تم إتاحة الكود والنماذج بشكل عام على الموقع الإلكتروني: https://steinate.github.io/logoplanner.github.io/{project page}.

One-sentence Summary

Peng et al. introduce LoGoPlanner, a fully end-to-end navigation framework that eliminates external localization dependencies through implicit state estimation and metric-aware geometry reconstruction. By finetuning long-horizon visual-geometry backbones with depth-derived scale priors and conditioning diffusion-based trajectory generation on implicit geometric features, it achieves 27.3% better performance than oracle-localization baselines while enabling robust cross-embodiment navigation in unstructured environments.

Key Contributions

LoGoPlanner addresses the limitation of existing end-to-end navigation methods that still require separate localization modules with precise sensor calibration by introducing implicit state estimation through a fine-tuned long-horizon visual-geometry backbone that grounds predictions in absolute metric scale. This eliminates reliance on external localization while providing accurate self-state awareness.
The framework reconstructs dense scene geometry from historical visual observations to supply fine-grained environmental context for obstacle avoidance, overcoming the partial or scale-ambiguous geometry reconstruction common in prior single-frame approaches. This enables robust spatial reasoning across occluded and rear-view regions.
By conditioning the navigation policy directly on this implicit metric-aware geometry, LoGoPlanner reduces error propagation in trajectory planning, achieving over 27.3% improvement over oracle-localization baselines in both simulation and real-world evaluations while demonstrating strong generalization across robot embodiments and environments.

Introduction

Mobile robots navigating unstructured environments require robust trajectory planning, but traditional modular pipelines suffer from latency and cascading errors across perception, localization, and planning stages. While end-to-end learning methods promise efficiency by mapping raw visuals directly to control signals, they still critically depend on external localization modules that require precise sensor calibration, limiting generalization across robots and environments. Monocular visual odometry approaches further struggle with inherent scale ambiguity and drift, often needing additional sensors or scene priors that reduce real-world applicability. The authors overcome these limitations by introducing LoGoPlanner, an end-to-end framework that integrates metric-scale visual geometry estimation directly into navigation. It leverages a finetuned visual-geometry backbone to implicitly estimate absolute scale and state, reconstructs scene geometry from historical observations for obstacle avoidance, and conditions the policy on this bootstrapped geometry to minimize error propagation without external localization inputs.

Method

The authors leverage a unified end-to-end architecture—LoGoPlanner—that jointly learns metric-aware perception, implicit localization, and trajectory generation without relying on external modules. The framework is built upon a pretrained video geometry backbone, enhanced with depth-derived scale priors to enable metric-scale scene reconstruction. At its core, the system processes causal sequences of RGB-D observations to extract compact, world-aligned point embeddings that encode both fine-grained geometry and long-term ego-motion.

Refer to the framework diagram, which illustrates the overall pipeline. The architecture begins with a vision transformer (ViT-L) that processes sequential RGB frames into patch tokens. These tokens are fused at the patch level with geometric tokens derived from depth maps using a lightweight ViT-S encoder. The fused tokens are then processed through a transformer decoder augmented with Rotary Position Embedding (RoPE) to produce metric-aware per-frame features:

\mathbf{t}_i^{\mathrm{metric}} = \mathrm{Attention}\big( \mathrm{RoPE}( (\mathbf{t}_i^I, \mathbf{t}_i^D), \mathrm{pos} ) \big)

where $\mathrm{pos} \in \mathbb{R}^{K \times 2}$ encodes 2D spatial coordinates to preserve positional relationships. To improve reconstruction fidelity, auxiliary supervision is applied via two task-specific heads: a local point head and a camera pose head. The local point head maps metric tokens to latent features $\mathbf{h}_i^p$ , which are decoded into canonical 3D points in the camera frame:

\mathbf{h}_i^p = \phi_p( \mathbf{t}_i^{\mathrm{metric}} ), \qquad \widehat{P}_i^{\mathrm{local}} = f_p( \mathbf{h}_i^p )

These are supervised using the pinhole model:

\mathbf{p}_{\mathrm{cam},i}(u,v) = D_i(u,v) \, K^{-1} [u \, v \, 1]^\top

In parallel, the camera pose head maps the same metric tokens to features $\mathbf{h}_i^c$ , which are decoded into camera-to-world transformations $\widehat{T}_{\mathrm{c},i}$ , defined relative to the chassis frame of the last time step to ensure planning consistency.

To bridge perception and control without explicit calibration, the authors decouple camera and chassis pose estimation. The chassis pose $\widehat{T}_{\mathrm{b},i}$ and relative goal $\widehat{g}_i$ are predicted from $\mathbf{h}_i^c$ :

\widehat{T}_{\mathrm{b},i} = f_b( \mathbf{h}_i^{\mathrm{c}} ) \qquad \widehat{g}_i = f_q( \mathbf{h}_i^{\mathrm{c}}, g )

The extrinsic transformation $T_{\mathrm{ext}}$ —capturing camera height and pitch—is implicitly learned through training data with varying camera configurations, enabling cross-embodiment generalization.

Rather than propagating explicit poses or point clouds, the system employs a query-based design inspired by UniAD. State queries $Q_S$ and geometric queries $Q_G$ extract implicit representations via cross-attention:

\begin{array}{r} Q_S = \mathrm{CrossAttn}( Q_s, \mathbf{h}^{\mathrm{c}} ) \\ Q_G = \mathrm{CrossAttn}( Q_d, \mathbf{h}^{\mathrm{p}} ) \end{array}

These are fused with goal embeddings to form a planning context query $Q_P$ , which conditions a diffusion policy head. The policy generates trajectory chunks $\boldsymbol{a}_{t} = (\Delta x_{t}, \Delta y_{t}, \Delta \theta_{t})$ by iteratively denoising noisy action sequences:

\pmb{\alpha}^{k-1} = \alpha( \pmb{\alpha}^k - \gamma \epsilon_\theta( Q_P, \pmb{\alpha}^k, k ) + \mathcal{N}(0, \sigma^2 I) )

where $\epsilon_{\theta}$ is the noise prediction network, and $\alpha, \gamma$ are diffusion schedule parameters. This iterative refinement ensures collision-free, feasible trajectories while avoiding error accumulation from explicit intermediate representations.

Experiment

Simulation on 40 InternScenes unseen environments: LoGoPlanner improved Home Success Rate by 27.3 percentage points and Success weighted by Path Length by 21.3% over ViPlanner, validating robust collision-free navigation without external localization.
Real-world tests on TurtleBot, Unitree Go2, and G1 platforms: Achieved 90.0% Success Rate and 82.0% Success weighted by Path Length on Unitree Go2 in cluttered home scenes, demonstrating cross-platform generalization without SLAM or visual odometry.
Ablation studies: Confirmed Point Cloud supervision is critical for obstacle avoidance, and scale-injected geometric backbone reduces navigation error while improving planning accuracy.

The authors evaluate navigation performance in simulation across home and commercial scenes, measuring success rate (SR) and success weighted by path length (SPL). LoGoPlanner, which performs implicit state estimation without external localization, outperforms all baselines, achieving 57.3 SR and 52.4 SPL in home scenes and 67.1 SR and 63.9 SPL in commercial scenes. Results show LoGoPlanner improves home scene performance by 27.3 percentage points in SR and 21.3% in SPL over ViPlanner, highlighting the benefit of integrating self-localization with geometry-aware planning.

The authors evaluate LoGoPlanner against iPlanner and ViPlanner in real-world settings across three robotic platforms and environment types. LoGoPlanner achieves the highest success rates in all scenarios, notably 85% on TurtleBot in office environments, 70% on Unitree Go2 in home settings, and 50% on Unitree G1 in industrial scenes, outperforming both baselines. Results show LoGoPlanner’s ability to operate without external localization and maintain robust performance despite platform-induced camera jitter and complex obstacle configurations.

The authors evaluate ablation variants of their model by removing auxiliary tasks—Odometry, Goal, and Point Cloud—and measure performance in home and commercial scenes using Success Rate (SR) and Success weighted by Path Length (SPL). Results show that including all three modules yields the highest performance, with SR reaching 57.3 in home scenes and 67.1 in commercial scenes, indicating that joint supervision improves trajectory consistency and spatial perception. Omitting any module degrades performance, confirming that each contributes meaningfully to robust navigation.

The authors evaluate different video geometry backbones for navigation performance, finding that VGGT with scale injection achieves the highest success rate and SPL in both home and commercial scenes while reducing navigation and planning errors. Results show that incorporating metric-scale supervision improves trajectory accuracy and planning consistency compared to single-frame or unscaled multi-frame models.

ملف PDF المصدر

جدول المحتويات

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي

وحدات GPU جاهزة للعمل

أفضل الأسعار

ابدأ عرض الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا

سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين

مدعوم بواسطة MailChimp

منذ 2 أشهر

Jiaqi Peng Wenzhe Cai Yuqiang Yang Tai Wang Yuan Shen Jiangmiao Pang

جدول المحتويات

الملخص

One-sentence Summary

Key Contributions

LoGoPlanner addresses the limitation of existing end-to-end navigation methods that still require separate localization modules with precise sensor calibration by introducing implicit state estimation through a fine-tuned long-horizon visual-geometry backbone that grounds predictions in absolute metric scale. This eliminates reliance on external localization while providing accurate self-state awareness.
The framework reconstructs dense scene geometry from historical visual observations to supply fine-grained environmental context for obstacle avoidance, overcoming the partial or scale-ambiguous geometry reconstruction common in prior single-frame approaches. This enables robust spatial reasoning across occluded and rear-view regions.
By conditioning the navigation policy directly on this implicit metric-aware geometry, LoGoPlanner reduces error propagation in trajectory planning, achieving over 27.3% improvement over oracle-localization baselines in both simulation and real-world evaluations while demonstrating strong generalization across robot embodiments and environments.

Introduction

Method

\mathbf{t}_i^{\mathrm{metric}} = \mathrm{Attention}\big( \mathrm{RoPE}( (\mathbf{t}_i^I, \mathbf{t}_i^D), \mathrm{pos} ) \big)

\mathbf{h}_i^p = \phi_p( \mathbf{t}_i^{\mathrm{metric}} ), \qquad \widehat{P}_i^{\mathrm{local}} = f_p( \mathbf{h}_i^p )

These are supervised using the pinhole model:

\mathbf{p}_{\mathrm{cam},i}(u,v) = D_i(u,v) \, K^{-1} [u \, v \, 1]^\top

\widehat{T}_{\mathrm{b},i} = f_b( \mathbf{h}_i^{\mathrm{c}} ) \qquad \widehat{g}_i = f_q( \mathbf{h}_i^{\mathrm{c}}, g )

\begin{array}{r} Q_S = \mathrm{CrossAttn}( Q_s, \mathbf{h}^{\mathrm{c}} ) \\ Q_G = \mathrm{CrossAttn}( Q_d, \mathbf{h}^{\mathrm{p}} ) \end{array}

\pmb{\alpha}^{k-1} = \alpha( \pmb{\alpha}^k - \gamma \epsilon_\theta( Q_P, \pmb{\alpha}^k, k ) + \mathcal{N}(0, \sigma^2 I) )

Experiment

Simulation on 40 InternScenes unseen environments: LoGoPlanner improved Home Success Rate by 27.3 percentage points and Success weighted by Path Length by 21.3% over ViPlanner, validating robust collision-free navigation without external localization.
Real-world tests on TurtleBot, Unitree Go2, and G1 platforms: Achieved 90.0% Success Rate and 82.0% Success weighted by Path Length on Unitree Go2 in cluttered home scenes, demonstrating cross-platform generalization without SLAM or visual odometry.
Ablation studies: Confirmed Point Cloud supervision is critical for obstacle avoidance, and scale-injected geometric backbone reduces navigation error while improving planning accuracy.

ملف PDF المصدر

جدول المحتويات

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

البرمجة التعاونية باستخدام الذكاء الاصطناعي

وحدات GPU جاهزة للعمل

أفضل الأسعار

ابدأ عرض الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا

سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين

مدعوم بواسطة MailChimp

Command Palette

LoGoPlanner: سياسة توجيه مبنية على التموضع مع هندسة بصرية واعية بالقياس

Jiaqi Peng Wenzhe Cai Yuqiang Yang Tai Wang Yuan Shen Jiangmiao Pang

الملخص

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters

Command Palette

LoGoPlanner: سياسة توجيه مبنية على التموضع مع هندسة بصرية واعية بالقياس

Jiaqi Peng Wenzhe Cai Yuqiang Yang Tai Wang Yuan Shen Jiangmiao Pang

الملخص

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters

Command Palette

LoGoPlanner: سياسة توجيه مبنية على التموضع مع هندسة بصرية واعية بالقياس

Jiaqi Peng Wenzhe Cai Yuqiang Yang Tai Wang Yuan Shen Jiangmiao Pang

الملخص

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters