HyperAIHyperAI

Command Palette

Search for a command to run...

EgoPush: 모바일 로봇을 위한 엔도세트릭 다중 객체 재배치를 위한 엔드투엔드 학습

Boyuan An Zhexiong Wang Yipeng Wang Jiaqi Li Sihang Li Jing Zhang Chen Feng

초록

인간은 자기 중심적 인지 능력을 활용해 혼잡한 환경에서 물체를 재배치하며, 전역 좌표 없이도 가림을 효과적으로 탐색할 수 있다. 이러한 능력에 영감을 받아, 단일 자기 중심 카메라를 사용하는 이동 로봇을 위한 장기 계획적 다중 물체 비포획 재배치 문제를 연구한다. 우리는 EgoPush라는 정책 학습 프레임워크를 제안하며, 일반적으로 동적 환경에서 실패하기 쉬운 명시적 전역 상태 추정에 의존하지 않고도 자기 중심적이고 인지 기반의 재배치를 가능하게 한다. EgoPush는 절대적 자세가 아닌 물체 간의 상대적 공간 관계를 인코딩하는 물체 중심의 잠재 공간을 설계한다. 이 구조는 풍부한 강화학습(Reinforcement Learning, RL) 교사 모델이 희소한 키포인트로부터 잠재 상태와 이동 동작을 동시에 학습하도록 하며, 이를 순수 시각 기반의 학생 정책으로 정제한다. 오목한 교사와 부분적으로 관측 가능한 학생 사이의 감독 갭을 줄이기 위해, 교사의 관측 범위를 시각적으로 접근 가능한 단서로 제한함으로써, 학생의 시점에서 복원 가능한 능동적 인지 행동을 유도한다. 장기 계획 문제의 보상 할당 문제를 해결하기 위해, 시간적으로 감쇠되는 단계별 완성 보상(Stage-local completion rewards)을 사용하여 재배치 작업을 단계 수준의 하위 문제로 분해한다. 광범위한 시뮬레이션 실험을 통해 EgoPush가 엔드투엔드 강화학습 기반의 기준 모델보다 성공률에서 뚜렷한 성능 향상을 보였으며, 아블레이션 실험을 통해 각 설계 선택의 타당성을 입증한다. 또한, 실제 이동 플랫폼에서 실세계 환경에서의 제로샷(sim-to-real) 전이 성능도 입증하였다. 코드 및 동영상 자료는 https://ai4ce.github.io/EgoPush/ 에서 확인할 수 있다.

One-sentence Summary

Researchers from New York University propose EgoPush, a teacher-student framework enabling mobile robots to rearrange multiple objects using only egocentric vision, by learning object-centric latent representations and distilling constrained teacher policies that ensure visually recoverable behaviors, achieving robust sim-to-real transfer on a TurtleBot.

Key Contributions

  • EgoPush enables mobile robots to perform long-horizon multi-object rearrangement using only egocentric vision by learning object-centric latent representations that encode relative spatial relations, eliminating reliance on fragile global state estimation in texture-sparse, occlusion-prone environments.
  • The framework introduces a constrained teacher RL approach that restricts the privileged teacher to egocentric observations, ensuring its behavior is recoverable by the visual student and promoting active perception strategies that align with real-world partial observability.
  • EgoPush improves long-horizon credit assignment through stage-wise training with temporally decayed rewards, achieving higher success rates than end-to-end RL baselines in simulation and demonstrating zero-shot sim-to-real transfer on a Turtlebot platform.

Introduction

The authors leverage egocentric vision to enable mobile robots to perform long-horizon multi-object rearrangement without global state estimation—a capability critical for real-world deployment where occlusions and sparse textures make traditional mapping unreliable. Prior methods either rely on brittle global state estimators or suffer from poor sample efficiency and instability under partial observability when using end-to-end visual RL. EgoPush addresses these gaps by introducing an object-centric latent space that encodes relative spatial relations, a constrained teacher policy that only uses egocentric observations to ensure distillable behavior, and stage-wise training with temporally decayed rewards to improve long-horizon credit assignment. This framework enables robust sim-to-real transfer and outperforms baselines in both success rate and sample efficiency.

Dataset

The authors use a custom dataset collected in a 3m × 3m gray arena with five color-coded boxes (red, green, blue, violet, brown) to train and evaluate their egocentric robot manipulation system. Here’s how the data is composed and processed:

  • RGB Images: Captured from the robot’s egocentric camera. The server applies HSV-based color segmentation to isolate each box from the background, generating binary masks. Specific HSV thresholds for each color are defined in Table VI.

  • Depth Images: Raw depth data from an Intel RealSense D435i is heavily noisy, especially on box top surfaces. The authors tested four postprocessing methods:

    • Learning-based denoising (CDM): High-quality reconstruction but too slow (~50ms/frame), unsuitable for real-time control.
    • Onboard filtering: Applied hole-filling and temporal/spatial filters on Jetson Nano; causes flickering and dropouts, with ~15ms latency.
    • Median-depth filling (baseline): Replaces masked regions with median depth value; stable and fast (~2ms), but loses geometric detail.
    • Navier-Stokes inpainting (final choice): Inpaints masked regions using fluid dynamics; retains more geometry than median-fill, more stable than onboard filtering, and runs at ~2ms. Used during training with injected noise to simulate real-world conditions.
  • Data Usage: The system uses the RGB masks to filter depth images, and the Navier-Stokes inpainted depth maps are fed into the model during training and real-time control. This strategy balances geometric fidelity, stability, and low latency, enabling effective sim-to-real transfer.

Method

The authors leverage a two-phase distillation framework called EgoPush, designed for long-horizon, multi-object non-prehensile rearrangement under egocentric visual constraints. The framework decouples learning into a privileged teacher phase and a vision-based student phase, enabling zero-shot sim-to-real transfer. The overall architecture is structured to ensure that the student policy, trained solely on egocentric RGB-D inputs, can replicate the teacher’s behavior while operating under perceptual constraints that mirror real-world sensor limitations.

In Phase 1, the teacher policy is trained via online reinforcement learning using sparse, privileged 3D keypoints that encode object geometry and relative poses. These keypoints are grouped into four semantic categories: active object, anchor, obstacles, and reference target. To ensure the teacher’s behavior remains visually recoverable by the student, the authors impose two critical constraints on the teacher’s observation space: virtual egocentric field-of-view (FOV) masking and center-gated visibility for the reference target. The FOV mask restricts observations to points within a robot-pose-based frustum and within a maximum range, approximating camera visibility. The center-gated visibility condition ensures the reference target keypoints are only revealed when the anchor object lies within the central region of the virtual FOV, preventing the teacher from exploiting the target without attending to the anchor. As shown in the figure below, this constrained observation space forces the teacher to rely on task-relevant, recoverable cues.

The teacher’s state estimator is implemented as a PointNet, which processes variable-sized point sets per semantic group and produces group-wise latent embeddings ZtkZ_t^kZtk via shared-weight MLPs and symmetric pooling. These latents, concatenated with the previous action at1a_{t-1}at1, are fed into an MLP policy head to output a 2D continuous action at=[vt,ωt]a_t = [v_t, \omega_t]at=[vt,ωt], corresponding to linear and angular velocities of the differential-drive base. The reward function is designed to facilitate long-horizon learning through stage-aligned supervision: it includes time-weighted completion rewards for reaching and placing the active object, progress shaping via phase-gated distance decrease, smoothness penalties to discourage abrupt actions, and slowdown rewards near the target to encourage stable settling. The teacher is trained using Proximal Policy Optimization (PPO) with domain randomization applied to physical parameters.

In Phase 2, the student policy is distilled from the teacher via supervised learning using only egocentric RGB-D observations. The RGB stream is used solely for instance segmentation to assign objects to semantic groups (active, anchor, obstacle), while the depth map is masked and aggregated per group to produce fixed-dimensional depth layers d~tk\tilde{d}_t^kd~tk. These depth layers serve as input to a CNN-based state estimator, which replaces the teacher’s PointNet. The student’s policy head architecture remains identical to the teacher’s MLP, and its weights are initialized from the teacher’s learned parameters to accelerate convergence. The student is trained using an online DAgger-style procedure: at each iteration, the teacher is queried online to generate action labels for the current state, and the student performs a supervised update. To bridge the representation gap between the teacher’s privileged latent space and the student’s visual inputs, the authors introduce a relational distillation loss. This loss minimizes the Frobenius norm between the pairwise cosine similarity matrices of the shared semantic group latents (active, anchor, obstacle) from teacher and student, encouraging the student to replicate the teacher’s perception of spatial relationships without requiring explicit access to the reference target. As shown in the framework diagram, this relational alignment enables the student to implicitly encode target-seeking behavior.

For real-world deployment, the student policy operates on RGB-D inputs from a RealSense camera mounted on a TurtleBot equipped with a front pusher. The pusher mitigates the depth camera’s sensing dead zone but introduces dynamical challenges due to an extended moment arm, which the learned policy compensates for through adaptive control. Domain randomization is further applied to camera-pose-related observations during distillation to enhance robustness. The student’s ability to generalize to real-world conditions without fine-tuning is enabled by the structured distillation process and the alignment of perceptual constraints between simulation and reality.

Experiment

  • EgoPush successfully performs multi-object rearrangement with diverse object geometries in both simulation and real-world settings, achieving high precision in target formations.
  • Constraining the RL teacher’s observation space improves student learning by aligning supervision with the student’s partial observability, leading to significantly better performance than unrestricted teacher variants.
  • Decomposing long-horizon tasks into sequential sub-tasks with stage-level time-decayed rewards accelerates RL convergence and enhances credit assignment, enabling stable and efficient learning.
  • Adding a relational distillation loss helps the student inherit the teacher’s spatial reasoning, proving critical for complex, asymmetric tasks where geometric consistency matters.
  • Baseline methods—including end-to-end visual RL and classical mapping approaches—fail to solve the task reliably due to poor long-horizon reasoning, partial observability, and drift-induced state inconsistency.
  • Real-world deployment demonstrates zero-shot transfer success, achieving 80% success rate with minor deviations, though torque limits reduce robustness compared to simulation.
  • The student policy generalizes to novel object shapes (cylinder, prism) in terms of reaching, but struggles with final alignment due to geometry-dependent contact dynamics.
  • Accuracy evaluation on cuboids shows the student achieves ~86.7% normalized accuracy, confirming precise final positioning relative to invisible targets.

The authors use a simplified two-object pushing task to compare EgoPush against classical and end-to-end visual RL baselines, all operating under egocentric RGB-D sensing. Results show that while baselines achieve limited object reach rates, none succeed in completing the full push-and-align task, whereas EgoPush achieves perfect success and reach rates with efficient trajectories. This highlights that structured distillation and spatial reasoning are critical for solving long-horizon rearrangement under partial observability.

The authors use a progressive reward shaping approach to decompose long-horizon tasks, showing that adding stage-wise rewards and temporal decay significantly improves learning efficiency. Their final method, which resets the decay schedule at each stage boundary, achieves near-saturated performance with faster convergence and lower execution time compared to ablated variants. Results confirm that structured credit assignment is critical for solving complex sequential rearrangement under sparse feedback.

The authors use restricted observation spaces for the RL teacher to align with the student’s partial observability, which significantly improves student performance despite slightly reducing teacher efficiency. Results show that removing center-gated visibility or global FOV constraints leads to poor student success rates, highlighting the importance of observation design for effective distillation. The student trained under the full method achieves 70.7% success, while variants without key constraints fail to generalize or learn meaningful behaviors.

The authors use HSV threshold ranges to segment colored boxes in real-world experiments, with hue values adjusted to account for circular color space boundaries, particularly for red. These ranges are designed to ensure consistent object detection under varying lighting and camera conditions. The selected thresholds reflect empirical tuning to balance precision and recall across multiple object colors.

The authors evaluate their student policy on objects with non-cuboid geometries and find that while the model reliably reaches the target object, success rates drop significantly for cylinders and prisms due to challenges in maintaining stable contact and precise alignment during pushing. Execution time and trajectory length increase for prisms, indicating that geometric complexity amplifies control errors over long-horizon interactions. Results suggest the policy’s perception and approach skills generalize well, but fine-grained manipulation remains sensitive to object shape.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp