HyperAIHyperAI

Command Palette

Search for a command to run...

Task Tokens: 행동 파운데이션 모델(Behavior Foundation Models) 적응을 위한 유연한 접근 방식

Ron Vainshtein Zohar Rimon Shie Mannor Chen Tessler

초록

로봇 제어를 위한 모방 학습(imitation learning) 분야의 최근 발전은 휴머노이드 에이전트(humanoid agents)가 멀티모달(multi-modal) 방식의 인간과 유사한 제어를 가능하게 하는 Transformer 기반의 행동 파운데이션 모델(behavior foundation models, BFMs)로 이어졌습니다. 이러한 모델은 상위 수준의 목표나 prompt가 주어지면 그에 따른 해결책을 생성하며, 예를 들어 로봇 골반의 위치가 조건으로 주어지면 특정 좌표로 이동하는 동작을 생성할 수 있습니다. BFMs는 강건한(robust) 동작의 zero-shot 생성을 수행하는 데 탁월하지만, 특정 작업을 수행하기 위해서는 세심한 prompt engineering이 필요한 경우가 많으며, 이로 인해 최적에 미치지 못하는(suboptimal) 결과를 초래할 수 있습니다.본 연구에서는 BFM의 유연성을 유지하면서도 특정 작업에 효과적으로 맞춤화할 수 있는 방법론인 "Task Tokens"를 제안합니다. 우리의 접근 방식은 BFM의 Transformer 구조 내에 자연스럽게 통합됩니다. Task Tokens는 기존 BFM을 수정하지 않은 채, 작업별 특화된 인코더(tokenizer)를 학습시킵니다. 본 방법론은 표준 베이스라인(baseline)과 비교했을 때 작업당 학습 가능한 파라미터(trainable parameters) 수를 최대 125배까지 줄이며, 수렴 속도는 최대 6배 더 빠릅니다. 또한, 기존 BFM을 변경하지 않고 유지함으로써 Task Tokens는 기존의 인코더를 활용할 수 있게 합니다. 이를 통해 사용자가 정의한 사전 지식(priors)을 통합할 수 있으며, 보상 설계(reward design)와 prompt engineering 사이의 균형을 맞출 수 있습니다. 우리는 out-of-distribution 시나리오를 포함한 다양한 작업에서 Task Tokens의 효능을 입증하였으며, 다른 prompting modality와의 호환성 또한 확인하였습니다.

One-sentence Summary

By training a task-specific encoder while keeping the original transformer architecture frozen, the proposed Task Tokens method adapts behavior foundation models to specific robotic control tasks with up to 125 times fewer trainable parameters and 6 times faster convergence than standard baselines.

Key Contributions

  • The paper introduces Task Tokens, a method designed to adapt Behavior Foundation Models (BFMs) to specific tasks by training a task-specific encoder while keeping the original BFM parameters frozen.
  • This approach integrates directly into the existing transformer architecture, allowing for the use of pre-existing encoders and the incorporation of user-defined priors to balance reward design with prompt engineering.
  • Experimental results demonstrate that Task Tokens reduce trainable parameters per task by up to 125 times and achieve convergence up to 6 times faster than standard baselines across various tasks and out-of-distribution scenarios.

Introduction

Goal Conditioned Behavior Foundation Models (GC-BFMs) are essential for generating diverse, human-like motions in robotics and animation by mapping goals directly to actions. While these models excel at reproducing common motions, they struggle to adapt to out-of-distribution constraints or specialized tasks defined by users. Existing solutions like prompt engineering or fine-tuning are often inefficient or risk degrading the foundational knowledge stored within the model. The authors leverage a Task Tokens approach to bridge this gap, providing a mechanism that incorporates task-specific optimization while preserving the natural behavior of the underlying foundation model.

Method

The authors propose a parameter-efficient method called Task Tokens to adapt a Goal-Conditioned Behavior Foundation Model (GC-BFM), specifically MaskedMimic, to specific downstream tasks without fine-tuning the underlying foundation model. This approach preserves the zero-shot capabilities and general behavioral priors of the BFM while enabling task-specific optimization.

The core architecture relies on the transformer-based nature of MaskedMimic, which processes sequences of tokens. The method integrates three distinct input sources to guide the model. As shown in the framework diagram:

These sources include Prior Tokens, which allow for user-defined behavioral priors such as textual prompts or joint conditions; State Tokens, which represent the current environment state stis_t^isti using pre-trained encoders; and Task Tokens, which are generated by a dedicated Task Encoder. The frozen GC-BFM integrates these inputs to produce task-optimized actions atia_t^iati.

The Task Encoder is designed as a lightweight, generic module implemented as a feed-forward multilayer perceptron (MLP) with ReLU activations. It processes task-goal observations gtig_t^igti—which are represented in the agent's egocentric reference frame—and predicts a Task Token τtiR512\tau_t^i \in \mathbb{R}^{512}τtiR512. For example, in a steering task, the input gtig_t^igti might include target direction, facing direction, and desired speed. To ensure alignment with the pre-trained representations of MaskedMimic, the encoder is also provided with proprioceptive information. The resulting Task Token is concatenated with other tokens in the BFM's input space, creating a token "sentence" where the task-specific signal acts as a specialized word guiding the model toward the target behavior.

To optimize the Task Encoder, the authors utilize Proximal Policy Optimization (PPO). During the training process, the BFM predicts action probabilities based on the combined token sequence. The PPO objective is computed with respect to the task-specific reward and the BFM's action probabilities. Crucially, the gradients flow through the frozen GC-BFM to update only the Task Encoder parameters. This design prevents the degradation of the foundation model's prior knowledge, which might otherwise occur through full fine-tuning.

For complex tasks that require sequential execution, the authors implement multi-phase prompting. This mechanism uses a finite-state machine (FSM) to switch between different prior tokens while using a single Task Token. As shown in the figure below, this allows the agent to transition between distinct behavioral phases, such as moving from a locomotion phase to a striking phase, based on geometric proximity to a target:

Experiment

The researchers evaluated the Task Tokens method across diverse humanoid control tasks, including reaching, steering, and striking, to assess its ability to adapt behavior foundation models to specific objectives. The experiments validate that this hybrid approach achieves rapid convergence and high success rates while maintaining the robustness and multi-modal prompting capabilities of the original model. Qualitative results from human studies and out-of-distribution tests confirm that the method produces natural, human-like motions that generalize well to varying environmental conditions such as changes in gravity and friction.

The authors compare the success rates of various humanoid control methods across five distinct tasks. Results show that Task Tokens achieves high performance across most environments, demonstrating strong competitive capabilities compared to established baselines. Task Tokens maintains high success rates in Direction, Steering, and Reach tasks. The method performs comparably to advanced baselines like PPO and AMP in several task categories. While some baselines show higher success in the Strike and Long Jump tasks, Task Tokens remains effective in the majority of evaluated scenarios.

The authors evaluate the performance of Task Tokens across several humanoid control tasks by comparing success rates with various baselines. The results demonstrate that Task Tokens achieves high success rates across most environments, effectively adapting the foundation model to new tasks. Task Tokens performs competitively with or outperforms fine-tuning methods in most evaluated tasks. The method shows significant improvements in success rates for tasks like Reach and Steering compared to the original zero-shot MaskedMimic approach. While fine-tuning achieves high success in the Strike task, Task Tokens maintains strong performance across the majority of other scenarios.

The authors compare the Task Tokens approach against several baselines across five humanoid control tasks. Results show that Task Tokens achieves high success rates in most environments, demonstrating effective task adaptation while maintaining competitive performance against specialized methods. Task Tokens achieves superior success rates in the Reach, Direction, and Steering tasks compared to most baselines. The method performs highly effectively on the Long Jump task, matching the performance of state-of-the-art hierarchical approaches. While other methods like MaskedMimic Fine-Tune and PULSE show strong results in the Strike task, Task Tokens remains highly competitive across the task suite.

The authors conduct an ablation study to evaluate how different Task Encoder architectures affect performance on the steering task. The results demonstrate that larger MLP encoders and the inclusion of current pose information generally contribute to higher success rates. Using a bigger MLP encoder tends to improve the success rate in steering tasks. Incorporating current pose information into the encoder leads to better performance when using a larger MLP. The combination of a larger MLP and current pose information yields the highest success rate among the tested configurations.

The authors evaluate the Task Tokens approach by comparing its success rates across five humanoid control tasks against several established baselines and conducting an ablation study on encoder architectures. The results demonstrate that Task Tokens provides effective task adaptation and achieves highly competitive performance, often outperforming or matching advanced fine-tuning and hierarchical methods in most scenarios. Furthermore, the ablation study indicates that performance is optimized by utilizing larger MLP encoders and incorporating current pose information.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp
Task Tokens: 행동 파운데이션 모델(Behavior Foundation Models) 적응을 위한 유연한 접근 방식 | 문서 | HyperAI초신경