HyperAIHyperAI

Command Palette

Search for a command to run...

vor 12 Stunden
Agent
Multimodal

NitroGen: Ein offenes Grundlegendes Modell für generalistische Gaming-Agenten

Zusammenfassung

Wir präsentieren NitroGen, ein Vision-Action-Foundation-Model für generalistische Gaming-Agents, das auf 40.000 Stunden Gameplay-Videos aus mehr als 1.000 Spielen trainiert wurde. Unser Ansatz basiert auf drei Schlüsselkomponenten: 1) einem Video-Action-Dataset im Internet-Maßstab, das durch automatische Extraktion der Spieleraktionen aus öffentlich zugänglichen Gameplay-Videos erstellt wurde, 2) einer Multi-Game-Benchmark-Umgebung, die die generalisierende Leistung über verschiedene Spiele hinweg messen kann, und 3) einem einheitlichen Vision-Action-Model, das mit großskaligem Behavior Cloning trainiert wurde. NitroGen zeigt starke Leistungen in vielfältigen Domänen, darunter Kampfsequenzen in 3D-Action-Spielen, präzise Steuerung in 2D-Plattformern und Exploration in prozedural generierten Welten. Das Modell überträgt sich effektiv auf unbekannte Spiele und erreicht dabei bis zu eine 52%ige relative Verbesserung der Aufgaben-Erfolgsraten im Vergleich zu Modellen, die von Grund auf trainiert wurden. Wir stellen das Dataset, die Evaluations-Suite und die Model-Gewichte zur Verfügung, um die Forschung an generalistischen embodied Agents voranzutreiben.

One-sentence Summary

NitroGen is an open vision-action foundation model for generalist gaming agents trained via large-scale behavior cloning on 40,000 hours of gameplay videos across more than 1,000 games that achieves up to 52% relative improvement in task success rates over models trained from scratch in unseen games, with the release of the dataset, evaluation suite, and model weights to advance generalist embodied agent research.

Key Contributions

  • An internet-scale video-action dataset is constructed by automatically extracting player actions from publicly available gameplay videos using input overlay software. The resource enables training across hundreds of games without relying on costly data collection or specialized simulators.
  • NitroGen is a unified vision-action foundation model trained with large-scale behavior cloning on 40,000 hours of gameplay across more than 1,000 games. The method discards language conditioning to focus purely on scalable vision-action mapping for generalist gaming agents.
  • A multi-game benchmark environment measures cross-game generalization, where the model achieves up to 52% relative improvement in task success rates over models trained from scratch. The dataset, evaluation suite, and model weights are released to advance research on generalist embodied agents.

Introduction

Building generally capable embodied agents is a major goal in AI, yet progress is hindered by the lack of large, diverse, and labeled action datasets. Existing methods often depend on specialized simulators or hand-crafted APIs that do not scale to arbitrary games, while behavior cloning is constrained by the high cost of collecting human demonstrations. To address this, the authors introduce NitroGen, a vision-action foundation model trained on 40,000 hours of publicly available gameplay videos across over 1,000 titles. They leverage an automated pipeline to extract frame-level actions from input overlays in internet videos, removing the need for costly manual labeling. This unified model demonstrates strong cross-game generalization and improves task success rates by up to 52% compared to models trained from scratch.

Dataset

  1. Dataset Composition and Sources The authors construct NitroGen from publicly available gameplay videos featuring input overlay software. These overlays visualize player actions such as gamepad buttons, allowing for label recovery from internet-scale data without direct access to game inputs.

  2. Key Subset Details The raw collection contains 71,000 hours of video across 38,739 clips from 818 creators. After filtering, the final dataset includes 40,000 hours spanning more than 1,000 unique games. Action-RPGs represent 34.9% of total hours, followed by Platformers and Action-Adventure titles. The evaluation suite covers 10 games with 30 tasks categorized into combat, navigation, and game-specific mechanics.

  3. Data Usage and Training Strategy The authors use the data for large-scale behavior cloning pre-training. Segments are filtered to retain only chunks where at least 50% of timesteps contain non-zero actions to prevent null action over-prediction. Evaluation utilizes a universal simulator that wraps commercial games with a Gymnasium API, standardizing observations to single RGB frames and actions to a 20-dimensional vector.

  4. Processing and Extraction Methods Action extraction employs a three-stage pipeline starting with template matching using SIFT and XFeat to locate overlays. A fine-tuned SegFormer model parses controller states from consecutive frames to output joystick positions on an 11x11 grid and binary button states. The on-screen controller is masked during training to prevent model exploitation, and 8M synthetic frames are used to train the annotation model.

Method

The authors propose NitroGen, a multi-game foundation agent designed to generate future action chunks conditioned on visual observations. The overall system integrates a Universal Simulator, the foundation agent itself, and an Internet-Scale Video-Action Dataset. Refer to the framework diagram.

To enable training on diverse gameplay, the system relies on a large-scale dataset constructed by extracting controller inputs from online videos. As shown in the figure below, the data preparation pipeline begins with gamepad localization using template matching on input video frames. Once localized, the gamepad is cropped, and specific actions are extracted through joystick segmentation and button classification.

The NitroGen architecture employs flow matching to generate these action sequences. The model adapts a diffusion transformer (DiT) backbone, removing language and state encoders to focus purely on visual conditioning. RGB inputs at 256×256256 \times 256256×256 resolution are encoded using a SigLIP 2 vision transformer, which produces 256 image tokens per frame. Noisy action chunks are first encoded by an MLP into one action token per timestep. These tokens are processed through several DiT blocks consisting of alternating self-attention and cross-attention layers, where cross-attention layers condition action generation on the encoded frame tokens. Finally, the action tokens are decoded into continuous action vectors using an MLP applied independently across the time dimension.

Regarding design choices, the model generates 16-action chunks conditioned on a single context frame. This approach improves temporal consistency compared to single-action generation and leverages the initial state of the game to elicit appropriate behavior.

The model is trained using the standard conditional flow-matching objective. Given a ground-truth action chunk aR16×24a \in \mathbb{R}^{16 \times 24}aR16×24, an observation oR256×256o \in \mathbb{R}^{256 \times 256}oR256×256, a flow-matching timestep t[0,1]t \in [0, 1]t[0,1], and Gaussian noise ϵN(0,I)\epsilon \sim \mathcal{N}(0, \mathcal{I})ϵN(0,I), the noisy action is constructed as: at=(1t)ϵ+taa _ { t } = \left( 1 - t \right) \cdot \epsilon + t \cdot aat=(1t)ϵ+ta The conditional velocity field is defined as: νcond(x,t,a,ϵ,o)=aϵ\nu ^ { \mathrm { c o n d } } ( x , t , a , \epsilon , o ) = a - \epsilonνcond(x,t,a,ϵ,o)=aϵ The model is trained to predict this velocity field by minimizing the conditional flow-matching loss: LCFM(θ,ϕ)=Et,a,ϵ[πθ(at,ψϕ(o),t)(aϵ)2]\mathcal { L } ^ { C F M } ( \theta , \phi ) = \mathbb { E } _ { t , a , \epsilon } \left[ | | \pi _ { \theta } ( a _ { t } , \psi _ { \phi } ( o ) , t ) - ( a - \epsilon ) | | ^ { 2 } \right]LCFM(θ,ϕ)=Et,a,ϵ[∣∣πθ(at,ψϕ(o),t)(aϵ)2] where πθ\pi_{\theta}πθ represents the DiT and ψϕ\psi_{\phi}ψϕ represents the image encoder. During training, a shifted beta distribution is used to sample ttt, prioritizing small timesteps.

At inference time, the model initializes a0N(0,I)a_0 \sim \mathcal{N}(0, \mathcal{I})a0N(0,I) and iteratively denoises for k=16k=16k=16 steps using Euler integration: at+1/k=at+1kπθ(at,ψϕ(o),t)a _ { t + 1 / k } = a _ { t } + \frac { 1 } { k } \pi _ { \theta } ( a _ { t } , \psi _ { \phi } ( o ) , t )at+1/k=at+k1πθ(at,ψϕ(o),t) Training is performed using the AdamW optimizer with a weight decay of 0.0010.0010.001 and a warmup-stable-decay schedule. An exponential moving average (EMA) of model weights is maintained during training with a decay of 0.99990.99990.9999, and all reported results utilize these EMA weights.

Experiment

The evaluation employs a benchmark dataset to validate action extraction accuracy and assesses model performance across diverse games to test generalization capabilities. Results demonstrate that the system achieves robust extraction and adapts well to unseen scenarios, while pre-training on noisy internet-scale data significantly enhances downstream fine-tuning compared to training from scratch. Additionally, comparative tests confirm that synchronous inference does not adversely affect game physics, validating the reliability of the freezing mechanism during prediction.


KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren
Sofort einsatzbereite GPUs
Die besten Preise

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates
Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen
Unterstützt von MailChimp