منذ 12 ساعات

Loïc Magne Anas Awadalla Guanzhi Wang Yinzhen Xu Joshua Belofsky Fengyuan Hu Joohwan Kim Ludwig Schmidt Georgia Gkioxari Jan Kautz

جدول المحتويات

الملخص

نقدّم "نيتروجين" (NitroGen)، وهو نموذج أساسي للرؤية-الفعل (vision-action) مُعدّ لعملاء ألعاب عامّين، تم تدريبه على 40,000 ساعة من مقاطع فيديو اللعب عبر أكثر من 1,000 لعبة. ندمج ثلاثة عناصر رئيسية: 1) مجموعة بيانات ضخمة لمقاطع الفيديو-الفعل على نطاق الإنترنت، تم بناؤها عبر استخراج إجراءات اللاعبين تلقائيًا من مقاطع فيديو اللعب المتاحة علنًا؛ 2) بيئة معيارية متعدد الألعاب تتيح قياس القدرة على التعميم عبر الألعاب؛ و3) نموذج موحد للرؤية-الفعل تم تدريبه باستخدام تقليد السلوك الواسع النطاق (large-scale behavior cloning). يُظهر "نيتروجين" كفاءة عالية في مجالات متنوعة، بما في ذلك المواجهات القتالية في ألعاب الأكشن ثلاثية الأبعاد، والتحكم الدقيق للغاية في ألعاب المنصات ثنائية الأبعاد، والاستكشاف في العوالم المُولَّدة إجرائيًا. كما ينتقل بنجاح إلى ألعاب غير مرئية سابقًا، محققًا تحسنًا نسبياً يصل إلى 52% في معدلات نجاح المهام مقارنةً بنماذج تم تدريبها من الصفر (scratch). كما ننشر مجموعة البيانات، وأداة التقييم، وأوزان النموذج لتعزيز البحث في مجال العملاء الممجسمين العاميين (generalist embodied agents).

One-sentence Summary

NitroGen is an open vision-action foundation model for generalist gaming agents trained via large-scale behavior cloning on 40,000 hours of gameplay videos across more than 1,000 games that achieves up to 52% relative improvement in task success rates over models trained from scratch in unseen games, with the release of the dataset, evaluation suite, and model weights to advance generalist embodied agent research.

Key Contributions

An internet-scale video-action dataset is constructed by automatically extracting player actions from publicly available gameplay videos using input overlay software. The resource enables training across hundreds of games without relying on costly data collection or specialized simulators.
NitroGen is a unified vision-action foundation model trained with large-scale behavior cloning on 40,000 hours of gameplay across more than 1,000 games. The method discards language conditioning to focus purely on scalable vision-action mapping for generalist gaming agents.
A multi-game benchmark environment measures cross-game generalization, where the model achieves up to 52% relative improvement in task success rates over models trained from scratch. The dataset, evaluation suite, and model weights are released to advance research on generalist embodied agents.

Introduction

Building generally capable embodied agents is a major goal in AI, yet progress is hindered by the lack of large, diverse, and labeled action datasets. Existing methods often depend on specialized simulators or hand-crafted APIs that do not scale to arbitrary games, while behavior cloning is constrained by the high cost of collecting human demonstrations. To address this, the authors introduce NitroGen, a vision-action foundation model trained on 40,000 hours of publicly available gameplay videos across over 1,000 titles. They leverage an automated pipeline to extract frame-level actions from input overlays in internet videos, removing the need for costly manual labeling. This unified model demonstrates strong cross-game generalization and improves task success rates by up to 52% compared to models trained from scratch.

Dataset

Dataset Composition and Sources The authors construct NitroGen from publicly available gameplay videos featuring input overlay software. These overlays visualize player actions such as gamepad buttons, allowing for label recovery from internet-scale data without direct access to game inputs.
Key Subset Details The raw collection contains 71,000 hours of video across 38,739 clips from 818 creators. After filtering, the final dataset includes 40,000 hours spanning more than 1,000 unique games. Action-RPGs represent 34.9% of total hours, followed by Platformers and Action-Adventure titles. The evaluation suite covers 10 games with 30 tasks categorized into combat, navigation, and game-specific mechanics.
Data Usage and Training Strategy The authors use the data for large-scale behavior cloning pre-training. Segments are filtered to retain only chunks where at least 50% of timesteps contain non-zero actions to prevent null action over-prediction. Evaluation utilizes a universal simulator that wraps commercial games with a Gymnasium API, standardizing observations to single RGB frames and actions to a 20-dimensional vector.
Processing and Extraction Methods Action extraction employs a three-stage pipeline starting with template matching using SIFT and XFeat to locate overlays. A fine-tuned SegFormer model parses controller states from consecutive frames to output joystick positions on an 11x11 grid and binary button states. The on-screen controller is masked during training to prevent model exploitation, and 8M synthetic frames are used to train the annotation model.

Method

The authors propose NitroGen, a multi-game foundation agent designed to generate future action chunks conditioned on visual observations. The overall system integrates a Universal Simulator, the foundation agent itself, and an Internet-Scale Video-Action Dataset. Refer to the framework diagram.

To enable training on diverse gameplay, the system relies on a large-scale dataset constructed by extracting controller inputs from online videos. As shown in the figure below, the data preparation pipeline begins with gamepad localization using template matching on input video frames. Once localized, the gamepad is cropped, and specific actions are extracted through joystick segmentation and button classification.

The NitroGen architecture employs flow matching to generate these action sequences. The model adapts a diffusion transformer (DiT) backbone, removing language and state encoders to focus purely on visual conditioning. RGB inputs at $256 \times 256$ resolution are encoded using a SigLIP 2 vision transformer, which produces 256 image tokens per frame. Noisy action chunks are first encoded by an MLP into one action token per timestep. These tokens are processed through several DiT blocks consisting of alternating self-attention and cross-attention layers, where cross-attention layers condition action generation on the encoded frame tokens. Finally, the action tokens are decoded into continuous action vectors using an MLP applied independently across the time dimension.

Regarding design choices, the model generates 16-action chunks conditioned on a single context frame. This approach improves temporal consistency compared to single-action generation and leverages the initial state of the game to elicit appropriate behavior.

The model is trained using the standard conditional flow-matching objective. Given a ground-truth action chunk $a \in \mathbb{R}^{16 \times 24}$ , an observation $o \in \mathbb{R}^{256 \times 256}$ , a flow-matching timestep $t \in [0, 1]$ , and Gaussian noise $\epsilon \sim \mathcal{N}(0, \mathcal{I})$ , the noisy action is constructed as: $a _ { t } = \left( 1 - t \right) \cdot \epsilon + t \cdot a$ The conditional velocity field is defined as: $\nu ^ { \mathrm { c o n d } } ( x , t , a , \epsilon , o ) = a - \epsilon$ The model is trained to predict this velocity field by minimizing the conditional flow-matching loss: $\mathcal { L } ^ { C F M } ( \theta , \phi ) = \mathbb { E } _ { t , a , \epsilon } \left[ | | \pi _ { \theta } ( a _ { t } , \psi _ { \phi } ( o ) , t ) - ( a - \epsilon ) | | ^ { 2 } \right]$ where $\pi_{\theta}$ represents the DiT and $\psi_{\phi}$ represents the image encoder. During training, a shifted beta distribution is used to sample $t$ , prioritizing small timesteps.

At inference time, the model initializes $a_0 \sim \mathcal{N}(0, \mathcal{I})$ and iteratively denoises for $k=16$ steps using Euler integration: $a _ { t + 1 / k } = a _ { t } + \frac { 1 } { k } \pi _ { \theta } ( a _ { t } , \psi _ { \phi } ( o ) , t )$ Training is performed using the AdamW optimizer with a weight decay of $0.001$ and a warmup-stable-decay schedule. An exponential moving average (EMA) of model weights is maintained during training with a decay of $0.9999$ , and all reported results utilize these EMA weights.

Experiment

The evaluation employs a benchmark dataset to validate action extraction accuracy and assesses model performance across diverse games to test generalization capabilities. Results demonstrate that the system achieves robust extraction and adapts well to unseen scenarios, while pre-training on noisy internet-scale data significantly enhances downstream fine-tuning compared to training from scratch. Additionally, comparative tests confirm that synchronous inference does not adversely affect game physics, validating the reliability of the freezing mechanism during prediction.

ملف PDF المصدر

جدول المحتويات

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي

وحدات GPU جاهزة للعمل

أفضل الأسعار

ابدأ عرض الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا

سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين

مدعوم بواسطة MailChimp

منذ 12 ساعات

Loïc Magne Anas Awadalla Guanzhi Wang Yinzhen Xu Joshua Belofsky Fengyuan Hu Joohwan Kim Ludwig Schmidt Georgia Gkioxari Jan Kautz

جدول المحتويات

الملخص

One-sentence Summary

Key Contributions

An internet-scale video-action dataset is constructed by automatically extracting player actions from publicly available gameplay videos using input overlay software. The resource enables training across hundreds of games without relying on costly data collection or specialized simulators.
NitroGen is a unified vision-action foundation model trained with large-scale behavior cloning on 40,000 hours of gameplay across more than 1,000 games. The method discards language conditioning to focus purely on scalable vision-action mapping for generalist gaming agents.
A multi-game benchmark environment measures cross-game generalization, where the model achieves up to 52% relative improvement in task success rates over models trained from scratch. The dataset, evaluation suite, and model weights are released to advance research on generalist embodied agents.

Introduction

Dataset

Dataset Composition and Sources The authors construct NitroGen from publicly available gameplay videos featuring input overlay software. These overlays visualize player actions such as gamepad buttons, allowing for label recovery from internet-scale data without direct access to game inputs.
Key Subset Details The raw collection contains 71,000 hours of video across 38,739 clips from 818 creators. After filtering, the final dataset includes 40,000 hours spanning more than 1,000 unique games. Action-RPGs represent 34.9% of total hours, followed by Platformers and Action-Adventure titles. The evaluation suite covers 10 games with 30 tasks categorized into combat, navigation, and game-specific mechanics.
Data Usage and Training Strategy The authors use the data for large-scale behavior cloning pre-training. Segments are filtered to retain only chunks where at least 50% of timesteps contain non-zero actions to prevent null action over-prediction. Evaluation utilizes a universal simulator that wraps commercial games with a Gymnasium API, standardizing observations to single RGB frames and actions to a 20-dimensional vector.
Processing and Extraction Methods Action extraction employs a three-stage pipeline starting with template matching using SIFT and XFeat to locate overlays. A fine-tuned SegFormer model parses controller states from consecutive frames to output joystick positions on an 11x11 grid and binary button states. The on-screen controller is masked during training to prevent model exploitation, and 8M synthetic frames are used to train the annotation model.

Method

Experiment

ملف PDF المصدر

جدول المحتويات

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

البرمجة التعاونية باستخدام الذكاء الاصطناعي

وحدات GPU جاهزة للعمل

أفضل الأسعار

ابدأ عرض الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا

سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين

مدعوم بواسطة MailChimp

Command Palette

NitroGen: نموذج أساسي مفتوح للوكلاء اللاعبين العالميين

Loïc Magne Anas Awadalla Guanzhi Wang Yinzhen Xu Joshua Belofsky Fengyuan Hu Joohwan Kim Ludwig Schmidt Georgia Gkioxari Jan Kautz4 more

الملخص

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters

Command Palette

NitroGen: نموذج أساسي مفتوح للوكلاء اللاعبين العالميين

Loïc Magne Anas Awadalla Guanzhi Wang Yinzhen Xu Joshua Belofsky Fengyuan Hu Joohwan Kim Ludwig Schmidt Georgia Gkioxari Jan Kautz4 more

الملخص

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters

Command Palette

NitroGen: نموذج أساسي مفتوح للوكلاء اللاعبين العالميين

Loïc Magne Anas Awadalla Guanzhi Wang Yinzhen Xu Joshua Belofsky Fengyuan Hu Joohwan Kim Ludwig Schmidt Georgia Gkioxari Jan Kautz4 more

الملخص

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters

Loïc Magne Anas Awadalla Guanzhi Wang Yinzhen Xu Joshua Belofsky Fengyuan Hu Joohwan Kim Ludwig Schmidt Georgia Gkioxari Jan Kautz

Loïc Magne Anas Awadalla Guanzhi Wang Yinzhen Xu Joshua Belofsky Fengyuan Hu Joohwan Kim Ludwig Schmidt Georgia Gkioxari Jan Kautz

Loïc Magne Anas Awadalla Guanzhi Wang Yinzhen Xu Joshua Belofsky Fengyuan Hu Joohwan Kim Ludwig Schmidt Georgia Gkioxari Jan Kautz