Command Palette
Search for a command to run...
NitroGen: نموذج أساسي مفتوح للوكلاء اللاعبين العالميين
NitroGen: نموذج أساسي مفتوح للوكلاء اللاعبين العالميين
الملخص
نقدّم "نيتروجين" (NitroGen)، وهو نموذج أساسي للرؤية-الفعل (vision-action) مُعدّ لعملاء ألعاب عامّين، تم تدريبه على 40,000 ساعة من مقاطع فيديو اللعب عبر أكثر من 1,000 لعبة. ندمج ثلاثة عناصر رئيسية: 1) مجموعة بيانات ضخمة لمقاطع الفيديو-الفعل على نطاق الإنترنت، تم بناؤها عبر استخراج إجراءات اللاعبين تلقائيًا من مقاطع فيديو اللعب المتاحة علنًا؛ 2) بيئة معيارية متعدد الألعاب تتيح قياس القدرة على التعميم عبر الألعاب؛ و3) نموذج موحد للرؤية-الفعل تم تدريبه باستخدام تقليد السلوك الواسع النطاق (large-scale behavior cloning). يُظهر "نيتروجين" كفاءة عالية في مجالات متنوعة، بما في ذلك المواجهات القتالية في ألعاب الأكشن ثلاثية الأبعاد، والتحكم الدقيق للغاية في ألعاب المنصات ثنائية الأبعاد، والاستكشاف في العوالم المُولَّدة إجرائيًا. كما ينتقل بنجاح إلى ألعاب غير مرئية سابقًا، محققًا تحسنًا نسبياً يصل إلى 52% في معدلات نجاح المهام مقارنةً بنماذج تم تدريبها من الصفر (scratch). كما ننشر مجموعة البيانات، وأداة التقييم، وأوزان النموذج لتعزيز البحث في مجال العملاء الممجسمين العاميين (generalist embodied agents).
One-sentence Summary
NitroGen is an open vision-action foundation model for generalist gaming agents trained via large-scale behavior cloning on 40,000 hours of gameplay videos across more than 1,000 games that achieves up to 52% relative improvement in task success rates over models trained from scratch in unseen games, with the release of the dataset, evaluation suite, and model weights to advance generalist embodied agent research.
Key Contributions
- An internet-scale video-action dataset is constructed by automatically extracting player actions from publicly available gameplay videos using input overlay software. The resource enables training across hundreds of games without relying on costly data collection or specialized simulators.
- NitroGen is a unified vision-action foundation model trained with large-scale behavior cloning on 40,000 hours of gameplay across more than 1,000 games. The method discards language conditioning to focus purely on scalable vision-action mapping for generalist gaming agents.
- A multi-game benchmark environment measures cross-game generalization, where the model achieves up to 52% relative improvement in task success rates over models trained from scratch. The dataset, evaluation suite, and model weights are released to advance research on generalist embodied agents.
Introduction
Building generally capable embodied agents is a major goal in AI, yet progress is hindered by the lack of large, diverse, and labeled action datasets. Existing methods often depend on specialized simulators or hand-crafted APIs that do not scale to arbitrary games, while behavior cloning is constrained by the high cost of collecting human demonstrations. To address this, the authors introduce NitroGen, a vision-action foundation model trained on 40,000 hours of publicly available gameplay videos across over 1,000 titles. They leverage an automated pipeline to extract frame-level actions from input overlays in internet videos, removing the need for costly manual labeling. This unified model demonstrates strong cross-game generalization and improves task success rates by up to 52% compared to models trained from scratch.
Dataset
-
Dataset Composition and Sources The authors construct NitroGen from publicly available gameplay videos featuring input overlay software. These overlays visualize player actions such as gamepad buttons, allowing for label recovery from internet-scale data without direct access to game inputs.
-
Key Subset Details The raw collection contains 71,000 hours of video across 38,739 clips from 818 creators. After filtering, the final dataset includes 40,000 hours spanning more than 1,000 unique games. Action-RPGs represent 34.9% of total hours, followed by Platformers and Action-Adventure titles. The evaluation suite covers 10 games with 30 tasks categorized into combat, navigation, and game-specific mechanics.
-
Data Usage and Training Strategy The authors use the data for large-scale behavior cloning pre-training. Segments are filtered to retain only chunks where at least 50% of timesteps contain non-zero actions to prevent null action over-prediction. Evaluation utilizes a universal simulator that wraps commercial games with a Gymnasium API, standardizing observations to single RGB frames and actions to a 20-dimensional vector.
-
Processing and Extraction Methods Action extraction employs a three-stage pipeline starting with template matching using SIFT and XFeat to locate overlays. A fine-tuned SegFormer model parses controller states from consecutive frames to output joystick positions on an 11x11 grid and binary button states. The on-screen controller is masked during training to prevent model exploitation, and 8M synthetic frames are used to train the annotation model.
Method
The authors propose NitroGen, a multi-game foundation agent designed to generate future action chunks conditioned on visual observations. The overall system integrates a Universal Simulator, the foundation agent itself, and an Internet-Scale Video-Action Dataset. Refer to the framework diagram.
To enable training on diverse gameplay, the system relies on a large-scale dataset constructed by extracting controller inputs from online videos. As shown in the figure below, the data preparation pipeline begins with gamepad localization using template matching on input video frames. Once localized, the gamepad is cropped, and specific actions are extracted through joystick segmentation and button classification.
The NitroGen architecture employs flow matching to generate these action sequences. The model adapts a diffusion transformer (DiT) backbone, removing language and state encoders to focus purely on visual conditioning. RGB inputs at 256×256 resolution are encoded using a SigLIP 2 vision transformer, which produces 256 image tokens per frame. Noisy action chunks are first encoded by an MLP into one action token per timestep. These tokens are processed through several DiT blocks consisting of alternating self-attention and cross-attention layers, where cross-attention layers condition action generation on the encoded frame tokens. Finally, the action tokens are decoded into continuous action vectors using an MLP applied independently across the time dimension.
Regarding design choices, the model generates 16-action chunks conditioned on a single context frame. This approach improves temporal consistency compared to single-action generation and leverages the initial state of the game to elicit appropriate behavior.
The model is trained using the standard conditional flow-matching objective. Given a ground-truth action chunk a∈R16×24, an observation o∈R256×256, a flow-matching timestep t∈[0,1], and Gaussian noise ϵ∼N(0,I), the noisy action is constructed as: at=(1−t)⋅ϵ+t⋅a The conditional velocity field is defined as: νcond(x,t,a,ϵ,o)=a−ϵ The model is trained to predict this velocity field by minimizing the conditional flow-matching loss: LCFM(θ,ϕ)=Et,a,ϵ[∣∣πθ(at,ψϕ(o),t)−(a−ϵ)∣∣2] where πθ represents the DiT and ψϕ represents the image encoder. During training, a shifted beta distribution is used to sample t, prioritizing small timesteps.
At inference time, the model initializes a0∼N(0,I) and iteratively denoises for k=16 steps using Euler integration: at+1/k=at+k1πθ(at,ψϕ(o),t) Training is performed using the AdamW optimizer with a weight decay of 0.001 and a warmup-stable-decay schedule. An exponential moving average (EMA) of model weights is maintained during training with a decay of 0.9999, and all reported results utilize these EMA weights.
Experiment
The evaluation employs a benchmark dataset to validate action extraction accuracy and assesses model performance across diverse games to test generalization capabilities. Results demonstrate that the system achieves robust extraction and adapts well to unseen scenarios, while pre-training on noisy internet-scale data significantly enhances downstream fine-tuning compared to training from scratch. Additionally, comparative tests confirm that synchronous inference does not adversely affect game physics, validating the reliability of the freezing mechanism during prediction.