Command Palette
Search for a command to run...
신경망 아키텍처의 에이전트 기반 탐색: AIRA-Compose 및 AIRA-Design
신경망 아키텍처의 에이전트 기반 탐색: AIRA-Compose 및 AIRA-Design
Alberto Pepe Chien-Yu Lin Despoina Magka Bilge Acun Yannan Nellie Wu Anton Protopopov Carole-Jean Wu Yoram Bachrach
초록
재귀적 자기개선(recursive self-improvement)을 향한 한 단계로서, 우리는 대형 언어 모델(LLM) 에이전트가 표준 Transformer 패러다임을 넘어 기초 모델(foundation models)을 자율적으로 설계할 수 있는 능력을 조사합니다. 본 연구는 이중 프레임워크 접근법을 제시합니다. 고수준 구조 탐색을 위한 AIRA-Compose와 저수준 기계적 구현을 위한 AIRA-Design이 그것입니다. AIRA-Compose는 고정된 24시간의 컴퓨팅 예산 하에서 Attention, MLP, Mamba와 같은 기본 연산 원소(primitives)의 조합적 설계 공간을 탐색하기 위해 11개의 에이전트로 구성된 앙상블을 배치합니다. 에이전트는 두 단계로 작동하여 수백만 파라미터 규모에서 후보 모델을 반복적으로 설계하고 평가한 후, 상위 성능을 보인 설계들을 350M, 1B, 3B 파라미터 규모로 외삽합니다. 이 탐색 과정을 통해 Transformer 기반인 AIRAformers와 Transformer-Mamba 기반인 AIRAhybrids 두 가지 계열에 속하는 14개의 새로운 아키텍처가 도출되었습니다.고정된 토큰(benchmark) 예산 하에서 1B 규모로 사전 훈련(pre-training) 시, 에이전트가 발견한 상위 성능 아키텍처들은 Llama 3.2 및 Composer가 발견한 대안 모델들보다 일관되게 우수한 성능을 보였습니다. 하류 작업(downstream tasks)에서 AIRAformer-D와 AIRAhybrid-D는 각각 Llama 3.2 대비 정확도가 2.4% 및 3.8% 향상되었습니다. 또한 AIRA-Compose는 계산량 최적화(compute-optimal) 확장 한계에서 더 가파르고 효율적인 스케일링 경계를 달성하는 새로운 모델 아키텍처들을 발견했습니다. AIRAformer-C는 Llama 3.2와 Composer가 발견한 최고 성능 Transformer보다 각각 54% 및 71% 더 빠르게 확장되는 반면, AIRAhybrid-C는 수정된 Nemotron-2와 Composer가 발견한 최고 성능 하이브리드 모델보다 각각 23% 및 37% 더 빠르게 확장되었습니다.AIRA-Design은 최대 20명의 에이전트를 할당하여 장기 의존성(long-range dependencies) 처리를 위한 새로운 어텐션 메커니즘을 직접 작성하고 고성능 학습 스크립트(training scripts)를 구현하는 임무를 수행합니다. Long Range Arena(LRA) 벤치마크에서 평가한 결과, 에이전트가 설계한 최적의 아키텍처는 문서 일치(document matching) 작업에서 인간이 설계한 최상(state-of-the-art) 성능 대비 정확도가 2.3% 이내로, 텍스트 분류(text classification) 작업에서는 2.6% 이내에 도달했습니다. Autoresearch 벤치마크에서 Greedy Opus 4.5는 고정된 시간 예산 하에서 학습을 최적화하여 0.968 bits-per-byte의 검증 오류를 기록하여, 기존 출판된 최소 참조치보다 더 우수한 성능을 달성했습니다.AIRA-Compose와 AIRA-Design은 AI 연구 에이전트가 수작업으로 설계된 기준 모델(hand-designed baselines)과 견줄 만한, 혹은 이를 능가하는 하이브리드 아키텍처와 알고리즘 최적화(optimizations)를 자율적으로 발견할 수 있음을 입증합니다. 이는次世代 기초 모델을 발견하기 위한 유연하고 강력한 패러다임을 정립하며, 재귀적 자기개선으로 나아가는 중요한 단계입니다.
One-sentence Summary
This work introduces AIRA-Compose and AIRA-Design, a dual-framework where LLM agents autonomously design foundation models beyond standard Transformers, with AIRA-Compose deploying 11 agents to generate AIRAformer and AIRAhybrid architectures that surpass Llama 3.2 by 2.4% and 3.8% in accuracy and scale up to 54% faster than Llama 3.2, while AIRA-Design employs up to 20 agents to implement novel mechanisms that reach within 2.6% of human state-of-the-art on the Long Range Arena benchmark and achieve 0.968 validation bits-per-byte on the Autoresearch benchmark, establishing a flexible paradigm for recursive self-improvement.
Key Contributions
- This work introduces a dual-framework approach comprising AIRA-Compose for high-level architecture search and AIRA-Design for low-level mechanistic implementation. The system deploys agent ensembles to navigate combinatorial design spaces of computational primitives under fixed compute budgets.
- Agent-discovered architectures consistently outperform Llama 3.2 and Composer-found alternatives when pre-trained at the 1B scale under a fixed token budget. Specific models improve downstream accuracy by up to 3.8% and scale significantly faster than standard baselines.
- The AIRA-Design component tasks agents with writing novel attention mechanisms and implementing high-performing training scripts for evaluation. On the Long Range Arena and Autoresearch benchmarks, these agent-designed systems achieve near state-of-the-art accuracy and surpass published minimum reference bits per byte respectively.
Introduction
Current foundation models predominantly rely on Transformer architectures, yet their quadratic complexity creates bottlenecks for long-context processing and inference efficiency. While the community is shifting toward hybrid models that combine diverse computational primitives, manual exploration cannot effectively navigate the vast combinatorial space and traditional search methods remain computationally prohibitive. The authors leverage a dual-framework approach comprising AIRA-Compose and AIRA-Design to enable LLM agents to autonomously discover and implement these next-generation architectures. Their system successfully identifies novel hybrid designs that outperform human-engineered baselines like Llama 3.2 while establishing more efficient compute-optimal scaling frontiers.
Dataset
-
Dataset Composition and Sources
- The authors introduce 12 RSI tasks divided into AIRA-Compose and AIRA-Design categories, building on the AIRS-BENCH framework where tasks are defined by problem, dataset, and metric triplets.
- Two primary benchmarks drive the evaluation: Long Range Arena (LRA) for low-level mechanistic design and Autoresearch for optimizing training scripts.
-
Key Details for Each Subset
- LRA tasks employ three text-based datasets: IMDB for sentiment classification, ListOps for hierarchical math expressions, and the ACL Anthology Network for document retrieval.
- Autoresearch utilizes pre-tokenized web text from the ClimbMix corpus paired with a pre-trained BPE tokenizer containing approximately 8192 vocabulary items.
- A literature-enhanced version of Autoresearch includes structured summaries from 41 research papers and 14 reference code repositories organized by topic.
-
Model Usage and Training Splits
- Agents access training and validation splits during the search phase to evaluate hypotheses using greedy or one-shot scaffolds.
- Final performance is assessed on a held-out test split that remains inaccessible during the search process to ensure fairness.
- The Autoresearch metric is validation bits per byte calculated within a fixed 5-minute wall-clock training budget on a single GPU.
-
Processing and Metadata Construction
- Standardized task directories include preparation scripts for data sanitization and isolated scoring scripts that encapsulate full training pipelines.
- Metadata files define task constraints and evaluation metrics while ensuring test labels are hidden during solution construction.
- The literature variant organizes resources into Architecture Improvements, Training Strategies, and Optimizers within a dedicated pwc/ folder.
Method
The AIRA-Compose pipeline recasts the Composer framework into equivalent AIRS-Bench tasks to automate the discovery of hybrid foundation models. The process follows a four-step methodology: Search, Evaluation, Aggregation, and Extrapolation. Rather than relying on rigid Bayesian Optimization, the system employs agents to freely formulate structural hypotheses and propose novel primitive arrangements.
Refer to the framework diagram for the overall workflow where data, computational primitives, and codebases feed into the AIRS-Bench Task.
The search engine is driven by Large Language Models (LLMs) acting as agents within a harness that includes a scaffold for one-shot or greedy execution. Agents are tasked with assembling 16-layer small-scale architectures using predefined computational primitives such as MLPs (M), multi-head Attention (mA), and Mamba SSM (Mb). The search space for two-primitive configurations spans 216=65,536 possible arrangements, while three-primitive spaces expand to approximately 43 million combinations.
The authors leverage a greedy tree search approach where agents iteratively explore the architecture space. As shown in the figure below, the agent drafts initial solutions and refines them based on validation scores.
At each node of the search tree, the agent articulates design choices, produces a candidate architecture file, and writes an evaluation script. The submitted architecture is trained from scratch on proxy datasets including MAD, BabiStories, and DCLM. The node with the highest validation score is selected for further exploration via improve operations, allowing the agent to leverage domain knowledge to navigate the combinatorial space meaningfully. Red arrows in the diagram indicate debug operations, while blue arrows denote improve operations that propose new architectures informed by the parent's reasoning and score.
Once the agentic exploration concludes, the pipeline moves to aggregation and extrapolation. The Aggregator collects submitted architectures and their test scores across all agents. It employs layer-wise clustering techniques, such as k-means, to select the most frequent computational primitives within clusters. This process smooths out noise and overfitting from proxy training to obtain a robust small-scale architecture. Different aggregation strategies are applied, including N0, N1, and N2 aggregation, which weight architectures based on rank or cluster membership.
Finally, the Extrapolator scales the aggregated small-scale architecture to target parameter counts of 350M, 1B, or 3B. This scaling is achieved through stretching, which proportionally expands contiguous blocks, or stacking, which repeats the entire discovered architecture sequentially. At small scale, all primitives share a model dimension d=128, while large-scale configurations adjust the model dimension, number of attention heads, and hidden dimensions according to IsoFLOP methodologies. The resulting architectures utilize SwiGLU variants for MLPs and grouped-query attention for attention blocks to ensure efficiency at scale.
Experiment
This research evaluates AI agents on architecture search and training design tasks using one-shot and greedy scaffolds across benchmarks including MAD, Long Range Arena, and Autoresearch. Experiments in AIRA-Compose demonstrate that agents can discover novel neural architectures that outperform established baselines in validation loss and downstream performance, with balanced designs showing better compute efficiency. In AIRA-Design tasks, greedy agents achieved peak accuracy near leading human performance levels on mechanistic challenges and optimized training loops to surpass reference baselines, though they primarily recombined existing techniques rather than generating fundamental scientific innovations. Overall, the results indicate that agent-driven search is a viable approach for generating competitive foundation model components, highlighting strengths in engineering synthesis while identifying limitations in genuine algorithmic discovery.
The the the table presents the number of training steps achievable for various hybrid architectures across three parameter scales under five distinct FLOP budgets. The data demonstrates that as model scale increases, the number of feasible training steps decreases for a fixed compute budget. Furthermore, architectural composition significantly impacts compute efficiency, where designs with higher attention layer counts result in fewer training steps compared to Mamba-heavy or balanced alternatives. Increasing model scale reduces the total number of training steps available under a fixed FLOP budget. Architectures with a higher proportion of attention layers allow for fewer training iterations than those dominated by Mamba or MLP layers. Stacked and Stretched variants of the same base architecture configuration yield identical training step counts across all tested budgets.
The authors evaluate 3-primitive hybrid architectures at a 1B parameter scale against established baselines including Mamba and Composer. The agent-discovered AIRAhybrid-D (Stretched) variant demonstrates the strongest overall performance, achieving the lowest validation loss and highest average accuracy across downstream tasks. While the Composer baseline secures the highest DCLM Core Score, the AIRAhybrid models generally outperform the Mamba and Nemotron baselines across linguistic and reasoning benchmarks. AIRAhybrid-D (Stretched) achieves the lowest validation loss and highest average 0-shot accuracy among all tested architectures. The Composer baseline secures the highest DCLM Core Score, outperforming the agent-discovered variants on this specific metric. AIRAhybrid models generally maintain superior average accuracy compared to baselines like Mamba and approximated Nemotron-2.
The authors evaluate LLM agents on a 2-primitive architecture search task using the MAD, BabiStories, and DCLM datasets. The results compare One-Shot generation against Greedy search, distinguishing between the final submitted solution and the best solution found during exploration. Across all datasets, Greedy search consistently outperforms One-Shot generation, and the best-found solutions generally surpass the submitted solutions in terms of accuracy or loss. Greedy search scaffolds consistently yield better performance than One-Shot generation across all three datasets. The best solutions discovered during the search process typically outperform the final solutions submitted by the agents. The CWM agent achieved the highest accuracy on the MAD dataset and lowest loss on BabiStories and DCLM among the evaluated models.
The the the table details the iterative optimization steps taken by different agent variants on the Autoresearch task, tracking improvements in validation loss relative to a baseline. The authors show that agents successfully reduce loss through a sequence of architectural changes and hyperparameter adjustments, with performance varying based on model capability and access to literature. The Opus 4.6 variant with literature access achieves the best final validation performance after multiple iterative steps of tuning depth, batch size, and learning rates. The literature-enhanced Opus 4.5 agent achieves the most significant single-step gain by introducing focal loss to replace the standard cross-entropy objective. Optimization strategies diverge between model families, with Opus 4.5 focusing on architectural widening and value embedding sparsification while Opus 4.6 emphasizes depth changes and increased optimizer steps.
The the the table categorizes the architectural and hyperparameter modifications that led to performance gains during the Autoresearch task. It indicates that while learning rate and depth adjustments were the most common sources of improvement, changes to attention patterns yielded the highest median gains per step. Learning rate modifications were the most frequent driver of improvement, occurring in nearly half of the successful optimization steps. Adjustments to attention patterns produced the highest median performance gain, outperforming other categories in typical step-by-step progress. Changes to model depth, width, and MLP activations achieved the largest single-step performance jumps, sharing the highest maximum improvement value.
The experiments evaluate hybrid architectures and LLM agents across various parameter scales, FLOP budgets, and search tasks to assess compute efficiency and model performance. Results indicate that while larger models require fewer training steps under fixed budgets, agent-discovered hybrid variants generally outperform established baselines like Mamba and Composer across downstream tasks. Furthermore, iterative optimization by agents demonstrates that greedy search strategies yield superior solutions compared to one-shot generation, with attention pattern modifications providing the highest median performance gains during the search process.