il y a un mois

Table des matières

Résumé

Dans ce rapport, nous présentons la série IQuest-Coder-V1 (7B/14B/40B/40B-Loop), une nouvelle famille de grands modèles linguistiques dédiés au code (LLM pour Large Language Models). En dépassant les représentations statiques du code, nous proposons un paradigme d’entraînement multicouche basé sur le flux de code, qui capture l’évolution dynamique de la logique logicielle à travers les différentes phases du pipeline. Nos modèles sont développés selon une chaîne évolutionnaire, commençant par un pré-entraînement initial fondé sur des faits de code, des dépôts (repositories) et des données de complétion. Par la suite, nous mettons en œuvre une phase intermédiaire spécialisée, intégrant des trajectoires de raisonnement et d’agenticité dans un contexte de 32 000 tokens, ainsi que des trajectoires à l’échelle du dépôt dans un contexte de 128 000 tokens, afin de consolider des fondations logiques profondes. Les modèles sont ensuite finalisés par une phase de post-entraînement visant à renforcer des compétences spécifiques en programmation, divisée en deux voies spécialisées : la voie de réflexion (utilisant un apprentissage par renforcement piloté par le raisonnement) et la voie d’instruction (optimisée pour une assistance générale). L’IQuest-Coder-V1 atteint des performances de pointe parmi les modèles concurrents sur des dimensions critiques de l’intelligence du code : ingénierie logicielle agente, programmation compétitive et utilisation complexe d’outils. Pour répondre aux contraintes de déploiement, la variante IQuest-Coder-V1-Loop introduit un mécanisme récurrent conçu pour optimiser le compromis entre capacité du modèle et empreinte de déploiement, offrant ainsi une voie architecturale améliorée pour un équilibre efficace entre efficacité et efficience. Nous pensons que la mise à disposition de la série IQuest-Coder-V1, incluant la chaîne complète et transparente (« white-box ») des points de contrôle allant des bases de pré-entraînement aux modèles finaux de réflexion et d’instruction, accélérera la recherche en intelligence autonome du code et en systèmes agents applicables dans le monde réel.

One-sentence Summary

The IQuest-Coder-V1 team introduces the 7B/14B/40B/40B-Loop code LLM family, trained via a code-flow multi-stage paradigm that evolves software logic through pre-training on code facts and repositories, mid-training with reasoning and agentic trajectories in 32k and 128k contexts, and dual-path post-training via reasoning-driven RL and instruction optimization, achieving state-of-the-art performance in agentic software engineering, competitive programming, and complex tool use.

Key Contributions

We introduce the code-flow multi-stage training paradigm, which evolves software logic through pre-training on code facts and repositories, mid-training with reasoning and agentic trajectories in 32k–128k-context, and bifurcated post-training via reasoning-driven RL and instruction-tuning for thinking and instruct paths.
IQuest-Coder-V1 achieves state-of-the-art performance on agentic software engineering, competitive programming, and complex tool use benchmarks, including LiveCodeBench v6 (via the 40B-Loop-Thinking model) and other tasks evaluated on the 40B-Loop-Instruct variant.
The IQuest-Coder-V1-Loop variant incorporates a recurrent mechanism to optimize the capacity-efficiency trade-off for deployment, and the full open-sourced training pipeline—from pre-training bases to final models—enables reproducibility and advances autonomous code intelligence research.

Introduction

The authors leverage a novel code-flow multi-stage training paradigm to build IQuest-Coder-V1, a family of code LLMs designed to model the dynamic evolution of software logic across development phases. Prior models often treat code as static, limiting their ability to handle complex, real-world engineering tasks that require reasoning over repository-scale context and evolving logic. To overcome this, the authors introduce an evolutionary pipeline with specialized mid-training using 32k and 128k context windows, followed by bifurcated post-training paths—reasoning-driven RL for deep analysis and instruction-tuning for general assistance. Their main contribution includes state-of-the-art performance across agentic software engineering, competitive programming, and tool use, plus the IQuest-Coder-V1-Loop variant, which uses recurrent mechanisms to optimize deployment efficiency without sacrificing capability. They also open-source the full training chain to accelerate research in autonomous code systems.

Dataset

Dataset overview

The authors use a multi-stage, highly curated dataset to train IQuest-Coder, combining web-scale corpora with domain-specific code and reasoning data. Here’s how the dataset is structured and used:

General Corpus (Stage 1):
- Sourced primarily from Common Crawl, cleaned via regex filters and hierarchical deduplication (exact + fuzzy matching using embedding models).
- Decontaminated to exclude overlaps with standard benchmarks.
- Code snippets undergo AST analysis to verify syntactic validity, supporting code-flow training.
- Domain-specific proxy classifiers (small models trained to emulate larger ones) filter data by information density, educational value, and toxicity — outperforming FastText baselines.
- CodeSimpleQA-Instruct (66M samples) is integrated to inject factual, objective Q&A pairs generated via LLMs under constraints for time-invariance and single correct answers.
Repository Evolution Data:
- Built as triplets: (R_old, P, R_new), where R_old and R_new are code states from the 40%-80% lifecycle percentile (mature phase), and P is the patch between them.
- Ensures temporal continuity and meaningful code changes, avoiding early instability or late fragmentation.
Code Completion Data (FIM Format):
- Constructed at file and repo levels using Fill-In-the-Middle (FIM) format: ||.
- File-level: single document split via random character or line boundaries.
- Repo-level: adds semantically similar snippets from the same repo as context.
- Syntax-based construction uses ASTs to extract expression-, statement-, and function-level segments, preserving structural integrity.
- Task formats include: <fim_prefix>{pre}|{suf}|{mid}|{im_end}|> (file) and <repo_name>{repo} <|file_sep|>{path1} {content1} <|file_sep|>{path2} {content2} <|file_SEP|>{path3} ... {pre}|{suf} <fim_middle|>{mid}|{im_end}|> (repo).
Mid-Training (Stage 2):
- Uses same core data categories: Reasoning QA (math, coding, logic), Agent trajectories, code commits, and FIM data.
- Reasoning QA teaches structured decomposition and consistency; Agent trajectories simulate closed-loop intelligence via action-observation-revision cycles with feedback (logs, errors, tests).
- Stage 2.1 trains at 32K context; Stage 2.2 extends to 128K to enable repository-scale reasoning.
Evaluation Benchmarks:
- CrossCodeEval (Python, Java, TypeScript, C#) tests cross-file completion.
- Aider’s Polyglot benchmark (C++, Go, Java, JavaScript, Python, Rust) evaluates multilingual editing on 225 hardest Exercism problems.

All data undergoes strict quality filtering and structural validation to support scalable, high-fidelity code modeling.

Method

The authors leverage a structured, multi-phase training pipeline—termed the Code-Flow pipeline—to systematically evolve logical and agentic capabilities in the IQuest-Coder-V1 series. The framework is organized into three primary stages: pre-training and annealing, mid-training, and bifurcated post-training, each designed to progressively specialize the model while preserving general intelligence. The process begins with pre-training and annealing, where the model is first exposed to a mixture of general and code-specific data in Stage 1, followed by an annealing phase focused exclusively on high-quality, curated code corpora. This primes the model’s representations for complex reasoning tasks by reinforcing syntactic and semantic patterns inherent in real-world codebases. Refer to the framework diagram for the flow of this initial phase. Code-Flow Training Pipeline of IQuest-Coder-V1 Mid-training introduces two sequential phases that progressively scale context length and task complexity. In Mid-train Phase 1, the model is trained on 32k-context data encompassing reasoning, agentic, and code tasks, yielding intermediate checkpoints such as IQuest-Coder-Stage1 and LoopCoder-Base-Stage1. This phase is critical for establishing a scaffold for long-horizon reasoning. Mid-train Phase 2 extends context to 128k while maintaining similar data distributions, producing IQuest-Coder-Stage2 and LoopCoder-Base-Stage2, which serve as the foundation for post-training. The authors emphasize that injecting 32k reasoning trajectories after annealing—but before post-training—stabilizes performance under distribution shifts, a finding validated through ablation studies. Post-training bifurcates into two distinct pathways: Thinking and Instruct. The Thinking path begins with supervised fine-tuning (SFT) on datasets containing explicit reasoning traces, followed by reinforcement learning (RL) optimized for reasoning fidelity. This yields LoopCoder-Thinking, which exhibits emergent autonomous error-recovery in long-horizon tasks such as software engineering and competitive programming. The Instruct path, in contrast, applies SFT on general and code instruction-following data, followed by RL tuned for instruction adherence, producing LoopCoder-Instruct. Both paths utilize the LoopCoder architecture, which employs a recurrent transformer design with two fixed iterations: the first processes input embeddings with position-shifted hidden states, while the second computes a gated mixture of global attention (attending to all key-value pairs from iteration 1) and local causal attention (attending only to preceding tokens in iteration 2). This architecture enables iterative refinement over complex code segments without requiring token-shifting mechanisms. The training infrastructure supports this multi-stage pipeline through fused gated attention kernels to reduce memory bandwidth, context parallelism for ultra-long context training, and deterministic re-computation for silent error detection. Data construction is model-centric, relying on frontier LLMs to generate training samples under automated verification, including execution-based validation for objective domains and multi-agent debate for subjective ones. Supervised fine-tuning employs aggressive sequence packing, cosine annealing with extended low-rate phases, and a three-phase curriculum that sequences data by difficulty to ensure stable convergence. Reinforcement learning leverages the GRPO algorithm with clip-Higher strategy, trained on test case pass rates without KL penalties, and is further enhanced by the SWE-RL framework, which formulates software engineering as an interactive RL environment with tool-based actions and rewards based on test suite passage and efficiency regularization. The authors also formalize multilingual code scaling laws by incorporating language proportions $p = (p_1, ldots, p_K)$ into the loss function: $mathcal { L } ( N , D ; p ) = A cdot N ^ { - alpha _ { N } ( p ) } + B cdot D _ { gamma } ^ { - alpha _ { D } ( p ) } + L _ { infty } ( p )$ where $alpha_{N}(p)=sum_{k} p_{k} alpha_{N}^{k}$ , $alpha_{D}(p)=sum_{k} p_{k} alpha_{D}^{k}$ , and $L_{infty}(p)=sum_{p} p L_{infty}^{k}$ are proportion-weighted averages of language-specific parameters. The effective data term captures cross-lingual transfer: $D _ { x } = D _ { a l l } left( 1 + gamma sum _ { L _ { i } eq L _ { j } } p _ { L _ { i } } p _ { L _ { j } } au _ { i j } ight)$ with $au_{ij}$ as the transfer coefficient derived from empirical synergy gains. The final fitted scaling law under optimal multilingual allocation is: $mathcal { L } ^ { * } ( N , D ) = A ^ { * } cdot N ^ { - alpha _ { N } ^ { * } } + B ^ { * } cdot D ^ { - alpha _ { D } ^ { * } } + L _ { infty } ^ { * }$ where $alpha_{D}^{*}=0.6859$ , $alpha_{N}^{*}=0.2186$ , and $L_{infty}^{*}=0.2025$ . This formulation enables efficient allocation of multilingual code data to maximize performance on both generation and translation tasks.

Experiment

Our model is evaluated across diverse code generation, reasoning, efficiency, and agentic tasks against a broad set of state-of-the-art baselines including Claude, GPT, Gemini, Qwen, and StarCoder2. It achieves strong results on EvalPlus, BigCodeBench, FullStackBench, and LiveCodeBench for code generation; excels in forward and inverse reasoning on CRUXEval; and scores competitively on Mercury for runtime efficiency and on Spider and BIRD for Text-to-SQL. In agentic settings, it attains a 76.2 score on SWE-bench Verified and performs robustly on Terminal-Bench and general tool-use benchmarks like Mind2Web and BFCL, while maintaining high safety compliance across Tulu 3 benchmarks with balanced refusal behavior on harmful prompts.

The authors evaluate code generation performance across six programming languages, reporting scores and relative changes compared to baseline models. Results show consistent gains in most language pairs, with notable improvements in Java and TypeScript, while some cases show slight declines in Python and C#. Java-to-JavaScript shows a 12.62% improvement over baseline Python-to-C# and Python-to-Rust show declines of 1.69% and 2.72% TypeScript-to-JavaScript and Go-to-Rust show gains of 4.69% and 2.86%

Cross-language code generation performance

The authors evaluate multiple code generation models across Python, Java, TypeScript, and C#, reporting exact match and edit similarity scores. IQuest-Coder-V1-40B achieves the highest average edit similarity and strong exact match scores, outperforming other 20B+ models in most languages. Results indicate competitive performance in both functional correctness and structural similarity across programming languages. IQuest-Coder-V1-40B leads in average edit similarity (85.7) among 20B+ models It achieves top exact match in Java (57.9) and strong scores in Python (49.0) and C# (63.4) StarCoder2-7B underperforms with lowest average exact match (8.3) and edit similarity (70.8)

Code generation performance by language

The authors evaluate their model across agentic coding and general tool-use tasks, comparing it with open-source and closed-API models. Results show the model achieves competitive scores on Terminal-Bench, SWE-Verified, Mind2Web, and BFCL V3, particularly excelling in SWE-Verified with a 76.2 score. Performance is strong relative to similarly sized open models and approaches some closed-source systems. IQuest-Coder-V1-40B-Loop-Instruct scores 76.2 on SWE-Verified, outperforming most open models. The model achieves 52.5 on Terminal-Bench and 64.3 on Mind2Web, showing broad agentic capability. Performance on BFCL V3 reaches 73.9, indicating strong multi-step tool-use reasoning.

Agentic coding and tool use performance

The authors evaluate multiple code-focused language models on Text to SQL tasks using Bird and Spider benchmarks. Results show that IQuest-Coder-V1-40B-Instruct achieves 70.5 on Bird and 92.2 on Spider, outperforming most open-source models and matching or exceeding several closed-source systems. The the the table highlights performance differences across model sizes and architectures, with closed-API models generally showing strong execution accuracy. IQuest-Coder-V1-40B-Instruct leads open-source models with 92.2 Spider accuracy Closed-API models like Gemini-3-Flash-preview score 87.2 on Spider Qwen3-Coder-480B-A35B-Instruct achieves 81.2 on Spider despite lower Bird score

Text to SQL performance comparison

The authors evaluate multiple code-focused language models across CruxEval and LiveCodeBench benchmarks, grouping them by parameter scale and architecture. Results show that IQuest-Coder-V1 variants, particularly the 40B Loop-Thinking model, achieve top scores in Output-COT and LiveCodeBench V6, outperforming many larger or closed-source models. Closed-API models like Gemini-3 and Claude-Opus-4.5 remain competitive but do not consistently surpass the best open models in all metrics. IQuest-Coder-V1-40B-Loop-Thinking leads in Output-COT with 99.4 and LiveCodeBench V6 with 81.1 Closed-API models like Gemini-3-Flash-preview score high in CruxEval but trail in LiveCodeBench V6 Qwen3-235B-A22B-Thinking-2507 shows strong CruxEval Output-COT at 89.5 but lower LiveCodeBench scores

Code model performance comparison

The authors evaluate IQuest-Coder-V1-40B across cross-language code generation, agentic coding, Text-to-SQL, and code reasoning benchmarks, showing consistent gains in Java and TypeScript translation while noting slight declines in Python and C#. The model achieves top edit similarity (85.7) and exact match scores in Java (57.9), C# (63.4), and Python (49.0), outperforming most 20B+ models, and excels in agentic tasks with a 76.2 on SWE-Verified, 52.5 on Terminal-Bench, and 73.9 on BFCL V3. In Text-to-SQL, it leads open-source models with 92.2 on Spider and 70.5 on Bird, while its Loop-Thinking variant tops Output-COT (99.4) and LiveCodeBench V6 (81.1), rivaling or surpassing several closed-source systems.

PDF source

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

il y a un mois

Traitement Du Langage Naturel

Tâche

Jian Yang Wei Zhang Shawn Guo Zhengmao Ye Lin Jing Shark Liu Yizhi Li Jiajun Wu Cening Liu X. Ma

Table des matières

Résumé

One-sentence Summary

Key Contributions

We introduce the code-flow multi-stage training paradigm, which evolves software logic through pre-training on code facts and repositories, mid-training with reasoning and agentic trajectories in 32k–128k-context, and bifurcated post-training via reasoning-driven RL and instruction-tuning for thinking and instruct paths.
IQuest-Coder-V1 achieves state-of-the-art performance on agentic software engineering, competitive programming, and complex tool use benchmarks, including LiveCodeBench v6 (via the 40B-Loop-Thinking model) and other tasks evaluated on the 40B-Loop-Instruct variant.
The IQuest-Coder-V1-Loop variant incorporates a recurrent mechanism to optimize the capacity-efficiency trade-off for deployment, and the full open-sourced training pipeline—from pre-training bases to final models—enables reproducibility and advances autonomous code intelligence research.

Introduction

Dataset

Dataset overview

The authors use a multi-stage, highly curated dataset to train IQuest-Coder, combining web-scale corpora with domain-specific code and reasoning data. Here’s how the dataset is structured and used:

General Corpus (Stage 1):
- Sourced primarily from Common Crawl, cleaned via regex filters and hierarchical deduplication (exact + fuzzy matching using embedding models).
- Decontaminated to exclude overlaps with standard benchmarks.
- Code snippets undergo AST analysis to verify syntactic validity, supporting code-flow training.
- Domain-specific proxy classifiers (small models trained to emulate larger ones) filter data by information density, educational value, and toxicity — outperforming FastText baselines.
- CodeSimpleQA-Instruct (66M samples) is integrated to inject factual, objective Q&A pairs generated via LLMs under constraints for time-invariance and single correct answers.
Repository Evolution Data:
- Built as triplets: (R_old, P, R_new), where R_old and R_new are code states from the 40%-80% lifecycle percentile (mature phase), and P is the patch between them.
- Ensures temporal continuity and meaningful code changes, avoiding early instability or late fragmentation.
Code Completion Data (FIM Format):
- Constructed at file and repo levels using Fill-In-the-Middle (FIM) format: ||.
- File-level: single document split via random character or line boundaries.
- Repo-level: adds semantically similar snippets from the same repo as context.
- Syntax-based construction uses ASTs to extract expression-, statement-, and function-level segments, preserving structural integrity.
- Task formats include: <fim_prefix>{pre}|{suf}|{mid}|{im_end}|> (file) and <repo_name>{repo} <|file_sep|>{path1} {content1} <|file_sep|>{path2} {content2} <|file_SEP|>{path3} ... {pre}|{suf} <fim_middle|>{mid}|{im_end}|> (repo).
Mid-Training (Stage 2):
- Uses same core data categories: Reasoning QA (math, coding, logic), Agent trajectories, code commits, and FIM data.
- Reasoning QA teaches structured decomposition and consistency; Agent trajectories simulate closed-loop intelligence via action-observation-revision cycles with feedback (logs, errors, tests).
- Stage 2.1 trains at 32K context; Stage 2.2 extends to 128K to enable repository-scale reasoning.
Evaluation Benchmarks:
- CrossCodeEval (Python, Java, TypeScript, C#) tests cross-file completion.
- Aider’s Polyglot benchmark (C++, Go, Java, JavaScript, Python, Rust) evaluates multilingual editing on 225 hardest Exercism problems.

All data undergoes strict quality filtering and structural validation to support scalable, high-fidelity code modeling.

Method

Experiment

Cross-language code generation performance

Code generation performance by language

Agentic coding and tool use performance

Text to SQL performance comparison

Code model performance comparison

PDF source

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

Command Palette

Rapport technique d'IQuest-Coder-V1

Jian Yang Wei Zhang Shawn Guo Zhengmao Ye Lin Jing Shark Liu Yizhi Li Jiajun Wu Cening Liu X. Ma28 more

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

Rapport technique d'IQuest-Coder-V1

Jian Yang Wei Zhang Shawn Guo Zhengmao Ye Lin Jing Shark Liu Yizhi Li Jiajun Wu Cening Liu X. Ma28 more

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

Rapport technique d'IQuest-Coder-V1

Jian Yang Wei Zhang Shawn Guo Zhengmao Ye Lin Jing Shark Liu Yizhi Li Jiajun Wu Cening Liu X. Ma28 more

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Jian Yang Wei Zhang Shawn Guo Zhengmao Ye Lin Jing Shark Liu Yizhi Li Jiajun Wu Cening Liu X. Ma

Jian Yang Wei Zhang Shawn Guo Zhengmao Ye Lin Jing Shark Liu Yizhi Li Jiajun Wu Cening Liu X. Ma

Jian Yang Wei Zhang Shawn Guo Zhengmao Ye Lin Jing Shark Liu Yizhi Li Jiajun Wu Cening Liu X. Ma