منذ شهر واحد

جدول المحتويات

الملخص

في هذا التقرير، نقدّم سلسلة نماذج اللغة الكبيرة المخصصة للبرمجة IQuest-Coder-V1 (7B/14B/40B/40B-Loop)، وهي عائلة جديدة من النماذج الكبيرة للغة المخصصة للبرمجة (LLMs). وننتقل بeyond التمثيلات الثابتة للبرمجة، ونُقدّم نموذج التدريب متعدد المراحل القائم على "تدفق الكود"، الذي يُمكّن من التقاط التطوّر الديناميكي للمنطق البرمجي عبر مراحل مختلفة من خط أنابيب المعالجة. تم تطوير نماذجنا من خلال خط أنابيب تطوري، يبدأ بمرحلة التدريب المبدئي التي تشمل بيانات حقائق الكود، ومستودعات الكود، وبيانات اكتمال الكود. ثم نُطبّق مرحلة وسطية مخصصة تدمج مسارات التفكير والمسارات العاملة (agentic) في سياق بطول 32K، وتمتد على نطاق مستودعات الكود بسياق بطول 128K، بهدف بناء أسس منطقية عميقة. وبعد ذلك، تُكتمل النماذج بمرحلة ما بعد التدريب المخصصة لتعزيز القدرات البرمجية، والتي تُنقسم إلى طريقتين مخصصتين: المسار التفكيري (الذي يعتمد على التعلم القائم على التفكير - reasoning-driven RL) والمسار التوجيهي (المُحسّن لتقديم المساعدة العامة). تُحقّق سلسلة IQuest-Coder-V1 أداءً متفوّقًا على المستوى العالمي بين النماذج التنافسية في المحاور الحاسمة للذكاء البرمجي: هندسة البرمجيات العاملة ذاتية، والبرمجة التنافسية، واستخدام الأدوات المعقدة. ولمعالجة قيود النشر، تقدّم النسخة المُطوّرة IQuest-Coder-V1-Loop آلية تكرارية مصممة لتحسين التوازن بين قدرة النموذج وحجم النشر، مما يوفّر طريقًا مُحسّنًا بناءً على البنية المعمارية لتحقيق توازن فعّال بين الكفاءة والأداء. نؤمن بأن إطلاق سلسلة IQuest-Coder-V1، بما في ذلك السلسلة الكاملة المفتوحة للنقاط المرجعية (checkpoints) من النماذج الأساسية للتدريب الأولي حتى النماذج النهائية للتفكير والتعليم، سيُسهم في دفع عجلة البحث في مجال الذكاء البرمجي الذاتي والأنظمة العاملة الواقعية.الشكل 1. أداء نموذج IQuest-Coder-V1 على مختلف المعايير. تم احتساب الدرجة الخاصة بـ LiveCodeBench v6 من خلال نموذج IQuest-Coder-V1-40B-Loop-Thinking، بينما تم احتساب باقي الدرجات من نموذج IQuest-Coder-V1-40B-Loop-Instruct. تمثل الخط المستقيم المرسوم باللون البرتقالي المتوسط الحسابي للنماذج المختارة.

One-sentence Summary

The IQuest-Coder-V1 team introduces the 7B/14B/40B/40B-Loop code LLM family, trained via a code-flow multi-stage paradigm that evolves software logic through pre-training on code facts and repositories, mid-training with reasoning and agentic trajectories in 32k and 128k contexts, and dual-path post-training via reasoning-driven RL and instruction optimization, achieving state-of-the-art performance in agentic software engineering, competitive programming, and complex tool use.

Key Contributions

We introduce the code-flow multi-stage training paradigm, which evolves software logic through pre-training on code facts and repositories, mid-training with reasoning and agentic trajectories in 32k–128k-context, and bifurcated post-training via reasoning-driven RL and instruction-tuning for thinking and instruct paths.
IQuest-Coder-V1 achieves state-of-the-art performance on agentic software engineering, competitive programming, and complex tool use benchmarks, including LiveCodeBench v6 (via the 40B-Loop-Thinking model) and other tasks evaluated on the 40B-Loop-Instruct variant.
The IQuest-Coder-V1-Loop variant incorporates a recurrent mechanism to optimize the capacity-efficiency trade-off for deployment, and the full open-sourced training pipeline—from pre-training bases to final models—enables reproducibility and advances autonomous code intelligence research.

Introduction

The authors leverage a novel code-flow multi-stage training paradigm to build IQuest-Coder-V1, a family of code LLMs designed to model the dynamic evolution of software logic across development phases. Prior models often treat code as static, limiting their ability to handle complex, real-world engineering tasks that require reasoning over repository-scale context and evolving logic. To overcome this, the authors introduce an evolutionary pipeline with specialized mid-training using 32k and 128k context windows, followed by bifurcated post-training paths—reasoning-driven RL for deep analysis and instruction-tuning for general assistance. Their main contribution includes state-of-the-art performance across agentic software engineering, competitive programming, and tool use, plus the IQuest-Coder-V1-Loop variant, which uses recurrent mechanisms to optimize deployment efficiency without sacrificing capability. They also open-source the full training chain to accelerate research in autonomous code systems.

Dataset

Dataset overview

The authors use a multi-stage, highly curated dataset to train IQuest-Coder, combining web-scale corpora with domain-specific code and reasoning data. Here’s how the dataset is structured and used:

General Corpus (Stage 1):
- Sourced primarily from Common Crawl, cleaned via regex filters and hierarchical deduplication (exact + fuzzy matching using embedding models).
- Decontaminated to exclude overlaps with standard benchmarks.
- Code snippets undergo AST analysis to verify syntactic validity, supporting code-flow training.
- Domain-specific proxy classifiers (small models trained to emulate larger ones) filter data by information density, educational value, and toxicity — outperforming FastText baselines.
- CodeSimpleQA-Instruct (66M samples) is integrated to inject factual, objective Q&A pairs generated via LLMs under constraints for time-invariance and single correct answers.
Repository Evolution Data:
- Built as triplets: (R_old, P, R_new), where R_old and R_new are code states from the 40%-80% lifecycle percentile (mature phase), and P is the patch between them.
- Ensures temporal continuity and meaningful code changes, avoiding early instability or late fragmentation.
Code Completion Data (FIM Format):
- Constructed at file and repo levels using Fill-In-the-Middle (FIM) format: ||.
- File-level: single document split via random character or line boundaries.
- Repo-level: adds semantically similar snippets from the same repo as context.
- Syntax-based construction uses ASTs to extract expression-, statement-, and function-level segments, preserving structural integrity.
- Task formats include: <fim_prefix>{pre}|{suf}|{mid}|{im_end}|> (file) and <repo_name>{repo} <|file_sep|>{path1} {content1} <|file_sep|>{path2} {content2} <|file_SEP|>{path3} ... {pre}|{suf} <fim_middle|>{mid}|{im_end}|> (repo).
Mid-Training (Stage 2):
- Uses same core data categories: Reasoning QA (math, coding, logic), Agent trajectories, code commits, and FIM data.
- Reasoning QA teaches structured decomposition and consistency; Agent trajectories simulate closed-loop intelligence via action-observation-revision cycles with feedback (logs, errors, tests).
- Stage 2.1 trains at 32K context; Stage 2.2 extends to 128K to enable repository-scale reasoning.
Evaluation Benchmarks:
- CrossCodeEval (Python, Java, TypeScript, C#) tests cross-file completion.
- Aider’s Polyglot benchmark (C++, Go, Java, JavaScript, Python, Rust) evaluates multilingual editing on 225 hardest Exercism problems.

All data undergoes strict quality filtering and structural validation to support scalable, high-fidelity code modeling.

Method

The authors leverage a structured, multi-phase training pipeline—termed the Code-Flow pipeline—to systematically evolve logical and agentic capabilities in the IQuest-Coder-V1 series. The framework is organized into three primary stages: pre-training and annealing, mid-training, and bifurcated post-training, each designed to progressively specialize the model while preserving general intelligence. The process begins with pre-training and annealing, where the model is first exposed to a mixture of general and code-specific data in Stage 1, followed by an annealing phase focused exclusively on high-quality, curated code corpora. This primes the model’s representations for complex reasoning tasks by reinforcing syntactic and semantic patterns inherent in real-world codebases. Refer to the framework diagram for the flow of this initial phase. Code-Flow Training Pipeline of IQuest-Coder-V1 Mid-training introduces two sequential phases that progressively scale context length and task complexity. In Mid-train Phase 1, the model is trained on 32k-context data encompassing reasoning, agentic, and code tasks, yielding intermediate checkpoints such as IQuest-Coder-Stage1 and LoopCoder-Base-Stage1. This phase is critical for establishing a scaffold for long-horizon reasoning. Mid-train Phase 2 extends context to 128k while maintaining similar data distributions, producing IQuest-Coder-Stage2 and LoopCoder-Base-Stage2, which serve as the foundation for post-training. The authors emphasize that injecting 32k reasoning trajectories after annealing—but before post-training—stabilizes performance under distribution shifts, a finding validated through ablation studies. Post-training bifurcates into two distinct pathways: Thinking and Instruct. The Thinking path begins with supervised fine-tuning (SFT) on datasets containing explicit reasoning traces, followed by reinforcement learning (RL) optimized for reasoning fidelity. This yields LoopCoder-Thinking, which exhibits emergent autonomous error-recovery in long-horizon tasks such as software engineering and competitive programming. The Instruct path, in contrast, applies SFT on general and code instruction-following data, followed by RL tuned for instruction adherence, producing LoopCoder-Instruct. Both paths utilize the LoopCoder architecture, which employs a recurrent transformer design with two fixed iterations: the first processes input embeddings with position-shifted hidden states, while the second computes a gated mixture of global attention (attending to all key-value pairs from iteration 1) and local causal attention (attending only to preceding tokens in iteration 2). This architecture enables iterative refinement over complex code segments without requiring token-shifting mechanisms. The training infrastructure supports this multi-stage pipeline through fused gated attention kernels to reduce memory bandwidth, context parallelism for ultra-long context training, and deterministic re-computation for silent error detection. Data construction is model-centric, relying on frontier LLMs to generate training samples under automated verification, including execution-based validation for objective domains and multi-agent debate for subjective ones. Supervised fine-tuning employs aggressive sequence packing, cosine annealing with extended low-rate phases, and a three-phase curriculum that sequences data by difficulty to ensure stable convergence. Reinforcement learning leverages the GRPO algorithm with clip-Higher strategy, trained on test case pass rates without KL penalties, and is further enhanced by the SWE-RL framework, which formulates software engineering as an interactive RL environment with tool-based actions and rewards based on test suite passage and efficiency regularization. The authors also formalize multilingual code scaling laws by incorporating language proportions $p = (p_1, ldots, p_K)$ into the loss function: $mathcal { L } ( N , D ; p ) = A cdot N ^ { - alpha _ { N } ( p ) } + B cdot D _ { gamma } ^ { - alpha _ { D } ( p ) } + L _ { infty } ( p )$ where $alpha_{N}(p)=sum_{k} p_{k} alpha_{N}^{k}$ , $alpha_{D}(p)=sum_{k} p_{k} alpha_{D}^{k}$ , and $L_{infty}(p)=sum_{p} p L_{infty}^{k}$ are proportion-weighted averages of language-specific parameters. The effective data term captures cross-lingual transfer: $D _ { x } = D _ { a l l } left( 1 + gamma sum _ { L _ { i } eq L _ { j } } p _ { L _ { i } } p _ { L _ { j } } au _ { i j } ight)$ with $au_{ij}$ as the transfer coefficient derived from empirical synergy gains. The final fitted scaling law under optimal multilingual allocation is: $mathcal { L } ^ { * } ( N , D ) = A ^ { * } cdot N ^ { - alpha _ { N } ^ { * } } + B ^ { * } cdot D ^ { - alpha _ { D } ^ { * } } + L _ { infty } ^ { * }$ where $alpha_{D}^{*}=0.6859$ , $alpha_{N}^{*}=0.2186$ , and $L_{infty}^{*}=0.2025$ . This formulation enables efficient allocation of multilingual code data to maximize performance on both generation and translation tasks.

Experiment

Our model is evaluated across diverse code generation, reasoning, efficiency, and agentic tasks against a broad set of state-of-the-art baselines including Claude, GPT, Gemini, Qwen, and StarCoder2. It achieves strong results on EvalPlus, BigCodeBench, FullStackBench, and LiveCodeBench for code generation; excels in forward and inverse reasoning on CRUXEval; and scores competitively on Mercury for runtime efficiency and on Spider and BIRD for Text-to-SQL. In agentic settings, it attains a 76.2 score on SWE-bench Verified and performs robustly on Terminal-Bench and general tool-use benchmarks like Mind2Web and BFCL, while maintaining high safety compliance across Tulu 3 benchmarks with balanced refusal behavior on harmful prompts.

The authors evaluate code generation performance across six programming languages, reporting scores and relative changes compared to baseline models. Results show consistent gains in most language pairs, with notable improvements in Java and TypeScript, while some cases show slight declines in Python and C#. Java-to-JavaScript shows a 12.62% improvement over baseline Python-to-C# and Python-to-Rust show declines of 1.69% and 2.72% TypeScript-to-JavaScript and Go-to-Rust show gains of 4.69% and 2.86%

Cross-language code generation performance

The authors evaluate multiple code generation models across Python, Java, TypeScript, and C#, reporting exact match and edit similarity scores. IQuest-Coder-V1-40B achieves the highest average edit similarity and strong exact match scores, outperforming other 20B+ models in most languages. Results indicate competitive performance in both functional correctness and structural similarity across programming languages. IQuest-Coder-V1-40B leads in average edit similarity (85.7) among 20B+ models It achieves top exact match in Java (57.9) and strong scores in Python (49.0) and C# (63.4) StarCoder2-7B underperforms with lowest average exact match (8.3) and edit similarity (70.8)

Code generation performance by language

The authors evaluate their model across agentic coding and general tool-use tasks, comparing it with open-source and closed-API models. Results show the model achieves competitive scores on Terminal-Bench, SWE-Verified, Mind2Web, and BFCL V3, particularly excelling in SWE-Verified with a 76.2 score. Performance is strong relative to similarly sized open models and approaches some closed-source systems. IQuest-Coder-V1-40B-Loop-Instruct scores 76.2 on SWE-Verified, outperforming most open models. The model achieves 52.5 on Terminal-Bench and 64.3 on Mind2Web, showing broad agentic capability. Performance on BFCL V3 reaches 73.9, indicating strong multi-step tool-use reasoning.

Agentic coding and tool use performance

The authors evaluate multiple code-focused language models on Text to SQL tasks using Bird and Spider benchmarks. Results show that IQuest-Coder-V1-40B-Instruct achieves 70.5 on Bird and 92.2 on Spider, outperforming most open-source models and matching or exceeding several closed-source systems. The the the table highlights performance differences across model sizes and architectures, with closed-API models generally showing strong execution accuracy. IQuest-Coder-V1-40B-Instruct leads open-source models with 92.2 Spider accuracy Closed-API models like Gemini-3-Flash-preview score 87.2 on Spider Qwen3-Coder-480B-A35B-Instruct achieves 81.2 on Spider despite lower Bird score

Text to SQL performance comparison

The authors evaluate multiple code-focused language models across CruxEval and LiveCodeBench benchmarks, grouping them by parameter scale and architecture. Results show that IQuest-Coder-V1 variants, particularly the 40B Loop-Thinking model, achieve top scores in Output-COT and LiveCodeBench V6, outperforming many larger or closed-source models. Closed-API models like Gemini-3 and Claude-Opus-4.5 remain competitive but do not consistently surpass the best open models in all metrics. IQuest-Coder-V1-40B-Loop-Thinking leads in Output-COT with 99.4 and LiveCodeBench V6 with 81.1 Closed-API models like Gemini-3-Flash-preview score high in CruxEval but trail in LiveCodeBench V6 Qwen3-235B-A22B-Thinking-2507 shows strong CruxEval Output-COT at 89.5 but lower LiveCodeBench scores

Code model performance comparison

The authors evaluate IQuest-Coder-V1-40B across cross-language code generation, agentic coding, Text-to-SQL, and code reasoning benchmarks, showing consistent gains in Java and TypeScript translation while noting slight declines in Python and C#. The model achieves top edit similarity (85.7) and exact match scores in Java (57.9), C# (63.4), and Python (49.0), outperforming most 20B+ models, and excels in agentic tasks with a 76.2 on SWE-Verified, 52.5 on Terminal-Bench, and 73.9 on BFCL V3. In Text-to-SQL, it leads open-source models with 92.2 on Spider and 70.5 on Bird, while its Loop-Thinking variant tops Output-COT (99.4) and LiveCodeBench V6 (81.1), rivaling or surpassing several closed-source systems.

ملف PDF المصدر

جدول المحتويات

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي

وحدات GPU جاهزة للعمل

أفضل الأسعار

ابدأ عرض الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا

سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين

مدعوم بواسطة MailChimp

منذ شهر واحد

معالجة اللغة الطبيعية

مهمة

Jian Yang Wei Zhang Shawn Guo Zhengmao Ye Lin Jing Shark Liu Yizhi Li Jiajun Wu Cening Liu X. Ma

جدول المحتويات

الملخص

One-sentence Summary

Key Contributions

We introduce the code-flow multi-stage training paradigm, which evolves software logic through pre-training on code facts and repositories, mid-training with reasoning and agentic trajectories in 32k–128k-context, and bifurcated post-training via reasoning-driven RL and instruction-tuning for thinking and instruct paths.
IQuest-Coder-V1 achieves state-of-the-art performance on agentic software engineering, competitive programming, and complex tool use benchmarks, including LiveCodeBench v6 (via the 40B-Loop-Thinking model) and other tasks evaluated on the 40B-Loop-Instruct variant.
The IQuest-Coder-V1-Loop variant incorporates a recurrent mechanism to optimize the capacity-efficiency trade-off for deployment, and the full open-sourced training pipeline—from pre-training bases to final models—enables reproducibility and advances autonomous code intelligence research.

Introduction

Dataset

Dataset overview

The authors use a multi-stage, highly curated dataset to train IQuest-Coder, combining web-scale corpora with domain-specific code and reasoning data. Here’s how the dataset is structured and used:

General Corpus (Stage 1):
- Sourced primarily from Common Crawl, cleaned via regex filters and hierarchical deduplication (exact + fuzzy matching using embedding models).
- Decontaminated to exclude overlaps with standard benchmarks.
- Code snippets undergo AST analysis to verify syntactic validity, supporting code-flow training.
- Domain-specific proxy classifiers (small models trained to emulate larger ones) filter data by information density, educational value, and toxicity — outperforming FastText baselines.
- CodeSimpleQA-Instruct (66M samples) is integrated to inject factual, objective Q&A pairs generated via LLMs under constraints for time-invariance and single correct answers.
Repository Evolution Data:
- Built as triplets: (R_old, P, R_new), where R_old and R_new are code states from the 40%-80% lifecycle percentile (mature phase), and P is the patch between them.
- Ensures temporal continuity and meaningful code changes, avoiding early instability or late fragmentation.
Code Completion Data (FIM Format):
- Constructed at file and repo levels using Fill-In-the-Middle (FIM) format: ||.
- File-level: single document split via random character or line boundaries.
- Repo-level: adds semantically similar snippets from the same repo as context.
- Syntax-based construction uses ASTs to extract expression-, statement-, and function-level segments, preserving structural integrity.
- Task formats include: <fim_prefix>{pre}|{suf}|{mid}|{im_end}|> (file) and <repo_name>{repo} <|file_sep|>{path1} {content1} <|file_sep|>{path2} {content2} <|file_SEP|>{path3} ... {pre}|{suf} <fim_middle|>{mid}|{im_end}|> (repo).
Mid-Training (Stage 2):
- Uses same core data categories: Reasoning QA (math, coding, logic), Agent trajectories, code commits, and FIM data.
- Reasoning QA teaches structured decomposition and consistency; Agent trajectories simulate closed-loop intelligence via action-observation-revision cycles with feedback (logs, errors, tests).
- Stage 2.1 trains at 32K context; Stage 2.2 extends to 128K to enable repository-scale reasoning.
Evaluation Benchmarks:
- CrossCodeEval (Python, Java, TypeScript, C#) tests cross-file completion.
- Aider’s Polyglot benchmark (C++, Go, Java, JavaScript, Python, Rust) evaluates multilingual editing on 225 hardest Exercism problems.

All data undergoes strict quality filtering and structural validation to support scalable, high-fidelity code modeling.

Method

Experiment

Cross-language code generation performance

Code generation performance by language

Agentic coding and tool use performance

Text to SQL performance comparison

Code model performance comparison

ملف PDF المصدر

جدول المحتويات

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

البرمجة التعاونية باستخدام الذكاء الاصطناعي

وحدات GPU جاهزة للعمل

أفضل الأسعار

ابدأ عرض الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا

سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين

مدعوم بواسطة MailChimp

Command Palette

تقرير فني حول IQuest-Coder-V1

Jian Yang Wei Zhang Shawn Guo Zhengmao Ye Lin Jing Shark Liu Yizhi Li Jiajun Wu Cening Liu X. Ma28 more

الملخص

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters

Command Palette

تقرير فني حول IQuest-Coder-V1

Jian Yang Wei Zhang Shawn Guo Zhengmao Ye Lin Jing Shark Liu Yizhi Li Jiajun Wu Cening Liu X. Ma28 more

الملخص

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters

Command Palette

تقرير فني حول IQuest-Coder-V1

Jian Yang Wei Zhang Shawn Guo Zhengmao Ye Lin Jing Shark Liu Yizhi Li Jiajun Wu Cening Liu X. Ma28 more

الملخص

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters

Jian Yang Wei Zhang Shawn Guo Zhengmao Ye Lin Jing Shark Liu Yizhi Li Jiajun Wu Cening Liu X. Ma

Jian Yang Wei Zhang Shawn Guo Zhengmao Ye Lin Jing Shark Liu Yizhi Li Jiajun Wu Cening Liu X. Ma

Jian Yang Wei Zhang Shawn Guo Zhengmao Ye Lin Jing Shark Liu Yizhi Li Jiajun Wu Cening Liu X. Ma