HyperAI

Since 2025, the AI industry has seen the emergence of several large-model-powered automation tools for code generation, such as Cursor, Gemini CLI, Qwen CLI, and GPT-codex. These tools signal a new frontier in AI development: leveraging large models to interact with various analytical tools, significantly enhancing automated code generation and accelerating human coding workflows. Building on this trend, Yu Jiahao, an undergraduate alumnus of Shanghai Jiao Tong University and a PhD student at Northwestern University, led a research team focused on improving large models’ performance in complex code generation tasks. Through extensive research, the team identified two widely adopted techniques in code generation: Testing Time Scaling (TTS), which involves generating multiple candidate answers and selecting the best one through comparison, and offline learning—pre-generating high-quality training data for model training. Compared to online learning, which simultaneously collects data and trains models, offline learning is more computationally efficient and easier to experiment with. However, the team discovered a critical limitation: offline learning tends to reduce the diversity of model outputs. When multiple generated answers are too similar, redundancy increases, undermining the effectiveness of TTS. This prompted the team to address how to preserve or enhance output diversity in offline learning scenarios. To solve this, they introduced a novel training method that incorporates a diversity-promoting term into the offline learning loss function. By explicitly encouraging model outputs to differ from one another during training, the method ensures that the generated candidates are more varied. This, in turn, significantly improves performance during testing time scaling. The team validated their approach on the open-source SWE-Bench benchmark, achieving the fourth position on SWE-Bench-Verified and the top rank on SWE-Bench-Lite—demonstrating the method’s effectiveness. When compared to models trained using online learning approaches, their method showed strong competitiveness, offering a compelling alternative to online learning in tasks requiring test-time scaling. In practical applications, the technique holds promise for complex, multi-step tasks such as code generation, mathematical problem solving, and Capture The Flag (CTF) cybersecurity competitions, where diverse solution paths can dramatically improve success rates. For creative writing, where AI-generated text is often criticized for its "AI flavor"—repetitive phrasing and formulaic structures—the approach can help reduce mechanical patterns, producing more varied and inspiring outputs that better support human creativity. A key challenge during the project was data collection for offline learning. Initially, the team planned to use Anthropic’s Claude Sonnet 4, but even a small-scale data collection cost over $500. Full-scale data acquisition would have required tens of thousands of dollars—far beyond their budget. They then explored alternative models with strong code-generation capabilities. At the time, Chinese open-source models (referred to as "national models") were rapidly advancing. Models like Kimi-K2, Qwen3-coder-480B, and GLM-4.5 offered comparable performance to Claude Sonnet 4 at a fraction of the cost. After small-scale experiments, they confirmed these models were suitable replacements. The breakthrough came when GLM-4.5’s developer launched a promotional offer: 1 trillion tokens of free usage for one month—perfectly aligned with the team’s timeline. They used this to complete their entire data collection phase for just 50 RMB (~$7 USD), a dramatic reduction from the original $10,000 estimate. The collected data proved high-quality, directly translating into strong performance in downstream model fine-tuning. The team used Qwen3-coder-30B—a national model—for fine-tuning, reflecting a broader shift in the research community toward Chinese open-source models. “Back in 2023, the open-source landscape was dominated by Llama models, with most research based on Llama2. Today, national models have taken over, effectively replacing Llama as the new standard in open-source AI research, and their performance gap with proprietary models continues to shrink,” the team noted. Looking ahead, the researchers plan to investigate the relationship between output diversity and test-time scaling performance more deeply. While anecdotal evidence suggests that using multiple diverse models during testing improves results, no quantitative study has yet explored optimal numbers of models, or how performance disparities among them affect outcomes. “These questions remain open,” the team said. “We aim to provide the first systematic analysis in this space.”

Related Links

Related Links

Related Links

Paper Compilation | Over 100 Key AI for Science Achievements: A Quick Overview of Technological Innovations by 2025

Paper Compilation | Over 100 Key AI for Science Achievements: A Quick Overview of Technological Innovations by 2025

Command Palette

Enhancing AI Creativity: Diversity-Aware Offline Learning for Code and Writing

Related Links

Command Palette

Enhancing AI Creativity: Diversity-Aware Offline Learning for Code and Writing

Related Links

Command Palette

Enhancing AI Creativity: Diversity-Aware Offline Learning for Code and Writing

Related Links

Paper Compilation | Over 100 Key AI for Science Achievements: A Quick Overview of Technological Innovations by 2025

Paper Compilation | Over 100 Key AI for Science Achievements: A Quick Overview of Technological Innovations by 2025