Command Palette
Search for a command to run...
CL-벤치: 컨텍스트 학습을 위한 벤치마크
CL-벤치: 컨텍스트 학습을 위한 벤치마크
초록
현재의 언어 모델(LM)은 사전 학습된 지식을 활용해 프롬프트에 대한 추론을 매우 잘 수행한다. 그러나 현실 세계의 과제는 훨씬 더 복잡하고 맥락에 의존적이다. 모델은 특정 과제에 맞는 맥락을 바탕으로 학습하고, 사전 학습 과정에서 습득하지 못한 새로운 지식을 활용하여 과제를 추론하고 해결해야 한다. 이러한 능력을 우리는 ‘맥락 학습(context learning)’이라 부르며, 인간이 자연스럽게 지닌 핵심 능력이지만 오랫동안 간과되어 왔다. 이를 위해 우리는 경험 많은 분야 전문가들이 직접 제작한 500개의 복잡한 맥락, 1,899개의 과제, 31,607개의 검증 기준을 포함하는 현실 세계 기반의 벤치마크인 CL-bench를 제안한다. 각 과제는 해결하기 위해 필요한 새로운 정보가 해당 맥락 내에 포함되도록 설계되어 있다. CL-bench의 과제를 해결하기 위해서는 모델이 맥락에서 새로운 도메인 특화 지식, 규칙 체계, 복잡한 절차, 또는 실증 데이터로부터 도출된 법칙 등 다양한 지식을 학습해야 하며, 이러한 지식들은 사전 학습 과정에서 포함되지 않는다. 이는 주로 정보 검색이나 독해 능력을 테스트하는 장문 맥락 과제나, 지시문과 예시를 통해 단순한 과제 패턴을 학습하는 인-컨텍스트 학습 과제를 훨씬 뛰어넘는다. 우리는 10개의 최첨단 언어 모델에 대해 평가한 결과, 평균적으로 모델이 해결하는 과제 비율은 17.2%에 불과했으며, 가장 성능이 뛰어난 모델인 GPT-5.1도 단 23.7%에 그쳤다. 이는 언어 모델이 여전히 효과적인 맥락 학습 능력을 갖추지 못하고 있음을 보여주며, 현실 세계에서 복잡하고 맥락에 의존적인 과제를 해결하는 데 있어 중요한 장벽이 존재함을 시사한다. CL-bench는 언어 모델이 이러한 핵심 능력을 갖추도록 하는 데 한 걸음 전진하는 계기가 되며, 모델의 지능 수준을 높이고 현실 적용 가능성을 크게 향상시킬 수 있을 것으로 기대된다.
One-sentence Summary
Researchers from Hunyuan Team at Tencent and Fudan University propose CL-bench, a real-world benchmark exposing LMs’ poor context learning—acquiring new knowledge from context to solve complex tasks—revealing even GPT-5.1 solves only 23.7%, highlighting a critical gap for deploying LMs in practical, context-dependent scenarios.
Key Contributions
- The paper identifies a critical gap in current language models: their inability to perform “context learning”—learning and applying new, task-specific knowledge from context rather than relying solely on pre-trained knowledge, a capability essential for real-world, context-dependent tasks.
- To evaluate this capability, the authors introduce CL-bench, a rigorously constructed benchmark with 500 complex contexts, 1,899 tasks, and 31,607 verification rubrics, all designed by domain experts to require models to learn from context without external retrieval, covering domains like rule systems, procedures, and empirical discovery.
- Evaluations across ten frontier LMs reveal severe limitations, with models solving only 17.2% of tasks on average and the top model (GPT-5.1) achieving just 23.7%, highlighting that context learning remains a fundamental bottleneck for real-world deployment.
Introduction
The authors leverage the growing gap between current language models’ reliance on pre-trained knowledge and the real-world need for dynamic context learning—where models must absorb and apply new, task-specific information from complex contexts. Prior benchmarks focus on retrieval, reading comprehension, or simple in-context pattern learning, failing to evaluate true knowledge acquisition from rich, domain-specific material. To address this, they introduce CL-bench: a contamination-free, expert-crafted benchmark with 500 contexts, 1,899 tasks, and 31,607 rubrics that force models to learn new rules, procedures, or empirical laws directly from context. Evaluations show even top models like GPT-5.1 solve only 23.7% of tasks, exposing a critical bottleneck in context learning and calling for focused research to enable models to adapt like humans in real-world scenarios.
Dataset
-
The authors use CL-bench, a benchmark designed to evaluate language models’ ability to learn from context and apply that knowledge to solve complex, real-world tasks. The dataset is built around entirely new or niche knowledge not typically found in pre-training corpora.
-
CL-bench contains 18 subcategories across 4 main context categories: Domain Knowledge Reasoning (e.g., legal, financial, scientific), Rule System Application (e.g., game mechanics, programming syntax), Procedural Task Execution (e.g., manuals, workflows), and Empirical Discovery & Simulation (e.g., experimental data, simulations). Each context includes multiple tasks, 51.1% of which are sequential, requiring models to build on prior task solutions.
-
Contexts are constructed in three stages by domain experts: (1) design novel or modified content (fictional, altered real-world, or emerging/niche material), (2) create context-dependent tasks requiring genuine learning, and (3) annotate with detailed rubrics. Each context takes ~20 expert hours to build. The dataset is contamination-free: models achieve <1% task-solving rate without context, confirming reliance on provided knowledge.
-
Tasks are evaluated using multi-dimensional rubrics. Input length varies, and contexts include structured data (e.g., tables, JSON, code), plain text, and mixed formats. No cropping is mentioned; instead, full context is provided. Metadata includes task UID, model name, context category/subcategory, and task description. Processing includes expert validation and ablation testing to ensure context dependence.
Method
The authors leverage a dual-path framework that distinguishes between context-driven learning and prompt-guided reasoning, each tailored to different operational regimes along a complexity axis. As shown in the figure below, the architecture bifurcates into two primary pathways: one anchored in context engineering for learning from complex, novel inputs, and another grounded in prompt engineering for reasoning over pre-trained knowledge under constrained conditions.
In the context engineering pathway, the system ingests heterogeneous sources—such as research papers, operational manuals, and experimental data—to infer underlying physical or mathematical laws. The model, exemplified by GPT-5.1 in the workflow, processes these inputs under a system prompt that frames the task as reverse-engineering from raw data. The output is evaluated against rubrics that enforce physical consistency, such as deriving the entry angle via θ=arctan(v⊥/v∥), and penalize omissions of key assumptions, as seen in the rationale for omitting the linear growth of z with time.
The prompt engineering pathway operates under strict rule systems, where the model executes predefined procedural logic without inferring new functions. This is evident in domains such as logistics or regulatory compliance, where the system must adhere to documented APIs and safety protocols. For instance, when faced with an urgent drone delivery request involving Class-4 Hazmat, the model refuses to invoke non-existent functions like force_launch_override() and instead proposes a compliant alternative using auth_handshake() and Safety_verify_weather_conditions(), as evaluated against rubrics that enforce legal and safety constraints.
The framework’s modularity allows it to span diverse domains—from finance and healthcare to programming syntax and game mechanics—by switching between context-rich inference and rule-bound execution. Each pathway is governed by distinct evaluation criteria: context learning is scored on physical or mathematical fidelity, while prompt-based reasoning is validated against procedural correctness and safety compliance. This design enables the system to scale toward real-world complexity without conflating learning from novel data with execution under fixed constraints.
The training process implicitly enforces this duality by exposing the model to both open-ended context learning tasks and tightly constrained procedural tasks, each with explicit rubrics that shape the output toward domain-specific correctness. No external data or assumptions are introduced; all reasoning is grounded in the provided documentation, ensuring reproducibility and adherence to operational boundaries.
Experiment
- CL-bench introduces task-level rubrics with binary yes/no criteria to reliably evaluate complex, multi-solution tasks, validated by expert design and inter-verifier agreement exceeding 90%.
- GPT-5.1 serves as a reliable verifier, showing high consistency with human annotators and minimal self-evaluation bias when compared to Claude Opus 4.5 and Qwen-3-Max.
- Context learning remains a major challenge: even top models achieve only 17–24% task-solving rates, with no model surpassing 30%, highlighting a gap in real-world deployment readiness.
- Task difficulty varies by category: domain knowledge and procedural tasks are more tractable; empirical discovery and simulation are hardest, with ~11% success and high instability due to inductive reasoning demands.
- Within categories, subcategories reveal fine-grained gaps: e.g., legal & regulatory tasks outperform mathematical formalism by over 15%, and observational data is harder than experimental data.
- Long context and instruction following are necessary but insufficient; models like GPT-5.2 underperform GPT-5.1 despite newer architecture, often losing coherence or violating constraints in extended contexts.
- Increasing reasoning effort yields modest gains for most models (e.g., 2.5% for GPT-5.1), but benefits diminish or reverse for some, and become more critical as context length increases.
- Context length consistently degrades performance across all models; solving rates drop from 25–35% at short lengths to 5–10% beyond 32K tokens, with GPT-5.1 showing relative robustness.
- Models frequently fail by ignoring or misapplying context (context misused >60% of failures), and format errors persist even in top models (e.g., 35–40% for GPT-5.1 and Claude Opus).
- Qualitative cases confirm models struggle with deep context integration: they may detect violations but fail to retrieve or apply relevant content, especially under complex or structured constraints.
- Removing context reduces GPT-5.1’s solving rate to 0.9%, proving CL-bench tasks require in-context learning and cannot be solved via pre-trained knowledge alone.
- Knowledge type and structure significantly affect performance: rule-based tasks (e.g., legal & regulatory) are easier than judgment-based ones (e.g., legal advisory), even within the same domain.
- Models often satisfy surface constraints (e.g., refusal, formatting) but fail on deeper requirements, treating constraints as terminal rather than reformulable, leading to incomplete or non-functional outputs.
Results show that even the strongest language models solve fewer than 24% of context learning tasks on average, with performance varying significantly across task types. Models consistently struggle most with empirical discovery and simulation tasks, which require inductive reasoning, while performing relatively better on domain knowledge and procedural tasks that rely on deductive application of explicit rules. The findings indicate that current models, despite advanced reasoning and long-context capabilities, still face fundamental limitations in reliably learning, retaining, and applying novel contextual knowledge under strict evaluation criteria.

The authors use a detailed rubric-based evaluation framework to assess language models on complex context learning tasks, with each task verified through multiple binary criteria designed by domain experts. Results show that even top models solve fewer than 24% of tasks on average, highlighting that context learning remains a significant challenge despite advances in long-context and reasoning capabilities. Performance varies notably across categories, with empirical discovery and simulation proving hardest, while domain knowledge and procedural tasks are relatively more tractable.

Results show that even the strongest language models solve fewer than 24% of tasks on average, highlighting that context learning remains a significant challenge despite advances in long-context and reasoning capabilities. Performance varies widely across subcategories, with models excelling in rule-based legal and procedural tasks but struggling with empirical discovery and simulation, where inductive reasoning from data proves particularly difficult. Most models frequently fail by misapplying or ignoring contextual information, and even top performers exhibit high rates of format and instruction-following errors, indicating that reliably learning and applying novel knowledge from context is still beyond current model capabilities.

Results show that even top-performing language models frequently fail to fully utilize or correctly apply contextual information, with context misuse and format errors dominating failure modes across all evaluated systems. The data confirms that long-context reasoning and strict instruction following remain significant bottlenecks, as models often overlook critical details or violate structural requirements despite possessing the necessary knowledge.
