A Controllable Examination for Long-Context Language Models

Yijun Yang, Zeyu Huang, Wenhao Zhu, Zihan Qiu, Fei Yuan, Jeff Z. Pan, Ivan Titov

발행일: 6/5/2025

A Controllable Examination for Long-Context Language Models

초록

Existing frameworks for evaluating long-context language models (LCLM) can bebroadly categorized into real-world and synthetic tasks. Despite their utility,both approaches are accompanied by certain intrinsic limitations. Real-worldtasks are too complex to interpret or characterize and are susceptible to datacontamination. In contrast, synthetic tasks often adopt theneedle-in-the-haystack (NIAH) format, wherein a lack of coherence between the"needle" and the "haystack" compromises their validity as proxies for realisticapplications. In response to these challenges, we posit that an ideallong-context evaluation framework should be characterized by three essentialfeatures: seamless context, controllable setting, andsound evaluation. This study introduces LongBioBench, anovel benchmark that utilizes artificially generated biographies as acontrolled environment for assessing LCLMs across dimensions ofunderstanding, reasoning, and trustworthiness.Our experimental evaluation, which includes 18 LCLMs in total,demonstrates that most models still exhibit deficiencies in semanticunderstanding and elementary reasoning over retrieved results and are lesstrustworthy as context length increases. Our further analysis indicates somedesign choices employed by existing synthetic benchmarks, such as contextualnon-coherence, numerical needles, and the absence of distractors, renderingthem vulnerable to test the model long-context capabilities. Moreover, we alsoreveal that long-context continual pretraining primarily adjusts RoPE embeddingto accommodate extended context lengths. To sum up, compared to previoussynthetic benchmarks, LongBioBench achieves a better trade-off betweenmirroring authentic language tasks and maintaining controllability, and ishighly interpretable and configurable.

논문 세부 정보 보기 View Code