Command Palette
Search for a command to run...
LifeSciBench: تقييم نماذج اللغات (LLMs) على مهام واقعية ومستوى خبراء في علوم الحياة
LifeSciBench: تقييم نماذج اللغات (LLMs) على مهام واقعية ومستوى خبراء في علوم الحياة
الملخص
نُقدّم LifeSciBench، وهو معيار (benchmark) يتألف من 750 مهمة من إعداد خبراء، صُمم لتقييم ما إذا كانت نماذج اللغة (LLMs) قادرة على التعامل مع أبحاث علوم الحياة الواقعية. وفي الوقت الراهن، تفشل الغالبية العظمى من معايير التقييم في مجال العلوم البيولوجية في التقاط تعقيد الأعمال البحثية على المستوى المتقدم؛ حيث تكون الأسئلة عادةً محدودة النطاق ومعتمدة بالكامل على المعرفة، بينما تتميز الأعمال الواقعية بالطابع الغامض وتتطلب العديد من أحكام التقدير. علاوة على ذلك، تقتصر المعايير الحالية، في أحسن الأحوال، على مجموعة صغيرة من المجالات العلمية. ولا يوجد حالياً أي معيار في علوم الحياة يجمع بين الاتساع والعمق اللازمين لقياس الكفاءة بشكل مقنع في البيئات المهنية الواقعية. ويعالج LifeSciBench هذه الفجوة من خلال تغطية سبع سير عمل علمية وسبع مجالات في علوم الحياة، مع ربط كل مهمة بمعايير تقييم (rubric) من إعداد خبراء. وعلى الرغم من الأداء الأفضل الذي حققه نموذج GPT-Rosalind عبر مجموعة من النماذج المتطورة والمتخصصة في المجال، حيث حقق معدل موحد مرجحاً للمشكلات يبلغ 0.576 ومعدل نجاح في المهام يبلغ 36.1%، يبقى المعيار بعيداً عن الشبع؛ إذ لم ينجح أي نموذج في اجتياز 171 مهمة (أي 22.8%)، بينما كان معدل النجاح للمهمة الأعلى أداءً أقل من 20% في 261 مهمة (أي 34.8%). ويؤدي LifeSciBench دور مقياس عالي الدقة للاستدلال العلمي العملي واتخاذ القرارات التشغيلية في مجال البيولوجيا.
One-sentence Summary
LifeSciBench, a benchmark of 750 expert-authored tasks spanning seven scientific workflows and seven life science domains with expert-written rubrics, evaluates realistic research reasoning; among five tested models, GPT-Rosalind achieves the highest problem-weighted normalized score (0.576) and task pass rate (36.1%), yet 22.8% of tasks remain unsolved and 34.8% have a best-model pass rate below 20%, highlighting the benchmark’s difficulty.
Key Contributions
- LifeSciBench is introduced as an expert-authored benchmark of 750 problems spanning seven scientific workflows and seven life science domains, designed to capture the complexity of practical research tasks.
- Expert-written rubrics totaling 19,020 criteria are provided, evaluating response quality beyond final correctness by assessing scientific reasoning, evidence use, and communication.
- Benchmarking five frontier and domain-specialized models reveals that the best system, GPT-Rosalind, achieves a problem-weighted normalized score of 0.576 and a 36.1% task pass rate, with 22.8% of tasks unsolved by any model.
Introduction
The authors address the growing need for AI systems that can operate as true scientific collaborators in the life sciences, where progress depends not only on factual knowledge but also on reasoned judgment amid uncertainty, experimental design, and precise decision-making. Prior benchmarks largely isolate fact retrieval or confine evaluation to computational biology workflows, providing clean reference answers that overlook the open-ended, artifact-rich, and ambiguous nature of real research tasks. They generally fail to assess whether a model can weigh imperfect evidence, justify next steps, and produce actionable outputs at an expert level. To close this gap, the authors introduce LifeSciBench, a benchmark of 750 expert-authored problems spanning multiple biological domains and scientific workflows. Each task is accompanied by detailed rubrics that measure the validity of reasoning and communication, not just final correctness, enabling a more realistic evaluation of frontier models on applied life-science work.
Dataset
LifeSciBench: A Benchmark for Realistic Life Science Research Tasks
The authors introduce LifeSciBench, an expert-authored evaluation benchmark containing 750 free-response tasks that measure whether language models can perform realistic, multi-step life science work. Unlike typical factual knowledge benchmarks, LifeSciBench emphasizes practical scientific reasoning, evidence use, and decision-making under uncertainty.
Dataset Composition and Sources
- All tasks are created by 173 expert scientists who hold a Ph.D. in a relevant discipline and have at least two years of industry experience in biotech or pharma.
- The benchmark is organized along three taxonomies: seven workflow categories, seven biological/scientific domains, and multiple data source/evidence types.
- Each task consists of a natural-language prompt, optional supporting artifacts (SMILES, sequences, tabular data, PDFs, instrument outputs, microscopy images, etc.), and a fine-grained expert-written rubric for scoring.
Key Details for Each Subset
- Workflow categories (from a practitioner-informed taxonomy): e.g., Experimental Design & Troubleshooting, Data Analysis & Interpretation, Evidence Synthesis & Communication, Mechanistic Reasoning, Protocol & Instruction Following, Quantitative & Computational Analysis, Literature & Context Integration. Exact category names are in the paper’s Table 1; the benchmark balances broadly relevant capabilities with domain-specific expertise.
- Biological domains: seven categories such as biochemistry, molecular biology, neuroscience, immunology, pharmacology, computational biology, and translational/clinical domains (see Table 5). Tasks cover computational and experimental contexts.
- Evidence types: artifacts range from molecular representations and sequences to tabular datasets, PDFs, raw instrument outputs, microscopy images, gel images, and experimental figures.
- No separate train/validation/test splits: the full 750-task set is used exclusively for evaluation.
How the Paper Uses the Data
- LifeSciBench is used as an evaluation-only benchmark; there is no training split or mixture ratio.
- Models are tested in a single-turn setting: they receive the prompt and any artifacts once and must produce one final response. No multi-turn clarification or iterative feedback is permitted.
- Scoring uses the task-specific rubrics. Each rubric contains multiple criteria (over 19,000 total, averaging 25 per task) that reward correct facts, explicit reasoning steps, proper use of evidence, scientific caveats, and communication quality. Final scores are computed by summing awarded points and dividing by total rubric points. This allows partial credit for valid intermediate reasoning.
Processing and Quality Control
- Task formulation: Experts write questions as a scientist would pose to a knowledgeable colleague, ranging from single-answer queries to multi-step analytical tasks.
- Artifact handling: Attachments are provided exactly as they would appear in research settings; no additional preprocessing or cropping is described. Models must interpret heterogeneous formats directly.
- Review process: All tasks underwent a multi-stage expert review with no cap on revision rounds (averaged six automated review cycles and at least two rounds of expert reviews). Reviews checked question-rubric consistency, scientific ambition (multi-step reasoning beyond memorization), factual accuracy supported by evidence or expert consensus, and spelling/grammar/formatting. Rubric answers required at least 90% expert agreement.
- Metadata construction: Tasks are stratified by workflow, biological domain, and evidence type for performance analysis. A workflow-by-domain heatmap (Appendix B) visualises coverage.
The benchmark is designed to evaluate models under realistic scientific constraints, not to serve as a training corpus. Performance is reported via problem-weighted normalized scores and task pass rates, with the top model (GPT-Rosalind) achieving 0.576 normalized score and 36.1% pass rate, leaving ample room for progress.
Method
The authors leverage a structured pipeline known as LifeSciBench to evaluate large language models on complex life science tasks. This framework transitions from expert-authored tasks to stratified model analysis through a series of distinct modules.
As shown in the figure below:

The process initiates with expert task authoring, where domain specialists design challenging scientific problems. These tasks are compiled into task instances that comprise a prompt, relevant artifacts, and a detailed scoring rubric. The construction of these rubrics follows strict principles to ensure rigorous evaluation. Each criterion must demonstrate specificity by describing a concrete property of the response and atomicity by evaluating a single claim, calculation, or constraint. Additionally, the criteria are designed for evaluability, meaning they can be assessed as satisfied or not satisfied based solely on the model response. The rubrics also require grounding in the task prompt or expert consensus, non-redundancy to prevent double-counting, and operational usefulness to reward responses that are practically helpful for scientific decision-making.
Following the creation of task instances, an expert review phase validates the quality and accuracy of the materials. Once approved, the task instances are fed to the model to generate a response. For instance, a model might be asked to critically review a DNA methylation analysis, requiring it to identify methodological errors such as inappropriate probe quality control or invalid probe-pair averaging.
The generated model responses then undergo rubric grading, where the predefined criteria are applied to score the output. This systematic grading allows for a quantitative assessment of model performance. Finally, the framework facilitates stratified analysis, enabling researchers to dissect and understand model capabilities across different scientific domains and task complexities based on the graded results.
Experiment
Independent expert validation confirmed that LifeSciBench tasks strongly reflect real-world life science work, with high ratings for scientific reasoning, grounding, and overall usefulness. Evaluation of frontier models, including a domain-specialized system, showed modest overall pass rates, with relative strengths in scientific synthesis and expert-facing judgment but pronounced weaknesses on artifact-heavy tasks and those demanding exact or construct-level outputs. Aggregate rankings obscured complementary task-level strengths among systems, and even the strongest model frequently made partial scientific progress without meeting full task requirements, underscoring that reliability under realistic research constraints remains a key gap. Substantial benchmark headroom, with more than half of tasks having a best-model pass rate below 50%, positions LifeSciBench as a challenging measure for future capability advances.