HyperAIHyperAI

Command Palette

Search for a command to run...

LifeSciBench : Évaluer les Language Models sur des tâches réalistes et de niveau expert en sciences de la vie

Résumé

Nous présentons LifeSciBench, un benchmark composé de 750 tâches rédigées par des experts, conçu pour évaluer la capacité des modèles de langage (LLMs) à traiter des travaux de recherche en sciences de la vie conformes à la pratique réelle. À l’heure actuelle, la grande majorité des benchmarks en biologie ne rendent pas compte de la complexité des travaux de niveau recherche : les questions y sont généralement circonscrites à un champ restreint et purement factuelles, tandis que les travaux en contexte réel font souvent appel à une ambiguïté inhérente et nécessitent de multiples arbitrages décisionnels. Par ailleurs, la plupart des benchmarks existants se limitent, dans le meilleur des cas, à un petit nombre de domaines scientifiques. Il n’existe aucun benchmark en sciences de la vie offrant à la fois l’étendue et la profondeur nécessaires pour mesurer de manière convaincante la compétence dans des contextes professionnels concrets. LifeSciBench comble cette lacune en couvrant sept workflows scientifiques et sept domaines des sciences de la vie, chaque tâche étant accompagnée d’une grille d’évaluation (rubric) rédigée par un expert. Sur cinq modèles de pointe et spécialisés par domaine, GPT-Rosalind obtient les meilleurs résultats, avec un score normalisé pondéré par la difficulté des tâches de 0,576 et un taux de réussite de 36,1 % par tâche. Toutefois, le benchmark reste loin d’être saturé : aucun modèle n’a réussi 171 tâches (22,8 %), et 261 tâches (34,8 %) affichent un taux de réussite du meilleur modèle inférieur à 20 %. LifeSciBench constitue ainsi une évaluation à haute résolution du raisonnement scientifique pratique et de la prise de décision opérationnelle en biologie.

One-sentence Summary

LifeSciBench, a benchmark of 750 expert-authored tasks spanning seven scientific workflows and seven life science domains with expert-written rubrics, evaluates realistic research reasoning; among five tested models, GPT-Rosalind achieves the highest problem-weighted normalized score (0.576) and task pass rate (36.1%), yet 22.8% of tasks remain unsolved and 34.8% have a best-model pass rate below 20%, highlighting the benchmark’s difficulty.

Key Contributions

  • LifeSciBench is introduced as an expert-authored benchmark of 750 problems spanning seven scientific workflows and seven life science domains, designed to capture the complexity of practical research tasks.
  • Expert-written rubrics totaling 19,020 criteria are provided, evaluating response quality beyond final correctness by assessing scientific reasoning, evidence use, and communication.
  • Benchmarking five frontier and domain-specialized models reveals that the best system, GPT-Rosalind, achieves a problem-weighted normalized score of 0.576 and a 36.1% task pass rate, with 22.8% of tasks unsolved by any model.

Introduction

The authors address the growing need for AI systems that can operate as true scientific collaborators in the life sciences, where progress depends not only on factual knowledge but also on reasoned judgment amid uncertainty, experimental design, and precise decision-making. Prior benchmarks largely isolate fact retrieval or confine evaluation to computational biology workflows, providing clean reference answers that overlook the open-ended, artifact-rich, and ambiguous nature of real research tasks. They generally fail to assess whether a model can weigh imperfect evidence, justify next steps, and produce actionable outputs at an expert level. To close this gap, the authors introduce LifeSciBench, a benchmark of 750 expert-authored problems spanning multiple biological domains and scientific workflows. Each task is accompanied by detailed rubrics that measure the validity of reasoning and communication, not just final correctness, enabling a more realistic evaluation of frontier models on applied life-science work.

Dataset

LifeSciBench: A Benchmark for Realistic Life Science Research Tasks

The authors introduce LifeSciBench, an expert-authored evaluation benchmark containing 750 free-response tasks that measure whether language models can perform realistic, multi-step life science work. Unlike typical factual knowledge benchmarks, LifeSciBench emphasizes practical scientific reasoning, evidence use, and decision-making under uncertainty.

Dataset Composition and Sources

  • All tasks are created by 173 expert scientists who hold a Ph.D. in a relevant discipline and have at least two years of industry experience in biotech or pharma.
  • The benchmark is organized along three taxonomies: seven workflow categories, seven biological/scientific domains, and multiple data source/evidence types.
  • Each task consists of a natural-language prompt, optional supporting artifacts (SMILES, sequences, tabular data, PDFs, instrument outputs, microscopy images, etc.), and a fine-grained expert-written rubric for scoring.

Key Details for Each Subset

  • Workflow categories (from a practitioner-informed taxonomy): e.g., Experimental Design & Troubleshooting, Data Analysis & Interpretation, Evidence Synthesis & Communication, Mechanistic Reasoning, Protocol & Instruction Following, Quantitative & Computational Analysis, Literature & Context Integration. Exact category names are in the paper’s Table 1; the benchmark balances broadly relevant capabilities with domain-specific expertise.
  • Biological domains: seven categories such as biochemistry, molecular biology, neuroscience, immunology, pharmacology, computational biology, and translational/clinical domains (see Table 5). Tasks cover computational and experimental contexts.
  • Evidence types: artifacts range from molecular representations and sequences to tabular datasets, PDFs, raw instrument outputs, microscopy images, gel images, and experimental figures.
  • No separate train/validation/test splits: the full 750-task set is used exclusively for evaluation.

How the Paper Uses the Data

  • LifeSciBench is used as an evaluation-only benchmark; there is no training split or mixture ratio.
  • Models are tested in a single-turn setting: they receive the prompt and any artifacts once and must produce one final response. No multi-turn clarification or iterative feedback is permitted.
  • Scoring uses the task-specific rubrics. Each rubric contains multiple criteria (over 19,000 total, averaging 25 per task) that reward correct facts, explicit reasoning steps, proper use of evidence, scientific caveats, and communication quality. Final scores are computed by summing awarded points and dividing by total rubric points. This allows partial credit for valid intermediate reasoning.

Processing and Quality Control

  • Task formulation: Experts write questions as a scientist would pose to a knowledgeable colleague, ranging from single-answer queries to multi-step analytical tasks.
  • Artifact handling: Attachments are provided exactly as they would appear in research settings; no additional preprocessing or cropping is described. Models must interpret heterogeneous formats directly.
  • Review process: All tasks underwent a multi-stage expert review with no cap on revision rounds (averaged six automated review cycles and at least two rounds of expert reviews). Reviews checked question-rubric consistency, scientific ambition (multi-step reasoning beyond memorization), factual accuracy supported by evidence or expert consensus, and spelling/grammar/formatting. Rubric answers required at least 90% expert agreement.
  • Metadata construction: Tasks are stratified by workflow, biological domain, and evidence type for performance analysis. A workflow-by-domain heatmap (Appendix B) visualises coverage.

The benchmark is designed to evaluate models under realistic scientific constraints, not to serve as a training corpus. Performance is reported via problem-weighted normalized scores and task pass rates, with the top model (GPT-Rosalind) achieving 0.576 normalized score and 36.1% pass rate, leaving ample room for progress.

Method

The authors leverage a structured pipeline known as LifeSciBench to evaluate large language models on complex life science tasks. This framework transitions from expert-authored tasks to stratified model analysis through a series of distinct modules.

As shown in the figure below:

The process initiates with expert task authoring, where domain specialists design challenging scientific problems. These tasks are compiled into task instances that comprise a prompt, relevant artifacts, and a detailed scoring rubric. The construction of these rubrics follows strict principles to ensure rigorous evaluation. Each criterion must demonstrate specificity by describing a concrete property of the response and atomicity by evaluating a single claim, calculation, or constraint. Additionally, the criteria are designed for evaluability, meaning they can be assessed as satisfied or not satisfied based solely on the model response. The rubrics also require grounding in the task prompt or expert consensus, non-redundancy to prevent double-counting, and operational usefulness to reward responses that are practically helpful for scientific decision-making.

Following the creation of task instances, an expert review phase validates the quality and accuracy of the materials. Once approved, the task instances are fed to the model to generate a response. For instance, a model might be asked to critically review a DNA methylation analysis, requiring it to identify methodological errors such as inappropriate probe quality control or invalid probe-pair averaging.

The generated model responses then undergo rubric grading, where the predefined criteria are applied to score the output. This systematic grading allows for a quantitative assessment of model performance. Finally, the framework facilitates stratified analysis, enabling researchers to dissect and understand model capabilities across different scientific domains and task complexities based on the graded results.

Experiment

Independent expert validation confirmed that LifeSciBench tasks strongly reflect real-world life science work, with high ratings for scientific reasoning, grounding, and overall usefulness. Evaluation of frontier models, including a domain-specialized system, showed modest overall pass rates, with relative strengths in scientific synthesis and expert-facing judgment but pronounced weaknesses on artifact-heavy tasks and those demanding exact or construct-level outputs. Aggregate rankings obscured complementary task-level strengths among systems, and even the strongest model frequently made partial scientific progress without meeting full task requirements, underscoring that reliability under realistic research constraints remains a key gap. Substantial benchmark headroom, with more than half of tasks having a best-model pass rate below 50%, positions LifeSciBench as a challenging measure for future capability advances.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp