AI Models Fail Safety Benchmarks in Lab Experiments, Study Warns
A new study published in Nature Machine Intelligence reveals significant safety risks associated with using artificial intelligence in laboratory experiments. Despite AI’s growing role in scientific research—such as predicting protein structures—current large language models (LLMs) and vision-language models (VLMs) lack reliable knowledge of lab safety, raising serious concerns about their use in real-world experiments. To assess AI performance in safety-critical scenarios, researchers developed LabSafety Bench, a comprehensive benchmarking framework. It includes 765 multiple-choice questions, 404 realistic lab scenarios, and 3,128 open-ended tasks covering hazard identification, risk assessment, and consequence prediction across biology, chemistry, physics, and general laboratory settings. The study evaluated 19 AI models, including eight proprietary models, seven open-weight LLMs, and four open-weight VLMs. While some top models like GPT-4o (86.55% accuracy) and DeepSeek-R (84.49% accuracy) performed well on structured multiple-choice questions, they struggled with open-ended, scenario-based reasoning. Performance dropped significantly on critical safety topics such as radiation hazards, electrical safety, equipment misuse, and physical risks. Notably, no model achieved over 70% accuracy in hazard identification tasks. In the HIT (Hazard Identification Test) and CIT (Consequence Identification Test), models showed stronger performance in biology and physics but faltered in chemistry, cryogenic liquids, and general lab safety. Several models scored below 50% on identifying improper equipment operation, while even the weakest model managed 66.55% on common hazards—highlighting the inconsistency and unreliability of AI responses. The Vicuna series performed particularly poorly, with results near random guessing in text-only multiple-choice tasks. Invision-based tasks, InstructBlip-7B—based on Vicuna-7B—also showed the weakest performance. Researchers attempted to improve model safety through fine-tuning and retrieval-augmented generation (RAG), but improvements were limited and inconsistent. Training on specific safety subsets only boosted performance by 5–10%, indicating that current methods are insufficient. The study identifies key failure modes: poor risk prioritization, hallucination of facts, and overfitting to training data. These flaws mean that even advanced AI models cannot be trusted to make safe decisions in lab environments. The authors stress that larger or newer models do not automatically perform better in safety contexts. They urge the scientific community to adopt benchmarking tools like LabSafety Bench and to enforce strict human oversight when using AI in labs. While future improvements are expected, the findings underscore that AI should not be used independently in experimental settings involving hazardous materials. Until models demonstrate consistent, reliable safety knowledge, human judgment remains essential to prevent accidents, injuries, and potential disasters.
