HyperAIHyperAI

Command Palette

Search for a command to run...

P1-VL:物理学オリンピックにおける視覚認識と科学的推論を橋渡しする

概要

大規模言語モデル(LLM)における記号操作から科学的水準の推論への移行は、今後の重要な前線領域であり、物理学は抽象論理を物理的現実と結びつけるための鍵となる試験基準として機能する。物理学は、モデルが宇宙を支配する法則と物理的に整合性を保つことを要求するが、これは本質的に、抽象論理を現実に根ざさせるためのマルチモーダルな認識能力を必要とする。オリンピックレベルの問題では、図解は単なる補助的表現ではなく、境界条件や空間的対称性といったテキストに含まれない重要な制約を含むことが多く、問題の本質を規定する役割を果たす。この視覚的・論理的ギャップを埋めるために、我々は高度な科学的推論を目的としたオープンソースの視覚言語モデル群「P1-VL」を提案する。本手法は、訓練後の安定化を図るための段階的難易度拡張を用いる「カリキュラム強化学習」と、推論時に反復的な自己検証を可能にする「エージェント拡張」を統合している。2024–2025年度の13回の試験を収録した厳格なベンチマーク「HiPhO」において評価したところ、本研究の主力モデル「P1-VL-235B-A22B」は、初めてオープンソースの視覚言語モデル(VLM)として12枚の金メダルを獲得し、オープンソースモデルの中では最先端の性能を達成した。エージェント拡張を施した本システムは、全世界で第2位の総合順位を記録しており、唯一それを上回るモデルはGemini-3-Proにとどまる。物理学にとどまらず、P1-VLは顕著な科学的推論能力と汎化性を示し、STEM分野のベンチマークにおいてベースモデルを大きく引き離している。P1-VLのオープンソース提供により、機械による科学的発見のための、視覚的認識と抽象的物理法則をよりよく統合する汎用的物理知能の実現に向けた基盤的な一歩を提供した。

One-sentence Summary

Developed by Shanghai AI Laboratory’s P1 Team, P1-VL is the first open-source vision-language model family to achieve 12 gold medals in physics Olympiads by fusing curriculum reinforcement learning with agentic self-verification, uniquely bridging diagrams and physical laws for multimodal scientific reasoning.

Key Contributions

  • P1-VL addresses the critical gap in physics reasoning by integrating visual perception with abstract logic, specifically targeting Olympiad problems where diagrams encode essential constraints absent in text, thus enabling grounded scientific reasoning.
  • The model leverages Curriculum Reinforcement Learning with progressive difficulty scaling and Agentic Augmentation for iterative self-verification at inference, stabilizing training and enhancing reasoning fidelity beyond standard fine-tuning approaches.
  • Evaluated on the HiPhO benchmark of 13 2024–2025 exams, P1-VL-235B-A22B achieves 12 gold medals and ranks No.2 globally when augmented with PhysicsMinions, outperforming all other open-source VLMs and demonstrating strong generalization across STEM tasks.

Introduction

The authors leverage physics Olympiad problems as a high-stakes testbed to push Large Language Models beyond symbolic reasoning into grounded, multimodal scientific understanding. Prior work largely ignores the critical role of diagrams in physics—where visuals encode essential constraints absent in text—limiting models to incomplete, text-only reasoning. To bridge this gap, they introduce P1-VL, an open-source family of vision-language models trained via Curriculum Reinforcement Learning to progressively master complex reasoning, and augmented with an agent framework that enables iterative self-correction at inference. Their flagship model achieves 12 gold medals on the HiPhO benchmark, ranking second globally among all models, and demonstrates strong generalization across STEM domains—setting a new standard for open-source physical intelligence.

Dataset

  • The authors use a curated multimodal physics dataset of 8,033 problems, drawn from three sources: 4,126 from physics Olympiads (including IPhO and APhO up to 2023), 2,968 from undergraduate textbooks, and 939 from competition guides. These sources were selected to balance conceptual depth, visual richness, and verifiable solutions.

  • Each subset was processed through a multi-stage pipeline: OCR correction for scanned materials, model-based answer extraction (using Gemini-2.5-Flash, Claude-3.7-Sonnet, and GPT-4o) with majority-vote consensus, filtering of diagram-generation or open-ended tasks, visual consistency checks via Gemini-2.5-Flash, and final expert review. This reduced the initial 13,432 items to 8,033 high-fidelity, bilingual samples.

  • For training, the authors apply curriculum reinforcement learning. They first estimate problem difficulty using Qwen3-VL-30B-A3B’s pass rate across 72 rollouts. They remove trivial samples (pass rate > 0.7) and recover zero-shot failures (pass rate = 0.0) using Gemini-2.5-Flash for verification and refinement. Training proceeds in stages, progressively lowering the difficulty threshold and expanding group size and generation window to maintain search depth.

  • The dataset is used exclusively for RLVR training, with no test data from the same sources. Evaluation is performed on HiPhO, a separate benchmark of 13 recent Olympiad exams (2024–2025), using Gemini-2.5-Flash as an automated grader that scores both final answers and reasoning steps, mirroring human grading to enable medal-threshold comparisons.

Method

The authors leverage a reinforcement learning (RL) framework to train vision-language models for solving complex Physics Olympiad problems, formulating the task as a Markov Decision Process (MDP) where the state space encompasses the problem context and generated reasoning tokens, and the action space corresponds to the discrete vocabulary of output tokens. The policy is optimized to maximize the expected return, computed as the sum of scalar rewards over a trajectory, with the reward signal derived from the correctness of the final answer relative to ground truth. To stabilize training and improve sample efficiency, they adopt Group Sequence Policy Optimization (GSPO), which operates at the sequence level rather than the token level, employing length-normalized importance ratios to reduce variance and a clipped objective to constrain policy updates.

Refer to the framework diagram, which illustrates the end-to-end data pipeline from raw problem sources to final training data. The process begins with PDF conversion to Markdown, followed by QA parsing to extract question-answer pairs. Human annotators then perform answer annotation, ensuring the solutions adhere to a structured format—such as LaTeX for symbolic expressions and boxed final answers—as specified in the system prompt. Post-processing includes language transformation and expert review, culminating in multi-modal training data that pairs questions, answers, associated images, and metadata. This structured pipeline ensures that the model learns to generate verifiable, format-compliant solutions while handling both textual and visual inputs.

To address the train-inference mismatch inherent in distributed RL training, the authors implement Sequence-level Masked Importance Sampling (Seq-MIS), which rejects entire trajectories whose geometric mean of importance weights exceeds a threshold, thereby enforcing a hard trust region. This mechanism mitigates gradient bias introduced by discrepancies between rollout and training engines. The training dynamics are further stabilized through curriculum learning, where complexity is progressively increased by scaling data difficulty, expanding group sizes, and extending generation windows across stages. The VERL framework is used for implementation, with vision encoders and projection layers frozen during RL training to preserve pre-trained visual representations, while the language model parameters are fine-tuned.

At inference time, the system is augmented with PhysicsMinions, a multi-agent framework comprising Visual, Logic, and Review Studios. The Visual Studio processes diagrams and converts them into symbolic representations, enabling grounded reasoning. The Logic Studio iteratively refines solutions via solver-introspector collaboration, while the Review Studio validates outputs using domain-specific verifiers. This agentic loop supports scalable, robust reasoning on multimodal problems, with domain-adaptive mechanisms routing problems to appropriate solvers and verifiers based on detected scientific discipline.

Experiment

  • P1-VL models trained via reinforcement learning achieve top-tier performance on physics Olympiads, outperforming many closed-source models and demonstrating superior visual-scientific reasoning without agent augmentation.
  • Agent-augmented P1-VL systems surpass even top closed-source models, setting new benchmarks across multiple Olympiads and validating the “model + system” paradigm for complex scientific tasks.
  • P1-VL models generalize well beyond physics, showing strong transfer to diverse STEM benchmarks including math and multi-modal reasoning, with consistent gains over base models and minimal catastrophic forgetting.
  • Training stability is achieved through Sequence-Level Masked Importance Sampling, which mitigates train-inference mismatch and prevents RL collapse observed with other sampling methods.
  • Mixed training data (text-only + image-text) enhances performance without negative transfer, supporting the use of heterogeneous data for robust multimodal training.
  • Curriculum-based RL training significantly improves reasoning depth and response length, proving essential for developing advanced scientific reasoning capabilities.
  • RL training is broadly effective across model architectures, including InternVL series, confirming its generalizability for unlocking latent scientific reasoning in diverse base models.

The authors use a physics Olympiad benchmark to evaluate models trained with reinforcement learning, showing that their P1-VL series outperforms both open-source and closed-source baselines in scientific reasoning, even without agent augmentation. When combined with multi-agent systems, the models achieve state-of-the-art results across multiple competitions, demonstrating that structured training and system-level collaboration significantly enhance complex problem-solving. The models also generalize well to other STEM domains, improving performance on text-only and multi-modal tasks while maintaining robust visual reasoning capabilities.

The authors use reinforcement learning to train P1-VL models, achieving top-tier performance on physics Olympiads and demonstrating strong generalization to other STEM domains. Results show that even smaller variants outperform larger baselines, and combining the model with agent frameworks further boosts scores, highlighting the value of integrated system design. The models also retain and enhance reasoning capabilities across text-only and multi-modal tasks, indicating effective vision-language alignment without catastrophic forgetting.

The authors use a physics problem involving fluid mechanics and atmospheric pressure to evaluate model reasoning, requiring integration of visual schematics, tabular data, and symbolic equations. Results show the model correctly identifies behavioral regimes and computes precise force values across experiments, demonstrating robust alignment between visual perception and scientific calculation. This case underscores the model’s ability to sustain multi-step reasoning under real-world physical constraints without hallucinating invalid assumptions.

The authors use reinforcement learning to train P1-VL models, achieving top-tier performance on physics Olympiad benchmarks, with the largest variant ranking third among all models and outperforming several closed-source systems even without agent augmentation. When combined with the PhysicsMinions agent framework, the model climbs to second place globally, setting new state-of-the-art scores on multiple competitions, demonstrating that multi-agent collaboration significantly enhances complex scientific reasoning. Results also show strong generalization beyond physics, with consistent gains over base models across diverse STEM and multimodal benchmarks, indicating that domain-specific training does not compromise but rather amplifies broader reasoning capabilities.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています