HyperAIHyperAI

Command Palette

Search for a command to run...

P1-VL: 물리 올림피아드에서 시각 인지와 과학적 추론 연결하기

초록

심볼 조작에서 과학 수준의 추론으로의 전환은 대규모 언어 모델(Large Language Models, LLMs)에게 핵심적인 전초적인 과제이며, 물리학은 추상적 논리를 물리적 현실과 연결짓는 데 있어 결정적인 시험대 역할을 한다. 물리학은 모델이 우주를 지배하는 법칙과 물리적 일관성을 유지해야 한다는 요구를 내포하며, 이는 추상적 논리를 현실에 뿌리내리게 하기 위해 다중 모달 인지(multimodal perception)가 본질적으로 필요함을 의미한다. 올림피아드 수준에서는 다이어그램이 단순한 도해를 넘어서 본질적인 제약 조건을 포함하며, 텍스트에는 존재하지 않는 경계 조건이나 공간 대칭성 같은 핵심 정보를 담고 있다. 이러한 시각적-논리적 간극을 극복하기 위해, 고급 과학적 추론을 위한 개방형 비전-언어 모델 P1-VL을 소개한다. 본 연구는 교육 과정 기반 강화 학습(Curriculum Reinforcement Learning)과 에이전트 증강(agents augmentation)을 통합하여, 사후 훈련 과정에서의 안정성을 확보하고, 추론 시 반복적인 자기 검증(iterative self-verification)을 가능하게 한다. HiPhO(2024–2025년 기준 13개 시험을 포함하는 엄격한 벤치마크)에서 평가한 결과, 주력 모델인 P1-VL-235B-A22B는 세계 최초로 개방형 비전-언어 모델(Vision-Language Model, VLM)로서 12개의 금메달을 획득하며, 개방형 모델 중 최고 성능을 기록했다. 에이전트 증강 시스템은 전 세계적으로 두 번째로 높은 종합 순위를 기록하며, 단 하나의 모델인 Gemini-3-Pro만이 이를 상회했다. 물리학 외에도 P1-VL은 뛰어난 과학적 추론 능력과 일반화 능력을 입증하였으며, STEM 분야 벤치마크에서 기초 모델 대비 상당한 우위를 보였다. P1-VL의 개방 소스화를 통해, 기계의 과학적 탐구를 위한 시각적 인지와 추상적 물리 법칙 간의 보다 정교한 일치를 가능하게 하는, 일반 목적의 물리적 지능의 기초적 단계를 제시한다.

One-sentence Summary

Developed by Shanghai AI Laboratory’s P1 Team, P1-VL is the first open-source vision-language model family to achieve 12 gold medals in physics Olympiads by fusing curriculum reinforcement learning with agentic self-verification, uniquely bridging diagrams and physical laws for multimodal scientific reasoning.

Key Contributions

  • P1-VL addresses the critical gap in physics reasoning by integrating visual perception with abstract logic, specifically targeting Olympiad problems where diagrams encode essential constraints absent in text, thus enabling grounded scientific reasoning.
  • The model leverages Curriculum Reinforcement Learning with progressive difficulty scaling and Agentic Augmentation for iterative self-verification at inference, stabilizing training and enhancing reasoning fidelity beyond standard fine-tuning approaches.
  • Evaluated on the HiPhO benchmark of 13 2024–2025 exams, P1-VL-235B-A22B achieves 12 gold medals and ranks No.2 globally when augmented with PhysicsMinions, outperforming all other open-source VLMs and demonstrating strong generalization across STEM tasks.

Introduction

The authors leverage physics Olympiad problems as a high-stakes testbed to push Large Language Models beyond symbolic reasoning into grounded, multimodal scientific understanding. Prior work largely ignores the critical role of diagrams in physics—where visuals encode essential constraints absent in text—limiting models to incomplete, text-only reasoning. To bridge this gap, they introduce P1-VL, an open-source family of vision-language models trained via Curriculum Reinforcement Learning to progressively master complex reasoning, and augmented with an agent framework that enables iterative self-correction at inference. Their flagship model achieves 12 gold medals on the HiPhO benchmark, ranking second globally among all models, and demonstrates strong generalization across STEM domains—setting a new standard for open-source physical intelligence.

Dataset

  • The authors use a curated multimodal physics dataset of 8,033 problems, drawn from three sources: 4,126 from physics Olympiads (including IPhO and APhO up to 2023), 2,968 from undergraduate textbooks, and 939 from competition guides. These sources were selected to balance conceptual depth, visual richness, and verifiable solutions.

  • Each subset was processed through a multi-stage pipeline: OCR correction for scanned materials, model-based answer extraction (using Gemini-2.5-Flash, Claude-3.7-Sonnet, and GPT-4o) with majority-vote consensus, filtering of diagram-generation or open-ended tasks, visual consistency checks via Gemini-2.5-Flash, and final expert review. This reduced the initial 13,432 items to 8,033 high-fidelity, bilingual samples.

  • For training, the authors apply curriculum reinforcement learning. They first estimate problem difficulty using Qwen3-VL-30B-A3B’s pass rate across 72 rollouts. They remove trivial samples (pass rate > 0.7) and recover zero-shot failures (pass rate = 0.0) using Gemini-2.5-Flash for verification and refinement. Training proceeds in stages, progressively lowering the difficulty threshold and expanding group size and generation window to maintain search depth.

  • The dataset is used exclusively for RLVR training, with no test data from the same sources. Evaluation is performed on HiPhO, a separate benchmark of 13 recent Olympiad exams (2024–2025), using Gemini-2.5-Flash as an automated grader that scores both final answers and reasoning steps, mirroring human grading to enable medal-threshold comparisons.

Method

The authors leverage a reinforcement learning (RL) framework to train vision-language models for solving complex Physics Olympiad problems, formulating the task as a Markov Decision Process (MDP) where the state space encompasses the problem context and generated reasoning tokens, and the action space corresponds to the discrete vocabulary of output tokens. The policy is optimized to maximize the expected return, computed as the sum of scalar rewards over a trajectory, with the reward signal derived from the correctness of the final answer relative to ground truth. To stabilize training and improve sample efficiency, they adopt Group Sequence Policy Optimization (GSPO), which operates at the sequence level rather than the token level, employing length-normalized importance ratios to reduce variance and a clipped objective to constrain policy updates.

Refer to the framework diagram, which illustrates the end-to-end data pipeline from raw problem sources to final training data. The process begins with PDF conversion to Markdown, followed by QA parsing to extract question-answer pairs. Human annotators then perform answer annotation, ensuring the solutions adhere to a structured format—such as LaTeX for symbolic expressions and boxed final answers—as specified in the system prompt. Post-processing includes language transformation and expert review, culminating in multi-modal training data that pairs questions, answers, associated images, and metadata. This structured pipeline ensures that the model learns to generate verifiable, format-compliant solutions while handling both textual and visual inputs.

To address the train-inference mismatch inherent in distributed RL training, the authors implement Sequence-level Masked Importance Sampling (Seq-MIS), which rejects entire trajectories whose geometric mean of importance weights exceeds a threshold, thereby enforcing a hard trust region. This mechanism mitigates gradient bias introduced by discrepancies between rollout and training engines. The training dynamics are further stabilized through curriculum learning, where complexity is progressively increased by scaling data difficulty, expanding group sizes, and extending generation windows across stages. The VERL framework is used for implementation, with vision encoders and projection layers frozen during RL training to preserve pre-trained visual representations, while the language model parameters are fine-tuned.

At inference time, the system is augmented with PhysicsMinions, a multi-agent framework comprising Visual, Logic, and Review Studios. The Visual Studio processes diagrams and converts them into symbolic representations, enabling grounded reasoning. The Logic Studio iteratively refines solutions via solver-introspector collaboration, while the Review Studio validates outputs using domain-specific verifiers. This agentic loop supports scalable, robust reasoning on multimodal problems, with domain-adaptive mechanisms routing problems to appropriate solvers and verifiers based on detected scientific discipline.

Experiment

  • P1-VL models trained via reinforcement learning achieve top-tier performance on physics Olympiads, outperforming many closed-source models and demonstrating superior visual-scientific reasoning without agent augmentation.
  • Agent-augmented P1-VL systems surpass even top closed-source models, setting new benchmarks across multiple Olympiads and validating the “model + system” paradigm for complex scientific tasks.
  • P1-VL models generalize well beyond physics, showing strong transfer to diverse STEM benchmarks including math and multi-modal reasoning, with consistent gains over base models and minimal catastrophic forgetting.
  • Training stability is achieved through Sequence-Level Masked Importance Sampling, which mitigates train-inference mismatch and prevents RL collapse observed with other sampling methods.
  • Mixed training data (text-only + image-text) enhances performance without negative transfer, supporting the use of heterogeneous data for robust multimodal training.
  • Curriculum-based RL training significantly improves reasoning depth and response length, proving essential for developing advanced scientific reasoning capabilities.
  • RL training is broadly effective across model architectures, including InternVL series, confirming its generalizability for unlocking latent scientific reasoning in diverse base models.

The authors use a physics Olympiad benchmark to evaluate models trained with reinforcement learning, showing that their P1-VL series outperforms both open-source and closed-source baselines in scientific reasoning, even without agent augmentation. When combined with multi-agent systems, the models achieve state-of-the-art results across multiple competitions, demonstrating that structured training and system-level collaboration significantly enhance complex problem-solving. The models also generalize well to other STEM domains, improving performance on text-only and multi-modal tasks while maintaining robust visual reasoning capabilities.

The authors use reinforcement learning to train P1-VL models, achieving top-tier performance on physics Olympiads and demonstrating strong generalization to other STEM domains. Results show that even smaller variants outperform larger baselines, and combining the model with agent frameworks further boosts scores, highlighting the value of integrated system design. The models also retain and enhance reasoning capabilities across text-only and multi-modal tasks, indicating effective vision-language alignment without catastrophic forgetting.

The authors use a physics problem involving fluid mechanics and atmospheric pressure to evaluate model reasoning, requiring integration of visual schematics, tabular data, and symbolic equations. Results show the model correctly identifies behavioral regimes and computes precise force values across experiments, demonstrating robust alignment between visual perception and scientific calculation. This case underscores the model’s ability to sustain multi-step reasoning under real-world physical constraints without hallucinating invalid assumptions.

The authors use reinforcement learning to train P1-VL models, achieving top-tier performance on physics Olympiad benchmarks, with the largest variant ranking third among all models and outperforming several closed-source systems even without agent augmentation. When combined with the PhysicsMinions agent framework, the model climbs to second place globally, setting new state-of-the-art scores on multiple competitions, demonstrating that multi-agent collaboration significantly enhances complex scientific reasoning. Results also show strong generalization beyond physics, with consistent gains over base models across diverse STEM and multimodal benchmarks, indicating that domain-specific training does not compromise but rather amplifies broader reasoning capabilities.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp