HyperAI超神経

Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

Chae, Hyungjoo ; Kim, Sunghwan ; Cho, Junhee ; Kim, Seungone ; Moon, Seungjun ; Hwangbo, Gyeom ; Lim, Dongha ; Kim, Minjin ; Hwang, Yeonjun ; Gwak, Minju ; Choi, Dongwook ; Kang, Minseok ; Im, Gwanhoon ; Cho, ByeongUng ; Kim, Hyojun ; Han, Jun Hee ; Kwon, Taeyoon ; Kim, Minju ; Kwak, Beong-woo ; Kang, Dongjin ; Yeo, Jinyoung
公開日: 5/22/2025
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
要約

Web navigation is a unique domain that can automate many repetitive real-lifetasks and is challenging as it requires long-horizon sequential decision makingbeyond typical multimodal large language model (MLLM) tasks. Yet, specializedreward models for web navigation that can be utilized during both training andtest-time have been absent until now. Despite the importance of speed andcost-effectiveness, prior works have utilized MLLMs as reward models, whichposes significant constraints for real-world deployment. To address this, inthis work, we propose the first process reward model (PRM) called Web-Shepherdwhich could assess web navigation trajectories in a step-level. To achievethis, we first construct the WebPRM Collection, a large-scale dataset with 40Kstep-level preference pairs and annotated checklists spanning diverse domainsand difficulty levels. Next, we also introduce the WebRewardBench, the firstmeta-evaluation benchmark for evaluating PRMs. In our experiments, we observethat our Web-Shepherd achieves about 30 points better accuracy compared tousing GPT-4o on WebRewardBench. Furthermore, when testing on WebArena-lite byusing GPT-4o-mini as the policy and Web-Shepherd as the verifier, we achieve10.9 points better performance, in 10 less cost compared to using GPT-4o-minias the verifier. Our model, dataset, and code are publicly available at LINK.