Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles

The application of rule-based reinforcement learning (RL) to multimodal largelanguage models (MLLMs) introduces unique challenges and potential deviationsfrom findings in text-only domains, particularly for perception-heavy tasks.This paper provides a comprehensive study of rule-based visual RL using jigsawpuzzles as a structured experimental framework, revealing several key findings.\textit{Firstly,} we find that MLLMs, initially performing near to randomguessing on simple puzzles, achieve near-perfect accuracy and generalize tocomplex, unseen configurations through fine-tuning. \textit{Secondly,} trainingon jigsaw puzzles can induce generalization to other visual tasks, witheffectiveness tied to specific task configurations. \textit{Thirdly,} MLLMs canlearn and generalize with or without explicit reasoning, though open-sourcemodels often favor direct answering. Consequently, even when trained forstep-by-step reasoning, they can ignore the thinking process in deriving thefinal answer. \textit{Fourthly,} we observe that complex reasoning patternsappear to be pre-existing rather than emergent, with their frequency increasingalongside training and task difficulty. \textit{Finally,} our resultsdemonstrate that RL exhibits more effective generalization than SupervisedFine-Tuning (SFT), and an initial SFT cold start phase can hinder subsequent RLoptimization. Although these observations are based on jigsaw puzzles and mayvary across other visual tasks, this research contributes a valuable piece ofjigsaw to the larger puzzle of collective understanding rule-based visual RLand its potential in multimodal learning. The code is available at:\href{https://github.com/zifuwanggg/Jigsaw-R1}{https://github.com/zifuwanggg/Jigsaw-R1}.