Proximal Policy Optimization
Proximal Policy Optimization (PPO) is an algorithm in the field of reinforcement learning that is used to train the decision-making functions of computer agents to complete difficult tasks. PPO was developed by John Schulman in 2017 and has become the default reinforcement learning algorithm of the American artificial intelligence company OpenAI. In 2018, PPO achieved various successes, such as controlling a robotic arm, beating professional players in Dota 2, and performing well in Atari games. Many experts call PPO the state-of-the-art because it strikes a good balance between performance and understanding. Compared with other algorithms, the three main advantages of PPO are simplicity, stability, and sample efficiency.
Advantages of PPO
- Simplicity: PPO approximates what TRPO does without doing much computation. It uses first-order optimization (clipping function) to constrain the policy update, while TRPO uses KL divergence constraints outside the objective function (second-order optimization). The PPO method is relatively easy to implement and takes less computation time than the TRPO method. Therefore, it is cheaper and more efficient to use PPO in large-scale problems.
- stability:While other reinforcement learning algorithms require hyperparameter tuning, PPO does not necessarily require hyperparameter tuning (epsilon 0.2 can be used in most cases). In addition, PPO does not require complex optimization techniques. It can be easily trained using standard deep learning frameworks and generalizes to a wide range of tasks.
- Sample efficiency:Sample efficiency indicates whether an algorithm requires more or less data to train a good policy. PPO achieves sample efficiency due to the use of a surrogate objective. The surrogate objective enables PPO to avoid new policies that vary too much from the old ones; the clip function regularizes policy updates and reuses training data. Sample efficiency is particularly useful for complex and high-dimensional tasks, where data collection and computation can be expensive.
References
【1】https://en.wikipedia.org/wiki/Proximal_Policy_Optimization