Back to articles
AIOpenAI News

Proximal Policy Optimization

We’re releasing a new class of reinforcement learning algorithms, Proximal Policy Optimization (PPO), which perform comparably or better than state-of-the-art approaches while being much simpler to implement and tune. P...

The RSS feed only provided an excerpt. FlowMarket recovered the public content available from the original page without bypassing restricted content.

July 20, 2017

Proximal Policy Optimization

Openai Baselines Ppo

Illustration: Ben Barry

We’re releasing a new class of reinforcement learning algorithms, Proximal Policy Optimization (PPO), which perform comparably or better than state-of-the-art approaches while being much simpler to implement and tune. PPO has become the default reinforcement learning algorithm at OpenAI because of its ease of use and good performance.

Policy gradient methods ⁠ (opens in a new window)  are fundamental to recent breakthroughs in using deep neural networks for control, from  video games ⁠ (opens in a new window) , to  3D locomotion ⁠ (opens in a new window) , to  Go ⁠ (opens in a new window) . But getting good results via policy gradient methods is challenging because they are sensitive to the choice of stepsize — too small, and progress is hopelessly slow; too large and the signal is overwhelmed by the noise, or one might see catastrophic drops in performance. They also often have very poor sample efficiency, taking millions (or billions) of timesteps to learn simple tasks.

Researchers have sought to eliminate these flaws with approaches like  TRPO ⁠ (opens in a new window)  and  ACER ⁠ (opens in a new window) , by constraining or otherwise optimizing the size of a policy update. These methods have their own trade-offs—ACER is far more complicated than PPO, requiring the addition of code for off-policy corrections and a replay buffer, while only doing marginally better than PPO on the Atari benchmark; TRPO—though useful for continuous control tasks—isn’t easily compatible with algorithms that share parameters between a policy and value function or auxiliary losses, like those used to solve problems in Atari and other domains where the visual input is significant.

PPO

With supervised learning, we can easily implement the cost function, run gradient descent on it, and be very confident that we’ll get excellent results with relatively little hyperparameter tuning. The route to success in reinforcement learning isn’t as obvious—the algorithms have many moving parts that are hard to debug, and they require substantial effort in tuning in order to get good results. PPO strikes a balance between ease of implementation, sample complexity, and ease of tuning, trying to compute an update at each step that minimizes the cost function while ensuring the deviation from the previous policy is relatively small.

We’ve  previously ⁠ (opens in a new window)  detailed a variant of PPO that uses an adaptive  KL ⁠ (opens in a new window)  penalty to control the change of the policy at each iteration. The new variant uses a novel objective function not typically found in other algorithms:

  • θ \theta θ is the policy parameter
  • E ^ t \hat{E}_{t} E ^ t ​ denotes the empirical expectation over timesteps
  • r t r_t r t ​ is the ratio of the probability under the new and old policies, respectively
  • A ^ t \hat{A}_t A ^ t ​ is the estimated advantage at time t t t
  • ε \varepsilon ε is a hyperparameter, usually 0.1 or 0.2

This objective implements a way to do a Trust Region update which is compatible with Stochastic Gradient Descent, and simplifies the algorithm by removing the KL penalty and need to make adaptive updates. In tests, this algorithm has displayed the best performance on continuous control tasks and almost matches ACER’s performance on Atari, despite being far simpler to implement.

Controllable, complicated robots

We’ve created interactive agents based on policies trained by PPO—we can  use the keyboard ⁠ (opens in a new window)  to set new target positions for a robot in an environment within Roboschool; though the input sequences are different from what the agent was trained on, it manages to generalize.

Baselines: PPO, PPO2, ACER, and TRPO

This release of  baselines ⁠ (opens in a new window)  includes scalable, parallel implementations of PPO and TRPO which both use MPI for data passing. Both use Python3 and TensorFlow. We’re also adding pre-trained versions of the policies used to train the above robots to the  Roboschool ⁠   agent zoo ⁠ (opens in a new window) .

Update : We’re also releasing a GPU-enabled implementation of PPO, called PPO2. This runs approximately 3x faster than the current PPO baseline on Atari. In addition, we’re releasing an implementation of Actor Critic with Experience Replay (ACER), a sample-efficient policy gradient algorithm. ACER makes use of a replay buffer, enabling it to perform more than one gradient update using each piece of sampled experience, as well as a Q-Function approximate trained with the Retrace algorithm.

We’re looking for people to help build and optimize our reinforcement learning algorithm codebase. If you’re excited about RL, benchmarking, thorough experimentation, and open source, please  apply ⁠ (opens in a new window) , and mention that you read the baselines PPO post in your application.

  • Exploration & Games
  • Learning Paradigms
  • Software & Engineering
  • Robotics
  • Simulated Environments

Authors

Related articles

Scaling Laws For Reward Model Overoptimization

Publication Oct 19, 2022

Whisper

Release Sep 21, 2022

Screenshot of a scene from Minecraft

Conclusion Jun 23, 2022

Need an n8n workflow or help installing it?

After the briefing, move to execution: find an n8n template or a creator who can adapt it to your tools.

Source

OpenAI News - openai.com

View original publication