Trust region policy optimization (TRPO) Value Iteration Value - PowerPoint PPT Presentation

Trust region policy optimization (TRPO)

Value Iteration

Value Iteration • This is what we similar to what Q-Learning does, the main difference being that we we might not know the actual expected reward and instead explore the world and use discounted rewards to model our value function.

Model-based Value Iteration Model-free • This is what we similar to what Q-Learning does, the main difference being that we we might not know the actual expected reward and instead explore the world and use discounted rewards to model our value function.

Value Iteration • This is what we similar to what Q-Learning does, the main difference being that we we might not know the actual expected reward and instead explore the world and use discounted rewards to model our value function. • Once we have Q(s,a), we can find optimal policy π * using:

Policy Iteration • We can directly optimize in the policy space.

Policy Iteration Smaller than Q-function space • We can directly optimize in the policy space.

Preliminaries Following identity expresses the expected return of another policy in terms of the advantage over π, accumulated over time steps: Where A π is the advantage function: And is the visitation frequency of states in policy

Preliminaries To remove the complexity due to , following local approximation is introduced: If we have a parameterized policy , where is a differentiable function of the parameter vector , then matches to first order. i.e., This implies that a sufficiently small step that improves will also improve , but does not give us any guidance on how big of a step to take.

Preliminaries • To address this issue, Kakade & Langford (2002) proposed conservative policy iteration: where, • They derived the following lower bound:

Preliminaries • Computationally, this α-coupling means that if we randomly choose a seed for our random number generator, and then we sample from each of π and π new after setting that seed, the results will agree for at least fraction 1-α of seeds. • Thus α can be considered as a measure of disagreement between π and π new

Theorem 1 • Previous result was applicable to mixture policies only. Schulman showed that it can be extended to general stochastic policies by using a distance measure called “Total Variation” divergence between π and as : for discrete probability distributions p; q • Let • They proved that for , following result holds:

Theorem 1 • Note the following relation between Total Variation & Kullback–Leibler: • Thus bounding condition becomes:

Algorithm 1

Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve the true objective by performing following maximization: • However, using the penalty coefficient like above results in very small step sizes. One way to take larger steps in a robust way is to use a constraint on the KL divergence between the new policy and the old policy, i.e., a trust region constraint :

Trust Region Policy Optimization • The constraint is bounded at every point in state space, which is not practical. We can use the following heuristic approximation: • Thus, the optimization problem becomes:

Trust Region Policy Optimization • In terms of expectation, previous equation can be written as: where, q denotes the sampling distribution • This sampling distribution can be calculated in two ways: Ø a) Single Path Method Ø b) Vine Method

Final Algorithm • Step 1: Use the single path or vine procedures to collect a set of state-action pairs along with Monte Carlo estimates of their Q -values • Step 2: By averaging over samples, construct the estimated objective and constraint in Equation (14) • Step 3: Approximately solve this constrained optimization problem to update the policy’s parameter vector

Trust region policy optimization (TRPO) Value Iteration Value - PowerPoint PPT Presentation

Trust region policy optimization (TRPO) Value Iteration Value Iteration This is what we similar to what Q-Learning does, the main difference being that we we might not know the actual expected reward and instead explore the world and use

Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017

Trust Region Policy Optimization (TRPO) John Schulman, Sergey Levine, Philipp Moritz, Michael I.

Matrix Iteration Higher Modes Inverse Iteration Matrix Iteration Giacomo Boffi with Shifts

Policy iteration comments Each step of policy iteration is guaranteed to strictly improve the

CS 4803 / 7643: Deep Learning Topics: Application: PointGoal Navigation Trust Region

Markov Decision Processes and Exact Solution Methods: Value Iteration Policy Iteration Linear

CS 473: Artificial Intelligence MDP Planning: Value Iteration and Policy Iteration Travis Mandel

Introduction to Mobile Robotics The Markov Decision Problem Value Iteration and Policy Iteration

Trust Region Method Lectures for PHD course on Numerical optimization Enrico Bertolazzi DIMS

Module 7 Policy Iteration CS 886 Sequential Decision Making and Reinforcement Learning

TULA REGION TULA Moscow REGION Moscow region Kaluga region Tula Novomoskovsk Ryazan

Iteration/loops variety of iteration constructs provided with varying degrees of complexity,

Program Analysis with Local Policy Iteration George Karpenkov VERIMAG May 6, 2015 George

Natural Policy Gradients, TRPO, PPO CMU 10703 Katerina Fragkiadaki Part of the slides adapted

61A Lecture 10 Monday, September 17 Sequence Iteration 2 Sequence Iteration def count(s,

CATTLE MARKET ITERATION 01 CIRCULATION PRESENTATION ITERATION 01 INTRODUCTION This document

Grapevine Virus Tales of Woe Not Just One Tale 11% of all vineyards have at least one block with

1 Disclaimer Some of the information in this presentation has been extracted from the prospectus

The Festival of the Presentation of the Augsburg Confession St. John 15:1-11 June 25 th , 2017

10/24/2017 Center for Agricultural Law & Taxation Farm Expenses Kristy Maitre Tax Specialist

Town Branch Commons Town Branch History Lexington is situated on the Town Branch of the

DEPARTMENT OF CORRECTION AND REHABILITATION Isiah Leggett Arthur M. Wallenstein County Executive

Investor Presentation October 2018 Forward Looking Statements This presentation contains

Northern Bedford County School District Budget Presentation April 9, 2019 Budget Committee

Trust region policy optimization (TRPO) Value Iteration Value - PowerPoint PPT Presentation

Trust region policy optimization (TRPO) Value Iteration Value Iteration This is what we similar to what Q-Learning does, the main difference being that we we might not know the actual expected reward and instead explore the world and use

Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017

Trust Region Policy Optimization (TRPO) John Schulman, Sergey Levine, Philipp Moritz, Michael I.

Matrix Iteration Higher Modes Inverse Iteration Matrix Iteration Giacomo Boffi with Shifts

Policy iteration comments Each step of policy iteration is guaranteed to strictly improve the

CS 4803 / 7643: Deep Learning Topics: Application: PointGoal Navigation Trust Region

Markov Decision Processes and Exact Solution Methods: Value Iteration Policy Iteration Linear

CS 473: Artificial Intelligence MDP Planning: Value Iteration and Policy Iteration Travis Mandel

Introduction to Mobile Robotics The Markov Decision Problem Value Iteration and Policy Iteration

Trust Region Method Lectures for PHD course on Numerical optimization Enrico Bertolazzi DIMS

Module 7 Policy Iteration CS 886 Sequential Decision Making and Reinforcement Learning

TULA REGION TULA Moscow REGION Moscow region Kaluga region Tula Novomoskovsk Ryazan

Iteration/loops variety of iteration constructs provided with varying degrees of complexity,

Program Analysis with Local Policy Iteration George Karpenkov VERIMAG May 6, 2015 George

Natural Policy Gradients, TRPO, PPO CMU 10703 Katerina Fragkiadaki Part of the slides adapted

61A Lecture 10 Monday, September 17 Sequence Iteration 2 Sequence Iteration def count(s,

CATTLE MARKET ITERATION 01 CIRCULATION PRESENTATION ITERATION 01 INTRODUCTION This document

Grapevine Virus Tales of Woe Not Just One Tale 11% of all vineyards have at least one block with

1 Disclaimer Some of the information in this presentation has been extracted from the prospectus

The Festival of the Presentation of the Augsburg Confession St. John 15:1-11 June 25 th , 2017

10/24/2017 Center for Agricultural Law &amp; Taxation Farm Expenses Kristy Maitre Tax Specialist

Town Branch Commons Town Branch History Lexington is situated on the Town Branch of the

DEPARTMENT OF CORRECTION AND REHABILITATION Isiah Leggett Arthur M. Wallenstein County Executive

Investor Presentation October 2018 Forward Looking Statements This presentation contains

Northern Bedford County School District Budget Presentation April 9, 2019 Budget Committee

10/24/2017 Center for Agricultural Law & Taxation Farm Expenses Kristy Maitre Tax Specialist