The Provable E ff ectiveness of Policy Gradient Methods in - PowerPoint PPT Presentation

The Provable E ff ectiveness of Policy Gradient Methods in Reinforcement Learning   Sham Kakade University of Washington & Microsoft Research (with Alekh Agarwal, Jason Lee, and Gaurav Mahajan) r

Policy Optimization in RL [AlphaZero, Silver et.al, 17] [OpenAI Five, 18] [OpenAI,19]

Markov Decision Processes:   a framework for RL

Markov Decision Processes:   a framework for RL • A policy:   π : States → Actions

Markov Decision Processes:   a framework for RL • A policy:   π : States → Actions • We execute to obtain a trajectory:   π s 0 , a 0 , r 0 , s 1 , a 1 , r 1 …

  Markov Decision Processes:   a framework for RL • A policy:   π : States → Actions • We execute to obtain a trajectory:   π s 0 , a 0 , r 0 , s 1 , a 1 , r 1 … • Total -discounted reward:   γ V π ( s 0 ) = 피 π [ γ t r t ] ∞ ∑ t =0

  Markov Decision Processes:   a framework for RL • A policy:   π : States → Actions • We execute to obtain a trajectory:   π s 0 , a 0 , r 0 , s 1 , a 1 , r 1 … • Total -discounted reward:   γ V π ( s 0 ) = 피 π [ γ t r t ] ∞ ∑ t =0 • Goal: Find a policy that maximizes our value V π ( s 0 ) π

Challenges in RL

Challenges in RL 1. Exploration   (the environment may be unknown)

Challenges in RL 1. Exploration   (the environment may be unknown) 2. Credit assignment problem   (due to delayed rewards)

Dexterous Robotic Hand Manipulation OpenAI, 2019 Challenges in RL 1. Exploration   (the environment may be unknown) 2. Credit assignment problem   (due to delayed rewards) 3. Large state/action spaces:   hand state: joint angles/velocities   cube state: configuration   actions: forces applied to actuators

Part 0: Background RL, Deep RL, and Supervised Learning (SL)

  The “Tabular” Dynamic Programming approach • “Tabular” dynamic programming approach: (with known model) 1. For every entry in the table, compute the state-action value:   ∞ Q π ( s , a ) = 피 π [ γ t r t | s 0 = s , a 0 = a ] ∑ t =0 π ( s ) ← argmax a Q π ( s , a ) 2. Update the policy to be greedy: π

  The “Tabular” Dynamic Programming approach • “Tabular” dynamic programming approach: (with known model) 1. For every entry in the table, compute the state-action value:   ∞ Q π ( s , a ) = 피 π [ γ t r t | s 0 = s , a 0 = a ] ∑ t =0 π ( s ) ← argmax a Q π ( s , a ) 2. Update the policy to be greedy: π • Generalization: how can we deal with this infinite table?   Use sampling/supervised learning + deep learning.

    The “Tabular” Dynamic Programming approach “ deep RL”?   [Bertsekas & Tsitsiklis ’97] provides first systematic analysis of RL with (worst case) “function approximation”. • “Tabular” dynamic programming approach: (with known model) 1. For every entry in the table, compute the state-action value:   ∞ Q π ( s , a ) = 피 π [ γ t r t | s 0 = s , a 0 = a ] ∑ t =0 π ( s ) ← argmax a Q π ( s , a ) 2. Update the policy to be greedy: π • Generalization: how can we deal with this infinite table?   Use sampling/supervised learning + deep learning.

In practice, policy gradient methods rule…

    In practice, policy gradient methods rule… • They are the most e ff ective method for   obtaining state of the art.   θ ← θ + η ∇ θ V π θ ( s 0 )

    In practice, policy gradient methods rule… • They are the most e ff ective method for   obtaining state of the art.   θ ← θ + η ∇ θ V π θ ( s 0 ) • Why do we like them?

    In practice, policy gradient methods rule… • They are the most e ff ective method for   obtaining state of the art.   θ ← θ + η ∇ θ V π θ ( s 0 ) • Why do we like them? • they easily deal with large state/action spaces   (through the neural net parameterization)

      In practice, policy gradient methods rule… • They are the most e ff ective method for   obtaining state of the art.   θ ← θ + η ∇ θ V π θ ( s 0 ) • Why do we like them? • they easily deal with large state/action spaces   (through the neural net parameterization) • We can estimate the gradient using only simulation of our current policy π θ (the expectation is under the state actions visited under ) π θ

      In practice, policy gradient methods rule… • They are the most e ff ective method for   obtaining state of the art.   θ ← θ + η ∇ θ V π θ ( s 0 ) • Why do we like them? • they easily deal with large state/action spaces   (through the neural net parameterization) • We can estimate the gradient using only simulation of our current policy π θ (the expectation is under the state actions visited under ) π θ • They directly optimize the cost function of interest!

The Optimization Landscape Supervised Learning: Reinforcement Learning: • Gradient descent tends to ‘just work’   • In many real RL problems, we have in practice (not sensitive to initialization) “very” flat regions. • Saddle points not a problem… • Gradients can be exponentially small in the “horizon” due to lack of exploration.

    The Optimization Landscape s ! Thrun ’92 Supervised Learning: Lemma: [Higher order vanishing gradients] Reinforcement Learning: • Gradient descent tends to ‘just work’   • In many real RL problems, we have Suppose there are S ≤ 1/(1 − γ ) states in the MDP . With random initialization, in practice (not sensitive to initialization) “very” flat regions. all -th higher-order gradients, for , the spectral norm of the k k < S /log( S ) • Saddle points not a problem… • Gradients can be exponentially small in 2 − S /2 gradients are bounded by . the “horizon” due to lack of exploration.

    The Optimization Landscape s ! Thrun ’92 Supervised Learning: Lemma: [Higher order vanishing gradients] Reinforcement Learning: • Gradient descent tends to ‘just work’   • In many real RL problems, we have Suppose there are S ≤ 1/(1 − γ ) states in the MDP . With random initialization, in practice (not sensitive to initialization) “very” flat regions. all -th higher-order gradients, for , the spectral norm of the k k < S /log( S ) • Saddle points not a problem… • Gradients can be exponentially small in 2 − S /2 gradients are bounded by . the “horizon” due to lack of exploration. This talk: Can we get any handle on policy gradient methods   because they are one of the most widely used practical tools?

This talk We provide provable global convergence and generalization guarantees of (nonconvex) policy gradient methods.  

This talk We provide provable global convergence and generalization guarantees of (nonconvex) policy gradient methods.   • Part – I: small state spaces + exact gradients   curvature + non-convexity • Vanilla PG • PG with regularization • Natural Policy Gradient  

This talk We provide provable global convergence and generalization guarantees of (nonconvex) policy gradient methods.   • Part – I: small state spaces + exact gradients   curvature + non-convexity • Vanilla PG • PG with regularization • Natural Policy Gradient   • Part – II: large state spaces   generalization and distribution shift • Function approximation/deep nets? Why use PG?

Part I: Small State Spaces (and the softmax policy class)

Policy Optimization over the ”softmax” policy class (let’s start simple!) • Simplest way to parameterize the simplex, without constraints.  

Policy Optimization over the ”softmax” policy class (let’s start simple!) • Simplest way to parameterize the simplex, without constraints.   • is the probability of action given state   π θ ( a | s ) a s exp( θ s , a ) π θ ( a | s ) = ∑ a ′ exp( θ s , a ′ )

  Policy Optimization over the ”softmax” policy class (let’s start simple!) • Simplest way to parameterize the simplex, without constraints.   • is the probability of action given state   π θ ( a | s ) a s exp( θ s , a ) π θ ( a | s ) = ∑ a ′ exp( θ s , a ′ ) • Complete class: contains every stationary policy  

  Policy Optimization over the ”softmax” policy class (let’s start simple!) • Simplest way to parameterize the simplex, without constraints.   • is the probability of action given state   π θ ( a | s ) a s exp( θ s , a ) assume π θ ( a | s ) = ∑ a ′ exp( θ s , a ′ ) son PV • Complete class: contains every stationary policy   exactly V π θ ( s 0 ) The policy optimization problem is non-convex.   max θ Do we have global convergence?

  Global Convergence of PG for Softmax V θ ( μ ) = E s ∼ μ [ V θ ( s )] starting u w θ ← θ + η ∇ θ V θ ( μ ) state dist Theorem [Vanilla PG for Softmax Policy class] Suppose has full support over the state space. Then, for all states ,   μ s V θ ( s ) → V ⋆ ( s )

The Provable E ff ectiveness of Policy Gradient Methods in - PowerPoint PPT Presentation

The Provable E ff ectiveness of Policy Gradient Methods in Reinforcement Learning Sham Kakade University of Washington & Microsoft Research (with Alekh Agarwal, Jason Lee, and Gaurav Mahajan) r Policy Optimization in RL [AlphaZero,

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

From Importance Sampling to Doubly Robust Policy Gradient Jiawei Huang (UIUC) Nan Jiang (UIUC)

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

CS234 Notes - Lecture 9 Advanced Policy Gradient Patrick Cho, Emma Brunskill February 11, 2019

Ef Effectiv ectiveness R eness Repor eport 201 2018-2019 Northwest ISD Elementary Education

Business Leader Innovation & E ff ectiveness CA Himanshu Kishnadwala June 15, 2018 Why

The E ff ectiveness Of Online Learning In Supporting Nurses Across Ontario To Obtain Their

Policy Gradient Prof. Kuan-Ting Lai 2020/5/22 Advantages of Policy-based RL Previously we

CSC321 Lecture 21: Policy Gradient Roger Grosse Roger Grosse CSC321 Lecture 21: Policy Gradient

Steps to understanding Policy-gradient methods Policy approximation ( a | s, ) The

Another Look at Provable Security Alfred Menezes (joint work with Sanjit Chatterjee, Neal

Provable Security in Cryptography ----- DL-based Systems ECC - Sept 24th 2002 - Essen David

Provable Security against Side-Channel Attacks Matthieu Rivain matthieu.rivain@cryptoexperts.com

Outline Cryptographic Algorithm Engineering and Provable Security Crypto refresher

Group Key Exchange and Provable Security joint work with E. Bresson and O. Chevassut David

Lattice modulation of a strongly interacting 2d superfluid: could it be...the Higgs particle?

Th The D DEx Ex CD Lite lang ngua uage Ver Version 1 1.0 3 rd rd , 2014 April il 3 2014

Performance Simulation of Runtime Reconfigurable Component-Based Software Architectures Robert von

Or else, what? Imperatives on the borderline of semantics and pragmatics Frank Veltman Logic,

Online Meta-Learning Chelsea Finn, Aravind Rajeswaran, Sham Kakade, Sergey Levine Deep networks

Modelling the connection between galaxies and their dark matter halos Carlton Baugh Institute

Conclusion Announcements Society Privacy Policies and Laws 4 Privacy Policies and Laws Mark

A: Use of arrays (space) to improve efficiency of algorithms B: Dictionary Prime Numbers:

The Provable E ff ectiveness of Policy Gradient Methods in - PowerPoint PPT Presentation

The Provable E ff ectiveness of Policy Gradient Methods in Reinforcement Learning Sham Kakade University of Washington & Microsoft Research (with Alekh Agarwal, Jason Lee, and Gaurav Mahajan) r Policy Optimization in RL [AlphaZero,

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

From Importance Sampling to Doubly Robust Policy Gradient Jiawei Huang (UIUC) Nan Jiang (UIUC)

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

CS234 Notes - Lecture 9 Advanced Policy Gradient Patrick Cho, Emma Brunskill February 11, 2019

Ef Effectiv ectiveness R eness Repor eport 201 2018-2019 Northwest ISD Elementary Education

Business Leader Innovation &amp; E ff ectiveness CA Himanshu Kishnadwala June 15, 2018 Why

The E ff ectiveness Of Online Learning In Supporting Nurses Across Ontario To Obtain Their

Policy Gradient Prof. Kuan-Ting Lai 2020/5/22 Advantages of Policy-based RL Previously we

CSC321 Lecture 21: Policy Gradient Roger Grosse Roger Grosse CSC321 Lecture 21: Policy Gradient

Steps to understanding Policy-gradient methods Policy approximation ( a | s, ) The

Another Look at Provable Security Alfred Menezes (joint work with Sanjit Chatterjee, Neal

Provable Security in Cryptography ----- DL-based Systems ECC - Sept 24th 2002 - Essen David

Provable Security against Side-Channel Attacks Matthieu Rivain matthieu.rivain@cryptoexperts.com

Outline Cryptographic Algorithm Engineering and Provable Security Crypto refresher

Group Key Exchange and Provable Security joint work with E. Bresson and O. Chevassut David

Lattice modulation of a strongly interacting 2d superfluid: could it be...the Higgs particle?

Th The D DEx Ex CD Lite lang ngua uage Ver Version 1 1.0 3 rd rd , 2014 April il 3 2014

Performance Simulation of Runtime Reconfigurable Component-Based Software Architectures Robert von

Or else, what? Imperatives on the borderline of semantics and pragmatics Frank Veltman Logic,

Online Meta-Learning Chelsea Finn*, Aravind Rajeswaran*, Sham Kakade, Sergey Levine Deep networks

Modelling the connection between galaxies and their dark matter halos Carlton Baugh Institute

Conclusion Announcements Society Privacy Policies and Laws 4 Privacy Policies and Laws Mark

A: Use of arrays (space) to improve efficiency of algorithms B: Dictionary Prime Numbers:

Business Leader Innovation & E ff ectiveness CA Himanshu Kishnadwala June 15, 2018 Why

Online Meta-Learning Chelsea Finn, Aravind Rajeswaran, Sham Kakade, Sergey Levine Deep networks