New Temporal-Difference Methods Based on Gradient Descent Rich - PowerPoint PPT Presentation

New Temporal-Difference Methods Based on Gradient Descent Rich Sutton Hamid Maei Doina Precup (McGill) Shalabh Bhatnagar (IIS Bangalore) Csaba Szepesvari Eric Wiewiora David Silver

Outline • The promise and problems of TD learning • Value-function approximation • Gradient-descent methods - LMS example • Objective functions for TD • GD derivation of new algorithms • Proofs of convergence • Empirical results • Conclusions

What is temporal-difference learning? • The most important and distinctive idea in reinforcement learning • A way of learning to predict,   from changes in your predictions,   without waiting for the final outcome • A way of taking advantage of state   in multi-step prediction problems • Learning a guess from a guess

Examples of TD learning opportunities • Learning to evaluate backgammon positions from changes in evaluation within a game • Learning where your tennis opponent will hit the ball from his approach • Learning what features of a market indicate that it will have a major decline • Learning to recognize your friend’s face

Function approximation • TD learning is sometimes done in a table- lookup context - where every state is distinct and treated totally separately • But really, to be powerful, we must generalize between states • The same state never occurs twice For example, in Computer Go, we use 10 6 parameters to learn about 10 170 positions

Advantages of TD methods for prediction 1. Data efficient.   Learn much faster on Markov problems 2. Cheap to implement.   Require less memory, peak computation; 3. Able to learn from incomplete sequences.   In particular, able to learn off-policy

Off-policy learning • Learning about a policy different than the one being used to generate actions • Most often used to learn optimal behavior from a given data set, or from more exploratory behavior • Key to ambitious theories of knowledge and perception as continual prediction about the outcomes of options

Value-function approximation from sample trajectories states • True values: outcome V ( s ) = E [outcome | s ] 5 • Estimated values: V θ ( s ) � V ( s ) , θ ⇥ ⇤ n 2 • Linear approximation: -1 V θ ( s ) = θ � φ s , φ s � ⇥ n modifiable parameter vector feature vector for state s

Value-function approximation from sample trajectories feature parameter • True values: vector vector state 0 0.1 V ( s ) = E [outcome | s ] 1 -2 0 0 x = -2 + 0 + 5 = 3 0 0.5 • Estimated values: 0 0 1 0 1 5 V θ ( s ) � V ( s ) , θ ⇥ ⇤ n 0 -.4 φ s θ s • Linear approximation: V θ ( s ) = θ � φ s , φ s � ⇥ n modifiable parameter vector feature vector for state s

From terminal outcomes to per-step rewards • True values: state trajectory 6 � ∞ ⇥ 1 ⇤ γ t r t | s 0 = s V ( s ) = E 5 target values (returns) 1 t =0 rewards = sum of future 4 rewards until end 2 of episode, or until discount rate, 2 discounting horizon 0 ≤ γ ≤ 1 1 1 1

TD methods operate on individual transitions trajectories transitions d s - distribution of first state s • T r s - expected reward given s 0 1 2 P ss’ - prob of next state s’ given s 0 1 2 Training set is now a bag of transitions 1 1 P and d Select from them i.i.d. are linked (independently, identically distributed) -1 Sample transition: ( s, r, s � ) or ( φ , r, φ � ) TD(0) algorithm: θ ← θ + αδφ δ = r + γθ ⇥ φ � − θ ⇥ φ

Off-policy training trajectories transitions d s • T r s 0 1 2 P ss’ 0 1 2 1 1 P and d are no longer linked -1 TD(0) may diverge!

Baird’s counter-example • P and d are not linked • d is all states with equal probability • P is according to this Markov chain: V k ( s ) = V k ( s ) = V k ( s ) = V k ( s ) = V k ( s ) = ! (7) +2 ! (1) ! (7) +2 ! (2) ! (7) +2 ! (3) ! (7) +2 ! (4) ! (7) +2 ! (5) α = 0 . 01 γ = 0 . 99 100% θ 0 = (1 , 1 , 1 , 1 , 1 , 10 , 1) � V k ( s ) = r = 0 2 ! (7) + ! (6) terminal 1% state 99%

TD can diverge: Baird’s counter-example 10 10 ! k (1) – ! k (5) ! k (7) 5 10 Parameter ! k (6) values, ! k ( i ) 0 0 / -10 10 (log scale, broken at ± 1) 5 - 10 10 - 10 0 1000 2000 3000 4000 5000 Iterations ( k ) deterministic updates θ 0 = (1 , 1 , 1 , 1 , 1 , 10 , 1) � γ = 0 . 99 α = 0 . 01

TD(0) can diverge: A simple example r=1 θ 2 θ r + γθ ⇥ φ � − θ ⇥ φ δ = 0 + 2 θ − θ = θ = TD update: ∆ θ αδφ = Diverges! αθ = TD fixpoint: θ ∗ = 0

Previous attempts to solve the off-policy problem • Importance sampling • With recognizers • Least-squares methods, LSTD, LSPI, iLSTD • Averagers • Residual gradient methods

Desiderata: We want a TD algorithm that • Bootstraps (genuine TD) • Works with linear function approximation   (stable, reliably convergent) • Is simple, like linear TD — O(n) • Learns fast, like linear TD • Can learn off-policy (arbitrary P and d ) • Learns from online causal trajectories   (no repeat sampling from the same state)

Gradient-descent learning methods - the recipe 1. Pick an objective function , a J ( θ ) parameterized function to be minimized 2. Use calculus to analytically compute the gradient � θ J ( θ ) 3. Find a “sample gradient” that you can sample on every time step and whose expected value equals the gradient 4. Take small steps in proportional to the θ sample gradient: θ ⇥ θ � α ⇤ θ J t ( θ )

Conventional TD is not the gradient of anything ∆ θ = αδφ TD(0) algorithm: δ = r + γθ ⇥ φ � − θ ⇥ φ ∂ J Assume there is a J such that: = δφ i ∂θ i Then look at the second derivative: } ∂ 2 J = ∂ ( δφ i ) C o = ( γφ � j − φ j ) φ i n ∂ 2 J ∂ 2 J t ∂θ j ∂θ i ∂θ j r a d i � = c ∂θ j ∂θ i ∂θ i ∂θ j t i o ∂ 2 J = ∂ ( δφ j ) n ! = ( γφ � i − φ i ) φ j ∂θ i ∂θ j ∂θ i Real 2 nd derivatives must be symmetric

    Gradient descent for TD: What should the objective function be? • Close to the true values? d s ( V θ ( s ) � V ( s )) 2 � MSE( θ ) = Mean-Square Value Error s True value   ⇥ V θ � V ⇥ 2 = function D • Or close to satisfying the Bellman equation?   Mean-Square ⇥ V θ � TV θ ⇥ 2 MSBE( θ ) = Bellman Error D where T is the Bellman operator defined by V r + γ PV = TV =

Value function geometry Previous work on gradient methods for TD T takes you outside   minimized this objective fn TV θ RMSBE the space (Baird 1995, 1999) Π projects you back   T Π into it Π TV θ V θ E B P S M R Better objective fn? Φ , D V θ = Π TV θ The space spanned by the feature vectors,   weighted by the state visitation distribution Is the TD fix-point D = diag( d ) Mean Square Projected Bellman Error (MSPBE)

A-split example (Dayan 1992) Clearly, the true values are V ( A ) = 0 . 5 A V ( B ) = 1 50% 50% But if you minimize the naive B objective fn, 100% , J ( θ ) = E [ δ 2 ] then you get the solution 1 0 V ( A ) = 1 / 3 V ( B ) = 2 / 3 Even in the tabular case (no FA)

Split-A example The two ‘A’ states look the same, they share a single A1 A2 feature and must be given the 100% same approximate value 100% B The example appears just like 100% the previous, and the 1 0 minimum MSBE solution is V ( A ) = 1 / 3 V ( B ) = 2 / 3

Three new algorithms • GTD, the original gradient TD algorithm (Sutton, Szepevari & Maei, 2008) • GTD-2, a second-generation GTD • TD-C, TD with gradient correction • GTD( λ ), GQ( λ )

First relate the geometry to the iid statistics TV θ RMSBE T Π Π TV θ V θ RMSPBE MSPBE ( θ ) Φ , D ⇥ V θ � Π TV θ ⇥ 2 = D Φ T D ( TV θ − V θ ) = E [ δφ ] ⇥ Π ( V θ � TV θ ) ⇥ 2 = D Φ T D Φ = E [ φφ T ] ( Π ( V θ � TV θ )) ⇤ D ( Π ( V θ � TV θ )) = ( V θ � TV θ ) ⇤ Π ⇤ D Π ( V θ � TV θ ) = ( V θ � TV θ ) ⇤ D ⇤ Φ ( Φ ⇤ D Φ ) � 1 Φ ⇤ D ( V θ � TV θ ) = ( Φ ⇤ D ( TV θ � V θ )) ⇤ ( Φ ⇤ D Φ ) � 1 Φ ⇤ D ( TV θ � V θ ) = φφ ⇤ ⇥ � 1 E [ δφ ] . E [ δφ ] ⇤ E � =

New Temporal-Difference Methods Based on Gradient Descent Rich - PowerPoint PPT Presentation

New Temporal-Difference Methods Based on Gradient Descent Rich Sutton Hamid Maei Doina Precup (McGill) Shalabh Bhatnagar (IIS Bangalore) Csaba Szepesvari Eric Wiewiora David Silver Outline The promise and problems of TD learning

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Chapter 6: Temporal Difference Learning Objectives of this chapter: Introduce Temporal Difference

Stress-Minimizing Orthogonal Layout of Data Flow Diagrams with Ports Ulf Regg Steve Kieffer

Stochastic Gradient Descent 10701 Recitations 3 Mu Li Computer Science Department Cargenie

Chapter 4: Training Regression Models Dr. Xudong Liu Assistant Professor School of Computing

Secret sharing on large girth graphs Lszl Csirmaz, Pter Ligeti Etvs Lornd University,

De Finetti theorems for a Boolean analogue of easy quantum groups 1 / 27 De Finetti theorems for

Optimal Security Investments in a Prevention and Detection Game Carlos Barreto,

The role of over-parametrisation in NNs The role of over-parametrisation in NNs Levent Sagun,

Government Digital Service @liammax GDS Building a digital government based on user needs GDS

New Temporal-Difference Methods Based on Gradient Descent Rich - PowerPoint PPT Presentation

New Temporal-Difference Methods Based on Gradient Descent Rich Sutton Hamid Maei Doina Precup (McGill) Shalabh Bhatnagar (IIS Bangalore) Csaba Szepesvari Eric Wiewiora David Silver Outline The promise and problems of TD learning

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Chapter 6: Temporal Difference Learning Objectives of this chapter: Introduce Temporal Difference

Stress-Minimizing Orthogonal Layout of Data Flow Diagrams with Ports Ulf Regg Steve Kieffer

Stochastic Gradient Descent 10701 Recitations 3 Mu Li Computer Science Department Cargenie

Chapter 4: Training Regression Models Dr. Xudong Liu Assistant Professor School of Computing

Secret sharing on large girth graphs Lszl Csirmaz, Pter Ligeti Etvs Lornd University,

De Finetti theorems for a Boolean analogue of easy quantum groups 1 / 27 De Finetti theorems for

Optimal Security Investments in a Prevention and Detection Game Carlos Barreto,

The role of over-parametrisation in NNs The role of over-parametrisation in NNs Levent Sagun,

Government Digital Service @liammax GDS Building a digital government based on user needs GDS

Gradient Descent Michail Michailidis & Patrick Maiden Outline