Temporal Difference Methods CS60077: Reinforcement Learning Abir - PowerPoint PPT Presentation

Temporal Difference Methods CS60077: Reinforcement Learning Abir Das IIT Kharagpur Oct 12, 13, 19, 2020

Agenda Introduction TD Evaluation TD Control Agenda § Understand incremental computation of Monte Carlo methods § From incremental Monte Carlo methods, the journey will take us to different Temporal Difference (TD) based methods. Abir Das (IIT Kharagpur) CS60077 Oct 12, 13, 19, 2020 2 / 43

Agenda Introduction TD Evaluation TD Control Resources § Reinforcement Learning by Udacity [Link] § Reinforcement Learning by Balaraman Ravindran [Link] § Reinforcement Learning by David Silver [Link] § SB: Chapter 6 Abir Das (IIT Kharagpur) CS60077 Oct 12, 13, 19, 2020 3 / 43

Agenda Introduction TD Evaluation TD Control MRP Evaluation - Model Based § Like the previous approaches, here also we are going to first look at the evaluation problems using TD methods and then later, we will do TD control. Abir Das (IIT Kharagpur) CS60077 Oct 12, 13, 19, 2020 4 / 43

Agenda Introduction TD Evaluation TD Control MRP Evaluation - Model Based § Like the previous approaches, here also we are going to first look at the evaluation problems using TD methods and then later, we will do TD control. § Let us take a MRP. Why MRP? Abir Das (IIT Kharagpur) CS60077 Oct 12, 13, 19, 2020 4 / 43

Agenda Introduction TD Evaluation TD Control MRP Evaluation - Model Based § Like the previous approaches, here also we are going to first look at the evaluation problems using TD methods and then later, we will do TD control. § Let us take a MRP. Why MRP? 0.9 𝑇 " +1 𝑇 % +1 +0 𝑇 $ 𝑇 ' 𝑇 # +2 𝑇 & +10 0.1 Abir Das (IIT Kharagpur) CS60077 Oct 12, 13, 19, 2020 4 / 43

Agenda Introduction TD Evaluation TD Control MRP Evaluation - Model Based § Like the previous approaches, here also we are going to first look at the evaluation problems using TD methods and then later, we will do TD control. § Let us take a MRP. Why MRP? 0.9 𝑇 " +1 𝑇 % +1 +0 𝑇 $ 𝑇 ' 𝑇 # +2 𝑇 & +10 0.1 § Find V ( S 3 ) , given γ = 1 Abir Das (IIT Kharagpur) CS60077 Oct 12, 13, 19, 2020 4 / 43

Agenda Introduction TD Evaluation TD Control MRP Evaluation - Model Based § Like the previous approaches, here also we are going to first look at the evaluation problems using TD methods and then later, we will do TD control. § Let us take a MRP. Why MRP? 0.9 𝑇 " +1 𝑇 % +1 +0 𝑇 $ 𝑇 ' 𝑇 # +2 𝑇 & +10 0.1 § Find V ( S 3 ) , given γ = 1 § V ( S F ) = 0 Abir Das (IIT Kharagpur) CS60077 Oct 12, 13, 19, 2020 4 / 43

Agenda Introduction TD Evaluation TD Control MRP Evaluation - Model Based § Like the previous approaches, here also we are going to first look at the evaluation problems using TD methods and then later, we will do TD control. § Let us take a MRP. Why MRP? 0.9 𝑇 " +1 𝑇 % +1 +0 𝑇 $ 𝑇 ' 𝑇 # +2 𝑇 & +10 0.1 § Find V ( S 3 ) , given γ = 1 § V ( S F ) = 0 § Then V ( S 4 ) = 1 + 1 × 0 = 1 , V ( S 5 ) = 10 + 1 × 0 = 10 Abir Das (IIT Kharagpur) CS60077 Oct 12, 13, 19, 2020 4 / 43

Agenda Introduction TD Evaluation TD Control MRP Evaluation - Model Based § Like the previous approaches, here also we are going to first look at the evaluation problems using TD methods and then later, we will do TD control. § Let us take a MRP. Why MRP? 0.9 𝑇 " +1 𝑇 % +1 +0 𝑇 $ 𝑇 ' 𝑇 # +2 𝑇 & +10 0.1 § Find V ( S 3 ) , given γ = 1 § V ( S F ) = 0 § Then V ( S 4 ) = 1 + 1 × 0 = 1 , V ( S 5 ) = 10 + 1 × 0 = 10 § Then V ( S 3 ) = 0 + 1 × (0 . 9 × 1 + 0 . 1 × 10) = 1 . 9 Abir Das (IIT Kharagpur) CS60077 Oct 12, 13, 19, 2020 4 / 43

Agenda Introduction TD Evaluation TD Control MRP Evaluation - Monte Carlo § Now let us think about how to get the values from ‘experience’ without knowing the model. Abir Das (IIT Kharagpur) CS60077 Oct 12, 13, 19, 2020 5 / 43

Agenda Introduction TD Evaluation TD Control MRP Evaluation - Monte Carlo § Now let us think about how to get the values from ‘experience’ without knowing the model. § Let’s say we have the following samples/episodes. +1 +0 +1 𝑇 " 𝑇 $ 𝑇 % 𝑇 ' 0.9 𝑇 " +1 𝑇 % +1 +1 +0 +10 𝑇 " 𝑇 $ 𝑇 & 𝑇 ' +0 +1 +0 +1 𝑇 $ 𝑇 ' 𝑇 " 𝑇 $ 𝑇 % 𝑇 ' +1 +0 +1 𝑇 " 𝑇 $ 𝑇 % 𝑇 ' 𝑇 # +2 𝑇 & +10 0.1 +2 +0 +10 𝑇 # 𝑇 $ 𝑇 & 𝑇 ' Abir Das (IIT Kharagpur) CS60077 Oct 12, 13, 19, 2020 5 / 43

Agenda Introduction TD Evaluation TD Control MRP Evaluation - Monte Carlo § Now let us think about how to get the values from ‘experience’ without knowing the model. § Let’s say we have the following samples/episodes. +1 +0 +1 𝑇 " 𝑇 $ 𝑇 % 𝑇 ' 0.9 𝑇 " +1 𝑇 % +1 +1 +0 +10 𝑇 " 𝑇 $ 𝑇 & 𝑇 ' +0 +1 +0 +1 𝑇 $ 𝑇 ' 𝑇 " 𝑇 $ 𝑇 % 𝑇 ' +1 +0 +1 𝑇 " 𝑇 $ 𝑇 % 𝑇 ' 𝑇 # +2 𝑇 & +10 0.1 +2 +0 +10 𝑇 # 𝑇 $ 𝑇 & 𝑇 ' § What is the estimated value of V ( S 1 ) - after 3 epiodes? after 4 episodes? Abir Das (IIT Kharagpur) CS60077 Oct 12, 13, 19, 2020 5 / 43

Agenda Introduction TD Evaluation TD Control MRP Evaluation - Monte Carlo § Now let us think about how to get the values from ‘experience’ without knowing the model. § Let’s say we have the following samples/episodes. +1 +0 +1 𝑇 " 𝑇 $ 𝑇 % 𝑇 ' 0.9 𝑇 " +1 𝑇 % +1 +1 +0 +10 𝑇 " 𝑇 $ 𝑇 & 𝑇 ' +0 +1 +0 +1 𝑇 $ 𝑇 ' 𝑇 " 𝑇 $ 𝑇 % 𝑇 ' +1 +0 +1 𝑇 " 𝑇 $ 𝑇 % 𝑇 ' 𝑇 # +2 𝑇 & +10 0.1 +2 +0 +10 𝑇 # 𝑇 $ 𝑇 & 𝑇 ' § What is the estimated value of V ( S 1 ) - after 3 epiodes? after 4 episodes? § After 3 episodes: (1+0+1)+(1+0+10)+(1+0+1) = 5 . 0 3 § After 4 episodes: (1+0+1)+(1+0+10)+(1+0+1)+(1+0+1) = 4 . 25 4 Abir Das (IIT Kharagpur) CS60077 Oct 12, 13, 19, 2020 5 / 43

Agenda Introduction TD Evaluation TD Control Incremental Monte Carlo § Next we are going to see how we can ‘incrementally’ compute an estimate for the value of a state given the previous estimate, i.e. , given the estimate after 3 episodes, how do we get that after 4 episodes and so on. Abir Das (IIT Kharagpur) CS60077 Oct 12, 13, 19, 2020 6 / 43

Agenda Introduction TD Evaluation TD Control Incremental Monte Carlo § Next we are going to see how we can ‘incrementally’ compute an estimate for the value of a state given the previous estimate, i.e. , given the estimate after 3 episodes, how do we get that after 4 episodes and so on. § Let V T − 1 ( S 1 ) is the estimate of the value function at state S 1 after ( T − 1) th episode. § Let the return (or total discounted reward) of the T th episode be R T ( S 1 ) § Then, V T ( S 1 ) = V T − 1 ( S 1 ) ∗ ( T − 1) + R T ( S 1 ) T = T − 1 V T − 1 ( S 1 ) + 1 T R T ( S 1 ) T α T = 1 = V T − 1 ( S 1 ) + α T ( R T ( S 1 ) − V T − 1 ( S 1 )) , T Abir Das (IIT Kharagpur) CS60077 Oct 12, 13, 19, 2020 6 / 43

Agenda Introduction TD Evaluation TD Control Incremental Monte Carlo α T = 1 V T ( S 1 ) = V T − 1 ( S 1 ) + α T ( R T ( S 1 ) − V T − 1 ( S 1 )) , T § Think of T as time i.e. , you are drawing sampling trajectories and getting the ( T − 1) th episode at time ( T − 1) , T th episode at time T and so on. § Then we are looking at a ‘Temporal difference’. The ‘update’ to the value of S 1 is going to be equal to the difference between the return ( R T ( S 1 ) ) at step T and the estimate ( V T − 1 ( S 1 ) ) at the previous time step T − 1 § As we get more and more episodes, the learning rate α T , gets smaller and smaller. So we make smaller and smaller changes. Abir Das (IIT Kharagpur) CS60077 Oct 12, 13, 19, 2020 7 / 43

Agenda Introduction TD Evaluation TD Control Properties of Learning Rate § This learning falls under a general learning rule where the value at time T = the value at time T − 1 + some learning rate*(difference between what you get and what you expected it to be) V T ( S 1 ) = V T − 1 ( S 1 ) + α T ( R T ( S 1 ) − V T − 1 ( S 1 )) Abir Das (IIT Kharagpur) CS60077 Oct 12, 13, 19, 2020 8 / 43

Agenda Introduction TD Evaluation TD Control Properties of Learning Rate § This learning falls under a general learning rule where the value at time T = the value at time T − 1 + some learning rate*(difference between what you get and what you expected it to be) V T ( S 1 ) = V T − 1 ( S 1 ) + α T ( R T ( S 1 ) − V T − 1 ( S 1 )) § In limit, the estimate is going to converge to the true value, i.e. , T →∞ ( S ) = V ( S ) , given two conditions that the learning rate lim sequence has to obey. I. � α T = ∞ T II. � α 2 T < ∞ T Abir Das (IIT Kharagpur) CS60077 Oct 12, 13, 19, 2020 8 / 43

Agenda Introduction TD Evaluation TD Control Properties of Learning Rate ∞ � 1 § Let us see what T is. T =1 Abir Das (IIT Kharagpur) CS60077 Oct 12, 13, 19, 2020 9 / 43

Agenda Introduction TD Evaluation TD Control Properties of Learning Rate ∞ � 1 § Let us see what T is. T =1 § It is 1 + 1 2 + 1 3 + 1 4 + · · · What is it known as? Abir Das (IIT Kharagpur) CS60077 Oct 12, 13, 19, 2020 9 / 43

Temporal Difference Methods CS60077: Reinforcement Learning Abir - PowerPoint PPT Presentation

Temporal Difference Methods CS60077: Reinforcement Learning Abir Das IIT Kharagpur Oct 12, 13, 19, 2020 Agenda Introduction TD Evaluation TD Control Agenda Understand incremental computation of Monte Carlo methods From incremental

Chapter 6: Temporal Difference Learning Objectives of this chapter: Introduce Temporal Difference

Temporal Difference Learning Robert Platt Northeastern University If one had to identify one

Spatio-Temporal Statistics with R Chapter Two: Exploring Spatio-Temporal Data Spatio-Temporal

Causal inference Part II: Difference In Difference and Instrumental Variables Difference in

Temporal, Spatial, and Spatio-temporal Granularities Gabriele Pozzani Department of Computer

Temporal Code Temporal Code Temporal Code (Acoustic Front-end) Human Recognition Machine

Temporal Privacy in Wireless Sensor Networks Temporal Privacy in Wireless Sensor Networks

Temporal Planning Planning with Temporal and Concurrent Actions 1 Literature Malik

Temporal Distortion Temporal Distortion Perspective) Perspective) t t Blue view Blue view y

Temporal Difference Methods, Off-Policy Methods Milan Straka October 21, 2019 Charles

Singer difference sets and difference system of sets Akihiro Munemasa Graduate School of

Temporal Planning Planning with Temporal and Concurrent Actions Literature Malik Ghallab,

Analysis of Peer Review data from WoS Data part 3: temporal analyses Temporal distributions

Sequential Data Types of data Temporal (focusing on this one today) Bi-Temporal (Physical Time

Temporal Logic of Actions Advanced Topics in Distributed Computing Dominik Grewe Saarland

Outline Temporal and Real-Time Temporal database Databases: A survey Real-time database

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

MPC-Based Humanoid Gait Generation with application to Pursuit-Evasion (slides prepared by Nicola

Advanced Virgo and Einstein Telescope Bas Swinkels (Nikhef) 7 th Belgium-Dutch GW meeting

Toward Self-Adaptive Software Employing Model Predictive Control NII Shonan Meeting on Controlled

Processes (MDP) Prof. Kuan-Ting Lai 2020/3/20 Markov Decision Process (MDP)

Waiting Times in BMAP/BMAP/1 Queues MAM-9, Budapest Nail Akar, Bilkent University, Ankara, Turkey

Implementation Issues More from Interface point of view V Eye Y U N X Z Viewing Coordinate

10703 Deep Reinforcement Learning Tom Mitchell September 5, 2018 Solving known MDPs Many slides