temporal difference learning
play

Temporal Difference Learning CMPUT 366: Intelligent Systems S&B - PowerPoint PPT Presentation

Temporal Difference Learning CMPUT 366: Intelligent Systems S&B 6.0-6.2, 6.4-6.5 Lecture Overview 1. Recap 2. TD Prediction 3. On-Policy TD Control (Sarsa) 4. Off-Policy TD Control (Q-Learning) Recap: Monte Carlo RL Monte


  1. 
 Temporal Difference Learning CMPUT 366: Intelligent Systems 
 S&B §6.0-6.2, §6.4-6.5

  2. Lecture Overview 1. Recap 2. TD Prediction 3. On-Policy TD Control (Sarsa) 4. Off-Policy TD Control (Q-Learning)

  3. Recap: Monte Carlo RL • Monte Carlo estimation: Estimate expected returns to a state or action by averaging actual returns over sampled trajectories • Estimating action values requires either exploring starts or a soft policy (e.g., 𝜁 -greedy) • O ff -policy learning is the estimation of value functions for a target policy based on episodes generated by a different behaviour policy • O ff -policy control is learning the optimal policy (target policy) using episodes from a behaviour policy

  4. Learning from Experience • Suppose we are playing a blackjack-like game in person , but we don't know the rules . • We know the actions we can take, we can see the cards, and we get told when we win or lose • Question: Could we compute an optimal policy using dynamic programming in this scenario? • Question: Could we compute an optimal policy using Monte Carlo ? • What would be the pros and cons of running Monte Carlo?

  5. 
 Bootstrapping No Bootstrapping bootstrapping Learns from MC TD experience DP Requires full dynamics • Dynamic programming bootstraps : Each iteration's estimates are based partly on estimates from previous iterations • Each Monte Carlo estimate is based only on actual returns

  6. 
 
 
 Updates V ( S t ) ← ∑ π ( a | S t ) ∑ p ( s ′ � , r | S t , a ) [ r + γ V ( s ′ � ) ] Dynamic Programming: s ′ � , r a V ( S t ) ← V ( S t ) + α [ G t − V ( S t ) ] Monte Carlo: V ( S t ) ← V ( S t ) + α [ R t +1 + γ V ( S t +1 ) − V ( S t ) ] TD(0): v π ( s ) . Monte Carlo: Approximate because of 𝔽 = E π [ G t | S t = s ] = E π [ R t +1 + γ G t +1 | S t = s ] = E π [ R t +1 + γ v π ( S t +1 ) | S t = s ] . Dynamic programming: Approximate because not known v π TD(0): Approximate because of 𝔽 and not known v π

  7. TD(0) Algorithm Tabular TD(0) for estimating v π Input: the policy π to be evaluated Algorithm parameter: step size α ∈ (0 , 1] Initialize V ( s ), for all s ∈ S + , arbitrarily except that V ( terminal ) = 0 Loop for each episode: Initialize S Loop for each step of episode: A ← action given by π for S Take action A , observe R , S 0 ⇥ ⇤ V ( S ) ← V ( S ) + α R + γ V ( S 0 ) − V ( S ) S ← S 0 until S is terminal Question: What information does this algorithm use?

  8. TD for Control • We can plug TD prediction into the generalized policy iteration framework • Monte Carlo control loop: 1. Generate an episode using estimated π 2. Update estimates of and Q π • On-policy TD control loop: 1. Take an action according to π 2. Update estimates of and Q π

  9. On-Policy TD Control Sarsa (on-policy TD control) for estimating Q ≈ q ⇤ Algorithm parameters: step size α ∈ (0 , 1], small ε > 0 Initialize Q ( s, a ), for all s ∈ S + , a ∈ A ( s ) , arbitrarily except that Q ( terminal , · ) = 0 Loop for each episode: Initialize S Choose A from S using policy derived from Q (e.g., ε -greedy) Loop for each step of episode: Take action A , observe R , S 0 Choose A 0 from S 0 using policy derived from Q (e.g., ε -greedy) ⇥ ⇤ Q ( S, A ) ← Q ( S, A ) + α R + γ Q ( S 0 , A 0 ) − Q ( S, A ) S ← S 0 ; A ← A 0 ; until S is terminal Question: What information does this algorithm use? Question: Will this estimate the Q-values of the optimal policy?

  10. Actual Q-Values vs. Optimal Q-Values • Just as with on-policy Monte Carlo control, Sarsa does not converge to the optimal policy, because it always chooses an 𝜁 -greedy action • And the estimated Q-values are with respect to the actual actions , which are 𝜁 -greedy • Question: Why is it necessary to choose 𝜁 -greedy actions? • What if we acted 𝜁 -greedy, but learned the Q-values for the optimal policy?

  11. Off-Policy TD Control Q-learning (o ff -policy TD control) for estimating π ≈ π ⇤ Algorithm parameters: step size α ∈ (0 , 1], small ε > 0 Initialize Q ( s, a ), for all s ∈ S + , a ∈ A ( s ) , arbitrarily except that Q ( terminal , · ) = 0 Loop for each episode: Initialize S Loop for each step of episode: Choose A from S using policy derived from Q (e.g., ε -greedy) Take action A , observe R , S 0 ⇥ ⇤ Q ( S, A ) ← Q ( S, A ) + α R + γ max a Q ( S 0 , a ) − Q ( S, A ) S ← S 0 until S is terminal Question: What information does this algorithm use? Question: Why aren't we estimating the policy 𝜌 explicitly?

  12. Example: The Cliff 𝛿 =1 (undiscounted) - R = - 1 Safer path R - , Optimal path l T h e C l i f f S G - (! ! of R = - 100 • Agent gets -1 reward until they reach the goal state • Step into the Cliff region, get reward -100 and go back to start ! % ∃ ! ! ∀ ∀ • Question: How will Q-Learning estimate the value of this state? ! ∃∀ • Question: How will Sarsa estimate the value of this state? ! #∃ ! ! ∀ ∀ ∀ !∀∀ %∀∀ &∀∀ ∍∀∀ ∃∀∀

  13. (! ! Performance on The Cliff ! % ∃ ! ! ∀ ∀ s Sarsa ! ∃∀ at -25 ge e- ! #∃ Sum of -50 ff rewards Q-learning ac- during r ! ! ∀ ∀ episode ∀ !∀∀ %∀∀ &∀∀ ∍∀∀ ∃∀∀ -75 o t t -100 ac- 0 100 200 300 400 500 i- Episodes e Q-Learning estimates optimal policy , but Sarsa consistently outperforms Q-Learning. ( why? )

  14. Summary • Temporal Difference Learning bootstraps and learns from experience • Dynamic programming bootstraps, but doesn't learn from experience (requires full dynamics) • Monte Carlo learns from experience, but doesn't bootstrap • Prediction: TD(0) algorithm • Sarsa estimates action-values of actual 𝜁 -greedy policy • Q-Learning estimates action-values of optimal policy while executing an 𝜁 -greedy policy

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend