monte carlo control
play

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, - PowerPoint PPT Presentation

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1. Recap 2. Estimating Action Values 3. Monte Carlo Control 4. Importance Sampling 5. Off-Policy Monte Carlo Control Recap: Monte Carlo vs.


  1. 
 Monte Carlo Control CMPUT 366: Intelligent Systems 
 S&B §5.3-5.5, 5.7

  2. Lecture Outline 1. Recap 2. Estimating Action Values 3. Monte Carlo Control 4. Importance Sampling 5. Off-Policy Monte Carlo Control

  3. Recap: Monte Carlo vs. Dynamic Programming • Iterative policy evaluation uses the estimates of the next state's value to update the value of this state • Only needs to compute a single transition to update s a state's estimate π • Monte Carlo estimate of each state's value is a π independent from estimates of other states' values r p s 0 • Needs the entire episode to compute an update • Can focus on evaluating a subset of states if desired

  4. First-visit Monte Carlo Prediction First-visit MC prediction, for estimating V ≈ v π Input: a policy π to be evaluated Initialize: V ( s ) ∈ R , arbitrarily, for all s ∈ S Returns ( s ) ← an empty list, for all s ∈ S Loop forever (for each episode): Generate an episode following π : S 0 , A 0 , R 1 , S 1 , A 1 , R 2 , . . . , S T − 1 , A T − 1 , R T G ← 0 Loop for each step of episode, t = T − 1 , T − 2 , . . . , 0: G ← γ G + R t +1 Unless S t appears in S 0 , S 1 , . . . , S t − 1 : Append G to Returns ( S t ) V ( S t ) ← average( Returns ( S t ))

  5. Control vs. Prediction • Prediction: estimate the value of states and/or actions given some fixed policy π • Control: estimate an optimal policy

  6. Estimating Action Values • When we know the dynamics , an estimate of state values is p ( s ′ � , r ∣ s , a ) sufficient to determine a good policy : • Choose the action that gives the best combination of reward and next- state value • If we don't know the dynamics, state values are not enough • To estimate a good policy, we need an explicit estimate of action values

  7. Exploring Starts • We can just run first-visit Monte Carlo and approximate the returns to each state-action pair • Question: What do we do about state-action pairs that are never visited ? • If the current policy never selects an action from a state , then a s π Monte Carlo can't estimate its value • Exploring starts assumption: • Every episode starts at a state-action pair S 0 , A 0 • Every pair has a positive probability of being selected for a start

  8. Monte Carlo Control Monte Carlo control can be used for policy iteration : evaluation Q � q π e t Q π π � greedy( Q ) t improvement E I E I E I E − → q π 0 − → π 1 − → q π 1 − → π 2 − → · · · − → π ∗ − → q ∗ π 0

  9. Monte Carlo Control with Exploring Starts Monte Carlo ES (Exploring Starts), for estimating π ≈ π ∗ Initialize: π ( s ) ∈ A ( s ) (arbitrarily), for all s ∈ S Q ( s, a ) ∈ R (arbitrarily), for all s ∈ S , a ∈ A ( s ) Returns ( s, a ) ← empty list, for all s ∈ S , a ∈ A ( s ) Loop forever (for each episode): Choose S 0 ∈ S , A 0 ∈ A ( S 0 ) randomly such that all pairs have probability > 0 Generate an episode from S 0 , A 0 , following π : S 0 , A 0 , R 1 , . . . , S T − 1 , A T − 1 , R T G ← 0 Loop for each step of episode, t = T − 1 , T − 2 , . . . , 0: G ← γ G + R t +1 Unless the pair S t , A t appears in S 0 , A 0 , S 1 , A 1 . . . , S t − 1 , A t − 1 : Append G to Returns ( S t , A t ) Q ( S t , A t ) ← average( Returns ( S t , A t )) π ( S t ) ← argmax a Q ( S t , a ) Question: What unlikely assumptions does this rely upon?

  10. 𝜁 -Soft Policies • The exploring starts assumption ensures that we see every state-action pair with positive probability • Even if never chooses from state a s π • Another approach: Simply force to (sometimes) choose ! a π • An -soft policy is one for which π ( a ∣ s ) ≥ ϵ ∀ s , a ϵ • Example: -greedy policy ϵ ϵ if a ∉ arg max a Q ( s , a ), | 𝒝 | π ( a | s ) = ϵ otherwise. 1 − ϵ + | 𝒝 |

  11. Monte Carlo Control w/out Exploring Starts On-policy first-visit MC control (for ε -soft policies), estimates π ⇡ π ⇤ Algorithm parameter: small ε > 0 Initialize: π an arbitrary ε -soft policy Q ( s, a ) 2 R (arbitrarily), for all s 2 S , a 2 A ( s ) Returns ( s, a ) empty list, for all s 2 S , a 2 A ( s ) Repeat forever (for each episode): Generate an episode following π : S 0 , A 0 , R 1 , . . . , S T � 1 , A T � 1 , R T G 0 Loop for each step of episode, t = T � 1 , T � 2 , . . . , 0: G γ G + R t +1 Unless the pair S t , A t appears in S 0 , A 0 , S 1 , A 1 . . . , S t � 1 , A t � 1 : Append G to Returns ( S t , A t ) Q ( S t , A t ) average( Returns ( S t , A t )) A ⇤ argmax a Q ( S t , a ) (with ties broken arbitrarily) For all a 2 A ( S t ): ⇢ 1 � ε + ε / | A ( S t ) | if a = A ⇤ π ( a | S t ) ε / | A ( S t ) | if a 6 = A ⇤

  12. Monte Carlo Control w/out Exploring Starts On-policy first-visit MC control (for ε -soft policies), estimates π ⇡ π ⇤ Algorithm parameter: small ε > 0 Initialize: π an arbitrary ε -soft policy Q ( s, a ) 2 R (arbitrarily), for all s 2 S , a 2 A ( s ) Returns ( s, a ) empty list, for all s 2 S , a 2 A ( s ) Question: Repeat forever (for each episode): Generate an episode following π : S 0 , A 0 , R 1 , . . . , S T � 1 , A T � 1 , R T Will this procedure G 0 Loop for each step of episode, t = T � 1 , T � 2 , . . . , 0: converge to the G γ G + R t +1 optimal policy ? π * Unless the pair S t , A t appears in S 0 , A 0 , S 1 , A 1 . . . , S t � 1 , A t � 1 : Append G to Returns ( S t , A t ) Why or why not? Q ( S t , A t ) average( Returns ( S t , A t )) A ⇤ argmax a Q ( S t , a ) (with ties broken arbitrarily) For all a 2 A ( S t ): ⇢ 1 � ε + ε / | A ( S t ) | if a = A ⇤ π ( a | S t ) if a 6 = A ⇤ ε / | A ( S t ) |

  13. Importance Sampling • Question: What was importance sampling the last time we studied it (in Supervised Learning?) • Monte Carlo sampling: use samples from the target distribution to estimate expectations • Importance sampling: Use samples from proposal distribution to estimate expectations of target distribution by reweighting samples 𝔽 [ X ] = ∑ f ( x ) x = ∑ g ( x ) f ( x ) x = ∑ f ( x i ) g ( x ) g ( x ) f ( x ) g ( x ) x ≈ 1 n ∑ g ( x i ) x i x i ∼ g x x x Importance sampling ratio

  14. Off-Policy Prediction via Importance Sampling Definition: 
 Off-policy learning means using data generated by a behaviour policy to learn about a distinct target policy . Proposal 
 distribution Target distribution

  15. Off-Policy Monte Carlo Prediction • Generate episodes using behaviour policy b • Take weighted average of returns to state s over all the episodes containing a visit to to estimate s v π ( s ) • Weighed by importance sampling ratio of trajectory starting from until the end of the episode: S t = s ρ t : T − 1 ≐ Pr[ A t , S t +1 , …, S T | S t , A t : T − 1 ∼ π ] Pr[ A t , S t +1 , …, S T | S t , A t : T − 1 ∼ b ]

  16. Importance Sampling Ratios for Trajectories • Probability of a trajectory from : A t , S t +1 , A t +1 , …, S T S t Pr[ A t , S t +1 , …, S T | S t , A t : T − 1 ∼ π ] = π ( A t | S t ) p ( S t +1 | S t , A t ) π ( A t +1 | S t +1 )… p ( S T | S T − 1 , A T − 1 ) • Importance sampling ratio for a trajectory from : A t , S t +1 , A t +1 , …, S T S t ∏ T − 1 ∏ T − 1 k = t π ( A k | S k ) p ( S k +1 | S k , A k ) k = t π ( A k | S k ) ρ t : T − 1 ≐ = ∏ T − 1 ∏ T − 1 k = t b ( A k | S k ) k = t b ( A k | S k ) p ( S k +1 | S k , A k )

  17. Ordinary vs.Weighted Importance Sampling • Ordinary importance sampling: n V ( s ) ≐ 1 ∑ ρ t ( s , i ): T ( i ) − 1 G i , t n i =1 • Weighted importance sampling: ∑ n i =1 ρ t ( s , i ): T ( i ) − 1 G i , t V ( s ) ≐ ∑ n i =1 ρ t ( s , i ): T ( i ) − 1

  18. Example: Ordinary vs. Weighted Importance Sampling for Blackjack 5 Ordinary importance Mean sampling square error (average over 100 runs) Weighted importance sampling 0 0 10 100 1000 10,000 Episodes (log scale) Figure 5.3: Weighted importance sampling produces lower error estimates of the value of a single blackjack state from o ff -policy episodes. (Image: Sutton & Barto, 2018)

  19. Off-Policy Monte Carlo Prediction O ff -policy MC prediction (policy evaluation) for estimating Q ⇡ q π Input: an arbitrary target policy π Initialize, for all s 2 S , a 2 A ( s ): Q ( s, a ) 2 R (arbitrarily) C ( s, a ) 0 Loop forever (for each episode): b any policy with coverage of π Generate an episode following b : S 0 , A 0 , R 1 , . . . , S T − 1 , A T − 1 , R T G 0 W 1 Loop for each step of episode, t = T � 1 , T � 2 , . . . , 0, while W 6 = 0: G γ G + R t +1 C ( S t , A t ) C ( S t , A t ) + W W Q ( S t , A t ) Q ( S t , A t ) + C ( S t ,A t ) [ G � Q ( S t , A t )] W W π ( A t | S t ) b ( A t | S t )

  20. Off-Policy Monte Carlo Control O ff -policy MC control, for estimating π ⇡ π ∗ Initialize, for all s 2 S , a 2 A ( s ): Q ( s, a ) 2 R (arbitrarily) C ( s, a ) 0 π ( s ) argmax a Q ( s, a ) (with ties broken consistently) Loop forever (for each episode): b any soft policy Generate an episode using b : S 0 , A 0 , R 1 , . . . , S T − 1 , A T − 1 , R T G 0 W 1 Loop for each step of episode, t = T � 1 , T � 2 , . . . , 0: G γ G + R t +1 C ( S t , A t ) C ( S t , A t ) + W W Q ( S t , A t ) Q ( S t , A t ) + C ( S t ,A t ) [ G � Q ( S t , A t )] π ( S t ) argmax a Q ( S t , a ) (with ties broken consistently) If A t 6 = π ( S t ) then exit inner Loop (proceed to next episode) 1 W W b ( A t | S t )

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend