introduction to reinforcement learning finale doshi velez
play

Introduction to Reinforcement Learning Finale Doshi-Velez Harvard - PowerPoint PPT Presentation

Introduction to Reinforcement Learning Finale Doshi-Velez Harvard University Buenos Aires MLSS 2018 We often must make decisions under uncertainty. How to get to work, walk or bus? We often must make decisions under uncertainty. What


  1. Introduction to Reinforcement Learning Finale Doshi-Velez Harvard University Buenos Aires MLSS 2018

  2. We often must make decisions under uncertainty. How to get to work, walk or bus?

  3. We often must make decisions under uncertainty. What projects to work on? https://imgs.xkcd.com/comics/automation.png

  4. We often must make decisions under uncertainty. How to improvise with a new recipe? https://s-media-cache-ak0.pinimg.com/originals/23/ce/4b/23ce4b2fc9014b26d4b811209550ef5b.jpg

  5. Some Real Applications of RL

  6. Why are these problems hard? ● Must learn from experience (may have prior experience on the same or related task) ● Delayed rewards/actions may have long term effects (delayed credit assignment) ● Explore or exploit? Learn and plan together. ● Generalization (new developments, don’t assume all information has been identified)

  7. Reinforcement learning formalizes this problem actions Agent: World: models, Black policies, Box etc. observation reward t r t ] E [ ∑ t γ Objective: Maximize (finite or infinite horizon)

  8. Concept Check: Reward Adjustment ● If I adjust every reward r by r + c, does the policy change? ● If I adjust every reward r by c*r, does the policy change?

  9. Key Terms ● Policy π(s,a) or π(s) = a ● State s ● History {s0,a0,r0,s1,a1…} Start Markov Property: p(s t+1 | h t ) = p( s t+1 | h t-1 , s t , a t ) = p( s t+1 | s t , a t ) … we'll come back to identifying state later!

  10. Markov Decision Process ● T( s' | s , a ) = Pr( state s' after taking action a in state s ) ● R( s , a , s ' ) = E[ reward after taking action a in state s and transitioning to s' ] … but may depend on less, e.g. R( s , a ) or even R( s ) P = 1 , R = 0 P = 1 , R = 2 P = .25 , R = 1 State 0 State 1 P = 1 , R = 3 P = .75 , R = 0 Notice given a policy, we have a Markov chain to analyze!

  11. How to Solve an MDP: Value Functions Value: V π (s) = E π [ Σ t γ t r t | s 0 = s ] … in s, follow π

  12. How to Solve an MDP: Value Functions Value: V π (s) = E π [ Σ t γ t r t | s 0 = s ] … in s, follow π π

  13. Concept Check: Discounts 4 (1) In functions of γ, what are the values A of policies A, B, and B C? S 5 1 C (2) When is it better to do B? C? 0 5

  14. How to Solve an MDP: Value Functions Value: V π (s) = E π [ Σ t γ t r t | s 0 = s ] … in s, follow π Action-Value: Q π (s,a) = E π [ Σ t γ t r t | s 0 = s, a 0 = a ] … in s, do a, follow π π

  15. Expanding the expression... t r t ∣ s 0 = s ] V π ( s )= E π [ ∑ t γ t r t ∣ s 0 = s' ]] V π ( s )= ∑ a π( a ∣ s ) ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ E π [ ∑ t γ Next action Next state Next reward Discounted future rewards action: you choose state: world chooses

  16. Expanding the expression... t r t ∣ s 0 = s ] V π ( s )= E π [ ∑ t γ t r t ∣ s 0 = s' ]] V π ( s )= ∑ a π( a ∣ s ) ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ E π [ ∑ t γ Next action Next state Next reward Discounted future rewards

  17. Expanding the expression... t r t ∣ s 0 = s ] V π ( s )= E π [ ∑ t γ t r t ∣ s 0 = s' ]] V π ( s )= ∑ a π( a ∣ s ) ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ E π [ ∑ t γ Next action Next state Next reward Discounted future rewards V π ( s )= ∑ a π( a ∣ s ) ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ V π ( s' )]

  18. Expanding the expression... t r t ∣ s 0 = s ] V π ( s )= E π [ ∑ t γ t r t ∣ s 0 = s' ]] V π ( s )= ∑ a π( a ∣ s ) ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ E π [ ∑ t γ Next action Next state Next reward Discounted future rewards V π ( s )= ∑ a π( a ∣ s ) ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ V π ( s' )] Exercise: Rewrite in finite horizon case, making the rewards and transitions depend on time t… notice how thinking about the future is the same as thinking backward from the end!

  19. Optimal Value Functions Don't average, take the best! V ( s )= max a Q ( s,a ) V ( s )= max a ∑ s' T ( s' ∣ s,a )[ r ( s ,a, s' )+γ V ( s' )] Q-table is the set of values Q(s,a) Note: we still have problems – system must be Markov in s, the size of {s} might be large

  20. Can we solve this? Policy Evaluation V π ( s )= ∑ a π( a ∣ s ) ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ V π ( s' )] This is a system of linear equations!

  21. Can we solve this? Policy Evaluation V π ( s )= ∑ a π( a ∣ s ) ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ V π ( s' )] This is a system of linear equations! We can also do it iteratively: 0 ( s )= c V π k ( s )= ∑ a π( a ∣ s ) ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ V π k − 1 ( s' )] V π Will converge because the Bellman iterator is a contraction – the initial value V 0 (s) is pushed into the past as the “collected data” r(s,a) takes over.

  22. Can we solve this? Policy Evaluation V π ( s )= ∑ a π( a ∣ s ) ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ V π ( s' )] This is a system of linear equations! We can also do it iteratively: 0 ( s )= c V π k ( s )= ∑ a π( a ∣ s ) ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ V π k − 1 ( s' )] V π Will converge because the Bellman iterator is a contraction – the initial value V 0 (s) is pushed into the past as the “collected data” r(s,a) takes over. Finally, can apply Monte carlo: many simulations from s, and see what V π (s) is.

  23. Policy Improvement Theorem Let π, π' be two policies that are the same except for the action that they recommend at state s. If Q π ( s, π'(s) ) > Q π ( s, π(s) ) Then V π' (s) > V π (s) Gives us a way to improve policies: just be greedy with respect to Q!

  24. Policy Iteration Select Solve Solve Improve ... some For For policy policy V π V π π π Will converge; each step requires a potentially expensive policy evaluation computation

  25. Value Iteration k ( s )= max a ∑ s ' T ( s' ∣ s ,a )[ r ( s ,a,s' )+γ V k − 1 ( s' )] V Policy Improvement Policy Evaluation Also converges (contraction) Note that in the tabular case, this is a bunch of inexpensive matrix operations!

  26. Linear programming min ∑ s V ( s )μ( s ) s.t .V ( s )≥ ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ V ( s' )]∀ a,s For any μ; equality for the best action at optimality

  27. Learning from Experience: Reinforcement Learning Now, instead of the transition T and reward R, we assume that we only have histories. Why is this case interesting? ● May not have the model ● Even if have model (e.g. rules of go, or Atari simulator code), focuses attention on right place

  28. Taxonomy of Approaches ● Forward Search/Monte Carlo: Simulate the future, pick the best one (with or without a model). ● Value function: Learn V(s) ● Policy Search: parametrize policy π θ (s) and search for the best parameters θ, often good for systems in which the cardinality of θ is small.

  29. Taxonomy of Approaches ● Forward Search/Monte Carlo: Simulate the future, pick the best one (with or without a model). ● Value function: Learn V(s) ● Policy Search: parametrize policy π θ (s) and search for the best parameters θ, often good for systems in which the cardinality of θ is small.

  30. Monte Carlo Policy Evaluation 1) Generate N sequences of length T from state s 0 to estimate V π (s 0 ). 2)If π has some randomness, or we do s 0 , a 0 , then π, can do policy improvement. … might need a lot of data! But okay if we have a blackbox simulator. π π π

  31. Monte Carlo Policy Evaluation 1) Generate N sequences of length T from state s 0 to estimate V π (s 0 ). 2)If π has some randomness, or we do s 0 , a 0 , then π, can do policy improvement. … might need a lot of data! But okay if we have a blackbox simulator. π π π Add sophistication: UCT, MCTS

  32. Taxonomy of Approaches ● Forward Search/Monte Carlo: Simulate the future, pick the best one (with or without a model). ● Value function: Learn V(s) ● Policy Search: parametrize policy π θ (s) and search for the best parameters θ, often good for systems in which the cardinality of θ is small.

  33. Temporal Difference V π ( s )= E π [ ∑ t γ t r t ∣ s 0 = s ]= E π [ r 0 +γ V π ( s' )] Monte Carlo Estimate Dynamic Programming TD: Start with some V(s), do π(s), and update: V π ( s )← V π ( s )+α t ( r 0 +γ V π ( s' )− V π ( s )) Original Value Temporal Difference: Error between the sampled value of where you went and the stored value ∑ t α t →∞ , ∑ t α t 2 → C Will converge if

  34. Monte Carlo (only one trajectory)

  35. Value Iteration (all actions)

  36. Temporal Difference

  37. Example (S&B 6.4, Let γ = 1) Two states (A,B). Two rewards (0,1). Suppose we have seen the histories: MC estimate of V(B)? A0B0 TD estimate of V(B)? B1 B1 B1 B1 B1 B1 B0

  38. Example (S&B 6.4, Let γ = 1) Two states (A,B). Two rewards (0,1). Suppose we have seen the histories: MC estimate of V(B)? V MC (B) = ¾ A0B0 B1 TD estimate of V(B)? V TD (B) = ¾ B1 B1 B1 B1 B1 B0

  39. Example (S&B 6.4, Let γ = 1) Two states (A,B). Two rewards (0,1). Suppose we have seen the histories: MC estimate of V(B)? V MC (B) = ¾ A0B0 B1 TD estimate of V(B)? V MC (B) = ¾ B1 MC estimate of V(A)? B1 TD estimate of V(A)? B1 B1 B1 B0

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend