lecture 2 making sequences of good decisions given a
play

Lecture 2: Making Sequences of Good Decisions Given a Model of the - PowerPoint PPT Presentation

Lecture 2: Making Sequences of Good Decisions Given a Model of the World Emma Brunskill CS234 Reinforcement Learning Winter 2020 Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the


  1. Lecture 2: Making Sequences of Good Decisions Given a Model of the World Emma Brunskill CS234 Reinforcement Learning Winter 2020 Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 1 / 62

  2. Refresh Your Knowledge 1. Piazza Poll In a Markov decision process, a large discount factor γ means that short term rewards are much more influential than long term rewards. [Enter your answer in piazza] True False Don’t know False. A large γ implies we weigh delayed / long term rewards more. γ = 0 only values immediate rewards Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 2 / 62

  3. Today’s Plan Last Time: Introduction Components of an agent: model, value, policy This Time: Making good decisions given a Markov decision process Next Time: Policy evaluation when don’t have a model of how the world works Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 3 / 62

  4. Models, Policies, Values Model : Mathematical models of dynamics and reward Policy : Function mapping agent’s states to actions Value function : future rewards from being in a state and/or action when following a particular policy Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 4 / 62

  5. Today: Given a model of the world Markov Processes Markov Reward Processes (MRPs) Markov Decision Processes (MDPs) Evaluation and Control in MDPs Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 5 / 62

  6. Full Observability: Markov Decision Process (MDP) MDPs can model a huge number of interesting problems and settings Bandits: single state MDP Optimal control mostly about continuous-state MDPs Partially observable MDPs = MDP where state is history Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 6 / 62

  7. Recall: Markov Property Information state: sufficient statistic of history State s t is Markov if and only if: p ( s t +1 | s t , a t ) = p ( s t +1 | h t , a t ) Future is independent of past given present Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 7 / 62

  8. Markov Process or Markov Chain Memoryless random process Sequence of random states with Markov property Definition of Markov Process S is a (finite) set of states ( s ∈ S ) P is dynamics/transition model that specifices p ( s t +1 = s ′ | s t = s ) Note: no rewards, no actions If finite number ( N ) of states, can express P as a matrix   P ( s 1 | s 1 ) P ( s 2 | s 1 ) · · · P ( s N | s 1 ) P ( s 1 | s 2 ) P ( s 2 | s 2 ) · · · P ( s N | s 2 )     P = . . . ...  . . .  . . .   P ( s 1 | s N ) P ( s 2 | s N ) · · · P ( s N | s N ) Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 8 / 62

  9. Example: Mars Rover Markov Chain Transition Matrix, P ! " ! # ! $ ! % ! & ! ' ! ( 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.6 0.2 0.2 0.6 0.2 0.2 0.2   0 . 6 0 . 4 0 0 0 0 0 0 . 4 0 . 2 0 . 4 0 0 0 0     0 0 . 4 0 . 2 0 . 4 0 0 0     P = 0 0 0 . 4 0 . 2 0 . 4 0 0     0 0 0 0 . 4 0 . 2 0 . 4 0     0 0 0 0 0 . 4 0 . 2 0 . 4   0 0 0 0 0 0 . 4 0 . 6 Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 9 / 62

  10. Example: Mars Rover Markov Chain Episodes ! " ! # ! $ ! % ! & ! ' ! ( 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.6 0.2 0.2 0.2 0.2 0.2 0.6 Example: Sample episodes starting from S4 s 4 , s 5 , s 6 , s 7 , s 7 , s 7 , . . . s 4 , s 4 , s 5 , s 4 , s 5 , s 6 , . . . s 4 , s 3 , s 2 , s 1 , . . . Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 10 / 62

  11. Markov Reward Process (MRP) Markov Reward Process is a Markov Chain + rewards Definition of Markov Reward Process (MRP) S is a (finite) set of states ( s ∈ S ) P is dynamics/transition model that specifices P ( s t +1 = s ′ | s t = s ) R is a reward function R ( s t = s ) = E [ r t | s t = s ] Discount factor γ ∈ [0 , 1] Note: no actions If finite number ( N ) of states, can express R as a vector Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 11 / 62

  12. Example: Mars Rover MRP ! " ! # ! $ ! % ! & ! ' ! ( 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.6 0.2 0.2 0.6 0.2 0.2 0.2 Reward: +1 in s 1 , +10 in s 7 , 0 in all other states Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 12 / 62

  13. Return & Value Function Definition of Horizon Number of time steps in each episode Can be infinite Otherwise called finite Markov reward process Definition of Return, G t (for a MRP) Discounted sum of rewards from time step t to horizon G t = r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · Definition of State Value Function, V ( s ) (for a MRP) Expected return from starting in state s V ( s ) = E [ G t | s t = s ] = E [ r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · | s t = s ] Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 13 / 62

  14. Discount Factor Mathematically convenient (avoid infinite returns and values) Humans often act as if there’s a discount factor < 1 γ = 0: Only care about immediate reward γ = 1: Future reward is as beneficial as immediate reward If episode lengths are always finite, can use γ = 1 Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 14 / 62

  15. Example: Mars Rover MRP ! " ! # ! $ ! % ! & ! ' ! ( 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.6 0.2 0.2 0.2 0.2 0.6 0.2 Reward: +1 in s 1 , +10 in s 7 , 0 in all other states Sample returns for sample 4-step episodes, γ = 1 / 2 s 4 , s 5 , s 6 , s 7 : 0 + 1 2 × 0 + 1 4 × 0 + 1 8 × 10 = 1 . 25 Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 15 / 62

  16. Example: Mars Rover MRP ! " ! # ! $ ! % ! & ! ' ! ( 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.6 0.6 0.2 0.2 0.2 Reward: +1 in s 1 , +10 in s 7 , 0 in all other states Sample returns for sample 4-step episodes, γ = 1 / 2 s 4 , s 5 , s 6 , s 7 : 0 + 1 2 × 0 + 1 4 × 0 + 1 8 × 10 = 1 . 25 s 4 , s 4 , s 5 , s 4 : 0 + 1 2 × 0 + 1 4 × 0 + 1 8 × 0 = 0 s 4 , s 3 , s 2 , s 1 : 0 + 1 2 × 0 + 1 4 × 0 + 1 8 × 1 = 0 . 125 Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 16 / 62

  17. Example: Mars Rover MRP ! " ! # ! $ ! % ! & ! ' ! ( 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.6 0.6 0.2 0.2 0.2 0.2 Reward: +1 in s 1 , +10 in s 7 , 0 in all other states Value function: expected return from starting in state s V ( s ) = E [ G t | s t = s ] = E [ r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · | s t = s ] Sample returns for sample 4-step episodes, γ = 1 / 2 s 4 , s 5 , s 6 , s 7 : 0 + 1 2 × 0 + 1 4 × 0 + 1 8 × 10 = 1 . 25 s 4 , s 4 , s 5 , s 4 : 0 + 1 2 × 0 + 1 4 × 0 + 1 8 × 0 = 0 s 4 , s 3 , s 2 , s 1 : 0 + 1 2 × 0 + 1 4 × 0 + 1 8 × 1 = 0 . 125 V = [1.53 0.37 0.13 0.22 0.85 3.59 15.31] Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 17 / 62

  18. Computing the Value of a Markov Reward Process Could estimate by simulation Generate a large number of episodes Average returns Concentration inequalities bound how quickly average concentrates to expected value Requires no assumption of Markov structure Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 18 / 62

  19. Computing the Value of a Markov Reward Process Could estimate by simulation Markov property yields additional structure MRP value function satisfies � P ( s ′ | s ) V ( s ′ ) V ( s ) = R ( s ) + γ ���� s ′ ∈ S Immediate reward � �� � Discounted sum of future rewards Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 19 / 62

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend