cmu q 15 381
play

CMU-Q 15-381 Lecture 18: Reinforcement Learning I Teacher: Gianni - PowerPoint PPT Presentation

CMU-Q 15-381 Lecture 18: Reinforcement Learning I Teacher: Gianni A. Di Caro H OW REALISTIC ARE MDP S ? Assumption 1: state is known exactly after performing an action Do we always have an infinitely powerful GPS that tells us where


  1. CMU-Q 15-381 Lecture 18: Reinforcement Learning I Teacher: Gianni A. Di Caro

  2. H OW REALISTIC ARE MDP S ? § Assumption 1: state is known exactly after performing an action § Do we always have an infinitely powerful “GPS” that tells us where we are in the world? Think of a robot moving in a building, how does it know where it is? § Relax the assumption : Partially Observable MDP (POMDP) § Assumption 2: known model of dynamics and reward of the world, ! and " § Do we always know what will be the effect of our actions when chance is playing against us? Where those numbers come from? Image to fill in the ! matrix for the action of a wheeled robot on an icy surface … § Relax the assumption : Reinforcement Learning Problems 2

  3. R EINFORCEMENT L EARNING Memoryless Transition stochastic reward Model? process (MRP) Action State Reward model? Agent Goal: Maximize expected sum of future rewards 3

  4. MDP P LANNING VS . R EINFORCEMENT L EARNING Don’t have a simulator ! Have to actually learn what happens if take an action in a state Drawings by Ketrina Yim 4

  5. R EINFORCEMENT LEARNING PROBLEM ü The agent can ”sense” the environment (it knows the state ) and has goals ü Learning effect of actions from interaction with the environment Trial and Error search § ( Delayed ) Rewards (Advisory signals ≠ Error signals) § § What actions to take? → Exploration- exploitation dilemma The agent has to generate the training set by interaction § 5

  6. R EINFORCEMENT L EARNING Memoryless Transition stochastic reward Model? process (MRP) Action State Reward model? Agent Goal: Maximize expected sum of future rewards 6

  7. P ASSIVE R EINFORCEMENT L EARNING § Before figuring out how to act, let’s first just try to figure out how good a (given) particular policy ! is § Passive learning: agent’s policy is fixed (i.e., in state " it always execute action !(") ) and the task is to estimate policy’s value → Learn state values, % " , or State-action values, '(", () → Policy evaluation Policy evaluation in MDPs ∼ Passive RL (*, +) Model (*, +) Model Bellman eqs. Learning 7

  8. P ASSIVE R EINFORCEMENT L EARNING Two approaches 1. Build a model Transition à Solve Value Iteration Model? T(s,a,s’)=0.8, R(s,a,s’)=4,… State Action Reward model? Agent 8

  9. P ASSIVE R EINFORCEMENT L EARNING Two approaches: Transition Model? 1. Build a model 2. Model-free: directly V π (s 1 )=1.8, estimate ! " V π (s 2 )=2.5,… State Action Reward model? Agent 9

  10. P ASSIVE RL: B UILD A MODEL 1. Build a model Transition Model? T(s,a,s’)=0.8, State R(s,a,s’)=4,… Action Reward model? Agent 10

  11. G RID W ORLD E XAMPLE Start at (1,1) 11

  12. G RID W ORLD E XAMPLE G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup try up Adaption of drawing by Ketrina Yim 12

  13. G RID W ORLD E XAMPLE G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 Adaption of drawing by Ketrina Yim 13

  14. G RID W ORLD E XAMPLE G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup Adaption of drawing by Ketrina Yim 14

  15. G RID W ORLD E XAMPLE G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 Adaption of drawing by Ketrina Yim 15

  16. G RID W ORLD E XAMPLE G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(4,3), r = 1 Adaption of drawing by Ketrina Yim 16

  17. G RID W ORLD E XAMPLE G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 The gathered s=(3,2) action=tup, s’=(3,3), r = -.01 experience can be s=(3,3) action=tright, s’=(4,3), r = 1 used to estimate MDP’s ! and " Adaption of drawing by Ketrina Yim models 17

  18. G RID W ORLD E XAMPLE G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 The gathered s=(3,2) action=tup, s’=(3,3), r = -.01 experience can be s=(3,3) action=tright, s’=(4,3), r = 1 used to estimate MDP’s ! and " Estimate of T(<1,2>,tup,<1,3>) = 1/2 models Adaption of drawing by Ketrina Yim 18

  19. M ODEL -B ASED P ASSIVE R EINFORCEMENT L EARNING 1. Follow policy ! , observe transitions and rewards 2. Estimate MDP model parameters " and # given observed transitions and rewards § If finite set of states and actions, can just make a table, count, and average counts 3. Use estimated MDP to do policy evaluation of ! (using Value Iteration) Does this give us all the parameters for an MDP? 19

  20. S OME PARAMETERS ARE MISSING G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 Estimate of T(<1,2>,tright,<1,3>)? s=(3,3) action=tright, s’=(4,3), r = 1 No idea! Never tried this action… Adaption of drawing by Ketrina Yim 20

  21. P ASSIVE M ODEL -B ASED RL § Does this give us all the parameters of the underlying MDP? § No. § But does that matter for computing policy value? § No, don’t need to reconstruct the whole MDP for performing policy evaluation! § Have all parameters we need! § We have !(#) , we can assign non-zero probabilities to all observed transitions and zero to the unobserved ones § We need to visit all states # ∈ & at least once in order to solve the Bellman equations for all states " # V π ( s ) = E π R ( s t +1 ) + γ V π ( s t +1 ) | s t = s ⇣ ⌘ X p ( s 0 | s, π ( s )) R ( s 0 , s, π ( s )) + γ V π ( s 0 ) = ∀ s ∈ S s 0 2 S 21

  22. P ASSIVE M ODEL -B ASED RL Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(4,3), r = 1 s=(1,1) action= tup, s’=(2,1), r = -.01 2 episodes of experience in MDP. Use to s=(2,1) action= tright, s’=(3,1), r = -.01 estimate MDP parameters & evaluate ! s=(3,1) action= tup, s’=(4,1), r = -.01 s=(4,1) action= tleft, s’=(3,1), r = -.01 Is the computed policy value likely s=(3,1) action= tup, s’=(3,2), r = -.01 to be correct? s=(3,2) action= tup, s’=(4,2), r = -1 (1) Yes (2) No (3) Not sure Adaption of drawing by Ketrina Yim 22

  23. P ASSIVE R EINFORCEMENT L EARNING Two Approaches: Transition Model? 1. Build a model 2. Model-free: directly estimate ! " V π (s 1 )=1.8, State V π (s 2 )=2.5,… Action Reward model? Agent 23

  24. L ET ’ S CONSIDER AN EPISODIC SCENARIO Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(4,3), r = 1 s=(1,1) action= tup, s’=(2,1), r = -.01 s=(2,1) action= tright, s’=(3,1), r = -.01 Estimate of !(1,1) ? s=(3,1) action= tup, s’=(4,1), r = -.01 ! 1,1 = 1 s=(4,1) action= tleft, s’=(3,1), r = -.01 & 1 + 7 + −0.01 + (−1 + 5 + −0.01 ) 2 s=(3,1) action= tup, s’=(3,2), r = -.01 0 2 0 1 s=(3,2) action= tup, s’=(4,2), r = -1 Adaption of drawing by Ketrina Yim 2 episodes of (MDP) experiences Averaging episode returns 24

  25. A VERAGING OBSERVED RETURNS § Averaging the returns from ( episodes, , $ , , 6 , ⋯ , , " " "#$ % = 1 § Arithmetic average: ! ( ) , * *+$ § Incremental arithmetic average: " % + 1 ! "#$ % = ! ( (, " −! " % ) § Incremental weighted arithmetic average: § Weight of an episode: 1 " " " = ∑ *+4 § Sum of ( episodes: 2 1 * " % + 1 " ! "#$ % = ! (, " −! " % ) 2 " 25

  26. A VERAGING OBSERVED RETURNS § Exponentially-weighted average ( moving average ): ! "#$ % = ! " % + ((* " −! " % ) $ (Note: constant ( vs. " ) = (! " % + (1 − ()* " § Weights decrease exponentially: " ( "20 (1 − ()* 0 "#$ % = ( " ! ! . % + / 01$ ! $ % = (! . % + 1 − ( * $ ! 3 % = (! $ % + 1 − ( * 3 = ( (! . % + 1 − ( * $ + 1 − ( * 3 = ( 3 ! . % + ( 1 − ( * $ + 1 − ( * 3 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend