u s u s r u s 0 u s
play

U ( s ) = U ( s ) + [ r + U ( s 0 ) U ( s )] inputs : mdp , an - PowerPoint PPT Presentation

Learning the Value of a Policy } The dynamic programming algorithm we have seen works fine if we already know everything about an MDP system, including: Probabilities of all state-action transitions 1. Rewards we get in each case 2. } If we


  1. Learning the Value of a Policy } The dynamic programming algorithm we have seen works fine if we already know everything about an MDP system, including: Probabilities of all state-action transitions 1. Rewards we get in each case 2. } If we don’t have this information, how can we figure out Class #25: the value of a policy? Reinforcement Learning } Turns out we can use a sampling method } “Follow the policy, and see what happens” Machine Learning (COMP 135): M. Allen, 22 Apr. 20 2 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 1 2 Temporal Difference (TD) Updates The Basic TD(0) Update function TD-Policy-Evaluation ( mdp, π ) returns a value function U ( s ) = U ( s ) + α [ r + γ U ( s 0 ) − U ( s )] inputs : mdp , an MDP, and π , a policy to be evaluated } When we make one-step update, we add one-step reward ∀ s ∈ S, U ( s ) = 0 that we get, r , plus the difference between where we start, repeat for each episode E : set start-state s ← s 0 U ( s ) , and where we end up U ( s ´) , discounted by the repeat for each time-step t of episode E , until s is terminal: factor g as usual take action π ( s ) observe : next state s 0 , one-step reward r } If state where we end up s ´ is better than original state s U ( s ) ← U ( s ) + α [ r + γ U ( s 0 ) − U ( s )] after discounting, then the value of s goes up s ← s 0 } If s ´ is worse than s , the value of s goes down return value function U ≈ U π } Agent in an MDP takes actions, sees new states and rewards } Now, we don’t base value-update on a probability distribution } Instead, based on the single state we actually see, over and over again, stopping whenever we hit a terminal condition, for some number of learning episodes 4 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 3 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 3 4 1

  2. The Basic TD(0) Update Advantages and a Problem U ( s ) = U ( s ) + α [ r + γ U ( s 0 ) − U ( s )] } With TD updates, we only update the states we actually see given the policy we are following } We also weight the value-update amount by another } Don’t need to know MDP dynamics constant a , (less than 1 ), called a step-size parameter } May only have to update very few states, saving much time to get the values of those we actually reach under our policy If this value shrinks to 0 over time, values stop changing 1. If we do this slowly , the update will eventually converge 2. } However, this can be a source of difficulty: we may not be to actual value of state if we follow the policy p able to find a better policy, since we don’t know values of } For example, if we update over episodes, e = 1, 2, 3,… , we states that we never happen to visit can set the parameter for each episode to be: α e = 1 e 6 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 5 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 5 6 Exploration and Exploitation Almost-Greedy Policies } One simple way to add exploration is to use a policy } If we use the Dynamic Programming method, we calculate the value of every state that is mostly greedy, but not always } Easy to update policy (just be greedy) } An “epsilon-greedy” ( e -greedy) policy sets some } This is exploitation: use best values seen to choose actions probability threshold, e , and chooses actions by: 1. Picking a random number R ∈ [0,1] } When we are learning, however, we sometimes don’t 2. If R ≤ e , choosing the action at random know what certain states are like, because we’ve never actually seen them yet 3. If R > e , acting in a greedy fashion (as before) } Our current policy may never get us to things we really want } Thus, we must use exploration: try out things even if our current best policy doesn’t think it’s a good idea 8 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 7 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 7 8 2

  3. Learning with e -greedy policies TD-Learning } We can add this idea to our sampling update method function TD-Learning ( mdp ) returns a policy inputs : mdp , an MDP } After we take an action, and see a state-transition from s to s´ , we do the same updates as before: ∀ s ∈ S, U ( s ) = 0 repeat for each episode E : set start-state s ← s 0 U ( s ) = U ( s ) + α [ r + γ U ( s 0 ) − U ( s )] repeat for each time-step t of episode E , until s is terminal: choose action a , using ✏ -greedy policy based on U ( s ) } When we choose actions, we do so in an e -greedy way, observe next state s 0 , one-step reward r U ( s ) ← U ( s ) + ↵ [ r + � U ( s 0 ) − U ( s )] sometimes following the policy based on learned values , s ← s 0 and sometimes trying random things } Over enough time, this can converge to true value return policy ⇡ , set greedily for every state s ∈ S , based upon U ( s ) } Algorithm is the same, but explores using sometimes-greedy and function U * of the optimal policy p * sometimes-probabilistic action-choices instead of fixed policy p } We reduce learning parameter a just as before to converge 10 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 9 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 9 10 Randomness and Weighting in Learning Randomness and Weighting in Learning } Our algorithm uses two parameters, a and e (plus the } The control parameters a and e give us simple ways to usual discount factor g ), to control its overall behavior control complex learning behavior } We don’t always want to reduce each over time } Each can be adapted over time to control algorithm e : the amount of randomness in the policy } In a purely stationary environment , where system 1. dynamics don’t ever change, and all probabilities stay the When we don’t know much, set it to a high value , so that } we start off with lots of random exploration same, we can simply slowly reduce each until we We reduce this value over time until e = 0 , and we are being converge upon a stable learned behavior } purely greedy, and just exploiting what he have learned } In a non-stationary environment, where things may a : the weight on each learning-update step change at some point, learned solutions may quit working 2. Reduce this over time, as well: when a = 0 , U -values don’t } change anymore, and we can converge on final policy values 12 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 11 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 11 12 3

  4. Non-Stationary Environments Non-Stationary Environments } Suppose environment starts off in one configuration: } The environment may change, however: GOAL GOAL s 0 s 0 } If e and a stay at 0, policy is sub-optimal from now on } Over time, we can learn a policy for shortest path to goal } By letting e and a go to 0, the policy becomes stable 14 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 13 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 13 14 Non-Stationary Environments Bellman Equations for Q -values } We may be able to tell that environment changes, however } Instead of the value of a state U ( s ) , we can calculate the value of a state-action pair Q ( s , a ) } The value of taking action a in state s , and then following GOAL the policy π after that: � P ( s, a, s � ) [ R ( s, a, s � ) + γ Q π ( s � , π ( s � ))] Q π ( s, a ) = s � } Similarly, we calculate optimal values Q *( s,a ) of taking a in state s , then following best possible policy after that: s 0 X X X X X X X X X X X X X X ⇥ ⇤ � Q � ( s, a ) = P ( s, a, s � ) R ( s, a, s � ) + γ max Q � ( s � , a � ) } If value drops off over a long time, we can increase e and a a � s � again, to resume learning and find new optimal policy } We can do learning for Q -values, too… 16 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 15 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 15 16 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend