U ( s ) = U ( s ) + [ r + U ( s 0 ) U ( s )] inputs : mdp , an - PowerPoint PPT Presentation

Learning the Value of a Policy } The dynamic programming algorithm we have seen works fine if we already know everything about an MDP system, including: Probabilities of all state-action transitions 1. Rewards we get in each case 2. } If we don’t have this information, how can we figure out Class #25: the value of a policy? Reinforcement Learning } Turns out we can use a sampling method } “Follow the policy, and see what happens” Machine Learning (COMP 135): M. Allen, 22 Apr. 20 2 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 1 2 Temporal Difference (TD) Updates The Basic TD(0) Update function TD-Policy-Evaluation ( mdp, π ) returns a value function U ( s ) = U ( s ) + α [ r + γ U ( s 0 ) − U ( s )] inputs : mdp , an MDP, and π , a policy to be evaluated } When we make one-step update, we add one-step reward ∀ s ∈ S, U ( s ) = 0 that we get, r , plus the difference between where we start, repeat for each episode E : set start-state s ← s 0 U ( s ) , and where we end up U ( s ´) , discounted by the repeat for each time-step t of episode E , until s is terminal: factor g as usual take action π ( s ) observe : next state s 0 , one-step reward r } If state where we end up s ´ is better than original state s U ( s ) ← U ( s ) + α [ r + γ U ( s 0 ) − U ( s )] after discounting, then the value of s goes up s ← s 0 } If s ´ is worse than s , the value of s goes down return value function U ≈ U π } Agent in an MDP takes actions, sees new states and rewards } Now, we don’t base value-update on a probability distribution } Instead, based on the single state we actually see, over and over again, stopping whenever we hit a terminal condition, for some number of learning episodes 4 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 3 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 3 4 1

The Basic TD(0) Update Advantages and a Problem U ( s ) = U ( s ) + α [ r + γ U ( s 0 ) − U ( s )] } With TD updates, we only update the states we actually see given the policy we are following } We also weight the value-update amount by another } Don’t need to know MDP dynamics constant a , (less than 1 ), called a step-size parameter } May only have to update very few states, saving much time to get the values of those we actually reach under our policy If this value shrinks to 0 over time, values stop changing 1. If we do this slowly , the update will eventually converge 2. } However, this can be a source of difficulty: we may not be to actual value of state if we follow the policy p able to find a better policy, since we don’t know values of } For example, if we update over episodes, e = 1, 2, 3,… , we states that we never happen to visit can set the parameter for each episode to be: α e = 1 e 6 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 5 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 5 6 Exploration and Exploitation Almost-Greedy Policies } One simple way to add exploration is to use a policy } If we use the Dynamic Programming method, we calculate the value of every state that is mostly greedy, but not always } Easy to update policy (just be greedy) } An “epsilon-greedy” ( e -greedy) policy sets some } This is exploitation: use best values seen to choose actions probability threshold, e , and chooses actions by: 1. Picking a random number R ∈ [0,1] } When we are learning, however, we sometimes don’t 2. If R ≤ e , choosing the action at random know what certain states are like, because we’ve never actually seen them yet 3. If R > e , acting in a greedy fashion (as before) } Our current policy may never get us to things we really want } Thus, we must use exploration: try out things even if our current best policy doesn’t think it’s a good idea 8 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 7 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 7 8 2

Learning with e -greedy policies TD-Learning } We can add this idea to our sampling update method function TD-Learning ( mdp ) returns a policy inputs : mdp , an MDP } After we take an action, and see a state-transition from s to s´ , we do the same updates as before: ∀ s ∈ S, U ( s ) = 0 repeat for each episode E : set start-state s ← s 0 U ( s ) = U ( s ) + α [ r + γ U ( s 0 ) − U ( s )] repeat for each time-step t of episode E , until s is terminal: choose action a , using ✏ -greedy policy based on U ( s ) } When we choose actions, we do so in an e -greedy way, observe next state s 0 , one-step reward r U ( s ) ← U ( s ) + ↵ [ r + � U ( s 0 ) − U ( s )] sometimes following the policy based on learned values , s ← s 0 and sometimes trying random things } Over enough time, this can converge to true value return policy ⇡ , set greedily for every state s ∈ S , based upon U ( s ) } Algorithm is the same, but explores using sometimes-greedy and function U * of the optimal policy p * sometimes-probabilistic action-choices instead of fixed policy p } We reduce learning parameter a just as before to converge 10 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 9 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 9 10 Randomness and Weighting in Learning Randomness and Weighting in Learning } Our algorithm uses two parameters, a and e (plus the } The control parameters a and e give us simple ways to usual discount factor g ), to control its overall behavior control complex learning behavior } We don’t always want to reduce each over time } Each can be adapted over time to control algorithm e : the amount of randomness in the policy } In a purely stationary environment , where system 1. dynamics don’t ever change, and all probabilities stay the When we don’t know much, set it to a high value , so that } we start off with lots of random exploration same, we can simply slowly reduce each until we We reduce this value over time until e = 0 , and we are being converge upon a stable learned behavior } purely greedy, and just exploiting what he have learned } In a non-stationary environment, where things may a : the weight on each learning-update step change at some point, learned solutions may quit working 2. Reduce this over time, as well: when a = 0 , U -values don’t } change anymore, and we can converge on final policy values 12 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 11 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 11 12 3

Non-Stationary Environments Non-Stationary Environments } Suppose environment starts off in one configuration: } The environment may change, however: GOAL GOAL s 0 s 0 } If e and a stay at 0, policy is sub-optimal from now on } Over time, we can learn a policy for shortest path to goal } By letting e and a go to 0, the policy becomes stable 14 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 13 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 13 14 Non-Stationary Environments Bellman Equations for Q -values } We may be able to tell that environment changes, however } Instead of the value of a state U ( s ) , we can calculate the value of a state-action pair Q ( s , a ) } The value of taking action a in state s , and then following GOAL the policy π after that: � P ( s, a, s � ) [ R ( s, a, s � ) + γ Q π ( s � , π ( s � ))] Q π ( s, a ) = s � } Similarly, we calculate optimal values Q *( s,a ) of taking a in state s , then following best possible policy after that: s 0 X X X X X X X X X X X X X X ⇥ ⇤ � Q � ( s, a ) = P ( s, a, s � ) R ( s, a, s � ) + γ max Q � ( s � , a � ) } If value drops off over a long time, we can increase e and a a � s � again, to resume learning and find new optimal policy } We can do learning for Q -values, too… 16 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 15 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 15 16 4

U ( s ) = U ( s ) + [ r + U ( s 0 ) U ( s )] inputs : mdp , an - PowerPoint PPT Presentation

Learning the Value of a Policy } The dynamic programming algorithm we have seen works fine if we already know everything about an MDP system, including: Probabilities of all state-action transitions 1. Rewards we get in each case 2. } If we

FEI Week 3 Valuation Venture Capital Chris Ansell MBA CFA BPP BUSINESS SCHOOL BPP BUSINESS

301AA - Advanced Programming Lecturer: Andrea Corradini andrea@di.unipi.it

E40M Capacitors M. Horowitz, J. Plummer, R. Howe 1 Reading Reader: Chapter 6

NYSE : NOW OUR TEAM Carol Fuss Matthew Yardley OUTLINE ServiceNow Overview Enterprise

while Loops Introducing: while Loops General form of a while loop statement: while [boolean

Adversarial Search R&N 5.15.5 Jacques Fleuriot University of Edinburgh, School of

'()+,)&-.),& 4E&+;.+F,&;.2&6(&)E6,&G+,().=:(&

The term structure of interest rates Financial markets and recession expectations 2018-2019

USING DATA TO FIND THE OPTIMAL MIX OF RETAIL LOCATIONS AND RESOURCES INTRODUCTION Education

| 1 New gTLD Subsequent Procedures PDP Working Group ICANN65 Monday, 24 June 2019 13:30 to

Water ers o s of t the U he U.S. S. Presented By: Anjali Karve , M.S. Certified Ecologist

Law Law of the of the Sea i Sea in n Historical Perspective Historical Perspective R. R. P.

brief you on maritime security and surveillance cooperation in the Baltic. First of all I want to

IIRSM Qatar 23 rd January 2018 Presented by: Mark Stephens Before we start CROWNE PLAZA SAFETY

MICE Experiment Status and Prospects J. Pasternak, Imperial College London / ISIS - RAL- STFC,

4170 Staff Prof. Lydia Chilton Office hours: Tuesdays 3-4 in CEPSR 612 IAs: Tessa

http: (optional) Calculator (just something that can do square roots)

The Humans in the Cloud: Shared Autonomy over the Internet Matei Ciocarlie Willow Garage

FEDERAL POLICY STRATEGY AND ADVOCACY CONSULTANT REQUEST FOR QUALIFICATIONS Informational Webinar

Outline Debating Adjudicating Style Content Strategy Definitions Points of Information

2016

A New Algorithm for Generating Quadrilateral Meshes and Its Application to FE-Based Image

December 3, 2012 Edwall 1 ABCD II (Great Start): 2003-2006 ABCD Screening Academy:

Scaling up What Works: Experimental Evidence on External Validity in Kenyan Education Tessa Bold

U ( s ) = U ( s ) + [ r + U ( s 0 ) U ( s )] inputs : mdp , an - PowerPoint PPT Presentation

Learning the Value of a Policy } The dynamic programming algorithm we have seen works fine if we already know everything about an MDP system, including: Probabilities of all state-action transitions 1. Rewards we get in each case 2. } If we

FEI Week 3 Valuation Venture Capital Chris Ansell MBA CFA BPP BUSINESS SCHOOL BPP BUSINESS

301AA - Advanced Programming Lecturer: Andrea Corradini andrea@di.unipi.it

E40M Capacitors M. Horowitz, J. Plummer, R. Howe 1 Reading Reader: Chapter 6

NYSE : NOW OUR TEAM Carol Fuss Matthew Yardley OUTLINE ServiceNow Overview Enterprise

while Loops Introducing: while Loops General form of a while loop statement: while [boolean

Adversarial Search R&amp;N 5.15.5 Jacques Fleuriot University of Edinburgh, School of

'()*+*,)&amp;-.)*,&amp; 4E*&amp;+*;.+F,&amp;;.2*&amp;6(&amp;)E6,&amp;G+*,*().=:(&amp;

The term structure of interest rates Financial markets and recession expectations 2018-2019

USING DATA TO FIND THE OPTIMAL MIX OF RETAIL LOCATIONS AND RESOURCES INTRODUCTION Education

| 1 New gTLD Subsequent Procedures PDP Working Group ICANN65 Monday, 24 June 2019 13:30 to

Water ers o s of t the U he U.S. S. Presented By: Anjali Karve , M.S. Certified Ecologist

Law Law of the of the Sea i Sea in n Historical Perspective Historical Perspective R. R. P.

brief you on maritime security and surveillance cooperation in the Baltic. First of all I want to

IIRSM Qatar 23 rd January 2018 Presented by: Mark Stephens Before we start CROWNE PLAZA SAFETY

MICE Experiment Status and Prospects J. Pasternak, Imperial College London / ISIS - RAL- STFC,

4170 Staff Prof. Lydia Chilton Office hours: Tuesdays 3-4 in CEPSR 612 IAs: Tessa

http: (optional) Calculator (just something that can do square roots)

The Humans in the Cloud: Shared Autonomy over the Internet Matei Ciocarlie Willow Garage

FEDERAL POLICY STRATEGY AND ADVOCACY CONSULTANT REQUEST FOR QUALIFICATIONS Informational Webinar

Outline Debating Adjudicating Style Content Strategy Definitions Points of Information

2016

A New Algorithm for Generating Quadrilateral Meshes and Its Application to FE-Based Image

December 3, 2012 Edwall 1 ABCD II (Great Start): 2003-2006 ABCD Screening Academy:

Scaling up What Works: Experimental Evidence on External Validity in Kenyan Education Tessa Bold

Adversarial Search R&N 5.15.5 Jacques Fleuriot University of Edinburgh, School of

'()+,)&-.),& 4E&+;.+F,&;.2&6(&)E6,&G+,().=:(&