13 reinforcemen t learning read chapter 13 exercises 13 1
play

13. Reinforcemen t Learning [Read Chapter 13] [Exercises - PDF document

13. Reinforcemen t Learning [Read Chapter 13] [Exercises 13.1, 13.2, 13.4] Con trol learning Con trol p olici es that c ho ose optimal actions Q learning Con v ergence 255 lecture slides for


  1. 13. Reinforcemen t Learning [Read Chapter 13] [Exercises 13.1, 13.2, 13.4] � Con trol learning � Con trol p olici es that c ho ose optimal actions � Q learning � Con v ergence 255 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

  2. Con trol Learning Consider learning to c ho ose actions, e.g., � Rob ot learning to do c k on battery c harger � Learning to c ho ose actions to optimize factory output � Learning to pla y Bac kgammon Note sev eral problem c haracteristics: � Dela y ed rew ard � Opp ortunit y for activ e exploration � P ossibilit y that state only partially observ able � P ossible need to learn m ultiple tasks with same sensors/e�ectors 256 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

  3. One Example: TD-Gammon [T esauro, 1995] Learn to pla y Bac kgammon Immediate rew ard � +100 if win � -100 if lose � 0 for all other states T rained b y pla ying 1.5 million games against itself No w appro ximately equal to b est h uman pla y er 257 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

  4. Reinforcemen t Learning Problem Agent State Reward Action Environment a a a 0 1 2 s s s ... 0 1 2 r r r 0 1 2 258 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997 Goal: Learn to choose actions that maximize 2 γ r + r + ... , where γ γ <1 r + 0 < 0 1 2

  5. Mark o v Decision Pro cesses Assume � �nite set of states S � set of actions A � at eac h discrete time agen t observ es state s 2 S t and c ho oses action a 2 A t � then receiv es immediate rew ard r t � and state c hanges to s t +1 � Mark o v assumption: s = � ( s ; a ) and t +1 t t r = r ( s ; a ) t t t { i.e., r and s dep end only on curr ent state t t +1 and action { functions � and r ma y b e nondeterministic { functions � and r not necessarily kno wn to agen t 259 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

  6. Agen t's Learning T ask Execute actions in en vironmen t, observ e results, and � learn action p olicy � : S ! A that maximizes 2 E [ r + � r + � r + : : : ] t t +1 t +2 from an y starting state in S � here 0 � � < 1 is the discoun t factor for future rew ards Note something new: � T arget function is � : S ! A � but w e ha v e no training examples of form h s; a i � training examples are of form hh s; a i ; r i 260 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

  7. V alue F unction T o b egin, consider deterministic w orlds... F or eac h p ossible p olicy � the agen t migh t adopt, w e can de�ne an ev aluation function o v er states 2 � V ( s ) � r + � r + � r + ::: t t +1 t +2 1 X i � � r t + i i =0 where r ; r ; : : : are generated b y follo wing p olicy t t +1 � starting at state s � Restated, the task is to learn the optimal p olicy � � � � � argmax V ( s ) ; ( 8 s ) � 261 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

  8. r ( s; a ) (immediate rew ard) v alues 0 100 0 G 0 0 0 0 0 100 0 0 0 0 � Q ( s; a ) v alues V ( s ) v alues 0 90 100 G G 90 100 0 81 72 81 81 90 100 81 90 81 90 100 72 81 One optimal p olicy 262 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997 G

  9. What to Learn W e migh t try to ha v e agen t learn the ev aluati on � � � function V (whic h w e write as V ) It could then do a lo ok ahead searc h to c ho ose b est action from an y state s b ecause � � � ( s ) = argmax [ r ( s; a ) + � V ( � ( s; a ))] a A problem: � This w orks w ell if agen t kno ws � : S � A ! S , and r : S � A ! < � But when it do esn't, it can't c ho ose actions this w a y 263 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

  10. Q F unction � De�ne new function v ery similar to V � Q ( s; a ) � r ( s; a ) + � V ( � ( s; a )) If agen t learns Q , it can c ho ose optimal action ev en without kno wing � ! � � � ( s ) = argmax [ r ( s; a ) + � V ( � ( s; a ))] a � � ( s ) = argmax Q ( s; a ) a Q is the ev aluation function the agen t will learn 264 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

  11. T raining Rule to Learn Q � Note Q and V closely related: � 0 V ( s ) = max Q ( s; a ) 0 a Whic h allo ws us to write Q recursiv ely as � Q ( s ; a ) = r ( s ; a ) + � V ( � ( s ; a ))) t t t t t t 0 = r ( s ; a ) + � max Q ( s ; a ) t t t +1 0 a ^ Nice! Let Q denote learner's curren t appro ximation to Q . Consider training rule 0 0 ^ ^ Q ( s; a ) r + � max Q ( s ; a ) 0 a 0 where s is the state resulting from applying action a in state s 265 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

  12. Q Learning for Deterministi c W orlds ^ F or eac h s; a initial i ze table en try Q ( s; a ) 0 Observ e curren t state s Do forev er: � Select an action a and execute it � Receiv e immediate rew ard r 0 � Observ e the new state s ^ � Up date the table en try for Q ( s; a ) as follo ws: 0 0 ^ ^ Q ( s; a ) r + � max Q ( s ; a ) 0 a 0 � s s 266 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

  13. ^ Up dating Q 0 ^ ^ Q ( s ; a ) r + � max Q ( s ; a ) 1 r ig ht 2 0 a 0 + 0 : 9 max f 63 ; 81 ; 100 g 72 100 90 100 R R 90 63 63 81 81 a right notice if rew ards non-negativ e, then ^ ^ ( 8 s; a; n ) Q ( s; a ) � Q ( s; a ) n +1 n initial state: s 1 next state: s 2 and ^ ( 8 s; a; n ) 0 � Q ( s; a ) � Q ( s; a ) n 267 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

  14. ^ Q con v erges to Q . Consider case of deterministic w orld where see eac h h s; a i visited in�nitely often. Pr o of : De�ne a full in terv al to b e an in terv al during whic h eac h h s; a i is visited. During eac h full ^ in terv al the largest error in Q table is reduced b y factor of � ^ Let Q b e table after n up dates, and � b e the n n ^ maxim um error in Q ; that is n ^ � = max j Q ( s; a ) � Q ( s; a ) j n n s;a ^ F or an y table en try Q ( s; a ) up dated on iteration n ^ n + 1, the error in the revised estimate Q ( s; a ) is n +1 0 0 ^ ^ j Q ( s; a ) � Q ( s; a ) j = j ( r + � max Q ( s ; a )) n +1 n 0 a 0 0 � ( r + � max Q ( s ; a )) j 0 a 0 0 0 0 ^ = � j max Q ( s ; a ) � max Q ( s ; a ) j n 0 0 a a 0 0 0 0 ^ � � max j Q ( s ; a ) � Q ( s ; a ) j n 0 a 00 0 00 0 ^ � � max j Q ( s ; a ) � Q ( s ; a ) j n 00 0 s ;a ^ j Q ( s; a ) � Q ( s; a ) j � � � n +1 n 268 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

  15. Note w e used general fact that j max f ( a ) � max f ( a ) j � max j f ( a ) � f ( a ) j 1 2 1 2 a a a 269 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

  16. Nondeterministi c Case What if rew ard and next state are non-deterministic? W e rede�ne V ; Q b y taking exp ected v alues 2 � V ( s ) � E [ r + � r + � r + : : : ] t t +1 t +2 1 X i � E [ � r ] t + i i =0 � Q ( s; a ) � E [ r ( s; a ) + � V ( � ( s; a ))] 270 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

  17. Nondeterministi c Case Q learning generalizes to nondeterministic w orlds Alter training rule to 0 0 ^ ^ ^ Q ( s; a ) (1 � � ) Q ( s; a )+ � [ r +max Q ( s ; a )] n n n � 1 n n � 1 0 a where 1 � = n 1 + v isits ( s; a ) n ^ Can still pro v e con v ergence of Q to Q [W atkins and Da y an, 1992] 271 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend