reinforcemen t learning read chapter exercises
play

Reinforcemen t Learning Read Chapter Exercises - PDF document

Reinforcemen t Learning Read Chapter Exercises Con trol learning Con trol p olici es that c ho ose optimal actions Q learning


  1. ��� Reinforcemen t Learning �Read Chapter ��� �Exercises ����� ����� ����� � Con trol learning � Con trol p olici es that c ho ose optimal actions � Q learning � Con v ergence ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����

  2. Con trol Learning Consider learning to c ho ose actions� e�g�� � Rob ot learning to do c k on battery c harger � Learning to c ho ose actions to optimize factory output � Learning to pla y Bac kgammon Note sev eral problem c haracteristics� � Dela y ed rew ard � Opp ortunit y for activ e exploration � P ossibilit y that state only partially observ able � P ossible need to learn m ultiple tasks with same sensors�e�ectors ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����

  3. One Example� TD�Gammon �T esauro� ����� Learn to pla y Bac kgammon Immediate rew ard � ���� if win � ���� if lose � � for all other states T rained b y pla ying ��� million games against itself No w appro ximately equal to b est h uman pla y er ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����

  4. Reinforcemen t Learning Problem Agent State Reward Action Environment a a a 0 1 2 s s s ... 0 1 2 r r r 0 1 2 Goal: Learn to choose actions that maximize 2 γ r + r + ... , where γ γ <1 r + 0 < 0 1 2 ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����

  5. Mark o v Decision Pro cesses Assume � �nite set of states S � set of actions A � at eac h discrete time agen t observ es state � s S t and c ho oses action � a A t � then receiv es immediate rew ard r t � and state c hanges to s t �� � Mark o v assumption� � � s � and s � � a t �� t t � � s � r r � a t t t i�e�� and dep end only on state r s curr ent t t �� � and action functions � and r ma y b e nondeterministic � functions � and r not necessarily kno wn to � agen t ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����

  6. Agen t�s Learning T ask Execute actions in en vironmen t� observ e results� and � learn action p olicy � � that maximizes � S A � � r � � � � � E � r � r � � t t �� t �� from an y starting state in S � here � � � � � is the discoun t factor for future rew ards Note something new� � T arget function is � � � S A � but w e ha v e no training examples of form h s� a i � training examples are of form hh s� a i � r i ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����

  7. V alue F unction T o b egin� consider deterministic w orlds��� F or eac h p ossible p olicy the agen t migh t adopt� � w e can de�ne an ev aluation function o v er states � � V � s � � r � � r � � r � ��� t t �� t �� � i � � r X t � i i �� where are generated b y follo wing p olicy r � r � � � � t �� t starting at state � s Restated� the task is to learn the optimal p olicy � � � � � argmax � s � � � � s � � V � ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����

  8. 0 100 0 G 0 0 0 0 0 100 0 0 0 0 � s� a � �immediate rew ard� v alues r 0 90 100 G G 90 100 0 81 72 81 81 90 100 81 90 81 90 100 72 81 Q � s� a � v alues � s � v alues V � G One optimal p olicy ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����

  9. What to Learn W e migh t try to ha v e agen t learn the ev aluati on � � function �whic h w e write as � � V V It could then do a lo ok ahead searc h to c ho ose b est action from an y state s b ecause � � � � s � � argmax � r � s� a � � � V � � � s� a ��� a A problem� � This w orks w ell if agen t kno ws � � S � A � S � and r � S � A � � � But when it do esn�t� it can�t c ho ose actions this w a y ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����

  10. F unction Q De�ne new function v ery similar to � V � Q � s� a � � � s� a � � � � � s� a �� r � V If agen t learns Q � it can c ho ose optimal action ev en without kno wing � � � � � � s � � argmax � r � s� a � � � V � � � s� a ��� a � � s � � argmax Q � s� a � � a is the ev aluation function the agen t will learn Q ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����

  11. T raining Rule to Learn Q Note and � closely related� Q V � � � s � � max Q � s� � V a � a Whic h allo ws us to write recursiv ely as Q � Q � s � a � � r � s � a � � � V � � � s � a ��� t t t t t t � � s � � max Q � s � � r � a � � a t �� t t a � � Nice� Let denote learner�s curren t appro ximation Q to Q � Consider training rule � � � � Q � s� a � � � max Q � s � r � � a � a where � is the state resulting from applying action s in state a s ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����

  12. Learning for Deterministi c W orlds Q � F or eac h initial i ze table en try � s� a � � � s� a Q Observ e curren t state s Do forev er� � Select an action and execute it a � Receiv e immediate rew ard r � Observ e the new state s � � � Up date the table en try for Q � s� a � as follo ws� � � � � Q � s� a � � r � � max Q � s � a � � a � s � s � ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����

  13. � Up dating Q 90 72 100 100 R R 63 63 81 81 a right initial state: s 1 next state: s 2 � � � Q � s � � � max Q � s � � a r � � a � � r ig ht � a � � � � � � max f �� � �� � ��� g � �� notice if rew ards non�negativ e� then � � � � s� n � � s� a � � � s� a � a� Q Q n �� n and � � � s� n � � � � s� a � � Q � s� a � a� Q n ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����

  14. � Q con v erges to Q � Consider case of deterministic w orld where see eac h h s� a i visited in�nitely often� of � De�ne a full in terv al to b e an in terv al during Pr o whic h eac h h s� a i is visited� During eac h full � in terv al the largest error in Q table is reduced b y factor of � � Let b e table after up dates� and � b e the Q n n n � maxim um error in � that is Q n � � � max j Q � s� a � � Q � s� a � j n n s�a � F or an y table en try � s� a � up dated on iteration Q n � � �� the error in the revised estimate � s� a � is n Q n �� � � � � j � s� a � � Q � s� a � j � j � r � max � s �� Q � Q � a n �� n � a � � � � r � max Q � s �� j � � a � a � � � � � � j max � s � � max Q � s � j � Q � a � a n � � a a � � � � � � max j � s � � Q � s � j � Q � a � a n � a � �� � �� � � � max j Q � s � a � � Q � s � a � j n �� � s �a � j Q � s� a � � Q � s� a � j � � � n �� n ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����

  15. Note w e used general fact that j max � a � � max � a � j � max j f � a � � � a � j f f f � � � � a a a ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend