manuela veloso manuela veloso see machine learning tom
play

Manuela Veloso Manuela Veloso - PowerPoint PPT Presentation

Manuela Veloso Manuela Veloso see Machine Learning Tom Mitchell, chapter 13 on RL 15381, Fall 2009


  1. ���������������������� Manuela Veloso Manuela Veloso see “Machine Learning” – Tom Mitchell, chapter 13 on RL 15�381, Fall 2009

  2. ��������������������������������������� • Supervised learning – Classification � concept learning – Learning from labeled data – Function approximation • Unsupervised learning – Data is not labeled – Data is not labeled – Data needs to be ������� , ��������� – We need distance metric • Control and action model learning – Learning to select actions efficiently – Feedback: goal achievement, failure, ������ – Control learning, reinforcement learning 2

  3. ������������������ ������� • “Reward” today versus future (promised) reward • Future rewards not worth as much as current. • $100K + $100K + $100K + ... INFINITE sum • Assume reality ...: discount factor , say γ . • Assume reality ...: discount factor , say γ . • $100K + γ $100K + γ 2 $100K + ... CONVERGES. 3

  4. ������������������������������ Goal: Learn to choose actions that maximize r 0 + γ r 1 + γ 2 r 2 + ... , where 0 ≤ γ < 1 4

  5. ������������������� • Assume world can be modeled as a Markov Decision Process, with rewards as a function of state and action. • ������������������ New states and rewards are a function only of the current state and action, i.e., s t +1 = δ ( s t , a t ) s = δ ( s , a ) – – – r t = r ( s t , a t ) • ���������������������������������� Functions δ and r may be nondeterministic and are not necessarily known to learner. 5

  6. ����������������� ��! • Execute actions in world, • Observe state of world, • Learn action policy π : S → A • Maximize expected reward E [ r t + γ r t+ 1 + γ 2 r t+ 2 + ...] ��������������������������� S . – 0 ≤ γ < 1, discount factor for future rewards 6

  7. "���������������������������� • We have a target function to learn π : S → A • We have �� training examples of the form 〈 s , a 〉 • We have training examples of the form 〈〈 s , a 〉 , r 〉 (rewards can be ��� real number) immediate reward values r ( s , a ) 7

  8. �������� • There are ������������� policies, of course not necessarily ������� , i.e., with maximum expected reward • There can be also ��������������� policies. 8

  9. #��$��%$������ • For each possible policy π , define an �������������������� ������������ (deterministic world) ( ) π ≡ + γ + γ 2 + V s r r r ... + + t t 1 t 1 ∞ ∑ ≡ γ i r + t i = 0 = i i 0 where r t , r t +1 ,... are generated by following policy π starting at state s • Learning task: Learn OPTIMAL policy π* ≡ argmax π V π ( s ), ( ∀ s ) 9

  10. ������#��$��%$������ • Learn the evaluation function V π * � V *. • Select the optimal action from any state s , i.e., have an optimal policy, by using V * with one step lookahead: [ ] ( ) ( ) ( ( ) ) π * = + γ * δ s r s , a V s , a arg max a 10

  11. &�������#��$�����&������������' π*( s ) = argmax a [ r ( s , a ) + γ V *( δ ( s , a ))] A problem: • This works well if agent knows δ : S × A → S , and • This works well if agent knows δ : S × A → S , and r : S × A → ℜ • When it doesn’t, it can’t choose actions this way 11

  12. � %$������ • Define new function very similar to V * Q ( s , a ) ≡ r ( s , a ) + γ V *( δ ( s , a )) Learn Q function – Q �learning • If agent learns Q , it can choose optimal action even without knowing δ or r . [ ] ( ) ( ) ( ( ) ) π * = + γ * δ s r s , a V s , a arg max a ( ) π * = s Q ( s , a ) arg max a 12

  13. � � �������� Note that Q and V * are closely related: ( ) ( ) ∗ ′ = V s Q s , a max ′ a Which allows us to write Q recursively as ( ) ( ) ( ( ) ) ∗ = + γ δ Q s , a r s , a V s , a t t t t t t ( ) ( ) + γ r s , a ′ = max Q s , a t t + t 1 ′ a Q �learning actively generates examples. It “processes” examples by updating its Q values. !��� learning, Q values are approximations. 13

  14. ���������$������������ � ˆ Let Q denote current approximation to Q . Then Q�learning uses the following ����������$�� : ( ) ( ) ˆ ˆ ′ ′ ← + γ Q s , a r max Q s , a ′ a where s ′ is the state resulting from applying action a in state where s ′ is the state resulting from applying action a in state s , and r is the reward that is returned. 14

  15. ^ ()������� *�������� � ( ) ( ) ˆ ˆ ′ ← + γ Q s , a r Q s , a max 1 right 2 ′ a { } ← + 0 0 . 9 max 63 , 81 , 100 ← 90 15

  16. �� ���������������������������+����� ˆ For each s , a initialize table entry Q ( s , a ) ← 0 Observe current state s Do forever: • Select an action a and execute it • Receive immediate reward r • Receive immediate reward r • Observe the new state s ′ ˆ • Update the table entry for Q ( s , a ) as follows: ( ) ( ) ˆ ˆ ′ ′ ← + γ Q s , a r Q s , a max ′ a • s ← s ′ 16

  17. �� ���������,��������� Starts at bottom left corner – moves clockwise around perimeter; Initially Q ( s , a ) = 0; γ = 0.8 ( ) ( ) ˆ ˆ ′ ′ ← + γ Q s , a r max Q s , a ′ a 17

  18. ��������� ������������� How many possible �������� are there in this 3�state, 2�action deterministic world? A robot starts in the state Mild. It moves for 4 steps choosing A robot starts in the state Mild. It moves for 4 steps choosing actions +���-�(���-�(���-�+��� . The initial values of its Q�table are 0 and the discount factor is γ = 0.5 Initial State: MILD Action: West Action: East Action: East Action: West New State: HOT New State: MILD New State: COLD New State: MILD East West East West East West East West East West HOT 0 0 0 0 . 0 5 0 5 0 MILD 0 0 0 /0 0 10 0 10 0 10 COLD 0 0 0 0 0 0 0 0 0 �. 18

  19. ����������������������()����� 19

  20. 1�������������������� What if reward and next state are non�deterministic? We redefine V , Q by taking expected values [ ] ( ) π ≡ + γ + γ 2 + V s E r r r ... + + t t 1 t 2 ≡ ∑     ∞ ∑   γ i E r + t i     = i 0 [ ] ( ) ( ) ( ( ) ) ≡ + γ * δ Q s , a E r s , a V s , a 20

  21. 1�������������������� Q learning generalizes to nondeterministic worlds Alter training rule to ( ) ( ) ( ) ˆ ˆ ← − α + Q s , a 1 Q s , a − n n n 1 ( ( ) ) ,   ˆ ′ ′ ′ ′ α α + + γ γ r r max max Q Q s s , , a a ,     − − n n   n n 1 1   ′ a ( ) 1 ′ α = = δ where , and s s , a . ( ) n + 1 visits s , a n ˆ * Q still converges to Q (Watkins and Dayan, 1992) 21

  22. 1����������������()����� 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend