���������������������� Manuela Veloso Manuela Veloso see “Machine Learning” – Tom Mitchell, chapter 13 on RL 15�381, Fall 2009
��������������������������������������� • Supervised learning – Classification � concept learning – Learning from labeled data – Function approximation • Unsupervised learning – Data is not labeled – Data is not labeled – Data needs to be ������� , ��������� – We need distance metric • Control and action model learning – Learning to select actions efficiently – Feedback: goal achievement, failure, ������ – Control learning, reinforcement learning 2
������������������ ������� • “Reward” today versus future (promised) reward • Future rewards not worth as much as current. • $100K + $100K + $100K + ... INFINITE sum • Assume reality ...: discount factor , say γ . • Assume reality ...: discount factor , say γ . • $100K + γ $100K + γ 2 $100K + ... CONVERGES. 3
������������������������������ Goal: Learn to choose actions that maximize r 0 + γ r 1 + γ 2 r 2 + ... , where 0 ≤ γ < 1 4
������������������� • Assume world can be modeled as a Markov Decision Process, with rewards as a function of state and action. • ������������������ New states and rewards are a function only of the current state and action, i.e., s t +1 = δ ( s t , a t ) s = δ ( s , a ) – – – r t = r ( s t , a t ) • ���������������������������������� Functions δ and r may be nondeterministic and are not necessarily known to learner. 5
����������������� ��! • Execute actions in world, • Observe state of world, • Learn action policy π : S → A • Maximize expected reward E [ r t + γ r t+ 1 + γ 2 r t+ 2 + ...] ��������������������������� S . – 0 ≤ γ < 1, discount factor for future rewards 6
"���������������������������� • We have a target function to learn π : S → A • We have �� training examples of the form 〈 s , a 〉 • We have training examples of the form 〈〈 s , a 〉 , r 〉 (rewards can be ��� real number) immediate reward values r ( s , a ) 7
�������� • There are ������������� policies, of course not necessarily ������� , i.e., with maximum expected reward • There can be also ��������������� policies. 8
#��$��%$������ • For each possible policy π , define an �������������������� ������������ (deterministic world) ( ) π ≡ + γ + γ 2 + V s r r r ... + + t t 1 t 1 ∞ ∑ ≡ γ i r + t i = 0 = i i 0 where r t , r t +1 ,... are generated by following policy π starting at state s • Learning task: Learn OPTIMAL policy π* ≡ argmax π V π ( s ), ( ∀ s ) 9
������#��$��%$������ • Learn the evaluation function V π * � V *. • Select the optimal action from any state s , i.e., have an optimal policy, by using V * with one step lookahead: [ ] ( ) ( ) ( ( ) ) π * = + γ * δ s r s , a V s , a arg max a 10
&�������#��$�����&������������' π*( s ) = argmax a [ r ( s , a ) + γ V *( δ ( s , a ))] A problem: • This works well if agent knows δ : S × A → S , and • This works well if agent knows δ : S × A → S , and r : S × A → ℜ • When it doesn’t, it can’t choose actions this way 11
� %$������ • Define new function very similar to V * Q ( s , a ) ≡ r ( s , a ) + γ V *( δ ( s , a )) Learn Q function – Q �learning • If agent learns Q , it can choose optimal action even without knowing δ or r . [ ] ( ) ( ) ( ( ) ) π * = + γ * δ s r s , a V s , a arg max a ( ) π * = s Q ( s , a ) arg max a 12
� � �������� Note that Q and V * are closely related: ( ) ( ) ∗ ′ = V s Q s , a max ′ a Which allows us to write Q recursively as ( ) ( ) ( ( ) ) ∗ = + γ δ Q s , a r s , a V s , a t t t t t t ( ) ( ) + γ r s , a ′ = max Q s , a t t + t 1 ′ a Q �learning actively generates examples. It “processes” examples by updating its Q values. !��� learning, Q values are approximations. 13
���������$������������ � ˆ Let Q denote current approximation to Q . Then Q�learning uses the following ����������$�� : ( ) ( ) ˆ ˆ ′ ′ ← + γ Q s , a r max Q s , a ′ a where s ′ is the state resulting from applying action a in state where s ′ is the state resulting from applying action a in state s , and r is the reward that is returned. 14
^ ()������� *�������� � ( ) ( ) ˆ ˆ ′ ← + γ Q s , a r Q s , a max 1 right 2 ′ a { } ← + 0 0 . 9 max 63 , 81 , 100 ← 90 15
�� ���������������������������+����� ˆ For each s , a initialize table entry Q ( s , a ) ← 0 Observe current state s Do forever: • Select an action a and execute it • Receive immediate reward r • Receive immediate reward r • Observe the new state s ′ ˆ • Update the table entry for Q ( s , a ) as follows: ( ) ( ) ˆ ˆ ′ ′ ← + γ Q s , a r Q s , a max ′ a • s ← s ′ 16
�� ���������,��������� Starts at bottom left corner – moves clockwise around perimeter; Initially Q ( s , a ) = 0; γ = 0.8 ( ) ( ) ˆ ˆ ′ ′ ← + γ Q s , a r max Q s , a ′ a 17
��������� ������������� How many possible �������� are there in this 3�state, 2�action deterministic world? A robot starts in the state Mild. It moves for 4 steps choosing A robot starts in the state Mild. It moves for 4 steps choosing actions +���-�(���-�(���-�+��� . The initial values of its Q�table are 0 and the discount factor is γ = 0.5 Initial State: MILD Action: West Action: East Action: East Action: West New State: HOT New State: MILD New State: COLD New State: MILD East West East West East West East West East West HOT 0 0 0 0 . 0 5 0 5 0 MILD 0 0 0 /0 0 10 0 10 0 10 COLD 0 0 0 0 0 0 0 0 0 �. 18
����������������������()����� 19
1�������������������� What if reward and next state are non�deterministic? We redefine V , Q by taking expected values [ ] ( ) π ≡ + γ + γ 2 + V s E r r r ... + + t t 1 t 2 ≡ ∑ ∞ ∑ γ i E r + t i = i 0 [ ] ( ) ( ) ( ( ) ) ≡ + γ * δ Q s , a E r s , a V s , a 20
1�������������������� Q learning generalizes to nondeterministic worlds Alter training rule to ( ) ( ) ( ) ˆ ˆ ← − α + Q s , a 1 Q s , a − n n n 1 ( ( ) ) , ˆ ′ ′ ′ ′ α α + + γ γ r r max max Q Q s s , , a a , − − n n n n 1 1 ′ a ( ) 1 ′ α = = δ where , and s s , a . ( ) n + 1 visits s , a n ˆ * Q still converges to Q (Watkins and Dayan, 1992) 21
1����������������()����� 22
Recommend
More recommend