- Manuela Veloso
Manuela Veloso
see “Machine Learning” – Tom Mitchell, chapter 13 on RL
Manuela Veloso Manuela Veloso - - PowerPoint PPT Presentation
Manuela Veloso Manuela Veloso see Machine Learning Tom Mitchell, chapter 13 on RL 15381, Fall 2009
see “Machine Learning” – Tom Mitchell, chapter 13 on RL
– Classification concept learning – Learning from labeled data – Function approximation
– Data is not labeled
2
– Data is not labeled – Data needs to be , – We need distance metric
– Learning to select actions efficiently – Feedback: goal achievement, failure, – Control learning, reinforcement learning
INFINITE sum
3
CONVERGES.
Goal: Learn to choose actions that maximize r0 + γr1 + γ2r2 + ... , where 0 ≤ γ < 1
Process, with rewards as a function of state and action.
New states and rewards are a function only of the current state and action, i.e.,
–
s = δ(s , a )
5
–
st+1 = δ(st, at) – rt = r(st, at)
Functions δ and r may be nondeterministic and are not necessarily known to learner.
6
E[rt + γrt+1 + γ2rt+2 + ...] S. – 0 ≤ γ < 1, discount factor for future rewards
(rewards can be real number)
7
immediate reward values r(s,a)
necessarily , i.e., with maximum expected reward
8
(deterministic world)
∞ = + + +
≡ + + + ≡
1 2 1
...
i i t i t t t
r r r r s V γ γ γ
π
9
where rt, rt+1,... are generated by following policy π starting at state s
π* ≡ argmaxπVπ(s), (∀s)
=0 i
a s V a s r s
a
, , max arg
* *
δ γ π + =
10
A problem:
π*(s) = argmaxa[r(s,a) + γV*(δ(s,a))]
11
r : S × A → ℜ
Q(s,a) ≡ r(s,a) + γV*(δ(s,a)) Learn Q function – Qlearning
12
without knowing δ or r.
) , ( max arg , , max arg
* * *
a s Q s a s V a s r s
a a
= + = π δ γ π
Which allows us to write Q recursively as
a s Q s V
a
′ =
′ ∗
, max
13
Qlearning actively generates examples. It “processes” examples by updating its Q values. ! learning, Q values are approximations.
a s Q a s r a s V a s r a s Q
t a t t t t t t t t
′ + = + =
+ ′ ∗
, max , , , ,
1
γ δ γ
Let Q denote current approximation to Q. Then Qlearning uses the following $: where s′ is the state resulting from applying action a in state ˆ
a s Q r a s Q
a
′ ′ + ←
′
, ˆ max , ˆ γ
14
where s′ is the state resulting from applying action a in state s, and r is the reward that is returned.
15
90 100 , 81 , 63 max 9 . , ˆ max , ˆ
2 1
← + ← ′ + ←
′
a s Q r a s Q
a right
γ
For each s, a initialize table entry Q(s,a) ← 0 Observe current state s Do forever:
ˆ
16
ˆ
a s Q r a s Q
a
′ ′ + ←
′
, ˆ max , ˆ γ
Starts at bottom left corner – moves clockwise around perimeter; Initially Q(s,a) = 0; γ = 0.8
17
a s Q r a s Q
a
′ ′ + ←
′
, ˆ max , ˆ γ
How many possible are there in this 3state, 2action deterministic world? A robot starts in the state Mild. It moves for 4 steps choosing
18
A robot starts in the state Mild. It moves for 4 steps choosing actions +-(-(-+. The initial values of its Qtable are 0 and the discount factor is γ = 0.5
Initial State: MILD Action: West New State: HOT Action: East New State: MILD Action: East New State: COLD Action: West New State: MILD East West East West East West East West East West HOT . 5 5 MILD /0 10 10 10 COLD .
19
What if reward and next state are nondeterministic? We redefine V, Q by taking expected values
r r r E s V
t t t
...
2 2 1
γ γ
π
+ + + ≡
∞ + +
20
a s V a s r E a s Q r E
i i t i
, , ,
*
δ γ γ + ≡ ≡ ∑
= +
Q learning generalizes to nondeterministic worlds Alter training rule to
, ˆ max , ˆ 1 , ˆ
1 n 1
a s Q r a s Q a s Q
n n n n
γ α α ′ ′ + + − ←
− −
21
( )
1992) Dayan, and (Watkins to converges still ˆ . , and , where , , max
* , 1 1 1 n
Q Q a s s a s Q r
a s n visits n n a
δ α γ α = ′ = ′ ′ +
+ − ′
22
π*(s) = D, for any s= S1, S2, S3, and S4, γ = 0.9. /////////////////////////////////////////////////////////////
∗∗ ∗∗ ∗ ∗∗ ∗
23
∗ ∗ ∗∗∗ ∗∗ ∗ ∗∗∗ ∗∗ ∗
What is the Qvalue, Q(S2,R)?
∗∗
values? – Exploration – Exploitation
25
– Function approximator (neural net instead of table) – Generalization – Reuse, use of macros – Abstraction, learning substructure
δ ˆ
26
δ ˆ
– Deterministic, nondeterministic update rule
27