Manuela Veloso Manuela Veloso - - PowerPoint PPT Presentation

manuela veloso manuela veloso see machine learning tom
SMART_READER_LITE
LIVE PREVIEW

Manuela Veloso Manuela Veloso - - PowerPoint PPT Presentation

Manuela Veloso Manuela Veloso see Machine Learning Tom Mitchell, chapter 13 on RL 15381, Fall 2009


slide-1
SLIDE 1
  • Manuela Veloso

Manuela Veloso

see “Machine Learning” – Tom Mitchell, chapter 13 on RL

15381, Fall 2009

slide-2
SLIDE 2
  • Supervised learning

– Classification concept learning – Learning from labeled data – Function approximation

  • Unsupervised learning

– Data is not labeled

2

– Data is not labeled – Data needs to be , – We need distance metric

  • Control and action model learning

– Learning to select actions efficiently – Feedback: goal achievement, failure, – Control learning, reinforcement learning

slide-3
SLIDE 3
  • “Reward” today versus future (promised) reward
  • Future rewards not worth as much as current.
  • $100K + $100K + $100K + ...

INFINITE sum

  • Assume reality ...: discount factor , say γ.

3

  • Assume reality ...: discount factor , say γ.
  • $100K + γ $100K + γ2 $100K + ...

CONVERGES.

slide-4
SLIDE 4
  • 4

Goal: Learn to choose actions that maximize r0 + γr1 + γ2r2 + ... , where 0 ≤ γ < 1

slide-5
SLIDE 5
  • Assume world can be modeled as a Markov Decision

Process, with rewards as a function of state and action.

New states and rewards are a function only of the current state and action, i.e.,

s = δ(s , a )

5

st+1 = δ(st, at) – rt = r(st, at)

Functions δ and r may be nondeterministic and are not necessarily known to learner.

slide-6
SLIDE 6

!

  • Execute actions in world,
  • Observe state of world,
  • Learn action policy π : S → A
  • Maximize expected reward

6

E[rt + γrt+1 + γ2rt+2 + ...] S. – 0 ≤ γ < 1, discount factor for future rewards

slide-7
SLIDE 7

"

  • We have a target function to learn π : S → A
  • We have training examples of the form 〈s, a〉
  • We have training examples of the form 〈〈s, a〉, r〉

(rewards can be real number)

7

immediate reward values r(s,a)

slide-8
SLIDE 8
  • There are policies, of course not

necessarily , i.e., with maximum expected reward

  • There can be also policies.

8

slide-9
SLIDE 9

#$%$

  • For each possible policy π, define an

(deterministic world)

( )

∞ = + + +

≡ + + + ≡

1 2 1

...

i i t i t t t

r r r r s V γ γ γ

π

9

where rt, rt+1,... are generated by following policy π starting at state s

  • Learning task: Learn OPTIMAL policy

π* ≡ argmaxπVπ(s), (∀s)

=0 i

slide-10
SLIDE 10

#$%$

  • Learn the evaluation function Vπ* V*.
  • Select the optimal action from any state s, i.e., have an
  • ptimal policy, by using V* with one step lookahead:

( ) ( ) ( ) ( )

[ ]

a s V a s r s

a

, , max arg

* *

δ γ π + =

10

slide-11
SLIDE 11

&#$&'

A problem:

  • This works well if agent knows δ : S × A → S, and

π*(s) = argmaxa[r(s,a) + γV*(δ(s,a))]

11

  • This works well if agent knows δ : S × A → S, and

r : S × A → ℜ

  • When it doesn’t, it can’t choose actions this way
slide-12
SLIDE 12

%$

  • Define new function very similar to V*

Q(s,a) ≡ r(s,a) + γV*(δ(s,a)) Learn Q function – Qlearning

12

  • If agent learns Q, it can choose optimal action even

without knowing δ or r.

( ) ( ) ( ) ( )

[ ]

( )

) , ( max arg , , max arg

* * *

a s Q s a s V a s r s

a a

= + = π δ γ π

slide-13
SLIDE 13
  • Note that Q and V* are closely related:

Which allows us to write Q recursively as

( ) ( )

a s Q s V

a

′ =

′ ∗

, max

13

Qlearning actively generates examples. It “processes” examples by updating its Q values. ! learning, Q values are approximations.

( ) ( ) ( ) ( ) ( ) ( )

a s Q a s r a s V a s r a s Q

t a t t t t t t t t

′ + = + =

+ ′ ∗

, max , , , ,

1

γ δ γ

slide-14
SLIDE 14

$

Let Q denote current approximation to Q. Then Qlearning uses the following $: where s′ is the state resulting from applying action a in state ˆ

( ) ( )

a s Q r a s Q

a

′ ′ + ←

, ˆ max , ˆ γ

14

where s′ is the state resulting from applying action a in state s, and r is the reward that is returned.

slide-15
SLIDE 15

() * ^

15

( )

( ) { }

90 100 , 81 , 63 max 9 . , ˆ max , ˆ

2 1

← + ← ′ + ←

a s Q r a s Q

a right

γ

slide-16
SLIDE 16

+

For each s, a initialize table entry Q(s,a) ← 0 Observe current state s Do forever:

  • Select an action a and execute it
  • Receive immediate reward r

ˆ

16

  • Receive immediate reward r
  • Observe the new state s′
  • Update the table entry for Q(s,a) as follows:
  • s ← s′

ˆ

( ) ( )

a s Q r a s Q

a

′ ′ + ←

, ˆ max , ˆ γ

slide-17
SLIDE 17

,

Starts at bottom left corner – moves clockwise around perimeter; Initially Q(s,a) = 0; γ = 0.8

17

( ) ( )

a s Q r a s Q

a

′ ′ + ←

, ˆ max , ˆ γ

slide-18
SLIDE 18

How many possible are there in this 3state, 2action deterministic world? A robot starts in the state Mild. It moves for 4 steps choosing

18

A robot starts in the state Mild. It moves for 4 steps choosing actions +-(-(-+. The initial values of its Qtable are 0 and the discount factor is γ = 0.5

Initial State: MILD Action: West New State: HOT Action: East New State: MILD Action: East New State: COLD Action: West New State: MILD East West East West East West East West East West HOT . 5 5 MILD /0 10 10 10 COLD .

slide-19
SLIDE 19

()

19

slide-20
SLIDE 20

1

What if reward and next state are nondeterministic? We redefine V, Q by taking expected values

( )

[ ]

r r r E s V

t t t

...

2 2 1

γ γ

π

  + + + ≡

∞ + +

20

( ) ( ) ( ) ( )

[ ]

a s V a s r E a s Q r E

i i t i

, , ,

*

δ γ γ + ≡         ≡ ∑

= +

slide-21
SLIDE 21

1

Q learning generalizes to nondeterministic worlds Alter training rule to

( ) ( ) ( ) ( ) ,

, ˆ max , ˆ 1 , ˆ

1 n 1

a s Q r a s Q a s Q

n n n n

γ α α       ′ ′ + + − ←

− −

21

( )

( )

( )

1992) Dayan, and (Watkins to converges still ˆ . , and , where , , max

* , 1 1 1 n

Q Q a s s a s Q r

a s n visits n n a

δ α γ α = ′ =     ′ ′ +

+ − ′

slide-22
SLIDE 22

1()

22

slide-23
SLIDE 23

1()

π*(s) = D, for any s= S1, S2, S3, and S4, γ = 0.9. /////////////////////////////////////////////////////////////

∗∗ ∗∗ ∗ ∗∗ ∗

23

∗ ∗ ∗∗∗ ∗∗ ∗ ∗∗∗ ∗∗ ∗

slide-24
SLIDE 24

1()

What is the Qvalue, Q(S2,R)?

∗∗

  • 24
slide-25
SLIDE 25

$

  • How should the learning agent use the Q

values? – Exploration – Exploitation

  • Scaling up in the size of the state space

25

  • Scaling up in the size of the state space

– Function approximator (neural net instead of table) – Generalization – Reuse, use of macros – Abstraction, learning substructure

slide-26
SLIDE 26

&

  • Partially observable state
  • Continuous action, state spaces
  • Learn state abstractions
  • Optimal exploration strategies
  • Learn and use : S × A → S

δ ˆ

26

  • Learn and use : S × A → S
  • Multiple learners Multiagent reinforcement learning

δ ˆ

slide-27
SLIDE 27

"$'

  • Markov model for state/action transitions.
  • Value, policy iteration
  • Qlearning

– Deterministic, nondeterministic update rule

  • Exploration, exploitation

27

  • Exploration, exploitation