R i f R i f Reinforcement Learning III Reinforcement Learning III t - - PowerPoint PPT Presentation

r i f r i f reinforcement learning iii reinforcement
SMART_READER_LITE
LIVE PREVIEW

R i f R i f Reinforcement Learning III Reinforcement Learning III t - - PowerPoint PPT Presentation

R i f R i f Reinforcement Learning III Reinforcement Learning III t L t L i i III III Dec 03 2008 1 Large State Spaces h When a problem has a large state space we can not longer represent the U or Q functions as explicit tables explicit


slide-1
SLIDE 1

R i f t L i III R i f t L i III Reinforcement Learning III Reinforcement Learning III

Dec 03 2008

1

slide-2
SLIDE 2

Large State Spaces

hWhen a problem has a large state space we can

not longer represent the U or Q functions as explicit tables explicit tables

hEven if we had enough memory

5N

h t i i d t !

5Never enough training data! 5Learning takes too long

hWhat to do??

3

slide-3
SLIDE 3

Function Approximation

h Never enough training data!

5 Must generalize what is learned from one situation to other

“similar” new situations h Idea:

5 Instead of using large table to represent U or Q, use a

parameterized function parameterized function

g small number of parameters (generally exponentially fewer

parameters than the number of states)

5 Learn parameters from experience

Learn parameters from experience

5 When we update parameters based on observations in one state,

then the U or Q estimate will also change for other similar states

g facilitates generalization of experience g facilitates generalization of experience

4

slide-4
SLIDE 4

Example

C id id bl i h b l d i i i i

h Consider grid problem with no obstacles, deterministic actions

U/D/L/R (49 states)

h Features for state s=(x,y): f1(s)=x, f2(s)=y (just 2 features)

( ,y) ( ) , ( ) y (j ) 10 6 10 0

5

6

slide-5
SLIDE 5

Linear Function Approximation

h Define a set of state features f1(s), …, fn(s)

5 The features are used as our representation of states 5 States with similar feature values will be treated similarly

h A common approximation is to represent V(s) as a weighted sum

  • f the features (i.e. a linear approximation)

) ( ... ) ( ) ( ) (

2 2 1 1

s f s f s f s U

n n

θ θ θ θ

θ

+ + + + = ) ( ... ) ( ) ( ) (

2 2 1 1

s f s f s f s U

n n

θ θ θ θ

θ

6

slide-6
SLIDE 6

Example

h Consider grid problem with no obstacles deterministic actions

Consider grid problem with no obstacles, deterministic actions U/D/L/R (49 states)

h Features for state s=(x,y): f1(s)=x, f2(s)=y (just 2 features) h U(s) = θ0 + θ1 x + θ2 y h Is there a good linear

approximation? 10 6 approximation?

5 Yes. 5 θ0 =10, θ1 = ‐1, θ2 = ‐1 5 (note upper right is origin)

10 0

(note upper right is origin) h U(s) = 10 ‐ x ‐ y

subtracts Manhattan dist. from goal reward from goal reward

h Instead of storing a table of 49

i l d entries, we now only need to store 3 parameters

7

6

slide-7
SLIDE 7

Function approximation accuracy

h The approximation accuracy is fundamentally limited by

the information provided by the features

h Can we always define features that allow for a perfect

linear approximation?

5 Yes Assign each state an indicator feature (I e i’th feature is 1 iff

  • Yes. Assign each state an indicator feature. (I.e. i th feature is 1 iff

i’th state is present and θi represents value of i’th state)

5 Of course this requires far too many features and gives no

generalization generalization.

8

slide-8
SLIDE 8

Changed Reward: Bad linear approximation

h U(s) = θ0 + θ1 x + θ2 y h Is there a good linear approximation?

g pp

5 No.

10

9

slide-9
SLIDE 9

But What If…

hU(s) = θ0 + θ1 x + θ2 y + θ3 z 3 hInclude new feature z

5z= |3-x| + |3-y| 5z is dist to goal location

10 3 z is dist. to goal location hDoes this allow a

good linear approx?

5θ0 =10, θ1 = θ2 = 0,

10

θ0 = -1

slide-10
SLIDE 10

Linear Function Approximation

hDefine a set of features f1(s), …, fn(s)

5The features are used as our representation of states

S i h i il f l ill b d i il l

5States with similar feature values will be treated similarly 5More complex functions require more complex features

O l i l d l (i f

) ( ... ) ( ) ( ) (

2 2 1 1

s f s f s f s U

n n

θ θ θ θ

θ

+ + + + =

hOur goal is to learn good parameter values (i.e. feature

weights) that approximate the value function well

5How can we do this?

How can we do this?

5Use TD‐based RL and somehow update parameters based on

each experience.

11

slide-11
SLIDE 11

TD‐based RL for Linear Approximators

1.

Start with initial parameter values

2.

Take action according to an explore/exploit policy g p p p y (should converge to greedy policy, i.e. GLIE)

3.

Update estimated model p

4.

Perform TD update for each parameter

? θ

5.

Goto 2

? ←

i

θ

What is a “TD update” for a parameter?

12

slide-12
SLIDE 12

Aside: Gradient Descent for Squared Error

h Suppose that we have a sequence of states and target values for

each state

5 E g produced by the TD based RL loop

K , ) ( , , ) ( ,

2 2 1 1

s u s s u s

E.g. produced by the TD‐based RL loop h Our goal is to minimize the sum of squared errors between our

estimated function and each target value: g

( ) 2

) ( ) ( ˆ 2 1

j j j

s u s U E − =

θ

h After seeing j’th state gradient descent rule tells us to update all

squared error of example j

  • ur estimated value

for j’th state target value for j’th state

h After seeing j th state gradient descent rule tells us to update all

parameters by:

j j j j

s U E E E θ θ

θ

∂ ∂ ∂ ∂ ← ) ( ˆ

13

learning rate

i j j j i j i j i i

s U θ θ θ α θ θ

θ

∂ ∂ = ∂ ∂ − ← ) ( ˆ ,

slide-13
SLIDE 13

Aside: continued

( )

s U E ∂ ∂ ) ( ˆ

( )

i j j j i i j i i

s U s U s u E θ α θ θ α θ θ

θ θ

∂ ∂ − + = ∂ ∂ + ← ) ( ) ( ˆ ) ( E ∂ ) ( ˆ

j j

s U E

θ

∂ ∂

depends on form of approximator

  • For a linear approximation function:

) ( ) ( ) ( ) ( ˆ s f s f s f s U θ θ θ θ + + + + = ) ( ... ) ( ) ( ) (

2 2 1 1 1

s f s f s f s U

n n

θ θ θ θ

θ

+ + + + = ) ( ) ( ˆ

j i j

s f s U = ∂ ∂ θ

θ

  • Thus the update becomes:

( )

) ( ) ( ˆ ) (

j i j j i i

s f s U s u

θ

α θ θ − + ← ) (

j i i

f ∂θ

14

  • For linear functions this update is guaranteed to converge

to best approximation for suitable learning rate schedule

slide-14
SLIDE 14

TD‐based RL for Linear Approximators

1

Start with initial parameter values

1.

Start with initial parameter values

2.

Take action according to an explore/exploit policy (should converge to greedy policy, i.e. GLIE) Transition from s to s’

3.

Update estimated model

4.

Perform TD update for each parameter

( )

) ( ) ( ˆ ) ( s f s U s u

i i i θ

α θ θ − + ←

5.

Goto 2 What should we use for “target value” v(s)?

( )

) ( ) ( ) ( fi

i i θ

What should we use for target value v(s)?

  • Use the TD prediction based on the next state s’

) ' ( ˆ ) ( ) ( U R +

15

this is the same as previous TD method only with approximation

) ' ( ) ( ) ( s U s R s u

θ

γ + =

slide-15
SLIDE 15

TD‐based RL for Linear Approximators

1

Start with initial parameter values

1.

Start with initial parameter values

2.

Take action according to an explore/exploit policy (should converge to greedy policy, i.e. GLIE)

3.

Update estimated model

4.

Perform TD update for each parameter p p

5

G t 2

( )

) ( ) ( ˆ ) ' ( ˆ ) ( s f s U s U s R

i i i θ θ

γ α θ θ − + + ←

5.

Goto 2

  • Note that step 2 still requires T to select action

T id thi d th thi f d l f

  • To avoid this we can do the same thing for model-free

Q-learning

16

slide-16
SLIDE 16

Q‐learning with Linear Approximators

) ( ) ( ) ( ) ( ˆ a s f a s f a s f a s Q θ θ θ θ + + + + = ) , ( ... ) , ( ) , ( ) , (

2 2 1 1

a s f a s f a s f a s Q

n n

θ θ θ θ

θ

+ + + + =

Features are a function of states and actions.

1.

Start with initial parameter values

2

Take action according to an explore/exploit policy

2.

Take action according to an explore/exploit policy (should converge to greedy policy, i.e. GLIE)

3.

Perform TD update for each parameter p p

( )

) , ( ) , ( ˆ ) ' , ' ( ˆ max ) (

'

a s f a s Q a s Q s R

i a i i θ θ

γ α θ θ − + + ←

4.

Goto 2

17

  • For both Q and U, these algorithms converge to the closest

linear approximation to optimal Q or U.

slide-17
SLIDE 17

Summary of RL

hMDP

5Definition of an MDP (T, R, S) 5Solving MDP for optimal policy: Value iteration, policy

iteration hRL

5Difference between RL and MDP 5Different methods for Passive RL: DUE, ADP, TD 5Different method for Active RL: ADP, Q‐Learning with

TD learning TD learning

5Function approximation for large state/action space

18

slide-18
SLIDE 18

Learning objectives

1) Students are able to apply supervised learning algorithms to prediction problems and evaluate the results. 2) Students are able to apply unsupervised learning algorithms to data analysis problems and evaluate results. 3) Students are able to apply reinforcement learning algorithms to control problem and evaluate results. 4) Students are able to take a description of a new problem and decide what kind of problem (supervised, unsupervised, or i f t) it i reinforcement) it is.

20