A Closer Look at Function Approximation Robert Platt Northeastern - - PowerPoint PPT Presentation

a closer look at function approximation
SMART_READER_LITE
LIVE PREVIEW

A Closer Look at Function Approximation Robert Platt Northeastern - - PowerPoint PPT Presentation

A Closer Look at Function Approximation Robert Platt Northeastern University The problem of large and continuous state spaces Example of a large state space: Atari Learning Environment state: video game screen actions: joystick


slide-1
SLIDE 1

A Closer Look at Function Approximation

Robert Platt Northeastern University

slide-2
SLIDE 2

The problem of large and continuous state spaces

Example of a large state space: Atari Learning Environment – state: video game screen – actions: joystick actions – reward: game score

Agent

a s,r

Agent takes actions Agent perceives states and rewards

Why are large state spaces a problem for tabular methods?

  • 1. many states may never be visited
  • 2. there is no notion that the agent should behave similarly in

“similar” states.

slide-3
SLIDE 3

Function approximation

Approximating the Value function using function approximator: Some kind of function approximator parameterized by w

slide-4
SLIDE 4

Which Function Approximator?

There are many function approximators, e.g. – Linear combinations of features – Neural networks – Decision tree – Nearest Neighbour – Fourier / wavelet bases We will require the function approximator to be differentiable Need to be able to handle non-stationary, non-iid data

slide-5
SLIDE 5

Approximating value function using SGD

Goal: find parameter vector w minimizing mean-squared error between approximate value fn, , and the true value function, Approach: do gradient descent on this cost function For starters, let’s focus on policy evaluation, i.e. estimating

slide-6
SLIDE 6

Approximating value function using SGD

Goal: find parameter vector w minimizing mean-squared error between approximate value fn, , and the true value function, Approach: do gradient descent on this cost function Here’s the gradient: For starters, let’s focus on policy evaluation, i.e. estimating

slide-7
SLIDE 7

Linear value function approximation

Let’s approximate as a linear function of features: where x(s) is the feature vector:

slide-8
SLIDE 8

Think-pair-share

Can you think of some good features for pacman?

slide-9
SLIDE 9

Linear value function approx: coarse coding

For example, the elts in x(s) could correspond to regions of state space:

Binary features – one feature for each circle (above)

slide-10
SLIDE 10

Linear value function approx: coarse coding

For example, the elts in x(s) could correspond to regions of state space:

Binary features – one feature for each circle (above)

The value function is encoded by the combination of all tiles that a state intersects

slide-11
SLIDE 11

The effect of overlapping feature regions

slide-12
SLIDE 12

Think-pair-share

What type of linear features might be appropriate for this problem? What is the relationship between feature shape and generalization?

Cliff region Goal region

slide-13
SLIDE 13

Linear value function approx: tile coding

For example, x(s) could be constructed using tile coding: – Each tiling is a partition of the state space. – Assigns each state to a unique tile. Binary features n = num tiles x num tilings In this example: n = 16 x 4

slide-14
SLIDE 14

Think-pair-share

The value function is encoded by the combination of all tiles that a state intersects State aggregation is a special case of tile coding. How many tilings in this case? What do the weights correspond to in this case? Binary features n = num tiles x num tilings In this example: n = 16 x 4

slide-15
SLIDE 15

Think-pair-share

– what are the pros/cons of rectangular tiles like this? – what are the pros/cons to evenly spacing the tilings vs placing them at uneven offsets? Binary features n = num tiles x num tilings In this example: n = 16 x 4

slide-16
SLIDE 16

Recall monte carlo policy evaluation algorithm

Let’s think about how to do the same thing using function approximation...

slide-17
SLIDE 17

Gradient monte carlo policy evaluation

Notice that in MC, the return is an unbiased, noisy sample of the true value, Can therefore apply supervised learning to “training data”: The weight update “sampled” from the training data is: Goal: calculate

slide-18
SLIDE 18

Gradient monte carlo policy evaluation

Notice that in MC, the return is an unbiased, noisy sample of the true value, Can therefore apply supervised learning to “training data”: The weight update “sampled” from the training data is: For a linear function approximator, this is: Goal: calculate

slide-19
SLIDE 19

Gradient monte carlo policy evaluation

For linear function approximation, gradient MC converges to the weights that minimize MSE wrt the true value function. Even for non-linear function approximation, gradient MC converges to a local optimum. However, since this is MC, the estimates are high-variance.

slide-20
SLIDE 20

Gradient MC example: 1000-state random walk

slide-21
SLIDE 21

Gradient MC example: 1000-state random walk

The whole value function over 1000 states will be approximated with 10 numbers!

slide-22
SLIDE 22

Question

The whole value function over 1000 states will be approximated with 10 numbers! How many tilings are here?

slide-23
SLIDE 23

Gradient MC example: 1000-state random walk

slide-24
SLIDE 24

Gradient MC example: 1000-state random walk

Converges to unbiased value estimate

slide-25
SLIDE 25

Question

What is the relationship between the state distribution (mu) and the policy? How do you correct for following a policy that visits states differently?

slide-26
SLIDE 26

TD Learning with value function approximation

The TD target, is an estimate of the true value, But, let’s ignore that and use the TD target anyway… Training data:

slide-27
SLIDE 27

TD Learning with value function approximation

The TD target, is an estimate of the true value, But, let’s ignore that and use the TD target anyway… Training data: This gives us TD(0) policy evaluation with:

slide-28
SLIDE 28

TD Learning with value function approximation

The TD target, is an estimate of the true value, But, let’s ignore that and use the TD target anyway… Training data: This gives us TD(0) policy evaluation with: Next state

slide-29
SLIDE 29

TD Learning with value function approximation

slide-30
SLIDE 30

Think-pair-share

Why is this called “semi-gradient”? Here’s the update rule we’re using: Is this really the gradient? What is the gradient actually? Loss function:

slide-31
SLIDE 31

Semi-gradient TD(0) ex: 1000-state random walk

Converges to biased value estimate

slide-32
SLIDE 32

Convergence results summary

  • 1. Gradient-MC converges for both linear and non-linear fn approx
  • 2. Gradient-MC converges to optimal value estimates

– converges to values that min MSE

  • 3. Semi-gradient-TD(0) converges for linear fn approx
  • 4. Semi-gradient-TD(0) converges to a biased estimate

– converges to a point, , that does does not minimize MSE – but we have:

Fixed point for semi-gradient TD Point that min MSE

slide-33
SLIDE 33

TD Learning with value function approximation

For linear function approximation, gradient TD(0) converges to biased estimate of weights such that:

Fixed point for semi-gradient TD Point that min MSE

slide-34
SLIDE 34

Think-pair-share

Write the semi-gradient weight update equation for the special case of linear function approximation. How would you update this algorithm for q-learning?

slide-35
SLIDE 35

Linear Sarsa with Coarse Coding in Mountain Car

slide-36
SLIDE 36

Linear Sarsa with Coarse Coding in Mountain Car

slide-37
SLIDE 37

Least Squares Policy Iteration (LSPI)

Recall that for linear function approximation, J(w) is quadratic in the weights: We can solve for w that min J(w) directly. First, let’s think about this in the context of batch policy evaluation.

slide-38
SLIDE 38

Policy evaluation

Given: – a dataset generated using policy Find w that min:

slide-39
SLIDE 39

Question

Given: – a dataset generated using policy Find w that min: HOW?

slide-40
SLIDE 40

Think-pair-share

Given: a dataset Find w that min: where a, b, w are scalars. What if b is a vector?

slide-41
SLIDE 41

Policy evaluation

Given: – a dataset generated using policy Find w that min:

  • 1. Set derivative to zero:
slide-42
SLIDE 42

Policy evaluation

Given: – a dataset generated using policy Find w that min:

  • 1. Set derivative to zero:
  • 2. Solve for w:
slide-43
SLIDE 43

LSMC policy evaluation

  • 1. collect a bunch of experience

under policy

  • 2. calculate weights using:
slide-44
SLIDE 44

LSMC policy evaluation

  • 1. collect a bunch of experience

under policy

  • 2. calculate weights using:

How to we ensure this matrix is well conditioned?

slide-45
SLIDE 45

Question

  • 1. collect a bunch of experience

under policy

  • 2. calculate weights using:

What effect does this term have? What cost function is being minimized now?

slide-46
SLIDE 46

LSMC policy iteration

  • 1. Take an action according current policy,
  • 2. Add experience to buffer:
  • 3. Calculate new LS weights using:
  • 4. Goto step 1
slide-47
SLIDE 47

Is there a TD version of this?

  • 1. Take an action according current policy,
  • 2. Add experience to buffer:
  • 3. Calculate new LS weights using:
  • 4. Goto step 1

MC target

slide-48
SLIDE 48

LSTD policy evaluation

In TD learning, the target is: Substituting into the gradient of J(w): Solving for w:

slide-49
SLIDE 49

LSTD policy evaluation

In TD learning, the target is: Substituting into the gradient of J(w): Solving for w (and add regularization term):

slide-50
SLIDE 50

LSTD policy evaluation

In TD learning, the target is: Substituting into the gradient of J(w): Solving for w (and add regularization term): Notice this is slightly different from what was used for LSMC

slide-51
SLIDE 51

LSTD policy evaluation

  • 1. collect a bunch of experience

under policy

  • 2. calculate weights using:
slide-52
SLIDE 52

LSTDQ

Approximate Q function as: Now, the update is:

slide-53
SLIDE 53

LSPI-TD

Policy improvement Guaranteed to converge to near-optimal (linear fn approx)

slide-54
SLIDE 54

Chain Walk Example

slide-55
SLIDE 55

LSPI in Chain Walk: Action-Value Function

Notice that the policy is

  • ptimal after iteration 4