Class 4: On Policy Prediction With Approximation Chapter 9 Sutton - - PowerPoint PPT Presentation

class 4 on policy prediction
SMART_READER_LITE
LIVE PREVIEW

Class 4: On Policy Prediction With Approximation Chapter 9 Sutton - - PowerPoint PPT Presentation

Class 4: On Policy Prediction With Approximation Chapter 9 Sutton slides/silver slides 295, class 4 1 Forms of approximations functions: A linear approximation, a neural network, a decision tree 295, class 4 2 The


slide-1
SLIDE 1

Class 4: On Policy Prediction With Approximation Chapter 9

295, class 4 1

Sutton slides/silver slides

slide-2
SLIDE 2

295, class 4 2

Forms of approximations functions:

  • A linear approximation,
  • a neural network,
  • a decision tree
slide-3
SLIDE 3

The Prediction Objective

17

We must specify a state weighting or distribution representing how much we care about the error in each state s. The objective function is to minimize the Mean Square Value Error, denoted: With approximation we can no longer hope to converge to the exact value for each state. mu(s) is the fraction of time spent in s, which is called “on-policy distribution” The continuing case and the episodic case are different.

  • It is not obvious that the above is a good objective for RL (we want the value function

in order to generate a good policy, but this is what we use.

  • For a general function form no guarantee to converge to optimal w*
slide-4
SLIDE 4

Stochastic-gradient and Semi-gradient Methods

295, class 4 18

slide-5
SLIDE 5

General Stochastic Gradient Descent

295, class 4 19

slide-6
SLIDE 6

295, class 4 20

slide-7
SLIDE 7

Gradient Monte Carlo Algorithm for estimating v

21

we cannot perform the exact update (9.5) because v(St) is unknown, but we can approximate it by substituting Utin place of v(St). This yields the following general SGD method for state-value prediction: Th egeneral SGD (aiming at G_t) converges to a local optimum approximation

slide-8
SLIDE 8

Semi Gradient Methods

295, class 4 22

Replacing G_t with a bootstrapping target such as TD(0) or G_{t:t+n} will not guarantee convergence (but for linear functions) semi-gradient (bootstrapping) methods offer important advantages: they typically enable significantly faster learning, without waiting for the end of an episode. This enables them to be used on continuing problems and provides computational advantages. A prototypical semi-gradient method is semi-gradient TD(0),

slide-9
SLIDE 9

State Aggregation

295, class 4 23

slide-10
SLIDE 10

295, class 4 24

slide-11
SLIDE 11

295, class 4 25

slide-12
SLIDE 12

295, class 4 26

slide-13
SLIDE 13

Linear Methods

27

X(s) is a feature vector with the same dimensionality as w In the linear case there is only one optimum thus the Semi-SGD is guaranteed to converge to or near a local optimum. SGD does converges to the global optimum if alpha satisfies the usual conditions Of reducing over time.

slide-14
SLIDE 14

TD(0) Convergence

295, class 4 28

slide-15
SLIDE 15

Bootstrapping on the 1000-state random walk

295, class 4 29

slide-16
SLIDE 16

295, class 4 31

slide-17
SLIDE 17

n-Step Semi-gradient TD for v

295, class 4 32