Direct Gradient-Based Reinforcement Learning Jonathan Baxter - - PowerPoint PPT Presentation

direct gradient based reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Direct Gradient-Based Reinforcement Learning Jonathan Baxter - - PowerPoint PPT Presentation

Direct Gradient-Based Reinforcement Learning Jonathan Baxter Research School of Information Sciences and Engineering Australian National University http://csl.anu.edu.au/ jon Joint work with Peter Bartlett and Lex Weaver December 5, 1999


slide-1
SLIDE 1

Direct Gradient-Based Reinforcement Learning

Jonathan Baxter Research School of Information Sciences and Engineering Australian National University http://csl.anu.edu.au/∼jon Joint work with Peter Bartlett and Lex Weaver December 5, 1999

slide-2
SLIDE 2

1

Reinforcement Learning

Models agent interacting with its environment.

  • 1. Agent receives information about its state.
  • 2. Agent chooses action or control based on state-

information.

  • 3. Agent receives a reward.
  • 4. State is updated.
  • 5. Goto ??.
slide-3
SLIDE 3

2

Reinforcement Learning

  • Goal: Adjust agent’s behaviour to maximize long-term

average reward.

  • Key Assumption: state transitions are Markov.
slide-4
SLIDE 4

3

Chess

  • State: Board position.
  • Control: Move pieces.
  • State Transitions: My move, followed by opponent’s

move.

  • Reward: Win, draw, or lose.
slide-5
SLIDE 5

4

Call Admission Control

Telecomms carrier selling bandwidth: queueing problem.

  • State: Mix of call types on channel.
  • Control: Accept calls of certain type.
  • State Transitions: Calls finish. New calls arrive.
  • Reward: Revenue from calls accepted.
slide-6
SLIDE 6

5

Cleaning Robot

  • State: Robot and environment (position, velocity, dust

levels, . . . ).

  • Control: Actions available to robot.
  • State Transitions: depend on dynamics of robot and

statistics of environment.

  • Reward: Pick up rubbish, don’t damage the furniture.
slide-7
SLIDE 7

6

Summary

Previous approaches:

  • Dynamic Programming can find optimal policies in

small state spaces.

  • Approximate Value-Function based approaches currently

the method of choice in large state spaces.

  • Numerous practical successes, BUT
  • Policy performance can degrade at each step.
slide-8
SLIDE 8

7

Summary

Alternative Approach:

  • Policy parameters θ ∈ RK, Performance: η(θ).
  • Compute ∇η(θ) and step uphill (gradient ascent).
  • Previous algorithms relied on accurate reward baseline
  • r recurrent states.
slide-9
SLIDE 9

8

Summary

Our Contribution:

  • Approximation ∇

βη(θ) to ∇η(θ).

  • Parameter

β ∈ [0, 1) related to Mixing Time

  • f

problem.

  • Algorithm to approximate ∇

βη(θ) via simulation (POMDPG).

  • Line search in the presence of noise.
slide-10
SLIDE 10

9

Partially Observable Markov Decision Processes (POMDPs)

States: S= {1, 2, . . . , n} Xt∈ S Observations: Y= {1, 2, . . . , M} Yt∈ Y Actions or Controls: U= {1, 2, . . . , N} Ut∈ U Observation Process ν: Pr(Yt = y|Xt = i)= νy(i) Stochastic Policy µ: Pr(Ut = u|Yt = y)= µu(θ, y) Rewards: r : S → R Adjustable parameters: θ ∈ RK

slide-11
SLIDE 11

10

POMDP

Transition Probabilities: Pr(Xt+1 = j|Xt = i, Ut = u) = pij(u)

slide-12
SLIDE 12

11

POMDP

r(X ) t Environment Agent Ut Yt X t

ν µ

Policy:

slide-13
SLIDE 13

12

The Induced Markov Chain

  • Transition Probabilities:

pij(θ)=Pr (Xt+1 = j|Xt = i) =Ey∼ν(Xt) Eu∼µ(θ,y) pij(u)

  • Transition Matrix:

P (θ) = [pij(θ)]

slide-14
SLIDE 14

13

Stationary Distributions

q = [q1 · · · qn]′ ∈ Rn is a distribution over states. Xt∼ q ⇒ Xt+1∼ q′P (θ) Definition: A probability distribution π ∈ Rn is a stationary distribution of the Markov chain if π′P (θ) = π′.

slide-15
SLIDE 15

14

Stationary Distributions

Convenient Assumption: For all values

  • f

the parameters θ, there is a unique stationary distribution π(θ). Implies the Markov chain mixes: For all X0, the distribution of Xt approaches π(θ). Inconvenient Assumption: Number

  • f

states n “essentially infinite”. Meaning: forget about storing a number for each state, or inverting n × n matrices.

slide-16
SLIDE 16

15

Measuring Performance

  • Average Reward:

η(θ) =

n

  • i=1

πi(θ)r(i)

  • Goal: Find θ maximizing η(θ).
slide-17
SLIDE 17

16

Summary

  • Partially Observable Markov Decision Processes.
  • Previous approaches: value function methods.
  • Direct gradient ascent
  • Approximating the gradient of the average reward.
  • Estimating the approximate gradient: POMDPG.
  • Line search in the presence of noise.
  • Experimental results.
slide-18
SLIDE 18

17

Approximate Value Functions

  • Discount Factor β ∈ [0, 1), Discounted value of state

i under policy µ: Jµ

β (i) = Eµ

  • r(X0) + βr(X1) + β2r(X2) + · · ·
  • X0 = i
  • Idea:

Choose restricted class of value functions ˜ J(θ, i), θ ∈ RK, i ∈ S (e.g neural network with parameters θ).

slide-19
SLIDE 19

18

Policy Iteration

Iterate:

  • Given policy µ, find approximation ˜

J(θ, ·) to Jµ

β .

  • Many algorithms for finding θ:

TD(λ), Q-learning, Bellman residuals, . . . .

  • Simulation and non-simulation based.
  • Generate new policy µ′ using ˜

J(θ, ·): µ′

u∗(θ, i) = 1 ⇔ u∗ = argmaxu∈U

  • j∈S

pij(u) ˜ J(θ, j)

slide-20
SLIDE 20

19

Approximate Value Functions

  • The Good:

⋆ Backgammon (world-champion), chess (International Master), job-shop scheduling, elevator control, . . . ⋆ Notion of “backing-up” state values can be efficient.

  • The Bad:

⋆ Unless

  • ˜

J(θ, i) − Jµ

β (i)

  • = 0 for all states i, the new

policy µ′ can be a lot worse than the old one. ⋆ “Essentially Infinite” state spaces means we are likely to have very bad approximation error for some states.

slide-21
SLIDE 21

20

Summary

  • Partially Observable Markov Decision Processes.
  • Previous approaches: value function methods.
  • Direct gradient ascent.
  • Approximating the gradient of the average reward.
  • Estimating the approximate gradient: POMDPG.
  • Line search in the presence of noise.
  • Experimental results.
slide-22
SLIDE 22

21

Direct Gradient Ascent

  • Desideratum:

Adjusting the agent’s parameters θ should improve its performance.

  • Implies. . .
  • Adjust the parameters in the direction of the

gradient of the average reward: θ := θ + γ∇η(θ)

slide-23
SLIDE 23

22

Direct Gradient Ascent: Main Results

  • 1. Algorithm to estimate approximate gradient(∇

βη) from a

sample path.

  • 2. Accuracy of approximation depends on parameter of the

algorithm (β); bias/variance trade-off.

  • 3. Line search algorithm using only gradient estimates.
slide-24
SLIDE 24

23

Related Work

Machine Learning: Williams’ REINFORCE algorithm (1992).

  • Gradient ascent algorithm for restricted class of MDPs.
  • Requires accurate reward baseline, i.i.d. transitions.

Kimura et. al. , 1998: extension to infinite horizon. Discrete Event Systems: Algorithms that rely on recurrent

  • states. MDPs: (Cao and Chen, 1997), POMDPs: (Marbach and

Tsitsiklis, 1998).

Control Theory: Direct adaptive control using derivatives

(Hjalmarsson, Gunnarsson, Gevers, 1994), (Kammer, Bitmead, Bartlett, 1997), (DeBruyne, Anderson, Gevers, Linard, 1997).

slide-25
SLIDE 25

24

Summary

  • Partially Observable Markov Decision Processes.
  • Previous approaches: value function methods.
  • Direct gradient ascent.
  • Approximating the gradient of the average reward.
  • Estimating the approximate gradient: POMDPG.
  • Line search in the presence of noise.
  • Experimental results.
slide-26
SLIDE 26

25

Approximating the gradient

Recall: For β ∈ [0, 1), Discounted value of state i is Jβ(i) = E

  • r(X0) + βr(X1) + β2r(X2) + · · ·
  • X0 = i
  • .

Vector notation: Jβ = (Jβ(1), . . . , Jβ(n)). Theorem: For all β ∈ [0, 1), ∇η(θ)= βπ′(θ)∇P (θ)Jβ + (1 − β)∇π′(θ)Jβ. = β∇

βη(θ) estimate

+ (1 − β)∇π′(θ)Jβ

  • →0 as β→1

.

slide-27
SLIDE 27

26

Mixing Times of Markov Chains

  • ℓ1-distance: If p, q are distributions on the states,

p − q1 :=

n

  • i=1

|p(i) − q(i)|

  • d(t)-distance: Let pt(i) be the distribution over states

at time t, starting from state i. d(t) := max

ij

pt(i) − pt(j)1

  • Unique stationary distribution ⇒ d(t) → 0.
slide-28
SLIDE 28

27

Approximating the gradient

Mixing time: τ ∗ := min

  • t: d(t) ≤ e−1

Theorem: For all β ∈ [0, 1), θ ∈ Rk, ∇η(θ) − ∇

βη(θ) ≤ constant × τ ∗(θ)(1 − β).

That is, if 1/(1 − β) is large compared with the mixing time τ ∗(θ), ∇

βη(θ) accurately approximates the gradient

direction ∇η(θ).

slide-29
SLIDE 29

28

Summary

  • Partially Observable Markov Decision Processes.
  • Previous approaches: value function methods.
  • Direct gradient ascent.
  • Approximating the gradient of the average reward.
  • Estimating the approximate gradient: POMDPG.
  • Line search in the presence of noise.
  • Experimental results.
slide-30
SLIDE 30

29

Estimating ∇

βη(θ): POMDPG

Given: parameterized policies, µu(θ, y), β ∈ [0, 1):

  • 1. Set z0 = ∆0 = 0 ∈ RK.
  • 2. for each observation yt, control ut, reward r(it+1) do
  • 3. Set zt+1 = βzt + ∇µut(θ, yt)

µut(θ, yt) (eligibility trace)

  • 4. Set ∆t+1 = ∆t +

1 t+1 [r(it+1)zt+1 − ∆t]

  • 5. end for
slide-31
SLIDE 31

30

Convergence of POMDPG

Theorem: For all β ∈ [0, 1), θ ∈ RK, ∆t → ∇

βη(θ).

slide-32
SLIDE 32

31

Explanation of POMDPG

Algorithm computes: ∆T = 1 T

T −1

  • t=0

∇µut µut

  • r(it+1) + βr(it+2)+· · ·+βT −t−1r(iT)
  • Estimate of discounted value ‘due to’ action ut
  • ∇µut(θ, yt) is the direction to increase the probability of

the action ut.

  • It is weighted by something involving subsequent

rewards, and

  • divided by µut: ensures “popular” actions don’t dominate
slide-33
SLIDE 33

32

POMDPG: Bias/Variance trade-off

∆t

t→∞

− − → ∇

βη(θ) β→1

− − → ∇η(θ) .

  • Bias/Variance Tradeoff: β ≈ 1 gives:

⋆ Accurate gradient approximation (∇

βη close to ∇η),

but ⋆ Large variance in estimates ∆t of ∇

βη for small t.

slide-34
SLIDE 34

33

POMDPG: Bias/Variance trade-off

∆t

t→∞

− − → ∇

βη(θ) β→1

− − → ∇η(θ) .

  • Recall: 1/(1 − β) ≈ τ ∗(θ) (mixing time).

⋆ Small mixing time ⇒ small β ⇒ accurate gradient estimate from short POMDPG simulation. ⋆ Large mixing time ⇒ large β ⇒ accurate gradient estimate only from long POMDPG simulation.

  • Conjecture: Mixing time is an intrinsic constraint on

any simulation-based algorithm.

slide-35
SLIDE 35

34

Example: 3-state Markov Chain

Transition Probabilities: P (u1) =    4/5 1/5 4/5 1/5 4/5 1/5    P (u2) =    1/5 4/5 1/5 4/5 1/5 4/5    Observations: (φ1(i), φ2(i)): State 1: (2/3, 1/3) State 2: (1/3, 2/3) State 3: (5/18, 5/18) Parameterized Policy: θ ∈ R2 µu1(θ, i)= e(θ1φ1(i)+θ2φ2(i)) 1 + e(θ1φ1(i)+θ2φ2(i)) µu2(θ, i)= 1 − µu1(θ, i) Rewards: (r(1), r(2), r(3)) = (0, 0, 1)

slide-36
SLIDE 36

35

Bias/Variance Trade-off

Relative norm difference = ∇η − ∆T ∇η .

0.5 1 1.5 2 2.5 3 1 10 100 1000 10000 100000 1e+06 1e+07 Relative Norm Difference

  • T

τ = 1 0.5 1 1.5 2 2.5 3 3.5 4 4.5 1 10 100 1000 10000 100000 1e+06 1e+07 Relative Norm Difference

  • T

τ = 20

slide-37
SLIDE 37

36

Bias/Variance Trade-off

0.001 0.01 0.1 1 10 1 10 100 1000 10000 100000 1e+06 1e+07 Relative Norm Difference

  • T

τ = 1 τ = 5 τ = 20

slide-38
SLIDE 38

37

Summary

  • Partially Observable Markov Decision Processes.
  • Previous approaches: value function methods.
  • Direct gradient ascent.
  • Approximating the gradient of the average reward.
  • Estimating the approximate gradient: POMDPG.
  • Line search in the presence of noise.
  • Experimental results.
slide-39
SLIDE 39

38

Line-search in the presence of noise

  • Want to find maximum of η(θ) in direction ∇

βη(θ).

  • Usual method: find 3 points

θi = θ + γi∇

βη(θ),

i = 1, 2, 3, with γ1 < γ2 < γ3 such that: η(θ2) > η(θ1), η(θ2) > η(θ3) and interpolate.

  • Problem: η(θ) only available by simulation (e.g. ηT(θ)),

so noisy: lim

θ1→θ2 var [sign (ηT(θ2) − ηT(θ1))] = 1

slide-40
SLIDE 40

39

Line-search in the presence of noise

  • Solution: Use gradients to bracket (POMDPG).

βη(θ1) · ∇ βη(θ) > 0,

βη(θ2) · ∇ βη(θ) < 0

  • Variance independent of θ2 − θ1.
  • 0.25
  • 0.2
  • 0.15
  • 0.1
  • 0.05

0.05 0.1 0.15 0.2 0.25

  • 0.4
  • 0.2

0.2 0.4

slide-41
SLIDE 41

40

Example: Call Admission Control

Telecommunications carrier selling bandwidth: queueing problem. From (Marbach and Tsitsiklis, 1998).

  • Three call types, with differing arrival rates (Poisson),

bandwidth requirements, rewards, holding times (exponential).

  • State = observation = mix of calls.
  • Policy = (squashed) linear controller.
slide-42
SLIDE 42

41

Direct Reinforcement Learning: Call Admission Control

0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 1000 10000 100000 CONJGRAD Final Reward Total Queue Iterations class optimal τ = 1

slide-43
SLIDE 43

42

Direct Reinforcement Learning: Puck World

  • Puck moving around mountainous terrain.
  • Aim is to get out of a valley and on to a plateau
  • reward = 0 everywhere except plateau (=100)
  • Observation = relative location,

absolute location, velocity.

  • Neural-Network Controller
  • Insufficient thrust to climb directly out of valley, must

learn to “oscillate”.

slide-44
SLIDE 44

43

Direct Reinforcement Learning: Puck World

10 20 30 40 50 60 70 80 2e+07 4e+07 6e+07 8e+07 1e+08 Average Reward Iterations

slide-45
SLIDE 45

44

Direct Reinforcement Learning

  • Philosophy:

⋆ Adjusting policy should improve performance. ⋆ View average reward as function of policy parameters: η(θ). ⋆ For suitably smooth policies: ∇η(θ) exists. ⋆ Compute ∇η(θ) and step uphill.

slide-46
SLIDE 46

45

Direct Reinforcement Learning

  • Main results:

⋆ Approximation ∇

βη(θ) to ∇η(θ).

⋆ Algorithm to accurately estimate ∇

βη from a single

sample path (POMDPG). ⋆ Accuracy of approximation depends on parameter of the algorithm (β ∈ [0, 1)); bias/variance trade-off. ⋆ 1/(1 − β) relates to mixing time of underlying Markov chain. ⋆ Line search using only gradient estimates. ⋆ Many successful applications.

slide-47
SLIDE 47

46

Advertisement

  • Papers available from http://csl.anu.edu.au.
  • Two

research positions available in the Machine Learning Group at the Australian National University.