Steps to understanding Policy-gradient methods Policy approximation - - PowerPoint PPT Presentation

steps to understanding policy gradient methods
SMART_READER_LITE
LIVE PREVIEW

Steps to understanding Policy-gradient methods Policy approximation - - PowerPoint PPT Presentation

Steps to understanding Policy-gradient methods Policy approximation ( a | s, ) The average-reward (reward rate) objective r ( ) Stochastic gradient ascent/descent t r ( ) The


slide-1
SLIDE 1

Steps to understanding Policy-gradient methods

  • Policy approximation
  • The average-reward (reward rate) objective
  • Stochastic gradient ascent/descent
  • The policy-gradient theorem and its proof
  • Approximating the gradient
  • Eligibility functions for a few cases
  • A final algorithm

π(a|s, θ)

∆θt ≈ α∂¯ r(θ) ∂θ

¯ r(θ)

slide-2
SLIDE 2

Policy Approximation

  • Policy = a function from state to action
  • How does the agent select actions?
  • In such a way that it can be affected by

learning?

  • In such a way as to assure exploration?
  • Approximation: there are too many states

and/or actions to represent all policies

  • To handle large/continuous action spaces
slide-3
SLIDE 3

What is learned and stored?

  • 1. Action-value methods: learn the value of each

action; pick the max (usually)

  • 2. Policy-gradient methods: learn the parameters u
  • f a stochastic policy, update by
  • including actor-critic methods, which learn

both value and policy parameters

  • 3. Dynamic Policy Programming
  • 4. Drift-diffusion models (Psychology)

∇uPerformance

slide-4
SLIDE 4

Actor-critic architecture

World

slide-5
SLIDE 5

Action-value methods

  • The value of an action in a state given a policy

is the expected future reward starting from the state taking that first action, then following the policy thereafter

  • Policy: pick the max most of the time



 but sometimes pick at random (휀-greedy)

π(s, a) = E

" ∞ X

t=1

γt−1Rt

  • S0 = s, A0 = a

# At = arg max

a

ˆ Qt(St, a)

qπ(

slide-6
SLIDE 6

Why approximate policies rather than values?

  • In many problems, the policy is simpler to

approximate than the value function

  • In many problems, the optimal policy is

stochastic

  • e.g., bluffing, POMDPs
  • To enable smoother change in policies
  • To avoid a search on every step (the max)
  • To better relate to biology
slide-7
SLIDE 7

Gradient-bandit algorithm

  • Store action preferences


rather than action-value estimates

  • Instead of 휀-greedy, pick actions by an exponential soft-max:
  • Also store the sample average of rewards as
  • Then update:

Pr{At =a} . = eHt(a) Pk

b=1 eHt(b)

. = πt(a)

eHt(a)

Ht+1(a) = Ht(a) + α

  • Rt − ¯

Rt

  • 1a=At − πt(a)
  • "

Qt(a) +

¯ Rt

  • 1 or 0, depending on whether

the predicate (subscript) is true

slide-8
SLIDE 8

Gradient-bandit algorithms

  • n the 10-armed testbed

% Optimal action Steps α = 0.1

100% 80% 60% 40% 20% 0%

α = 0.4 α = 0.1 α = 0.4

without baseline with baseline

250 500 750 1000

Figure 2.6:

Average performance of the gradient-bandit algorithm with and without a reward baseline on the 10-armed testbed when the q∗(a) are chosen to be near +4 rather than near zero.

slide-9
SLIDE 9

∂ ∂x f(x) g(x)

  • =

∂f(x) ∂x g(x) − f(x) ∂g(x) ∂x

g(x)2 .

∂πt(b) ∂Ht(a) = ∂ ∂Ht(a) πt(b) = ∂ ∂Ht(a) " eHt(b) Pk

c=1 eHt(c)

# =

∂eHt(b) ∂Ht(a)

Pk

c=1 eHt(c) − eHt(b) ∂ Pk

c=1 eHt(c)

∂Ht(a)

⇣Pk

c=1 eHt(c)

⌘2 (by the quotient rule) = 1a=beHt(a) Pk

c=1 eHt(c) − eHt(b)eHt(a)

⇣Pk

c=1 eHt(c)

⌘2 (because ∂ex

∂x = ex)

= 1a=beHt(b) Pk

c=1 eHt(c) −

eHt(b)eHt(a) ⇣Pk

c=1 eHt(c)

⌘2 = 1a=bπt(b) − πt(b)πt(a) = πt(b)

  • 1a=b − πt(a)
  • .

Q.E.D.

slide-10
SLIDE 10

Steps to understanding Policy-gradient methods

  • Policy approximation
  • The average-reward (reward rate) objective
  • Stochastic gradient ascent/descent
  • The policy-gradient theorem and its proof
  • Approximating the gradient
  • Eligibility functions for a few cases
  • A complete algorithm

π(a|s, θ)

∆θt ≈ α∂¯ r(θ) ∂θ

¯ r(θ)

slide-11
SLIDE 11

eg, linear-exponential policies (discrete actions)

  • The “preference” for action a in state s is linear

in 휽 and a state-action feature vector 휙(s,a)

  • The probability of action a in state s is

exponential in its preference

  • Corresponding eligibility function:

π(a|s, θ) . = exp(θ>φ(s, a)) P

b exp(θ>φ(s, b))

P rπ(a|s, θ) π(a|s, θ) = φ(s, a) X

b

π(b|s, θ)φ(s, b)

slide-12
SLIDE 12

Policy-gradient setup

π(a|s, θ) . = Pr{At = a | St = s} r(π) . = lim

n!1

1 n

n

X

t=1

Eπ[Rt] = X

s

dπ(s) X

a

π(a|s) X

s0,r

p(s0, r|s, a)r dπ . = lim

t!1 Pr{St = s}

˜ vπ(s) . =

1

X

k=1

Eπ[Rt+k r(π) | St =s] ˜ qπ(s, a) . =

1

X

k=1

Eπ[Rt+k r(π) | St =s, At =a] ∆θt ⇡ α∂r(π) ∂θ . = αrr(π) rr(π) = X

s

dπ(s) X

a

˜ qπ(s, a)rπ(a|s, θ) (the policy-gradient theorem) 

  • parameterized

policies average-reward

  • bjective

steady-state distribution differential state-value fn differential action-value fn stochastic gradient ascent

slide-13
SLIDE 13

˜ qπ(s, a) . = X

k=1

Eπ[Rt+k r(π) | St =s, At =a] ∆θt ⇡ α∂r(π) ∂θ . = αrr(π) rr(π) = X

s

dπ(s) X

a

˜ qπ(s, a)rπ(a|s, θ) (the policy-gradient theorem) = E ⇣ ˜ qπ(St, At) v(St) ⌘rπ(At|St, θ) π(At|St)

  • St ⇠ dπ, At ⇠ π(·|St, θ)
  • = E

⇣ ˜ Gλ

t ˆ

v(St, w) ⌘rπ(At|St, θ) π(At|St)

  • St ⇠ dπ, At:1 ⇠ π

⇣ ˜ Gλ

t ˆ

v(St, w) ⌘rπ(At|St, θ) π(At|St) (by sampling under π) θt+1 . = θt + α ⇣ ˜ Gλ

t ˆ

v(St, w) ⌘rπ(At|St, θ) π(At|St) e.g., in the one-step linear case: = θt + α ⇣ Rt+1 ¯ Rt + w>

t φt+1 w> t φt)

⌘rπ(At|St, θ) π(At|St) . = θt + αδte(At, St)

differential action-value fn stochastic gradient ascent policy-gradient theorem

slide-14
SLIDE 14

Deriving the policy-gradient theorem: rr(π) = P

s dπ(s) P a ˜

qπ(s, a)rπ(a|s, θ): r˜ vπ(s) = r X

a

π(a|s, θ)˜ qπ(s, a) = X

a

h rπ(a|s, θ)˜ qπ(s, a) + π(a|s, θ)r˜ qπ(s, a) i = X

a

h rπ(a|s, θ)˜ qπ(s, a) + π(a|s, θ)r X

s0,r

p(s0, r|s, a) ⇥ r r(π) + ˜ vπ(s0) ⇤i = X

a

h rπ(a|s, θ)˜ qπ(s, a) + π(a|s, θ) h rr(π) + X

s0,r

p(s0|s, a)r˜ vπ(s0) ii ∴ rr(π) = X

a

h rπ(a|s, θ)˜ qπ(s, a)+π(a|s, θ) X

s0

p(s0|s, a)r˜ vπ(s0) i r˜ vπ(s) ∴ X

s

dπ(s)rr(π) = X

s

dπ(s) X

a

rπ(a|s, θ)˜ qπ(s, a) + X

s

dπ(s) X

a

π(a|s, θ) X

s0

p(s0|s, a)r˜ vπ(s0) X

s

dπ(s)r˜ vπ(s) X X

slide-15
SLIDE 15

Xh X ⇥ ⇤i = X

a

h rπ(a|s, θ)˜ qπ(s, a) + π(a|s, θ) h rr(π) + X

s0,r

p(s0|s, a)r˜ vπ(s0) ii ∴ rr(π) = X

a

h rπ(a|s, θ)˜ qπ(s, a)+π(a|s, θ) X

s0

p(s0|s, a)r˜ vπ(s0) i r˜ vπ(s) ∴ X

s

dπ(s)rr(π) = X

s

dπ(s) X

a

rπ(a|s, θ)˜ qπ(s, a) + X

s

dπ(s) X

a

π(a|s, θ) X

s0

p(s0|s, a)r˜ vπ(s0) X

s

dπ(s)r˜ vπ(s) = X

s

dπ(s) X

a

rπ(a|s, θ)˜ qπ(s, a) + X

s0

X

s

dπ(s) X

a

π(a|s, θ)p(s0|s, a) | {z }

dπ(s0)

r˜ vπ(s0) X

s

dπ(s)r˜ vπ(s) rr(π) = X

s

dπ(s) X

a

rπ(a|s, θ)˜ qπ(s, a) Q.E.D.

slide-16
SLIDE 16

Complete PG algorithm

update eligibility trace for critic form TD error from critic update average reward estimate update critic parameters update eligibility trace for actor update actor parameters

Initialize parameters of policy θ 2 Rn, and state-value function w 2 Rm Initialize eligibility traces eθ 2 Rn and ew 2 Rm to 0 Initialize ¯ R = 0 On each step, in state S: Choose A according to π(·|S, θ) Take action A, observe S0, R δ R ¯ R + ˆ v(S0, w) ˆ v(S, w) ¯ R ¯ R + αθδ ew λew + r

v(S, w) w w + αwδew eθ λeθ + rπ(A|S,θ)

π(A|S,θ)

θ θ + αθδeθ

slide-17
SLIDE 17

The generality of the policy-gradient strategy

  • Can be applied whenever we can compute the

effect of parameter changes on the action probabilities,

  • E.g., has been applied to spiking neuron models
  • There are many possibilities other than linear-

exponential and linear-gaussian

  • e.g., mixture of random, argmax, and fixed-

width gaussian; learn the mixing weights, drift/ diffusion models

) ⌘rπ(At|St, θ)

slide-18
SLIDE 18

eg, linear-gaussian policies (continuous actions)

action action prob. density 휇 and 휎 linear in the state

slide-19
SLIDE 19

eg, linear-gaussian policies (continuous actions)

  • The mean and std. dev. for the action taken in

state s are linear and linear-exponential in

  • The probability density function for the action

taken in state s is gaussian

µ(s) . = θ>

µ φ(s)

σ(s) . = exp(θ>

σ φ(s)

π(a|s, θ) . = 1 σ(s) p 2π exp ✓ (a µ(s))2 2σ(s)2 ◆

θ . = (θ>

µ ; θ> σ )>

slide-20
SLIDE 20

Gaussian eligibility functions

r

θµπ(a|s, θ)

π(a|s, θ) = 1 σ(s)2(a µ(s))φµ(s) r

θσπ(a|s, θ)

π(a|s, θ) = ✓(a µ(s))2 σ(s)2 1 ◆ φσ(s)

slide-21
SLIDE 21

The generality of the policy-gradient strategy (2)

  • Can be applied whenever we can compute the effect of

parameter changes 


  • n the action probabilities,
  • Can we apply PG when outcomes are viewed as action?
  • e.g., the action of “turning on the light” 

  • r the action of “going to the bank”
  • is this an alternate strategy for temporal abstraction?
  • We would need to learn—not compute—the gradient
  • f these states w.r.t. the policy parameter

) ⌘rπ(At|St, θ)

slide-22
SLIDE 22

Have we eliminated action?

  • If any state can be an action, then what is still special

about actions?

  • The parameters/weights are what we can really,

directly control

  • We have always, in effect, “sensed” our actions 


(even in the 휀-greedy case)

  • Perhaps actions are just sensory signals that we can

usually control easily

  • Perhaps there is no longer any need for a special

concept of action in the RL framework