A Bayesian Approach to Finding Compact Representations for - - PowerPoint PPT Presentation

a bayesian approach to finding compact representations
SMART_READER_LITE
LIVE PREVIEW

A Bayesian Approach to Finding Compact Representations for - - PowerPoint PPT Presentation

A Bayesian Approach to Finding Compact Representations for Reinforcement Learning Special thanks to Joelle Pineau for presenting our paper - July, 2012 1 Authors Alborz Geramifard Stefanie Tellex David Wingate Nicholas Roy Jonathan How 2


slide-1
SLIDE 1

1

Special thanks to Joelle Pineau for presenting our paper - July, 2012

A Bayesian Approach to Finding Compact Representations for Reinforcement Learning

slide-2
SLIDE 2

Authors

Stefanie Tellex Alborz Geramifard Jonathan How David Wingate Nicholas Roy

2

slide-3
SLIDE 3

Vision

Solving Large Sequential Decision Making Problems Formulated as MDPs.

3

slide-4
SLIDE 4

Reinforcement Learning

π(s) : S → A

at

st, rt

Qπ(s, a) = Eπ " 1 X

t=1

γt1rt

  • s0 = s, a0 = a,

#

4

slide-5
SLIDE 5

5

Linear Function Approximation

. . .

  • θ1

θ2 φ1 φ2 φn θn

s

Qπ(s, a) ≈ φ(s, a)>θ

slide-6
SLIDE 6

Challenge

φ

Good Representation ( )

π

Good Policy ( )

Q

Good V alue Function ( )

Our focus

6

slide-7
SLIDE 7

D φ Q π G

Approach

Observed Data Samples Policy is good? Representation V alue Function Policy

∈ {0, 1}

7

slide-8
SLIDE 8

D φ Q π G

Approach

Observed Data Samples Policy is good? Representation V alue Function Policy

φ∗ = argmax

φ

P(φ|G, D)

∈ {0, 1}

Ideally:

Using G instead of G=1 for brevity

7

slide-9
SLIDE 9

φ

φ∗ = argmax

φ

P(φ|G, D)

Big!

Problem:

∧ ∨

Primitive features Extended features 1 2 3 4 5 6 7 8

8

Approach

f8 = f4 ∧ f6

Logical combinations of primitive features such as

slide-10
SLIDE 10

φ

φ∗ = argmax

φ

P(φ|G, D)

Big!

Problem: Insight:

∧ ∨

Primitive features Extended features 1 2 3 4 5 6 7 8

8

P(φ|G, D) ∝ | | P(G|φ, D)P(φ).

Approach

f8 = f4 ∧ f6

Logical combinations of primitive features such as

slide-11
SLIDE 11

P(φ|G, D) ∝ | | P(G|φ, D)P(φ).

Prior Likelihood

9

Approach

slide-12
SLIDE 12

P(φ|G, D) ∝ | | P(G|φ, D)P(φ).

Prior Likelihood

9

Approach

Likelihood:

  • Find the best policy given (we used LSPI)
  • φ, D

P(G|φ, D) ∝ eηV π(s0)

π ←

[Lagoudakis et al. 2003]

slide-13
SLIDE 13

P(φ|G, D) ∝ | | P(G|φ, D)P(φ).

Prior Likelihood

9

A well performing policy is more likely to be a Good policy!

Simulate trajectories for estimating V π(s0)

Approach

Likelihood:

  • Find the best policy given (we used LSPI)
  • φ, D

P(G|φ, D) ∝ eηV π(s0)

π ←

[Lagoudakis et al. 2003]

slide-14
SLIDE 14

P(φ|G, D) ∝ | | P(G|φ, D)P(φ).

Prior Likelihood

9

A well performing policy is more likely to be a Good policy!

Simulate trajectories for estimating V π(s0)

Approach

Likelihood:

  • Find the best policy given (we used LSPI)
  • φ, D

P(G|φ, D) ∝ eηV π(s0)

π ←

[Lagoudakis et al. 2003]

Prior:

  • Representations with less number of features are more likely.
  • Representations with simple features are more likely.

[Goodman et al. 2008]

slide-15
SLIDE 15

P(φ|G, D) ∝ | | P(G|φ, D)P(φ).

Inference:

Posterior

Use Metropolis-Hastings (MH) to sample from the posterior.

10

MH+LSPI = MHPI

Approach

slide-16
SLIDE 16

P(φ|G, D) ∝ | | P(G|φ, D)P(φ).

Inference:

Posterior

Use Metropolis-Hastings (MH) to sample from the posterior.

φ φ0

Propose Accept probabilistically based on the posterior

Markov Chain Monte-Carlo:

10

MH+LSPI = MHPI

Approach

slide-17
SLIDE 17

P(φ|G, D) ∝ | | P(G|φ, D)P(φ).

Inference:

Posterior

∧ ∨ ∧ ∧ ∨ ∧ ∧ ∧

Primitive features A d d Mutate Remove Extended features 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9

Figure 2: Representation of primitive and extended features and

Propose Function:

φ φ0

Use Metropolis-Hastings (MH) to sample from the posterior.

φ φ0

Propose Accept probabilistically based on the posterior

Markov Chain Monte-Carlo:

10

MH+LSPI = MHPI

Approach

slide-18
SLIDE 18

(a) Domain

1 2 3 4 5 6 7 8 9 100 200 300 400 500 600 700 800 900 1000

# of Extended Features # of Samples

(b) Posterior Distribution

200 400 600 800 1000 10 20 30 40 50 60 70 80 90 100 Steps Iteration

# of Steps to the Goal MH Iteration

(c) Sampled Performance

1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11

(d) Resulting Policy

Figure 3: Maze domain empirical results

Maze

200 Initial Samples Initial features: row and column indicators Noiseless Actions: →,←,↓,↑

11

MHPI Iteration

slide-19
SLIDE 19

BlocksW

  • rld

1000 Initial Samples Initial features: on(A,B) 20% noise of dropping the block

12

start goal

(a) Domain

# of Samples

1 2 3 4 5 6 7 8 9 10 11 12 20 40 60 80 100 120 140 160 180 200

# of Extended Features

(b) Posterior Distribution

200 400 600 800 1000 10 20 30 40 50 60 70 80 90 100 Steps Iteration

# of Steps to Make the Tower MH Iteration

(c) Sampled Performance

# of Extended Features 2.75 3.75 4.75 5.75 6.75 7.75 1 2 3 4 5 6 7 8 9 10 11 12 # of Steps to Make the Tower

(d) Performance Dist.

Figure 4: BlocksWorld

MHPI Iteration

slide-20
SLIDE 20

BlocksW

  • rld

1000 Initial Samples Initial features: on(A,B) 20% noise of dropping the block

12

start goal

(a) Domain

# of Samples

1 2 3 4 5 6 7 8 9 10 11 12 20 40 60 80 100 120 140 160 180 200

# of Extended Features

(b) Posterior Distribution

200 400 600 800 1000 10 20 30 40 50 60 70 80 90 100 Steps Iteration

# of Steps to Make the Tower MH Iteration

(c) Sampled Performance

# of Extended Features 2.75 3.75 4.75 5.75 6.75 7.75 1 2 3 4 5 6 7 8 9 10 11 12 # of Steps to Make the Tower

(d) Performance Dist.

Figure 4: BlocksWorld

MHPI Iteration

slide-21
SLIDE 21

BlocksW

  • rld

1000 Initial Samples Initial features: on(A,B) 20% noise of dropping the block

12

start goal

(a) Domain

# of Samples

1 2 3 4 5 6 7 8 9 10 11 12 20 40 60 80 100 120 140 160 180 200

# of Extended Features

(b) Posterior Distribution

200 400 600 800 1000 10 20 30 40 50 60 70 80 90 100 Steps Iteration

# of Steps to Make the Tower MH Iteration

(c) Sampled Performance

# of Extended Features 2.75 3.75 4.75 5.75 6.75 7.75 1 2 3 4 5 6 7 8 9 10 11 12 # of Steps to Make the Tower

(d) Performance Dist.

Figure 4: BlocksWorld

MHPI Iteration

slide-22
SLIDE 22

θ θ · τ

(a) Domain

# of Samples # of Extended Features

1 2 3 4 5 6 7 8 9 10 11 50 100 150 200 250 300 350

(b) Posterior Distribution

100 200 300 400 500 500 1000 1500 2000 2500 3000 Steps Iteration

MH Iteration # of Balancing Steps

(c) Performance

1000 1500 2000 2500 3000 1 >1

# of Extended Features # of Balancing Steps

(d) Performance Dist.

Figure 5:

Inverted Pendulum

1000 Initial Samples Initial features: Discretize into 21 buckets separately Gaussian noise was added to torque values

13

θ, ˙ θ

MHPI Iteration

slide-23
SLIDE 23

θ θ · τ

(a) Domain

# of Samples # of Extended Features

1 2 3 4 5 6 7 8 9 10 11 50 100 150 200 250 300 350

(b) Posterior Distribution

100 200 300 400 500 500 1000 1500 2000 2500 3000 Steps Iteration

MH Iteration # of Balancing Steps

(c) Performance

1000 1500 2000 2500 3000 1 >1

# of Extended Features # of Balancing Steps

(d) Performance Dist.

Figure 5:

Inverted Pendulum

1000 Initial Samples Initial features: Discretize into 21 buckets separately Gaussian noise was added to torque values

13

θ, ˙ θ

Many proposed representations were rejected initially

MHPI Iteration

slide-24
SLIDE 24

θ θ · τ

(a) Domain

# of Samples # of Extended Features

1 2 3 4 5 6 7 8 9 10 11 50 100 150 200 250 300 350

(b) Posterior Distribution

100 200 300 400 500 500 1000 1500 2000 2500 3000 Steps Iteration

MH Iteration # of Balancing Steps

(c) Performance

1000 1500 2000 2500 3000 1 >1

# of Extended Features # of Balancing Steps

(d) Performance Dist.

Figure 5:

Inverted Pendulum

1000 Initial Samples Initial features: Discretize into 21 buckets separately Gaussian noise was added to torque values

13

θ, ˙ θ

Many proposed representations were rejected initially

xtended features hurt the performance, the ex- feature (− π

21 ≤ θ < 0) ∧ (0.4 ≤ ˙

θ < 0.6) agent to complete the task successfully.

Key feature:

MHPI Iteration

slide-25
SLIDE 25

Contributions

Introduced a Bayesian approach for finding concise yet expressive representations for solving MDPs. Introduced MHPI as a new RL technique that expands the representation using limited samples. Empirically demonstrated the effectiveness of our approach in 3 domains.

14

Feature Work:

Reuse the data for estimating for policy iteration Relax the need of a simulator to generate trajectories Importance sampling [Sutton and Barto, 1998] Model-free Monte Carlo [Fonteneau et al., 2010]

V π(s0)