1
Special thanks to Joelle Pineau for presenting our paper - July, 2012
A Bayesian Approach to Finding Compact Representations for - - PowerPoint PPT Presentation
A Bayesian Approach to Finding Compact Representations for Reinforcement Learning Special thanks to Joelle Pineau for presenting our paper - July, 2012 1 Authors Alborz Geramifard Stefanie Tellex David Wingate Nicholas Roy Jonathan How 2
1
Special thanks to Joelle Pineau for presenting our paper - July, 2012
Stefanie Tellex Alborz Geramifard Jonathan How David Wingate Nicholas Roy
2
3
π(s) : S → A
Qπ(s, a) = Eπ " 1 X
t=1
γt1rt
#
4
5
Qπ(s, a) ≈ φ(s, a)>θ
Our focus
6
Observed Data Samples Policy is good? Representation V alue Function Policy
∈ {0, 1}
7
Observed Data Samples Policy is good? Representation V alue Function Policy
φ
∈ {0, 1}
Using G instead of G=1 for brevity
7
φ
∧ ∨
Primitive features Extended features 1 2 3 4 5 6 7 8
8
f8 = f4 ∧ f6
Logical combinations of primitive features such as
φ
∧ ∨
Primitive features Extended features 1 2 3 4 5 6 7 8
8
f8 = f4 ∧ f6
Logical combinations of primitive features such as
9
9
π ←
[Lagoudakis et al. 2003]
9
Simulate trajectories for estimating V π(s0)
π ←
[Lagoudakis et al. 2003]
9
Simulate trajectories for estimating V π(s0)
π ←
[Lagoudakis et al. 2003]
[Goodman et al. 2008]
Inference:
Use Metropolis-Hastings (MH) to sample from the posterior.
10
MH+LSPI = MHPI
Inference:
Use Metropolis-Hastings (MH) to sample from the posterior.
Propose Accept probabilistically based on the posterior
Markov Chain Monte-Carlo:
10
MH+LSPI = MHPI
Inference:
∧ ∨ ∧ ∧ ∨ ∧ ∧ ∧
Primitive features A d d Mutate Remove Extended features 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9
Figure 2: Representation of primitive and extended features and
Propose Function:
Use Metropolis-Hastings (MH) to sample from the posterior.
Propose Accept probabilistically based on the posterior
Markov Chain Monte-Carlo:
10
MH+LSPI = MHPI
(a) Domain
1 2 3 4 5 6 7 8 9 100 200 300 400 500 600 700 800 900 1000# of Extended Features # of Samples
(b) Posterior Distribution
200 400 600 800 1000 10 20 30 40 50 60 70 80 90 100 Steps Iteration# of Steps to the Goal MH Iteration
(c) Sampled Performance
1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11
(d) Resulting Policy
Figure 3: Maze domain empirical results
200 Initial Samples Initial features: row and column indicators Noiseless Actions: →,←,↓,↑
11
MHPI Iteration
1000 Initial Samples Initial features: on(A,B) 20% noise of dropping the block
12
start goal(a) Domain
# of Samples
1 2 3 4 5 6 7 8 9 10 11 12 20 40 60 80 100 120 140 160 180 200# of Extended Features
(b) Posterior Distribution
200 400 600 800 1000 10 20 30 40 50 60 70 80 90 100 Steps Iteration# of Steps to Make the Tower MH Iteration
(c) Sampled Performance
# of Extended Features 2.75 3.75 4.75 5.75 6.75 7.75 1 2 3 4 5 6 7 8 9 10 11 12 # of Steps to Make the Tower
(d) Performance Dist.
Figure 4: BlocksWorld
MHPI Iteration
1000 Initial Samples Initial features: on(A,B) 20% noise of dropping the block
12
start goal(a) Domain
# of Samples
1 2 3 4 5 6 7 8 9 10 11 12 20 40 60 80 100 120 140 160 180 200# of Extended Features
(b) Posterior Distribution
200 400 600 800 1000 10 20 30 40 50 60 70 80 90 100 Steps Iteration# of Steps to Make the Tower MH Iteration
(c) Sampled Performance
# of Extended Features 2.75 3.75 4.75 5.75 6.75 7.75 1 2 3 4 5 6 7 8 9 10 11 12 # of Steps to Make the Tower
(d) Performance Dist.
Figure 4: BlocksWorld
MHPI Iteration
1000 Initial Samples Initial features: on(A,B) 20% noise of dropping the block
12
start goal(a) Domain
# of Samples
1 2 3 4 5 6 7 8 9 10 11 12 20 40 60 80 100 120 140 160 180 200# of Extended Features
(b) Posterior Distribution
200 400 600 800 1000 10 20 30 40 50 60 70 80 90 100 Steps Iteration# of Steps to Make the Tower MH Iteration
(c) Sampled Performance
# of Extended Features 2.75 3.75 4.75 5.75 6.75 7.75 1 2 3 4 5 6 7 8 9 10 11 12 # of Steps to Make the Tower
(d) Performance Dist.
Figure 4: BlocksWorld
MHPI Iteration
θ θ · τ
(a) Domain
# of Samples # of Extended Features
1 2 3 4 5 6 7 8 9 10 11 50 100 150 200 250 300 350(b) Posterior Distribution
100 200 300 400 500 500 1000 1500 2000 2500 3000 Steps IterationMH Iteration # of Balancing Steps
(c) Performance
1000 1500 2000 2500 3000 1 >1
# of Extended Features # of Balancing Steps
(d) Performance Dist.
Figure 5:
1000 Initial Samples Initial features: Discretize into 21 buckets separately Gaussian noise was added to torque values
13
θ, ˙ θ
MHPI Iteration
θ θ · τ
(a) Domain
# of Samples # of Extended Features
1 2 3 4 5 6 7 8 9 10 11 50 100 150 200 250 300 350(b) Posterior Distribution
100 200 300 400 500 500 1000 1500 2000 2500 3000 Steps IterationMH Iteration # of Balancing Steps
(c) Performance
1000 1500 2000 2500 3000 1 >1
# of Extended Features # of Balancing Steps
(d) Performance Dist.
Figure 5:
1000 Initial Samples Initial features: Discretize into 21 buckets separately Gaussian noise was added to torque values
13
θ, ˙ θ
Many proposed representations were rejected initially
MHPI Iteration
θ θ · τ
(a) Domain
# of Samples # of Extended Features
1 2 3 4 5 6 7 8 9 10 11 50 100 150 200 250 300 350(b) Posterior Distribution
100 200 300 400 500 500 1000 1500 2000 2500 3000 Steps IterationMH Iteration # of Balancing Steps
(c) Performance
1000 1500 2000 2500 3000 1 >1
# of Extended Features # of Balancing Steps
(d) Performance Dist.
Figure 5:
1000 Initial Samples Initial features: Discretize into 21 buckets separately Gaussian noise was added to torque values
13
θ, ˙ θ
Many proposed representations were rejected initially
21 ≤ θ < 0) ∧ (0.4 ≤ ˙
Key feature:
MHPI Iteration
14
V π(s0)