 
              A Bayesian Approach to Finding Compact Representations for Reinforcement Learning Special thanks to Joelle Pineau for presenting our paper - July, 2012 1
Authors Alborz Geramifard Stefanie Tellex David Wingate Nicholas Roy Jonathan How 2
Vision Solving Large Sequential Decision Making Problems Formulated as MDPs . 3
Reinforcement Learning π ( s ) : S → A a t s t , r t " 1 # � X γ t � 1 r t � Q π ( s, a ) = E π � s 0 = s, a 0 = a, � t =1 4
Linear Function Approximation φ 1 θ 1 φ 2 θ 2 s � . Q π ( s, a ) ≈ φ ( s, a ) > θ . . φ n θ n 5
Challenge Our focus φ Good Representation ( ) Good V alue Function ( ) Q Good Policy ( ) π 6
Approach φ Observed Data Samples D Representation Q V alue Function Policy π G ∈ { 0 , 1 } Policy is good? 7
Approach φ Observed Data Samples D Representation Q V alue Function Ideally: φ ∗ = argmax P ( φ | G, D ) φ Policy π Using G instead of G ∈ { 0 , 1 } Policy is good? G=1 for brevity 7
Approach φ ∗ = argmax P ( φ | G, D ) φ Big! Extended features ∨ ∧ 7 8 Logical combinations of φ Problem: ∈ primitive features such as 1 2 3 4 5 6 Primitive features f 8 = f 4 ∧ f 6 8
Approach φ ∗ = argmax P ( φ | G, D ) φ Big! Extended features ∨ ∧ 7 8 Logical combinations of φ Problem: ∈ primitive features such as 1 2 3 4 5 6 Primitive features f 8 = f 4 ∧ f 6 | | Insight: P ( φ | G, D ) P ( G | φ , D ) P ( φ ) . ∝ 8
Approach | | P ( φ | G, D ) P ( G | φ , D ) P ( φ ) . ∝ Prior Likelihood 9
Approach | | P ( φ | G, D ) P ( G | φ , D ) P ( φ ) . ∝ Prior Likelihood Likelihood: φ , D - Find the best policy given ( we used LSPI ) π ← [ Lagoudakis et al. 2003 ] P ( G | φ , D ) ∝ e η V π ( s 0 ) - 9
Approach | | P ( φ | G, D ) P ( G | φ , D ) P ( φ ) . ∝ Prior Likelihood Likelihood: φ , D - Find the best policy given ( we used LSPI ) π ← [ Lagoudakis et al. 2003 ] Simulate trajectories for P ( G | φ , D ) ∝ e η V π ( s 0 ) - estimating V π ( s 0 ) A well performing policy is more likely to be a Good policy! 9
Approach | | P ( φ | G, D ) P ( G | φ , D ) P ( φ ) . ∝ Prior Likelihood Likelihood: φ , D - Find the best policy given ( we used LSPI ) π ← [ Lagoudakis et al. 2003 ] Simulate trajectories for P ( G | φ , D ) ∝ e η V π ( s 0 ) - estimating V π ( s 0 ) A well performing policy is more likely to be a Good policy! Prior: [ Goodman et al. 2008 ] - Representations with less number of features are more likely. - Representations with simple features are more likely. 9
Approach | | P ( φ | G, D ) P ( G | φ , D ) P ( φ ) . ∝ Posterior Inference: Use Metropolis - Hastings ( MH ) to sample from the posterior. MH+LSPI = MHPI 10
Approach | | P ( φ | G, D ) P ( G | φ , D ) P ( φ ) . ∝ Posterior Inference: Use Metropolis - Hastings ( MH ) to sample from the posterior. MH+LSPI = MHPI Markov Chain Monte - Carlo: φ 0 φ Propose Accept probabilistically based on the posterior 10
Approach | | P ( φ | G, D ) P ( G | φ , D ) P ( φ ) . ∝ Posterior Propose Function: Inference: φ 0 Use Metropolis - Hastings ∧ 9 ( MH ) to sample from the posterior. ∨ ∧ 7 8 φ MH+LSPI = MHPI d 1 2 3 4 5 6 d A Markov Chain Monte - Carlo: Extended features ∨ ∧ ∧ ∧ 7 8 7 8 φ 0 φ Propose Mutate 1 2 3 4 5 6 1 2 3 4 5 6 Remove Primitive features ∧ 7 Accept probabilistically based on the posterior 10 1 2 3 4 5 6 Figure 2: Representation of primitive and extended features and
Maze 200 Initial Samples Initial features: row and column indicators Noiseless Actions: → , ← , ↓ , ↑ 1 2 3 100 1000 4 900 90 # of Steps to the Goal 5 800 80 # of Samples 700 6 70 600 60 7 Steps 500 50 8 400 40 9 300 30 200 10 20 100 11 10 0 0 200 400 600 800 1000 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 11 MH Iteration Iteration MHPI Iteration # of Extended Features ( a ) Domain ( b ) Posterior Distribution ( c ) Sampled Performance ( d ) Resulting Policy Figure 3: Maze domain empirical results 11
BlocksW orld 1000 Initial Samples Initial features: on ( A,B ) 20 % noise of dropping the block 200 100 # of Steps to Make the Tower 180 90 160 80 # of Steps to Make the Tower 7.75 140 # of Samples 70 6.75 120 60 goal Steps 100 50 5.75 80 40 start 4.75 60 30 40 3.75 20 20 10 2.75 0 0 1 2 3 4 5 6 7 8 9 10 11 12 0 200 400 600 800 1000 0 1 2 3 4 5 6 7 8 9 10 11 12 MHPI Iteration # of Extended Features MH Iteration Iteration # of Extended Features ( a ) Domain ( b ) Posterior Distribution ( c ) Sampled Performance ( d ) Performance Dist. Figure 4: BlocksWorld 12
BlocksW orld 1000 Initial Samples Initial features: on ( A,B ) 20 % noise of dropping the block 200 100 # of Steps to Make the Tower 180 90 160 80 # of Steps to Make the Tower 7.75 140 # of Samples 70 6.75 120 60 goal Steps 100 50 5.75 80 40 start 4.75 60 30 40 3.75 20 20 10 2.75 0 0 1 2 3 4 5 6 7 8 9 10 11 12 0 200 400 600 800 1000 0 1 2 3 4 5 6 7 8 9 10 11 12 MHPI Iteration # of Extended Features MH Iteration Iteration # of Extended Features ( a ) Domain ( b ) Posterior Distribution ( c ) Sampled Performance ( d ) Performance Dist. Figure 4: BlocksWorld 12
BlocksW orld 1000 Initial Samples Initial features: on ( A,B ) 20 % noise of dropping the block 200 100 # of Steps to Make the Tower 180 90 160 80 # of Steps to Make the Tower 7.75 140 # of Samples 70 6.75 120 60 goal Steps 100 50 5.75 80 40 start 4.75 60 30 40 3.75 20 20 10 2.75 0 0 1 2 3 4 5 6 7 8 9 10 11 12 0 200 400 600 800 1000 0 1 2 3 4 5 6 7 8 9 10 11 12 MHPI Iteration # of Extended Features MH Iteration Iteration # of Extended Features ( a ) Domain ( b ) Posterior Distribution ( c ) Sampled Performance ( d ) Performance Dist. Figure 4: BlocksWorld 12
Inverted Pendulum 1000 Initial Samples θ , ˙ Initial features: Discretize into 21 buckets separately θ Gaussian noise was added to torque values 350 3000 · 300 θ # of Balancing Steps 2500 250 # of Samples # of Balancing Steps 3000 θ 200 2000 Steps 2500 150 1500 2000 100 τ 1500 1000 50 1000 0 0 1 >1 500 0 1 2 3 4 5 6 7 8 9 10 11 0 100 200 300 400 500 # of Extended Features # of Extended Features MHPI Iteration MH Iteration Iteration ( a ) Domain ( b ) Posterior Distribution ( c ) Performance ( d ) Performance Dist. Figure 5: 13
Inverted Pendulum 1000 Initial Samples θ , ˙ Initial features: Discretize into 21 buckets separately θ Gaussian noise was added to torque values Many proposed representations were rejected initially 350 3000 · 300 θ # of Balancing Steps 2500 250 # of Samples # of Balancing Steps 3000 θ 200 2000 Steps 2500 150 1500 2000 100 τ 1500 1000 50 1000 0 0 1 >1 500 0 1 2 3 4 5 6 7 8 9 10 11 0 100 200 300 400 500 # of Extended Features # of Extended Features MHPI Iteration MH Iteration Iteration ( a ) Domain ( b ) Posterior Distribution ( c ) Performance ( d ) Performance Dist. Figure 5: 13
Inverted Pendulum 1000 Initial Samples θ , ˙ Initial features: Discretize into 21 buckets separately θ Gaussian noise was added to torque values Many proposed representations were rejected initially 350 3000 · 300 θ # of Balancing Steps 2500 250 # of Samples # of Balancing Steps 3000 θ 200 2000 Steps 2500 150 1500 2000 100 τ 1500 1000 50 1000 0 0 1 >1 500 0 1 2 3 4 5 6 7 8 9 10 11 0 100 200 300 400 500 # of Extended Features # of Extended Features MHPI Iteration MH Iteration Iteration ( a ) Domain ( b ) Posterior Distribution ( c ) Performance ( d ) Performance Dist. Figure 5: xtended features hurt the performance, the ex- 21 ≤ θ < 0) ∧ (0 . 4 ≤ ˙ feature ( − π θ < 0 . 6) Key feature: agent to complete the task successfully. 13
Contributions Introduced a Bayesian approach for finding concise yet expressive representations for solving MDPs. Introduced MHPI as a new RL technique that expands the representation using limited samples. Empirically demonstrated the e ff ectiveness of our approach in 3 domains. Feature Work: V π ( s 0 ) Reuse the data for estimating for policy iteration Relax the need of a simulator to generate trajectories Importance sampling [ Sutton and Barto, 1998 ] Model - free Monte Carlo [ Fonteneau et al., 2010 ] 14
Recommend
More recommend