a bayesian approach to finding compact representations
play

A Bayesian Approach to Finding Compact Representations for - PowerPoint PPT Presentation

A Bayesian Approach to Finding Compact Representations for Reinforcement Learning Special thanks to Joelle Pineau for presenting our paper - July, 2012 1 Authors Alborz Geramifard Stefanie Tellex David Wingate Nicholas Roy Jonathan How 2


  1. A Bayesian Approach to Finding Compact Representations for Reinforcement Learning Special thanks to Joelle Pineau for presenting our paper - July, 2012 1

  2. Authors Alborz Geramifard Stefanie Tellex David Wingate Nicholas Roy Jonathan How 2

  3. Vision Solving Large Sequential Decision Making Problems Formulated as MDPs . 3

  4. Reinforcement Learning π ( s ) : S → A a t s t , r t " 1 # � X γ t � 1 r t � Q π ( s, a ) = E π � s 0 = s, a 0 = a, � t =1 4

  5. Linear Function Approximation φ 1 θ 1 φ 2 θ 2 s � . Q π ( s, a ) ≈ φ ( s, a ) > θ . . φ n θ n 5

  6. Challenge Our focus φ Good Representation ( ) Good V alue Function ( ) Q Good Policy ( ) π 6

  7. Approach φ Observed Data Samples D Representation Q V alue Function Policy π G ∈ { 0 , 1 } Policy is good? 7

  8. Approach φ Observed Data Samples D Representation Q V alue Function Ideally: φ ∗ = argmax P ( φ | G, D ) φ Policy π Using G instead of G ∈ { 0 , 1 } Policy is good? G=1 for brevity 7

  9. Approach φ ∗ = argmax P ( φ | G, D ) φ Big! Extended features ∨ ∧ 7 8 Logical combinations of φ Problem: ∈ primitive features such as 1 2 3 4 5 6 Primitive features f 8 = f 4 ∧ f 6 8

  10. Approach φ ∗ = argmax P ( φ | G, D ) φ Big! Extended features ∨ ∧ 7 8 Logical combinations of φ Problem: ∈ primitive features such as 1 2 3 4 5 6 Primitive features f 8 = f 4 ∧ f 6 | | Insight: P ( φ | G, D ) P ( G | φ , D ) P ( φ ) . ∝ 8

  11. Approach | | P ( φ | G, D ) P ( G | φ , D ) P ( φ ) . ∝ Prior Likelihood 9

  12. Approach | | P ( φ | G, D ) P ( G | φ , D ) P ( φ ) . ∝ Prior Likelihood Likelihood: φ , D - Find the best policy given ( we used LSPI ) π ← [ Lagoudakis et al. 2003 ] P ( G | φ , D ) ∝ e η V π ( s 0 ) - 9

  13. Approach | | P ( φ | G, D ) P ( G | φ , D ) P ( φ ) . ∝ Prior Likelihood Likelihood: φ , D - Find the best policy given ( we used LSPI ) π ← [ Lagoudakis et al. 2003 ] Simulate trajectories for P ( G | φ , D ) ∝ e η V π ( s 0 ) - estimating V π ( s 0 ) A well performing policy is more likely to be a Good policy! 9

  14. Approach | | P ( φ | G, D ) P ( G | φ , D ) P ( φ ) . ∝ Prior Likelihood Likelihood: φ , D - Find the best policy given ( we used LSPI ) π ← [ Lagoudakis et al. 2003 ] Simulate trajectories for P ( G | φ , D ) ∝ e η V π ( s 0 ) - estimating V π ( s 0 ) A well performing policy is more likely to be a Good policy! Prior: [ Goodman et al. 2008 ] - Representations with less number of features are more likely. - Representations with simple features are more likely. 9

  15. Approach | | P ( φ | G, D ) P ( G | φ , D ) P ( φ ) . ∝ Posterior Inference: Use Metropolis - Hastings ( MH ) to sample from the posterior. MH+LSPI = MHPI 10

  16. Approach | | P ( φ | G, D ) P ( G | φ , D ) P ( φ ) . ∝ Posterior Inference: Use Metropolis - Hastings ( MH ) to sample from the posterior. MH+LSPI = MHPI Markov Chain Monte - Carlo: φ 0 φ Propose Accept probabilistically based on the posterior 10

  17. Approach | | P ( φ | G, D ) P ( G | φ , D ) P ( φ ) . ∝ Posterior Propose Function: Inference: φ 0 Use Metropolis - Hastings ∧ 9 ( MH ) to sample from the posterior. ∨ ∧ 7 8 φ MH+LSPI = MHPI d 1 2 3 4 5 6 d A Markov Chain Monte - Carlo: Extended features ∨ ∧ ∧ ∧ 7 8 7 8 φ 0 φ Propose Mutate 1 2 3 4 5 6 1 2 3 4 5 6 Remove Primitive features ∧ 7 Accept probabilistically based on the posterior 10 1 2 3 4 5 6 Figure 2: Representation of primitive and extended features and

  18. Maze 200 Initial Samples Initial features: row and column indicators Noiseless Actions: → , ← , ↓ , ↑ 1 2 3 100 1000 4 900 90 # of Steps to the Goal 5 800 80 # of Samples 700 6 70 600 60 7 Steps 500 50 8 400 40 9 300 30 200 10 20 100 11 10 0 0 200 400 600 800 1000 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 11 MH Iteration Iteration MHPI Iteration # of Extended Features ( a ) Domain ( b ) Posterior Distribution ( c ) Sampled Performance ( d ) Resulting Policy Figure 3: Maze domain empirical results 11

  19. BlocksW orld 1000 Initial Samples Initial features: on ( A,B ) 20 % noise of dropping the block 200 100 # of Steps to Make the Tower 180 90 160 80 # of Steps to Make the Tower 7.75 140 # of Samples 70 6.75 120 60 goal Steps 100 50 5.75 80 40 start 4.75 60 30 40 3.75 20 20 10 2.75 0 0 1 2 3 4 5 6 7 8 9 10 11 12 0 200 400 600 800 1000 0 1 2 3 4 5 6 7 8 9 10 11 12 MHPI Iteration # of Extended Features MH Iteration Iteration # of Extended Features ( a ) Domain ( b ) Posterior Distribution ( c ) Sampled Performance ( d ) Performance Dist. Figure 4: BlocksWorld 12

  20. BlocksW orld 1000 Initial Samples Initial features: on ( A,B ) 20 % noise of dropping the block 200 100 # of Steps to Make the Tower 180 90 160 80 # of Steps to Make the Tower 7.75 140 # of Samples 70 6.75 120 60 goal Steps 100 50 5.75 80 40 start 4.75 60 30 40 3.75 20 20 10 2.75 0 0 1 2 3 4 5 6 7 8 9 10 11 12 0 200 400 600 800 1000 0 1 2 3 4 5 6 7 8 9 10 11 12 MHPI Iteration # of Extended Features MH Iteration Iteration # of Extended Features ( a ) Domain ( b ) Posterior Distribution ( c ) Sampled Performance ( d ) Performance Dist. Figure 4: BlocksWorld 12

  21. BlocksW orld 1000 Initial Samples Initial features: on ( A,B ) 20 % noise of dropping the block 200 100 # of Steps to Make the Tower 180 90 160 80 # of Steps to Make the Tower 7.75 140 # of Samples 70 6.75 120 60 goal Steps 100 50 5.75 80 40 start 4.75 60 30 40 3.75 20 20 10 2.75 0 0 1 2 3 4 5 6 7 8 9 10 11 12 0 200 400 600 800 1000 0 1 2 3 4 5 6 7 8 9 10 11 12 MHPI Iteration # of Extended Features MH Iteration Iteration # of Extended Features ( a ) Domain ( b ) Posterior Distribution ( c ) Sampled Performance ( d ) Performance Dist. Figure 4: BlocksWorld 12

  22. Inverted Pendulum 1000 Initial Samples θ , ˙ Initial features: Discretize into 21 buckets separately θ Gaussian noise was added to torque values 350 3000 · 300 θ # of Balancing Steps 2500 250 # of Samples # of Balancing Steps 3000 θ 200 2000 Steps 2500 150 1500 2000 100 τ 1500 1000 50 1000 0 0 1 >1 500 0 1 2 3 4 5 6 7 8 9 10 11 0 100 200 300 400 500 # of Extended Features # of Extended Features MHPI Iteration MH Iteration Iteration ( a ) Domain ( b ) Posterior Distribution ( c ) Performance ( d ) Performance Dist. Figure 5: 13

  23. Inverted Pendulum 1000 Initial Samples θ , ˙ Initial features: Discretize into 21 buckets separately θ Gaussian noise was added to torque values Many proposed representations were rejected initially 350 3000 · 300 θ # of Balancing Steps 2500 250 # of Samples # of Balancing Steps 3000 θ 200 2000 Steps 2500 150 1500 2000 100 τ 1500 1000 50 1000 0 0 1 >1 500 0 1 2 3 4 5 6 7 8 9 10 11 0 100 200 300 400 500 # of Extended Features # of Extended Features MHPI Iteration MH Iteration Iteration ( a ) Domain ( b ) Posterior Distribution ( c ) Performance ( d ) Performance Dist. Figure 5: 13

  24. Inverted Pendulum 1000 Initial Samples θ , ˙ Initial features: Discretize into 21 buckets separately θ Gaussian noise was added to torque values Many proposed representations were rejected initially 350 3000 · 300 θ # of Balancing Steps 2500 250 # of Samples # of Balancing Steps 3000 θ 200 2000 Steps 2500 150 1500 2000 100 τ 1500 1000 50 1000 0 0 1 >1 500 0 1 2 3 4 5 6 7 8 9 10 11 0 100 200 300 400 500 # of Extended Features # of Extended Features MHPI Iteration MH Iteration Iteration ( a ) Domain ( b ) Posterior Distribution ( c ) Performance ( d ) Performance Dist. Figure 5: xtended features hurt the performance, the ex- 21 ≤ θ < 0) ∧ (0 . 4 ≤ ˙ feature ( − π θ < 0 . 6) Key feature: agent to complete the task successfully. 13

  25. Contributions Introduced a Bayesian approach for finding concise yet expressive representations for solving MDPs. Introduced MHPI as a new RL technique that expands the representation using limited samples. Empirically demonstrated the e ff ectiveness of our approach in 3 domains. Feature Work: V π ( s 0 ) Reuse the data for estimating for policy iteration Relax the need of a simulator to generate trajectories Importance sampling [ Sutton and Barto, 1998 ] Model - free Monte Carlo [ Fonteneau et al., 2010 ] 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend