Thompson Sampling based Monte-Carlo Planning in POMDPs Aijun Bai 1 - PowerPoint PPT Presentation

The 24th International Conference on Automated Planning and Scheduling, ICAPS 2014 Thompson Sampling based Monte-Carlo Planning in POMDPs Aijun Bai 1 Feng Wu 2 Zongzhang Zhang 3 Xiaoping Chen 1 1 University of Science & Technology of China 2 University of Southampton 3 National University of Singapore June 24, 2014

Table of Contents Introduction The approach Empirical results Conclusion and future work A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 2 / 22

Monte-Carlo tree search ◮ Online planning method ◮ Finds near-optimal policies for MDPs and POMDPs ◮ Builds a best-first search tree using Monte-Carlo samplings ◮ Without explicitly knowing the underlying models in advance A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 3 / 22

MCTS procedure Figure 1 : Outline of Monte-Carlo tree search [Chaslot et al. , 2008]. A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 4 / 22

Resulting asymmetric search tree Figure 2 : An example of resulting asymmetric search tree [Coquelin and Munos, 2007]. A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 5 / 22

The exploration vs. exploitation dilemma ◮ A fundamental problem in MCTS: 1. Must not only exploit by selecting the action that currently seems best 2. Should also keep exploring for possible higher future outcomes ◮ Can be seen as a multi-armed bandit problem (MAB) 1. A set of actions: A 2. An unknown stochastic reward function R ( a ) := X a ◮ Cumulative regret (CR): � T � � ( X a ∗ − X a t ) R T = E (1) t =1 ◮ Minimize CR by trading off between exploration and exploitation A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 6 / 22

UCB1 heuristics ◮ POMCP algorithm [Silver and Veness, 2010]: � log N ( h ) UCB1( h, a ) = ¯ Q ( h, a ) + c (2) N ( h, a ) ◮ ¯ Q ( h, a ) is the mean outcome of applying action a in history h ◮ N ( h, a ) is the visitation count of action a following h ◮ N ( h ) = � a ∈ A N ( h, a ) is the overall count ◮ c is the exploration constant A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 7 / 22

Balancing between CR and SR in MCTS ◮ Simple regret (SR): r n = E [ X a ∗ − X ¯ a ] (3) a = argmax a ∈ A ¯ where ¯ X a ◮ Makes more sense for pure exploration ◮ A recently growing understanding: balance between CR and SR [Feldman and Domshlak, 2012] 1. Does not collect a real reward when searching the tree 2. Good to grow the tree more accurately by exploiting the current tree A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 8 / 22

Thompson sampling ◮ Select an action based on its posterior probability of being optimal � � � � E [ X a ′ | θ a ′ ] P a ′ ( θ a ′ | Z ) d θ P ( a ) = a = argmax (4) 1 a ′ a ′ 1. θ a specifies the unknown distribution of X a 2. θ = ( θ a 1 , θ a 2 , . . . ) is a vector of all hidden parameters ◮ Can efficiently be approached by sampling method 1. Sample a set of hidden parameters θ a 2. Select the action with highest expectation E [ X a ′ | θ a ′ ] A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 9 / 22

An example of Thompson sampling ◮ 2-armed bandit: a and b ◮ Bernoulli reward distributions ◮ Hidden parameters p a and p b ◮ Prior distributions: ◮ p a ∼ Uniform (0 , 1) (a) Beta (2 , 2) . ◮ p b ∼ Uniform (0 , 1) ◮ History: a, 1, b, 0, a, 0 ◮ Posterior distributions: ◮ p a ∼ Beta (2 , 2) ◮ p b ∼ Beta (1 , 2) ◮ Sample p a and p b (b) Beta (1 , 2) . ◮ Compare E [ X a | p a ] and E [ X b | p b ] Figure 3 : Posterior distributions. A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 10 / 22

Motivation ◮ Thompson sampling 1. Theoretically achieves asymptotic optimality for MABs in terms of CR 2. Empirically has competitive and even better performance comparing with state-of-the-art in terms of CR and SR ◮ Seems to be a promising approach for the challenge of balancing CR and SR A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 11 / 22

Contribution ◮ A complete Bayesian approach for online Monte-Carlo planning in POMDPs 1. Maintain the posterior reward distribution of applying an action 2. Use Thompson sampling to guide the action selection A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 12 / 22

Bayesian modeling and inference ◮ X b,a : the immediate reward of performing action a in belief b ◮ A finite set of possible immediate rewards: I = { r 1 , r 2 , . . . , r k } ◮ X b,a ∼ Multinomial ( p 1 , p 2 , . . . , p k ) 1. p i = � s ∈ S 1 [ R ( s, a ) = r i ] b ( s ) 2. � k i =1 p i = 1 ◮ ( p 1 , p 2 , . . . , p k ) ∼ Dirichlet ( ψ b,a ) , where ψ b,a = ( ψ b,a,r 1 , ψ b,a,r 2 , . . . , ψ b,a,r k ) ◮ Observing r : ψ b,a,r ← ψ b,a,r + 1 (5) A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 13 / 22

Bayesian modeling and inference ◮ X s,b,π : the cumulative reward of following policy π in joint state � s, b � ◮ X s,b,π ∼ N ( µ s,b , 1 /τ s,b ) (according to CLT on Markov chains) ◮ ( µ s,b , τ s,b ) ∼ NormalGamma ( µ s,b, 0 , λ s,b , α s,b , β s,b ) ◮ Observing v : µ s,b, 0 = λ s,b µ s,b, 0 + v (6) λ s,b + 1 λ s,b = λ s,b + 1 (7) α s,b = α s,b + 1 (8) 2 � λ s,b ( v − µ s,b, 0 ) 2 β s,b = β s,b + 1 � (9) 2 λ s,b + 1 A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 14 / 22

Bayesian modeling and inference ◮ X b,π : the cumulative reward of following policy π in belief b ◮ X b,π follows a mixture of Normal distributions: � f X b,π ( x ) = b ( s ) f X s,b,π ( x ) (10) s ∈ S ◮ X b,a,π : the cumulative reward of applying a in belief b and following policy π X b,a,π = X b,a + γX b ′ ,π (11) ◮ Expectation of X b,a,π : 1 [ b ′ = ζ ( b, a, o )]Ω( o | b, a ) E [ X b ′ ,π ] � E [ X b,a,π ] = E [ X b,a ] + γ (12) o ∈ O A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 15 / 22

Bayesian modeling and inference ◮ Ω( · | b, a ) ∼ Dirichlet ( ρ b,a ) ◮ ρ b,a = ( ρ b,a,o 1 , ρ b,a,o 2 , . . . ) ◮ Observing a transition ( b, a ) → o : ρ b,a,o ← ρ b,a,o + 1 (13) A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 16 / 22

Thompson sampling based action selection ◮ Decision node with belief b ◮ Sample a set of parameters: 1. { w b,a,o } ∼ Dirichlet ( ρ b,a ) 2. { w b,a,r } ∼ Dirichlet ( ψ b,a ) 3. { µ s ′ ,b ′ } ∼ NormalGamma ( µ s ′ ,b ′ , 0 , λ s ′ ,b ′ , α s ′ ,b ′ , β s ′ ,b ′ ) , where b ′ = ζ ( b, a, o ) ◮ Select action with highest expectation — sampled ˜ Q value: 1 [ b ′ = ζ ( b, a, o )] w b,a,o ˜ � � � µ s ′ ,b ′ b ′ ( s ′ ) Q ( b, a ) = w b,a,r r + γ (14) r ∈I o ∈ O s ′ ∈ S A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 17 / 22

Experiments ◮ D 2 NG-POMCP: Dirichlet-Dirichlet-NormalGamma partially observable Monte-Carlo planning ◮ RockSample and PocMan domains ◮ Evaluation: 1. Run the algorithms for a number of iterations for current belief 2. Apply the best action based on the resulting action-values 3. Repeat until terminating conditions (goal state or maximal number of steps) 4. Report the total discounted reward and the average time usage per action A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 18 / 22

Experimental results 25 25 25 25 20 20 20 20 Avg. Discounted Return Avg. Discounted Return Avg. Discounted Return Avg. Discounted Return 15 15 15 15 10 10 10 10 5 5 5 5 0 0 0 0 POMCP POMCP POMCP POMCP D 2 NG-POMCP D 2 NG-POMCP D 2 NG-POMCP D 2 NG-POMCP -5 -5 -5 -5 1 10 100 1000 10000 100000 1e+06 1e-05 0.0001 0.001 0.01 0.1 1 10 100 1 10 100 1000 10000 100000 1e+06 1e-05 0.0001 0.001 0.01 0.1 1 10 100 Number of Iterations Avg. Time Per Action (Seconds) Number of Iterations Avg. Time Per Action (Seconds) (a) RS[7, 8]. (b) RS[7, 8]. (c) RS[11, 11]. (d) RS[11, 11]. 25 25 90 90 80 80 Avg. Discounted Return 20 Avg. Discounted Return 20 Avg. Discounted Return Avg. Discounted Return 70 70 60 60 15 15 50 50 10 10 40 40 30 30 5 5 20 20 10 10 0 0 POMCP POMCP POMCP POMCP 0 0 D 2 NG-POMCP D 2 NG-POMCP D 2 NG-POMCP D 2 NG-POMCP -5 -5 -10 -10 1 10 100 1000 10000 100000 0.0001 0.001 0.01 0.1 1 10 100 1 10 100 1000 10000 100000 1e-05 0.0001 0.001 0.01 0.1 1 10 Number of Iterations Avg. Time Per Action (Seconds) Number of Iterations Avg. Time Per Action (Seconds) (e) RS[15, 15]. (f) RS[15, 15]. (g) PocMan. (h) PocMan. Figure 4 : Performance of D 2 NG-POMCP in RockSample and PocMan A. Bai, F. Wu, Z. Zhang, and X. Chen Thompson Sampling based Monte-Carlo Planning in POMDPs 19 / 22

Thompson Sampling based Monte-Carlo Planning in POMDPs Aijun Bai 1 - PowerPoint PPT Presentation

The 24th International Conference on Automated Planning and Scheduling, ICAPS 2014 Thompson Sampling based Monte-Carlo Planning in POMDPs Aijun Bai 1 Feng Wu 2 Zongzhang Zhang 3 Xiaoping Chen 1 1 University of Science & Technology of China 2

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

Fast approximate planning in POMDPs Geoff Gordon ggordon@cs.cmu.edu Joelle Pineau, Geoff Gordon,

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.

Geometrically Coupled Monte Carlo Sampling Mark Rowland Krzysztof Choromanski Franois Chalus

Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow POMDPs are a

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

Monte Carlo Integration Monte Carlo methods use random numbers for sampling Although

Conformal Supergravity, 4D Scattering Equations (and Monte Carlo Methods) Joe Farrow Based on

Monte Carlo Methods CS60077: Reinforcement Learning Abir Das IIT Kharagpur Oct 05 and 06, 2020

Sequential Monte Carlo Methods Particle Filter Martin Ulmke Head of Research Group

Sequential Monte Carlo Methods Click to edit Master text styles Click to edit Master text

RSA Parameter Generation Bob needs to: - find 2 large primes p,q - find e s.t. gcd(e, (pq))=1

Sequential Implementation of Monte Carlo Tests with Uniformly Bounded Resampling Risk Axel Gandy

RAL Report for DUNE-UK meeting Fergus Wilson Chris Brew Raja Nandakumar 1 Workload management

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization

Sambuz

Useful Links

Newsletter

Mail Us

Thompson Sampling based Monte-Carlo Planning in POMDPs Aijun Bai 1 - PowerPoint PPT Presentation

The 24th International Conference on Automated Planning and Scheduling, ICAPS 2014 Thompson Sampling based Monte-Carlo Planning in POMDPs Aijun Bai 1 Feng Wu 2 Zongzhang Zhang 3 Xiaoping Chen 1 1 University of Science & Technology of China 2

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

Fast approximate planning in POMDPs Geoff Gordon ggordon@cs.cmu.edu Joelle Pineau, Geoff Gordon,

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Monte Carlo Control CMPUT 366: Intelligent Systems S&amp;B 5.3-5.5, 5.7 Lecture Outline 1.

Geometrically Coupled Monte Carlo Sampling Mark Rowland Krzysztof Choromanski Franois Chalus

Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow POMDPs are a

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

Monte Carlo Integration Monte Carlo methods use random numbers for sampling Although

Conformal Supergravity, 4D Scattering Equations (and Monte Carlo Methods) Joe Farrow Based on

Monte Carlo Methods CS60077: Reinforcement Learning Abir Das IIT Kharagpur Oct 05 and 06, 2020

Sequential Monte Carlo Methods Particle Filter Martin Ulmke Head of Research Group

Sequential Monte Carlo Methods Click to edit Master text styles Click to edit Master text

RSA Parameter Generation Bob needs to: - find 2 large primes p,q - find e s.t. gcd(e, (pq))=1

Sequential Implementation of Monte Carlo Tests with Uniformly Bounded Resampling Risk Axel Gandy

RAL Report for DUNE-UK meeting Fergus Wilson Chris Brew Raja Nandakumar 1 Workload management

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization

Sambuz

Useful Links

Newsletter

Mail Us

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.