de scale free adaptive planning for deterministic
play

de Scale-free adaptive PLANNING for deterministic dynamics & - PowerPoint PPT Presentation

de Scale-free adaptive PLANNING for deterministic dynamics & discounted rewards Peter Bartlett, Victor Gabillon, Jennifer Healey, Michal Valko ICML - June 13th, 2019 1/11 An MCTS setting MDP with starting state x 0 X , action space A


  1. de Scale-free adaptive PLANNING for deterministic dynamics & discounted rewards Peter Bartlett, Victor Gabillon, Jennifer Healey, Michal Valko ICML - June 13th, 2019 1/11

  2. An MCTS setting MDP with starting state x 0 ∈ X , action space A n interactions: At time t playing a t in x t leads to Deterministic dynamics g : x t +1 � g ( x t , a t ), Reward: r t ( x t , a t ) + ε t with ε t being the noise Objective: Recommend action a ( n ) that minimizes r n � max a ∈ A Q ⋆ ( x , a ) − Q ⋆ ( x , a ( n )) simple regret � γ t r ( x t , π ( x t )) where Q ⋆ ( x , a ) � r ( x , a ) + sup π Assumption: r t ∈ [0 , R max ] and | ε t | ≤ b Approach: Trying to explore without the parameters R max and b 2/11

  3. An MCTS setting MDP with starting state x 0 ∈ X , action space A n interactions: At time t playing a t in x t leads to Deterministic dynamics g : x t +1 � g ( x t , a t ), Reward: r t ( x t , a t ) + ε t with ε t being the noise Objective: Recommend action a ( n ) that minimizes r n � max a ∈ A Q ⋆ ( x , a ) − Q ⋆ ( x , a ( n )) simple regret � γ t r ( x t , π ( x t )) where Q ⋆ ( x , a ) � r ( x , a ) + sup π Assumption: r t ∈ [0 , R max ] and | ε t | ≤ b Approach: Trying to explore without the parameters R max and b 2/11

  4. An MCTS setting MDP with starting state x 0 ∈ X , action space A n interactions: At time t playing a t in x t leads to Deterministic dynamics g : x t +1 � g ( x t , a t ), Reward: r t ( x t , a t ) + ε t with ε t being the noise Objective: Recommend action a ( n ) that minimizes r n � max a ∈ A Q ⋆ ( x , a ) − Q ⋆ ( x , a ( n )) simple regret � γ t r ( x t , π ( x t )) where Q ⋆ ( x , a ) � r ( x , a ) + sup π Assumption: r t ∈ [0 , R max ] and | ε t | ≤ b Approach: Trying to explore without the parameters R max and b 2/11

  5. OLOP (Bubeck and Munos, 2010) OLOP implements Optimistic Planning using Upper Confidence Bound (UCB) on the Q value of a sequence of q actions a 1 , . . . , a q : � � � q � + R max γ q +1 1 � Q UCB ( a 1: q ) � γ h � r h ( t ) + γ h b t T a h ( t ) 1 − γ � �� � h =1 � �� � unseen reward estimation of observed reward in optimization under a fixed budget n , excellent strategies allocate samples to actions without knowing R max or b 3/11

  6. OLOP (Bubeck and Munos, 2010) OLOP implements Optimistic Planning using Upper Confidence Bound (UCB) on the Q value of a sequence of q actions a 1 , . . . , a q : � � � q � + R max γ q +1 1 � Q UCB ( a 1: q ) � γ h � r h ( t ) + γ h b t T a h ( t ) 1 − γ � �� � h =1 � �� � unseen reward estimation of observed reward in optimization under a fixed budget n , excellent strategies allocate samples to actions without knowing R max or b 3/11

  7. Tree Search x 0 h=0 r 04 r 02 r 03 x 2 x 3 x 4 h=1 r 35 x 5 x 6 x 7 h=2 r 56 x 6 h=3 h=4 Q(x 6 )=r 03 + γ r 35 + γ 2 r 56 h=5 4/11

  8. Tree Search x 0 h=0 r 04 r 02 r 03 x 3 x 2 x 4 h=1 r 35 x 5 x 6 x 7 h=2 r 56 x 6 h=3 h=4 Q(x 6 )=r 03 + γ r 35 + γ 2 r 56 h=5 This is a zero order optimization! 4/11

  9. Black-box optimization: use the partitioning to explore f (uniformly) 5/11

  10. Black-box optimization: use the partitioning to explore f (uniformly) h=0 5/11

  11. Black-box optimization: use the partitioning to explore f (uniformly) h=0 h=1 5/11

  12. Black-box optimization: use the partitioning to explore f (uniformly) h=0 h=1 h=2 5/11

  13. Zipf exploration: Open best n h cells at depth h h=0 h=1 ... ... ... ... n ... h 6/11

  14. Noisy case • need to pull more each x to limit uncertainty • tradeoff: the more you pull each x the shallower you can explore 7/11

  15. Noisy case: StroquOOL (Bartlett et al. 2019) At depth h : • order the cells by decreasing value and • open the i -th best cell with m = n hi estimations 8/11

  16. Black-box optimization vs planning: Reuse of samples and γ Optimization Planning x 0 x 0 h=0 h=0 r 04 r 02 r 03 x 2 x 3 x 4 x 2 x 3 x 4 h=1 h=1 r 35 x 5 x 5 x 6 x 7 x 6 x 7 h=2 h=2 r 56 h=3 h=3 h=4 h=4 h=5 h=5 Lower regret for planning! (Bubeck & Munos 2010) 9/11

  17. Black-box optimization vs planning: Reuse of samples and γ Optimization Planning x 0 x 0 h=0 h=0 r 1 x 2 x 3 x 4 x 2 x 3 x 4 h=1 h=1 r 2 x 5 x 5 x 6 x 7 x 6 x 7 h=2 h=2 r 3 h=3 h=3 r 4 h=4 h=4 f 105 h=5 h=5 Lower regret for planning! (Bubeck & Munos 2010) 9/11

  18. Black-box optimization vs planning: Reuse of samples and γ Optimization Planning x 0 x 0 h=0 h=0 r' 1 x 2 x 3 x 4 x 2 x 3 x 4 h=1 h=1 r' 2 x 5 x 5 x 6 x 7 x 6 x 7 h=2 h=2 r' 3 h=3 h=3 r' 4 h=4 h=4 f 134 h=5 h=5 Lower regret for planning! (Bubeck & Munos 2010) 9/11

  19. Black-box optimization vs planning: Reuse of samples and γ Optimization Planning x 0 x 0 h=0 h=0 r' 1 x 2 x 3 x 4 x 2 x 3 x 4 h=1 h=1 r' 2 x 5 x 5 x 6 x 7 x 6 x 7 h=2 h=2 r' 3 h=3 h=3 r' 4 h=4 h=4 f 134 h=5 h=5 K H samples near the root How many samples near the root? Lower regret for planning! (Bubeck & Munos 2010) 9/11

  20. Black-box optimization vs. planning: Reuse samples and take advantage of γ Uniform exploration Zipf exploration h=0 h=1 x 0 ... r 04 r 04 r 04 h=0 r 04 r 04 r 04 r 04 r 04 ... r 04 r 04 r 04 r 04 x 3 x 2 x 4 h=1 ... x 5 x 6 x 7 ... n h=2 ... h h=3 h=4 h=5 Bubeck & Munos: Only for uniform strategies . . . We figured the amount the samples needed! 10/11

  21. Black-box optimization vs. planning: Reuse samples and take advantage of γ Uniform exploration Zipf exploration h=0 h=1 x 0 ... r 04 r 04 r 04 h=0 r 04 r 04 r 04 r 04 r 04 ... r 04 r 04 r 04 r 04 x 3 x 2 x 4 h=1 ... x 5 x 6 x 7 ... n h=2 ... h h=3 h=4 h=5 not sharing information Sharing information Bubeck & Munos: Only for uniform strategies . . . We figured the amount the samples needed! 10/11

  22. PlaT γ POOS The power of PlaT γ POOS • implements Zipf exploration for MCTS StroquOOL , • explicitly pulls an action at depth h + 1, γ times less than � γ t r ( x t , π ( x t )) , action at depth h , ( Q ⋆ ( x , a ) = r ( x , a ) + sup π • does not use UCB & no use of R max and b ,) • improves over OLOP with adaptation to low noise and additional unknown smoothness • gets exponential speedups when no noise is present! 11/11

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend