Planning and Optimization December 16, 2019 G8. Monte-Carlo Tree - PowerPoint PPT Presentation

Planning and Optimization December 16, 2019 — G8. Monte-Carlo Tree Search Algorithms (Part II) Planning and Optimization G8.1 ε -greedy G8. Monte-Carlo Tree Search Algorithms (Part II) G8.2 Softmax Malte Helmert and Thomas Keller G8.3 UCB1 Universit¨ at Basel December 16, 2019 G8.4 Summary M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 1 / 25 M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 2 / 25 Content of this Course Content of this Course: Factored MDPs Foundations Foundations Logic Classical Heuristics Heuristic Factored MDPs Search Suboptimal Constraints Algorithms Planning Monte-Carlo Methods MCTS Explicit MDPs Probabilistic Factored MDPs M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 3 / 25 M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 4 / 25

G8. Monte-Carlo Tree Search Algorithms (Part II) ε -greedy G8. Monte-Carlo Tree Search Algorithms (Part II) ε -greedy ε -greedy: Idea ◮ tree policy parametrized with constant parameter ε ◮ with probability 1 − ε , pick one of the greedy actions uniformly at random G8.1 ε -greedy ◮ otherwise, pick non-greedy successor uniformly at random ε -greedy Tree Policy � 1 − ǫ if a ∈ L k ⋆ ( d ) | L k ⋆ ( d ) | π ( a | d ) = ǫ otherwise, | L ( d ( s )) \ L k ⋆ ( d ) | ⋆ ( d ) = { a ( c ) ∈ L ( s ( d )) | c ∈ arg min c ′ ∈ children( d ) ˆ with L k Q k ( c ′ ) } . M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 5 / 25 M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 6 / 25 G8. Monte-Carlo Tree Search Algorithms (Part II) ε -greedy G8. Monte-Carlo Tree Search Algorithms (Part II) ε -greedy ε -greedy: Example ε -greedy: Asymptotic Optimality d Asymptotic Optimality of ε -greedy ◮ explores forever c 1 c 2 c 3 c 4 ◮ not greedy in the limit ˆ ˆ ˆ ˆ Q ( c 1 ) = 6 Q ( c 2 ) = 12 Q ( c 3 ) = 6 Q ( c 4 ) = 9 � not asymptotically optimal Assuming a ( c i ) = a i and ε = 0 . 2, we get: asymptotically optimal variant uses decaying ε , e.g. ε = 1 k ◮ π ( a 1 | d ) = 0 . 4 ◮ π ( a 3 | d ) = 0 . 4 ◮ π ( a 2 | d ) = 0 . 1 ◮ π ( a 4 | d ) = 0 . 1 M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 7 / 25 M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 8 / 25

G8. Monte-Carlo Tree Search Algorithms (Part II) ε -greedy G8. Monte-Carlo Tree Search Algorithms (Part II) Softmax ε -greedy: Weakness Problem: when ε -greedy explores, all non-greedy actions are treated equally G8.2 Softmax d c l +2 c 1 c 2 c 3 . . . ˆ ˆ ˆ ˆ Q ( c 1 ) = 8 Q ( c 2 ) = 9 Q ( c 3 ) = 50 Q ( c l +2 ) = 50 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa � �� ℓ nodes Assuming a ( c i ) = a i , ε = 0 . 2 and ℓ = 9, we get: ◮ π ( a 1 | d ) = 0 . 8 ◮ π ( a 2 | d ) = π ( a 3 | d ) = · · · = π ( a 11 | d ) = 0 . 02 M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 9 / 25 M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 10 / 25 G8. Monte-Carlo Tree Search Algorithms (Part II) Softmax G8. Monte-Carlo Tree Search Algorithms (Part II) Softmax Softmax: Idea Softmax: Example ◮ tree policy with constant parameter τ d ◮ select actions proportionally to their action-value estimate ◮ most popular softmax tree policy uses Boltzmann exploration c 1 c 2 c 3 c l +2 − ˆ . . . Qk ( c ) ◮ ⇒ selects actions proportionally to e ˆ ˆ ˆ ˆ τ Q ( c 1 ) = 8 Q ( c 2 ) = 9 Q ( c 3 ) = 50 Q ( c l +2 ) = 50 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa � �� ℓ nodes Tree Policy based on Boltzmann Exploration − ˆ Qk ( c ) e τ Assuming a ( c i ) = a i , τ = 10 and ℓ = 9, we get: π ( a ( c ) | d ) = − ˆ Qk ( c ′ ) � ◮ π ( a 1 | d ) = 0 . 49 c ′ ∈ children( d ) e τ ◮ π ( a 2 | d ) = 0 . 45 ◮ π ( a 3 | d ) = . . . = π ( a 11 | d ) = 0 . 007 M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 11 / 25 M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 12 / 25

G8. Monte-Carlo Tree Search Algorithms (Part II) Softmax G8. Monte-Carlo Tree Search Algorithms (Part II) Softmax Boltzmann Exploration: Asymptotic Optimality Boltzmann Exploration: Weakness a 2 a 2 a 1 a 1 a 3 Asymptotic Optimality of Boltzmann Exploration ◮ explores forever P P ◮ not greedy in the limit: a 3 ◮ state- and action-value estimates converge to finite values ◮ therefore, probabilities also converge to positive, finite values � not asymptotically optimal cost cost ◮ Boltzmann exploration and ε -greedy only 1 asymptotically optimal variant uses decaying τ , e.g. τ = consider mean of sampled action-values log k careful: τ must not decay faster than logarithmically ◮ as we sample the same node many times, we can also gather (i.e., must have τ ≥ const log k ) to explore infinitely information about variance (how reliable the information is) ◮ Boltzmann exploration ignores the variance, treating the two scenarios equally M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 13 / 25 M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 14 / 25 G8. Monte-Carlo Tree Search Algorithms (Part II) UCB1 G8. Monte-Carlo Tree Search Algorithms (Part II) UCB1 Upper Confidence Bounds: Idea G8.3 UCB1 Balance exploration and exploitation by preferring actions that ◮ have been successful in earlier iterations (exploit) ◮ have been selected rarely (explore) M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 15 / 25 M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 16 / 25

G8. Monte-Carlo Tree Search Algorithms (Part II) UCB1 G8. Monte-Carlo Tree Search Algorithms (Part II) UCB1 Upper Confidence Bounds: Idea Bonus Term of UCB1 ◮ select successor c of d that minimizes ˆ Q k ( c ) − E k ( d ) · B k ( c ) � 2 · ln N k ( d ) ◮ use B k ( c ) = as bonus term ◮ based on action-value estimate ˆ Q k ( c ), N k ( c ) ◮ exploration factor E k ( d ) and ◮ bonus term is derived from Chernoff-Hoeffding bound: ◮ bonus term B k ( c ). ◮ gives the probability that a sampled value (here: ˆ Q k ( c )) ◮ select B k ( c ) such that ◮ is far from its true expected value (here: Q ⋆ ( s ( c ) , a ( c ))) ◮ in dependence of the number of samples (here: N k ( c )) Q ⋆ ( s ( c ) , a ( c )) ≤ ˆ Q k ( c ) − E k ( d ) · B k ( c ) ◮ picks the optimal action exponentially more often with high probability ◮ Idea: ˆ Q k ( c ) − E k ( d ) · B k ( c ) is a lower confidence bound ◮ concrete MCTS algorithm that uses UCB1 is called UCT on Q ⋆ ( s ( c ) , a ( c )) under the collected information M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 17 / 25 M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 18 / 25 G8. Monte-Carlo Tree Search Algorithms (Part II) UCB1 G8. Monte-Carlo Tree Search Algorithms (Part II) UCB1 Exploration Factor (1) Exploration Factor (2) Exploration factor E k ( d ) serves two roles in SSPs: ◮ UCB1 designed for MAB with reward in [0 , 1] Exploration factor E k ( d ) serves two roles in SSPs: ⇒ ˆ Q k ( c ) ∈ [0; 1] for all k and c ◮ E k ( d ) allows to adjust balance � 2 · ln N k ( d ) ◮ bonus term B k ( c ) = always ≥ 0 between exploration and exploitation N k ( c ) ◮ when d is visited, ◮ search with E k ( d ) = ˆ V k ( d ) very greedy ◮ B k +1 ( c ) > B k ( c ) if a ( c ) is not selected ◮ in practice, E k ( d ) is often multiplied with constant > 1 ◮ B k +1 ( c ) < B k ( c ) if a ( c ) is selected ◮ UCB1 often requires hand-tailored E k ( d ) to work well ◮ if B k ( c ) ≥ 2 for some c , UCB1 must explore ◮ hence, ˆ Q k ( c ) and B k ( c ) are always of similar size ⇒ set E k ( d ) to a value that depends on ˆ V k ( d ) M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 19 / 25 M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 20 / 25

G8. Monte-Carlo Tree Search Algorithms (Part II) UCB1 G8. Monte-Carlo Tree Search Algorithms (Part II) UCB1 Asymptotic Optimality Symmetric Search Tree up to depth 4 full tree up to depth 4 Asymptotic Optimality of UCB1 ◮ explores forever ◮ greedy in the limit � asymptotically optimal However: ◮ no theoretical justification to use UCB1 for SSPs/MDPs (MAB proof requires stationary rewards) ◮ development of tree policies active research topic M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 21 / 25 M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 22 / 25 G8. Monte-Carlo Tree Search Algorithms (Part II) UCB1 G8. Monte-Carlo Tree Search Algorithms (Part II) Summary Asymmetric Search Tree of UCB1 (equal number of search nodes) G8.4 Summary M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 23 / 25 M. Helmert, T. Keller (Universit¨ at Basel) Planning and Optimization December 16, 2019 24 / 25

Planning and Optimization December 16, 2019 G8. Monte-Carlo Tree - PowerPoint PPT Presentation

Planning and Optimization December 16, 2019 G8. Monte-Carlo Tree Search Algorithms (Part II) Planning and Optimization G8.1 -greedy G8. Monte-Carlo Tree Search Algorithms (Part II) G8.2 Softmax Malte Helmert and Thomas Keller G8.3 UCB1

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Classical Planning Systems ICS 271 Fall 2014 Outline: Planning Planning environments

Planning and Optimization December 4, 2019 G1. Factored MDPs G1.1 Factored MDPs Planning and

Planning and Optimization October 16, 2019 C2. Delete Relaxation: Properties of Relaxed

Planning 2.0 BLMs Final Planning Rule http://www.blm.gov/plan2 1 Planning 2.0 Outline

Classical Planning Systems Chapter 10 R&N ICS 271 Fall 2016 Outline: Planning Planning

Convex Optimization by Stephen Boyd, and Lieven Vandenberghe. Optimization for Machine Learning by

Evolutionary Algorithm 2. Swarm Intelligence and Ant Colony Optimization Ant Colony Optimization

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Optimization Problems Instructor:

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Optimization Problems

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

Introduction to Optimization Dr. Mihail October 23, 2018 (Dr. Mihail) Optimization October 23,

Optimization of HPSG Grammar Implementations in Trale Georgiana Dinu Optimization of HPSG

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Assume we are reading a stream of n distinct integers in { 1 , . . . , n + 1 } .

Learning sums of ridge functions in high dimension: a nonlinear compressed sensing model Massimo

Algorithms at Scale (Week 2) Puzzle of the Day: A bag contains a collection of blue and red

Computational Learning Theory 1 / 22 Decidability Computation Decidability which

Learning Theory CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016

Data Stream Classification using Random Feature Functions and Novel Method Combinations Jesse

Expectations or Guarantees? I Want It All! A Crossroad between Games and MDPs V. Bruy` ere

SAMOA: A Platform for Mining Big Data Streams Nicolas Kourtellis Associate Researcher

Planning and Optimization December 16, 2019 G8. Monte-Carlo Tree - PowerPoint PPT Presentation

Planning and Optimization December 16, 2019 G8. Monte-Carlo Tree Search Algorithms (Part II) Planning and Optimization G8.1 -greedy G8. Monte-Carlo Tree Search Algorithms (Part II) G8.2 Softmax Malte Helmert and Thomas Keller G8.3 UCB1

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Classical Planning Systems ICS 271 Fall 2014 Outline: Planning Planning environments

Planning and Optimization December 4, 2019 G1. Factored MDPs G1.1 Factored MDPs Planning and

Planning and Optimization October 16, 2019 C2. Delete Relaxation: Properties of Relaxed

Planning 2.0 BLMs Final Planning Rule http://www.blm.gov/plan2 1 Planning 2.0 Outline

Classical Planning Systems Chapter 10 R&amp;N ICS 271 Fall 2016 Outline: Planning Planning

Convex Optimization by Stephen Boyd, and Lieven Vandenberghe. Optimization for Machine Learning by

Evolutionary Algorithm 2. Swarm Intelligence and Ant Colony Optimization Ant Colony Optimization

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Optimization Problems Instructor:

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Optimization Problems

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

Introduction to Optimization Dr. Mihail October 23, 2018 (Dr. Mihail) Optimization October 23,

Optimization of HPSG Grammar Implementations in Trale Georgiana Dinu Optimization of HPSG

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Assume we are reading a stream of n distinct integers in { 1 , . . . , n + 1 } .

Learning sums of ridge functions in high dimension: a nonlinear compressed sensing model Massimo

Algorithms at Scale (Week 2) Puzzle of the Day: A bag contains a collection of blue and red

Computational Learning Theory 1 / 22 Decidability Computation Decidability which

Learning Theory CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016

Data Stream Classification using Random Feature Functions and Novel Method Combinations Jesse

Expectations or Guarantees? I Want It All! A Crossroad between Games and MDPs V. Bruy` ere

SAMOA: A Platform for Mining Big Data Streams Nicolas Kourtellis Associate Researcher

Classical Planning Systems Chapter 10 R&N ICS 271 Fall 2016 Outline: Planning Planning