 
              BayesOpt: hot topics and current challenges Javier Gonz´ alez Masterclass, 7-February, 2107 @Lancaster University
Agenda of the day ◮ 9:00-11:00, Introduction to Bayesian Optimization : ◮ What is BayesOpt and why it works? ◮ Relevant things to know. ◮ 11:30-13:00, Connections, extensions and applications : ◮ Extensions to multi-task problems, constrained domains, early-stopping, high dimensions. ◮ Connections to Armed bandits and ABC. ◮ An applications in genetics. ◮ 14:00-16:00, GPyOpt LAB! : Bring your own problem! ◮ 16:30-15:30, Hot topics current challenges : ◮ Parallelization. ◮ Non-myopic methods ◮ Interactive Bayesian Optimization.
Section III: Hot topics and challenges ◮ Parallel Bayesian Optimization ◮ Non-myopic methods. ◮ Interactive Bayesian Optimization.
Scalable BO: Parallel/batch BO Avoiding the bottleneck of evaluating f ◮ Cost of f ( x n ) = cost of { f ( x n, 1 ) , . . . , f ( x n,nb ) } . ◮ Many cores available, simultaneous lab experiments, etc.
Considerations when designing a batch ◮ Available pairs { ( x j , y i ) } n i =1 are augmented with the evaluations of f on B n b = { x t, 1 , . . . , x t,nb } . t ◮ Goal: design B n b 1 , . . . , B n b m . Notation: ◮ I n : represents the available data set D n and the GP structure when n data points are available ( I t,k in the batch context). ◮ α ( x ; I n ): generic acquisition function given I n .
Optimal greedy batch design Sequential policy : Maximize: α ( x ; I t, 0 ) Greedy batch policy, 1st element t-th batch : Maximize: α ( x ; I t, 0 )
Optimal greedy batch design Sequential policy : Maximize: α ( x ; I t, 0 ) Greedy batch policy, 2nd element t-th batch : Maximize: � α ( x ; I t, 1 ) p ( y t, 1 | x t, 1 , I t, 0 ) p ( x t, 1 |I t, 0 ) d x t, 1 dy t, 1 ◮ p ( y t, 1 | x 1 , I t, 0 ): predictive distribution of the GP . ◮ p ( x 1 |I t, 0 ) = δ ( x t, 1 − arg max x ∈X α ( x ; I t, 0 )).
Optimal greedy batch design Sequential policy : Maximize: α ( x ; I t,k − 1 ) Greedy batch policy, k-th element t-th batch : Maximize: k − 1 � � α ( x ; I t,k − 1 ) p ( y t,j | x t,j , I t,j − 1 ) p ( x t,j |I t,j − 1 ) d x t,j dy t,j j =1 ◮ p ( y t,j | x t,j , I t,j − 1 ): predictive distribution of the GP . ◮ p ( x j |I t,j − 1 ) = δ ( x t,j − arg max x ∈X α ( x ; I t,j − 1 )).
Available approaches [Azimi et al., 2010; Desautels et al., 2012; Chevalier et al., 2013; Contal et al. 2013] ◮ Exploratory approaches, reduction in system uncertainty. ◮ Generate ‘fake’ observations of f using p ( y t,j | x j , I t,j − 1 ). ◮ Simultaneously optimize elements on the batch using the joint distribution of y t 1 , . . . y t,nb . Bottleneck: All these methods require to iteratively update p ( y t,j | x j , I t,j − 1 ) to model the iteration between the elements in the batch: O ( n 3 ) How to design batches reducing this cost? Local penalization
Goal: eliminate the marginalization step “To develop an heuristic approximating the ‘optimal batch design strategy’ at lower computational cost, while incorporating information about global properties of f from the GP model into the batch design” Lipschitz continuity: | f ( x 1 ) − f ( x 2 ) | ≤ L � x 1 − x 2 � p .
Interpretation of the Lipschitz continuity of f M = max x ∈X f ( x ) and B r xj ( x j ) = { x ∈ X : � x − x j � ≤ r x j } where r x j = M − f ( x j ) L 20 10 0 f(x) 10 True function 20 Samples Exclusion cones 30 Active regions 0.4 0.6 0.8 1.0 1.2 x x M / ∈ B r xj ( x j ) otherwise, the Lipschitz condition is violated.
Probabilistic version of B r x ( x ) We can do this because f ( x ) ∼ GP ( µ ( x ) , k ( x , x ′ )) and σ 2 ( r x j ) = σ 2 ( x j ) ◮ r x j is Gaussian with µ ( r x j ) = M − µ ( x j ) . L L 2 Local penalizers: ϕ ( x ; x j ) = p ( x / ∈ B r x j ( x j )) ϕ ( x ; x j ) = p ( r x j < � x − x j � ) = 0 . 5erfc( − z ) 1 √ where z = n ( x j ) ( L � x j − x � − M + µ n ( x j )). 2 σ 2 ◮ Reflects the size of the ’Lipschitz’ exclusion areas. ◮ Approaches to 1 when x is far form x j and decreases otherwise.
Idea to collect the batches Without using explicitly the model. Optimal batch: maximization-marginalization k − 1 � � α ( x ; I t,k − 1 ) p ( y t,j | x t,j , I t,j − 1 ) p ( x t,j |I t,j − 1 ) d x t,j dy t,j j =1 Proposal : maximization-penalization. Use the ϕ ( x ; x j ) to penalize the acquisition and predict the expected change in α ( x ; I t,k − 1 ) .
Local penalization strategy [Gonz´ alez, Dai, Hennig, Lawrence, 2016] 1st batch element 2nd batch element 3th batch element 9 9 9 α ( x ) α ( x ) α ( x ) ϕ 1 ( x ) 8 8 8 α ( x ) ϕ 1 ( x ) α ( x ) ϕ 1 ( x ) ϕ 2 ( x ) 7 7 ϕ 1 ( x ) 7 ϕ 2 ( x ) 6 6 6 value 5 value 5 value 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 10 5 0 5 10 10 5 0 5 10 10 5 0 5 10 x x x The maximization-penalization strategy selects x t,k as   k − 1   � x t,k = arg max  g ( α ( x ; I t, 0 )) ϕ ( x ; x t,j )  , x ∈X j =1 g is a transformation of α ( x ; I t, 0 ) to make it always positive.
Local penalization strategy [Gonz´ alez, Dai, Hennig, Lawrence, 2016] 1st batch element 2nd batch element 3th batch element 9 9 9 α ( x ) α ( x ) α ( x ) ϕ 1 ( x ) 8 8 8 α ( x ) ϕ 1 ( x ) α ( x ) ϕ 1 ( x ) ϕ 2 ( x ) 7 7 ϕ 1 ( x ) 7 ϕ 2 ( x ) 6 6 6 value 5 value 5 value 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 10 5 0 5 10 10 5 0 5 10 10 5 0 5 10 x x x The maximization-penalization strategy selects x t,k as   k − 1   � x t,k = arg max  g ( α ( x ; I t, 0 )) ϕ ( x ; x t,j )  , x ∈X j =1 g is a transformation of α ( x ; I t, 0 ) to make it always positive.
Example for L = 50 L controls the exploration-exploitation balance within the batch.
Example for L = 100 L controls the exploration-exploitation balance within the batch.
Example for L = 150 L controls the exploration-exploitation balance within the batch.
Example for L = 250 L controls the exploration-exploitation balance within the batch.
Finding an unique Lipschitz constant Let f : X → R be a L-Lipschitz continuous function defined on a compact subset X ⊆ R D . Then L p = max x ∈X �∇ f ( x ) � p , is a valid Lipschitz constant. The gradient of f at x ∗ is distributed as a multivariate Gaussian ∇ f ( x ∗ ) | X , y , x ∗ ∼ N ( µ ∇ ( x ∗ ) , Σ 2 ∇ ( x ∗ )) We choose: ˆ � µ ∇ ( x ∗ ) � L = max X
Experiments: Sobol function Best (average) result for some given time budget.
2D experiment with ‘large domain’ Comparison in terms of the wall clock time 1.0 EI 1.1 UCB Rand-EI Best found value 1.2 Rand-UCB SM-UCB 1.3 B-UCB PE-UCB 1.4 Pred-EI Pred-UCB 1.5 qEI LP-EI 1.6 LP-UCB 1.7 0 50 100 150 200 250 300 Time(seconds)
Myopia of optimisation techniques ◮ Most global optimisation techniques are myopic, in considering no more than a single step into the future. ◮ Relieving this myopia requires solving the multi-step lookahead problem. Figure: Two evaluations, if the first evaluation is made myopically, the second must be sub-optimal.
Non-myopic thinking To think non-myopically is important: it is a way of integrating in our decisions the information about our available (limited) resources to solve a given problem.
Acquisition function: expected loss [Osborne, 2010] Loss of evaluating f at x ∗ assuming it is returning y ∗ : � y ∗ ; if y ∗ ≤ η λ ( y ∗ ) � η ; if y ∗ > η. where η = min { y 0 } , the current best found value. The loss expectation is : � Λ 1 ( x ∗ |I 0 ) � E [min( y ∗ , η )] = λ ( y ∗ ) p ( y ∗ | x ∗ , I 0 ) dy ∗ I 0 is the current information D , θ and likelihood type.
The expected loss (improvement) is myopic ◮ Selects the next evaluation as if it was the last one. ◮ The remaining available budget is not taken into account when deciding where to evaluate. How to take into account the effect of future evaluations in the decision?
Expected loss with n steps ahead Intractable even for a handful number of steps ahead n � � Λ n ( x ∗ |I 0 ) = λ ( y n ) p ( y j | x j , I j − 1 ) p ( x j |I j − 1 ) dy ∗ . . . dy n d x 2 . . . d x n j =1 ◮ p ( y j | x j , I j − 1 ): predictive distribution of the GP at x j and ◮ p ( x j |I j − 1 ): optimisation step.
Relieving the myopia of Bayesian optimisation We present... GLASSES! G lobal optimisation with L ook- A head through S tochastic S imulation and E xpected-loss S earch
GLASSES Rendering the approximation sparse Idea : jointly model the epistemic uncertainty about the steps ahead using some defining some point process. � Γ n ( x ∗ |I 0 ) = λ ( y n ) p ( y | X , I 0 , x ∗ ) p ( X |I 0 , x ∗ ) d y d X
GLASSES Technical details Selecting a good p ( X |I 0 , x ∗ ) is complicated. ◮ Replace integrating over p ( X |I 0 , x ∗ ) by conditioning over an oracle predictor F n ( x ∗ ) of the n future locations. ◮ y = ( y ∗ , . . . , y n ) T : Gaussian outputs of f at F n ( x ∗ ). ◮ Λ n � � � � x ∗ | I 0 , F n ( x ∗ ) = Γ n ( x ∗ |I 0 , F n ( x ∗ )) = E min( y , η ) . � � ◮ E min( y , η ) is computed using Expectation Propagation.
Recommend
More recommend