bayesian optimization and automated machine learning
play

Bayesian Optimization and Automated Machine Learning Jungtaek Kim - PowerPoint PPT Presentation

Bayesian Optimization and Automated Machine Learning Jungtaek Kim (jtkim@postech.ac.kr) Machine Learning Group, Department of Computer Science and Engineering, POSTECH, 77 Cheongam-ro, Nam-gu, Pohang 37673, Gyeongsangbuk-do, Republic of Korea


  1. Bayesian Optimization and Automated Machine Learning Jungtaek Kim (jtkim@postech.ac.kr) Machine Learning Group, Department of Computer Science and Engineering, POSTECH, 77 Cheongam-ro, Nam-gu, Pohang 37673, Gyeongsangbuk-do, Republic of Korea June 12, 2018 1/28

  2. Table of Contents Bayesian Optimization Global Optimization Bayesian Optimization Background: Gaussian Process Regression Acquisition Function Synthetic Examples bayeso Automated Machine Learning Automated Machine Learning Previous Works AutoML Challenge 2018 Automated Machine Learning for Soft Voting in an Ensemble of Tree-based Classifiers AutoML Challenge 2018 Result References 2/28

  3. Bayesian Optimization 3/28

  4. Global Optimization From Wikipedia ( https://en.wikipedia.org/wiki/Local_optimum ) ◮ A method to find global minimum or maximum of given target function: x ∗ = arg min L ( x ) , or x ∗ = arg max L ( x ) . 4/28

  5. Target Functions in Bayesian Optimization ◮ Usually an expensive black-box function. ◮ Unknown functional forms or local geometric features such as saddle points, global optima, and local optima. ◮ Uncertain function continuity. ◮ High-dimensional and mixed-variable domain space. 5/28

  6. Bayesian Approach ◮ In Bayesian inference, given a prior knowledge for parameters, p ( θ | λ ) and a likelihood over dataset, conditional to parameters, p ( D| θ , λ ), the posterior distribution: p ( θ |D , λ ) = p ( D| θ , λ ) p ( θ | λ ) p ( D| θ , λ ) p ( θ | λ ) = � p ( D| λ ) p ( D| θ , λ ) p ( θ | λ ) d θ where θ is a vector of parameters, D is an observed dataset, and λ is a vector of hyperparameters. ◮ Produce an uncertainty as well as a prediction. 6/28

  7. Bayesian Optimization ◮ A powerful strategy for finding the extrema of objective functions that are expensive to evaluate, ◮ where one does not have a closed-form expression for the objective function, ◮ but where one can obtain observations at sampled values. ◮ Since we do not know a target function, optimize acquisition function, instead of the target function. ◮ Compute acquisition function using outputs of Bayesian regression model. 7/28

  8. Bayesian Optimization Algorithm 1 Bayesian Optimization Input: Initial data D 1: I = { ( x i , y i ) 1: I } . 1: for t = 1 , 2 , . . . , do Predict a function f ∗ ( x |D 1: I + t − 1 ) considered as an objective 2: function. Find x I + t that maximizes an acquisition function, 3: x I + t = arg max x a ( x |D 1: I + t − 1 ). Sample the true objective function, y I + t = f ( x I + t ) + ǫ I + t . 4: Update on D 1: I + t = {D 1: I + t − 1 , ( x t , y t ) } . 5: 6: end for 8/28

  9. Background: Gaussian Process ◮ A collection of random variables, any finite number of which have a joint Gaussian distribution. [Rasmussen and Williams, 2006] ◮ Generally, Gaussian process (GP): f ∼ GP ( m ( x ) , k ( x , x ′ )) where m ( x ) = E [ f ( x )] k ( x , x ′ ) = E [( f ( x ) − m ( x ))( f ( x ′ ) − m ( x ′ ))] . 9/28

  10. Background: Gaussian Process Regression 1.0 0.5 0.0 y −0.5 −1.0 −3 −2 −1 0 1 2 3 x 10/28

  11. Background: Gaussian Process Regression ◮ One of basic covariance functions, the squared-exponential covariance function in one dimension: � − 1 � x − x ′ � 2 = σ 2 + σ 2 � x , x ′ � � k f exp n δ xx ′ , 2 l 2 where σ f is the signal standard deviation, l is the length scale and σ n is the noise standard deviation. [Rasmussen and Williams, 2006] ◮ Posterior mean function and covariance function: µ ∗ = K ( X ∗ , X )( K ( X , X ) + σ 2 n I ) − 1 y , Σ ∗ = K ( X ∗ , X ∗ ) − K ( X ∗ , X )( K ( X , X ) + σ 2 n I ) − 1 K ( X , X ∗ ) . 11/28

  12. Background: Gaussian Process Regression ◮ If non-zero mean prior is given, posterior mean and covariance functions: µ ∗ = K ( X ∗ , X )( K ( X , X ) + σ 2 I )( y − µ ( X )) + µ ( X ) Σ ∗ = K ( X ∗ , X ∗ ) + K ( X ∗ , X )( K ( X , X ) + σ 2 I ) − 1 K ( X , X ∗ ) . 12/28

  13. Acquisition Functions ◮ A function that acquires a next point to evaluate for an expensive black-box function. ◮ Traditionally, the probability of improvement (PI) [Kushner, 1964], the expected improvement (EI) [Mockus et al., 1978], and GP upper confidence bound (GP-UCB) [Srinivas et al., 2010] are used. ◮ Several functions such as entropy search [Hennig and Schuler, 2012] and a combination of existing functions [Kim and Choi, 2018b] have been recently proposed. 13/28

  14. Traditional Acquisition Functions (Minimization Case) ◮ PI [Kushner, 1964] a PI ( x |D , λ ) = Φ( Z ) , ◮ EI [Mockus et al., 1978] � if σ ( x ) > 0 ( f ( x + ) − µ ( x ))Φ( Z )+ σ ( x ) φ ( Z ) a EI ( x |D , λ ) = if σ ( x )=0 , 0 ◮ GP-UCB [Srinivas et al., 2010] a UCB ( x |D , λ ) = − µ ( x ) + βσ ( x ) , where f ( x +) − µ ( x ) � if σ ( x ) > 0 Z = σ ( x ) if σ ( x )=0 0 µ ( x ) := µ ( x |D , λ ) , σ ( x ) := σ ( x |D , λ ) . 14/28

  15. Synthetic Examples 20 20 20 10 10 y y y 0 0 0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 acq. 5 acq. acq. 1 2 0 0 0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 x x x (a) Iteration 1 (b) Iteration 2 (c) Iteration 3 20 20 20 y y 10 10 y 10 0 0 0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 acq. acq. acq. 0.02 0.25 0.001 0.00 0.00 0.000 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 x x x (d) Iteration 4 (e) Iteration 5 (f) Iteration 6 Figure 1: y = 4 . 0 cos( x ) + 0 . 1 x + 2 . 0 sin( x ) + 0 . 4( x − 0 . 5) 2 . EI is used to optimize. 15/28

  16. bayeso ◮ Simple, but essential Bayesian optimization package. ◮ Written in Python . ◮ Licensed under the MIT license. ◮ https://github.com/jungtaekkim/bayeso 16/28

  17. Automated Machine Learning 17/28

  18. Automated Machine Learning ◮ Attempt to find automatically the optimal machine learning model without human intervention. ◮ Usually include feature transformation, algorithm selection, and hyperparameter optimization. ◮ Given a training dataset D train and a validation dataset D val , the optimal hyperparameter vector λ ∗ for an automated machine learning system: λ ∗ = AutoML ( D train , D val , Λ) where AutoML is an automated machine learning system and λ ∈ Λ. 18/28

  19. Previous Works ◮ Bayesian optimization and hyperparameter optimization ◮ GPyOpt [The GPyOpt authors, 2016] ◮ SMAC [Hutter et al., 2011] ◮ BayesOpt [Martinez-Cantin, 2014] ◮ bayeso ◮ SigOpt API [Martinez-Cantin et al., 2018] ◮ Automated machine learning framework ◮ auto-sklearn [Feurer et al., 2015] ◮ Auto-WEKA [Thornton et al., 2013] ◮ Our previous work [Kim et al., 2016] 19/28

  20. AutoML Challenge 2018 ◮ Two phases: feedback phase and AutoML challenge phase. ◮ In the feedback phase, provide five datasets for binary classification. ◮ Given training/validation/test datasets, after submitting a code or prediction file, validation measure is posted in the leaderboard. ◮ In the AutoML challenge phase, determine challenge winners, comparing a normalized area under the ROC curve (AUC) metric for blind datasets: Normalized AUC = 2 · AUC − 1 . 20/28

  21. AutoML Challenge 2018 Figure 2: Datasets of feedback phase in AutoML Challenge 2018. Train. #, Valid. #, Test #, Feature #, Chrono., and Budget stand for training dataset size, validation dataset size, test dataset size, the number of features, chronological order, and time budget, respectively. Time budget shows in seconds. 21/28

  22. Background: Soft Majority Voting ◮ An ensemble method to construct a classifier using a majority vote of k base classifiers. ◮ Class assignment of soft majority voting classifier: k w j p ( j ) � c i = arg max i j =1 for 1 ≤ i ≤ n where n is the number of instances, arg max returns an index of maximum value in given vector, w j ∈ R ≥ 0 is a weight of base classifier j , and p ( j ) is a class i probability vector of base classifier j . 22/28

  23. Our AutoML System [Kim and Choi, 2018a] Automated Machine Learning System Bayesian Optimization Voting Classifier Dataset Prediction Gradient Boosting Extra-trees Random Forests Classifier Classifier Classifier Figure 3: Our automated machine learning system. Voting classifier constructed by three tree-based classifiers: gradient boosting, extra-trees, and random forests classifiers produces predictions, where voting classifier and tree-based classifiers are iteratively optimized by Bayesian optimization for the given time budget. 23/28

  24. Our AutoML System [Kim and Choi, 2018a] ◮ Written in Python . ◮ Use scikit-learn and our own Bayesian optimization package. ◮ Split training dataset to training (0.6) and validation (0.4) sets for Bayesian optimization. ◮ Optimize six hyperparameters: 1. extra-trees classifier weight/gradient boosting classifier weight for voting classifier, 2. random forests classifier weight/gradient boosting classifier weight for voting classifier, 3. the number of estimators for gradient boosting classifier, 4. the number of estimators for extra-trees classifier, 5. the number of estimators for random forests classifier, 6. maximum depth of gradient boosting classifier. ◮ Use GP-UCB. 24/28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend