Scalable Hyperparameter Transfer learning Valerio Perrone , Rodolphe - PowerPoint PPT Presentation

Scalable Hyperparameter Transfer learning Valerio Perrone † , Rodolphe Jenatton † , C´ edric Archambeau ⇤ , Matthias Seeger ⇤ AWS AI † /Amazon Research ⇤ , Berlin

Co-authors R. Jenatton C. Archambeau M. Seeger Most of the material from V. Perrone, R. Jenatton, M. Seeger, C. Archambeau Scalable Hyperparameter Transfer learning. NeurIPS 2018

Tuning deep neural nets for optimal performance LeNet5 [LBBH98] The search space X is large and diverse: Architecture: # hidden layers, activation functions, . . . Model complexity: regularization, dropout, . . . Optimisation parameters: learning rates, momentum, batch size, . . .

Two straightforward approaches (Figure by Bergstra and Bengio, 2012) Exhaustive search on a regular or random grid Complexity is exponential in p Wasteful of resources, but easy to parallelise Memoryless

Hyperparameter transfer learning

Motivation Transfer learning : Exploit evaluations of related past tasks I A given ML algorithm tuned over di ff erent datasets I Can we do it in absence of meta-data? Scalability : Both with respect to I #evaluations: P T t =1 N t I #tasks: T

Black-box global optimisation The function f to optimise can be non-convex. The number of hyperparameters p is moderate (typically < 20). Our goal is to solve the following optimisation problem: x ? = argmin f ( x ) . x 2 X Evaluating f ( x ) is expensive. No analytical form or gradient. Evaluations may be noisy.

Example: tuning deep neural nets [SLA12, SRS + 15, KFB + 16] LeNet5 [LBBH98] f ( x ) is the validation loss of the neural net as a function of its hyperparameters x . Evaluating f ( x ) is very costly ⇡ up to weeks!

Bayesian (black-box) optimisation [MTZ78, SSW + 16] x ? = argmin f ( x ) x 2 X Canonical algorithm: Surrogate model M of f #cheaper to evaluate Set of evaluated candidates C = {} While some BUDGET available: I Select candidate x new 2 X using M and C #exploration/exploitation I Collect evaluation y new of f at x new #time-consuming I Update C = C [ { ( x new , y new ) } I Update M with C #Update surrogate model I Update BUDGET

Bayesian (black-box) optimisation with Gaussian processes 1 Learn a probabilistic model of f , which is cheap to evaluate: f ( x i ) , ς 2 � y i | f ( x i ) ⇠ Gaussian � f ( x ) ⇠ GP (0 , K ) . ,

Bayesian (black-box) optimisation with Gaussian processes 1 Learn a probabilistic model of f , which is cheap to evaluate: f ( x i ) , ς 2 � � y i | f ( x i ) ⇠ Gaussian , f ( x ) ⇠ GP (0 , K ) . 2 Given the observations y = ( y 1 , . . . , y n ), compute the predictive mean and the predictive standard deviation: 3 Repeatedly query f by balancing exploitation against exploration

Where is the minimum of f ( x )?

Bayesian optimisation in practice (Image credit: Javier Gonz´ alez)

Bayesian optimization with transfer learning Problem statement: t ) } N t T functions { f t ( x ) } T t =1 with observations D t = { ( x n t , y n n =1 May/may not have meta-data (or contextual features) for { f t ( x ) } T t =1 Goal: Optimize some fixed f t 0 ( x ) while exploiting {D t } T t =1 (this is not multi-objective!) Previous work: Multitask GP (Swersky et al. 2013, Poloczek et al. 2016) GP + filter evaluations by task similarity (Feurer et al. 2015) Various ensemble-based approaches I GPs (Feurer et al. 2018) I Feedforward NNs (Schilling et al. 2015)

What is wrong with the Gaussian process surrogate? � N 3 � Scaling is O

Adaptive Bayesian linear regression (ABLR) [Bis06] The model: Y N ( φ z ( x n ) w , β � 1 ) , P ( y | w , z , β ) = n P ( w | α ) = N ( 0 , α � 1 I D ) . The predictive distribution: Z P ( y ⇤ | x ⇤ , D ) = P ( y ⇤ | x ⇤ , w ) P ( w |D ) d w = N ( µ t ( x ⇤ ) , σ 2 t ( x ⇤ ))

Multi-task ABLR for transfer learning 1 Multi-task extension of the model: Y N ( φ z ( x n t ) w t , β � 1 P ( w t | α t ) = N ( 0 , α � 1 P ( y t | w t , z , β t ) = ) , t I D ) . t n t 2 Shared features φ z ( x ): I Explicit features set (e.g., RBF) I Random kitchen sinks [RR + 07] I Learned by feedforward neural net 3 Multi-task objective: T ⇣ ⌘ z , { α t , β t } T X = � log P ( y t | z , α t , β t ) ρ t =1 t =1

Examples of φ z Feedforward neural networks : φ z ( x ) = a L ( Z L a L � 1 ( . . . Z 2 a 1 ( Z 1 x ) . . . )) . z consists of all { Z l } L l =1 Random Fourier features : n 1 o p φ z ( x ) = 2 / D cos σ Ux + b , with U ⇠ N ( 0 , I ) and b ⇠ U ([0 , 2 π ]) . z only consists of 1 / σ

Pictorial summary of ABLR

Posterior inference Hyperparameters: { α t , β t } T t =1 for each task t z for the shared basis function Empirical Bayesian approach: Marginalize out the Bayesian linear regression parameters { w t } T t =1 Jointly learn the hyper-parameters of the model { α t , β t } T t =1 and z Minimize T ⇣ ⌘ X z , { α t , β t } T log { P ( y t | X t , α t , β t , z ) } ρ = � t =1 t =1

Posterior inference (cont’d) We have closed-forms for posterior mean and variance: t ; D t , α t , β t , z ) = β t µ t ( x ⇤ φ z ( x ⇤ t ) > K � 1 t Φ > t y t α t t ; D t , α t , β t , z ) = 1 t ) + 1 σ 2 t ) > K � 1 t ( x ⇤ φ z ( x ⇤ t φ z ( x ⇤ α t β t and marginal likelihood: T " D # ✓ ◆ N t 2 log β t � β || y t || 2 � β t z , { α t , β t } T X || c t || 2 X � � ρ = � � log([ L t ] ii ) t =1 2 α t t =1 i =1 Cholesky for K t = � t ↵ t Φ > t Φ t + I D = L t L t > c t = L t � 1 Φ > t y t

Leveraging MXNet In Bayesian optimization, derivatives needed for Posterior inference: ( z , { α t , β t } T t =1 ) 7! ρ ( z , { α t , β t } T t =1 ) Acquisition functions A , typically of the form (e.g., EI, PI, UCB,. . . ): x ⇤ 7! A ( µ t ( x ⇤ ; D t , α t , β t , z ) , σ 2 t ( x ⇤ ; D t , α t , β t , z )) Leverage MXNet (Seeger et al. 2017): Auto-di ff erentiation Backward operator for Cholesky Can use any φ z

Optimization of the marginal likelihood Optimization properties: Number of tasks: T ⇡ few tens Number of points per task: N t � 1 Not standard SGD regime We apply L-BFGS jointly over all parameters z and { α t , β t } T t =1 Warm-start parameters: Re-convergence in a very few steps

Scalable Hyperparameter Transfer learning Valerio Perrone , Rodolphe - PowerPoint PPT Presentation

Scalable Hyperparameter Transfer learning Valerio Perrone , Rodolphe Jenatton , C edric Archambeau , Matthias Seeger AWS AI /Amazon Research , Berlin Co-authors R. Jenatton C. Archambeau M. Seeger Most of the material

Hyperparameter tuning in caret Dr. Shirin Glander Data Scientist DataCamp Hyperparameter

Parameters vs hyperparameters Dr. Shirin Glander Data Scientist DataCamp Hyperparameter Tuning

Machine learning with mlr Dr. Shirin Elsinghorst Data Scientist DataCamp Hyperparameter Tuning

Machine learning with H2O Dr. Shirin Glander Data Scientist DataCamp Hyperparameter Tuning in R

Hyperparameter Search in Machine Learning Marc Claesen and Bart De Moor

CSC321 Lecture 21: Bayesian Hyperparameter Optimization Roger Grosse Roger Grosse CSC321

Hyperparameter Optimization with SHERPA Lars Hertel, Julian Collado, Peter Sadowski, Pierre Baldi

Improving Bug Prediction Accuracy by Regularization and Hyperparameter Optimization Haidar Osman

Hyperparameter optimization strategies git clone

A quantile-based approach for hyperparameter transfer learning David Salinas 2 Huibin Shen 1

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Industrial Transfer Learning Introduction to Industrial Transfer Learning Industrial Transfer

Radiative Transfer Radiative Transfer Radiative transfer is a branch of atmospheric physics. We

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Transfer United: Partnerships to Foster Transfer Student Success Tuesday, November 5 th

Big Transfer (BiT): General Visual Representation Learning Abhash Kumar Singh Harit Vishwakarma

COMPUTING WITH HYPERVECTORS Pentti Kanerva Redwood Center for Theoretical Neuroscience UC

Hyper-Cube High-Dimensional Hypervisor Fuzzing Sergej Schumilo, Cornelius Aschermann, Ali Abbasi,

Verifying Hyperproperties of Hardware Systems Bernd Finkbeiner Markus N. Rabe Saarland

From HyPer to Hyper Integrating an academic DBMS into a leading analytics and business

PDE-Constrained Optimization Using Hyper-Reduced Models Matthew J. Zahr and Charbel Farhat

Hypertrees and the pure symmetric automorphism group Jon McCammond U.C. Santa Barbara 1 Big

Tom Spyrou

Meditation for a Theorem Prover Reasoning and Consciousness Teaching a Theorem Prover to let

Sambuz

Useful Links

Newsletter

Mail Us