Towards efficient automatic end-to-end learning Frank Hutter - - PowerPoint PPT Presentation

towards efficient automatic end to end learning
SMART_READER_LITE
LIVE PREVIEW

Towards efficient automatic end-to-end learning Frank Hutter - - PowerPoint PPT Presentation

Towards efficient automatic end-to-end learning Frank Hutter University of Freiburg, Germany Based on joint work with great students and collaborators (named throughout) Frank Hutter: Towards efficient automatic learning What will this partial


slide-1
SLIDE 1

Frank Hutter: Towards efficient automatic learning

Towards efficient automatic end-to-end learning

Frank Hutter

University of Freiburg, Germany

Based on joint work with great students and collaborators (named throughout)

slide-2
SLIDE 2

Frank Hutter: Towards efficient automatic learning

2

What will this partial learning curve converge to?

?

epoch validation set accuracy

slide-3
SLIDE 3

Frank Hutter: Towards efficient automatic learning

Performance is very sensitive to many hyperparameters

– Architectural choices – Optimization algorithm, learning rates, momentum, batch normalization, batch sizes, dropout rates, weight decay, … – Data augmentation & preprocessing

3

One Problem of Deep Learning

… dog cat

# convolutional layers # fully connected layers Units per layer Kernel size

slide-4
SLIDE 4

Frank Hutter: Towards efficient automatic learning

4

Towards True End-to-end Learning

Current deep learning practice Expert chooses architecture & hyperparameters Deep learning “end-to-end” AutoML: true end-to-end learning End-to-end learning Meta-level learning &

  • ptimization

Learning box

slide-5
SLIDE 5

Frank Hutter: Towards efficient automatic learning

The standard approach: blackbox optimization

DNN hyperparameter setting  Validation performance f() Train DNN and validate it Blackbox

  • ptimizer

max f() 

Grid search, random search, population-based & evolutionary methods, ..., Bayesian

  • ptimization

5

f()

slide-6
SLIDE 6

Frank Hutter: Towards efficient automatic learning

6

The standard approach: blackbox optimization

DNN hyperparameter setting  Validation performance f() Train DNN and validate it Blackbox

  • ptimizer

max f() 

Need more fine-grained actions

Too slow for tuning DNNs

slide-7
SLIDE 7

Frank Hutter: Towards efficient automatic learning

7

ways in which we can go beyond blackbox optimization

slide-8
SLIDE 8

Frank Hutter: Towards efficient automatic learning

8

  • 1. We can use transfer learning

Transfer learning from other datasets D  f(, D), need scalable model

Using Gaussian process models

  • Bardenet et al, ICML 2013: Collaborative hyperparameter tuning
  • Swersky et al, NIPS 2013: Multi-task Bayesian optimization
  • Yogatama & Mann, AISTATS 2014:

Efficient transfer learning method for automatic hyperparameter tuning

Using other models

  • Feurer et al, AAAI 2015:

Initializing Bayesian hyperparameter optimization via meta-learning

  • Springenberg et al, NIPS 2016:

Bayesian optimization with robust Bayesian neural networks

slide-9
SLIDE 9

Frank Hutter: Towards efficient automatic learning

Well-calibrated uncertainty estimates Scalable for multitask problems

9

Example: BO with robust Bayesian neural nets

[Springenberg, Klein, Falkner, Hutter; NIPS 2016]

https://github.com/automl/RoBO

slide-10
SLIDE 10

Frank Hutter: Towards efficient automatic learning

Large datasets: start from small subsets of size s

 f(, s)

10

  • 2. We can reason over cheaper subsets of data

log(C) log() log() log() log() smax /128 smax /16 smax /4 smax log(C) log(C) log(C)

Example: SVM error surface, trained on data subsets of size s

smax smin

, need model that extrapolates well

slide-11
SLIDE 11

Frank Hutter: Towards efficient automatic learning

  • Automatically choose dataset size for each evaluation

– Trading off information gain about global optimum vs. cost

  • Entropy Search

– Based on a probability distribution of where the minimum lies:

  • Strategy: pick configuration & data size pair (x, s) to

maximally decrease entropy of pmin per time spent

11

Example: Fast Bayesian optimization on large datasets

[Klein, Falkner, Bartels, Hennig, Hutter, arXiv 2016]

slide-12
SLIDE 12

Frank Hutter: Towards efficient automatic learning

10x-1000x speedup for SVMs, 5x-10x for DNNs

12

Example: Fast Bayesian optimization on large datasets

Error

Budget of optimizer [s]

[Klein, Falkner, Bartels, Hennig, Hutter, under review at AISTATS 2016]

https://github.com/automl/RoBO

slide-13
SLIDE 13

Frank Hutter: Towards efficient automatic learning

Example: DNN learning curves with different hyperparameter settings

13

  • 3. We can model the online performance of DNNs

Graybox optimization  f(, t)

time t

  • Swersky et al, arXiv 2014: Freeze-Thaw Bayesian optimization
  • Domhan et al, AutoML 2014 & IJCAI 2015: Speeding up Automatic Hyperparameter

Optimization of Deep Neural Networks by Extrapolation of Learning Curves

slide-14
SLIDE 14

Frank Hutter: Towards efficient automatic learning

14

Example: extrapolating learning curves

Parametric model, e.g. Maximum Likelihood fit: MCMC: to quantify model uncertainty

, ) | (

1

 

K k k k k t

t f w y  

) , ( ~

2

  N

K = 11 parametric models

Convex Combination

  • f these models:

?

[Domhan, Springenberg, Hutter; AutoML 2014 & IJCAI 2015]

slide-15
SLIDE 15

Frank Hutter: Towards efficient automatic learning

15

Example: extrapolating learning curves

P(ym > ybest | y

1:n)³ 5%

P(ym > ybest | y

1:n)

continue training…

y

1:n

ym

ybest

epoch validation set accuracy

slide-16
SLIDE 16

Frank Hutter: Towards efficient automatic learning

16

Example: extrapolating learning curves

P(ym > ybest | y

1:n)

y

1:n

ym

ybest

epoch validation set accuracy

Terminate!

P(ym > ybest | y

1:n)< 5%

slide-17
SLIDE 17

Frank Hutter: Towards efficient automatic learning

17

Example: extrapolating learning curves

slide-18
SLIDE 18

Frank Hutter: Towards efficient automatic learning

18

Easy to include in a Bayesian neural network

[Klein, Falkner, Springenberg, Hutter; Bayesian Deep Learning Workshop 2016]

slide-19
SLIDE 19

Frank Hutter: Towards efficient automatic learning

19

  • 4. We can change hyperparameters on the fly
  • Common practice: change SGD learning rates over time
  • Can automate & improve with reinforcement learning

– Daniel et al, AAAI 2016: Learning step size controllers for robust neural network training – Hansen, arXiv 2016: Using deep Q-Learning to control optimization hyperparameters – Andrychowicz et al, arXiv 2016: Learning to learn by gradient descent by gradient descent

hyper

slide-20
SLIDE 20

Frank Hutter: Towards efficient automatic learning

20

  • 5. We can automatically gain scientific insights

Hyperparameter 1 Hyperparameter 2 Hyperparameter 3

Marginal loss

One way to inspect the model: functional ANOVA

explains performance variation due to each subset of hyperparameters

Possible future insights:

1. How stable are good hyperparameter settings across datasets? 2. Which hyperparameters need to change as the dataset grows? 3. Which factors affect empirical convergence rates of SGD?

[Hutter, Hoos, Leyton-Brown; ICML 2014]

slide-21
SLIDE 21

Frank Hutter: Towards efficient automatic learning

21

Learning to optimize, to plan, etc.

  • Algorithm configuration often speeds up solvers a lot

– 500x for software verification [Hutter, Babic, Hoos, Hu, FMCAD 2007] – 50x for MIP [Hutter, Hoos, Leyton-Brown, CPAIOR 2011] – 100x for finding better domain encoding in AI planning

[Vallati, Hutter, Chrpa, McCluskey, IJCAI 2015]

  • Algorithm portfolios won many competitions

– E.g., SATzilla won SAT competitions 2007, 2009, 2012 (every time we entered) [Xu, Hutter, Hoos, Leyton-Brown, JAIR 2008] – E.g., Cedalion won IPC 2014 Planning & Learning Track [Seipp, Siefert, Helmert, Hutter, AAAI 2015]

slide-22
SLIDE 22

Frank Hutter: Towards efficient automatic learning

22

Conclusion: moving beyond hand-designed learners

Transfer learning: exploit strong priors Exploit cheaper, approximate blackboxes Graybox: partial feedback during evaluation Whitebox: hyperparameter control (RL)

Some ways of making this efficient https://github.com/automl/RoBO