Neural Nets in Practice Many slides attributable to: Prof. Mike - - PowerPoint PPT Presentation

neural nets in practice
SMART_READER_LITE
LIVE PREVIEW

Neural Nets in Practice Many slides attributable to: Prof. Mike - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Neural Nets in Practice Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI), Emily Fox (UW), Finale Doshi-Velez (Harvard) James, Witten,


slide-1
SLIDE 1

Neural Nets in Practice

2

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/

Many slides attributable to: Erik Sudderth (UCI), Emily Fox (UW), Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books)

  • Prof. Mike Hughes
slide-2
SLIDE 2

Objectives Today: (day 13) NNs in Practice

  • Multi-class classification with NNs
  • Pros and cons of NNs
  • Avoiding overfitting with NNs
  • Hyperparameter selection
  • Data augmentation
  • Early stopping

3

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-3
SLIDE 3

What will we learn?

4

Mike Hughes - Tufts COMP 135 - Fall 2020

Supervised Learning Unsupervised Learning Reinforcement Learning

Data, Label Pairs Performance measure Task data x label y

{xn, yn}N

n=1

Training Prediction Evaluation

slide-4
SLIDE 4

5

Mike Hughes - Tufts COMP 135 - Fall 2020

y

x2 x1

is a binary variable (red or blue)

Supervised Learning

binary classification

Unsupervised Learning Reinforcement Learning

Task: Binary Classification

slide-5
SLIDE 5

Multi-class Classification

6

Mike Hughes - Tufts COMP 135 - Fall 2020

How to do this?

slide-6
SLIDE 6

>>> yhat_N = model.predict(x_NF) >>> yhat_N[:5] [0, 0, 1, 0, 1]

Binary Prediction

Goal: Predict label (0 or 1) given features x

  • Input:
  • Output:

7

Mike Hughes - Tufts COMP 135 - Fall 2020

xi , [xi1, xi2, . . . xif . . . xiF ]

Entries can be real-valued, or

  • ther numeric types (e.g. integer,

binary) Binary label (0 or 1)

“features” “covariates” “attributes” “responses” or “labels”

yi ∈ {0, 1}

slide-7
SLIDE 7

>>> yproba_N2 = model.predict_proba(x_NF) >>> yproba1_N = model.predict_proba(x_NF)[:,1] >>> yproba1_N[:5] [0.143, 0.432, 0.523, 0.003, 0.994]

Binary Proba. Prediction

Goal: Predict probability of label given features

  • Input:
  • Output:

8

Mike Hughes - Tufts COMP 135 - Fall 2020

xi , [xi1, xi2, . . . xif . . . xiF ]

Entries can be real-valued, or

  • ther numeric types (e.g. integer,

binary)

“features” “covariates” “attributes” “probability”

ˆ pi , p(Yi = 1|xi)

Value between 0 and 1 e.g. 0.001, 0.513, 0.987

slide-8
SLIDE 8

>>> yhat_N = model.predict(x_NF) >>> yhat_N[:6] [0, 3, 1, 0, 0, 2]

Multi-class Prediction

Goal: Predict one of C classes given features x

  • Input:
  • Output:

9

Mike Hughes - Tufts COMP 135 - Fall 2020

xi , [xi1, xi2, . . . xif . . . xiF ]

Entries can be real-valued, or

  • ther numeric types (e.g. integer,

binary) Integer label (0 or 1 or … or C-1 )

“features” “covariates” “attributes” “responses” or “labels”

yi ∈ {0, 1, 2, . . . C − 1}

slide-9
SLIDE 9

>>> yproba_NC = model.predict_proba(x_NF) >>> yproba_c_N = model.predict_proba(x_NF)[:,c] >>> np.sum(yproba_NC, axis=1) [1.0, 1.0, 1.0, 1.0]

Multi-class Proba. Prediction

Goal: Predict probability of each class given features Input: Output:

10

Mike Hughes - Tufts COMP 135 - Fall 2020

xi , [xi1, xi2, . . . xif . . . xiF ]

Entries can be real-valued, or other numeric types (e.g. integer, binary)

“features” “covariates” “attributes”

“probability”

Vector of C non-negative values, sums to one

ˆ pi , [p(Yi = 0|xi) p(Yi = 1|xi) . . . p(Yi = C − 1|xi)]

slide-10
SLIDE 10

From Real Value to Probability

11

Mike Hughes - Tufts COMP 135 - Fall 2020

sigmoid(z) = 1 1 + e−z

probability

slide-11
SLIDE 11

From Vector of Reals to Vector of Probabilities

12

Mike Hughes - Tufts COMP 135 - Fall 2020

zi = [zi1 zi2 . . . zic . . . ziC]

ˆ pi = " ezi1 PC

c=1 ezic

ezi2 PC

c=1 ezic . . .

. . . eziC PC

c=1 ezic

#

called the “softmax” function

slide-12
SLIDE 12

Representing multi-class labels

Encode as length-C one hot binary vector

13

Mike Hughes - Tufts COMP 135 - Fall 2020

Examples (assume C=4 labels) class 0: [1 0 0 0] class 1: [0 1 0 0] class 2: [0 0 1 0] class 3: [0 0 0 1]

yn ∈ {0, 1, 2, . . . C − 1} ∈ { − } ¯ yn = [¯ yn1 ¯ yn2 . . . ¯ ync . . . ¯ ynC]

slide-13
SLIDE 13

“Neuron” for Binary Prediction

14

Mike Hughes - Tufts COMP 135 - Fall 2020

Linear function with weights w Logistic sigmoid activation function

Credit: Emily Fox (UW)

Probability

  • f class 1
slide-14
SLIDE 14

Recall: Binary log loss

15

Mike Hughes - Tufts COMP 135 - Fall 2020

log loss(y, ˆ p) = −y log ˆ p − (1 − y) log(1 − ˆ p)

error(y, ˆ y) = ( 1 if y 6= ˆ y if y = ˆ y

Plot assumes:

  • True label is 1
  • Threshold is 0.5
  • Log base 2
slide-15
SLIDE 15

Multi-class log loss

log loss(¯ yn, ˆ pn) = −

C

X

c=1

¯ ync log ˆ pnc

16

Mike Hughes - Tufts COMP 135 - Fall 2020

Input: two vectors of length C Output: scalar value (strictly non-negative)

Justifications carry over from the binary case:

  • Interpret as upper bound on the error rate
  • Interpret as cross entropy of multi-class discrete random variable
  • Interpret as log likelihood of multi-class discrete random variable
slide-16
SLIDE 16

Each Layer Extracts “Higher Level” Features

17

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-17
SLIDE 17

PROs CONs?

  • Flexible models
  • State-of-the-art

success in many applications

  • Object recognition
  • Speech recognition
  • Language models
  • Open-source software

18

Mike Hughes - Tufts COMP 135 - Spring 2019

Deep Neural Nets

slide-18
SLIDE 18

Two kinds of optimization problem

19

Mike Hughes - Tufts COMP 135 - Spring 2019

Convex Only one global minimum If GD converges, solution is best possible Non-Convex One or more local minimum GD solution might be much worse than global minimum

slide-19
SLIDE 19

20

Mike Hughes - Tufts COMP 135 - Spring 2019

Convex Only one global minimum If GD converges, solution is best possible Non-Convex One or more local minimum GD solution might be much worse than global minimum

Deep Neural Nets: Optimization is not convex

MLPs with 1+ hidden layers Deep NNs in general Linear regression Logistic regression

slide-20
SLIDE 20

PROs CONs

  • Flexible models
  • State-of-the-art success

in many applications

  • Object recognition
  • Speech recognition
  • Language models
  • Open-source software
  • Require lots of data
  • Each run of SGD can

take hours/days

  • Optimization not easy
  • Will it converge?
  • Is local minimum good

enough?

  • Hard to extrapolate
  • Many hyperparameters
  • Will it overfit?

21

Mike Hughes - Tufts COMP 135 - Spring 2019

Deep Neural Nets

slide-21
SLIDE 21

Many hyperparameters for a Deep Neural Network (MLP)

  • Num. layers
  • Num. units / layer
  • Activation function
  • L2 penalty strength
  • Learning rate
  • Batch size

22

Mike Hughes - Tufts COMP 135 - Spring 2019

Control model complexity Optimization quality/speed

slide-22
SLIDE 22

Guidelines: complexity params

  • Num. units / layer
  • Start with similar to num. features
  • Add more (log spaced) until serious overfitting
  • Num. layers
  • Start with 1
  • Add more (+1 at a time) until serious overfitting
  • L2 penalty strength scalar
  • Try range of values (log spaced)
  • Activation function
  • ReLU for most problems is reasonable

23

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-23
SLIDE 23

Grid Search

1) Choose candidate values of each hyperparameter 2) For each combination, assess its heldout score

  • We need to choose in advance:
  • Performance metric (e.g. AUROC, log loss, TPR at PPV > 0.98, etc.)
  • What is most important for your task?
  • Source of heldout data
  • Fixed validation set : Faster, simpler
  • Cross validation with K folds : Less noise, better use of all available data

3) Select the one single combination with best score

24

Mike Hughes - Tufts COMP 135 - Spring 2019

Step size/learning rate Number of hidden units

slide-24
SLIDE 24

Grid Search

1) Choose candidate values of each hyperparameter 2) For each combination, assess its heldout score

  • We need to choose in advance:
  • Performance metric (e.g. AUROC, log loss, TPR at PPV > 0.98, etc.)
  • What is most important for your task?
  • Source of heldout data
  • Fixed validation set : Faster, simpler
  • Cross validation with K folds : Less noise, better use of all available data

3) Select the one single combination with best score

25

Mike Hughes - Tufts COMP 135 - Spring 2019

Step size/learning rate Number of hidden units

Each trial can be parallelized. Can do for numeric or discrete variables. But, number of combinations to try can quickly grow infeasible

slide-25
SLIDE 25

Random Search

1) Choose candidate distributions of each hyperparameter 2) For each of T samples, assess heldout score

  • We need to choose in advance:
  • Performance metric (e.g. AUROC, log loss, TPR at PPV > 0.98, etc.)
  • What is most important for your task?
  • Source of heldout data
  • Fixed validation set : Faster, simpler
  • Cross validation with K folds : Less noise, better use of all available data

3) Select the one single combination with best score

26

Mike Hughes - Tufts COMP 135 - Spring 2019

Usually, for convenience, assume each independent

slide-26
SLIDE 26

Random Search

1) Choose candidate distributions of each hyperparameter 2) For each of T samples, assess heldout score

  • We need to choose in advance:
  • Performance metric (e.g. AUROC, log loss, TPR at PPV > 0.98, etc.)
  • What is most important for your task?
  • Source of heldout data
  • Fixed validation set : Faster, simpler
  • Cross validation with K folds : Less noise, better use of all available data

3) Select the one single combination with best score

27

Mike Hughes - Tufts COMP 135 - Spring 2019

Usually, for convenience, assume each independent

Each trial can be parallelized. Best for numeric values. Benefits in coverage over grid search.

slide-27
SLIDE 27

28

Mike Hughes - Tufts COMP 135 - Spring 2019

Random Search covers more of the parameter space

Credit: Bergstra & Bengio JMLR 2012

slide-28
SLIDE 28

8 random trials beats 100 grid search trials on MNIST digits

29

Mike Hughes - Tufts COMP 135 - Spring 2019

Grid search over 100 configs

Credit: Bergstra & Bengio JMLR 2012

slide-29
SLIDE 29

Hyperopt Toolbox

30

Mike Hughes - Tufts COMP 135 - Spring 2019

https://www.youtube.com/watch?v=Mp1xnPfE4PY https://github.com/hyperopt/hyperopt/wiki/FMin

slide-30
SLIDE 30

31

Mike Hughes - Tufts COMP 135 - Spring 2019

2012 ImageNet Challenge Winner

ImageNet challenge 1000 categories, 1.2 million images in training set

How to learn 60 million parameters from 1 million examples?

slide-31
SLIDE 31

NN Tricks to avoid overfitting

  • Gather more data
  • Data augmentation
  • Modify optimization
  • Early stopping

32

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-32
SLIDE 32

Data Augmentation: Gather more (artificial) data

33

Mike Hughes - Tufts COMP 135 - Spring 2019 Credit: Bharath Raj (medium.com post)

slide-33
SLIDE 33

Data Augmentation

Data Augmentation: Increase effective size of training dataset by applying perturbations to existing features x to create new (x’, y) pairs Choose perturbations which do not change label.

34

Images

  • Flip left-to-right
  • Slight rotations or crops
  • Recolor or brighten

Text

  • Add slight misspellings
  • Replace word with similar

word

from AlexNet paper (Krizhevsky et al. NIPS 2012)

slide-34
SLIDE 34

35

Mike Hughes - Tufts COMP 135 - Spring 2019 Credit: https://deeplearning4j.org/docs/latest/deeplearning4j-nn-early-stopping

Big idea: stop training after your heldout validation set stops improving

  • Avoid overfitting
  • Save time / compute resources

Early Stopping

Performance Metric (assume higher is better) Could be accuracy, area under ROC, Recall, precision, whatever you care about

Performance on Validation Set Performance on Training Set

slide-35
SLIDE 35

Objectives Today: (day 13) NNs in Practice

  • Multi-class classification with NNs
  • Pros and cons of NNs
  • Avoiding overfitting with NNs
  • Hyperparameter selection
  • Data augmentation
  • Early stopping

36

Mike Hughes - Tufts COMP 135 - Fall 2020