ECE 5984: Introduction to Machine Learning Topics: Neural Networks - - PowerPoint PPT Presentation

ece 5984 introduction to machine learning
SMART_READER_LITE
LIVE PREVIEW

ECE 5984: Introduction to Machine Learning Topics: Neural Networks - - PowerPoint PPT Presentation

ECE 5984: Introduction to Machine Learning Topics: Neural Networks Backprop Readings: Murphy 16.5 Dhruv Batra Virginia Tech Administrativia HW3 Due: in 2 weeks You will implement primal & dual SVMs Kaggle


slide-1
SLIDE 1

ECE 5984: Introduction to Machine Learning

Dhruv Batra Virginia Tech

Topics:

– Neural Networks – Backprop Readings: Murphy 16.5

slide-2
SLIDE 2

Administrativia

  • HW3

– Due: in 2 weeks – You will implement primal & dual SVMs – Kaggle competition: Higgs Boson Signal vs Background classification – https://inclass.kaggle.com/c/2015-Spring-vt-ece-machine- learning-hw3 – https://www.kaggle.com/c/higgs-boson

(C) Dhruv Batra 2

slide-3
SLIDE 3

Administrativia

  • Project Mid-Sem Spotlight Presentations

– Friday: 5-7pm, 3-5pm Whittemore 654 – 5 slides (recommended) – 4 minute time (STRICT) + 1-2 min Q&A – Tell the class what you’re working on – Any results yet? – Problems faced? – Upload slides on Scholar

(C) Dhruv Batra 3

slide-4
SLIDE 4

Recap of Last Time

(C) Dhruv Batra 4

slide-5
SLIDE 5

Not linearly separable data

  • Some datasets are not linearly separable!

– http://www.eee.metu.edu.tr/~alatan/Courses/Demo/ AppletSVM.html

slide-6
SLIDE 6

Addressing non-linearly separable data – Option 1, non-linear features

Slide Credit: Carlos Guestrin (C) Dhruv Batra 6

  • Choose non-linear features, e.g.,

– Typical linear features: w0 + ∑i wi xi – Example of non-linear features:

  • Degree 2 polynomials, w0 + ∑i wi xi + ∑ij wij xi xj
  • Classifier hw(x) still linear in parameters w

– As easy to learn – Data is linearly separable in higher dimensional spaces – Express via kernels

slide-7
SLIDE 7

Addressing non-linearly separable data – Option 2, non-linear classifier

Slide Credit: Carlos Guestrin (C) Dhruv Batra 7

  • Choose a classifier hw(x) that is non-linear in

parameters w, e.g.,

– Decision trees, neural networks,…

  • More general than linear classifiers
  • But, can often be harder to learn (non-convex
  • ptimization required)
  • Often very useful (outperforms linear classifiers)
  • In a way, both ideas are related
slide-8
SLIDE 8

Biological Neuron

(C) Dhruv Batra 8

slide-9
SLIDE 9

Recall: The Neuron Metaphor

  • Neurons

– accept information from multiple inputs, – transmit information to other neurons.

  • Multiply inputs by weights along edges
  • Apply some function to the set of inputs at each node

9 Slide Credit: HKUST

slide-10
SLIDE 10

Types of Neurons

θ1 θ2 θD θ0 1 f(~ x, ✓)

Linear Neuron

θ1 θ2 θD θ0 1 f(~ x, ✓)

Logistic Neuron

θ1 θ2 θD θ0 1 f(~ x, ✓)

Perceptron Potentially more. Require a convex loss function for gradient descent training.

10 Slide Credit: HKUST

slide-11
SLIDE 11

Limitation

  • A single “neuron” is still a linear decision boundary
  • What to do?
  • Idea: Stack a bunch of them together!

(C) Dhruv Batra 11

slide-12
SLIDE 12

Multilayer Networks

  • Cascade Neurons together
  • The output from one layer is the input to the next
  • Each Layer has its own sets of weights

x0 x1 x2 xP f(x, ~ ✓)

12

~ ✓0,0 ~ ✓0,1 ~ ✓0,2 ~ ✓1,2 ~ ✓1,1 ~ ✓1,0 θ2,0 θ2,1 θ2,2

Slide Credit: HKUST

slide-13
SLIDE 13

Universal Function Approximators

  • Theorem

– 3-layer network with linear outputs can uniformly approximate any continuous function to arbitrary accuracy, given enough hidden units [Funahashi ’89]

(C) Dhruv Batra 13

slide-14
SLIDE 14

Plan for Today

  • Neural Networks

– Parameter learning – Backpropagation

(C) Dhruv Batra 14

slide-15
SLIDE 15

Forward Propagation

  • On board

(C) Dhruv Batra 15

slide-16
SLIDE 16

Feed-Forward Networks

  • Predictions are fed forward through the network to

classify

16

x0 x1 x2 xP ~ ✓0,0 ~ ✓0,1 ~ ✓0,2 ~ ✓1,2 ~ ✓1,1 ~ ✓1,0 θ2,0 θ2,1 θ2,2

Slide Credit: HKUST

slide-17
SLIDE 17

Feed-Forward Networks

  • Predictions are fed forward through the network to

classify

17

x0 x1 x2 xP ~ ✓0,0 ~ ✓0,1 ~ ✓0,2 ~ ✓1,2 ~ ✓1,1 ~ ✓1,0 θ2,0 θ2,1 θ2,2

Slide Credit: HKUST

slide-18
SLIDE 18

Feed-Forward Networks

  • Predictions are fed forward through the network to

classify

18

x0 x1 x2 xP ~ ✓0,0 ~ ✓0,1 ~ ✓0,2 ~ ✓1,2 ~ ✓1,1 ~ ✓1,0 θ2,0 θ2,1 θ2,2

Slide Credit: HKUST

slide-19
SLIDE 19

Feed-Forward Networks

  • Predictions are fed forward through the network to

classify

19

x0 x1 x2 xP ~ ✓0,0 ~ ✓0,1 ~ ✓0,2 ~ ✓1,2 ~ ✓1,1 ~ ✓1,0 θ2,0 θ2,1 θ2,2

Slide Credit: HKUST

slide-20
SLIDE 20

Feed-Forward Networks

  • Predictions are fed forward through the network to

classify

20

x0 x1 x2 xP ~ ✓0,0 ~ ✓0,1 ~ ✓0,2 ~ ✓1,2 ~ ✓1,1 ~ ✓1,0 θ2,0 θ2,1 θ2,2

Slide Credit: HKUST

slide-21
SLIDE 21

Feed-Forward Networks

  • Predictions are fed forward through the network to

classify

21

x0 x1 x2 xP ~ ✓0,0 ~ ✓0,1 ~ ✓0,2 ~ ✓1,2 ~ ✓1,1 ~ ✓1,0 θ2,0 θ2,1 θ2,2

Slide Credit: HKUST

slide-22
SLIDE 22

Gradient Computation

  • First let’s try:

– Single Neuron for Linear Regression – Single Neuron for Logistic Regresion

(C) Dhruv Batra 22

slide-23
SLIDE 23

Logistic regression

  • Learning rule – MLE:

Slide Credit: Carlos Guestrin (C) Dhruv Batra 23

slide-24
SLIDE 24

Gradient Computation

  • First let’s try:

– Single Neuron for Linear Regression – Single Neuron for Logistic Regresion

  • Now let’s try the general case
  • Backpropagation!

– Really efficient

(C) Dhruv Batra 24

slide-25
SLIDE 25

Neural Nets

  • Best performers on OCR

– http://yann.lecun.com/exdb/lenet/index.html

  • NetTalk

– Text to Speech system from 1987 – http://youtu.be/tXMaFhO6dIY?t=45m15s

  • Rick Rashid speaks Mandarin

– http://youtu.be/Nu-nlQqFCKg?t=7m30s

(C) Dhruv Batra 25

slide-26
SLIDE 26

Neural Networks

  • Demo

– http://neuron.eng.wayne.edu/bpFunctionApprox/ bpFunctionApprox.html

(C) Dhruv Batra 26

slide-27
SLIDE 27

Historical Perspective

(C) Dhruv Batra 27

slide-28
SLIDE 28

Convergence of backprop

  • Perceptron leads to convex optimization

– Gradient descent reaches global minima

  • Multilayer neural nets not convex

– Gradient descent gets stuck in local minima – Hard to set learning rate – Selecting number of hidden units and layers = fuzzy process – NNs had fallen out of fashion in 90s, early 2000s – Back with a new name and significantly improved performance!!!!

  • Deep networks

– Dropout and trained on much larger corpus

Slide Credit: Carlos Guestrin (C) Dhruv Batra 28

slide-29
SLIDE 29

Overfitting

  • Many many many parameters
  • Avoiding overfitting?

– More training data – Regularization – Early stopping

(C) Dhruv Batra 29

slide-30
SLIDE 30

A quick note

(C) Dhruv Batra 30 Image Credit: LeCun et al. ‘98

slide-31
SLIDE 31

Rectified Linear Units (ReLU)

(C) Dhruv Batra 31

slide-32
SLIDE 32

Convolutional Nets

  • Basic Idea

– On board – Assumptions:

  • Local Receptive Fields
  • Weight Sharing / Translational Invariance / Stationarity

– Each layer is just a convolution!

(C) Dhruv Batra 32

Input image Convolutional layer Sub-sampling layer

Image Credit: Chris Bishop

slide-33
SLIDE 33

(C) Dhruv Batra 33 Slide Credit: Marc'Aurelio Ranzato

slide-34
SLIDE 34

(C) Dhruv Batra 34 Slide Credit: Marc'Aurelio Ranzato

slide-35
SLIDE 35

(C) Dhruv Batra 35 Slide Credit: Marc'Aurelio Ranzato

slide-36
SLIDE 36

(C) Dhruv Batra 36 Slide Credit: Marc'Aurelio Ranzato

slide-37
SLIDE 37

(C) Dhruv Batra 37 Slide Credit: Marc'Aurelio Ranzato

slide-38
SLIDE 38

(C) Dhruv Batra 38 Slide Credit: Marc'Aurelio Ranzato

slide-39
SLIDE 39

(C) Dhruv Batra 39 Slide Credit: Marc'Aurelio Ranzato

slide-40
SLIDE 40

(C) Dhruv Batra 40 Slide Credit: Marc'Aurelio Ranzato

slide-41
SLIDE 41

(C) Dhruv Batra 41 Slide Credit: Marc'Aurelio Ranzato

slide-42
SLIDE 42

Convolutional Nets

  • Example:

– http://yann.lecun.com/exdb/lenet/index.html

(C) Dhruv Batra 42

INPUT 32x32

Convolutions Subsampling Convolutions

C1: feature maps 6@28x28

Subsampling

S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84

Full connection Full connection Gaussian connections

OUTPUT 10

Image Credit: Yann LeCun, Kevin Murphy

slide-43
SLIDE 43

(C) Dhruv Batra 43 Slide Credit: Marc'Aurelio Ranzato

slide-44
SLIDE 44

(C) Dhruv Batra 44 Slide Credit: Marc'Aurelio Ranzato

slide-45
SLIDE 45

(C) Dhruv Batra 45 Slide Credit: Marc'Aurelio Ranzato

slide-46
SLIDE 46

Visualizing Learned Filters

(C) Dhruv Batra 46 Figure Credit: [Zeiler & Fergus ECCV14]

slide-47
SLIDE 47

Visualizing Learned Filters

(C) Dhruv Batra 47 Figure Credit: [Zeiler & Fergus ECCV14]

slide-48
SLIDE 48

Visualizing Learned Filters

(C) Dhruv Batra 48 Figure Credit: [Zeiler & Fergus ECCV14]

slide-49
SLIDE 49

Autoencoders

  • Goal

– Compression: Output tries to predict input

(C) Dhruv Batra 49 Image Credit: http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders

slide-50
SLIDE 50

Autoencoders

  • Goal

– Learns a low-dimensional “basis” for the data

(C) Dhruv Batra 50 Image Credit: Andrew Ng

slide-51
SLIDE 51

Stacked Autoencoders

  • How about we compress the low-dim features more?

(C) Dhruv Batra 51 Image Credit: http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders

slide-52
SLIDE 52

(C) Dhruv Batra 52

Sparse DBNs [Lee et al. ICML ‘09] Figure courtesy: Quoc Le

slide-53
SLIDE 53

Stacked Autoencoders

  • Finally perform classification with these low-dim

features.

(C) Dhruv Batra 53 Image Credit: http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders

slide-54
SLIDE 54

What you need to know about neural networks

  • Perceptron:

– Representation – Derivation

  • Multilayer neural nets

– Representation – Derivation of backprop – Learning rule – Expressive power