CS 559: Machine Learning CS 559: Machine Learning Fundamentals and - - PowerPoint PPT Presentation

cs 559 machine learning cs 559 machine learning
SMART_READER_LITE
LIVE PREVIEW

CS 559: Machine Learning CS 559: Machine Learning Fundamentals and - - PowerPoint PPT Presentation

1 CS 559: Machine Learning CS 559: Machine Learning Fundamentals and Applications 12 th Set of Notes Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E mail: Philippos Mordohai@stevens edu E-mail:


slide-1
SLIDE 1

1

CS 559: Machine Learning CS 559: Machine Learning Fundamentals and Applications 12th Set of Notes

Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E mail: Philippos Mordohai@stevens edu E-mail: Philippos.Mordohai@stevens.edu Office: Lieb 215

slide-2
SLIDE 2

Overview Overview

  • Deep Learning

Deep Learning

B d lid b M R t ( i l ) – Based on slides by M. Ranzato (mainly),

  • S. Lazebnik, R. Fergus and Q. Zhang
slide-3
SLIDE 3

Natural Neurons

H iti f di it

  • Human recognition of digits

– visual cortices – neuron interaction

slide-4
SLIDE 4

Recognizing Handwritten Digits Recognizing Handwritten Digits

  • How to describe a digit to a computer

– "a 9 has a loop at the top and a vertical stroke in a 9 has a loop at the top, and a vertical stroke in the bottom right“ – Algorithmically difficult to describe various 9s Algorithmically difficult to describe various 9s

slide-5
SLIDE 5

Perceptrons Perceptrons

  • Perceptrons
  • 1950s ~ 1960s Frank Rosenblatt inspired by earlier

1950s 1960s, Frank Rosenblatt, inspired by earlier work by Warren McCulloch and Walter Pitts

  • Standard model of artificial neurons

Standard model of artificial neurons

slide-6
SLIDE 6

Binary Perceptrons Binary Perceptrons

  • Inputs
  • Multiple binary inputs

Multiple binary inputs

  • Parameters

Th h ld & i ht

  • Thresholds & weights
  • Outputs
  • Thresholded weighted

linear combination

slide-7
SLIDE 7

Layered Perceptrons Layered Perceptrons

  • Layered, complex model
  • 1st layer 2nd layer of

1 layer, 2 layer of perceptrons

  • Perceptron rule

Perceptron rule

  • Weights, thresholds

Si il it t l i l

  • Similarity to logical

functions (NAND)

slide-8
SLIDE 8

Sigmoid Neurons Sigmoid Neurons

  • Sigmoid neurons
  • Stability

Stability

  • Small perturbation, small
  • utput change
  • Continuous inputs
  • Continuous outputs

p

  • Soft thresholds
slide-9
SLIDE 9

Output Functions Output Functions

  • Sigmoid neurons
  • Output
  • Output
  • Sigmoid vs conventional

g thresholds

slide-10
SLIDE 10

Smoothness & Differentiability Smoothness & Differentiability

  • Perturbations and

Derivatives Derivatives

  • Continuous function
  • Differentiable
  • Differentiable
  • Layers

I l l

  • Input layers, output layers,

hidden layers

slide-11
SLIDE 11

Layer Structure Design Layer Structure Design

  • Design of hidden layer
  • Heuristic rules

Heuristic rules

  • Number of hidden layers vs.

computational resources computational resources

  • Feedforward network
  • No loops involved

No loops involved

slide-12
SLIDE 12

Cost Function & Optimization Cost Function & Optimization

  • Learning with gradient descent
  • Cost function

Cost function

  • Euclidean loss
  • Non negative smooth
  • Non‐negative, smooth,

differentiable

slide-13
SLIDE 13

Cost Function & Optimization Cost Function & Optimization

  • Gradient Descent
  • Gradient vector
slide-14
SLIDE 14

Cost Function & Optimization Cost Function & Optimization

  • Extension to multiple dimensions
  • m variables

m variables

  • Small change in variable
  • Small change in cost
  • Small change in cost
slide-15
SLIDE 15

Neural Nets for Neural Nets for Computer Vision

Based on Tutorials at CVPR 2012 and 2014 by Marc’Aurelio Ranzato

slide-16
SLIDE 16

Building an Object Recognition System Building an Object Recognition System

IDEA: Use data to optimize features for the given task given task

slide-17
SLIDE 17

Building an Object Recognition System Building an Object Recognition System

What we want: Use parameterized function such that a) features are computed efficiently b) features can be trained efficiently b) features can be trained efficiently

slide-18
SLIDE 18

Building an Object Recognition System Building an Object Recognition System

  • Everything becomes adaptive
  • No distinction between feature extractor and classifier
  • Big non-linear system trained from raw pixels to labels
slide-19
SLIDE 19

Building an Object Recognition System Building an Object Recognition System

Q ? Q: How can we build such a highly non-linear system? A: By combining simple building blocks we can make more and more complex systems

slide-20
SLIDE 20

Building a Complicated Function Building a Complicated Function

  • Function composition is

p at the core of deep learning methods

  • Each “simple function”

p will have parameters subject to training

slide-21
SLIDE 21

Implementing a Complicated Function Implementing a Complicated Function

slide-22
SLIDE 22

Intuition Behind Deep Neural Nets Intuition Behind Deep Neural Nets

slide-23
SLIDE 23

Intuition Behind Deep Neural Nets Intuition Behind Deep Neural Nets

Each black box can have trainable parameters Their Each black box can have trainable parameters. Their composition makes a highly non-linear system.

slide-24
SLIDE 24

Intuition Behind Deep Neural Nets Intuition Behind Deep Neural Nets

System produces hierarchy of features

slide-25
SLIDE 25

Intuition Behind Deep Neural Nets Intuition Behind Deep Neural Nets

slide-26
SLIDE 26

Intuition Behind Deep Neural Nets Intuition Behind Deep Neural Nets

slide-27
SLIDE 27

Intuition Behind Deep Neural Nets Intuition Behind Deep Neural Nets

slide-28
SLIDE 28

Key Ideas of Neural Nets Key Ideas of Neural Nets

IDEA IDEA # 1 IDEA IDEA # 1 Learn features from data IDEA # IDEA # 2 Use differentiable functions that produce Use differentiable functions that produce features efficiently IDEA # IDEA # 3 E d d l i End-to-end learning: no distinction between feature extractor and classifier classifier IDEA # IDEA # 4 “Deep” architectures: cascade of simpler non linear modules cascade of simpler non-linear modules

slide-29
SLIDE 29

Key Questions Key Questions

  • What is the input output mapping?
  • What is the input-output mapping?
  • How are parameters trained?
  • How are parameters trained?
  • How computational expensive is it?
  • How computational expensive is it?
  • How well does it work?
  • How well does it work?
slide-30
SLIDE 30

Supervised Deep Learning Supervised Deep Learning

Marc’Aurelio Ranzato

slide-31
SLIDE 31

Supervised Learning Supervised Learning

{(xi yi) i=1 P } training set {(xi, yi), i 1... P } training set xi i-th input training example yi i-th target label P number of training examples G l di h l b l f i

  • Goal: predict the target label of unseen inputs
slide-32
SLIDE 32

Supervised Learning Examples

slide-33
SLIDE 33

Supervised Deep Learning

slide-34
SLIDE 34

Neural Networks

Assumptions (for the next few slides): p ( )

  • The input image is vectorized (disregard the

spatial layout of pixels)

  • The target label is discrete (classification)

g ( ) Question: what class of functions shall we consider to map the input into the output? to map the input into the output? Answer: composition of simpler functions. Follow-up questions: Why not a linear combination? Follow up questions: Why not a linear combination? What are the “simpler” functions? What is the interpretation? Answer: later... Answer: later...

slide-35
SLIDE 35

Neural Networks: example p

x input x input h1 1-st layer hidden units h2 2-nd layer hidden units

  • output
  • output

Example of a 2 hidden layer neural network (or 4 Example of a 2 hidden layer neural network (or 4 layer network, counting also input and output)

slide-36
SLIDE 36

Forward Propagation Forward Propagation

Forward propagation is the process of Forward propagation is the process of computing the output of the network given its input input

slide-37
SLIDE 37

Forward Propagation

W 1 1st layer weight matrix or weights b 1 1st layer biases b 1 layer biases

  • The non-linearity u=max(0,v) is called ReLU

ReLU in the DL literature.

  • Each output hidden unit takes as input all the units at the

previous layer: each such layer is called “fully fully connected

  • nnected”

previous layer: each such layer is called fully fully connected connected

slide-38
SLIDE 38

Rectified Linear Unit (ReLU) Rectified Linear Unit (ReLU)

38

slide-39
SLIDE 39

Forward Propagation

W 2 2nd layer weight matrix or weights y g g b b 2 2nd layer biases

slide-40
SLIDE 40

Forward Propagation

W 3 3rd layer weight matrix or weights y g g b b 3 3rd layer biases

slide-41
SLIDE 41

Alternative Graphical Representations Alternative Graphical Representations

slide-42
SLIDE 42

Interpretation

  • Question: Why can't the mapping between layers be

linear? A B iti f li f ti i

  • Answer: Because composition of linear functions is a

linear function. Neural network would reduce to (1 layer) logistic regression.

  • Question: What do ReLU layers accomplish?
  • Answer: Piece-wise linear tiling: mapping is locally linear.
slide-43
SLIDE 43

Interpretation

  • Question: Why do we need many layers?
  • Answer: When input has hierarchical structure, the use

f hi hi l hit t i t ti ll ffi i t

  • f a hierarchical architecture is potentially more efficient

because intermediate computations can be re-used. DL architectures are efficient also because they use distributed representations which are shared across classes.

slide-44
SLIDE 44

Interpretation p

44

slide-45
SLIDE 45

Interpretation

  • Distributed

Distributed representations

  • Feature sharing
  • Feature sharing
  • Compositionality

45

slide-46
SLIDE 46

Interpretation

Question: What does a hidden unit do? Answer: It can be thought of as a classifier or feature detector. Question Question: How many layers? How many hidden units? Question Question: How many layers? How many hidden units? Answer: Answer: Cross-validation or hyper-parameter search methods are the answer. In general, the wider and the deeper the network the more complicated the mapping. deeper the network the more complicated the mapping. Question: How do I set the weight matrices? A W i ht t i d bi l d Fi t Answer: Weight matrices and biases are learned. First, we need to define a measure of quality of the current

  • mapping. Then, we need to define a procedure to adjust

the parameters the parameters.

46

slide-47
SLIDE 47

How Good is a Network

Probabilit of class k gi en inp t (softma )

  • Probability of class k given input (softmax):
  • (Per-sample) Loss

Loss; e.g., negative log-likelihood (good for ( p ) ; g , g g (g classification of small number of classes):

slide-48
SLIDE 48

Training

  • Learning consists of minimizing the loss (plus some

regularization term) w.r.t. parameters over the whole g ) p training set. Question Question: How to minimize a complicated function of the Question Question: How to minimize a complicated function of the parameters? Answer: nswer: Chain rule, a.k.a. Back Backpro ropagation ation! That is the , p p p p g procedure to compute gradients of the loss w.r.t. parameters in a multi-layer neural network.

48

slide-49
SLIDE 49

Key Idea: Wiggle to Decrease Loss

  • Let's say we want to decrease the loss by adjusting W1

i j

Let s say we want to decrease the loss by adjusting W i,j.

  • We could consider a very small ϵ=1e-6 and compute:
  • Then update:
slide-50
SLIDE 50

Backward Propagation

50

slide-51
SLIDE 51

Backward Propagation

51

slide-52
SLIDE 52

Backward Propagation

52

slide-53
SLIDE 53

Optimization

Stochastic Gradient Descent Stochastic Gradient Descent Or one of its many variants

53

slide-54
SLIDE 54

Convolutional Neural Networks

Marc’Aurelio Ranzato

54

slide-55
SLIDE 55

Fully Connected Layer

slide-56
SLIDE 56

Locally Connected Layer y y

slide-57
SLIDE 57

Convolutional Layer

slide-58
SLIDE 58

Convolutional Layer Convolutional Layer

58

slide-59
SLIDE 59

Convolutional Layer Convolutional Layer

59

slide-60
SLIDE 60

Convolutional Layer Convolutional Layer

60

slide-61
SLIDE 61

Convolutional Layer Convolutional Layer

61

slide-62
SLIDE 62

Convolutional Layer Convolutional Layer

62

slide-63
SLIDE 63

Convolutional Layer

slide-64
SLIDE 64

Convolutional Layer Convolutional Layer

64

slide-65
SLIDE 65

Convolutional Layer Convolutional Layer

65

slide-66
SLIDE 66

Convolutional Layer Convolutional Layer

66

slide-67
SLIDE 67

Convolutional Layer

Question: What is the size of the output? What's the computational cost? Answer: It is proportional to the number of filters and depends on th t id If k l h i K×K i t h i D×D t id i 1 the stride. If kernels have size K×K, input has size D×D, stride is 1, and there are M input feature maps and N output feature maps then:

  • the input has size M×D×D

the output has size N× (D K+1) ×(D K+1)

  • the output has size N× (D-K+1) ×(D-K+1)
  • the kernels have M×N×K×K coefficients (which have to be learned)
  • cost: M×K×K×N×(D-K+1)×(D-K+1)

Question: How many feature maps? What's the size of the filters? Answer: Usually, there are more output feature maps than input feature maps Convolutional layers can increase the number of feature maps. Convolutional layers can increase the number of hidden units by big factors (and are expensive to compute). The size of the filters has to match the size/scale of the patterns we want to detect (task dependent).

67

slide-68
SLIDE 68

Key Ideas y

  • A standard neural net applied to images:

– scales quadratically with the size of the input scales quadratically with the size of the input – does not leverage stationarity

  • Solution:

– connect each hidden unit to a small patch of the input input – share the weight across space

  • This is called: convolutional layer

s s ca ed co

  • ut o a aye
  • A network with convolutional layers is called

convolutional network

68

slide-69
SLIDE 69

Pooling Layer

slide-70
SLIDE 70

Pooling Layer

slide-71
SLIDE 71

Pooling Layer

Question: What is the size of the output? What's the computational cost? Answer: The size of the output depends on the stride between the pools. For instance, if pools do not overlap and have size K×K, and the input has size D×D with M input feature maps, then:

  • output is M×(D/K)×(D/K)
  • the computational cost is proportional to the size of the

the computational cost is proportional to the size of the input (negligible compared to a convolutional layer) Question: How should I set the size of the pools? Question: How should I set the size of the pools? Answer: It depends on how much “invariant” or robust to distortions we want the representation to be. It is best to pool slowly (via a few stacks of conv pooling layers) pool slowly (via a few stacks of conv-pooling layers).

71

slide-72
SLIDE 72

Local Contrast Normalization

slide-73
SLIDE 73

Local Contrast Normalization

slide-74
SLIDE 74

Local Contrast Normalization Local Contrast Normalization

74

slide-75
SLIDE 75

Local Contrast Normalization Local Contrast Normalization

75

slide-76
SLIDE 76

ConvNets: Typical Stage

slide-77
SLIDE 77

ConvNets: Typical Architecture ConvNets: Typical Architecture

77

slide-78
SLIDE 78

ConvNets: Typical Architecture ConvNets: Typical Architecture

Conceptually similar to: SIFT  k  P id P li  SVM SIFT  k-means  Pyramid Pooling  SVM

78

slide-79
SLIDE 79

Engineered vs. learned features

Dense Dense Label

Convolutional filters are trained in a supervised manner by back propagating

Dense Dense Dense Dense

supervised manner by back-propagating classification error

Convolution/pool Convolution/pool Dense Dense Classifier Classifier

Label

Convolution/pool Convolution/pool Convolution/pool Convolution/pool Feature extraction Feature extraction Pooling Pooling Convolution/pool Convolution/pool Convolution/pool Convolution/pool Image Image

slide credit: S. Lazebnik

slide-80
SLIDE 80

SIFT Descriptor SIFT Descriptor

Image Apply gradient g Pixels pp y g filters Spatial pool Spatial pool (Sum) Normalize to unit Feature V t Normalize to unit length Vector

slide credit: R. Fergus

slide-81
SLIDE 81

AlexNet

  • Similar framework to LeCun’98 but:
  • Bigger model (7 hidden layers, 650,000 units, 60,000,000 params)
  • More data (106 vs. 103 images)

More data (10 vs. 10 images)

  • GPU implementation (50x speedup over CPU)
  • Trained on two GPUs for a week
  • A. Krizhevsky, I. Sutskever, and G. Hinton,

ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012

slide-82
SLIDE 82

Input

slide-83
SLIDE 83

Conv Nets: Examples

  • Pedestrian detection

83

slide-84
SLIDE 84

Conv Nets: Examples

  • Scene Parsing

84

slide-85
SLIDE 85

Conv Nets: Examples

  • Denoising

85

slide-86
SLIDE 86

Conv Nets: Examples

  • Object Detection

86

slide-87
SLIDE 87

Conv Nets: Examples

  • Face Verification and Identification (DeepFace)

87

slide-88
SLIDE 88

Conv Nets: Examples

  • Regression (DeepPose)