Deep Learning for Perception Robert Platt Northeastern University - - PowerPoint PPT Presentation

deep learning for perception
SMART_READER_LITE
LIVE PREVIEW

Deep Learning for Perception Robert Platt Northeastern University - - PowerPoint PPT Presentation

Deep Learning for Perception Robert Platt Northeastern University Perception problems We will focus on these applications We will ignore these applications image segmentation speech-to-text natural language processing


slide-1
SLIDE 1

Deep Learning for Perception

Robert Platt Northeastern University

slide-2
SLIDE 2

Perception problems

We will ignore these applications – image segmentation – speech-to-text – natural language processing – … .. but deep learning has been applied in lots of ways... We will focus on these applications

slide-3
SLIDE 3

Supervised learning problem

Given: – A pattern exists – We don’t know what it is, but we have a bunch of examples Machine Learning problem: find a rule for making predictions from the data Classification vs regression: – if a labels are discrete, then we have a classification problem – if the labels are real-valued, then we have a regression problem

slide-4
SLIDE 4

Problem we want to solve

Input: Label: Data: Given , find a rule for predicting given

slide-5
SLIDE 5

Problem we want to solve

Input: Label: Data: Given , find a rule for predicting given

Discrete y is classification Continuous y is regression

slide-6
SLIDE 6

The multi-layer perceptron

where A single “neuron” (i.e. unit) Activation function summation

slide-7
SLIDE 7

The multi-layer perceptron

Different activation functions: – sigmoid – tanh – rectified linear unit (ReLU)

slide-8
SLIDE 8

A single unit neural network

One-layer neural network has a simple interpretation: linear classification. X_1 == symmetry X_2 == avg intensity Y == class label (binary)

slide-9
SLIDE 9

Think-pair-share

X_1 == symmetry X_2 == avg intensity Y == class label (binary) What do w and b correspond to in this picture?

slide-10
SLIDE 10

Training

Given a dataset: Define loss function:

slide-11
SLIDE 11

Training

Given a dataset: Define loss function: Loss function tells us how well the network classified data

slide-12
SLIDE 12

Training

Given a dataset: Define loss function: Method of training: adjust w, b so as to minimize the net loss over the datas i.e.: adjust w, b so as to minimize: The closer to zero, the better the classification Loss function tells us how well the network classified data

slide-13
SLIDE 13

Method of training: adjust w, b so as to minimize the net loss over the dataset i.e.: adjust w, b so as to minimize:

Training

How?

slide-14
SLIDE 14

Method of training: adjust w, b so as to minimize the net loss over the dataset i.e.: adjust w, b so as to minimize: How?

Training

Gradient Descent

slide-15
SLIDE 15

Time out for gradient descent

Suppose someone gives you an unknown function F(x) – you want to find a minimum for F – but, you do not have an analytical description of F(x) Use gradient descent! – all you need is the ability to evaluate F(x) and its gradient at any point x

  • 1. pick at random

2. 3. 4.

  • 5. ...
slide-16
SLIDE 16

Time out for gradient descent

Suppose someone gives you an unknown function F(x) – you want to find a minimum for F – but, you do not have an analytical description of F(x) Use gradient descent! – all you need is the ability to evaluate F(x) and its gradient at any point x

  • 1. pick at random

2. 3. 4.

  • 5. ...
slide-17
SLIDE 17

Think-pair-share

  • 1. Label all the points where gradient

descent could converge to:

  • 2. Which path does gradient descent take?
slide-18
SLIDE 18

Do gradient descent on dataset:

  • 1. repeat

2. 3.

  • 4. until converged

Where: Method of training: adjust w, b so as to minimize the net loss over the dataset i.e.: adjust w, b so as to minimize:

Training

slide-19
SLIDE 19

Do gradient descent on dataset:

  • 1. repeat

2. 3.

  • 4. until converged

Where: Method of training: adjust w, b so as to minimize the net loss over the dataset i.e.: adjust w, b so as to minimize:

Training

This is the similar to logistic regression – logistic regression uses a cross entropy loss – we are using a quadratic loss

slide-20
SLIDE 20

Training a one-unit neural network

slide-21
SLIDE 21

Going deeper: a one layer network

Input layer Hidden layer Output layer Each hidden node is connected to every input

slide-22
SLIDE 22

Multi-layer evaluation works similarly

a1 a2 a3 a4 Vector of hidden layer activations Single activation:

slide-23
SLIDE 23

Multi-layer evaluation works similarly

a1 a2 a3 a4 Vector of hidden layer activations Single activation: Called “forward propagation” – b/c the activations are propogated forward...

slide-24
SLIDE 24

Think-pair-share

a1 a2 a3 a4

Vector of hidden layer activations Single activation:

Write a matrix expression for y in terms of x, f, and the weights (assume f can act over vectors as well as scalars...)

slide-25
SLIDE 25

Can create networks of arbitrary depth...

Input layer Hidden layer 1 Output layer Hidden layer 2 Hidden layer 3

– Forward propagation works the same for any depth network. – Whereas a single output node corresponds to linear classification, adding hidden nodes makes classification non-linear

slide-26
SLIDE 26

Can create networks of arbitrary depth...

slide-27
SLIDE 27

How do we train multi-layer networks?

Do gradient descent on dataset:

  • 1. repeat

2. 3.

  • 4. until converged

Almost the same as in the single-node case... Now, we’re doing gradient descent

  • n all weights/biases in the network

– not just a single layer – this is called backpropagation

slide-28
SLIDE 28

Backpropagation

Goal: calculate

slide-29
SLIDE 29

Backpropagation

http://ufldl.stanford.edu/tutorial/supervised/MultiLayerNeuralNetworks/

slide-30
SLIDE 30

Stochastic gradient descent: mini-batches

  • 1. repeat
  • 2. randomly sample a mini-batch:

3. 4.

  • 5. until converged

Training in mini-batches helps b/c: – don’t have to load the entire dataset into memory – training is still relatively stable – random sampling of batches helps avoid local minima

A batch is typically between 32 and 128 samples

slide-31
SLIDE 31

Convolutional layers

Deep multi-layer perceptron networks – general purpose – involve huge numbers of weights We want: – special purpose network for image and NLP data – fewer parameters – fewer local minima Answer: convolutional layers!

slide-32
SLIDE 32

Convolutional layers

Image pixels stride Filter size

slide-33
SLIDE 33

Convolutional layers

Image pixels stride Filter size All of these weight groupings are tied to each other

slide-34
SLIDE 34

Convolutional layers

Image pixels stride Filter size All of these weight groupings are tied to each other Because of the way weights are tied together – reduces number of parameters (dramatically) – encodes a prior on structure of data In practice, convolutional layers are essential to computer vision...

slide-35
SLIDE 35

Convolutional layers

Two dimensional example: Why do you think they call this “convolution”?

slide-36
SLIDE 36

Think-pair-share

What would the convolved feature map be for this kernel?

slide-37
SLIDE 37

Convolutional layers

slide-38
SLIDE 38

Example: MNIST digit classification with LeNet

MNIST dataset: images of 10,000 handwritten digits Objective: classify each image as the corresponding digit

slide-39
SLIDE 39

Example: MNIST digit classification with LeNet

two convolutional layers – conv, relu, pooling

LeNet:

two fully connected layers – relu – last layer has logistic activation function

slide-40
SLIDE 40

Example: MNIST digit classification with LeNet

Load dataset, create train/test splits

slide-41
SLIDE 41

Example: MNIST digit classification with LeNet

Define the neural network structure: Input Conv1 Conv2 FC1 FC2

slide-42
SLIDE 42

Example: MNIST digit classification with LeNet

Train network, classify test set, measure accuracy – notice we test on a different set (a holdout set) than we trained on Using the GPU makes a huge differece...

slide-43
SLIDE 43

Deep learning packages

slide-44
SLIDE 44

Another example: image classification w/ AlexNet

ImageNet dataset: millions of images of objects Objective: classify each image as the corresponding object (1k categories in ILSVRC)

slide-45
SLIDE 45

Another example: image classification w/ AlexNet

AlexNet has 8 layers: five conv followed by three fully connected

slide-46
SLIDE 46

Another example: image classification w/ AlexNet

AlexNet has 8 layers: five conv followed by three fully connected

slide-47
SLIDE 47

Another example: image classification w/ AlexNet

AlexNet won the 2012 ILSVRC challenge – sparked the deep learning craze

slide-48
SLIDE 48

Object detection

slide-49
SLIDE 49

Proposal generation

Exhaustive: Sliding window: Hand-coded proposal generation: (selective search)

slide-50
SLIDE 50

Fully convolutional object detection

slide-51
SLIDE 51

What exactly are deep conv networks learning?

slide-52
SLIDE 52

What exactly are deep conv networks learning?

slide-53
SLIDE 53

What exactly are deep conv networks learning?

slide-54
SLIDE 54

What exactly are deep conv networks learning?

slide-55
SLIDE 55

What exactly are deep conv networks learning?

slide-56
SLIDE 56

What exactly are deep conv networks learning?

FC layer 6

slide-57
SLIDE 57

What exactly are deep conv networks learning?

FC layer 7

slide-58
SLIDE 58

What exactly are deep conv networks learning?

Output layer

slide-59
SLIDE 59

Finetuning

AlexNet has 60M parameters – therefore, you need a very large training set (like imagenet) Suppose we want to train on our own images, but we only have a few hundred? – AlexNet will drastically overfit such a small dataset… (won’t generalize at all)

slide-60
SLIDE 60

Finetuning

AlexNet has 60M parameters – therefore, you need a very large training set (like imagenet) Suppose we want to train on our own images, but we only have a few hundred? – AlexNet will drastically overfit such a small dataset… (won’t generalize at all)

Idea:

  • 1. pretrain on imagenet
  • 2. finetune on your own

dataset