Lecture 8: Convolutional Neural Networks 1 CS109B Data Science 2 - - PowerPoint PPT Presentation

lecture 8 convolutional neural networks 1
SMART_READER_LITE
LIVE PREVIEW

Lecture 8: Convolutional Neural Networks 1 CS109B Data Science 2 - - PowerPoint PPT Presentation

Lecture 8: Convolutional Neural Networks 1 CS109B Data Science 2 Pavlos Protopapas and Mark Glickman 1 Outline CS109B, P ROTOPAPAS , G LICKMAN 2 Main drawbacks of MLPs MLPs use one perceptron for each input (e.g. pixel in an image,


slide-1
SLIDE 1

CS109B Data Science 2

Pavlos Protopapas and Mark Glickman

Lecture 8: Convolutional Neural Networks 1

1

slide-2
SLIDE 2

CS109B, PROTOPAPAS, GLICKMAN

Outline

2

slide-3
SLIDE 3

CS109B, PROTOPAPAS, GLICKMAN

Main drawbacks of MLPs

  • MLPs use one perceptron for each input (e.g. pixel in an

image, multiplied by 3 in RGB case). The amount of weights rapidly becomes unmanageable for large images.

  • Training difficulties arise, overfitting can appear.
  • MLPs react differently to an input (images) and its shifted

version – they are not translation invariant.

3

slide-4
SLIDE 4

CS109B, PROTOPAPAS, GLICKMAN

Latest events on Image Recognition

You Only Look Once (YOLO) - 2016

4

slide-5
SLIDE 5

CS109B, PROTOPAPAS, GLICKMAN

Latest events on Image Recognition

Mask- RCNN - 2017

5

slide-6
SLIDE 6

CS109B, PROTOPAPAS, GLICKMAN

Latest events on Image Recognition

NVIDIA Video to Video Synthesis - 2018

6

slide-7
SLIDE 7

CS109B, PROTOPAPAS, GLICKMAN

Image analysis

Imagine that we want to recognize swans in an image:

7

Round, elongated

  • val with orange

protuberance Long white rectangular shape (neck) Oval-shaped white blob (body)

slide-8
SLIDE 8

CS109B, PROTOPAPAS, GLICKMAN

Cases can be a bit more complex…

8

Round, elongated head with orange

  • r black beak

Long white neck, square shape Oval-shaped white body with or without large white symmetric blobs (wings)

slide-9
SLIDE 9

CS109B, PROTOPAPAS, GLICKMAN

Now what?

9

Round, elongated head with

  • range or black beak, can

be turned backwards Long white neck, can bend around, not necessarily straight White tail, generally far from the head, looks feathery White, oval shaped body, with or without wings visible Black feet, under body, can have different shapes Small black circles, can be facing the camera, sometimes can see both Black triangular shaped form, on the head, can have different sizes White elongated piece, can be squared or more triangular, can be obstructed sometimes

Luckily, the color is consistent…

slide-10
SLIDE 10

CS109B, PROTOPAPAS, GLICKMAN

10

slide-11
SLIDE 11

CS109B, PROTOPAPAS, GLICKMAN

We need to be able to deal with these cases.

11

slide-12
SLIDE 12

CS109B, PROTOPAPAS, GLICKMAN

Image features

  • We’ve been basically talking about detecting features in images, in a very

naïve way.

  • Researchers built multiple computer vision techniques to deal with these

issues: SIFT, FAST, SURF, BRIEF, etc.

  • However, similar problems arose: the detectors where either too general or

too over-engineered. Humans were designing these feature detectors, and that made them either too simple or hard to generalize.

12

FAST corner detection algorithm SIFT feature descriptor

slide-13
SLIDE 13

CS109B, PROTOPAPAS, GLICKMAN

Image features (cont)

  • What if we learned the features to detect?
  • We need a system that can do Representation Learning (or

Feature Learning). Representation Learning: technique that allows a system to automatically find relevant features for a given task. Replaces manual feature engineering. Multiple techniques for this:

  • Unsupervised (K-means, PCA, …).
  • Supervised (Sup. Dictionary learning, Neural Networks!)

13

slide-14
SLIDE 14

CS109B, PROTOPAPAS, GLICKMAN

Drawbacks

14

Imagine we want to build a cat detector with an MLP. In this case, the red weights will be modified to better recognize cats In this case, the green weights will be modified. We are learning redundant features. Approach is not robust, as cats could appear in yet another position.

slide-15
SLIDE 15

CS109B, PROTOPAPAS, GLICKMAN

Drawbacks

Example: CIFAR10 Simple 32x32 color images (3 channels) Each pixel is a feature: an MLP would have 32x32x3+1 = 3073 weights per neuron!

15

slide-16
SLIDE 16

CS109B, PROTOPAPAS, GLICKMAN

Drawbacks

Example: ImageNet Images are usually 224x224x3: an MLP would have 150129 weights per neuron. If the first layer of the MLP is around 128 nodes, which is small, this already becomes very heavy to calculate. Model complexity is extremely high:

  • verfitting.

16

slide-17
SLIDE 17

CS109B, PROTOPAPAS, GLICKMAN

Images are Local and Hierarchical

slide-18
SLIDE 18

CS109B, PROTOPAPAS, GLICKMAN

Images are Invariant

slide-19
SLIDE 19

CS109B, PROTOPAPAS, GLICKMAN

“Convolution” Operation

slide-20
SLIDE 20

CS109B, PROTOPAPAS, GLICKMAN

“Convolution” Operation

−1 −1 −1 −1 8 −1 −1 −1 −1 " # $ $ $ % & ' ' '

* =

−1 −1 5 −1 −1 " # $ $ $ % & ' ' '

* =

Edge detection Sharpen

wikipedia.org

Kernel

slide-21
SLIDE 21

CS109B, PROTOPAPAS, GLICKMAN

+ ReLU + ReLU

A Convolutional Network

slide-22
SLIDE 22

CS109B, PROTOPAPAS, GLICKMAN

Basics of CNNs

We know that MLPs:

  • Do not scale well for images
  • Ignore the information brought by pixel position and correlation with

neighbors

  • Cannot handle translations

The general idea of CNNs is to intelligently adapt to properties

  • f images:
  • Pixel position and neighborhood have semantic meanings.
  • Elements of interest can appear anywhere in the image.

22

slide-23
SLIDE 23

CS109B, PROTOPAPAS, GLICKMAN

Basics of CNNs

23

MLP CNN

CNNs are also composed of layers, but those layers are not fully connected: they have filters, sets of cube-shaped weights that are applied throughout the image. Each 2D slice of the filters are called kernels. These filters introduce translation invariance and parameter sharing. How are they applied? Convolutions!

slide-24
SLIDE 24

CS109B, PROTOPAPAS, GLICKMAN

Convolution and cross-correlation

  • A convolution of f and g (𝑔 ∗ 𝑕) is defined as the integral of the

product, having one of the functions inverted and shifted: 𝑔 ∗ 𝑕 𝑢 = (𝑔 𝑏 𝑕 𝑢 − 𝑏 𝑒𝑏

  • Discrete convolution:

𝑔 ∗ 𝑕 𝑢 = . 𝑔 𝑏 𝑕(𝑢 − 𝑏)

/

  • 01/
  • Discrete cross-correlation:

𝑔 ⋆ 𝑕 𝑢 = . 𝑔 𝑏 𝑕(𝑢 + 𝑏)

/

  • 01/

24

Function is inverted and shifted left by t

slide-25
SLIDE 25

CS109B, PROTOPAPAS, GLICKMAN

Convolutions – step by step

25

slide-26
SLIDE 26

CS109B, PROTOPAPAS, GLICKMAN

Convolutions – another example

26

slide-27
SLIDE 27

CS109B, PROTOPAPAS, GLICKMAN

Convolutions – 3D input

27

slide-28
SLIDE 28

CS109B, PROTOPAPAS, GLICKMAN

Convolutions – what happens at the edges?

If we apply convolutions on a normal image, the result will be down-sampled by an amount depending on the size of the filter. We can avoid this by padding the edges in different ways.

28

slide-29
SLIDE 29

CS109B, PROTOPAPAS, GLICKMAN

Padding

29

Full padding. Introduces zeros such that all pixels are visited the same amount of times by the filter. Increases size of output. Same padding. Ensures that the

  • utput has the same size as the

input.

slide-30
SLIDE 30

CS109B, PROTOPAPAS, GLICKMAN

Convolutional layers

30

Convolutional layer with four 3x3 filters on a black and white image (just one channel) Convolutional layer with four 3x3 filters

  • n an RGB image. As you can see, the

filters are now cubes, and they are applied on the full depth of the image..

slide-31
SLIDE 31

CS109B, PROTOPAPAS, GLICKMAN

Convolutional layers (cont)

  • To be clear: each filter is convolved with the

entirety of the 3D input cube, but generates a 2D feature map.

  • Because we have multiple filters, we end up

with a 3D output: one 2D feature map per filter.

  • The feature map dimension can change

drastically from one conv layer to the next: we can enter a layer with a 32x32x16 input and exit with a 32x32x128 output if that layer has 128 filters.

31

slide-32
SLIDE 32

CS109B, PROTOPAPAS, GLICKMAN

Why does this make sense?

In image is just a matrix of pixels. Convolving the image with a filter produces a feature map that highlights the presence of a given feature in the image.

32

slide-33
SLIDE 33

CS109B, PROTOPAPAS, GLICKMAN

33

slide-34
SLIDE 34

CS109B, PROTOPAPAS, GLICKMAN

Learning CNN

In a convolutional layer, we are basically applying multiple filters at over the image to extract different features. But most importantly, we are learning those filters! One thing we’re missing: non-linearity.

34

slide-35
SLIDE 35

CS109B, PROTOPAPAS, GLICKMAN

Introducing ReLU

The most successful non-linearity for CNNs is the Rectified Non-Linear unit (ReLU): Combats the vanishing gradient problem occurring in sigmoids, is easier to compute, generates sparsity (not always beneficial)

35

slide-36
SLIDE 36

CS109B, PROTOPAPAS, GLICKMAN

Convolutional layers so far

  • A convolutional layer convolves each of its filters with the

input.

  • Input: a 3D tensor, where the dimensions are Width, Height

and Channels (or Feature Maps)

  • Output: a 3D tensor, with dimensions Width, Height and

Feature Maps (one for each filter)

  • Applies non-linear activation function (usually ReLU) over

each value of the output.

  • Multiple parameters to define: number of filters, size of

filters, stride, padding, activation function to use, regularization.

36

slide-37
SLIDE 37

CS109B, PROTOPAPAS, GLICKMAN

Building a CNN

A convolutional neural network is built by stacking layers, typically of 3 types:

37

Pooling Layers Fully connected Layers Convolutional Layers

slide-38
SLIDE 38

CS109B, PROTOPAPAS, GLICKMAN

Building a CNN

A convolutional neural network is built by stacking layers, typically of 3 types:

38

Pooling Layers Fully connected Layers Convolutional Layers

I/O

  • Input: 3D cube,

previous set of feature maps

  • Output: 3D cube, one

2D map per filter Action

  • Apply filters to

extract features

  • Filters are composed
  • f small kernels,

learned.

  • One bias per filter.
  • Apply activation

function on every value of feature map Parameters

  • Number of kernels
  • Size of kernels (W

and H only, D is defined by input cube)

  • Activation function
  • Stride
  • Padding
  • Regularization type

and value

slide-39
SLIDE 39

CS109B, PROTOPAPAS, GLICKMAN

Building a CNN

A convolutional neural network is built by stacking layers, typically of 3 types:

39

Fully connected Layers Convolutional Layers Pooling Layers

slide-40
SLIDE 40

CS109B, PROTOPAPAS, GLICKMAN

Fully connected Layers Convolutional Layers

Building a CNN

A convolutional neural network is built by stacking layers, typically of 3 types:

40

Pooling Layers

I/O

  • Input: 3D cube,

previous set of feature maps

  • Output: 3D cube, one

2D map per filte, reduced spatial dimensions Action

  • Reduce

dimensionality

  • Extract maximum of

average of a region

  • Sliding window

approach Parameters

  • Stride
  • Size of window
slide-41
SLIDE 41

CS109B, PROTOPAPAS, GLICKMAN

Convolutional Layers

Building a CNN

A convolutional neural network is built by stacking layers, typically of 3 types:

41

Pooling Layers Fully connected Layers

slide-42
SLIDE 42

CS109B, PROTOPAPAS, GLICKMAN

Building a CNN

A convolutional neural network is built by stacking layers, typically of 3 types:

42

Pooling Layers Convolutional Layers Fully connected Layers

I/O

  • Input: FLATTENED 3D

cube, previous set of feature maps

  • Output: 3D cube, one

2D map per filter Action

  • Aggregate

information from final feature maps

  • Generate final

classification Parameters

  • Number of nodes
  • Activation function:

usually changes depending on role of

  • layer. If aggregating

info, use ReLU. If producing final classification, use Softmax.

slide-43
SLIDE 43

CS109B, PROTOPAPAS, GLICKMAN

Fully built CNN (VGG)

43

slide-44
SLIDE 44

CS109B, PROTOPAPAS, GLICKMAN

What do CNN layers learn?

  • Each CNN layer learns filters of increasing complexity.
  • The first layers learn basic feature detection filters: edges,

corners, etc.

  • The middle layers learn filters that detect parts of objects.

For faces, they might learn to respond to eyes, noses, etc.

  • The last layers have higher representations: they learn to

recognize full objects, in different shapes and positions.

44

slide-45
SLIDE 45

CS109B, PROTOPAPAS, GLICKMAN

45

slide-46
SLIDE 46

CS109B, PROTOPAPAS, GLICKMAN

Examples

  • I have a convolutional layer with 16 3x3 filters that takes an RGB

image as input.

  • What else can we define about this layer?
  • Activation function
  • Stride
  • Padding type
  • How many parameters does the layer have?

16 x 3 x 3 x 3 + 16 = 448

46

Number of filters Size of Filters Number of channels of prev layer Biases (one per filter)

slide-47
SLIDE 47

CS109B, PROTOPAPAS, GLICKMAN

Examples

  • Let C be a CNN with the following disposition:
  • Input: 32x32x3 images
  • Conv1: 8 3x3 filters, stride 1, padding=same
  • Conv2: 16 5x5 filters, stride 2, padding=same
  • Flatten layer
  • Dense1: 512 nodes
  • Dense2: 4 nodes
  • How many parameters does this network have?

(8 x 3 x 3 x 3 + 8) + (16 x 5 x 5 x 8 + 16) + (16 x 16 x 16 x 512 + 512) + (512 x 4 + 4)

47

Conv1 Conv2 Dense1 Dense2

slide-48
SLIDE 48

CS109B, PROTOPAPAS, GLICKMAN

3D visualization of networks in action http://scs.ryerson.ca/~aharley/vis/conv/ https://www.youtube.com/watch?v=3JQ3hYko51Y

48

slide-49
SLIDE 49

CS109B, PROTOPAPAS, GLICKMAN

EVOLUTION OF CNNS

A bit of history

49

slide-50
SLIDE 50

CS109B, PROTOPAPAS, GLICKMAN

Initial ideas

  • The first piece of research proposing something similar to a Convolutional Neural

Network was authored by Kunihiko Fukushima in 1980, and was called the NeoCognitron1.

  • Inspired by discoveries on visual cortex of mammals.
  • Fukushima applied the NeoCognitron to hand-written character recognition.
  • End of the 80’s: several papers advanced the field
  • Backpropagation published in French by Yann LeCun in 1985 (independently

discovered by other researchers as well)

  • TDNN by Waiber et al., 1989 - Convolutional-like network trained with backprop.
  • Backpropagation applied to handwritten zip code recognition by LeCun et al., 1989

50

1 K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.

Biological Cybernetics, 36(4): 93-202, 1980.

slide-51
SLIDE 51

CS109B, PROTOPAPAS, GLICKMAN

LeNet

  • November 1998: LeCun publishes one of his most recognized papers

describing a “modern” CNN architecture for document recognition, called LeNet1.

  • Not his first iteration, this was in fact LeNet-5, but this paper is the

commonly cited publication when talking about LeNet.

51

1 LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11 (1998): 2278-2324.

slide-52
SLIDE 52

CS109B, PROTOPAPAS, GLICKMAN

AlexNet

52

  • Developed by Alex Krizhevsky, Ilya Sutskever and

Geoffrey Hinton at Utoronto in 2012. More than 25000 citations.

  • Destroyed the competition in the 2012 ImageNet Large

Scale Visual Recognition Challenge. Showed benefits of CNNs and kickstarted AI revolution.

  • top-5 error of 15.3%, more than 10.8 percentage points

lower than runner-up.

AlexNet

  • Main contributions:
  • Trained on ImageNet with data

augmentation

  • Increased depth of model, GPU

training (five to six days)

  • Smart optimizer and Dropout layers
  • ReLU activation!
slide-53
SLIDE 53

CS109B, PROTOPAPAS, GLICKMAN

ZFNet

  • Introduced by Matthew Zeiler and Rob Fergus from NYU,

won ILSVRC 2013 with 11.2% error rate. Decreased sizes of filters.

  • Trained for 12 days.
  • Paper presented a visualization technique named

Deconvolutional Network, which helps to examine different feature activations and their relation to the input space.

53

slide-54
SLIDE 54

CS109B, PROTOPAPAS, GLICKMAN

VGG

  • Introduced by Simonyan and Zisserman (Oxford) in 2014
  • Simplicity and depth as main points. Used 3x3 filters

exclusively and 2x2 MaxPool layers with stride 2.

  • Showed that two 3x3 filters have an effective receptive field
  • f 5x5.
  • As spatial size decreases, depth increases.
  • Trained for two to three weeks.
  • Still used as of today.

54

slide-55
SLIDE 55

CS109B, PROTOPAPAS, GLICKMAN

GoogLeNet (Inception-v1)

  • Introduced by Szegedy et al. (Google), 2014. Winners of ILSVRC 2014.
  • Introduces inception module: parallel conv. layers with different filter sizes.

Motivation: we don’t know which filter size is best – let the network decide. Key idea for future archs.

  • No fully connected layer at the end. AvgPool instead. 12x fewer params than

AlexNet.

55

1x1 convs to Reduce number

  • f parameters

Inception module Proto Inception module

slide-56
SLIDE 56

CS109B, PROTOPAPAS, GLICKMAN

ResNet

  • Presented by He et al. (Microsoft), 2015. Won ILSVRC 2015 in multiple

categories.

  • Main idea: Residual block. Allows for extremely deep networks.
  • Authors believe that it is easier to optimize the residual mapping than the
  • riginal one. Furthermore, residual block can decide to “shut itself down” if

needed.

56

Residual Block

slide-57
SLIDE 57

CS109B, PROTOPAPAS, GLICKMAN

ResNet

  • Presented by He et al. (Microsoft), 2015. Won ILSVRC 2015 in multiple

categories.

  • Main idea: Residual block. Allows for extremely deep networks.
  • Authors believe that it is easier to optimize the residual mapping than the
  • riginal one. Furthermore, residual block can decide to “shut itself down” if

needed.

57

Residual Block

slide-58
SLIDE 58

CS109B, PROTOPAPAS, GLICKMAN

DenseNet

  • Proposed by Huang et al., 2016.

Radical extension of ResNet idea.

  • Each block uses every previous

feature map as input.

  • Idea: n computation of redundant
  • features. All the previous

information is available at each point.

  • Counter-intuitively, it reduces the

number of parameters needed.

58

slide-59
SLIDE 59

CS109B, PROTOPAPAS, GLICKMAN

DenseNet

  • Proposed by Huang et al., 2016.

Radical extension of ResNet idea.

  • Each block uses every previous

feature map as input.

  • Idea: n computation of redundant
  • features. All the previous

information is available at each point.

  • Counter-intuitively, it reduces the

number of parameters needed.

59

slide-60
SLIDE 60

CS109B, PROTOPAPAS, GLICKMAN

MobileNet

  • Published by Howard et al., 2017.
  • Extremely efficient network with decent

accuracy.

  • Main concept: depthwise-separable
  • convolutions. Convolve each feature maps

with a kernel, then use a 1x1 convolution to aggregate the result.

  • This approximates vanilla convolutions

without having to convolve large kernels through channels.

60

slide-61
SLIDE 61

CS109B, PROTOPAPAS, GLICKMAN

Beyond

  • MobileNetV2 (https://arxiv.org/abs/1801.04381)
  • Inception-Resnet, v1 and v2

(https://arxiv.org/abs/1602.07261)

  • Wide-Resnet (https://arxiv.org/abs/1605.07146)
  • Xception (https://arxiv.org/abs/1610.02357)
  • ResNeXt (https://arxiv.org/pdf/1611.05431)
  • ShuffleNet, v1 and v2 (https://arxiv.org/abs/1707.01083)
  • Squeeze and Excitation Nets

(https://arxiv.org/abs/1709.01507 )

61