Neural Network Part 3: Convolutional Neural Networks Yingyu Liang - - PowerPoint PPT Presentation

neural network part 3
SMART_READER_LITE
LIVE PREVIEW

Neural Network Part 3: Convolutional Neural Networks Yingyu Liang - - PowerPoint PPT Presentation

Neural Network Part 3: Convolutional Neural Networks Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven,


slide-1
SLIDE 1

Neural Network Part 3: Convolutional Neural Networks

Yingyu Liang Computer Sciences 760 Fall 2017

http://pages.cs.wisc.edu/~yliang/cs760/

Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Matt Gormley, Elad Hazan, Tom Dietterich, Pedro Domingos, and Kaiming He.

slide-2
SLIDE 2

Goals for the lecture

you should understand the following concepts

  • convolutional neural networks (CNN)
  • convolution and its advantage
  • pooling and its advantage

2

slide-3
SLIDE 3

Convolutional neural networks

  • Strong empirical application performance
  • Convolutional networks: neural networks that use convolution in

place of general matrix multiplication in at least one of their layers for a specific kind of weight matrix 𝑋 ℎ = 𝜏(𝑋𝑈𝑦 + 𝑐)

slide-4
SLIDE 4

Convolution

slide-5
SLIDE 5

Convolution: math formula

  • Given functions 𝑣(𝑢) and 𝑥(𝑢), their convolution is a function 𝑡 𝑢
  • Written as

𝑡 𝑢 = ∫ 𝑣 𝑏 𝑥 𝑢 − 𝑏 𝑒𝑏 𝑡 = 𝑣 ∗ 𝑥

  • r

𝑡 𝑢 = (𝑣 ∗ 𝑥)(𝑢)

slide-6
SLIDE 6

Convolution: discrete version

  • Given array 𝑣𝑢 and 𝑥𝑢, their convolution is a function 𝑡𝑢
  • Written as
  • When 𝑣𝑢 or 𝑥𝑢 is not defined, assumed to be 0

𝑡𝑢 = ෍

𝑏=−∞ +∞

𝑣𝑏𝑥𝑢−𝑏 𝑡 = 𝑣 ∗ 𝑥

  • r

𝑡𝑢 = 𝑣 ∗ 𝑥 𝑢

slide-7
SLIDE 7

Illustration 1

a b c d e f x y z xb+yc+zd 𝑥 = [z, y, x] 𝑣 = [a, b, c, d, e, f]

slide-8
SLIDE 8

Illustration 1

a b c d e f x y z xc+yd+ze

slide-9
SLIDE 9

Illustration 1

a b c d e f x y z xd+ye+zf

slide-10
SLIDE 10

Illustration 1: boundary case

a b c d e f x y xe+yf

slide-11
SLIDE 11

Illustration 1 as matrix multiplication

y z x y z x y z x y z x y z x y a b c d e f

slide-12
SLIDE 12

Illustration 2: two dimensional case

a b c d e f g h i j k l w x y z wa + bx + ey + fz

slide-13
SLIDE 13

Illustration 2

a b c d e f g h i j k l w x y z bw + cx + fy + gz wa + bx + ey + fz

slide-14
SLIDE 14

Illustration 2

a b c d e f g h i j k l w x y z bw + cx + fy + gz wa + bx + ey + fz Kernel (or filter) Feature map Input

slide-15
SLIDE 15

Advantage: sparse interaction

Figure from Deep Learning, by Goodfellow, Bengio, and Courville

Fully connected layer, 𝑛 × 𝑜 edges 𝑛 output nodes 𝑜 input nodes

slide-16
SLIDE 16

Advantage: sparse interaction

Figure from Deep Learning, by Goodfellow, Bengio, and Courville

Convolutional layer, ≤ 𝑛 × 𝑙 edges 𝑛 output nodes 𝑜 input nodes 𝑙 kernel size

slide-17
SLIDE 17

Advantage: sparse interaction

Figure from Deep Learning, by Goodfellow, Bengio, and Courville

Multiple convolutional layers: larger receptive field

slide-18
SLIDE 18

Advantage: parameter sharing/weight tying

Figure from Deep Learning, by Goodfellow, Bengio, and Courville

The same kernel are used repeatedly. E.g., the black edge is the same weight in the kernel.

slide-19
SLIDE 19

Advantage: equivariant representations

  • Equivariant: transforming the input = transforming the output
  • Example: input is an image, transformation is shifting
  • Convolution(shift(input)) = shift(Convolution(input))
  • Useful when care only about the existence of a pattern, rather than

the location

slide-20
SLIDE 20

Pooling

slide-21
SLIDE 21

Terminology

Figure from Deep Learning, by Goodfellow, Bengio, and Courville

slide-22
SLIDE 22

Pooling

  • Summarizing the input (i.e., output the max of the input)

Figure from Deep Learning, by Goodfellow, Bengio, and Courville

slide-23
SLIDE 23

Advantage

Induce invariance

Figure from Deep Learning, by Goodfellow, Bengio, and Courville

slide-24
SLIDE 24

Motivation from neuroscience

  • David Hubel and Torsten Wiesel studied early visual system in human

brain (V1 or primary visual cortex), and won Nobel prize for this

  • V1 properties
  • 2D spatial arrangement
  • Simple cells: inspire convolution layers
  • Complex cells: inspire pooling layers
slide-25
SLIDE 25

Example: LeNet

slide-26
SLIDE 26

LeNet-5

  • Proposed in “Gradient-based learning applied to document

recognition” , by Yann LeCun, Leon Bottou, Yoshua Bengio and Patrick Haffner,

in Proceedings of the IEEE, 1998

slide-27
SLIDE 27

LeNet-5

  • Proposed in “Gradient-based learning applied to document

recognition” , by Yann LeCun, Leon Bottou, Yoshua Bengio and Patrick Haffner,

in Proceedings of the IEEE, 1998

  • Apply convolution on 2D images (MNIST) and use backpropagation
slide-28
SLIDE 28

LeNet-5

  • Proposed in “Gradient-based learning applied to document

recognition” , by Yann LeCun, Leon Bottou, Yoshua Bengio and Patrick Haffner,

in Proceedings of the IEEE, 1998

  • Apply convolution on 2D images (MNIST) and use backpropagation
  • Structure: 2 convolutional layers (with pooling) + 3 fully connected layers
  • Input size: 32x32x1
  • Convolution kernel size: 5x5
  • Pooling: 2x2
slide-29
SLIDE 29

LeNet-5

Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner

slide-30
SLIDE 30

LeNet-5

Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner

slide-31
SLIDE 31

LeNet-5

Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner

Filter: 5x5, stride: 1x1, #filters: 6

slide-32
SLIDE 32

LeNet-5

Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner

Pooling: 2x2, stride: 2

slide-33
SLIDE 33

LeNet-5

Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner

Filter: 5x5x6, stride: 1x1, #filters: 16

slide-34
SLIDE 34

LeNet-5

Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner

Pooling: 2x2, stride: 2

slide-35
SLIDE 35

LeNet-5

Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner

Weight matrix: 400x120

slide-36
SLIDE 36

LeNet-5

Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner

Weight matrix: 120x84 Weight matrix: 84x10

slide-37
SLIDE 37

Example: ResNet

slide-38
SLIDE 38

Plain Network

  • “Overly deep” plain nets have higher training error
  • A general phenomenon, observed in many datasets

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. 38

slide-39
SLIDE 39

Residual Network

  • Naïve solution
  • If extra layers are an identity

mapping, then a training errors does not increase

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. 39

slide-40
SLIDE 40

Residual Network

  • Deeper networks also

maintain the tendency of results

  • Features in same level will

be almost same

  • An amount of changes is

fixed

  • Adding layers makes smaller

differences

  • Optimal mappings are closer

to an identity

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. 40

slide-41
SLIDE 41

Residual Network

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.

  • Plain block
  • Difficult to make identity

mapping because of multiple non-linear layers

41

slide-42
SLIDE 42

Residual Network

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.

  • Residual block
  • If identity were optimal,

easy to set weights as 0

  • If optimal mapping is

closer to identity, easier to find small fluctuations

  • > Appropriate for

treating perturbation as keeping a base information

42

slide-43
SLIDE 43

Network Design

  • Basic design (VGG-style)
  • All 3x3 conv (almost)
  • Spatial size/2 => #filters x2
  • Batch normalization
  • Simple design, just deep
  • Other remarks
  • No max pooling (almost)
  • No hidden fc
  • No dropout

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. 43

slide-44
SLIDE 44

Results

  • Deep Resnets can be trained without difficulties
  • Deeper ResNets have lower training error, and also lower test error

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. 44

slide-45
SLIDE 45

Results

  • 1st places in all five main tracks in “ILSVRC & COCO 2015 Competitions”
  • ImageNet Classification
  • ImageNet Detection
  • ImageNet Localization
  • COCO Detection
  • COCO Segmentation

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. 45

slide-46
SLIDE 46

Quantitative Results

  • ImageNet Classification

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. 46

slide-47
SLIDE 47

Qualitative Result

  • Object detection
  • Faster R-CNN + ResNet

47 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. Jifeng Dai, Kaiming He, & Jian Sun. “Instance-aware Semantic Segmentation via Multi-task Network Cascades”. arXiv 2015.

slide-48
SLIDE 48

Qualitative Results

  • Instance Segmentation

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. 48