Convolutional Neural Networks Kaitlin Palmer San Diego State - - PDF document

convolutional neural networks
SMART_READER_LITE
LIVE PREVIEW

Convolutional Neural Networks Kaitlin Palmer San Diego State - - PDF document

Convolutional Neural Networks Kaitlin Palmer San Diego State University 1 Outline What are Convolutional Neural Networks (CNN) Why use a CNN Typical Layout Kernel Size Stride Size/Padding Pooling Keras Implementation


slide-1
SLIDE 1

1

Convolutional Neural Networks

Kaitlin Palmer San Diego State University

Outline

  • What are Convolutional Neural Networks (CNN)
  • Why use a CNN
  • Typical Layout

– Kernel Size – Stride Size/Padding – Pooling

  • Keras Implementation

2

1 2

slide-2
SLIDE 2

2

What are CNNs?

3

  • Neural networks that use convolution (or cross correlation)
  • f a weight and bias matrix rather than matrix multiplication

What are CNNs?

4

s 𝑢 = ׬ 𝑦 𝑏 𝑥 𝑢 − 𝑏 𝑒𝑏

  • Spaceship example
  • s(t)- smoothed estimate of the
  • x(t)- radar position
  • a – age of measurement
  • w(t-a) – weighting function
  • Considered probability density function
  • 0 for all negative zeros

ISS tracking Data: https://www.nasa.gov/pdf/686319main_AP_ED_Stats_RadarData.pdf

3 4

slide-3
SLIDE 3

3

What are CNNs?

5

s 𝑢 = σ−∞

∞ 𝑦 𝑏 𝑥(𝑢 − 𝑏)

  • s(t) – feature map
  • x(a) input (multi-dimensional array)
  • w(t-a) kernel (multi-dimensional array)

Discretized What are CNNs?

6

  • What is convolution?
  • Practice example 1D (summation of the products)

1 1 2 5 3 1

2 3 7 8 3

1 1

5 6

slide-4
SLIDE 4

4

What are CNNs?

7

Multi-dimensional Array

𝑔 𝑦 = ෍

𝑜

𝑛

𝐽 𝑗 + 𝑛, 𝑘 + 𝑜 𝐿(𝑛, 𝑜) Beware matrix flipping – convolution vs. cross correlation 𝑔 𝑦 = ෍

𝑜

𝑛

𝐽 𝑗 − 𝑛, 𝑘 − 𝑜 𝐿(𝑛, 𝑜)

Why CNNs?

  • Sparse Interactions
  • Parameter sharing
  • Equivariant Representations

8

7 8

slide-5
SLIDE 5

5

Why CNNs

  • Sparse Interactions

– AKA sparse connectivity or sparse weights – Fewer Parameters – Kernels (storing) smaller than input – Tens or hundreds of parameters to learn vs. millions

9

  • Fig. 9.2 Goodfellow et al.

Why CNNs

  • Parameter Sharing

– Same parameter for more than one function in a model – Weights applied to one input applied elsewhere – Each member of the kernel used at every position – One set of parameters is learned- regardless of location

10

9 10

slide-6
SLIDE 6

6

Why CNNs

  • Sparse Interactions

– Receptive Field – Few Direct connections but units in deeper layers indirectly connected to most of the input image

11

  • Fig. 9.4 Goodfellow et al.

Why CNNs

  • Parameter Sharing

– Each kernel value used at every position of the input – Convolution Example

  • 280 x 320 * 280 x 319 =

319*280*3 = 267,960 [two multiplications and one addition per kernel

  • 320 x 280 x 319 x 280 = >8

billion parameters 4 billion times more effective

12

  • Fig. 9.5 Goodfellow et al.

11 12

slide-7
SLIDE 7

7

Why CNN’s

  • Equivariance to translation

– If input changes output changes by the same amount – Event moves later in time (or location) in input shifts the same in

  • utput

– Not naturally invariant to rotation or scale

13

Why CNN’s

14

  • Edge Detection Example
  • Fig. 9.6 Goodfellow et al.

13 14

slide-8
SLIDE 8

8

15

8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 1 0 -1 1 0 -1 1 0 -1 = 0 24 24 0 0 24 24 0 0 24 24 0 0 24 24 0

∗ ∗

Andrew Ng 2017

CNN Layout

  • Kernel Size (typically odd)
  • Stride/ padding
  • Pooling

16

15 16

slide-9
SLIDE 9

9

CNN Layout

  • Padding

17

CNN Layout

  • Why padding?

– Input shrinks at each layer – Edge Effects

  • Padding Types and Terminology

– Valid: No padding – Same: Make output size the same as the input size – Full: Sufficient pixels to be visited k times in each direction

18

17 18

slide-10
SLIDE 10

10

Layout - Step Size

  • AKA stride
  • Equivalent to hop size- advance
  • Down sample neural network

19

Convolutions on RGB

20

= ∗

4 x 4

Andrew ng

19 20

slide-11
SLIDE 11

11

CNN Layout- Pooling

  • Pooling Layers
  • Invariant to small translations of the input
  • Replace net output with summary statistic

– Max pooling – Neighborhood average – L2 norm – Weighted average distance from central pixel

21

Pooling

  • Pooling

– Equivalent to infinitely strong prior – Max Pooling Example

22

  • Fig. 9.8 Goodfellow et al.

21 22

slide-12
SLIDE 12

12

Pooling

  • Down sampling

– Computational Efficiency – Possible to use fewer pooling units than detector layer – Pool over k pixels

23

  • Fig. 9.10 Goodfellow et al.

Pooling

  • Invariance to translation

24

  • Fig. 9.9 Goodfellow et al.

23 24

slide-13
SLIDE 13

13

Pooling Invariance

25 Yann LeCun: http://yann.lecun.com/exdb/lenet/stroke-width.html

CNN Layout

26

  • Fig. 9.9 Goodfellow et al.

25 26

slide-14
SLIDE 14

14

Keras Implementation

27 LeCun et al. 1998 Gradient Based Learning Applied to Document Recognition

Keras Implementation LeNet-5

28

model = Sequential() model.add(Conv2D(6, kernel_size=(5, 5), activation= tanh ', input_shape=input_shape)) model.add(MaxPooling2D(6, pool_size=(2, 2))) model.add(Conv2D(16, (5, 5), activation=‘tanh’)) model.add(MaxPooling2D(16, pool_size=(2, 2))) model.add(Conv2D(120, (5, 5), activation= ‘tanh ')) model.add(Dense(84, activation= ‘sigmoid ')) model.add(Dense(num_classes, activation='softmax'))

27 28

slide-15
SLIDE 15

15

Convolution Backpropagation

  • Convolution, Backpropagation from output to weights,

Backpropagation from output to input

  • Kernel stack K
  • Multidimensional input (e.g. image) V
  • Stride s
  • Convolution output (feature map) Z
  • Loss function J

29

Backpropagation

30

𝐷𝑝𝑜𝑤𝑝𝑚𝑣𝑢𝑗𝑝𝑜 = 𝑑 𝐿, 𝑊, 𝑡 = 𝑎 𝑀𝑝𝑡𝑡 𝐺𝑣𝑜𝑑𝑢𝑗𝑝𝑜 = 𝐾 𝑊, 𝐿 𝐻 = 𝜖 𝜖𝑎𝑗,𝑘,𝑙 = 𝐾(𝑊, 𝐿) 𝑕(𝐻, 𝑊, 𝑡)𝑗,𝑘,𝑙,𝑚 = 𝜖 𝜖𝐿𝑗,𝑘,𝑙,𝑚 = ෍

𝑛,𝑜

𝐻𝑗,𝑛,𝑜𝑊

𝑘, 𝑛−1 𝑡+𝑙, 𝑜−1 𝑡+𝑚

ℎ(𝐿, 𝐻, 𝑡)𝑗,𝑘,𝑙 = 𝜖 𝜖𝑊

𝑗,𝑘,𝑙

𝐾(𝑊, 𝐿) = ෍

𝑚,𝑛 𝑡.𝑢. 𝑚−1 𝑡+𝑛=𝑘

𝑜,𝑞 𝑡.𝑢. 𝑜−1 𝑡+𝑞=𝑙

𝑟

𝐿𝑟,𝑗,𝑛,𝑞𝐻𝑟,𝑚,𝑜

Backpropagation from

  • utput to kernel

Tensor, change loss with respect to feature map Backpropagation through hidden layer Derivatives with respect to the kernel 29 30

slide-16
SLIDE 16

16

Structured Output

31

  • For pixel-wise labeling of images pooling is not always

necessary

  • Fig. 9.17 Goodfellow et al.

https://sthalles.github.io/deep_segmentation_network/

Locally Connected Layers

32

  • Aka unshared convolution
  • Features a fx small portion of

space, but not across all space

  • Look for chin in the bottom half
  • f an image

𝑎𝑗,𝑘,𝑙 = ෍

𝑚,𝑛,𝑜

[𝑊

𝑚,𝑘+𝑛−1,𝑙+𝑜−1𝑥𝑗,𝑘,𝑙,𝑚,𝑛,𝑜 ]

Fully connected layer

  • Fig. 9.14 Goodfellow et al.

Locally connected layer (patch size 2) Convolutional Layer

31 32

slide-17
SLIDE 17

17

Tiled Convolution

  • Midway between locally connected

layers and convolutional layer

  • Learn a set of kernels to rotate

through

  • Immediate neighbors different filters

but memory size increased only by a factor of the size of the kernels

33

Traditional convolution ~tiled convolution with t=1 Locally connected layer (patch size 2) Tiled convolution (t=2)

  • Fig. 9.16 Goodfellow et al.

𝑎𝑗,𝑘,𝑙 = ෍

𝑚,𝑛,𝑜

𝑊

𝑚,𝑘+𝑛−1,𝑙+𝑜−1

𝐿𝑗,𝑚,𝑛,𝑜,𝑘%𝑢+1,𝑙%𝑢+1

Data Types

  • Flexibility in CNNs
  • Multiple input sizes

34

33 34

slide-18
SLIDE 18

18

Data Types

  • 1D

35

Single Channel

www.riotgames.com Position Rotation Scale

Multi-channel

Data Types

  • 2D

36

Single Channel Multi-channel 35 36

slide-19
SLIDE 19

19

Data Types

  • 3D

37

Single Channel Multi Channel

Random or Unsupervised Features

  • Learning features is expensive

– Every gradient step requires full forward/back prop

  • Use features not trained in a supervised fashion

38

37 38

slide-20
SLIDE 20

20

Random or Unsupervised Features

  • Random kernel initialization
  • Design kernels by hand
  • Learn kernels with an unsupervised criterion

39

Random or Unsupervised Features

  • Random kernel initialization

– As before, random weights typically perform well – Need to test multiple architectures

  • Good approach:

– Build multiple architectures – Set random weights – Only train the last layer- pick the best architecture and train using full back prop

40

39 40

slide-21
SLIDE 21

21

Random or Unsupervised Features

  • Learn kernels (k) using unsupervised criterion

– Allows features to be determined separately from the classifier late in the architecture – What unsupervised tools have we used so far? – K-means clustering to image patches, each centroid as a convolution kernel – Extract k-means for the entire training set and use this as the last layer before classification

41

Random or Unsupervised Features

  • Hand designed features

42

? ? ?

41 42

slide-22
SLIDE 22

22

Neurobiologically Inspired Networks

  • Hubel and Wisel, 1959,1962,1968

43 Utdallas.edu

https://www.youtube.com/watch?v=IOHayh06LJ4

Neurobiological Basis

  • Simple cells

– Roughly linear – Feature selection

  • Complex cells

– Nonlinear – Invariant to some transformations of simple cell features

44

43 44

slide-23
SLIDE 23

23

Neurobiological Basis

  • fMRI Scans of human V1
  • Visual stimulus roughly

represented in V1

45

Tootel et al. Proceedings of the National Academy of Sciences Feb 1998, 95 (3) 811-817; DOI: 10.1073/pnas.95.3.811

Foeva Foeva Foeva

Neurological Basis

46

  • Visual cortex (V1)
  • CNN’s capture

– 2D structure of V1 and retina (light in lower half ~ activation in lower half of V1). So too neural networks in 2 dimensional maps – ‘Simple Cells’- linear function of the image in a small, spatially, localized receptive field (e.g. detector units) – ‘Complex Cells’- invariant to small shifts in position (e.g. pooling layers over all spatial location). Maxout units as well – Grandmother cells- concept of cells that respond to your grandmother regardless of the location and scale – Proven ‘Halle Berry neuron- (Quiroga et al. 2005) fires at image, drawing, or text

45 46

slide-24
SLIDE 24

24

Neurobolical Basis

  • How CNN’s differ

– Foeva-high res image detection surrounded by low res image – Quick eye movements- ‘saccades’ glimpse relevant scene pictures (Herman Grid Optical Illusion) – NN’s receive high res everywhere

  • Visual system part of integrated

system both physical (hearing, smelling, etc. ) and nebulous (mood, thoughts)

47

Neurological Basis

  • Understand scenes

– Objects, relationships between objects, 3D geometric information

  • Feedback throughout brain
  • Activation and pooling functions in

the brain? Probably quite different. No such distinction between ‘simple and complex’ cells. (Same cell different ‘parameters’)

48

47 48

slide-25
SLIDE 25

25

Neurological Basis

  • Training? Neurobiology isn’t much help..
  • Time-delay neural networks

– Biologically implausible – 1D CNN applied to a time series

  • Understanding Neural Networks and the Mammalian Visual Cortex

– In the neural network – plot image of convolutional kernel. Deep layers? – Biology? ‘Reverse Correlation’

  • Drill into skull, inject electrodes into brain, restrain animal, show images of white noise

and record output

49

Neurological Basis

50

𝑡 𝐽 = ෍

𝑦∈

𝑧∈

𝑥 𝑦, 𝑧 𝐽(𝑦, 𝑧)

49 50

slide-26
SLIDE 26

26

Gabor Function

  • w(x,y) 𝛽 –Magnitude of the response

51

𝑥 𝑦, 𝑧, 𝛾𝑦, 𝛾𝑧, 𝑔, ϕ, 𝑦0 , 𝑧0, 𝜐 = 𝛽 exp −𝛾𝑦𝑦𝑠2 − 𝛾𝑧𝑧𝑠2 cos(𝑔𝑦′ + ϕ) 𝑦′ = 𝑦 − 𝑦0 cos 𝜐 + 𝑧 − 𝑧0 sin(𝜐) 𝑧′ = − 𝑦 − 𝑦0 sin 𝜐 + 𝑧 − 𝑧0 cos(𝜐)

Gating Term 𝛽 –Magnitude of the response 𝛾 –how quickly does the receptive field falls off how cell responds to light on the x’ axis Location terms 𝜐 – radians from the horizontal Translation and rotation of x and y

Gabor Functions

52

  • Fig. 9.9 Goodfellow et al.

Location Parameters 𝑦0, 𝑧0, 𝜐 Gaussian Scale Parameters 𝛾𝑦𝛾𝑧 Sinusoid parameters 𝑔𝜚 51 52