DEEP LEARNING FFR135, Artificial Neural Networks Olof Mogren - - PowerPoint PPT Presentation

deep learning
SMART_READER_LITE
LIVE PREVIEW

DEEP LEARNING FFR135, Artificial Neural Networks Olof Mogren - - PowerPoint PPT Presentation

DEEP LEARNING FFR135, Artificial Neural Networks Olof Mogren Chalmers University of Technology October 2016 DEEP LEARNING Artificial neural networks Many layers of abstractions Outperforms traditional methods in: Image


slide-1
SLIDE 1

DEEP LEARNING

FFR135, Artificial Neural Networks Olof Mogren

Chalmers University of Technology

October 2016

slide-2
SLIDE 2

DEEP LEARNING

  • Artificial neural networks
  • Many layers of abstractions
  • Outperforms traditional methods in:
  • Image classification
  • Natural language processing
  • Machine translation
  • Sentiment analysis
  • Speech recognition
  • Reinforcement learning
slide-3
SLIDE 3

SEMI-RECENT PROGRESS

  • 2006: Depth breakthrough:

layerwise pretrained Restricted Boltzmann Machines

  • GPUs
  • Practical use

Real applications from Google, Facebook, Tesla, Microsoft, Apple, and others!

A fast learning algorithm for deep belief nets; Hinton, Osindero, Tehi; Neural Computation; 2006

slide-4
SLIDE 4

PERCEPTRON

  • 1943, McCulloch & Pitts (neuron model)
  • 1958, Rosenblatt (perceptron)
  • Linear (binary) classification of inputs
  • Can not learn any non-linear function

(e.g. XOR)

inputs

  • utput

x x

2

x

1

x

4

x

3

w w

2

w

1

w

4

w

3

y

slide-5
SLIDE 5

MODELLING XOR

x0 1 1 1 1 x1

slide-6
SLIDE 6

MODELLING XOR

x0 1 1 1 1 x1 x0 ∧ ¬x1 1 1 1 1 ¬x0 ∧ x1

slide-7
SLIDE 7

MULTI-LAYER PERCEPTRON

  • Combining layers lets us

represent non-linear functions

  • Each layer:
  • Linear transformation:

a = Wx + b

  • Non-linear (element-wise)

activation: h = g(a)

inputs hidden layer

  • utputs
slide-8
SLIDE 8

MODELLING FUNCTIONS

  • Universal function approximation
  • Stacking layers: function composition
  • Apply error/loss function to output
  • Continuously differentiable; chain rule
  • Propagating errors (backpropagation)
  • (Mini-batch) Stochastic gradient descent

(SGD)

inputs hidden layer

  • utputs

details

slide-9
SLIDE 9

MOTIVATION OF DEPTH

  • More compact representation (exponentially)
  • There are boolean functions that require
  • Polynomial number of units (deep architecture)
  • Exponential number of units (shallow architecture)
  • E.g., parity function (for n input bits):
  • efficiently represented with depth O(log n)
  • but O(2n) gates if represented by a depth two circuit (Yao, 1985)

Exploring Strategies for Training Deep Neural Networks; Larochelle, Bengio, Louradour, Lamblin; JMLR 2009

slide-10
SLIDE 10

LEARNING LEVELS OF REPRESENTATION

  • Each layer:

non-linear transformation of inputs: h = sigmoid(Wx + b)

  • Learning representations; abstractions
  • No feature engineering!
slide-11
SLIDE 11

DISTRIBUTED REPRESENTATIONS

  • E.g.: big, yellow, Volkswagen
  • Non-distributed representations:

n binary parameters → n values

  • E.g.: Clustering, n-grams, decision trees, etc.
  • NNs learn distributed representations
  • Distributed representations:

n binary parameters → 2n possible values

slide-12
SLIDE 12

EXAMPLE: WORD EMBEDDINGS

  • Distributed representations for words
  • word2vec, glove, etc.
slide-13
SLIDE 13

DEEP LEARNING IN JAVASCRIPT

cs231n.stanford.edu playground.tensorflow.org

slide-14
SLIDE 14

LEVELS OF ABSTRACTIONS

slide-15
SLIDE 15

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 10

32 32 3

Convolution Layer

32x32x3 image

width height depth

slide-16
SLIDE 16

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 11

32 32 3

Convolution Layer

5x5x3 filter 32x32x3 image

Convolve the filter with the image i.e. “slide over the image spatially, computing dot products”

slide-17
SLIDE 17

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 12

32 32 3

Convolution Layer

5x5x3 filter 32x32x3 image

Convolve the filter with the image i.e. “slide over the image spatially, computing dot products” Filters always extend the full depth of the input volume

slide-18
SLIDE 18

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 13

32 32 3

Convolution Layer

32x32x3 image 5x5x3 filter

1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias)

slide-19
SLIDE 19

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 14

32 32 3

Convolution Layer

32x32x3 image 5x5x3 filter

convolve (slide) over all spatial locations activation map 1 28 28

slide-20
SLIDE 20

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 15

32 32 3

Convolution Layer

32x32x3 image 5x5x3 filter

convolve (slide) over all spatial locations activation maps 1 28 28

consider a second, green filter

slide-21
SLIDE 21

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 16

32 32 3 Convolution Layer activation maps 6 28 28

For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps: We stack these up to get a “new image” of size 28x28x6!

slide-22
SLIDE 22

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 17

Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions 32 32 3 28 28 6 CONV, ReLU e.g. 6 5x5x3 filters

slide-23
SLIDE 23

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 18

Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions 32 32 3 CONV, ReLU e.g. 6 5x5x3 filters 28 28 6 CONV, ReLU e.g. 10 5x5x6 filters CONV, ReLU

….

10 24 24

slide-24
SLIDE 24

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 21

example 5x5 filters

(32 total) We call the layer convolutional because it is related to convolution

  • f two signals:

elementwise multiplication and sum of a filter and the signal (image)

  • ne filter =>
  • ne activation map
slide-25
SLIDE 25

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 22

preview:

slide-26
SLIDE 26

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 23

A closer look at spatial dimensions:

32 32 3

32x32x3 image 5x5x3 filter

convolve (slide) over all spatial locations activation map 1 28 28

slide-27
SLIDE 27

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 24

7x7 input (spatially) assume 3x3 filter 7 7

A closer look at spatial dimensions:

slide-28
SLIDE 28

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 25

7x7 input (spatially) assume 3x3 filter 7 7

A closer look at spatial dimensions:

slide-29
SLIDE 29

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 26

7x7 input (spatially) assume 3x3 filter 7 7

A closer look at spatial dimensions:

slide-30
SLIDE 30

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 27

7x7 input (spatially) assume 3x3 filter 7 7

A closer look at spatial dimensions:

slide-31
SLIDE 31

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 28

7x7 input (spatially) assume 3x3 filter => 5x5 output 7 7

A closer look at spatial dimensions:

slide-32
SLIDE 32

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 29

7x7 input (spatially) assume 3x3 filter applied with stride 2 7 7

A closer look at spatial dimensions:

slide-33
SLIDE 33

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 30

7x7 input (spatially) assume 3x3 filter applied with stride 2 7 7

A closer look at spatial dimensions:

slide-34
SLIDE 34

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 31

7x7 input (spatially) assume 3x3 filter applied with stride 2 => 3x3 output! 7 7

A closer look at spatial dimensions:

slide-35
SLIDE 35

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 32

7x7 input (spatially) assume 3x3 filter applied with stride 3? 7 7

A closer look at spatial dimensions:

slide-36
SLIDE 36

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 33

7x7 input (spatially) assume 3x3 filter applied with stride 3? 7 7

A closer look at spatial dimensions:

doesn’t fit! cannot apply 3x3 filter on 7x7 input with stride 3.

slide-37
SLIDE 37

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 34

N N F F Output size: (N - F) / stride + 1 e.g. N = 7, F = 3: stride 1 => (7 - 3)/1 + 1 = 5 stride 2 => (7 - 3)/2 + 1 = 3 stride 3 => (7 - 3)/3 + 1 = 2.33 :\

slide-38
SLIDE 38

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 35

In practice: Common to zero pad the border

e.g. input 7x7 3x3 filter, applied with stride 1 pad with 1 pixel border => what is the output?

(recall:) (N - F) / stride + 1

slide-39
SLIDE 39

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 36

In practice: Common to zero pad the border

e.g. input 7x7 3x3 filter, applied with stride 1 pad with 1 pixel border => what is the output? 7x7 output!

slide-40
SLIDE 40

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 37

In practice: Common to zero pad the border

e.g. input 7x7 3x3 filter, applied with stride 1 pad with 1 pixel border => what is the output? 7x7 output! in general, common to see CONV layers with stride 1, filters of size FxF, and zero-padding with (F-1)/2. (will preserve size spatially) e.g. F = 3 => zero pad with 1 F = 5 => zero pad with 2 F = 7 => zero pad with 3

slide-41
SLIDE 41

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 38

Remember back to… E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially! (32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t work well. 32 32 3 CONV, ReLU e.g. 6 5x5x3 filters 28 28 6 CONV, ReLU e.g. 10 5x5x6 filters CONV, ReLU

….

10 24 24

slide-42
SLIDE 42

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 39

Examples time: Input volume: 32x32x3 10 5x5 filters with stride 1, pad 2 Output volume size: ?

slide-43
SLIDE 43

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 40

Examples time: Input volume: 32x32x3 10 5x5 filters with stride 1, pad 2 Output volume size: (32+2*2-5)/1+1 = 32 spatially, so 32x32x10

slide-44
SLIDE 44

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 41

Examples time: Input volume: 32x32x3 10 5x5 filters with stride 1, pad 2 Number of parameters in this layer?

slide-45
SLIDE 45

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 42

Examples time: Input volume: 32x32x3 10 5x5 filters with stride 1, pad 2 Number of parameters in this layer? each filter has 5*5*3 + 1 = 76 params (+1 for bias) => 76*10 = 760

slide-46
SLIDE 46

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 45

(btw, 1x1 convolution layers make perfect sense)

64 56 56 1x1 CONV with 32 filters 32 56 56 (each filter has size 1x1x64, and performs a 64-dimensional dot product)

slide-47
SLIDE 47

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 54

Pooling layer

  • makes the representations smaller and more manageable
  • perates over each activation map independently:
slide-48
SLIDE 48

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 55

1 1 2 4 5 6 7 8 3 2 1 1 2 3 4 Single depth slice x y

max pool with 2x2 filters and stride 2

6 8 3 4

MAX POOLING

slide-49
SLIDE 49

Lecture 7 - 27 Jan 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 - 27 Jan 2016 58

Fully Connected Layer (FC layer)

  • Contains neurons that connect to the entire input volume, as in ordinary Neural

Networks

slide-50
SLIDE 50

DROPOUT

  • During training:
  • For each postactivation hi, with probability p let hi = 0
  • Redundancy
  • Equivalent to learning an ensamble of networks

Improving neural networks by preventing co-adaptation of feature detectors; Hinton, Srivastava, Krizhevsky, Sutskever, Salakhutdinov; (2012); arXiv:1207.0580 more on regularization

slide-51
SLIDE 51

BATCH NORMALIZATION

  • For each batch
  • Normalize inputs to every layer to zero mean, unit variance.
  • Helps with covariance shift

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift; Ioffe, Szegedy; arXiv:1502.03167

slide-52
SLIDE 52

RESIDUAL CONNECTIONS

weight layer weight layer

+

relu relu x identity x F(x) F(x)+x

Deep Residual Learning for Image Recognition; He, Zhang, Ren, Sun; arXiv:1512.03385

slide-53
SLIDE 53

DEEPER AND DEEPER

  • 1998: LeNet-5; 3 layers
  • 2012: AlexNet; 8 layers
  • 2014: GoogLeNet; 22 layers (illustration)
  • 2015: Residual Nets; 152 layers
  • “Surpassed” human performance in 2015
slide-54
SLIDE 54

DEPTH DEVELOPMENT

3.57

ImageNet Classification top-5 error (%)

shallow 8 layers 19 layers 22 layers

152 layers

K aiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition” . arXiv 2015.

8 layers

28.2 25.8 16.4 11.7 7.3 6.7 ILSVRC'10 ILSVRC'11 ILSVRC'12 AlexNet ILSVRC'14 VGG ILSVRC'14 GoogleNet ILSVRC'15 ResNet ILSVRC'13

slide-55
SLIDE 55

NON-CONVEX OPTIMIZATION

  • Loss function non-convex
  • Low-D: local minima dominate
  • High-D: saddle points dominate
  • Local minima are close to global

minimum

  • Convexity not needed

The loss surfaces of multilayer networks; Choromanska, et.al.; AISTATS 2015 Identifying and attacking the saddle point problem in high-dimensional non-convex

  • ptimization; Dauphin, et.al.; NIPS 2014

Yoshua Bengio

slide-56
SLIDE 56

SEQUENCE MODELLING

Andrej Karpathy details

slide-57
SLIDE 57

SENTIMENT ANALYSIS

xt-2 xt-1 xt

LSTM LSTM LSTM Classification

  • Binary sequence classification
slide-58
SLIDE 58

NEURAL MACHINE TRANSLATION, NMT

x3 x2 x1 y3 y2 y1

{

encoder

{

decoder

Sequence to sequence learning with neural networks; Sutskever, Vinyals, Le; NIPS 2014 Neural machine translation by jointly learning to align and translate; Bahdanau, Cho, Bengio; ICLR 2015

slide-59
SLIDE 59

RECENT ADVANCES IN NMT

  • Subwords (BPE) (Sennrich et.al., ACL 2016)
  • 8 layers deep LSTM model.
  • Quantized weights ∈ {−1, 0, +1}
  • Downpour SGD: parallell training
  • 8GPUs, one host.
  • Human evaluation:

results comparable to human translators!

Google’s neural machine translation system: Bridging the gap between human and machine translation; Yonghui Wu, et.al.; arXiv 1609.08144

slide-60
SLIDE 60

CAPTION GENERATION

more

slide-61
SLIDE 61

http://mogren.one/