Neural Network Training: Old & New Tricks Old: (80s) Stochastic - - PowerPoint PPT Presentation

neural network training old new tricks
SMART_READER_LITE
LIVE PREVIEW

Neural Network Training: Old & New Tricks Old: (80s) Stochastic - - PowerPoint PPT Presentation

Neural Network Training: Old & New Tricks Old: (80s) Stochastic Gradient Descent, Momentum, weight decay New: (last 5-6 years) Dropout ReLUs Batch Normalization Reminder: Overfitting, in images Classification just right


slide-1
SLIDE 1

Stochastic Gradient Descent, Momentum, “weight decay” Dropout ReLUs Batch Normalization Old: (80’s) New: (last 5-6 years)

Neural Network Training: Old & New Tricks

slide-2
SLIDE 2

Reminder: Overfitting, in images

2

Classification Regression

just right

slide-3
SLIDE 3

3

Each sample is processed by a ‘decimated’ neural net

Dropout

slide-4
SLIDE 4

3

Each sample is processed by a ‘decimated’ neural net Decimated nets: distinct classifiers But: they should all do the same job

Dropout

slide-5
SLIDE 5

4

Dropout Performance

slide-6
SLIDE 6

Stochastic Gradient Descent, Momentum, “weight decay” Dropout ReLUs Batch Normalization Old: (80’s) New: (last 5-6 years)

Neural Network Training: Old & New Tricks

5

slide-7
SLIDE 7

6

Sigmoidal (“logistic”) Rectified Linear Unit (RELU)

‘Neuron’: Cascade of Linear and Nonlinear Function

slide-8
SLIDE 8

7

Outputs Reminder: a network in backward mode

slide-9
SLIDE 9

7

Outputs

Gradient signal from above

Reminder: a network in backward mode

slide-10
SLIDE 10

7

Outputs

Gradient signal from above

Reminder: a network in backward mode

slide-11
SLIDE 11

7

Outputs

Gradient signal from above

Reminder: a network in backward mode

slide-12
SLIDE 12

7

Outputs

Gradient signal from above scaling: <1 (actually <0.25)

Reminder: a network in backward mode

slide-13
SLIDE 13

8

Gradient signal from above scaling: <1 (actually <0.25)

Vanishing Gradients Problem

slide-14
SLIDE 14

8

Gradient signal from above scaling: <1 (actually <0.25)

Do this 10 times: updates in the first layers get minimal Top layer knows what to do, lower layers “don’t get it” Sigmoidal Unit: Signal is not getting through!

Vanishing Gradients Problem

slide-15
SLIDE 15

9

Scaling: {0,1}

Vanishing Gradients Problem: ReLU Solves It

Gradient signal from above

slide-16
SLIDE 16

10

Activation Functions: ReLU & Co

Great! But… no gradient for negative half-space

slide-17
SLIDE 17

10

Activation Functions: ReLU & Co

Great! But… no gradient for negative half-space Lots of follow up work: LeakyReLU, eLU, etc. Can improve results, but typically fine-tuning only

slide-18
SLIDE 18

Stochastic Gradient Descent, Momentum, “weight decay” Dropout ReLUs Batch Normalization Old: (80’s) New: (last 5-6 years)

Neural Network Training: Old & New Tricks

11

slide-19
SLIDE 19

12

10 am 2pm 7pm

External Covariate Shift: your input changes

slide-20
SLIDE 20

13

Photometric transformation: I → a I + b

“Whitening”: Set Mean = 0, Variance = 1

slide-21
SLIDE 21
  • Make each patch have zero mean:

13

Photometric transformation: I → a I + b

“Whitening”: Set Mean = 0, Variance = 1

slide-22
SLIDE 22
  • Make each patch have zero mean:

13

Photometric transformation: I → a I + b

“Whitening”: Set Mean = 0, Variance = 1

slide-23
SLIDE 23
  • Make each patch have zero mean:
  • Then make it have unit variance:

13

Photometric transformation: I → a I + b

“Whitening”: Set Mean = 0, Variance = 1

slide-24
SLIDE 24
  • Make each patch have zero mean:
  • Then make it have unit variance:

13

Photometric transformation: I → a I + b

“Whitening”: Set Mean = 0, Variance = 1

slide-25
SLIDE 25

14

Whiten-as-you-go:

Batch Normalization

slide-26
SLIDE 26

15

Batch Normalization: Used in all current systems

slide-27
SLIDE 27

Convolutional Neural Networks

16

slide-28
SLIDE 28

17

Example: 200x200 image 40K hidden units ~1.6B parameters!!!

  • Spatial correlation is local
  • Waste of resources
  • We don’t have enough training samples anyway…

Fully-connected Layer

slide-29
SLIDE 29

18

Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters

Locally-connected Layer

Note: This parameterization is good when input image is registered (e.g., face recognition).

slide-30
SLIDE 30

19

Note: This parameterization is good when input image is registered (e.g., face recognition). Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters

Locally-connected Layer

slide-31
SLIDE 31

20

Share the same parameters across different locations (assuming input is stationary): Convolutions with learned kernels

Convolutional Layer

slide-32
SLIDE 32

21

Convolutional Layer

slide-33
SLIDE 33

22

Convolutional Layer

slide-34
SLIDE 34

23

Convolutional Layer

slide-35
SLIDE 35

24

Convolutional Layer

slide-36
SLIDE 36

25

Convolutional Layer

slide-37
SLIDE 37

26

Convolutional Layer

slide-38
SLIDE 38

27

Convolutional Layer

slide-39
SLIDE 39

28

Convolutional Layer

slide-40
SLIDE 40

29

Convolutional Layer

slide-41
SLIDE 41

30

Convolutional Layer

slide-42
SLIDE 42

31

Convolutional Layer

slide-43
SLIDE 43

32

Convolutional Layer

slide-44
SLIDE 44

33

Convolutional Layer

slide-45
SLIDE 45

34

Convolutional Layer

slide-46
SLIDE 46

35

Convolutional Layer

slide-47
SLIDE 47

36

Convolutional Layer

slide-48
SLIDE 48

37

Fully-connected layer

#of parameters: K2

slide-49
SLIDE 49

38

#of parameters: size of window

Convolutional layer

slide-50
SLIDE 50

39

*

  • 1 0 1
  • 1 0 1
  • 1 0 1 =

Convolutional layer

slide-51
SLIDE 51

Learning an edge filter

40

slide-52
SLIDE 52

41

Learn multiple filters.

E.g.: 200x200 image 100 Filters Filter size: 10x10 10K parameters

Convolutional layer

slide-53
SLIDE 53

42

Conv. layer

h1

n− 1

h2

n− 1

h3

n− 1

h1

n

h2

n

  • utput

feature map input feature map kernel

Convolutional layer with ReLU activation

slide-54
SLIDE 54

42

Conv. layer

h1

n− 1

h2

n− 1

h3

n− 1

h1

n

h2

n

  • utput

feature map input feature map kernel

Convolutional layer with ReLU activation

ReLU

Activation

slide-55
SLIDE 55

43

h1

n− 1

h2

n− 1

h3

n− 1

h1

n

h2

n

  • utput

feature map input feature map kernel

Convolutional layer with ReLU activation

slide-56
SLIDE 56

44

h1

n− 1

h2

n− 1

h3

n− 1

h1

n

h2

n

  • utput

feature map input feature map kernel

Convolutional layer with ReLU activation

slide-57
SLIDE 57

45

De-convolutional layer with ReLU activation

De-conv. layer

h1

n− 1

h2

n− 1

h3

n−

h1

n

h2

n

Still holds, same structure

slide-58
SLIDE 58

45

De-convolutional layer with ReLU activation

No real inverse - but convolutions can easily go the other way

De-conv. layer

h1

n− 1

h2

n− 1

h3

n−

h1

n

h2

n

Still holds, same structure

slide-59
SLIDE 59

45

De-convolutional layer with ReLU activation

No real inverse - but convolutions can easily go the other way

De-conv. layer

h1

n− 1

h2

n− 1

h3

n−

h1

n

h2

n

“De-convolution” or “Transposed convolution”

Still holds, same structure

slide-60
SLIDE 60

45

De-convolutional layer with ReLU activation

No real inverse - but convolutions can easily go the other way

De-conv. layer

h1

n− 1

h2

n− 1

h3

n−

h1

n

h2

n

“De-convolution” or “Transposed convolution” Also a convolution with transposed weight tensor

Still holds, same structure

slide-61
SLIDE 61

46

Pooling layer

slide-62
SLIDE 62

47

Pooling layer

slide-63
SLIDE 63

48

Pooling layer: receptive field size

slide-64
SLIDE 64

49

Pooling layer: receptive field size

slide-65
SLIDE 65

Receptive field

50

slide-66
SLIDE 66

Receptive field: layer 1

51

slide-67
SLIDE 67

Receptive field: layer 2

52

slide-68
SLIDE 68

Receptive field: layer 3

53

slide-69
SLIDE 69

Receptive field: layer 4

54

slide-70
SLIDE 70

Receptive field: layer 5

55

slide-71
SLIDE 71

Receptive field: layer 6

56

slide-72
SLIDE 72

Receptive field: layer 7

57

slide-73
SLIDE 73

Receptive field: layer 8

58

slide-74
SLIDE 74

Modern Architectures

59

slide-75
SLIDE 75

60

INPUT 32x32

Convolutions Subsampling Convolutions

C1: feature maps 6@28x28

Subsampling

S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84

Full connection Full connection Gaussian connections

OUTPUT 10

Gradient-based learning applied to document recognition, Y . LeCun, L. Bottou, Y . Bengio, and P . Haffner, 1998.

https://www.youtube.com/watch?v=FwFduRA_L6Q

CNNs, late 1980’s: LeNet

slide-76
SLIDE 76

60

INPUT 32x32

Convolutions Subsampling Convolutions

C1: feature maps 6@28x28

Subsampling

S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84

Full connection Full connection Gaussian connections

OUTPUT 10

Gradient-based learning applied to document recognition, Y . LeCun, L. Bottou, Y . Bengio, and P . Haffner, 1998.

https://www.youtube.com/watch?v=FwFduRA_L6Q

CNNs, late 1980’s: LeNet

slide-77
SLIDE 77

60

INPUT 32x32

Convolutions Subsampling Convolutions

C1: feature maps 6@28x28

Subsampling

S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84

Full connection Full connection Gaussian connections

OUTPUT 10

Gradient-based learning applied to document recognition, Y . LeCun, L. Bottou, Y . Bengio, and P . Haffner, 1998.

https://www.youtube.com/watch?v=FwFduRA_L6Q

CNNs, late 1980’s: LeNet

slide-78
SLIDE 78

61

What happened in between?

slide-79
SLIDE 79

61

deep learning = neural networks (+ big data + GPUs)

What happened in between?

slide-80
SLIDE 80

61

deep learning = neural networks (+ big data + GPUs) + a few more recent tricks!

What happened in between?

slide-81
SLIDE 81

AlexNet
 Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton:
 ImageNet classification with deep convolutional neural

  • networks. Commun. ACM 60(6): 84-90 (2017)

CNNs, 2012

62

slide-82
SLIDE 82

VGG Karen Simonyan, Andrew Zisserman (=Visual Geometry Group) Very Deep Convolutional Networks for Large-Scale Image Recognition, arxiv, 2014.

CNNs, 2014: VGG

63

slide-83
SLIDE 83

ResNet Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition, CVPR 2016.


CNNs, 2015: ResNet

64

slide-84
SLIDE 84
  • Deeper networks can cover more complex problems
  • Increasingly large receptive field size & rich patterns

65

Going Deeper - The Deeper, the Better

slide-85
SLIDE 85
  • From 20 to 100/1000
  • Residual networks

66

Going Deeper

slide-86
SLIDE 86

Naïve solution

  • If extra layers are an identity mapping, then training errors can not increase

67

Residual Network

slide-87
SLIDE 87
  • Goal: estimate update between an original image and a changed image

68

Some Network residual Preserving base information can treat perturbation

Residual Modelling: Basic idea in image processing

slide-88
SLIDE 88
  • Plain block
  • Difficult to make identity mapping because of multiple non-linear layers

69

Residual Network

slide-89
SLIDE 89
  • Residual block
  • If identity were optimal, easy to set weights as 0
  • If optimal mapping is closer to identity, easier to find small fluctuations

Appropriate for treating perturbation as keeping a base information

70

Residual Network

slide-90
SLIDE 90
  • Deeper ResNets have lower training error

71

Residual Network: Deeper is better

slide-91
SLIDE 91

72

Residual Network: Deeper is better

slide-92
SLIDE 92

CNNs, 2017: DenseNet

73

Densely Connected Convolutional Networks, CVPR 2017 Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q. Weinberger Recently proposed, better performance/parameter ratio

slide-93
SLIDE 93

Image-to-Image

74

slide-94
SLIDE 94

Graphics: Multiresolution

75

slide-95
SLIDE 95
  • So far we mapped an image image to a number or label
  • In graphics, output often is “richer”:
  • An image
  • A volume
  • A 3D mesh
  • Note: “image” just placeholder name here for any Eulerian data
  • Architectures
  • Fully convolutional
  • Encoder-Decoder
  • Skip connections

Image-to-image

76

slide-96
SLIDE 96

FCNN

Fully-convolutional Neural Networks

77

slide-97
SLIDE 97

FCNN

Fully-convolutional Neural Networks

78

slide-98
SLIDE 98

FCNN

Fully-convolutional Neural Networks

79

slide-99
SLIDE 99

FCNN

Fully-convolutional Neural Networks

80

slide-100
SLIDE 100

FCNN

Fully-convolutional Neural Networks

81

Flexible - works with varying input sizes

slide-101
SLIDE 101

FCNN

Flexible - works with varying input sizes Typically reduces input by fixed factor

32-fold decimation 224x224 to 7x7

Fully Convolutional Neural Networks in Practice

82

slide-102
SLIDE 102

Encoder-Decoder

83

Space Space Features

slide-103
SLIDE 103
  • Encoder: turns data set (e.g. image) into vector
  • This vector is a very compact and abstract “code”
  • Lives in the “latent space” of the neural network
  • Decoder: turns code back into image

Interpretation

84

slide-104
SLIDE 104
  • 1st: Reduce resolutions as before
  • 2nd: Increase resolution
  • Transposed convolutions
  • Preserves information
  • But cannot be split into en- and decoder anymore

Encoder-decoder + Skip connections

85

U-Net: Convolutional Networks for Biomedical Image Segmentatio. Ronneberger et al. 2015

slide-105
SLIDE 105

SIGGRAPH Asia Course CreativeAI: Deep Learning for Graphics

http://geometry.cs.ucl.ac.uk/creativeai/

Thank you!

86

slide-106
SLIDE 106

Recurrent Neural Networks

87

slide-107
SLIDE 107
  • Time dependent problems: repeated evaluations with internal “state”
  • State xt at time t, depends on previous times
  • Recurrent Neural Networks (RNNs)
  • Specialized back-prop possible: Back-propagation through time (BPTT)
  • Unrolled:

Recurrent Neural Networks

88

slide-108
SLIDE 108
  • Long short term memory (LSTM) networks
  • Three internal states: input, output, forget

Common Building Block: LSTM Units

89

slide-109
SLIDE 109
  • Long short term memory (LSTM) networks
  • Three internal states: input, output, forget

Common Building Block: LSTM Units

89

standard

slide-110
SLIDE 110
  • Long short term memory (LSTM) networks
  • Three internal states: input, output, forget

Common Building Block: LSTM Units

89

standard standard

slide-111
SLIDE 111
  • Long short term memory (LSTM) networks
  • Three internal states: input, output, forget

Common Building Block: LSTM Units

89

standard standard history, stored data

slide-112
SLIDE 112
  • Long short term memory (LSTM) networks
  • Three internal states: input, output, forget

Common Building Block: LSTM Units

89

standard standard history, stored data weight new vs. stored data

slide-113
SLIDE 113
  • Long short term memory (LSTM) networks
  • Three internal states: input, output, forget

Common Building Block: LSTM Units

89

standard standard history, stored data weight new vs. stored data forget stored data

slide-114
SLIDE 114
  • Long short term memory (LSTM) networks
  • Three internal states: input, output, forget

Common Building Block: LSTM Units

89

standard standard history, stored data weight new vs. stored data forget stored data control amount of data output

slide-115
SLIDE 115
  • Long short term memory (LSTM) networks
  • In equation form:

Common Building Block: LSTM Units

90

[Sutskever et al., “Sequence to Sequence Learning with Neural Networks”, 2014]

slide-116
SLIDE 116
  • LSTM networks powerful tool for sequences over time
  • Alternatives:
  • Gated Recurrent Units (GRUs)
  • Time convolutional networks (TCNs)

Recurrent Neural Networks

91

[Bai et al., "An empirical evaluation of generic convolutional and recurrent networks for sequence modeling”, 2018] [Chung et al., "Empirical evaluation of gated recurrent neural networks on sequence modeling”,2014]

slide-117
SLIDE 117

Deep Learning Frameworks

92

slide-118
SLIDE 118

(Python) (Python, C++, Java) (C++, Python, Matlab) (Python, backends support other languages)

Main frameworks Currently less frequently used

(Python, C++, C#) (Python, C++, and others) (Matlab) (Python, Java, Scala) (Python) (Python, C++) (Python)

slide-119
SLIDE 119

Popularity

Google Trends for search terms: “[name] tutorial” Google Trends for search terms: “[name] github”

slide-120
SLIDE 120

Typical Training Steps

for i = 1 .. max_iterations input, ground_truth = load_minibatch(data, i)

  • utput = network_evaluate(input, parameters)


 loss = compute_loss(output, ground_truth) # gradients of loss with respect to parameters gradients = network_backpropagate(loss, parameters) 
 parameters = optimizer_step(parameters, gradients)

slide-121
SLIDE 121

Tensors

  • Frameworks typically represent data as tensors
  • Examples:

feature channels C spatial width W spatial height H batches B

4D convolution kernel: OC x IC x KH x KW 4D input data: B x C x H x W

input channels IC kernel height KH kernel width KW

  • utput channels

OC

slide-122
SLIDE 122

What Does a Deep Learning Framework Do?

  • Tensor math
  • Common network operations/layers
  • Gradients of common operations
  • Backpropagation
  • Optimizers
  • GPU implementations of the above
  • usually: data loading, network parameter saving/loading
  • sometimes: distributed computing
slide-123
SLIDE 123

Automatic Differentiation & the Computation Graph

parameters = (weight, bias)

  • utput = σ(weight * input + bias)


 loss = (output - ground_truth)^2 # gradients of loss with respect to parameters gradients = backpropagate(loss, parameters)

weight input bias

+ *

ground_truth

  • ^

2 loss

  • utput

σ

𝑝1 𝑝2 𝑝3

+ *

𝜖 loss 𝜖 weight

  • ^

loss

σ

𝜖 loss 𝜖 bias 𝜖 loss 𝜖 𝑝1 𝜖 loss 𝜖 𝑝2 𝜖 loss 𝜖 output 𝜖 loss 𝜖 𝑝3

forward pass backward pass

Since loss is a scalar, the gradients are the same size as the parameters

slide-124
SLIDE 124

Automatic Differentiation & the Computation Graph

𝑔

inputs

  • utputs

𝑔

  • utputs = forward(inputs, )

𝜖 loss 𝜖 parameters 𝜖 loss 𝜖 inputs 𝜖 loss 𝜖 outputs

parameters , = backward()

slide-125
SLIDE 125

Static vs Dynamic Computation Graphs

  • Static analysis allows optimizations and distributing workload
  • Dynamic graphs make data-driven control flow easier
  • In static graphs, the graph is usually defined in a separate ‘language’
  • Static graphs have less support for debugging

Static Dynamic

define once, evaluate during training define implicitly by running operations, a new graph is created in each evaluation

x = Variable() loss = if_node(x < parameter[0], x + parameter[0], x - parameter[1]) for i = 1 .. max_iterations x = data() run(loss) backpropagate(loss, parameters) for i = 1 .. max_iterations x = data() if x < parameter[0] loss = x + parameter[0] else loss = x – parameter[1] backpropagate(loss, parameters)

slide-126
SLIDE 126

Tensorflow

  • Currently the largest community
  • Static graphs (dynamic graphs are in development: Eager Execution)
  • Good support for deployment
  • Good support for distributed computing
  • Typically slower than the other three main frameworks on a single

GPU

slide-127
SLIDE 127

PyTorch

  • Fast growing community
  • Dynamic graphs
  • Distributed computing is in development (some support is already

available)

  • Intuitive code, easy to debug and good for experimenting with less

traditional architectures due to dynamic graphs

  • Very Fast
slide-128
SLIDE 128

Keras

  • A high-level interface for various backends (Tensorflow, CNTK,

Theano)

  • Intuitive high-level code
  • Focus on optimizing time from idea to code
  • Static graphs
slide-129
SLIDE 129

Caffe

  • Created earlier than Tensorflow, PyTorch or Keras
  • Less flexible and less general than the other three frameworks
  • Static graphs
  • Legacy - to be replaced by Caffe2: focus is on performance and

deployment

  • Facebook’s platform for Detectron (Mask-RCNN, DensePose, …)
slide-130
SLIDE 130

Converting Between Frameworks

  • Example: develop in one framework, deploy in another
  • Currently: a large range of converters, but no clear standard
  • Standardized model formats are in development

convertor tensorflow pytorch keras caffe caffe2 CNTK chainer mxnet tensorflow

  • pytorch-tf/

MMdnn model-converters/ nn_toolsconvert-to- tensorflow/MMdnn MMdnn/ nn_tools None crosstalk/MMdnn None MMdnn pytorch pytorch2keras (over Keras)

  • Pytorch2keras/

nn-transfer Pytorch2caffe/ pytorch-caffe- darknet-convert

  • nnx-caffe2

ONNX None None keras nn_tools /convert-to- tensorflow/ keras_to_tensorflow/ keras_to_tensorflow/ MMdnn MMdnn/ nn-transfer

  • MMdnnnn_tools

None MMdnn None MMdnn caffe MMdnn/nn_tools/ caffe-tensorflow MMdnn/ pytorch-caffe- darknet- convert/ pytorch-resnet caffe_weight_converter / caffe2keras/nn_tools/ kerascaffe2keras/ Deep_Learning_Model_ Converter/MMdnn

  • CaffeToCaffe2

crosstalkcaffe/ CaffeConverterMMdnn None mxnet/tools/ caffe_converter/ ResNet_caffe2mxnet/ MMdnn caffe2 None ONNX None None

  • ONNX

None None CNTK MMdnn ONNX MMdnn MMdnn MMdnn ONNX

  • None

MMdnn chainer None chainer2pytorc h None None None None

  • None

mxnet MMdnn MMdnn MMdnn MMdnn/MXNet2Caffe/ Mxnet2Caffe None MMdnn None

  • from https://github.com/ysh329/deep-learning-model-convertor
slide-131
SLIDE 131

MMdnn

  • Standard format for models
  • Native support in

development for Pytorch, Caffe2, Chainer, CNTK, and MxNet

  • Converter in development for

Tensorflow

  • Converters


available for
 several
 frameworks

  • Common intermediate


representation, but no clear standard

slide-132
SLIDE 132

SIGGRAPH Asia Course CreativeAI: Deep Learning for Graphics

Thank you!

http://geometry.cs.ucl.ac.uk/creativeai/

107