Introduction to High-Performance Machine Learning: Convolutional - - PowerPoint PPT Presentation

introduction to high performance machine learning
SMART_READER_LITE
LIVE PREVIEW

Introduction to High-Performance Machine Learning: Convolutional - - PowerPoint PPT Presentation

Introduction to High-Performance Machine Learning: Convolutional Neural Networks Valeriu Codreanu SURFsara 1 Introduction to High-Performance Machine Learning: www.prace-ri.eu Convolutional Neural Networks SURFsara History: 1971: Founded


slide-1
SLIDE 1

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 1

Introduction to High-Performance Machine Learning: Convolutional Neural Networks

SURFsara

Valeriu Codreanu

slide-2
SLIDE 2

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 2

History: 1971: Founded by the VU, UvA, and CWI 2013: SARA (Stichting Academisch Rekencentrum A’dam) becomes part of SURF Cartesius (Bull supercomputer): 40.960 Ivy Bridge / Haswell cores: 1327 TFLOPS 56GBit/s Infiniband 64 nodes with 2 K40m GPUs each: 210 TFLOPS Broadwell & KNL extension (Nov 2016) 177 BDW and 18 KNL nodes: 284TFLOPS 7.7 PB Lustre parallel file-system Top500 position #45 2014/11 #142 2017/11 Increasing number of deep learning projects!

SURFsara

slide-3
SLIDE 3

GPU Programming www.prace-ri.eu 3

Today’s lecture:

  • NNs for Computer vision
  • High-performance CNN vision models (AlexNet, GoogLeNet, ResNet, …)
  • High-performance hardware
  • GPUs, CPUs, special-purpose accelerators
  • Scaling deep learning vision models
  • Bottlenecks
  • Efficiency
  • Successful results
  • Today’s hands-on:
  • Deep learning frameworks
  • Example: Tensorflow/Horovod and MXNet on Intel and NVIDIA
  • Example: Keras

Outline

slide-4
SLIDE 4

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 4

Classical Computer Vision Pipeline

CV experts 1.Select / develop features: SURF, HoG, SIFT, RIFT, … 2.Add on top of this Machine Learning for multi-class recognition and train classifier

Feature Extraction: SIFT, HoG... Detection, Classification Recognition

Classical CV feature definition is domain- specific and time-consuming

slide-5
SLIDE 5

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 5

Deep Learning–based Vision Pipeline

Deep Learning:

  • Build features automatically based on training data
  • Combine feature extraction and classification

DL experts: define NN topology and train NN

Deep NN... Detection, Classification Recognition Deep NN...

Deep Learning promise: train good feature automatically, same method for different domain

slide-6
SLIDE 6

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 6

Neural networks

slide-7
SLIDE 7

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 7

slide-8
SLIDE 8

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 8

Object segmentation

Figure credit: Dai, He, and Sun, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, CVPR 2016

slide-9
SLIDE 9

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 9

Pose estimation

Figure credit: Cao et al, “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields”, arXiv 2016

slide-10
SLIDE 10

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 10

Image Super-resolution

Figure credit: Ledig et al, “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network”, arXiv 2016

slide-11
SLIDE 11

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 11

Art generation

Gatys, Ecker, and Bethge, “Image Style Transfer using Convolutional Neural Networks”, CVPR 2016 (left) Mordvintsev, Olah, and Tyka, “Inceptionism: Going Deeper into Neural Networks” (upper right) Johnson, Alahi, and Fei-Fei: “Perceptual Losses for Real-Time Style Transfer and Super-Resolution”, ECCV 2016 (bottom left)

slide-12
SLIDE 12

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 12

Neural networks

slide-13
SLIDE 13

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 13

Based on slide from Andrew Ng

The neuron model

slide-14
SLIDE 14

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 14

Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017

slide-15
SLIDE 15

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 15

Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017

slide-16
SLIDE 16

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 16

4 easy steps:

  • 1. Sample a batch of data
  • 2. Forward prop it through the graph (network),

get loss

  • 3. Backprop to calculate the gradients
  • 4. Update the parameters using the gradient
slide-17
SLIDE 17

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 17

Fully-connected networks

x

[C1]

w1

[C1×C2]

Matrix Multiply

s

[C2]

Nonlinearity

a

[C2]

w2

[C2×C3]

ŷ

[C3]

Matrix Multiply

slide-18
SLIDE 18

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 18

x

C1×H×W

w1

C2×C1×k×k

C

  • n

v

  • l

u t i

  • n

s

C2×H×W N

  • n

l i n e a r i t y

a

C2×H×W

w2

C2HW/4×C3

ŷ

C3

p

C2×H/2×W/2 Pooling Fully Connected

Convolutional neural networks

slide-19
SLIDE 19

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 19

Sobel operator:

The convolution operator

slide-20
SLIDE 20

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 20

Stanford CS231n class CS231n: Convolutional Neural Networks for Visual Recognition

slide-21
SLIDE 21

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 21

Stanford CS231n class CS231n: Convolutional Neural Networks for Visual Recognition

slide-22
SLIDE 22

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 22

Stanford CS231n class CS231n: Convolutional Neural Networks for Visual Recognition

slide-23
SLIDE 23

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 23

Stanford CS231n class CS231n: Convolutional Neural Networks for Visual Recognition

slide-24
SLIDE 24

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 24

Stanford CS231n class CS231n: Convolutional Neural Networks for Visual Recognition

slide-25
SLIDE 25

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 25

Stanford CS231n class CS231n: Convolutional Neural Networks for Visual Recognition

slide-26
SLIDE 26

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 26

Stanford CS231n class CS231n: Convolutional Neural Networks for Visual Recognition

slide-27
SLIDE 27

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 27

Convolution as matrix-matrix multiplication

Very important factor motivating early GPU usage for neural network training!

slide-28
SLIDE 28

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 28

Stanford CS231n class CS231n: Convolutional Neural Networks for Visual Recognition

slide-29
SLIDE 29

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 29

Stanford CS231n class CS231n: Convolutional Neural Networks for Visual Recognition

slide-30
SLIDE 30

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 30

Stanford CS231n class CS231n: Convolutional Neural Networks for Visual Recognition

l Subsampling pixels will not change the object

Subsampling

bird bird

slide-31
SLIDE 31

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 31

Stanford CS231n class CS231n: Convolutional Neural Networks for Visual Recognition

slide-32
SLIDE 32

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 32

Stanford CS231n class CS231n: Convolutional Neural Networks for Visual Recognition

LeNet-5

slide-33
SLIDE 33

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 33

Neural networks

slide-34
SLIDE 34

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 34

5 convolutional layers 3 fully connected layers + soft-max 650K neurons , 60 M weights

AlexNet

slide-35
SLIDE 35

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 35

slide-36
SLIDE 36

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 36

Rob Fergus, NIPS 2013

Depth reduction experiment on AlexNet

slide-37
SLIDE 37

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 37

Depth reduction experiment on AlexNet

slide-38
SLIDE 38

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 38

Depth reduction experiment on AlexNet

slide-39
SLIDE 39

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 39

Depth reduction experiment on AlexNet

slide-40
SLIDE 40

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 40

Depth reduction experiment on AlexNet

slide-41
SLIDE 41

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 41

Stanford CS231n class CS231n: Convolutional Neural Networks for Visual Recognition

Feature visualization

slide-42
SLIDE 42

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 42

1.5 GFLOPS 19.6 GFLOPS 3.6-11.6 GFLOPS Training VGG for 50 epochs on Imagenet uses more than 1 ExaFlop True HPC distributed training is needed

slide-43
SLIDE 43

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 43

slide-44
SLIDE 44

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 44

slide-45
SLIDE 45

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 45

slide-46
SLIDE 46

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 46

slide-47
SLIDE 47

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 47

Accuracy scales with data and model size

Network capacity is crucial!

slide-48
SLIDE 48

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 48

Accuracy scales with data and model size

But computation also scales!! Network capacity is crucial!

slide-49
SLIDE 49

Introduction to High-Performance Machine Learning: Convolutional Neural Networks www.prace-ri.eu 49

Accuracy scales with data and model size

But computation also scales!! Network capacity is crucial! Hardware to the rescue

slide-50
SLIDE 50

GPU Programming www.prace-ri.eu 50

Intel Knights Mill & Beyond

slide-51
SLIDE 51

GPU Programming www.prace-ri.eu 51

slide-52
SLIDE 52

GPU Programming www.prace-ri.eu 52

NVIDIA Volta Architecture

slide-53
SLIDE 53

GPU Programming www.prace-ri.eu 53

Good libraries and framework integration are key!

NVIDIA V100 performance

slide-54
SLIDE 54

GPU Programming www.prace-ri.eu 54

slide-55
SLIDE 55

GPU Programming www.prace-ri.eu 55

slide-56
SLIDE 56

GPU Programming www.prace-ri.eu 56

Available as of 12/02/2018: https://cloud.google.com/tpu/

slide-57
SLIDE 57

GPU Programming www.prace-ri.eu 57

slide-58
SLIDE 58

GPU Programming www.prace-ri.eu 58

Going forward: Distributed deep learning training

slide-59
SLIDE 59

GPU Programming www.prace-ri.eu 59

2 4 7 14

Training time (days)*

IN5k-ResNeXt-101-64x4d ResNeXt-101-32x4d ResNet-101 ResNet-50

Internet-scale data / Videos months?

*measured on NVIDIA M40 8-gpus

Single-server training

slide-60
SLIDE 60

GPU Programming www.prace-ri.eu 60

Parallel neural network training

slide-61
SLIDE 61

GPU Programming www.prace-ri.eu 61

Synchronous vs. Asynchronous SGD

Image courtesy of Jim Dowling. Chen et. al.: Revisiting Distributed Synchronous SGD

slide-62
SLIDE 62

GPU Programming www.prace-ri.eu 62

Data-parallel training

slide-63
SLIDE 63

GPU Programming www.prace-ri.eu 63

Courtesy NVIDIA

slide-64
SLIDE 64

GPU Programming www.prace-ri.eu 64

Courtesy NVIDIA

slide-65
SLIDE 65

GPU Programming www.prace-ri.eu 65

Courtesy NVIDIA

slide-66
SLIDE 66

GPU Programming www.prace-ri.eu 66

Courtesy NVIDIA

slide-67
SLIDE 67

GPU Programming www.prace-ri.eu 67

Courtesy NVIDIA

slide-68
SLIDE 68

GPU Programming www.prace-ri.eu 68

\

slide-69
SLIDE 69

GPU Programming www.prace-ri.eu 69

slide-70
SLIDE 70

GPU Programming www.prace-ri.eu 70

Naïve Scaling

20 40 60 80

epochs

20 30 40 50 60 70 80 90 100

training error % kn=256, = 0.1, 23.60% 0.12 kn= 8k, = 0.1, 41.78% 0.10

8192 = 256 x 32

#gpus per gpu batch

slide-71
SLIDE 71

GPU Programming www.prace-ri.eu 71

Linear LR scaling

  • Linearly scale Learning Rate (LR) with batch size

Optimization difficulty

slide-72
SLIDE 72

GPU Programming www.prace-ri.eu 72

Gradual LR warmup

  • start from LR of ! and increase it by constant amount at each iteration so that !̂ = $! after 5 epochs
slide-73
SLIDE 73

GPU Programming www.prace-ri.eu 73

21 24 48 33 22 12 1 7 1 10 2 15 2 11 2 3 10 20 30 40 50 60 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 NUMBER OF BUFFERS PARAMETER SIZE (POWER OF 2)

N=214 buffers to synchronize

bandwidth bound latency bound

Small buffers Large buffers

slide-74
SLIDE 74

GPU Programming www.prace-ri.eu 74

gradient 1

Collective Execution with Gloo

LAYER N LAYER (N -1) LAYER (N -2) LAYER 3 LAYER 2 LAYER 1

gradient 2 gradient 3

AllReduce AllReduce AllReduce

G1 G2 G8 ….. G1 + G2 + …..+ G8 G1 G2 G8 G1 + G2 + …..+ G8 ….. G1 G2 G8 G1 + G2 + …..+ G8 ….. G1 G2 G8 G1 + G2 + …..+ G8 …..

………..

Inter-machine Intra-machine

Node-1 Node-2 Node-K Node-3

AllReduce

AllReduce is important to scale efficiently

slide-75
SLIDE 75

GPU Programming www.prace-ri.eu 75

IPCC@SURFsara: Scaling up Deep Learning on the KNL platform

slide-76
SLIDE 76

GPU Programming www.prace-ri.eu 76

Deep Learning @ SURFsara

slide-77
SLIDE 77

GPU Programming www.prace-ri.eu 77

IPCC@SURFsara: Improving large-batch convergence

slide-78
SLIDE 78

GPU Programming www.prace-ri.eu 78

CIFAR100 training from pretrained model CIFAR100 training from scratch ICDAR training from pretrained model ICDAR training from scratch Dataset Model type #GPU s Accuracy[%] Convergence time [min] CIFAR10 Large 4 95.65 51 CIFAR100 Large 2 80.33 124 Bangla Small 2 99.15 29 Bangla Large 2 99.47 7 Bangla-aug Large 4 99.73 160 MNIST Small 2 99.44 168 ICDAR Medium 2 95.28 2344

Faster convergence and better accuracy when fine-tuning!

Top-5 poster GTC2016

Transfer learning / Fine-tuning experiments

slide-79
SLIDE 79

GPU Programming www.prace-ri.eu 79

Deep Learning frameworks overview

slide-80
SLIDE 80

GPU Programming www.prace-ri.eu 80

  • Convolutional Neural Networks good at many problems in Computer Vision
  • Classification, Segmentation, Detection, Art generation, etc.
  • High-performance CNNs are expensive to train:
  • Large memory requirements
  • Large compute
  • Has strict synchronisation requirements (not embarrassingly parallel)
  • Fast, low latency networks
  • High-performance hardware helps
  • Large-batch training is still an open research problem
  • But it’s effects are alleviated through better hyperparameters
  • Through higher level abstractions CNNs are applicable in many fields today

(also from science), and also at scale

Conclusions

slide-81
SLIDE 81

GPU Programming www.prace-ri.eu 81

THANK YOU FOR YOUR ATTENTION

www.prace-ri.eu

slide-82
SLIDE 82

GPU Programming www.prace-ri.eu 82

THANK YOU FOR YOUR ATTENTION

www.prace-ri.eu