Computation Rachel Hu and Zhi Zhang d2l.ai Outline Performance - - PowerPoint PPT Presentation

computation
SMART_READER_LITE
LIVE PREVIEW

Computation Rachel Hu and Zhi Zhang d2l.ai Outline Performance - - PowerPoint PPT Presentation

Computation Rachel Hu and Zhi Zhang d2l.ai Outline Performance Hybridization Async-computation Multi-GPU/machine training Computer Vision Image augmentation Fine tuning d2l.ai A Hybrid of Imperative and


slide-1
SLIDE 1

d2l.ai

Computation

Rachel Hu and Zhi Zhang

slide-2
SLIDE 2

d2l.ai

Outline

  • Performance
  • Hybridization
  • Async-computation
  • Multi-GPU/machine training
  • Computer Vision
  • Image augmentation
  • Fine tuning
slide-3
SLIDE 3

d2l.ai

A Hybrid of Imperative and Symbolic Programming

slide-4
SLIDE 4

d2l.ai

Imperative Programming

  • The common way to program in Python,

Java, C/C++, …

  • Straightforward, easy to debug
  • Requires (Python) interpreter
  • Hard to deploy models


(smart phones, browser, embedded)

  • Performance problems

a = 1 b = 2 c = a + b

Interpreter compiles into bytecode Execute on virtual machine

3 calls in total

slide-5
SLIDE 5

d2l.ai

Symbolic Programming

  • Define the program first, feed

with data to execute later

  • Math, SQL, …
  • Easy to optimize, less

frontend overhead, portable

  • Hard to use

expr = "c = a + b" exec = compile(expr) exec(a=1, b=2)

Know the whole program, easy to

  • ptimize

May be used without Python interpreter Single call

slide-6
SLIDE 6

d2l.ai

Hybridization in Gluon

  • Define a model through nn.HybridSequential or

nn.HybridBlock

  • Call .hybridize() to switch from imperative execution to

symbolic execution

net = nn.HybridSequential() net.add(nn.Dense(256, activation='relu'), nn.Dense(10)) net.hybridize()

slide-7
SLIDE 7

d2l.ai

Hybridize Notebook

slide-8
SLIDE 8

d2l.ai

Asynchronous Computing

slide-9
SLIDE 9

d2l.ai

  • Execute one-by-one
  • With a backend thread

Asynchronous Execution

a = 1 b = 2 c = a + b print(c)

a = 1 b = 2 c = a + b print(c)

System overhead

a = 1 b = 2 c = a + b print(c)

Frontend thread Backend thread

Push Wait

}

Overlapped

slide-10
SLIDE 10

d2l.ai

Automatic Parallelism

slide-11
SLIDE 11

d2l.ai

Writing Parallel Program is Painful

data = next_batch() data[gpu0].copyfrom(data[0:50]) _, fc1_wgrad[gpu0] = FullcBackward(fc1_ograd[gpu0] , fc1_weight[gpu0]) fc1_ograd[gpu0], fc2_wgrad[gpu0] = FullcBackward(fc2_ograd[gpu0] , fc2_weight[gpu0]) fc2_ograd[gpu0] = LossGrad(fc2[gpu0], label[0:50]) fc2[gpu0] = FullcForward(fc1[gpu0], fc2_weight[gpu0]) fc1[gpu0] = FullcForward(data[gpu0], fc1_weight[gpu0]) fc2_wgrad[cpu] = fc2_wgrad[gpu0] + fc2_wgrad[gpu1] fc2_weight[cpu].copyto( fc2_weight[gpu0] , fc2_weight[gpu1]) fc2_weight[cpu] -= lr*fc12_wgrad[gpu0] fc1_weight[cpu] -= lr * fc1_wgrad[gpu0] fc1_wgrad[cpu] = fc1_wgrad[gpu0] + fc1_wgrad[gpu1] fc1_weight[cpu].copyto( fc1_weight[gpu0] , data[gpu0].copyfrom(data[51:100]) _, fc1_wgrad[gpu1] = FullcBackward(fc1_ograd[gpu1] , fc1_weight[gpu1]) fc1_ograd[gpu1], fc2_wgrad[gpu1] = FullcBackward(fc2_ograd[gpu1] , fc2_weight[gpu1]) fc2_ograd[gpu1] = LossGrad(fc2[gpu1], label[51:100]) fc2[gpu1] = FullcForward(fc1[gpu1], fc1[gpu1] = FullcForward(data[gpu1], fc1_weight[gpu1])

  • Single hidden-

layer MLP with 2 GPUs

  • Scales to

hundreds of layers and tens

  • f GPUs
slide-12
SLIDE 12

d2l.ai

Auto Parallelization

A = nd.ones((2,2)) * 2 C = A + 2 B = A + 1 D = B * C Write serial programs

A = 2 C = A + 2 B = A + 1 D = B ⨉ C

Run in parallel

slide-13
SLIDE 13

d2l.ai

Multi-GPU Training

(Lunar new year, 2014)

slide-14
SLIDE 14

d2l.ai

Data Parallelism

  • 1. Read a data partition
  • 2. Pull the parameters
  • 3. Compute the gradient
  • 4. Push the gradient
  • 5. Update the parameters

key-value store

examples

slide-15
SLIDE 15

d2l.ai

Distributed Training

(Alex’s frugal GPU cluster at CMU, 2015)

slide-16
SLIDE 16

d2l.ai

Distributed Computing

examples

key-value store

multiple worker machines example s Store data in a distributed filesystem multiple server machines push and pull 


  • ver network

read over network

slide-17
SLIDE 17

d2l.ai

GPU Machine Hierarchy

PCIe Switch GPU GPU GPU GPU CPU Network Switch

63 GB/s

4 PCIe 3.0 16x

15.75 GB/s

PCIe 3.0 16x

1.25 GB/s

10 Gbit Ethernet

Hierarchical parameter server

Level-1 Servers Workers Level-2 Servers

GPUs CPUs

slide-18
SLIDE 18

d2l.ai

Iterating a Batch

examples example s

  • Each worker machine

read a part of the data batch

slide-19
SLIDE 19

d2l.ai

Iterating a Batch

examples example s

  • Further split and

move to each GPU

slide-20
SLIDE 20

d2l.ai

Iterating a Batch

examples example s

  • Each server maintain

a part of parameters

  • Each worker pull the

whole parameters from servers

slide-21
SLIDE 21

d2l.ai

Iterating a Batch

examples example s

  • Copy parameters into

each GPU

slide-22
SLIDE 22

d2l.ai

Iterating a Batch

examples example s

  • Each GPU computes

gradients

slide-23
SLIDE 23

d2l.ai

Iterating a Batch

examples example s

  • Sum the gradients
  • ver all GPU
slide-24
SLIDE 24

d2l.ai

Iterating a Batch

examples example s

  • Push gradients into

servers

slide-25
SLIDE 25

d2l.ai

Iterating a Batch

examples example s

  • Each server sum

gradients from all workers, then updates its parameters

slide-26
SLIDE 26

d2l.ai

Synchronized SGD

  • Each worker run synchronically
  • If n GPUs and each GPU process b examples per time
  • Synchronized SGD equals to mini-batch SGD on a

single GPU with a nb batch size

  • In the ideal case, training with n GPUs will lead to a n

times speedup compared to a single GPU training

slide-27
SLIDE 27

d2l.ai

Performance

  • T1 = O(b): time to compute gradients for b example in a

GPU

  • T2 = O(m): time to send and receive m parameters/

gradients for a worker

  • Wall-time for each batch is max(T1, T2)
  • Idea case: T1 > T2, namely using large enough b
  • A too large b needs more data epochs to reach a desired

model quality

slide-28
SLIDE 28

d2l.ai

Performance Trade-off

Batch size per GPU Good System performance (walltime per epoch) Training efficiency (#epoch to stop)

Optimal batch size

slide-29
SLIDE 29

d2l.ai

Practical Suggestions

  • A large dataset
  • Good GPU-GPU and machine-machine bandwidth
  • Efficient data loading/preprocessing
  • A model with good computation (FLOP) vs communication

(model size) ratio

  • ResNet > AlexNet
  • A large enough batch size for good system performance
  • Tricks for efficiency optimization with a large batch size
slide-30
SLIDE 30

d2l.ai

Multi-GPU Notebooks

slide-31
SLIDE 31

d2l.ai

Image Augmentation

slide-32
SLIDE 32

d2l.ai

Real Story from CES’19

  • Startup with smart vending

machine demo that identifies purchases via a camera

  • Demo at CES failed
  • Different light temperature
  • Light reflection from table
  • The fix
  • Collect new data
  • Buy tablecloth
  • Retrain all night
slide-33
SLIDE 33

gluon-cv.mxnet.io

Data Augmentation

  • Use prior knowledge about invariances to augment data
  • Add background noise to speech
  • Transform / augment image by altering colors, noise,

cropping, distortions

slide-34
SLIDE 34

d2l.ai

Training with Augmented Data

Generate on the fly Original Augmented Dataset Model

slide-35
SLIDE 35

d2l.ai

Flip

vertical horizontal

x

slide-36
SLIDE 36

d2l.ai

Crop

  • Crop an area from the image and resize it
  • Random aspect ratio (e.g. [3:4, 4:3])
  • Random area size (e.g. [8%, 100%])
  • Random position
slide-37
SLIDE 37

d2l.ai

Color

Scale hue, saturation, and brightness (e.g. [0.5, 1.5]) Brightness Hue

slide-38
SLIDE 38

d2l.ai

Many Other Augmentations

https://github.com/aleju/imgaug

slide-39
SLIDE 39

courses.d2l.ai/berkeley-stat-157

Fine Tuning

slide-40
SLIDE 40

d2l.ai

# examples 1.2M 50K 60K # classes 1,000 100 10 My dataset

Labelling a Dataset is Expensive

Can we reuse this?

slide-41
SLIDE 41

gluon-cv.mxnet.io

Network Structure

Feature extractor Two components in deep network

  • Feature extractor to

map raw pixels into linearly separable features.

  • Linear classifier for

decisions Softmax 
 classifier

}

Layer 1 Layer L - 1 Output layer …

slide-42
SLIDE 42

gluon-cv.mxnet.io

Fine Tuning

}

Likely good feature extractor for target Don’t use last layer since classification problem is different

Layer 1 Layer L - 1 Output layer …

Source Dataset Target Dataset

slide-43
SLIDE 43

gluon-cv.mxnet.io

Weight Initialization for Fine Turning

Source Dataset Target Dataset

slide-44
SLIDE 44

d2l.ai

Fix Lower Layers

  • Neural networks learn hierarchical

feature representations

  • Low-level features are universal
  • High-level features are more

related to objects in the dataset

  • Fix the bottom layer parameters

during fine tuning
 (useful for regularization)

Layer 1 Layer L - 1 Output layer …

slide-45
SLIDE 45

d2l.ai

Re-use Classifier Parameters

Lucky break

  • Source dataset may

contain some of the target categories

  • Use the according

weight vectors from the pre-trained model during initialization

slide-46
SLIDE 46

gluon-cv.mxnet.io

Fine-tuning Training Recipe

  • Train on the target dataset as normal but with strong

regularization

  • Small learning rate
  • Fewer epochs
  • If source dataset is more complex than the target

dataset, fine-tuning can lead to better models
 (source model is a good prior)

slide-47
SLIDE 47

gluon-cv.mxnet.io

Fine-tuning Notebook

slide-48
SLIDE 48

d2l.ai

Summary

  • To get good performance:
  • Optimize codes through hybridization
  • Use multiple GPUs/machines
  • Augment image data by transformations
  • Train with pre-trained models