@MagnusHyttsten Meet Robin Guinea Pig Meet Robin An Awkward - - PowerPoint PPT Presentation

magnushyttsten meet robin guinea pig
SMART_READER_LITE
LIVE PREVIEW

@MagnusHyttsten Meet Robin Guinea Pig Meet Robin An Awkward - - PowerPoint PPT Presentation

@MagnusHyttsten Meet Robin Guinea Pig Meet Robin An Awkward Social Experiment (that I'm afraid you need to be part of...) ROCKS! "GTC" Input Data Examples (Train & Test Data) Model <Awkward Output (Your Silence>


slide-1
SLIDE 1

@MagnusHyttsten

slide-2
SLIDE 2

Meet Robin

slide-3
SLIDE 3

Guinea Pig

Meet Robin

slide-4
SLIDE 4
slide-5
SLIDE 5

An Awkward Social Experiment (that I'm afraid you need to be part of...)

slide-6
SLIDE 6
slide-7
SLIDE 7

ROCKS!

slide-8
SLIDE 8

Examples (Train & Test Data) Model (Your Brain) Output Input Data

"GTC"

<Awkward Silence>

slide-9
SLIDE 9

Examples (Train & Test Data) Model (Your Brain) Loss function Output Input Data Labels (Correct Answers) Optimizer

"GTC" "Rocks"

slide-10
SLIDE 10

Examples (Train & Test Data) Model (Your Brain) Loss function Output Input Data Labels (Correct Answers) Optimizer

"GTC" "Rocks" "Rocks"

slide-11
SLIDE 11
slide-12
SLIDE 12
slide-13
SLIDE 13

Agenda

Intro to Machine Learning Creating a TensorFlow Model Why are GPUs Great for Machine Learning Workloads Distributed TensorFlow Training

slide-14
SLIDE 14

Agenda

Intro to Machine Learning Creating a TensorFlow Model Why are GPUs Great for Machine Learning Workloads Distributed TensorFlow Training

slide-15
SLIDE 15

Agenda

Intro to Machine Learning Creating a TensorFlow Model Why are GPUs Great for Machine Learning Workloads Distributed TensorFlow Training

slide-16
SLIDE 16

TensorFlow Distributed Execution Engine CPU GPU Android iOS ... Java Python Frontend C++ Estimator tf.keras.layers tf.keras Premade Estimators Datasets

slide-17
SLIDE 17

input_fn (Datasets, tf.data) Estimator (tf.estimator) calls

TensorFlow Estimator Architecture

slide-18
SLIDE 18

input_fn (Datasets, tf.data) Estimator (tf.estimator) subclass calls

Premade Estimators

DNNClassifier DNNRegressor LinearClassifier LinearRegressor DNNLinearCombinedClassifier DNNLinearCombinedRegressor

Premade Estimators

BaselineClassifier BaselineRegressor

slide-19
SLIDE 19

estimator = # Train locally

estimator.train (input_fn=..., ... estimator.evaluate(input_fn=..., ...) estimator.predict (input_fn=..., ...)

Premade Estimators

Datasets Premade Estimators

LinearRegressor(...) LinearClassifier(...) DNNRegressor(...) DNNClassifier(...) DNNLinearCombinedRegressor(...) DNNLinearCombinedClassifier(...) BaselineRegressor(...) BaselineClassifier(...)

Datasets

slide-20
SLIDE 20

input_fn (Datasets, tf.data) Estimator (tf.estimator) subclass calls

Custom Models #1 - model_fn

DNNClassifier DNNRegressor LinearClassifier LinearRegressor DNNLinearCombinedClassifier DNNLinearCombinedRegressor

Premade Estimators

BaselineClassifier BaselineRegressor model_fn calls Keras Layers (tf.keras.layer) use

slide-21
SLIDE 21

input_fn (Datasets, tf.data) Estimator (tf.estimator) subclass calls

Custom Models #2 - Keras Model

DNNClassifier DNNRegressor LinearClassifier LinearRegressor DNNLinearCombinedClassifier DNNLinearCombinedRegressor

Premade Estimators

BaselineClassifier BaselineRegressor model_fn calls Keras Layers (tf.keras.layer) use model_to_estimator Keras (tf.keras)

slide-22
SLIDE 22

# Imports yada yada ... model = Sequential() model.add(Conv2D(32, kernel_size=(3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Flatten()) model.add(Dense(128, activation='relu')) model.add(Dropout(0.2)) model.add(Dense(10, activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

tf.keras.layers tf.keras

Custom Models

slide-23
SLIDE 23

# Convert a Keras model to tf.estimator.Estimator ...

estimator = keras.estimator.model_to_estimator(model, ...)

# Train locally

estimator.train (input_fn=..., ... estimator.evaluate(input_fn=..., ...) estimator.predict (input_fn=..., ...)

Train/Evaluate Model

Estimator Datasets Datasets

slide-24
SLIDE 24

Summary - Use Estimators, Datasets, and Keras

  • Premade Estimators (tf.estimator): When possible
  • Custom Models

a. model_fn in Estimator & tf.keras.layers b. Keras Models (tf.keras) ■ estimator = keras.model_to_estimator(...)

  • Datasets (tf.data) for the input pipeline
slide-25
SLIDE 25

Agenda

Intro to Machine Learning Creating a TensorFlow Model Why are GPUs Great for Machine Learning Workloads Distributed TensorFlow Training

slide-26
SLIDE 26

Disclaimer...

  • High-Level - We look at only parts of the power of GPUs
  • Simple Overview - More optimal designs exist
  • Reduced Scope - Only considering fully-connected layers, etc
slide-27
SLIDE 27

Strengths of V100 GPU

  • Built for Massively Parallel Computations
  • Hardware & software suitable to manage

Deep Learning Workloads (Tensor Cores, mixed-precision execution, etc)

slide-28
SLIDE 28

Strengths of V100 GPU

  • Built for Massively Parallel Computations
  • Specific hardware / software to manage

Deep Learning Workloads (Tensor Cores, mixed-precision execution, etc)

Tesla SXM V100

  • 5376 cores (FP32)
slide-29
SLIDE 29

Strengths of V100 GPU

What are we going to do with 5376 FP32 cores?

slide-30
SLIDE 30

Strengths of V100 GPU

What are we going to do with 5376 FP32 cores? "Execute things in parallel"!

slide-31
SLIDE 31

Strengths of V100 GPU

What are we going to do with 5376 FP32 cores? "Execute things in parallel"! Yes, but how can we exactly do that for ML Workloads?

slide-32
SLIDE 32

Strengths of V100 GPU

What are we going to do with 5376 FP32 cores? "Execute things in parallel"! Yes, but how can we exactly do that for ML Workloads? "Hey, that's your job - That's why we're here listening"!

slide-33
SLIDE 33

Strengths of V100 GPU

What are we going to do with 5376 FP32 cores? "Execute things in parallel"! Yes, but how can we exactly do that for ML Workloads? "Hey, that's your job - That's why we're here listening"!

Alright, let's talk about that then

slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36
  • We may have a huge number of layers
  • Each layer can have huge number of neurons
  • -> There may be 100s millions or even billions * and + ops

All knobs are W values that we need to tune So that given a certain input, they generate the correct output

slide-37
SLIDE 37

"Matrix Multiplication is EATING (the computing resources of) THE WORLD"

hi_j = [X0, X1, X2, ...] * [W0, W1, W2, ...] hi_j = X0*W0 + X1*W1 + X2*W2 + ...

slide-38
SLIDE 38

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

Matmul

slide-39
SLIDE 39

Single-threaded Execution

slide-40
SLIDE 40

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1

[ [

*

Single-threaded Execution

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-41
SLIDE 41

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 1*0.1 = 0.1

[ [

*

Single-threaded Execution

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-42
SLIDE 42

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 Prev 1*0.1 = 0.1 0.1

[ [

*

Single-threaded Execution

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-43
SLIDE 43

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 Prev 1*0.1 = 0.1 0.1 + 2*0.1 = 0.3

[ [

*

Single-threaded Execution

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-44
SLIDE 44

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 Prev 1*0.1 = 0.1 0.1 + 2*0.1 = 0.3 . . 3238.5+255*0.1 = 3264 3264 + 256*0.1 = 3289.6

[ [

*

Single-threaded Execution

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-45
SLIDE 45

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 Prev 1*0.1 = 0.1 0.1 + 2*0.1 = 0.3 . . 3238.5+255*0.1 = 3264 3264 + 256*0.1 = 3289.6

[ [

*

Single-threaded Execution

256 * t

Single-threaded Execution

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-46
SLIDE 46

GPU Execution

slide-47
SLIDE 47

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1

[ [

*

GPU - #1 Multiplication Step

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-48
SLIDE 48

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1

[ [

*

GPU - #1 Multiplication Step

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

Tesla SXM V100

5376 cores (FP32)

slide-49
SLIDE 49

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1

[ [

*

GPU - #1 Multiplication Step

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-50
SLIDE 50

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 X1_mul_vector 1*0.1 = 0.1 2*0.1 = 0.2 . . . 256*0.1 = 25.6

[ [

*

GPU - #1 Multiplication Step

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-51
SLIDE 51

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 X1_mul_vector 1*0.1 = 0.1 2*0.1 = 0.2 . . . 256*0.1 = 25.6

[ [

*

Multi-threaded Execution (256 Threads)

t

GPU - #1 Multiplication Step

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-52
SLIDE 52

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 X1_mul_vector 1*0.1 = 0.1 2*0.1 = 0.2 . . . 256*0.1 = 25.6

[ [

*

Multi-threaded Execution (256 Threads)

t

GPU - #1 What about Summation?

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-53
SLIDE 53

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 1*0.1 = 0.1 2*0.1 = 0.2 . . . 256*0.1 = 25.6

[ [

* + + +

= h0,0

X1_mul_vector 1*0.1 = 0.1 2*0.1 = 0.2 . . . 256*0.1 = 25.6

GPU - #2 Summary Step

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-54
SLIDE 54

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 1*0.1 = 0.1 2*0.1 = 0.2 . . . 256*0.1 = 25.6

[ [

*

Multi-threaded Execution (256 Threads)

+ + +

= h0,0 log 128 = 2 7 * t

X1_mul_vector 1*0.1 = 0.1 2*0.1 = 0.2 . . . 256*0.1 = 25.6

GPU - #2 Summary Step

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-55
SLIDE 55

Comparing - Order of Magnitude (sequences)

Single-Threaded Execution GPU Multi-Threaded Execution

1 * t + 7 * t = 256 * t 8 * t

slide-56
SLIDE 56

Many Knobs to Tune But the type of calculation we perform is very suited for GPUs

slide-57
SLIDE 57

Summary

  • GPUs == Many Threads == Great for ML Workloads
  • And now you know how this works
  • Fortunately, you don't need to worry about implementation details
slide-58
SLIDE 58

Agenda

Intro to Machine Learning Creating a TensorFlow Model Why are GPUs Great for Machine Learning Workloads Distributed TensorFlow Training

slide-59
SLIDE 59

Different Services

"Between-Graph & Asynchronous Training"

Services, typically distributed to different hardware nodes

  • Workers: ALL the processing (1 or more)
  • Parameter Servers: Stores all weights (1 or more, why more?)
  • Chief: Makes sure everything runs synchronized (1)
slide-60
SLIDE 60

Distributed Training - Between-Graph/A-sync

Operation Execution Chief & Worker #1 CPU & GPU Variable Storage Parameter Server #1 CPU Variable Storage Parameter Server #2 CPU Operation Execution Worker #2 CPU & GPU Operation Execution Worker #3 CPU & GPU PS sends variables to WS WS sends gradient updates to PS Shared Storage Checkpoint Storage Training Data Storage

PS & Chief Writes & Reads Checkpoint Data

WS Read Training Data

slide-61
SLIDE 61

Distributed Training - Between-Graph/A-sync

Operation Execution Chief & Worker #1 CPU & GPU Variable Storage Parameter Server #1 CPU Variable Storage Parameter Server #2 CPU Operation Execution Worker #2 CPU & GPU Operation Execution Worker #3 CPU & GPU PS sends variables to WS WS sends gradient updates to PS Shared Storage Checkpoint Storage Training Data Storage

PS & Chief Writes & Reads Checkpoint Data

WS Read Training Data

slide-62
SLIDE 62

# Convert a Keras model to tf.estimator.Estimator ...

estimator = keras.estimator.model_to_estimator(model, ...)

# Train locally

estimator.train (input_fn=..., ... estimator.evaluate(input_fn=..., ...) estimator.predict (input_fn=..., ...)

Train/Evaluate Model

Estimator Datasets Datasets

slide-63
SLIDE 63

# Convert a Keras model to tf.estimator.Estimator ...

estimator = keras.estimator.model_to_estimator(model, ...)

# Train locally & distributed

tf.estimator.train_and_evaluate( estimator, train_spec, eval_spec)

Train/Evaluate Model

Estimator Datasets

slide-64
SLIDE 64

# Convert a Keras model to tf.estimator.Estimator ...

estimator = keras.estimator.model_to_estimator(model, ...)

# Train locally & distributed

tf.estimator.train_and_evaluate( estimator, train_spec, eval_spec)

Train/Evaluate Model

Estimator Datasets

Specifying input data How long to train & evaluate Evaluation metrics

slide-65
SLIDE 65

# Convert a Keras model to tf.estimator.Estimator ...

estimator = keras.estimator.model_to_estimator(model, ...)

# Train locally & distributed

tf.estimator.train_and_evaluate( estimator, train_spec, eval_spec)

Train/Evaluate Model

Estimator Datasets

TF_CONFIG

Environment Variable

Specifying input data How long to train & evaluate Evaluation metrics

slide-66
SLIDE 66

TF_CONFIG='{

"cluster": { "chief": ["host1:2222"], "worker": ["host1:2222", "host2:2222", "host3:2222"], "ps": ["host4:2222", "host5:2222"] }, # To start Parameter Server #1 "task": {"type": "ps", "index": 0} } '

Starting Parameter Server #1

Estimator

slide-67
SLIDE 67

TF_CONFIG='{

"cluster": { "chief": ["host1:2222"], "worker": ["host1:2222", "host2:2222", "host3:2222"], "ps": ["host4:2222", "host5:2222"] }, # To start Parameter Server #1 "task": {"type": "ps", "index": 0} } ' $(host4) python <YourProgram.py>

Starting Parameter Server #1

Estimator

slide-68
SLIDE 68

Different Modes of Distribution

Work Distribution Mode

  • Between-graph Replication - Individual processes

Each Worker runs its own process (of the same program) Variable Update Mode

  • Asynchronous: Updates are applied across Workers in parallel
slide-69
SLIDE 69

Different Modes of Distribution

Work Distribution Mode

  • Between-graph Replication - Individual processes

Each Worker runs its own process (of the same program)

  • In-graph Replication - Single program/process

Ops or variables are distributed to different nodes (or GPU cards) Variable Update Mode

  • Asynchronous: Updates are applied across Workers in parallel
  • Synchronous: Update values are co-ordinated across Ops
slide-70
SLIDE 70
  • Workers are Stateless - Easy to add new ones
  • >=1 Parameter Servers - Supports large variables (e.g. embeddings)
  • Shared Storage: Makes Cloud a great place (GCS, S3, ...)

Summary

slide-71
SLIDE 71

HEY! Stop there, not so fast!

slide-72
SLIDE 72

HEY! Stop there, not so fast! What about

with tf.device('/gpu:0'): a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3]) b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2]) c = tf.matmul(a, b)

slide-73
SLIDE 73

Getting Started With TensorFlow & GPUs == void

slide-74
SLIDE 74

Getting Started With TensorFlow & GPUs == void

(at least in the future)

slide-75
SLIDE 75

Kubeflow

https://goo.gl/2vfHcm

slide-76
SLIDE 76

Summary

  • Use tf.estimator, tf.data, tf.keras to define & train your models
slide-77
SLIDE 77

Summary

  • Use tf.estimator, tf.data, tf.keras to define & train your models
  • GPUs are great for ML Workloads
slide-78
SLIDE 78

Summary

  • Use tf.estimator, tf.data, tf.keras to define & train your models
  • GPUs are great for ML Workloads
  • Estimators support Between-graph, Asynchronous Training

■ Chief ■ Parameter Server ■ Workers

slide-79
SLIDE 79

Summary

  • Use tf.estimator, tf.data, tf.keras to define & train your models
  • GPUs are great for ML Workloads
  • Estimators support Between-graph, Asynchronous Training

■ Chief ■ Parameter Server ■ Workers

<<<STAY TUNED>>>

slide-80
SLIDE 80
slide-81
SLIDE 81

@MagnusHyttsten

Thank You