Getuing Starued with TensorFlow on GPUs Magnus Hyttsten - - PowerPoint PPT Presentation

getuing starued with tensorflow on gpus
SMART_READER_LITE
LIVE PREVIEW

Getuing Starued with TensorFlow on GPUs Magnus Hyttsten - - PowerPoint PPT Presentation

Getuing Starued with TensorFlow on GPUs Magnus Hyttsten @MagnusHyttsten 1 Agenda + An Awkward Social Experiment (that I'm afraid you will be paru of...) ROCKS! "GTC" Input Data Examples (Train & Test Data) Model


slide-1
SLIDE 1

Getuing Starued with TensorFlow on GPUs

1

Magnus Hyttsten

@MagnusHyttsten

slide-2
SLIDE 2

+

Agenda

slide-3
SLIDE 3

An Awkward Social Experiment

(that I'm afraid you will be paru of...)

slide-4
SLIDE 4
slide-5
SLIDE 5

ROCKS!

slide-6
SLIDE 6

Examples (Train & Test Data) Model (Your Brain) Output Input Data

"GTC"

<Awkward Silence>

slide-7
SLIDE 7

Examples (Train & Test Data) Model (Your Brain) Loss function Output Input Data Labels (Correct Answers) Optimizer

"GTC" "Rocks"

slide-8
SLIDE 8

Examples (Train & Test Data) Model (Your Brain) Loss function Output Input Data Labels (Correct Answers) Optimizer

"GTC" "Rocks" "Rocks"

slide-9
SLIDE 9

"Classical" Programming Machine Learning Input Data + Code Input Data + Output Data Output Data Code

slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12
slide-13
SLIDE 13
slide-14
SLIDE 14
slide-15
SLIDE 15

Scalable

Tested at Google-scale. Deploy everywhere

Easy

Simplified APIs. Focused on Keras and eager execution

Powergul

Flexibility and performance. Power to do cutting edge research and scale to > 1 exaflops

TensorFlow 2.0 Alpha is out

slide-16
SLIDE 16
slide-17
SLIDE 17
slide-18
SLIDE 18

tf.data (Dataset) tf.feature_column (Transfer Learning) High-level APIs Perform Distributed Training (talk @1pm) E.g. V100

slide-19
SLIDE 19

tf.data (Dataset) tf.feature_column (Transfer Learning) High-level APIs Perform Distributed Training (talk @1pm) E.g. V100

slide-20
SLIDE 20

DNNClassifier DNNRegressor LinearClassifier LinearRegressor DNNLinearCombinedClassifier DNNLinearCombinedRegressor

Premade Estimators

BaselineClassifier BaselineRegressor BoostedTreeClassifier BoostedTreeRegressor input_fn (Datasets, tf.data) calls

Built to Distribute and Scale

slide-21
SLIDE 21

estimator = # Train locally

estimator.train (input_fn=..., ... estimator.evaluate(input_fn=..., ...) estimator.predict (input_fn=..., ...)

Premade Estimators

Datasets Premade Estimators

LinearRegressor(...) LinearClassifier(...) DNNRegressor(...) DNNClassifier(...) DNNLinearCombinedRegressor(...) DNNLinearCombinedClassifier(...) BaselineRegressor(...) BaselineClassifier(...) BoostedTreeRegressor(...) BoostedTreeClassifier(...)

Datasets

slide-22
SLIDE 22

wide_columns = [ tf.feature_column.bucketized_column( 'age',=[18, 27, 40, 65])] deep_columns = [ tf.feature_column.numeric_column('visits'), tf.feature_column.numeric_column('clicks')] tf.estimator.DNNLinearCombinedClassifier( linear_feature_columns=wide_columns, dnn_feature_columns=deep_columns, dnn_hidden_units=[100, 75, 50, 25])

Premade Estimator - Wide & Deep

slide-23
SLIDE 23

tf.data (Dataset) tf.feature_column (Transfer Learning) Perform Distributed Training E.g. V100

slide-24
SLIDE 24

tf.keras.layers tf.keras

Custom Models

model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(), tf.keras.layers.Dense(512, activation='relu'), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation='softmax') ]) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) model.fit (dataset, epochs=5) model.evaluate(dataset) model.predict (dataset)

Datasets

slide-25
SLIDE 25

TensorFlow Datasets

  • audio

○ "nsynth"

  • image

○ "celeb_a" ○ "cifar10" ○ "coco2014" ○ "diabetic_retinopathy_detection" ○ "imagenet2012" ○ "mnist" ○ "open_images_v4"

  • structured

○ "titanic"

  • text

○ "imdb_reviews" ○ "lm1b" ○ "squad" import tensorflow_datasets as tfds train_ds = tfds.load("imdb_reviews", split="train", as_supervised=True)

  • translate

○ "wmt_translate_ende" ○ "wmt_translate_enfr"

  • video

○ "bair_robot_pushing_small" ○ "moving_mnist" ○ "starcrafu_video"

  • 30+ available
  • Add your own
slide-26
SLIDE 26
  • Datasets (tf.data) for the input pipeline

a. TensorFlow Datasets is great b. tf.feature_columns are cool too

  • Premade Estimators
  • Keras Models (tf.keras)

TensorFlow Summary

slide-27
SLIDE 27

The V-100

And why is it so good @ Machine Learning???

slide-28
SLIDE 28
  • High-Level - We look at only parts of the power of GPUs
  • Simple Overview - More optimal designs exist
  • Reduced Scope - Only considering fully-connected layers, etc

Disclaimer

slide-29
SLIDE 29

Strengths of V100

  • Built for Massively Parallel Computations
  • Specific hardware / software to manage

Deep Learning Workloads (Tensor Cores, mixed-precision execution, etc)

slide-30
SLIDE 30

Strengths of V100

  • Built for Massively Parallel Computations
  • Specific hardware / software to manage

Deep Learning Workloads (Tensor Cores, mixed-precision execution, etc)

Tesla SXM V100

  • 5376 cores (FP32)
slide-31
SLIDE 31

What are we going to do with 5376 FP32 cores?

My Questions Around the GPU

slide-32
SLIDE 32

What are we going to do with 5376 FP32 cores? "Execute things in parallel"!

The Unsatisfactory Answer

slide-33
SLIDE 33

What are we going to do with 5376 FP32 cores? "Execute things in parallel"! Yes, but how can we exactly do that for ML Workloads?

slide-34
SLIDE 34

What are we going to do with 5376 FP32 cores? "Execute things in parallel"! Yes, but how can we exactly do that for ML Workloads? "Hey, that's your job - That's why we're here listening"!

slide-35
SLIDE 35

What are we going to do with 5376 FP32 cores? "Execute things in parallel"! Yes, but how can we exactly do that for ML Workloads? "Hey, that's your job - That's why we're here listening"!

Alright, let me try to talk about that then

slide-36
SLIDE 36
slide-37
SLIDE 37
slide-38
SLIDE 38
  • We may have a huge number of layers
  • Each layer can have huge number of neurons
  • -> There may be 100s millions or even billions * and + ops

All knobs are W values that we need to tune So that given a certain input, they generate the correct output

slide-39
SLIDE 39

"Matrix Multiplication is EATING (the computing resources of) THE WORLD"

hi_j = [X0, X1, X2, ...] * [W0, W1, W2, ...] hi_j = X0*W0 + X1*W1 + X2*W2 + ...

slide-40
SLIDE 40

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

Matmul

slide-41
SLIDE 41

Single-threaded Execution

slide-42
SLIDE 42

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1

[ [

*

Single-threaded Execution

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-43
SLIDE 43

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 1*0.1 = 0.1

[ [

*

Single-threaded Execution

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-44
SLIDE 44

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 Prev 1*0.1 = 0.1 0.1

[ [

*

Single-threaded Execution

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-45
SLIDE 45

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 Prev 1*0.1 = 0.1 0.1 + 2*0.1 = 0.3

[ [

*

Single-threaded Execution

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-46
SLIDE 46

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 Prev 1*0.1 = 0.1 0.1 + 2*0.1 = 0.3 . . 3238.5+255*0.1 = 3264 3264 + 256*0.1 = 3289.6

[ [

*

Single-threaded Execution

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-47
SLIDE 47

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 Prev 1*0.1 = 0.1 0.1 + 2*0.1 = 0.3 . . 3238.5+255*0.1 = 3264 3264 + 256*0.1 = 3289.6

[ [

*

Single-threaded Execution

256 * t

Single-threaded Execution

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-48
SLIDE 48

GPU Execution

slide-49
SLIDE 49

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1

[ [

*

GPU - #1 Multiplication Step

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-50
SLIDE 50

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1

[ [

*

GPU - #1 Multiplication Step

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

Tesla SXM V100

5376 cores (FP32)

slide-51
SLIDE 51

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1

[ [

*

GPU - #1 Multiplication Step

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-52
SLIDE 52

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 X1_mul_vector 1*0.1 = 0.1 2*0.1 = 0.2 . . . 256*0.1 = 25.6

[ [

*

GPU - #1 Multiplication Step

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-53
SLIDE 53

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 X1_mul_vector 1*0.1 = 0.1 2*0.1 = 0.2 . . . 256*0.1 = 25.6

[ [

*

Multi-threaded Execution (256 Threads)

t

GPU - #1 Multiplication Step

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-54
SLIDE 54

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 X1_mul_vector 1*0.1 = 0.1 2*0.1 = 0.2 . . . 256*0.1 = 25.6

[ [

*

Multi-threaded Execution (256 Threads)

t

GPU - #1 What about Summation?

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-55
SLIDE 55

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 1*0.1 = 0.1 2*0.1 = 0.2 . . . 256*0.1 = 25.6

[ [

* + + +

= h0,0

X1_mul_vector 1*0.1 = 0.1 2*0.1 = 0.2 . . . 256*0.1 = 25.6

GPU - #2 Summary Step

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-56
SLIDE 56

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 1*0.1 = 0.1 2*0.1 = 0.2 . . . 256*0.1 = 25.6

[ [

*

Multi-threaded Execution (256 Threads)

+ + +

= h0,0 log 128 = 2 7 * t

X1_mul_vector 1*0.1 = 0.1 2*0.1 = 0.2 . . . 256*0.1 = 25.6

GPU - #2 Summary Step

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-57
SLIDE 57

Single-Threaded Execution GPU Multi-Threaded Execution

1 * t + 7 * t = 256 * t

Comparing - Order of Magnitude (sequences)

slide-58
SLIDE 58

Many Knobs to Tune But the type of calculation we perform is very suited for GPUs

slide-59
SLIDE 59

Summary

  • GPUs == Many Threads == Great for ML Workloads
  • And now you know how this works
  • Fortunately, you don't need to worry about implementation details
slide-60
SLIDE 60
slide-61
SLIDE 61 61

multi-core CPU

slide-62
SLIDE 62 62

multi-core CPU GPU

slide-63
SLIDE 63 63

multi-core CPU GPU

Work needed: NONE

(just use a GPU build)

slide-64
SLIDE 64 64

Beyond That

Use Distribution Strategy API

There's a talk for that (@ 1pm)

slide-65
SLIDE 65 65

You Can...

tensorflow.org/learn TensorFlow Courses

coursera.org/learn/introduction-tensorflow udacity.com/tensorflow

Distribution Strategies

tensorflow.org/alpha/guide/distribute_strategy

@MagnusHyttsten