@MagnusHyttsten Meet Robin Guinea Pig Meet Robin An Awkward - - PowerPoint PPT Presentation

magnushyttsten meet robin guinea pig
SMART_READER_LITE
LIVE PREVIEW

@MagnusHyttsten Meet Robin Guinea Pig Meet Robin An Awkward - - PowerPoint PPT Presentation

@MagnusHyttsten Meet Robin Guinea Pig Meet Robin An Awkward Social Experiment (that I'm afraid you need to be part of...) Super ROCKS! "QCon" Input Data Examples (Train & Test Data) Model <Awkward Output (Your


slide-1
SLIDE 1

@MagnusHyttsten

slide-2
SLIDE 2

Meet Robin

slide-3
SLIDE 3

Guinea Pig

Meet Robin

slide-4
SLIDE 4
slide-5
SLIDE 5

An Awkward Social Experiment (that I'm afraid you need to be part of...)

slide-6
SLIDE 6
slide-7
SLIDE 7

Super ROCKS!

slide-8
SLIDE 8

Examples (Train & Test Data) Model (Your Brain) Output Input Data

"QCon"

<Awkward Silence>

slide-9
SLIDE 9

Examples (Train & Test Data) Model (Your Brain) Loss function Output Input Data Labels (Correct Answers) Optimizer

"QCon" "Super Rocks"

slide-10
SLIDE 10

Examples (Train & Test Data) Model (Your Brain) Loss function Output Input Data Labels (Correct Answers) Optimizer

"QCon" "Super Rocks" "Super Rocks"

slide-11
SLIDE 11
slide-12
SLIDE 12
slide-13
SLIDE 13

Agenda

Intro to Machine Learning Frontiers of Machine Learning Creating a TensorFlow Model Why are TPUs Great for Machine Learning Workloads Distributed TensorFlow Training

slide-14
SLIDE 14

Agenda

Intro to Machine Learning Frontiers of Machine Learning Creating a TensorFlow Model Why are TPUs Great for Machine Learning Workloads Distributed TensorFlow Training

slide-15
SLIDE 15

Agenda

Intro to Machine Learning Frontiers of Machine Learning Creating a TensorFlow Model Why are TPUs Great for Machine Learning Workloads Distributed TensorFlow Training

slide-16
SLIDE 16

“The network performed similarly to senior orthopedic surgeons when presented with images at the same resolution as the network.”

www.tandfonline.com/doi/full/10.1080/17453674.2017.1344459

Radiology Ophthalmology

0.95

Algorithm Ophthalmologist (median)

0.91

slide-17
SLIDE 17

Pathology

https://research.googleblog.com/2017/03/assisting-pathologists-in-detecting.html

slide-18
SLIDE 18

ImageNet

Alaskan Malamute Siberian Husky

slide-19
SLIDE 19

http://news.stanford.edu/2017/01/25/artificial-intelligence-used-identify-skin-cancer

slide-20
SLIDE 20

Input Saturation Defocus

slide-21
SLIDE 21

Data, Data, Data Compute, Compute, Compute

slide-22
SLIDE 22

Data, Data, Data Compute, Compute, Compute Humans, Humans, Humans

slide-23
SLIDE 23 Improving Inception and Image Classification in TensorFlow research.googleblog.com/2016/08/improving-inception-and-image.html

How long did it take for a Human to construct this?

slide-24
SLIDE 24

AM!!!

slide-25
SLIDE 25

Current: Solution = ML expertise + data + computation

slide-26
SLIDE 26

Current: Solution = ML expertise + data + computation Can we turn this into: Solution = data + 100X computation

slide-27
SLIDE 27

Current: Solution = ML expertise + data + computation Can we turn this into: Solution = data + 100X computation

??? Can We Learn How To Teach Machines To Learn

slide-28
SLIDE 28

CIFAR-10

slide-29
SLIDE 29

Learning Transferable Architectures for Scalable Image Recognition, Barret Zoph, Vijay Vasudevan, Jonathon Shlens and Quoc Le, https://arxiv.org/abs/1707.07012

ImageNet

slide-30
SLIDE 30

Agenda

Intro to Machine Learning Frontiers of Machine Learning Creating a TensorFlow Model Why are TPUs Great for Machine Learning Workloads Distributed TensorFlow Training

slide-31
SLIDE 31

TensorFlow Distributed Execution Engine CPU GPU Android iOS ... Java Python Frontend C++ Estimator tf.keras.layers tf.keras Premade Estimators Datasets

slide-32
SLIDE 32

input_fn (Datasets, tf.data) Estimator (tf.estimator) calls

TensorFlow Estimator Architecture

slide-33
SLIDE 33

input_fn (Datasets, tf.data) Estimator (tf.estimator) subclass calls

Premade Estimators

DNNClassifier DNNRegressor LinearClassifier LinearRegressor DNNLinearCombinedClassifier DNNLinearCombinedRegressor

Premade Estimators

BaselineClassifier BaselineRegressor

slide-34
SLIDE 34

estimator = # Train locally

estimator.train (input_fn=..., ... estimator.evaluate(input_fn=..., ...) estimator.predict (input_fn=..., ...)

Premade Estimators

Datasets Premade Estimators

LinearRegressor(...) LinearClassifier(...) DNNRegressor(...) DNNClassifier(...) DNNLinearCombinedRegressor(...) DNNLinearCombinedClassifier(...) BaselineRegressor(...) BaselineClassifier(...)

Datasets

slide-35
SLIDE 35

input_fn (Datasets, tf.data) Estimator (tf.estimator) subclass calls

Custom Models #1 - model_fn

DNNClassifier DNNRegressor LinearClassifier LinearRegressor DNNLinearCombinedClassifier DNNLinearCombinedRegressor

Premade Estimators

BaselineClassifier BaselineRegressor model_fn calls Keras Layers (tf.keras.layer) use

slide-36
SLIDE 36

# Imports yada yada ... def model_fn(input, ...): Conv2D(32, kernel_size=(3, 3), activation='relu') MaxPooling2D(l1, pool_size=(2, 2) Flatten(l2) Dense(l3, 128, activation='relu') Dropout(0.2)) Dense(10, activation='softmax') model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

tf.keras.layers tf.Estimator

Custom Models

tf.keras.layers

slide-37
SLIDE 37

# Convert a Keras model to tf.estimator.Estimator ...

estimator = keras.estimator.model_to_estimator(model, ...)

# Train locally

estimator.train (input_fn=..., ... estimator.evaluate(input_fn=..., ...) estimator.predict (input_fn=..., ...)

Train/Evaluate Model

Estimator Datasets Datasets

slide-38
SLIDE 38

Summary - Use Estimators, Datasets, and Keras

  • Premade Estimators (tf.estimator): When possible
  • Custom Models

a. model_fn in Estimator & tf.keras.layers

  • Datasets (tf.data) for the input pipeline
slide-39
SLIDE 39

Agenda

Intro to Machine Learning Frontiers of Machine Learning Creating a TensorFlow Model Why are TPUs Great for Machine Learning Workloads Distributed TensorFlow Training

slide-40
SLIDE 40
slide-41
SLIDE 41
slide-42
SLIDE 42
  • We may have a huge number of layers
  • Each layer can have huge number of neurons
  • -> There may be 100s millions or even billions * and + ops

All knobs are W values that we need to tune So that given a certain input, they generate the correct output

slide-43
SLIDE 43

"Matrix Multiplication is EATING (the computing resources of) THE WORLD"

hi_j = [X0, X1, X2, ...] * [W0, W1, W2, ...] hi_j = X0*W0 + X1*W1 + X2*W2 + ...

slide-44
SLIDE 44

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

Matmul

slide-45
SLIDE 45

Single-threaded Execution

slide-46
SLIDE 46

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1

[ [

*

Single-threaded Execution

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-47
SLIDE 47

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 1*0.1 = 0.1

[ [

*

Single-threaded Execution

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-48
SLIDE 48

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 Prev 1*0.1 = 0.1 0.1

[ [

*

Single-threaded Execution

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-49
SLIDE 49

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 Prev 1*0.1 = 0.1 0.1 + 2*0.1 = 0.3

[ [

*

Single-threaded Execution

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-50
SLIDE 50

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 Prev 1*0.1 = 0.1 0.1 + 2*0.1 = 0.3 . . 3238.5+255*0.1 = 3264 3264 + 256*0.1 = 3289.6

[ [

*

Single-threaded Execution

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-51
SLIDE 51

X 1 2 . . . 256

[ [

W 0.1 0.1 . . . 0.1 Prev 1*0.1 = 0.1 0.1 + 2*0.1 = 0.3 . . 3238.5+255*0.1 = 3264 3264 + 256*0.1 = 3289.6

[ [

*

Single-threaded Execution

256 * t

Single-threaded Execution

X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

slide-52
SLIDE 52

Tensor Processing Unit (TPU) v2

slide-53
SLIDE 53

Matrix Unit (MXU)

Matrix Unit Systolic Array

W11 W12 W13 W21 W22 W23 W31 W32 W33 X11 X12

Computing y = Wx

3x3 systolic array W = 3x3 matrix Batch-size(x) = 3

X13 X21 X22 X23 X31 X32 X33

inputs weights

accumulation

slide-54
SLIDE 54

Matrix Unit (MXU)

W11 X11 W12 W13 W21 W22 W23 W31 W32 W33 X12 X13 X21 X22 X23 X31 X32 X33

inputs weights

Matrix Unit Systolic Array

Computing y = Wx

with W = 3x3, batch-size(x) = 3

accumulation

slide-55
SLIDE 55

Matrix Unit (MXU)

W11 X21

W12X12 + W11X11

W13 W21 X11 W22 W23 W31 W32 W33 X13 X22 X23 X31 X32 X33

inputs weights

Matrix Unit Systolic Array

Computing y = Wx

with W = 3x3, batch-size(x) = 3

accumulation

slide-56
SLIDE 56

Matrix Unit (MXU)

W11 X31

W12X22 + W11X21 W13X13 + ...

W21 X21

W22X12 + W21X11

W23 W31 X11 W32 W33 X23 X32 X33

inputs weights

Matrix Unit Systolic Array

Computing y = Wx

with W = 3x3, batch-size(x) = 3

accumulation

slide-57
SLIDE 57

Matrix Unit (MXU)

W11

W12X32 + W11X31 W13X23 + ...

W21 X31

W22X22 + W21X21 W23X13 + ...

W31 X21

W32X12 + W31X11

W33 X33

inputs weights

Matrix Unit Systolic Array

Computing y = Wx

with W = 3x3, batch-size(x) = 3

Y11 = W11X11 + W12X12 + W13X13

  • utputs

accumulation

slide-58
SLIDE 58

Matrix Unit (MXU)

W11 W12

W13X33 + ...

W21

W22X32 + W21X31 W23X23 + ...

W31 X31

W32X22 + W31X21 W33X13 + ...

inputs weights

Matrix Unit Systolic Array

Computing y = Wx

with W = 3x3, batch-size(x) = 3

Y21 = W11X21 + W12X22 + W13X23 Y11 = W11X11 + W12X12 + W13X13 Y12 = W21X11 + W22X12 + W23X13

  • utputs

accumulation

slide-59
SLIDE 59

Matrix Unit (MXU)

W11 W12 W13 W21 W22

W23X33 + ...

W31

W32X32 + W31X31 W33X23 + ...

inputs weights

Matrix Unit Systolic Array

Computing y = Wx

with W = 3x3, batch-size(x) = 3

  • utputs

Y31 = W11X31 + W12X32 + W13X33 Y22 = W21X21 + W22X22 + W23X23 Y11 = W11X11 + W12X12 + W13X13 Y12 = W21X11 + W22X12 + W23X13 Y21 = W11X21 + W12X22 + W13X23 Y13 = W31X11 + W32X12 + W33X13

accumulation

slide-60
SLIDE 60

Matrix Unit (MXU)

W11 W12 W13 W21 W22 W23 W31 W32

W33X33 + ...

inputs weights

Matrix Unit Systolic Array

Computing y = Wx

with W = 3x3, batch-size(x) = 3

  • utputs

Y31 = W11X11 + W12X12 + W13X13 Y22 = W21X21 + W22X22 + W23X23 Y12 = W21X11 + W22X12 + W23X13 Y21 = W11X21 + W12X22 + W13X23 Y13 = W31X11 + W32X12 + W33X13 Y32 = W21X31 + W22X32 + W23X33 Y23 = W31X21 + W32X22 + W33X23

accumulation

slide-61
SLIDE 61

Matrix Unit (MXU)

W11 W12 W13 W21 W22 W23 W31 W32 W33

inputs weights

Matrix Unit Systolic Array

Computing y = Wx

with W = 3x3, batch-size(x) = 3

  • utputs

Y31 = W11X11 + W12X12 + W13X13 Y22 = W21X21 + W22X22 + W23X23 Y13 = W31X11 + W32X12 + W33X13 Y32 = W21X31 + W22X32 + W23X33 Y23 = W31X21 + W32X22 + W33X23 Y33 = W31X31 + W32X32 + W33X33

accumulation

slide-62
SLIDE 62

TPU V1

  • 256 X 256 Systolic Array @ 700MHz
  • 64K * 2 * 700M = 92 TOPS

Inference Only

slide-63
SLIDE 63

Tensor Processing Unit (aka TPU) v2

Designed for neural net training and inference

  • 180 teraflops of computation
  • Designed to be connected together
slide-64
SLIDE 64

TPU Pod - 11.5 petaflops

slide-65
SLIDE 65

Many Knobs to Tune TPUs Were Built to Perform these Operations (and no other) Optimally

slide-66
SLIDE 66
slide-67
SLIDE 67

Agenda

Intro to Machine Learning Frontiers of Machine Learning Creating a TensorFlow Model Why are TPUs Great for Machine Learning Workloads Distributed TensorFlow Training

slide-68
SLIDE 68

Different Services

"Between-Graph & Asynchronous Training"

Services, typically distributed to different hardware nodes

  • Workers: ALL the processing (1 or more)
  • Parameter Servers: Stores all weights (1 or more, why more?)
  • Chief: Makes sure everything runs synchronized (1)
slide-69
SLIDE 69

Distributed Training - Between-Graph/A-sync

Operation Execution Chief & Worker #1 CPU & TPU Variable Storage Parameter Server #1 CPU Variable Storage Parameter Server #2 CPU Operation Execution Worker #2 CPU & TPU Operation Execution Worker #3 CPU & TPU PS sends variables to WS WS sends gradient updates to PS Shared Storage Checkpoint Storage Training Data Storage

PS & Chief Writes & Reads Checkpoint Data

WS Read Training Data

slide-70
SLIDE 70

Distributed Training - Between-Graph/A-sync

Operation Execution Chief & Worker #1 CPU & TPU Variable Storage Parameter Server #1 CPU Variable Storage Parameter Server #2 CPU Operation Execution Worker #2 CPU & TPU Operation Execution Worker #3 CPU & TPU PS sends variables to WS WS sends gradient updates to PS Shared Storage Checkpoint Storage Training Data Storage

PS & Chief Writes & Reads Checkpoint Data

WS Read Training Data

slide-71
SLIDE 71

# Convert a Keras model to tf.estimator.Estimator ...

estimator = keras.estimator.model_to_estimator(model, ...)

# Train locally

estimator.train (input_fn=..., ... estimator.evaluate(input_fn=..., ...) estimator.predict (input_fn=..., ...)

Train/Evaluate Model

Estimator Datasets Datasets

slide-72
SLIDE 72

# Convert a Keras model to tf.estimator.Estimator ...

estimator = keras.estimator.model_to_estimator(model, ...)

# Train locally & distributed

tf.estimator.train_and_evaluate( estimator, train_spec, eval_spec)

Train/Evaluate Model

Estimator Datasets

slide-73
SLIDE 73

# Convert a Keras model to tf.estimator.Estimator ...

estimator = keras.estimator.model_to_estimator(model, ...)

# Train locally & distributed

tf.estimator.train_and_evaluate( estimator, train_spec, eval_spec)

Train/Evaluate Model

Estimator Datasets

Specifying input data How long to train & evaluate Evaluation metrics

slide-74
SLIDE 74

# Convert a Keras model to tf.estimator.Estimator ...

estimator = keras.estimator.model_to_estimator(model, ...)

# Train locally & distributed

tf.estimator.train_and_evaluate( estimator, train_spec, eval_spec)

Train/Evaluate Model

Estimator Datasets

TF_CONFIG

Environment Variable

Specifying input data How long to train & evaluate Evaluation metrics

slide-75
SLIDE 75

TF_CONFIG='{

"cluster": { "chief": ["host1:2222"], "worker": ["host1:2222", "host2:2222", "host3:2222"], "ps": ["host4:2222", "host5:2222"] }, # To start Parameter Server #1 "task": {"type": "ps", "index": 0} } '

Starting Parameter Server #1

Estimator

slide-76
SLIDE 76

TF_CONFIG='{

"cluster": { "chief": ["host1:2222"], "worker": ["host1:2222", "host2:2222", "host3:2222"], "ps": ["host4:2222", "host5:2222"] }, # To start Parameter Server #1 "task": {"type": "ps", "index": 0} } ' $(host4) python <YourProgram.py>

Starting Parameter Server #1

Estimator

slide-77
SLIDE 77
  • Workers are Stateless - Easy to add new ones
  • >=1 Parameter Servers - Supports large variables (e.g. embeddings)
  • Shared Storage: Makes Cloud a great place (GCS, S3, ...)

Summary

slide-78
SLIDE 78

HEY! Stop there, not so fast!

slide-79
SLIDE 79

HEY! Stop there, not so fast! What about

with tf.device('/gpu:0'): a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3]) b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2]) c = tf.matmul(a, b)

slide-80
SLIDE 80

Getting Started With TensorFlow & TPUs == void

slide-81
SLIDE 81

Getting Started With TensorFlow & TPUs == void

(at least in the future)

slide-82
SLIDE 82

Kubeflow

https://goo.gl/2vfHcm

slide-83
SLIDE 83

Summary

  • Use tf.estimator, tf.data, tf.keras to define & train your models
slide-84
SLIDE 84

Summary

  • Use tf.estimator, tf.data, tf.keras to define & train your models
  • TPUs are great for ML Workloads
slide-85
SLIDE 85

Summary

  • Use tf.estimator, tf.data, tf.keras to define & train your models
  • TPUs are great for ML Workloads
  • Estimators support Between-graph, Asynchronous Training

■ Chief ■ Parameter Server ■ Workers

slide-86
SLIDE 86

Summary

  • Use tf.estimator, tf.data, tf.keras to define & train your models
  • TPUs are great for ML Workloads
  • Estimators support Between-graph, Asynchronous Training

■ Chief ■ Parameter Server ■ Workers

<<<STAY TUNED>>>

slide-87
SLIDE 87
slide-88
SLIDE 88

@MagnusHyttsten

Thank You