@MagnusHyttsten
@MagnusHyttsten Meet Robin Guinea Pig Meet Robin An Awkward - - PowerPoint PPT Presentation
@MagnusHyttsten Meet Robin Guinea Pig Meet Robin An Awkward - - PowerPoint PPT Presentation
@MagnusHyttsten Meet Robin Guinea Pig Meet Robin An Awkward Social Experiment (that I'm afraid you need to be part of...) Super ROCKS! "QCon" Input Data Examples (Train & Test Data) Model <Awkward Output (Your
Meet Robin
Guinea Pig
Meet Robin
An Awkward Social Experiment (that I'm afraid you need to be part of...)
Super ROCKS!
Examples (Train & Test Data) Model (Your Brain) Output Input Data
"QCon"
<Awkward Silence>
Examples (Train & Test Data) Model (Your Brain) Loss function Output Input Data Labels (Correct Answers) Optimizer
"QCon" "Super Rocks"
Examples (Train & Test Data) Model (Your Brain) Loss function Output Input Data Labels (Correct Answers) Optimizer
"QCon" "Super Rocks" "Super Rocks"
Agenda
Intro to Machine Learning Frontiers of Machine Learning Creating a TensorFlow Model Why are TPUs Great for Machine Learning Workloads Distributed TensorFlow Training
Agenda
Intro to Machine Learning Frontiers of Machine Learning Creating a TensorFlow Model Why are TPUs Great for Machine Learning Workloads Distributed TensorFlow Training
Agenda
Intro to Machine Learning Frontiers of Machine Learning Creating a TensorFlow Model Why are TPUs Great for Machine Learning Workloads Distributed TensorFlow Training
“The network performed similarly to senior orthopedic surgeons when presented with images at the same resolution as the network.”
www.tandfonline.com/doi/full/10.1080/17453674.2017.1344459
Radiology Ophthalmology
0.95
Algorithm Ophthalmologist (median)
0.91
Pathology
https://research.googleblog.com/2017/03/assisting-pathologists-in-detecting.html
ImageNet
Alaskan Malamute Siberian Husky
http://news.stanford.edu/2017/01/25/artificial-intelligence-used-identify-skin-cancer
Input Saturation Defocus
Data, Data, Data Compute, Compute, Compute
Data, Data, Data Compute, Compute, Compute Humans, Humans, Humans
How long did it take for a Human to construct this?
AM!!!
Current: Solution = ML expertise + data + computation
Current: Solution = ML expertise + data + computation Can we turn this into: Solution = data + 100X computation
Current: Solution = ML expertise + data + computation Can we turn this into: Solution = data + 100X computation
??? Can We Learn How To Teach Machines To Learn
CIFAR-10
Learning Transferable Architectures for Scalable Image Recognition, Barret Zoph, Vijay Vasudevan, Jonathon Shlens and Quoc Le, https://arxiv.org/abs/1707.07012
ImageNet
Agenda
Intro to Machine Learning Frontiers of Machine Learning Creating a TensorFlow Model Why are TPUs Great for Machine Learning Workloads Distributed TensorFlow Training
TensorFlow Distributed Execution Engine CPU GPU Android iOS ... Java Python Frontend C++ Estimator tf.keras.layers tf.keras Premade Estimators Datasets
input_fn (Datasets, tf.data) Estimator (tf.estimator) calls
TensorFlow Estimator Architecture
input_fn (Datasets, tf.data) Estimator (tf.estimator) subclass calls
Premade Estimators
DNNClassifier DNNRegressor LinearClassifier LinearRegressor DNNLinearCombinedClassifier DNNLinearCombinedRegressor
Premade Estimators
BaselineClassifier BaselineRegressor
estimator = # Train locally
estimator.train (input_fn=..., ... estimator.evaluate(input_fn=..., ...) estimator.predict (input_fn=..., ...)
Premade Estimators
Datasets Premade Estimators
LinearRegressor(...) LinearClassifier(...) DNNRegressor(...) DNNClassifier(...) DNNLinearCombinedRegressor(...) DNNLinearCombinedClassifier(...) BaselineRegressor(...) BaselineClassifier(...)
Datasets
input_fn (Datasets, tf.data) Estimator (tf.estimator) subclass calls
Custom Models #1 - model_fn
DNNClassifier DNNRegressor LinearClassifier LinearRegressor DNNLinearCombinedClassifier DNNLinearCombinedRegressor
Premade Estimators
BaselineClassifier BaselineRegressor model_fn calls Keras Layers (tf.keras.layer) use
# Imports yada yada ... def model_fn(input, ...): Conv2D(32, kernel_size=(3, 3), activation='relu') MaxPooling2D(l1, pool_size=(2, 2) Flatten(l2) Dense(l3, 128, activation='relu') Dropout(0.2)) Dense(10, activation='softmax') model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
tf.keras.layers tf.Estimator
Custom Models
tf.keras.layers
# Convert a Keras model to tf.estimator.Estimator ...
estimator = keras.estimator.model_to_estimator(model, ...)
# Train locally
estimator.train (input_fn=..., ... estimator.evaluate(input_fn=..., ...) estimator.predict (input_fn=..., ...)
Train/Evaluate Model
Estimator Datasets Datasets
Summary - Use Estimators, Datasets, and Keras
- Premade Estimators (tf.estimator): When possible
- Custom Models
a. model_fn in Estimator & tf.keras.layers
- Datasets (tf.data) for the input pipeline
Agenda
Intro to Machine Learning Frontiers of Machine Learning Creating a TensorFlow Model Why are TPUs Great for Machine Learning Workloads Distributed TensorFlow Training
- We may have a huge number of layers
- Each layer can have huge number of neurons
- -> There may be 100s millions or even billions * and + ops
All knobs are W values that we need to tune So that given a certain input, they generate the correct output
"Matrix Multiplication is EATING (the computing resources of) THE WORLD"
hi_j = [X0, X1, X2, ...] * [W0, W1, W2, ...] hi_j = X0*W0 + X1*W1 + X2*W2 + ...
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
Matmul
Single-threaded Execution
X 1 2 . . . 256
[ [
W 0.1 0.1 . . . 0.1
[ [
*
Single-threaded Execution
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
X 1 2 . . . 256
[ [
W 0.1 0.1 . . . 0.1 1*0.1 = 0.1
[ [
*
Single-threaded Execution
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
X 1 2 . . . 256
[ [
W 0.1 0.1 . . . 0.1 Prev 1*0.1 = 0.1 0.1
[ [
*
Single-threaded Execution
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
X 1 2 . . . 256
[ [
W 0.1 0.1 . . . 0.1 Prev 1*0.1 = 0.1 0.1 + 2*0.1 = 0.3
[ [
*
Single-threaded Execution
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
X 1 2 . . . 256
[ [
W 0.1 0.1 . . . 0.1 Prev 1*0.1 = 0.1 0.1 + 2*0.1 = 0.3 . . 3238.5+255*0.1 = 3264 3264 + 256*0.1 = 3289.6
[ [
*
Single-threaded Execution
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
X 1 2 . . . 256
[ [
W 0.1 0.1 . . . 0.1 Prev 1*0.1 = 0.1 0.1 + 2*0.1 = 0.3 . . 3238.5+255*0.1 = 3264 3264 + 256*0.1 = 3289.6
[ [
*
Single-threaded Execution
256 * t
Single-threaded Execution
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
Tensor Processing Unit (TPU) v2
Matrix Unit (MXU)
Matrix Unit Systolic Array
W11 W12 W13 W21 W22 W23 W31 W32 W33 X11 X12
Computing y = Wx
3x3 systolic array W = 3x3 matrix Batch-size(x) = 3
X13 X21 X22 X23 X31 X32 X33
inputs weights
accumulation
Matrix Unit (MXU)
W11 X11 W12 W13 W21 W22 W23 W31 W32 W33 X12 X13 X21 X22 X23 X31 X32 X33
inputs weights
Matrix Unit Systolic Array
Computing y = Wx
with W = 3x3, batch-size(x) = 3
accumulation
Matrix Unit (MXU)
W11 X21
W12X12 + W11X11W13 W21 X11 W22 W23 W31 W32 W33 X13 X22 X23 X31 X32 X33
inputs weights
Matrix Unit Systolic Array
Computing y = Wx
with W = 3x3, batch-size(x) = 3
accumulation
Matrix Unit (MXU)
W11 X31
W12X22 + W11X21 W13X13 + ...W21 X21
W22X12 + W21X11W23 W31 X11 W32 W33 X23 X32 X33
inputs weights
Matrix Unit Systolic Array
Computing y = Wx
with W = 3x3, batch-size(x) = 3
accumulation
Matrix Unit (MXU)
W11
W12X32 + W11X31 W13X23 + ...W21 X31
W22X22 + W21X21 W23X13 + ...W31 X21
W32X12 + W31X11W33 X33
inputs weights
Matrix Unit Systolic Array
Computing y = Wx
with W = 3x3, batch-size(x) = 3
Y11 = W11X11 + W12X12 + W13X13
- utputs
accumulation
Matrix Unit (MXU)
W11 W12
W13X33 + ...W21
W22X32 + W21X31 W23X23 + ...W31 X31
W32X22 + W31X21 W33X13 + ...inputs weights
Matrix Unit Systolic Array
Computing y = Wx
with W = 3x3, batch-size(x) = 3
Y21 = W11X21 + W12X22 + W13X23 Y11 = W11X11 + W12X12 + W13X13 Y12 = W21X11 + W22X12 + W23X13
- utputs
accumulation
Matrix Unit (MXU)
W11 W12 W13 W21 W22
W23X33 + ...W31
W32X32 + W31X31 W33X23 + ...inputs weights
Matrix Unit Systolic Array
Computing y = Wx
with W = 3x3, batch-size(x) = 3
- utputs
Y31 = W11X31 + W12X32 + W13X33 Y22 = W21X21 + W22X22 + W23X23 Y11 = W11X11 + W12X12 + W13X13 Y12 = W21X11 + W22X12 + W23X13 Y21 = W11X21 + W12X22 + W13X23 Y13 = W31X11 + W32X12 + W33X13
accumulation
Matrix Unit (MXU)
W11 W12 W13 W21 W22 W23 W31 W32
W33X33 + ...inputs weights
Matrix Unit Systolic Array
Computing y = Wx
with W = 3x3, batch-size(x) = 3
- utputs
Y31 = W11X11 + W12X12 + W13X13 Y22 = W21X21 + W22X22 + W23X23 Y12 = W21X11 + W22X12 + W23X13 Y21 = W11X21 + W12X22 + W13X23 Y13 = W31X11 + W32X12 + W33X13 Y32 = W21X31 + W22X32 + W23X33 Y23 = W31X21 + W32X22 + W33X23
accumulation
Matrix Unit (MXU)
W11 W12 W13 W21 W22 W23 W31 W32 W33
inputs weights
Matrix Unit Systolic Array
Computing y = Wx
with W = 3x3, batch-size(x) = 3
- utputs
Y31 = W11X11 + W12X12 + W13X13 Y22 = W21X21 + W22X22 + W23X23 Y13 = W31X11 + W32X12 + W33X13 Y32 = W21X31 + W22X32 + W23X33 Y23 = W31X21 + W32X22 + W33X23 Y33 = W31X31 + W32X32 + W33X33
accumulation
TPU V1
- 256 X 256 Systolic Array @ 700MHz
- 64K * 2 * 700M = 92 TOPS
Inference Only
Tensor Processing Unit (aka TPU) v2
Designed for neural net training and inference
- 180 teraflops of computation
- Designed to be connected together
TPU Pod - 11.5 petaflops
Many Knobs to Tune TPUs Were Built to Perform these Operations (and no other) Optimally
Agenda
Intro to Machine Learning Frontiers of Machine Learning Creating a TensorFlow Model Why are TPUs Great for Machine Learning Workloads Distributed TensorFlow Training
Different Services
"Between-Graph & Asynchronous Training"
Services, typically distributed to different hardware nodes
- Workers: ALL the processing (1 or more)
- Parameter Servers: Stores all weights (1 or more, why more?)
- Chief: Makes sure everything runs synchronized (1)
Distributed Training - Between-Graph/A-sync
Operation Execution Chief & Worker #1 CPU & TPU Variable Storage Parameter Server #1 CPU Variable Storage Parameter Server #2 CPU Operation Execution Worker #2 CPU & TPU Operation Execution Worker #3 CPU & TPU PS sends variables to WS WS sends gradient updates to PS Shared Storage Checkpoint Storage Training Data Storage
PS & Chief Writes & Reads Checkpoint Data
WS Read Training Data
Distributed Training - Between-Graph/A-sync
Operation Execution Chief & Worker #1 CPU & TPU Variable Storage Parameter Server #1 CPU Variable Storage Parameter Server #2 CPU Operation Execution Worker #2 CPU & TPU Operation Execution Worker #3 CPU & TPU PS sends variables to WS WS sends gradient updates to PS Shared Storage Checkpoint Storage Training Data Storage
PS & Chief Writes & Reads Checkpoint Data
WS Read Training Data
# Convert a Keras model to tf.estimator.Estimator ...
estimator = keras.estimator.model_to_estimator(model, ...)
# Train locally
estimator.train (input_fn=..., ... estimator.evaluate(input_fn=..., ...) estimator.predict (input_fn=..., ...)
Train/Evaluate Model
Estimator Datasets Datasets
# Convert a Keras model to tf.estimator.Estimator ...
estimator = keras.estimator.model_to_estimator(model, ...)
# Train locally & distributed
tf.estimator.train_and_evaluate( estimator, train_spec, eval_spec)
Train/Evaluate Model
Estimator Datasets
# Convert a Keras model to tf.estimator.Estimator ...
estimator = keras.estimator.model_to_estimator(model, ...)
# Train locally & distributed
tf.estimator.train_and_evaluate( estimator, train_spec, eval_spec)
Train/Evaluate Model
Estimator Datasets
Specifying input data How long to train & evaluate Evaluation metrics
# Convert a Keras model to tf.estimator.Estimator ...
estimator = keras.estimator.model_to_estimator(model, ...)
# Train locally & distributed
tf.estimator.train_and_evaluate( estimator, train_spec, eval_spec)
Train/Evaluate Model
Estimator Datasets
TF_CONFIG
Environment Variable
Specifying input data How long to train & evaluate Evaluation metrics
TF_CONFIG='{
"cluster": { "chief": ["host1:2222"], "worker": ["host1:2222", "host2:2222", "host3:2222"], "ps": ["host4:2222", "host5:2222"] }, # To start Parameter Server #1 "task": {"type": "ps", "index": 0} } '
Starting Parameter Server #1
Estimator
TF_CONFIG='{
"cluster": { "chief": ["host1:2222"], "worker": ["host1:2222", "host2:2222", "host3:2222"], "ps": ["host4:2222", "host5:2222"] }, # To start Parameter Server #1 "task": {"type": "ps", "index": 0} } ' $(host4) python <YourProgram.py>
Starting Parameter Server #1
Estimator
- Workers are Stateless - Easy to add new ones
- >=1 Parameter Servers - Supports large variables (e.g. embeddings)
- Shared Storage: Makes Cloud a great place (GCS, S3, ...)
Summary
HEY! Stop there, not so fast!
HEY! Stop there, not so fast! What about
with tf.device('/gpu:0'): a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3]) b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2]) c = tf.matmul(a, b)
Getting Started With TensorFlow & TPUs == void
Getting Started With TensorFlow & TPUs == void
(at least in the future)
Kubeflow
https://goo.gl/2vfHcm
Summary
- Use tf.estimator, tf.data, tf.keras to define & train your models
Summary
- Use tf.estimator, tf.data, tf.keras to define & train your models
- TPUs are great for ML Workloads
Summary
- Use tf.estimator, tf.data, tf.keras to define & train your models
- TPUs are great for ML Workloads
- Estimators support Between-graph, Asynchronous Training
■ Chief ■ Parameter Server ■ Workers
Summary
- Use tf.estimator, tf.data, tf.keras to define & train your models
- TPUs are great for ML Workloads
- Estimators support Between-graph, Asynchronous Training
■ Chief ■ Parameter Server ■ Workers
<<<STAY TUNED>>>
@MagnusHyttsten
Thank You