@MagnusHyttsten
@MagnusHyttsten Meet Robin Guinea Pig Meet Robin An Awkward - - PowerPoint PPT Presentation
@MagnusHyttsten Meet Robin Guinea Pig Meet Robin An Awkward - - PowerPoint PPT Presentation
@MagnusHyttsten Meet Robin Guinea Pig Meet Robin An Awkward Social Experiment (that I'm afraid you need to be part of...) ROCKS! "GTC" Input Data Examples (Train & Test Data) Model <Awkward Output (Your Silence>
Meet Robin
Guinea Pig
Meet Robin
An Awkward Social Experiment (that I'm afraid you need to be part of...)
ROCKS!
Examples (Train & Test Data) Model (Your Brain) Output Input Data
"GTC"
<Awkward Silence>
Examples (Train & Test Data) Model (Your Brain) Loss function Output Input Data Labels (Correct Answers) Optimizer
"GTC" "Rocks"
Examples (Train & Test Data) Model (Your Brain) Loss function Output Input Data Labels (Correct Answers) Optimizer
"GTC" "Rocks" "Rocks"
Agenda
Intro to Machine Learning Creating a TensorFlow Model Why are GPUs Great for Machine Learning Workloads Distributed TensorFlow Training
Agenda
Intro to Machine Learning Creating a TensorFlow Model Why are GPUs Great for Machine Learning Workloads Distributed TensorFlow Training
Agenda
Intro to Machine Learning Creating a TensorFlow Model Why are GPUs Great for Machine Learning Workloads Distributed TensorFlow Training
TensorFlow Distributed Execution Engine CPU GPU Android iOS ... Java Python Frontend C++ Estimator tf.keras.layers tf.keras Premade Estimators Datasets
input_fn (Datasets, tf.data) Estimator (tf.estimator) calls
TensorFlow Estimator Architecture
input_fn (Datasets, tf.data) Estimator (tf.estimator) subclass calls
Premade Estimators
DNNClassifier DNNRegressor LinearClassifier LinearRegressor DNNLinearCombinedClassifier DNNLinearCombinedRegressor
Premade Estimators
BaselineClassifier BaselineRegressor
estimator = # Train locally
estimator.train (input_fn=..., ... estimator.evaluate(input_fn=..., ...) estimator.predict (input_fn=..., ...)
Premade Estimators
Datasets Premade Estimators
LinearRegressor(...) LinearClassifier(...) DNNRegressor(...) DNNClassifier(...) DNNLinearCombinedRegressor(...) DNNLinearCombinedClassifier(...) BaselineRegressor(...) BaselineClassifier(...)
Datasets
input_fn (Datasets, tf.data) Estimator (tf.estimator) subclass calls
Custom Models #1 - model_fn
DNNClassifier DNNRegressor LinearClassifier LinearRegressor DNNLinearCombinedClassifier DNNLinearCombinedRegressor
Premade Estimators
BaselineClassifier BaselineRegressor model_fn calls Keras Layers (tf.keras.layer) use
input_fn (Datasets, tf.data) Estimator (tf.estimator) subclass calls
Custom Models #2 - Keras Model
DNNClassifier DNNRegressor LinearClassifier LinearRegressor DNNLinearCombinedClassifier DNNLinearCombinedRegressor
Premade Estimators
BaselineClassifier BaselineRegressor model_fn calls Keras Layers (tf.keras.layer) use model_to_estimator Keras (tf.keras)
# Imports yada yada ... model = Sequential() model.add(Conv2D(32, kernel_size=(3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Flatten()) model.add(Dense(128, activation='relu')) model.add(Dropout(0.2)) model.add(Dense(10, activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
tf.keras.layers tf.keras
Custom Models
# Convert a Keras model to tf.estimator.Estimator ...
estimator = keras.estimator.model_to_estimator(model, ...)
# Train locally
estimator.train (input_fn=..., ... estimator.evaluate(input_fn=..., ...) estimator.predict (input_fn=..., ...)
Train/Evaluate Model
Estimator Datasets Datasets
Summary - Use Estimators, Datasets, and Keras
- Premade Estimators (tf.estimator): When possible
- Custom Models
a. model_fn in Estimator & tf.keras.layers b. Keras Models (tf.keras) ■ estimator = keras.model_to_estimator(...)
- Datasets (tf.data) for the input pipeline
Agenda
Intro to Machine Learning Creating a TensorFlow Model Why are GPUs Great for Machine Learning Workloads Distributed TensorFlow Training
Disclaimer...
- High-Level - We look at only parts of the power of GPUs
- Simple Overview - More optimal designs exist
- Reduced Scope - Only considering fully-connected layers, etc
Strengths of V100 GPU
- Built for Massively Parallel Computations
- Hardware & software suitable to manage
Deep Learning Workloads (Tensor Cores, mixed-precision execution, etc)
Strengths of V100 GPU
- Built for Massively Parallel Computations
- Specific hardware / software to manage
Deep Learning Workloads (Tensor Cores, mixed-precision execution, etc)
Tesla SXM V100
- 5376 cores (FP32)
Strengths of V100 GPU
What are we going to do with 5376 FP32 cores?
Strengths of V100 GPU
What are we going to do with 5376 FP32 cores? "Execute things in parallel"!
Strengths of V100 GPU
What are we going to do with 5376 FP32 cores? "Execute things in parallel"! Yes, but how can we exactly do that for ML Workloads?
Strengths of V100 GPU
What are we going to do with 5376 FP32 cores? "Execute things in parallel"! Yes, but how can we exactly do that for ML Workloads? "Hey, that's your job - That's why we're here listening"!
Strengths of V100 GPU
What are we going to do with 5376 FP32 cores? "Execute things in parallel"! Yes, but how can we exactly do that for ML Workloads? "Hey, that's your job - That's why we're here listening"!
Alright, let's talk about that then
- We may have a huge number of layers
- Each layer can have huge number of neurons
- -> There may be 100s millions or even billions * and + ops
All knobs are W values that we need to tune So that given a certain input, they generate the correct output
"Matrix Multiplication is EATING (the computing resources of) THE WORLD"
hi_j = [X0, X1, X2, ...] * [W0, W1, W2, ...] hi_j = X0*W0 + X1*W1 + X2*W2 + ...
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
Matmul
Single-threaded Execution
X 1 2 . . . 256
[ [
W 0.1 0.1 . . . 0.1
[ [
*
Single-threaded Execution
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
X 1 2 . . . 256
[ [
W 0.1 0.1 . . . 0.1 1*0.1 = 0.1
[ [
*
Single-threaded Execution
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
X 1 2 . . . 256
[ [
W 0.1 0.1 . . . 0.1 Prev 1*0.1 = 0.1 0.1
[ [
*
Single-threaded Execution
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
X 1 2 . . . 256
[ [
W 0.1 0.1 . . . 0.1 Prev 1*0.1 = 0.1 0.1 + 2*0.1 = 0.3
[ [
*
Single-threaded Execution
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
X 1 2 . . . 256
[ [
W 0.1 0.1 . . . 0.1 Prev 1*0.1 = 0.1 0.1 + 2*0.1 = 0.3 . . 3238.5+255*0.1 = 3264 3264 + 256*0.1 = 3289.6
[ [
*
Single-threaded Execution
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
X 1 2 . . . 256
[ [
W 0.1 0.1 . . . 0.1 Prev 1*0.1 = 0.1 0.1 + 2*0.1 = 0.3 . . 3238.5+255*0.1 = 3264 3264 + 256*0.1 = 3289.6
[ [
*
Single-threaded Execution
256 * t
Single-threaded Execution
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
GPU Execution
X 1 2 . . . 256
[ [
W 0.1 0.1 . . . 0.1
[ [
*
GPU - #1 Multiplication Step
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
X 1 2 . . . 256
[ [
W 0.1 0.1 . . . 0.1
[ [
*
GPU - #1 Multiplication Step
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
Tesla SXM V100
5376 cores (FP32)
X 1 2 . . . 256
[ [
W 0.1 0.1 . . . 0.1
[ [
*
GPU - #1 Multiplication Step
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
X 1 2 . . . 256
[ [
W 0.1 0.1 . . . 0.1 X1_mul_vector 1*0.1 = 0.1 2*0.1 = 0.2 . . . 256*0.1 = 25.6
[ [
*
GPU - #1 Multiplication Step
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
X 1 2 . . . 256
[ [
W 0.1 0.1 . . . 0.1 X1_mul_vector 1*0.1 = 0.1 2*0.1 = 0.2 . . . 256*0.1 = 25.6
[ [
*
Multi-threaded Execution (256 Threads)
t
GPU - #1 Multiplication Step
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
X 1 2 . . . 256
[ [
W 0.1 0.1 . . . 0.1 X1_mul_vector 1*0.1 = 0.1 2*0.1 = 0.2 . . . 256*0.1 = 25.6
[ [
*
Multi-threaded Execution (256 Threads)
t
GPU - #1 What about Summation?
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
X 1 2 . . . 256
[ [
W 0.1 0.1 . . . 0.1 1*0.1 = 0.1 2*0.1 = 0.2 . . . 256*0.1 = 25.6
[ [
* + + +
= h0,0
X1_mul_vector 1*0.1 = 0.1 2*0.1 = 0.2 . . . 256*0.1 = 25.6
GPU - #2 Summary Step
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
X 1 2 . . . 256
[ [
W 0.1 0.1 . . . 0.1 1*0.1 = 0.1 2*0.1 = 0.2 . . . 256*0.1 = 25.6
[ [
*
Multi-threaded Execution (256 Threads)
+ + +
= h0,0 log 128 = 2 7 * t
X1_mul_vector 1*0.1 = 0.1 2*0.1 = 0.2 . . . 256*0.1 = 25.6
GPU - #2 Summary Step
X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
Comparing - Order of Magnitude (sequences)
Single-Threaded Execution GPU Multi-Threaded Execution
1 * t + 7 * t = 256 * t 8 * t
Many Knobs to Tune But the type of calculation we perform is very suited for GPUs
Summary
- GPUs == Many Threads == Great for ML Workloads
- And now you know how this works
- Fortunately, you don't need to worry about implementation details
Agenda
Intro to Machine Learning Creating a TensorFlow Model Why are GPUs Great for Machine Learning Workloads Distributed TensorFlow Training
Different Services
"Between-Graph & Asynchronous Training"
Services, typically distributed to different hardware nodes
- Workers: ALL the processing (1 or more)
- Parameter Servers: Stores all weights (1 or more, why more?)
- Chief: Makes sure everything runs synchronized (1)
Distributed Training - Between-Graph/A-sync
Operation Execution Chief & Worker #1 CPU & GPU Variable Storage Parameter Server #1 CPU Variable Storage Parameter Server #2 CPU Operation Execution Worker #2 CPU & GPU Operation Execution Worker #3 CPU & GPU PS sends variables to WS WS sends gradient updates to PS Shared Storage Checkpoint Storage Training Data Storage
PS & Chief Writes & Reads Checkpoint Data
WS Read Training Data
Distributed Training - Between-Graph/A-sync
Operation Execution Chief & Worker #1 CPU & GPU Variable Storage Parameter Server #1 CPU Variable Storage Parameter Server #2 CPU Operation Execution Worker #2 CPU & GPU Operation Execution Worker #3 CPU & GPU PS sends variables to WS WS sends gradient updates to PS Shared Storage Checkpoint Storage Training Data Storage
PS & Chief Writes & Reads Checkpoint Data
WS Read Training Data
# Convert a Keras model to tf.estimator.Estimator ...
estimator = keras.estimator.model_to_estimator(model, ...)
# Train locally
estimator.train (input_fn=..., ... estimator.evaluate(input_fn=..., ...) estimator.predict (input_fn=..., ...)
Train/Evaluate Model
Estimator Datasets Datasets
# Convert a Keras model to tf.estimator.Estimator ...
estimator = keras.estimator.model_to_estimator(model, ...)
# Train locally & distributed
tf.estimator.train_and_evaluate( estimator, train_spec, eval_spec)
Train/Evaluate Model
Estimator Datasets
# Convert a Keras model to tf.estimator.Estimator ...
estimator = keras.estimator.model_to_estimator(model, ...)
# Train locally & distributed
tf.estimator.train_and_evaluate( estimator, train_spec, eval_spec)
Train/Evaluate Model
Estimator Datasets
Specifying input data How long to train & evaluate Evaluation metrics
# Convert a Keras model to tf.estimator.Estimator ...
estimator = keras.estimator.model_to_estimator(model, ...)
# Train locally & distributed
tf.estimator.train_and_evaluate( estimator, train_spec, eval_spec)
Train/Evaluate Model
Estimator Datasets
TF_CONFIG
Environment Variable
Specifying input data How long to train & evaluate Evaluation metrics
TF_CONFIG='{
"cluster": { "chief": ["host1:2222"], "worker": ["host1:2222", "host2:2222", "host3:2222"], "ps": ["host4:2222", "host5:2222"] }, # To start Parameter Server #1 "task": {"type": "ps", "index": 0} } '
Starting Parameter Server #1
Estimator
TF_CONFIG='{
"cluster": { "chief": ["host1:2222"], "worker": ["host1:2222", "host2:2222", "host3:2222"], "ps": ["host4:2222", "host5:2222"] }, # To start Parameter Server #1 "task": {"type": "ps", "index": 0} } ' $(host4) python <YourProgram.py>
Starting Parameter Server #1
Estimator
Different Modes of Distribution
Work Distribution Mode
- Between-graph Replication - Individual processes
Each Worker runs its own process (of the same program) Variable Update Mode
- Asynchronous: Updates are applied across Workers in parallel
Different Modes of Distribution
Work Distribution Mode
- Between-graph Replication - Individual processes
Each Worker runs its own process (of the same program)
- In-graph Replication - Single program/process
Ops or variables are distributed to different nodes (or GPU cards) Variable Update Mode
- Asynchronous: Updates are applied across Workers in parallel
- Synchronous: Update values are co-ordinated across Ops
- Workers are Stateless - Easy to add new ones
- >=1 Parameter Servers - Supports large variables (e.g. embeddings)
- Shared Storage: Makes Cloud a great place (GCS, S3, ...)
Summary
HEY! Stop there, not so fast!
HEY! Stop there, not so fast! What about
with tf.device('/gpu:0'): a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3]) b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2]) c = tf.matmul(a, b)
Getting Started With TensorFlow & GPUs == void
Getting Started With TensorFlow & GPUs == void
(at least in the future)
Kubeflow
https://goo.gl/2vfHcm
Summary
- Use tf.estimator, tf.data, tf.keras to define & train your models
Summary
- Use tf.estimator, tf.data, tf.keras to define & train your models
- GPUs are great for ML Workloads
Summary
- Use tf.estimator, tf.data, tf.keras to define & train your models
- GPUs are great for ML Workloads
- Estimators support Between-graph, Asynchronous Training
■ Chief ■ Parameter Server ■ Workers
Summary
- Use tf.estimator, tf.data, tf.keras to define & train your models
- GPUs are great for ML Workloads
- Estimators support Between-graph, Asynchronous Training
■ Chief ■ Parameter Server ■ Workers
<<<STAY TUNED>>>
@MagnusHyttsten
Thank You