Distributed TensorFlow
CSE545 - Spring 2020 Stony Brook University
Distributed TensorFlow CSE545 - Spring 2020 Stony Brook University - - PowerPoint PPT Presentation
Distributed TensorFlow CSE545 - Spring 2020 Stony Brook University Big Data Analytics, The Class Goal: Generalizations A model or summarization of the data. Data Frameworks Algorithms and Analyses Similarity Search Hadoop File System Spark
CSE545 - Spring 2020 Stony Brook University
Goal: Generalizations A model or summarization of the data.
Data Frameworks Algorithms and Analyses Hadoop File System MapReduce Spark Tensorflow Similarity Search Recommendation Systems Graph Analysis Deep Learning Streaming Hypothesis Testing
Spark Overview
Spark is fast for being so flexible
Spark Overview
Spark is fast for being so flexible
However:
memory across cluster.
Spark Overview
Spark is fast for being so flexible
However:
memory across cluster.
IO Bound large files (TBs or PBs) Compute Bound many numeric computations
Spark MapReduce
(1s of TBs, 100s of GBs)
Spark Overview
Spark is fast for being so flexible
However:
memory across cluster.
requires heavy numeric computation.
IO Bound (large files: TBs or PBs) Compute Bound (many numeric computations)
Spark MapReduce
(1s of TBs, 100s of GBs)
* this is the subjective approximation of the instructor as of February 2020. A lot of factors at play.
Spark Overview
Spark is fast for being so flexible
However:
memory across cluster.
requires heavy numeric computation.
IO Bound (large files: TBs or PBs) Compute Bound (many numeric computations)
Spark MapReduce
(1s of TBs, 100s of GBs)
TensorFlow
* this is the subjective approximation of the instructor as of February 2020. A lot of factors at play.
○ Know the key components of TensorFlow. ○ Understand the key concepts of distributed TensorFlow.
Spark Overview
A workflow system catered to numerical computation. One view: Like Spark, but uses tensors instead of RDDs.
Spark Overview
A workflow system catered to numerical computation. One view: Like Spark, but uses tensors instead of RDDs.
(i.stack.imgur.com)
A multi-dimensional matrix
Spark Overview
A workflow system catered to numerical computation. One view: Like Spark, but uses tensors instead of RDDs.
(i.stack.imgur.com)
A 2-d tensor is just a matrix. 1-d: vector 0-d: a constant / scalar Note: Linguistic ambiguity: Dimensions of a Tensor =/= Dimensions of a Matrix
Spark Overview
A workflow system catered to numerical computation. One view: Like Spark, but uses tensors instead of RDDs.
Examples > 2-d : Image definitions in terms of RGB per pixel Image[row][column][rgb] Subject, Verb, Object representation of language: Counts[verb][subject][object]
Spark Overview
A workflow system catered to numerical computation. One view: Like Spark, but uses tensors instead of RDDs.
Technically, less abstract than RDDs which could hold tensors as well as many other data structures (dictionaries/HashMaps, Trees, ...etc…). Then, why TensorFlow?
Spark Overview
Technically, less abstract than RDDs which could hold tensors as well as many other data structures (dictionaries/HashMaps, Trees, ...etc…). Then, why TensorFlow?
Efficient, high-level built-in linear algebra and machine learning optimization operations (i.e. transformations). enables complex models, like deep learning
Spark Overview
Efficient, high-level built-in linear algebra and machine learning optimization operations. enables complex models, like deep learning
(Bakshi, 2016, “What is Deep Learning? Getting Started With Deep Learning”)
Spark Overview
Efficient, high-level built-in linear algebra and machine learning operations.
(Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., ... & Ghemawat, S. (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.)
Spark Overview
Operations on tensors are often conceptualized as graphs:
(Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., ... & Ghemawat, S. (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.)
Spark Overview
Operations on tensors are often conceptualized as graphs:
A simpler example: c = tensorflow.matmul(a, b)
a b c =mm(A, B)
Spark Overview
Operations on tensors are often conceptualized as graphs:
(Adventures in Machine Learning. Python TensorFlow Tutorial, 2017)
example: d=b+c e=c+2 a=d∗e
Spark Overview
session defines the environment in which operations run. (like a Spark context) devices the specific devices (cpus or gpus) on which to run the session. tensors* variables - persistent mutable tensors constants - constant placeholders - from data
an abstract computation (e.g. matrix multiply, add) executed by device kernels
graph
* technically, still operations
Spark Overview
session defines the environment in which operations run. (like a Spark context) devices the specific devices (cpus or gpus) on which to run the session. tensors* variables - persistent mutable tensors constants - constant placeholders - from data
an abstract computation (e.g. matrix multiply, add) executed by device kernels
graph
* technically, operations that work with tensors.
○ tf.Variable(initial_value, name) ○ tf.constant(value, type, name) ○ tf.placeholder(type, shape, name)
Spark Overview
* technically, still operations
Operations
an abstract computation (e.g. matrix multiply, add) executed by device kernels tensors* variables - persistent mutable tensors constants - constant placeholders - from data
Spark Overview
session defines the environment in which operations run. (like a Spark context) devices the specific devices (cpus or gpus) on which to run the session. tensors* variables - persistent mutable tensors constants - constant placeholders - from data
an abstract computation (e.g. matrix multiply, add) executed by device kernels
graph
Spark Overview
session defines the environment in which operations run. (like a Spark context) devices the specific devices (cpus or gpus) on which to run the session. tensors* variables - persistent mutable tensors constants - constant placeholders - from data
an abstract computation (e.g. matrix multiply, add) executed by device kernels
graph
* technically, operations that work with tensors.
Spark Overview
Typical use-case: (Supervised Machine Learning)
Determine weights, W, of a function, f , such that ε is minimized: f(X|W) = Y + ε
Spark Overview
Typical use-case:
Determine weights, W, of a function, f , such that ε is minimized: f(X|W) = Y + ε
Spark Overview
X1 X2 X3 Y
Typical use-case:
Determine weights, W, of a function, f , such that ε is minimized: f(X|W) = Y + ε
Spark Overview
X1 X2 X3 Y X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 Y X13 X14 X15 ... Xm
Typical use-case:
Determine weights, W, of a function, f , such that ε is minimized: f(X|W) = Y + ε
Spark Overview
X1 X2 X3 Y X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 Y X13 X14 X15 ... Xm
f given w1 , w2...., wp
(typically, p >= m)
Typical use-case:
Determine weights, W, of a function, f , such that |ε| is minimized: f(X|W) = Y + ε
Spark Overview
X1 X2 X3 Y X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 Y X13 X14 X15 ... Xm
f given w1 , w2...., wp
(typically, p >= m)
f(X|W) = Ŷ Y = (X|W) + ε Y = Ŷ + ε ε = Ŷ - Y
Typical use-case:
Determine weights, W, of a function, f , such that ε is minimized: f(X|W) = Y + ε
Spark Overview
X1 X2 X3 Y X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 Y X13 X14 X15 ... Xm
f given w1 , w2...., wp
(typically, p >= m)
Typically very complex! Typically very complex! Typically, very complex!
f(X|W) = Ŷ ε = Ŷ - Y f(X|W) = Ŷ Y = (X|W) + ε Y = Ŷ + ε ε = Ŷ - Y
X1 X2 X3 Y X1
(1) X2 X3 X4 X5 X6
X7 X8 X9 X10 X11 X12 Y(1) X13 X14 X15 ... Xm Typical use-case:
Determine weights, W, of a function, f , such that ε is minimized: f(X|W) = Y + ε
W determined through gradient descent: back propagating error across the network that defines f.
Spark Overview
f given w1 , w2...., wp
(typically, p >= m)
f(X|W) = Ŷ ε = Ŷ - Y f(X|W) = Ŷ Y = (X|W) + ε Y = Ŷ + ε ε = Ŷ - Y
X1 X2 X3 Y X1
(1) X2 X3 X4 X5 X6
X7 X8 X9 X10 X11 X12 Y(1) X13 X14 X15 ... Xm Typical use-case:
Determine weights, W, of a function, f , such that ε is minimized: f(X|W) = Y + ε
W determined through gradient descent: back propagating error across the network that defines f.
Spark Overview
f given w1 , w2...., wp
(typically, p >= m)
X1
(2) X2 X3 X4 X5 X6
X7 X8 X9 X10 X11 X12 Y(2) X13 X14 X15 ... Xm X1
(3) X2 X3 X4 X5 X6
X7 X8 X9 X10 X11 X12 Y(3) X13 X14 X15 ... Xm X1
(4) X2 X3 X4 X5 X6
X7 X8 X9 X10 X11 X12 Y(4) X13 X14 X15 ... Xm X1
(...) X2 X3 X4 X5 X6
X7 X8 X9 X10 X11 X12 ... X13 X14 X15 ... Xm X1
(N) X2 X3 X4 X5 X6
X7 X8 X9 X10 X11 X12 Y(N) X13 X14 X15 ... Xm
minimizes ε on N training examples
f(X|W) = Ŷ ε = Ŷ - Y f(X|W) = Ŷ Y = (X|W) + ε Y = Ŷ + ε ε = Ŷ - Y
TensorFlow has built-in ability to derive gradients given a cost function.
(rasbt, http://rasbt.github.io/mlxtend/user_guide/gener al_concepts/gradient-optimization/)
Spark Overview
Linear Regression: Trying to find “betas” that minimize:
Spark Overview
Linear Regression: Trying to find “betas” that minimize: Thus:
matrix multiply
Spark Overview
Linear Regression: Trying to find “betas” that minimize: Thus:
In standard linear equation: (if we add a column of 1s, mx + b is just matmul(m, x)) matrix multiply
Spark Overview
Linear Regression: Trying to find “betas” that minimize: Thus:
matrix multiply
Spark Overview
Linear Regression: Trying to find “betas” that minimize: Thus:
How to update?
Spark Overview
Linear Regression: Trying to find “betas” that minimize: Thus:
How to update? (for gradient descent) “learning rate”
Spark Overview
Ridge Regression
(L2 Penalized linear regression, )
1. Matrix Solution:
Spark Overview
Ridge Regression
(L2 Penalized linear regression, )
1. Matrix Solution:
(Mirrors many parameter optimization problems.)
Spark Overview
Ridge Regression
(L2 Penalized linear regression, )
Gradient descent needs to solve. (Mirrors many parameter optimization problems.) TensorFlow has built-in ability to derive gradients given a cost function.
Spark Overview
Ridge Regression
(L2 Penalized linear regression, )
Gradient descent needs to solve. (Mirrors many parameter optimization problems.) TensorFlow has built-in ability to derive gradients given a cost function.
Spark Overview
TensorFlow has built-in ability to derive gradients given a cost function.
Spark Overview
Options for Distributing ML
1. Distribute copies of entire dataset a. Train over all with different hyperparameters b. Train different folds per worker node. Pro: Flexible to all situations; Con: Optimizing for subset is suboptimal Pro: Parameters can be localized Con: High communication for transferring Intermediar data.
Spark Overview
1. Distribute copies of entire dataset a. Train over all with different parameters b. Train different folds per worker node. Pro: Easy; Good for compute-bound; Con: Requires data fit in worker memories 2. Distribute data
a. Each node finds parameters for subset of data b. Needs mechanism for updating parameters i. Centralized parameter server ii. Distributed All-Reduce
Pro: Flexible to all situations; Con: Optimizing for subset is suboptimal Pro: Parameters can be localized Con: High communication for transferring Intermediar data.
Options for Distributing ML Spark Overview
1. Distribute copies of entire dataset a. Train over all with different parameters b. Train different folds per worker node. Pro: Easy; Good for compute-bound; Con: Requires data fit in worker memories 2. Distribute data
a. Each node finds parameters for subset of data b. Needs mechanism for updating parameters i. Centralized parameter server ii. Distributed All-Reduce
Pro: Flexible to all situations; Con: Optimizing for subset is suboptimal 3. Distribute model or individual operations (e.g. matrix multiply) Pro: Parameters can be localized Con: High communication for transferring Intermediar data.
Options for Distributing ML Spark Overview
1. Distribute copies of entire dataset a. Train over all with different parameters b. Train different folds per worker node. Pro: Easy; Good for compute-bound; Con: Requires data fit in worker memories 2. Distribute data
a. Each node finds parameters for subset of data b. Needs mechanism for updating parameters i. Centralized parameter server ii. Distributed All-Reduce
Pro: Flexible to all situations; Con: Optimizing for subset is suboptimal 3. Distribute model or individual operations (e.g. matrix multiply) Pro: Parameters can be localized Con: High communication for transferring Intermediar data.
Options for Distributing ML Spark Overview
1. Distribute copies of entire dataset a. Train over all with different parameters b. Train different folds per worker node. Pro: Easy; Good for compute-bound; Con: Requires data fit in worker memories 2. Distribute data
a. Each node finds parameters for subset of data b. Needs mechanism for updating parameters i. Centralized parameter server ii. Distributed All-Reduce
Pro: Flexible to all situations; Con: Optimizing for subset is suboptimal 3. Distribute model or individual operations (e.g. matrix multiply) Pro: Parameters can be localized Con: High communication for transferring Intermediar data.
Options for Distributing ML Spark Overview
Done often in practice. Not talked about much because it’s mostly as easy as it sounds.
Options for Distributing ML Spark Overview
1. Distribute copies of entire dataset a. Train over all with different parameters b. Train different folds per worker node. Pro: Easy; Good for compute-bound; Con: Requires data fit in worker memories 2. Distribute data
a. Each node finds parameters for subset of data b. Needs mechanism for updating parameters i. Centralized parameter server ii. Distributed All-Reduce
Pro: Flexible to all situations; Con: Optimizing for subset is suboptimal 3. Distribute model or individual operations (e.g. matrix multiply) Pro: Parameters can be localized Con: High communication for transferring Intermediar data. Done often in practice. Not talked about much because it’s mostly as easy as it sounds. Preferred method for big data or very complex models (i.e. models with many internal parameters).
Options for Distributing ML Spark Overview
1. Distribute copies of entire dataset a. Train over all with different parameters b. Train different folds per worker node. Pro: Easy; Good for compute-bound; Con: Requires data fit in worker memories 2. Distribute data
a. Each node finds parameters for subset of data b. Needs mechanism for updating parameters i. Centralized parameter server ii. Distributed All-Reduce
Pro: Flexible to all situations; Con: Optimizing for subset is suboptimal 3. Distribute model or individual operations (e.g. matrix multiply) Pro: Parameters can be localized Con: High communication for transferring Intermediar data. Done often in practice. Not talked about much because it’s mostly as easy as it sounds. Preferred method for big data or very complex models (i.e. models with many internal parameters).
Data Parellelism Model Parellelism
Multiple devices on multiple machines
Machine A CPU:0 CPU:1 Machine B GPU:0
Transfer Tensors
CPU:0 CPU:1 GPU:0
worker:0 worker:1 worker:2
X y
N
X y
batch_size-1 N-batch_size N
X y
batch_size-1 N-batch_size N
𝛴batch0 learn parameters (i.e. weights), given graph with cost function and optimizer 𝛴batch1 𝛴batch2 𝛴...
X y
batch_size-1 N-batch_size N
𝛴batch0 𝛴batch1 Combine parameters
X y
batch_size-1 N-batch_size N
𝛴batch0 𝛴batch1 Combine parameters update params of each node and repeat
(Geron, 2017)
Batch Gradient Descent Stochastic Gradient Descent: One example at a time Mini-batch Gradient Descent: k examples at a time.
(Geron, 2017)
Batch Gradient Descent Stochastic Gradient Descent: One example at a time Mini-batch Gradient Descent: k examples at a time.
(Geron, 2017)
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... & Kudlur, M. (2016, November). TensorFlow: A System for Large-Scale Machine Learning. In OSDI (Vol. 16, pp. 265-283).
Spark Overview
Distributed:
Distributed:
Parallelisms:
Distributed:
Parallelisms:
Model Updates:
Distributed:
Parallelisms:
Model Updates:
discussed previously
Multiple devices on single machine
CPU:0 CPU:1 GPU:0
Program 1 Program 2
Multiple devices on single machine
CPU:0 CPU:1 GPU:0
Multiple devices on multiple machines
Machine A CPU:0 CPU:1 Machine B GPU:0
Distributed:
Parallelisms:
Model Updates:
Distributed:
Parallelisms:
Model Updates:
Multiple devices on multiple machines
Machine A CPU:0 CPU:1 Machine B GPU:0
Transfer Tensors
Model Parallelism
CPU:0 CPU:1 GPU:0
Data Parallelism
Distributed:
Parallelisms:
Model Updates:
Distributed:
Parallelisms:
Model Updates:
CPU:0 CPU:1 GPU:0 CPU:0 Machine A Machine B
(Geron, 2017: HOML: p.324)
TF Server TF Server TF Server
“ps” “worker”
task 0 task 0 task 1
Master Worker Master Worker Master Worker
CPU:0 CPU:1 GPU:0 CPU:0 Machine A Machine B
(Geron, 2017: HOML: p.324)
TF Server TF Server TF Server
“ps” “worker”
task 0 task 0 task 1
Master Worker Master Worker Master Worker
Parameter Server: Job is just to maintain values of variables being optimized. Workers: do all the numerical “work” and send updates to the parameter server.
CPU:0 CPU:1 GPU:0 CPU:0 Machine A Machine B
(Geron, 2017: HOML: p.324)
TF Server TF Server TF Server
“Worker” Worker Worker Worker
Master Worker Master Worker Master Worker
Workers do computation, send parameter updates to other workers, and store parameter updates from other workers. Requires low latency communication.
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... & Kudlur, M. (2016, November). TensorFlow: A System for Large-Scale Machine Learning. In OSDI (Vol. 16, pp. 265-283).
○
constants, or placeholder)
○ automatically finds gradients ○ custom kernels for given devices
○ Within a single machine (local: many devices)) ○ Across a cluster (many machines and devices) ○ Jobs broken up as parameter servers / workers makes coordination of data efficient
Spark Overview