Distributed TensorFlow CSE545 - Spring 2020 Stony Brook University - - PowerPoint PPT Presentation

distributed tensorflow
SMART_READER_LITE
LIVE PREVIEW

Distributed TensorFlow CSE545 - Spring 2020 Stony Brook University - - PowerPoint PPT Presentation

Distributed TensorFlow CSE545 - Spring 2020 Stony Brook University Big Data Analytics, The Class Goal: Generalizations A model or summarization of the data. Data Frameworks Algorithms and Analyses Similarity Search Hadoop File System Spark


slide-1
SLIDE 1

Distributed TensorFlow

CSE545 - Spring 2020 Stony Brook University

slide-2
SLIDE 2

Big Data Analytics, The Class

Goal: Generalizations A model or summarization of the data.

Data Frameworks Algorithms and Analyses Hadoop File System MapReduce Spark Tensorflow Similarity Search Recommendation Systems Graph Analysis Deep Learning Streaming Hypothesis Testing

slide-3
SLIDE 3

Spark Overview

Spark is fast for being so flexible

  • Fast: RDDs in memory + Lazy evaluation: optimized chain of operations.
  • Flexible: Many transformations -- can contain any custom code.

Limitations of Spark

slide-4
SLIDE 4

Spark Overview

Spark is fast for being so flexible

  • Fast: RDDs in memory + Lazy evaluation: optimized chain of operations.
  • Flexible: Many transformations -- can contain any custom code.

However:

  • Hadoop MapReduce can still be better for extreme IO, data that will not fit in

memory across cluster.

Limitations of Spark

slide-5
SLIDE 5

Spark Overview

Spark is fast for being so flexible

  • Fast: RDDs in memory + Lazy evaluation: optimized chain of operations.
  • Flexible: Many transformations -- can contain any custom code.

However:

  • Hadoop MapReduce can still be better for extreme IO, data that will not fit in

memory across cluster.

Limitations of Spark

IO Bound large files (TBs or PBs) Compute Bound many numeric computations

Spark MapReduce

(1s of TBs, 100s of GBs)

slide-6
SLIDE 6

Spark Overview

Spark is fast for being so flexible

  • Fast: RDDs in memory + Lazy evaluation: optimized chain of operations.
  • Flexible: Many transformations -- can contain any custom code.

However:

  • Hadoop MapReduce can still be better for extreme IO, data that will not fit in

memory across cluster.

  • Modern machine learning (esp. Deep learning), a common big data task,

requires heavy numeric computation.

Limitations of Spark

IO Bound (large files: TBs or PBs) Compute Bound (many numeric computations)

Spark MapReduce

(1s of TBs, 100s of GBs)

* this is the subjective approximation of the instructor as of February 2020. A lot of factors at play.

slide-7
SLIDE 7

Spark Overview

Spark is fast for being so flexible

  • Fast: RDDs in memory + Lazy evaluation: optimized chain of operations.
  • Flexible: Many transformations -- can contain any custom code.

However:

  • Hadoop MapReduce can still be better for extreme IO, data that will not fit in

memory across cluster.

  • Modern machine learning (esp. Deep learning), a common big data task,

requires heavy numeric computation.

Limitations of Spark

IO Bound (large files: TBs or PBs) Compute Bound (many numeric computations)

Spark MapReduce

(1s of TBs, 100s of GBs)

TensorFlow

* this is the subjective approximation of the instructor as of February 2020. A lot of factors at play.

slide-8
SLIDE 8
  • Understand TensorFlow as a data workflow system.

○ Know the key components of TensorFlow. ○ Understand the key concepts of distributed TensorFlow.

  • Execute basic distributed tensorflow program.
  • Establish a foundation to distribute deep learning models:
  • Convolutional Neural Networks
  • Recurrent Neural Network (or LSTM, GRU)

Spark Overview

Learning Objectives

slide-9
SLIDE 9

A workflow system catered to numerical computation. One view: Like Spark, but uses tensors instead of RDDs.

Spark Overview

What is TensorFlow?

slide-10
SLIDE 10

A workflow system catered to numerical computation. One view: Like Spark, but uses tensors instead of RDDs.

(i.stack.imgur.com)

A multi-dimensional matrix

Spark Overview

What is TensorFlow?

slide-11
SLIDE 11

A workflow system catered to numerical computation. One view: Like Spark, but uses tensors instead of RDDs.

(i.stack.imgur.com)

A 2-d tensor is just a matrix. 1-d: vector 0-d: a constant / scalar Note: Linguistic ambiguity: Dimensions of a Tensor =/= Dimensions of a Matrix

Spark Overview

What is TensorFlow?

slide-12
SLIDE 12

A workflow system catered to numerical computation. One view: Like Spark, but uses tensors instead of RDDs.

Examples > 2-d : Image definitions in terms of RGB per pixel Image[row][column][rgb] Subject, Verb, Object representation of language: Counts[verb][subject][object]

Spark Overview

What is TensorFlow?

slide-13
SLIDE 13

A workflow system catered to numerical computation. One view: Like Spark, but uses tensors instead of RDDs.

Technically, less abstract than RDDs which could hold tensors as well as many other data structures (dictionaries/HashMaps, Trees, ...etc…). Then, why TensorFlow?

Spark Overview

What is TensorFlow?

slide-14
SLIDE 14

Technically, less abstract than RDDs which could hold tensors as well as many other data structures (dictionaries/HashMaps, Trees, ...etc…). Then, why TensorFlow?

Efficient, high-level built-in linear algebra and machine learning optimization operations (i.e. transformations). enables complex models, like deep learning

Spark Overview

What is TensorFlow?

slide-15
SLIDE 15

Efficient, high-level built-in linear algebra and machine learning optimization operations. enables complex models, like deep learning

(Bakshi, 2016, “What is Deep Learning? Getting Started With Deep Learning”)

Spark Overview

What is TensorFlow?

slide-16
SLIDE 16

Efficient, high-level built-in linear algebra and machine learning operations.

(Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., ... & Ghemawat, S. (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.)

Spark Overview

What is TensorFlow?

slide-17
SLIDE 17

Operations on tensors are often conceptualized as graphs:

(Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., ... & Ghemawat, S. (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.)

Spark Overview

TensorFlow

slide-18
SLIDE 18

Operations on tensors are often conceptualized as graphs:

A simpler example: c = tensorflow.matmul(a, b)

a b c =mm(A, B)

Spark Overview

TensorFlow

slide-19
SLIDE 19

Operations on tensors are often conceptualized as graphs:

(Adventures in Machine Learning. Python TensorFlow Tutorial, 2017)

example: d=b+c e=c+2 a=d∗e

Spark Overview

TensorFlow

slide-20
SLIDE 20

session defines the environment in which operations run. (like a Spark context) devices the specific devices (cpus or gpus) on which to run the session. tensors* variables - persistent mutable tensors constants - constant placeholders - from data

  • perations

an abstract computation (e.g. matrix multiply, add) executed by device kernels

graph

* technically, still operations

Spark Overview

Ingredients of a TensorFlow

slide-21
SLIDE 21

session defines the environment in which operations run. (like a Spark context) devices the specific devices (cpus or gpus) on which to run the session. tensors* variables - persistent mutable tensors constants - constant placeholders - from data

  • perations

an abstract computation (e.g. matrix multiply, add) executed by device kernels

graph

* technically, operations that work with tensors.

○ tf.Variable(initial_value, name) ○ tf.constant(value, type, name) ○ tf.placeholder(type, shape, name)

Spark Overview

Ingredients of a TensorFlow

* technically, still operations

slide-22
SLIDE 22

Operations

  • perations

an abstract computation (e.g. matrix multiply, add) executed by device kernels tensors* variables - persistent mutable tensors constants - constant placeholders - from data

Spark Overview

Ingredients of a TensorFlow

slide-23
SLIDE 23

session defines the environment in which operations run. (like a Spark context) devices the specific devices (cpus or gpus) on which to run the session. tensors* variables - persistent mutable tensors constants - constant placeholders - from data

  • perations

an abstract computation (e.g. matrix multiply, add) executed by device kernels

graph

  • Places operations on devices
  • Stores the values of variables (when not distributed)
  • Carries out execution: eval() or run()

Spark Overview

Ingredients of a TensorFlow

slide-24
SLIDE 24

session defines the environment in which operations run. (like a Spark context) devices the specific devices (cpus or gpus) on which to run the session. tensors* variables - persistent mutable tensors constants - constant placeholders - from data

  • perations

an abstract computation (e.g. matrix multiply, add) executed by device kernels

graph

* technically, operations that work with tensors.

Spark Overview

Ingredients of a TensorFlow

slide-25
SLIDE 25

Typical use-case: (Supervised Machine Learning)

Determine weights, W, of a function, f , such that ε is minimized: f(X|W) = Y + ε

Spark Overview

Distributed TensorFlow

slide-26
SLIDE 26

Typical use-case:

Determine weights, W, of a function, f , such that ε is minimized: f(X|W) = Y + ε

Spark Overview

Distributed TensorFlow

X1 X2 X3 Y

slide-27
SLIDE 27

Typical use-case:

Determine weights, W, of a function, f , such that ε is minimized: f(X|W) = Y + ε

Spark Overview

Distributed TensorFlow

X1 X2 X3 Y X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 Y X13 X14 X15 ... Xm

slide-28
SLIDE 28

Typical use-case:

Determine weights, W, of a function, f , such that ε is minimized: f(X|W) = Y + ε

Spark Overview

Distributed TensorFlow

X1 X2 X3 Y X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 Y X13 X14 X15 ... Xm

f given w1 , w2...., wp

(typically, p >= m)

slide-29
SLIDE 29

Typical use-case:

Determine weights, W, of a function, f , such that |ε| is minimized: f(X|W) = Y + ε

Spark Overview

Distributed TensorFlow

X1 X2 X3 Y X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 Y X13 X14 X15 ... Xm

f given w1 , w2...., wp

(typically, p >= m)

f(X|W) = Ŷ Y = (X|W) + ε Y = Ŷ + ε ε = Ŷ - Y

slide-30
SLIDE 30

Typical use-case:

Determine weights, W, of a function, f , such that ε is minimized: f(X|W) = Y + ε

Spark Overview

Distributed TensorFlow

X1 X2 X3 Y X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 Y X13 X14 X15 ... Xm

f given w1 , w2...., wp

(typically, p >= m)

Typically very complex! Typically very complex! Typically, very complex!

f(X|W) = Ŷ ε = Ŷ - Y f(X|W) = Ŷ Y = (X|W) + ε Y = Ŷ + ε ε = Ŷ - Y

slide-31
SLIDE 31

X1 X2 X3 Y X1

(1) X2 X3 X4 X5 X6

X7 X8 X9 X10 X11 X12 Y(1) X13 X14 X15 ... Xm Typical use-case:

Determine weights, W, of a function, f , such that ε is minimized: f(X|W) = Y + ε

W determined through gradient descent: back propagating error across the network that defines f.

Spark Overview

Distributed TensorFlow

f given w1 , w2...., wp

(typically, p >= m)

f(X|W) = Ŷ ε = Ŷ - Y f(X|W) = Ŷ Y = (X|W) + ε Y = Ŷ + ε ε = Ŷ - Y

slide-32
SLIDE 32

X1 X2 X3 Y X1

(1) X2 X3 X4 X5 X6

X7 X8 X9 X10 X11 X12 Y(1) X13 X14 X15 ... Xm Typical use-case:

Determine weights, W, of a function, f , such that ε is minimized: f(X|W) = Y + ε

W determined through gradient descent: back propagating error across the network that defines f.

Spark Overview

Distributed TensorFlow

f given w1 , w2...., wp

(typically, p >= m)

X1

(2) X2 X3 X4 X5 X6

X7 X8 X9 X10 X11 X12 Y(2) X13 X14 X15 ... Xm X1

(3) X2 X3 X4 X5 X6

X7 X8 X9 X10 X11 X12 Y(3) X13 X14 X15 ... Xm X1

(4) X2 X3 X4 X5 X6

X7 X8 X9 X10 X11 X12 Y(4) X13 X14 X15 ... Xm X1

(...) X2 X3 X4 X5 X6

X7 X8 X9 X10 X11 X12 ... X13 X14 X15 ... Xm X1

(N) X2 X3 X4 X5 X6

X7 X8 X9 X10 X11 X12 Y(N) X13 X14 X15 ... Xm

minimizes ε on N training examples

f(X|W) = Ŷ ε = Ŷ - Y f(X|W) = Ŷ Y = (X|W) + ε Y = Ŷ + ε ε = Ŷ - Y

slide-33
SLIDE 33

TensorFlow has built-in ability to derive gradients given a cost function.

(rasbt, http://rasbt.github.io/mlxtend/user_guide/gener al_concepts/gradient-optimization/)

Spark Overview

Weights Derived from Gradients

slide-34
SLIDE 34

Linear Regression: Trying to find “betas” that minimize:

Spark Overview

Weights Derived from Gradients

slide-35
SLIDE 35

Linear Regression: Trying to find “betas” that minimize: Thus:

matrix multiply

Spark Overview

Weights Derived from Gradients

slide-36
SLIDE 36

Linear Regression: Trying to find “betas” that minimize: Thus:

In standard linear equation: (if we add a column of 1s, mx + b is just matmul(m, x)) matrix multiply

Spark Overview

Weights Derived from Gradients

slide-37
SLIDE 37

Linear Regression: Trying to find “betas” that minimize: Thus:

matrix multiply

Spark Overview

Weights Derived from Gradients

slide-38
SLIDE 38

Linear Regression: Trying to find “betas” that minimize: Thus:

How to update?

Spark Overview

Weights Derived from Gradients

slide-39
SLIDE 39

Linear Regression: Trying to find “betas” that minimize: Thus:

How to update? (for gradient descent) “learning rate”

Spark Overview

Weights Derived from Gradients

slide-40
SLIDE 40

Ridge Regression

(L2 Penalized linear regression, )

1. Matrix Solution:

Spark Overview

Weights Derived from Gradients

slide-41
SLIDE 41

Ridge Regression

(L2 Penalized linear regression, )

1. Matrix Solution:

  • 2. Gradient descent solution

(Mirrors many parameter optimization problems.)

Spark Overview

Weights Derived from Gradients

slide-42
SLIDE 42

Ridge Regression

(L2 Penalized linear regression, )

Gradient descent needs to solve. (Mirrors many parameter optimization problems.) TensorFlow has built-in ability to derive gradients given a cost function.

Spark Overview

Weights Derived from Gradients

slide-43
SLIDE 43

Ridge Regression

(L2 Penalized linear regression, )

Gradient descent needs to solve. (Mirrors many parameter optimization problems.) TensorFlow has built-in ability to derive gradients given a cost function.

Spark Overview

Weights Derived from Gradients

slide-44
SLIDE 44

TensorFlow has built-in ability to derive gradients given a cost function.

Spark Overview

Weights Derived from Gradients

slide-45
SLIDE 45

Options for Distributing ML

1. Distribute copies of entire dataset a. Train over all with different hyperparameters b. Train different folds per worker node. Pro: Flexible to all situations; Con: Optimizing for subset is suboptimal Pro: Parameters can be localized Con: High communication for transferring Intermediar data.

Spark Overview

Options for distribution

slide-46
SLIDE 46

1. Distribute copies of entire dataset a. Train over all with different parameters b. Train different folds per worker node. Pro: Easy; Good for compute-bound; Con: Requires data fit in worker memories 2. Distribute data

a. Each node finds parameters for subset of data b. Needs mechanism for updating parameters i. Centralized parameter server ii. Distributed All-Reduce

Pro: Flexible to all situations; Con: Optimizing for subset is suboptimal Pro: Parameters can be localized Con: High communication for transferring Intermediar data.

Options for Distributing ML Spark Overview

Options for distribution

slide-47
SLIDE 47

1. Distribute copies of entire dataset a. Train over all with different parameters b. Train different folds per worker node. Pro: Easy; Good for compute-bound; Con: Requires data fit in worker memories 2. Distribute data

a. Each node finds parameters for subset of data b. Needs mechanism for updating parameters i. Centralized parameter server ii. Distributed All-Reduce

Pro: Flexible to all situations; Con: Optimizing for subset is suboptimal 3. Distribute model or individual operations (e.g. matrix multiply) Pro: Parameters can be localized Con: High communication for transferring Intermediar data.

Options for Distributing ML Spark Overview

Options for distribution

slide-48
SLIDE 48

1. Distribute copies of entire dataset a. Train over all with different parameters b. Train different folds per worker node. Pro: Easy; Good for compute-bound; Con: Requires data fit in worker memories 2. Distribute data

a. Each node finds parameters for subset of data b. Needs mechanism for updating parameters i. Centralized parameter server ii. Distributed All-Reduce

Pro: Flexible to all situations; Con: Optimizing for subset is suboptimal 3. Distribute model or individual operations (e.g. matrix multiply) Pro: Parameters can be localized Con: High communication for transferring Intermediar data.

Options for Distributing ML Spark Overview

Options for distribution

slide-49
SLIDE 49

1. Distribute copies of entire dataset a. Train over all with different parameters b. Train different folds per worker node. Pro: Easy; Good for compute-bound; Con: Requires data fit in worker memories 2. Distribute data

a. Each node finds parameters for subset of data b. Needs mechanism for updating parameters i. Centralized parameter server ii. Distributed All-Reduce

Pro: Flexible to all situations; Con: Optimizing for subset is suboptimal 3. Distribute model or individual operations (e.g. matrix multiply) Pro: Parameters can be localized Con: High communication for transferring Intermediar data.

Options for Distributing ML Spark Overview

Options for distribution

Done often in practice. Not talked about much because it’s mostly as easy as it sounds.

slide-50
SLIDE 50

Options for Distributing ML Spark Overview

Options for distribution

1. Distribute copies of entire dataset a. Train over all with different parameters b. Train different folds per worker node. Pro: Easy; Good for compute-bound; Con: Requires data fit in worker memories 2. Distribute data

a. Each node finds parameters for subset of data b. Needs mechanism for updating parameters i. Centralized parameter server ii. Distributed All-Reduce

Pro: Flexible to all situations; Con: Optimizing for subset is suboptimal 3. Distribute model or individual operations (e.g. matrix multiply) Pro: Parameters can be localized Con: High communication for transferring Intermediar data. Done often in practice. Not talked about much because it’s mostly as easy as it sounds. Preferred method for big data or very complex models (i.e. models with many internal parameters).

slide-51
SLIDE 51

Options for Distributing ML Spark Overview

Options for distribution

1. Distribute copies of entire dataset a. Train over all with different parameters b. Train different folds per worker node. Pro: Easy; Good for compute-bound; Con: Requires data fit in worker memories 2. Distribute data

a. Each node finds parameters for subset of data b. Needs mechanism for updating parameters i. Centralized parameter server ii. Distributed All-Reduce

Pro: Flexible to all situations; Con: Optimizing for subset is suboptimal 3. Distribute model or individual operations (e.g. matrix multiply) Pro: Parameters can be localized Con: High communication for transferring Intermediar data. Done often in practice. Not talked about much because it’s mostly as easy as it sounds. Preferred method for big data or very complex models (i.e. models with many internal parameters).

Data Parellelism Model Parellelism

slide-52
SLIDE 52

Model Parallelism

Multiple devices on multiple machines

Machine A CPU:0 CPU:1 Machine B GPU:0

Transfer Tensors

slide-53
SLIDE 53

Data Parallelism

CPU:0 CPU:1 GPU:0

slide-54
SLIDE 54

worker:0 worker:1 worker:2

Data Parallelism

slide-55
SLIDE 55

X y

N

Distributing Data

slide-56
SLIDE 56

X y

batch_size-1 N-batch_size N

Distributing Data

slide-57
SLIDE 57

Distributing Data

X y

batch_size-1 N-batch_size N

𝛴batch0 learn parameters (i.e. weights), given graph with cost function and optimizer 𝛴batch1 𝛴batch2 𝛴...

slide-58
SLIDE 58

X y

batch_size-1 N-batch_size N

𝛴batch0 𝛴batch1 Combine parameters

Distributing Data

slide-59
SLIDE 59

Distributing Data

X y

batch_size-1 N-batch_size N

𝛴batch0 𝛴batch1 Combine parameters update params of each node and repeat

slide-60
SLIDE 60

(Geron, 2017)

Gradient Descent for Linear Regression

slide-61
SLIDE 61

Batch Gradient Descent Stochastic Gradient Descent: One example at a time Mini-batch Gradient Descent: k examples at a time.

(Geron, 2017)

Gradient Descent for Linear Regression

slide-62
SLIDE 62

Batch Gradient Descent Stochastic Gradient Descent: One example at a time Mini-batch Gradient Descent: k examples at a time.

(Geron, 2017)

Gradient Descent for Linear Regression

slide-63
SLIDE 63

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... & Kudlur, M. (2016, November). TensorFlow: A System for Large-Scale Machine Learning. In OSDI (Vol. 16, pp. 265-283).

Spark Overview

Distributed TensorFlow

slide-64
SLIDE 64

Distributed:

  • Locally: Across processors (cpus, gpus, tpus)
  • Across a Cluster: Multiple machine with multiple processors

Distributed TensorFlow

slide-65
SLIDE 65

Distributed:

  • Locally: Across processors (cpus, gpus, tpus)
  • Across a Cluster: Multiple machine with multiple processors

Parallelisms:

  • Data Parallelism: All nodes doing same thing on different subsets of data
  • Graph/Model Parallelism: Different portions of model on different devices

Distributed TensorFlow

slide-66
SLIDE 66

Distributed:

  • Locally: Across processors (cpus, gpus, tpus)
  • Across a Cluster: Multiple machine with multiple processors

Parallelisms:

  • Data Parallelism: All nodes doing same thing on different subsets of data
  • Graph/Model Parallelism: Different portions of model on different devices

Model Updates:

  • Asynchronous Parameter Server
  • Synchronous AllReduce (doesn’t work with Model Parallelism)

Distributed TensorFlow

slide-67
SLIDE 67

Distributed:

  • Locally: Across processors (cpus, gpus, tpus)
  • Across a Cluster: Multiple machine with multiple processors

Parallelisms:

  • Data Parallelism: All nodes doing same thing on different subsets of data
  • Graph/Model Parallelism: Different portions of model on different devices

Model Updates:

  • Asynchronous Parameter Server
  • Synchronous AllReduce (doesn’t work with Model Parallelism)

Distributed TensorFlow

discussed previously

slide-68
SLIDE 68

Multiple devices on single machine

CPU:0 CPU:1 GPU:0

Program 1 Program 2

Local Distribution

slide-69
SLIDE 69

Multiple devices on single machine

CPU:0 CPU:1 GPU:0

Local Distribution

slide-70
SLIDE 70

Multiple devices on multiple machines

Machine A CPU:0 CPU:1 Machine B GPU:0

Local Distribution

slide-71
SLIDE 71

Distributed:

  • Locally: Across processors (cpus, gpus, tpus)
  • Across a Cluster: Multiple machine with multiple processors

Parallelisms:

  • Data Parallelism: All nodes doing same thing on different subsets of data
  • Graph/Model Parallelism: Different portions of model on different devices

Model Updates:

  • Asynchronous Parameter Server
  • Synchronous AllReduce (doesn’t work with Model Parallelism)

Distributed TensorFlow

slide-72
SLIDE 72

Distributed:

  • Locally: Across processors (cpus, gpus, tpus)
  • Across a Cluster: Multiple machine with multiple processors

Parallelisms:

  • Data Parallelism: All nodes doing same thing on different subsets of data
  • Graph/Model Parallelism: Different portions of model on different devices

Model Updates:

  • Asynchronous Parameter Server
  • Synchronous AllReduce (doesn’t work with Model Parallelism)

Distributed TensorFlow

slide-73
SLIDE 73

Multiple devices on multiple machines

Machine A CPU:0 CPU:1 Machine B GPU:0

Transfer Tensors

Parallelisms

Model Parallelism

slide-74
SLIDE 74

CPU:0 CPU:1 GPU:0

Parallelisms

Data Parallelism

slide-75
SLIDE 75

Distributed:

  • Locally: Across processors (cpus, gpus, tpus)
  • Across a Cluster: Multiple machine with multiple processors

Parallelisms:

  • Data Parallelism: All nodes doing same thing on different subsets of data
  • Graph/Model Parallelism: Different portions of model on different devices

Model Updates:

  • Asynchronous Parameter Server
  • Synchronous AllReduce (doesn’t work with Model Parallelism)

Distributed TensorFlow

slide-76
SLIDE 76

Distributed:

  • Locally: Across processors (cpus, gpus, tpus)
  • Across a Cluster: Multiple machine with multiple processors

Parallelisms:

  • Data Parallelism: All nodes doing same thing on different subsets of data
  • Graph/Model Parallelism: Different portions of model on different devices

Model Updates:

  • Asynchronous Parameter Server
  • Synchronous AllReduce (doesn’t work with Model Parallelism)

Distributed TensorFlow

slide-77
SLIDE 77

CPU:0 CPU:1 GPU:0 CPU:0 Machine A Machine B

(Geron, 2017: HOML: p.324)

TF Server TF Server TF Server

“ps” “worker”

task 0 task 0 task 1

Master Worker Master Worker Master Worker

Asynchronous Parameter Server

slide-78
SLIDE 78

CPU:0 CPU:1 GPU:0 CPU:0 Machine A Machine B

(Geron, 2017: HOML: p.324)

TF Server TF Server TF Server

“ps” “worker”

task 0 task 0 task 1

Master Worker Master Worker Master Worker

Parameter Server: Job is just to maintain values of variables being optimized. Workers: do all the numerical “work” and send updates to the parameter server.

Asynchronous Parameter Server

slide-79
SLIDE 79

CPU:0 CPU:1 GPU:0 CPU:0 Machine A Machine B

(Geron, 2017: HOML: p.324)

TF Server TF Server TF Server

“Worker” Worker Worker Worker

Master Worker Master Worker Master Worker

Workers do computation, send parameter updates to other workers, and store parameter updates from other workers. Requires low latency communication.

Synchronous All Reduce

slide-80
SLIDE 80

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... & Kudlur, M. (2016, November). TensorFlow: A System for Large-Scale Machine Learning. In OSDI (Vol. 16, pp. 265-283).

Distributed TF: Full Pipeline

slide-81
SLIDE 81
  • TF is a workflow system, where records are always tensors

  • perations applied to tensors (as either Variables,

constants, or placeholder)

  • Optimized for numerical / linear algebra

○ automatically finds gradients ○ custom kernels for given devices

  • “Easily” distributes

○ Within a single machine (local: many devices)) ○ Across a cluster (many machines and devices) ○ Jobs broken up as parameter servers / workers makes coordination of data efficient

Spark Overview

Summary