Fast, Distributed Computa2ons in the Cloud Omid Mashayekhi Advisor: - - PowerPoint PPT Presentation

fast distributed computa2ons in the cloud
SMART_READER_LITE
LIVE PREVIEW

Fast, Distributed Computa2ons in the Cloud Omid Mashayekhi Advisor: - - PowerPoint PPT Presentation

Fast, Distributed Computa2ons in the Cloud Omid Mashayekhi Advisor: Philip Levis April 7, 2017 2 2 Cloud Frameworks Machine Streaming Graph SQL Learning Cloud Framework ... ... ... ... Cloud frameworks abstract away the complexi2es


slide-1
SLIDE 1

Fast, Distributed Computa2ons in the Cloud

Omid Mashayekhi Advisor: Philip Levis

April 7, 2017

slide-2
SLIDE 2

2 2

slide-3
SLIDE 3

Cloud Frameworks

3

SQL Streaming

Machine Learning

Graph

Cloud Framework

... ... ... ...

Cloud frameworks abstract away the complexi2es of the cloud infrastructure from the applica2on developers:

1. Automa2c distribu2on 2. Elas2c scalability 3. Mul2tenant applica2ons 4. Load balancing 5. Fault tolerance

slide-4
SLIDE 4

Cloud Frameworks

4

SQL

Job Control Plane

... ... ... ...

  • Job is an instance of the applica2on running in the framework.
  • Task is the unit of computa2on.
  • Control plane makes the magic happen:
  • Par22oning job in to tasks
  • Scheduling tasks
  • Load balancing
  • Fault recovery

Task

slide-5
SLIDE 5

10s 1s 100ms 10ms 1ms I/O-bound data analy2cs

MapReduce Hadoop

Task Length

2004

Evolu2on of Cloud Frameworks

5

slide-6
SLIDE 6

10s 1s 100ms 10ms 1ms I/O-bound data analy2cs In-memory data analy2cs

MapReduce Hadoop

Task Length

Spark Naiad

2004 2012

Evolu2on of Cloud Frameworks

6

slide-7
SLIDE 7

10s 1s 100ms 10ms 1ms I/O-bound data analy2cs In-memory data analy2cs Op2mized data analy2cs

MapReduce Hadoop

Task Length

Spark Naiad Spark 2.0 Common IL C++

2004 2012 2016

Evolu2on of Cloud Frameworks

7

slide-8
SLIDE 8

8

10s 1s 100ms 10ms 1ms I/O-bound data analy2cs In-memory data analy2cs Op2mized data analy2cs

MapReduce Hadoop

Task Length

Spark Naiad Spark 2.0 Common IL C++

2004 2012 2016

slide-9
SLIDE 9

9

  • One itera2on of logis2c regression over a data set of size 64MB.
  • Tasks implemented efficiently, could run 50x faster.

Task Execu2on Time (ms)

178 (6x) 67 (16x) 21 (51x) 1071 C++ Java Spark RDD Spark DataFrame

10s 1s 100ms 10ms 1ms I/O-bound data analy2cs In-memory data analy2cs Op2mized data analy2cs

MapReduce Hadoop

Task Length

Spark Naiad Spark 2.0 Common IL C++

2004 2012 2016

slide-10
SLIDE 10

10

Individual tasks are gebng faster. But does it necessarily mean that job comple2on 2me is gebng shorter?

slide-11
SLIDE 11

Cloud Frameworks

11

SQL

Job Control Plane

... ... ... ...

Task are gebng orders

  • f magnitude faster.

How about the job?

slide-12
SLIDE 12

12

Control Plane

The New Bodleneck

  • Logis2c regression over a data set of size 100GB.
  • Classic Spark used to be CPU-bound.
slide-13
SLIDE 13

13

Control Plane

The New Bodleneck

  • Logis2c regression over a data set of size 100GB.
  • Spark 2.0 is already control-bound.
slide-14
SLIDE 14

14

Control Plane

The New Bodleneck

  • Logis2c regression over a data set of size 100GB.
  • Spark-opt: hypothe2cal case where Spark runs tasks as fast as C++.
slide-15
SLIDE 15

15

Control plane is the emerging bodleneck for the cloud compu2ng frameworks.

slide-16
SLIDE 16

16

Control Plane

The New Bodleneck

  • Logis2c regression over a data set of size 100GB.
  • Nimbus with execu<on templates scales almost linearly.
slide-17
SLIDE 17

Contribu2ons

  • Demonstra2ng how the control plane is the emerging bo=leneck for

data analy2cs frameworks.

  • Execu<on Templates as an abstrac2on for the control plane of cloud

compu2ng frameworks, that enables orders of magnitude higher task throughput, while keeping the fine-grained, flexible scheduling.

  • The design, implementa2on, and evalua2on of Nimbus, a distributed

cloud compu2ng framework that embeds execu2on templates.

  • A demonstra2on of a single-core graphical simula<on that Nimbus

automa<cally distributes in the cloud showing execu2on templates in prac2ce for complex applica2ons.

17

slide-18
SLIDE 18

This talk

  • Control Plane: the Emerging Bodleneck
  • Design Scope of the Control Plane
  • Execu2on Templates
  • Nimbus: a Framework with Templates
  • Evalua2on

18

slide-19
SLIDE 19

This talk

  • Control Plane: the Emerging Bodleneck
  • Design Scope of the Control Plane
  • Execu2on Templates
  • Nimbus: a Framework with Templates
  • Evalua2on

19

slide-20
SLIDE 20

Cloud Frameworks Design

  • Currently, there are two approaches:
  • 1. Centralized control model.
  • Controller generates and assigns tasks to the worker.
  • Limited task throughput, but reac2ve scheduling.
  • 2. Distributed data flow model.
  • Nodes generate and spawn tasks locally.
  • Great scalability, but sta2c scheduling.

20

slide-21
SLIDE 21

21

Design Spectrum

Centralized Controllers Distributed Controllers

slide-22
SLIDE 22

22

Design Spectrum

Centralized Controllers Distributed Controllers Controller

Task Graph Loop Worker Worker Worker Worker

  • Controller centrally schedules and spawns tasks.
  • MapReduce
  • Hadoop
  • Spark
slide-23
SLIDE 23

23

Design Spectrum

Centralized Controllers Distributed Controllers Controller

Task Graph Loop Worker Worker Worker Worker

  • Controller centrally schedules and spawns tasks.
slide-24
SLIDE 24

24

Design Spectrum

Centralized Controllers Distributed Controllers Controller

Task Graph Loop Worker Worker Worker Worker

  • Controller centrally schedules and spawns tasks.
slide-25
SLIDE 25

25

Design Spectrum

Centralized Controllers Distributed Controllers Controller

Task Graph Loop Worker Worker Worker Worker

  • Controller could reac2vely and dynamically change the schedule.
slide-26
SLIDE 26

26

Design Spectrum

Centralized Controllers Distributed Controllers Controller

Task Graph Loop Worker Worker Worker Worker

  • Controller could reac2vely and dynamically change the schedule.
slide-27
SLIDE 27

27

Design Spectrum

Centralized Controllers Distributed Controllers Controller

Task Graph Loop Worker Worker Worker Worker

  • Controller could reac2vely and dynamically change the schedule.
slide-28
SLIDE 28

28

Design Spectrum

Centralized Controllers Distributed Controllers Controller

Task Graph Loop Worker Worker Worker Worker

  • But controller bodlenecks at scale.
slide-29
SLIDE 29

29

Design Spectrum

Centralized Controllers Distributed Controllers Controller

Task Graph Loop Worker Worker Worker Worker

  • But controller bodlenecks at scale.
slide-30
SLIDE 30

30

Design Spectrum

Centralized Controllers Distributed Controllers Controller

Task Graph Loop Worker Worker Worker Worker

  • But controller bodlenecks at scale.
slide-31
SLIDE 31

31

Design Spectrum

Centralized Controllers Distributed Controllers Controller

Task Graph Loop Worker Worker Worker Worker

  • But controller bodlenecks at scale.
slide-32
SLIDE 32

32

Design Spectrum

Centralized Controllers Distributed Controllers Controller

Task Graph Loop Worker Worker Worker Worker

  • But controller bodlenecks at scale.
slide-33
SLIDE 33

33

Design Spectrum

Centralized Controllers Distributed Controllers Workers fall idle

  • Logis2c regression over a data set of size 100GB in Spark 2.0 MLlib.
  • Control Plane bo=lenecks at scale, genera2ng and spawning tasks.
slide-34
SLIDE 34

34

Design Spectrum

Centralized Controllers Distributed Controllers

Controller Worker Loop Controller Worker Loop Controller Worker Loop Controller Worker Loop Synchroniza2on

  • Each node generates and executes tasks locally.
  • Naiad
  • TensorFlow
slide-35
SLIDE 35

35

Design Spectrum

Centralized Controllers Distributed Controllers

Controller Worker Controller Worker Loop Controller Worker Loop Controller Worker Loop Controller Worker Loop Controller Worker Loop Controller Worker Loop Controller Worker Loop Synchroniza2on Loop

  • The design scales well as there is no single bodleneck.
slide-36
SLIDE 36

36

Design Spectrum

Centralized Controllers Distributed Controllers

Controller Worker Loop Controller Worker Loop Controller Worker Loop Controller Worker Loop Synchroniza2on

  • But, the scheduling is sta2c.
  • The progress speed is bound to the speed of the slowest node.
  • Any change requires stopping all nodes and installing new data flow.

Straggling

slide-37
SLIDE 37

37

Design Spectrum

Centralized Controllers Distributed Controllers

Synchroniza2on

  • In prac2ce the straggler mi2ga2on is only proac<ve:
  • Avoiding stragglers by me2culous engineering work.
  • Launching backup workers (at least doubling the resources).

Controller Backup Worker Loop Controller Backup Worker Loop Controller Backup Worker Loop Controller Backup Worker Loop Loop Controller Loop Controller Loop Controller Loop Controller

slide-38
SLIDE 38

38

Design Spectrum

Centralized Controllers Distributed Controllers

Synchroniza2on

  • In prac2ce the straggler mi2ga2on is only proac<ve:
  • Avoiding stragglers by me2culous engineering work.
  • Launching backup workers (at least doubling the resources).

Controller Backup Worker Loop Controller Backup Worker Loop Controller Backup Worker Loop Controller Backup Worker Loop Loop Controller Loop Controller Loop Controller Loop Controller Straggling

slide-39
SLIDE 39

39

Design Spectrum

Centralized Controllers Distributed Controllers

Synchroniza2on

  • In prac2ce the straggler mi2ga2on is only proac<ve:
  • Avoiding stragglers by me2culous engineering work.
  • Launching backup workers (at least doubling the resources).

Controller Backup Worker Loop Controller Backup Worker Loop Controller Backup Worker Loop Loop Controller Loop Controller Loop Controller Loop Controller Backup Worker Loop Controller

slide-40
SLIDE 40

40

Design Space

Summary

Control Plane Design Example Framework Task Throughput Task Scheduling Centralized MapReduce Low Dynamic Hadoop Spark Distributed Naiad High Sta2c TensorFlow

slide-41
SLIDE 41

41

Design Space

Summary

Control Plane Design Example Framework Task Throughput Task Scheduling Centralized MapReduce Low Dynamic Hadoop Spark Distributed Naiad High Sta2c TensorFlow

We would like to have the best of both worlds:

  • High task throughput for fast computa2ons.
  • Dynamic, fine-grained scheduling decisions.
slide-42
SLIDE 42

Repe22ve Paderns

  • Advanced data analy2cs are itera2ve in nature.

– Machine learning, graph processing, image recogni2on, etc.

  • This results in repe22ve paderns in the control plane.

– Similar tasks execute with minor differences.

42

slide-43
SLIDE 43

Repe22ve Paderns

  • Advanced data analy2cs are itera2ve in nature.

– Machine learning, graph processing, image recogni2on, etc.

  • This results in repe22ve paderns in the control plane.

– Similar tasks execute with minor differences.

43

while (error > threshold_e) { while (gradient > threshold_g) { // Optimization code block gradient = Gradient(tdata, coeff, param) coeff += gradient } // Estimation code block error = Estimate(edata, coeff, param) param = update_model(param, error) }

Training Data Es,ma,on Data Parameters Error Es,ma,on Itera,ve Op,mizer Coefficients

slide-44
SLIDE 44

This talk

  • Control Plane: the Emerging Bodleneck
  • Design Scope of the Control Plane
  • Execu2on Templates
  • Nimbus: a Framework with Templates
  • Evalua2on

44

slide-45
SLIDE 45

Execu2on Templates

  • Tasks are cached as parameterizable blocks on nodes.
  • Instead of assigning the tasks from scratch, templates

are instan<ated by filling in only changing parameters.

45

Task id Data list

  • Dep. list

Function Parameter Task id Data list

  • Dep. list

Function Parameter Task id Data list

  • Dep. list

Function Parameter

slide-46
SLIDE 46

Execu2on Templates

  • Tasks are cached as parameterizable blocks on nodes.
  • Instead of assigning the tasks from scratch, templates

are instan<ated by filling in only changing parameters.

46

Task id Data list

  • Dep. list

Function Parameter Task id Data list

  • Dep. list

Function Parameter Task id Data list

  • Dep. list

Function Parameter Load New Task ids Parameters T1 P1 T2 P2 T3 P3

slide-47
SLIDE 47

47

Execu2on Templates

Mechanisms Summary

  • Instan<a<on: spawn a block of tasks without processing each task

individually from scratch. It helps increase the task throughput.

  • Edits: modifies the content of each template at the granularity of
  • tasks. It enables fine-grained, dynamic scheduling.
  • Patches: In case the state of the worker does not match the

precondi2ons of the template. It enables dynamic control flow.

slide-48
SLIDE 48

Execu2on Model

48

Controller

Driver Program

Data Map Reduce

Data flow

Worker Worker

slide-49
SLIDE 49

Execu2on Model

49

Controller

Driver Program

Data Map Reduce

Data flow

Worker Worker

Task Graph

slide-50
SLIDE 50

Execu2on Model

50

Controller

Driver Program

Data Map Reduce

Data flow

Worker Worker

Data Objects Data Objects

Task Graph

slide-51
SLIDE 51

Execu2on Model

51

Controller

Driver Program

Data Map Reduce

Data flow

Worker Worker

Data Objects Data Objects

C Task Graph

slide-52
SLIDE 52

Execu2on Model

52

Controller

Driver Program

Data Map Reduce

Data flow

Worker Worker

Data Objects Data Objects

C

Task id Data list

  • Dep. list

Function Parameter

Task Graph

slide-53
SLIDE 53

Execu2on Model

53

Controller

Driver Program

Data Map Reduce

Data flow

Worker Worker

Data Objects Data Objects

Data Exchange C

Task id Data list

  • Dep. list

Function Parameter

Task Graph

slide-54
SLIDE 54

Repe22ve Paderns

54

Controller Worker Worker

Data Objects Data Objects

Task Graph

while (error > threshold_e) { while (gradient > threshold_g) { // Optimization code block gradient = Gradient(tdata, coeff, param) coeff += gradient } // Estimation code block error = Estimate(edata, coeff, param) param = update_model(param, error) }

Driver Program

slide-55
SLIDE 55

Repe22ve Paderns

55

Controller Worker Worker

Data Objects Data Objects

C

Task id Data list

  • Dep. list

Function Parameter

Task Graph

while (error > threshold_e) { while (gradient > threshold_g) { // Optimization code block gradient = Gradient(tdata, coeff, param) coeff += gradient } // Estimation code block error = Estimate(edata, coeff, param) param = update_model(param, error) }

Driver Program

slide-56
SLIDE 56

Repe22ve Paderns

56

Controller Worker Worker

Data Objects Data Objects

Task Graph

while (error > threshold_e) { while (gradient > threshold_g) { // Optimization code block gradient = Gradient(tdata, coeff, param) coeff += gradient } // Estimation code block error = Estimate(edata, coeff, param) param = update_model(param, error) }

Driver Program

Data Exchange C

Task id Data list

  • Dep. list

Function Parameter

slide-57
SLIDE 57

Repe22ve Paderns

57

Controller Worker Worker

Data Objects Data Objects

C

Task id Data list

  • Dep. list

Function Parameter

Task Graph

while (error > threshold_e) { while (gradient > threshold_g) { // Optimization code block gradient = Gradient(tdata, coeff, param) coeff += gradient } // Estimation code block error = Estimate(edata, coeff, param) param = update_model(param, error) }

Driver Program

slide-58
SLIDE 58

Repe22ve Paderns

58

Controller Worker Worker

Data Objects Data Objects

Task Graph

while (error > threshold_e) { while (gradient > threshold_g) { // Optimization code block gradient = Gradient(tdata, coeff, param) coeff += gradient } // Estimation code block error = Estimate(edata, coeff, param) param = update_model(param, error) }

Driver Program

Data Exchange C

Task id Data list

  • Dep. list

Function Parameter

slide-59
SLIDE 59

Execu2on Templates

Abstrac2on

59

Controller Worker Worker

C Task Graph

Data Objects Data Objects

slide-60
SLIDE 60

Execu2on Templates

Abstrac2on

60

Controller Worker Worker

Task Graph

Data Objects Data Objects

C C

Template Template

slide-61
SLIDE 61

Execu2on Templates

Abstrac2on

61

Controller Worker Worker

Task Graph

Data Objects Data Objects

C

Template Template

slide-62
SLIDE 62

Execu2on Templates

Abstrac2on

62

Controller Worker Worker

Task Graph

Data Objects Data Objects

C

Template Template

Instantiate<params> Instantiate<params>

slide-63
SLIDE 63

Execu2on Templates

Abstrac2on

63

Controller Worker Worker

Task Graph

Data Objects Data Objects

C C

Template Template

slide-64
SLIDE 64

Execu2on Templates

Abstrac2on

64

Controller Worker Worker

Task Graph

Data Objects Data Objects

C

Template Template

slide-65
SLIDE 65

Execu2on Templates

The Devil is in the details.

  • Caching tasks implies sta2c behavior:

– Templates and dynamic scheduling?

  • Reac2ve scheduling changes for load balancing.
  • Scheduling changes at the task granularity.

– Templates and dynamic control flow?

  • Need to support nested loops.
  • Need to support data dependent branches.

65

slide-66
SLIDE 66

Execu2on Templates

The Devil is in the details.

  • Caching tasks implies sta2c behavior:

– Templates and dynamic scheduling?

  • Reac2ve scheduling changes for load balancing.
  • Scheduling changes at the task granularity.

– Templates and dynamic control flow?

  • Need to support nested loops.
  • Need to support data dependent branches.

66

slide-67
SLIDE 67

Execu2on Templates

Edits

  • If scheduling changes, even slightly, the templates are obsolete.

– For example migra2ng tasks among workers.

  • Instead of paying the substan2al cost of installing templates for

every changes, templates allow edit, to change their structure.

  • Edits enable adding or removing tasks from the template and

modifying the template content, in-place.

  • Controller has the general view of the task graph so it can update the

dependencies properly, needed by the edits.

67

slide-68
SLIDE 68

Execu2on Templates

Edits

68

Controller Worker

Task Graph

Data Objects

C

Template

Worker

Data Objects Template

Migrate

  • ne task
slide-69
SLIDE 69

Execu2on Templates

Edits

69

Controller Worker

Task Graph

Data Objects

C

Template

Worker

Data Objects Template

Edit<remove > Edit<add >

slide-70
SLIDE 70

Execu2on Templates

Edits

70

Controller Worker

Task Graph

Data Objects

C

Template

Worker

Data Objects Template

slide-71
SLIDE 71

Execu2on Templates

Edits

71

Controller Worker

Task Graph

Data Objects

C

Template

Worker

Data Objects Template

Instantiate<params> Instantiate<params>

slide-72
SLIDE 72

Execu2on Templates

The Devil is in the details.

  • Caching tasks implies sta2c behavior:

– Templates and dynamic scheduling?

  • Reac2ve scheduling changes for load balancing.
  • Scheduling changes at the task granularity.

– Templates and dynamic control flow?

  • Need to support nested loops.
  • Need to support data dependent branches.

72

slide-73
SLIDE 73

Execu2on Templates

Granularity

73

Training Data Es,ma,on Data Parameters Error Es,ma,on Itera,ve Op,mizer Coefficients

slide-74
SLIDE 74

Execu2on Templates

Granularity

74

Training Data Es,ma,on Data Parameters Error Es,ma,on Itera,ve Op,mizer Coefficients

  • The more tasks cached in the template the beder.

– The cost of template instan2a2on is amor2zed over greater number of tasks. – But loop unrolling only works for sta2c control flow.

slide-75
SLIDE 75

Execu2on Templates

Granularity

75

Training Data Es,ma,on Data Parameters Error Es,ma,on Itera,ve Op,mizer Coefficients

Template

slide-76
SLIDE 76

Execu2on Templates

Granularity

76

Training Data Es,ma,on Data Parameters Error Es,ma,on Itera,ve Op,mizer Coefficients

Template

slide-77
SLIDE 77

Execu2on Templates

Granularity

77

Training Data Es,ma,on Data Parameters Error Es,ma,on Itera,ve Op,mizer Coefficients

Template

  • Cannot reuse the template (only two itera2ons of the inner loop).
slide-78
SLIDE 78

Execu2on Templates

Granularity

78

Training Data Es,ma,on Data Parameters Error Es,ma,on Itera,ve Op,mizer Coefficients

  • Templates cannot go beyond a branch in the driver program.
  • Execu2on templates operates at the granularity of basic blocks:

– A code block with single entry and no branches except at the end. – It is the biggest block without sacrificing dynamic control flow.

slide-79
SLIDE 79

Execu2on Templates

Granularity

79

Training Data Es,ma,on Data Parameters Error Es,ma,on Itera,ve Op,mizer Coefficients

Template 1

slide-80
SLIDE 80

Execu2on Templates

Granularity

80

Training Data Es,ma,on Data Parameters Error Es,ma,on Itera,ve Op,mizer Coefficients

Template 1 Instan2ate Template 1 Instan2ate Template 1 Instan2ate Template 1 Instan2ate Template 1

slide-81
SLIDE 81

Execu2on Templates

Granularity

81

Training Data Es,ma,on Data Parameters Error Es,ma,on Itera,ve Op,mizer Coefficients

Template 2

slide-82
SLIDE 82

Execu2on Templates

Granularity

82

Training Data Es,ma,on Data Parameters Error Es,ma,on Itera,ve Op,mizer Coefficients

Template 2 Instan2ate Template 2

slide-83
SLIDE 83

Execu2on Templates

Granularity

83

Controller Worker Worker

Task Graph

Data Objects Data Objects

C

StartTemplate EndTemplate EndTemplate StartTemplate

slide-84
SLIDE 84

Execu2on Templates

Granularity

84

Controller Worker Worker

Task Graph

Data Objects Data Objects

C

Template Template

slide-85
SLIDE 85

Execu2on Templates

Patching

85

Training Data Es,ma,on Data Parameters Error Es,ma,on Itera,ve Op,mizer Coefficients

  • With dynamic control flow a basic block can have different entries.
  • The execu2on state is not similar in all circumstances.
slide-86
SLIDE 86

Execu2on Templates

Patching

86

Training Data Es,ma,on Data Parameters Error Es,ma,on Itera,ve Op,mizer Coefficients

Instan2ate Template 1 Instan2ate Template 1

slide-87
SLIDE 87

Execu2on Templates

Patching

87

Training Data Es,ma,on Data Parameters Error Es,ma,on Itera,ve Op,mizer Coefficients

Instan2ate Template 1 Instan2ate Template 1

slide-88
SLIDE 88

Execu2on Templates

Patching

88

Training Data Es,ma,on Data Parameters Error Es,ma,on Itera,ve Op,mizer Coefficients

Instan2ate Template 1 Instan2ate Template 1

slide-89
SLIDE 89

Execu2on Templates

Patching

89

Training Data Es,ma,on Data Parameters Error Es,ma,on Itera,ve Op,mizer Coefficients

Instan2ate Template 1 Instan2ate Template 1

Updated model parameters

  • nly on the reducer
slide-90
SLIDE 90

Execu2on Templates

Patching

  • Each template has a set of precondi<ons that need to be sa2sfied

before it can be instan2ated.

– For example the set of data objects in memory, accessed by the tasks cached in the template.

90

slide-91
SLIDE 91

Execu2on Templates

Patching

91

Controller Worker Worker

Task Graph

Data Objects Data Objects

C

Template Template

slide-92
SLIDE 92

Execu2on Templates

Patching

92

Controller Worker Worker

Task Graph

Data Objects Data Objects

C

Template Template Precondi2ons Precondi2ons

slide-93
SLIDE 93

Execu2on Templates

Patching

  • Each template has a set of precondi<ons that need to be sa2sfied

before it can be instan2ated.

– For example the set of data objects in memory, accessed by the tasks cached in the template.

  • Worker state might not match the precondi2ons of the template in

all circumstances.

  • Controller patches the worker state before template instan2a2on, to

sa2sfy the precondi2ons.

93

slide-94
SLIDE 94

Execu2on Templates

Patching

94

Controller Worker Worker

Task Graph

Data Objects Data Objects

C

Template Template Precondi2ons Precondi2ons

slide-95
SLIDE 95

Execu2on Templates

Patching

95

Controller Worker Worker

Task Graph

Data Objects Data Objects

C

Template Template Precondi2ons Precondi2ons

Patch< load >

slide-96
SLIDE 96

Execu2on Templates

Patching

96

Controller Worker Worker

Task Graph

Data Objects Data Objects

C

Template Template Precondi2ons Precondi2ons

slide-97
SLIDE 97

Execu2on Templates

Patching

97

Controller Worker Worker

Task Graph

Data Objects Data Objects

C

Template Template

Instantiate<params> Instantiate<params>

Precondi2ons Precondi2ons

slide-98
SLIDE 98

Execu2on Templates

Patching

98

Controller Worker Worker

Task Graph

Data Objects Data Objects

C C

Template Template Precondi2ons Precondi2ons

slide-99
SLIDE 99

99

Execu2on Templates

Mechanisms Summary

  • Instan<a<on: spawn a block of tasks without processing each task

individually from scratch. It helps increase the task throughput.

  • Edits: modifies the content of each template at the granularity of
  • tasks. It enables fine-grained, dynamic scheduling.
  • Patches: In case the state of the worker does not match the

precondi2ons of the template. It enables dynamic control flow.

slide-100
SLIDE 100

This talk

  • Control Plane: the Emerging Bodleneck
  • Design Scope of the Control Plane
  • Execu2on Templates
  • Nimbus: a Framework with Templates
  • Evalua2on

100

slide-101
SLIDE 101

101

Nimbus

  • Nimbus is designed for low latency, fast computa2ons in the cloud.

– Implemented in C++ (the core library is ~35,000 semicolons). – Mutable data model to allow in-place opera2ons.

  • Nimbus embeds execu2on templates for its control plane.

– The centralized controller allows dynamic scheduling and resource alloca2on. – Execu2on templates help deliver high task throughput at scale.

  • Nimbus supports tradi2onal data analy2cs as well as Eulerian and hybrid

graphical simula2ons; for the first 2me in a cloud framework.

– Supervised/unsupervised learning algorithms, graph library. – PhysBAM library (water, smoke, etc.)

slide-102
SLIDE 102

Nimbus

Control Flow

102

Controller Worker Worker Controller

Worker Worker

Controller

Worker Worker

Controller

Worker Worker

  • Tasks spawn other tasks for execu2on (similar to Legion).
  • Driver program is a lineage of tasks execu2ng on the workers.
  • More flexible DAG for the task graph.
  • Not just narrow and wide dependencies.
  • Needed for graphical simula2ons.
slide-103
SLIDE 103

Nimbus

Controller and Worker Templates

103

Controller

Worker Worker Worker Worker

Instantiate

Controller Templates

slide-104
SLIDE 104

Nimbus

Controller and Worker Templates

104

Controller

Worker Worker Worker Worker

Controller Templates

slide-105
SLIDE 105

Nimbus

Controller and Worker Templates

105

Controller

Worker Worker Worker Worker

Worker Templates

Inst. Inst. Inst. Inst.

slide-106
SLIDE 106

Nimbus

Controller and Worker Templates

106

Controller

Worker Worker Worker Worker

C C C

Worker Templates

slide-107
SLIDE 107

Nimbus

Graphical Simula2ons

107 Driver Program:

Partition prt = {2, 1, 1}; Create(velocity, prt); Op(exec: advect, data: velocity, read: core/ghost, write: ghost); ...

Launcher

Physical Data Mappings

Controller

B A C D 2 1 3 4 5 6

Logical Data Copy Tasks Translator Manager app.so

2 1 3

Translator Manager

4 5 6

app.so

PhysicalTask(advect, {1,2,3}) PhysicalTask(advect, {4,5,6}) LogicalTask(advect, {A,B,C}) LogicalTask(advect, {B,C,D})

CompuBng Nodes

GeometricTask(advect, left_reg) GeometricTask(advect, right_reg)

ApplicaBon Data

  • The goal is to automa2cally distribute sequen2al library kernels.
  • Four layer data abstrac2on (geometric, logical, physical, applica2on).
  • Automa2c transla2on and caching between the data layers.
slide-108
SLIDE 108

nimbus.stanford.edu

108

  • For more informa2on you can visit Nimbus website.
slide-109
SLIDE 109

This talk

  • Control Plane: the Emerging Bodleneck
  • Design Scope of the Control Plane
  • Execu2on Templates
  • Nimbus: a Framework with Templates
  • Evalua2on

109

slide-110
SLIDE 110
  • Control plane task throughput:

– Execu2on templates match the strong scaling performance of frameworks with distributed control plane design.

  • Dynamic scheduling:

– Execu2on templates allows low cost, reac2ve scheduling and dynamic resource alloca2on similar to a centralized frameworks.

  • Dynamic control flow:

– Execu2on templates can handle applica2ons with nested loops and data dependent branches with low overhead.

110

Evalua2on

Results Summary

slide-111
SLIDE 111

Evalua2on

Strong Scalability with Templates

111

  • Logis2c regression over data set of size 100GB.
  • Spark-opt and Naiad-opt, runs tasks as fast as C++ implementa2on.
  • Nimbus centralized controller with execu2on templates matches the

performance of Naiad with a distributed control plane.

slide-112
SLIDE 112

Evalua2on

Reac2ve, Fine-Grained Scheduling with Templates

112

Migra2ng 5% of the tasks

  • Logis2c regression over data set of size 100GB, on 100 workers.
  • Naiad-opt curve is simulated (migra2ons every 5 itera2ons).
  • Execu2on templates allow low cost, reac2ve scheduling changes

through edits at task granularity.

  • Single edit overhead is only 41μs (in average).
slide-113
SLIDE 113

Evalua2on

Dynamic Resource Alloca2on with Templates

113

  • Logis2c regression over 100GB of data, on 50/100 workers.
  • One-2me template installa2on cost is ~40% of direct task scheduling.
  • Nimbus allows dynamic resource alloca2on.
  • Nimbus installs mul2ple versions of a template depending on resources.

x100 x100 x50

Valida2ng precondi2ons for template reuse Installing new templates

slide-114
SLIDE 114

Evalua2on

High Task Throughput with Templates

114

  • Spark and Nimbus both have centralized controller.
  • Nimbus task throughput scales super linearly with more workers.
  • O(N2): more tasks and shorter tasks, simultaneously.
  • For a task graphs with single stage:
  • Instan2a2on cost is <2μs per task (500,000 tasks per second).
slide-115
SLIDE 115

Evalua2on

Graphical Simula2ons Distributed in Nimbus

115

slide-116
SLIDE 116

Evalua2on

Complexi2es of Graphical Simula2ons

116

Levelset Posi2ve Par2cles Posi2ve Removed Par2cles Velocity Nega2ve Par2cles Nega2ve Removed Par2cles

  • 40 different variables: scalar, vector, par2cle.
  • Triply nested loop with data dependent branches.
  • 9 different templates (basic blocks).
  • 3 branches that need patching.
slide-117
SLIDE 117

Evalua2on

Speedup with Templates

117

  • Canonical water simula2ons under Nimbus and MPI.
  • Without templates, Nimbus is almost 6x slower than MPI.
  • Slow down means either lower resolu2on or more 2me/money.
slide-118
SLIDE 118

Evalua2on

Speedup with Templates

118

  • Canonical water simula2ons under Nimbus and MPI.
  • Without templates, Nimbus is almost 6x slower than MPI.
  • Slow down means either lower resolu2on or more 2me/money.

$180 $30

slide-119
SLIDE 119

Evalua2on

Comparison with Hand-Tuned MPI

119

  • Canonical water simula2ons under Nimbus and MPI.
  • Nimbus performance is within 3-15% of the hand-tuned MIP.
  • At 512 cores, there are more than 1 million dis2nct data objects and

task throughput picks at 460,000 tasks per second.

slide-120
SLIDE 120

Evalua2on

Load Balancing and Fault Recovery with Templates

120

20 40 60 Time (minute) 200 400 Iteration Number Enabled Disabled

rewind from checkpoint checkpoint checkpoint

  • ne node fails
  • ne node straggles
  • Nimbus controller adapts to the stragglers and worker failures.
  • Templates are seamlessly installed as schedule changes.
slide-121
SLIDE 121

Contribu2ons

  • Demonstra2ng how the control plane is the emerging bo=leneck for

data analy2cs frameworks.

  • Execu<on Templates as an abstrac2on for the control plane of cloud

compu2ng frameworks, that enables orders of magnitude higher task throughput, while keeping the fine-grained, flexible scheduling.

  • The design, implementa2on, and evalua2on of Nimbus, a distributed

cloud compu2ng framework that embeds execu2on templates.

  • A demonstra2on of a single-core graphical simula<on that Nimbus

automa<cally distributes in the cloud showing execu2on templates in prac2ce for complex applica2ons.

121

slide-122
SLIDE 122

Conclusion

122

Control Plane Design Example Framework Task Throughput Task Scheduling Centralized MapReduce Low Dynamic Hadoop Spark Distributed Naiad High Sta2c TensorFlow Centralized w/ Execu2on Templates Nimbus High Dynamic

slide-123
SLIDE 123

123

Thank You!