Multi-tenant Distributed Systems Jonathan Mace Peter Bodik Rodrigo - - PowerPoint PPT Presentation

multi tenant distributed systems
SMART_READER_LITE
LIVE PREVIEW

Multi-tenant Distributed Systems Jonathan Mace Peter Bodik Rodrigo - - PowerPoint PPT Presentation

Targeted Resource Management in Multi-tenant Distributed Systems Jonathan Mace Peter Bodik Rodrigo Fonseca Madanlal Musuvathi Brown University MSR Redmond Brown University MSR Redmond Resource Management in Multi-Tenant Systems 2


slide-1
SLIDE 1

Targeted Resource Management in Multi-tenant Distributed Systems

Jonathan Mace

Brown University

Peter Bodik

MSR Redmond

Rodrigo Fonseca

Brown University

Madanlal Musuvathi

MSR Redmond

slide-2
SLIDE 2

Resource Management in Multi-Tenant Systems

2

slide-3
SLIDE 3

Failure in availability zone cascades to shared control plane, causes thread pool starvation for all zones

  • April 2011 – Amazon EBS Failure

Resource Management in Multi-Tenant Systems

2

slide-4
SLIDE 4

Failure in availability zone cascades to shared control plane, causes thread pool starvation for all zones

  • April 2011 – Amazon EBS Failure

Aggressive background task responds to increased hardware capacity with deluge of warnings and logging

  • Aug. 2012 – Azure Storage Outage

Resource Management in Multi-Tenant Systems

2

slide-5
SLIDE 5

Code change increases database usage, shifts bottleneck to unmanaged application-level lock

  • Nov. 2014 – Visual Studio Online outage

Failure in availability zone cascades to shared control plane, causes thread pool starvation for all zones

  • April 2011 – Amazon EBS Failure

Aggressive background task responds to increased hardware capacity with deluge of warnings and logging

  • Aug. 2012 – Azure Storage Outage

Resource Management in Multi-Tenant Systems

2

slide-6
SLIDE 6

Code change increases database usage, shifts bottleneck to unmanaged application-level lock

  • Nov. 2014 – Visual Studio Online outage

Failure in availability zone cascades to shared control plane, causes thread pool starvation for all zones

  • April 2011 – Amazon EBS Failure

Aggressive background task responds to increased hardware capacity with deluge of warnings and logging

  • Aug. 2012 – Azure Storage Outage

Resource Management in Multi-Tenant Systems

2

Shared storage layer bottlenecks circumvent resource management layer

  • 2014 – Communication with Cloudera
slide-7
SLIDE 7

Code change increases database usage, shifts bottleneck to unmanaged application-level lock

  • Nov. 2014 – Visual Studio Online outage

Failure in availability zone cascades to shared control plane, causes thread pool starvation for all zones

  • April 2011 – Amazon EBS Failure

Aggressive background task responds to increased hardware capacity with deluge of warnings and logging

  • Aug. 2012 – Azure Storage Outage

Resource Management in Multi-Tenant Systems

2

Degraded performance, Violated SLOs, system outages

Shared storage layer bottlenecks circumvent resource management layer

  • 2014 – Communication with Cloudera
slide-8
SLIDE 8

Code change increases database usage, shifts bottleneck to unmanaged application-level lock

  • Nov. 2014 – Visual Studio Online outage

Failure in availability zone cascades to shared control plane, causes thread pool starvation for all zones

  • April 2011 – Amazon EBS Failure

Aggressive background task responds to increased hardware capacity with deluge of warnings and logging

  • Aug. 2012 – Azure Storage Outage

Resource Management in Multi-Tenant Systems

3

Degraded performance, Violated SLOs, system outages

Shared storage layer bottlenecks circumvent resource management layer

  • 2014 – Communication with Cloudera
slide-9
SLIDE 9

Code change increases database usage, shifts bottleneck to unmanaged application-level lock

  • Nov. 2014 – Visual Studio Online outage

Failure in availability zone cascades to shared control plane, causes thread pool starvation for all zones

  • April 2011 – Amazon EBS Failure

Aggressive background task responds to increased hardware capacity with deluge of warnings and logging

  • Aug. 2012 – Azure Storage Outage

Resource Management in Multi-Tenant Systems

3

Degraded performance, Violated SLOs, system outages

Shared storage layer bottlenecks circumvent resource management layer

  • 2014 – Communication with Cloudera

OS Hypervisor

slide-10
SLIDE 10

Code change increases database usage, shifts bottleneck to unmanaged application-level lock

  • Nov. 2014 – Visual Studio Online outage

Failure in availability zone cascades to shared control plane, causes thread pool starvation for all zones

  • April 2011 – Amazon EBS Failure

Aggressive background task responds to increased hardware capacity with deluge of warnings and logging

  • Aug. 2012 – Azure Storage Outage

Resource Management in Multi-Tenant Systems

3

Degraded performance, Violated SLOs, system outages

Shared storage layer bottlenecks circumvent resource management layer

  • 2014 – Communication with Cloudera

OS Hypervisor

slide-11
SLIDE 11

Containers / VMs

4

slide-12
SLIDE 12

Containers / VMs

Shared Systems: Storage, Database, Queueing, etc.

4

slide-13
SLIDE 13

5

slide-14
SLIDE 14

5

slide-15
SLIDE 15

Monitors resource usage of each tenant in near real-time Actively schedules tenants and activities

5

slide-16
SLIDE 16

Monitors resource usage of each tenant in near real-time Actively schedules tenants and activities High-level, centralized policies:

Encapsulates resource management logic

5

slide-17
SLIDE 17

Monitors resource usage of each tenant in near real-time Actively schedules tenants and activities High-level, centralized policies:

Encapsulates resource management logic Abstractions – not specific to resource type, system Achieve different goals: guarantee average latencies, fair share a resource, etc.

5

slide-18
SLIDE 18

Hadoop Distributed File System (HDFS)

6

slide-19
SLIDE 19

Hadoop Distributed File System (HDFS)

HDFS NameNode

Filesystem metadata

6

slide-20
SLIDE 20

HDFS DataNode HDFS DataNode HDFS DataNode

Hadoop Distributed File System (HDFS)

Replicated block storage

HDFS NameNode

Filesystem metadata

6

slide-21
SLIDE 21

HDFS DataNode HDFS DataNode HDFS DataNode

Hadoop Distributed File System (HDFS)

Replicated block storage

HDFS NameNode

Filesystem metadata

Rename

6

slide-22
SLIDE 22

HDFS DataNode HDFS DataNode HDFS DataNode

Hadoop Distributed File System (HDFS)

Replicated block storage

HDFS NameNode

Filesystem metadata

Read Rename

6

slide-23
SLIDE 23

HDFS DataNode HDFS DataNode HDFS DataNode

Hadoop Distributed File System (HDFS)

Replicated block storage

HDFS NameNode

Filesystem metadata

Read Rename

6

slide-24
SLIDE 24

HDFS NameNode HDFS DataNode HDFS DataNode HDFS DataNode

7

slide-25
SLIDE 25

HDFS NameNode HDFS DataNode HDFS DataNode HDFS DataNode

7

time 500

slide-26
SLIDE 26

HDFS NameNode HDFS DataNode HDFS DataNode HDFS DataNode

7

time 500

slide-27
SLIDE 27

HDFS NameNode HDFS DataNode HDFS DataNode HDFS DataNode

8

500

slide-28
SLIDE 28

HDFS NameNode HDFS DataNode HDFS DataNode HDFS DataNode

8

500 12

slide-29
SLIDE 29

HDFS NameNode HDFS DataNode HDFS DataNode HDFS DataNode

8

500 12

slide-30
SLIDE 30

HDFS NameNode HDFS DataNode HDFS DataNode HDFS DataNode

9

500 12

slide-31
SLIDE 31

HDFS NameNode HDFS DataNode HDFS DataNode HDFS DataNode

9

500 12 500

slide-32
SLIDE 32

HDFS NameNode HDFS DataNode HDFS DataNode HDFS DataNode

10

500 12 500

slide-33
SLIDE 33

HDFS NameNode HDFS DataNode HDFS DataNode HDFS DataNode

10

500 12 500

slide-34
SLIDE 34

HDFS NameNode HDFS DataNode HDFS DataNode HDFS DataNode

10

500 12 500 500

slide-35
SLIDE 35

Local storage

HDFS DataNode

11

slide-36
SLIDE 36

Local storage

HDFS DataNode

11

slide-37
SLIDE 37

Local storage

HBase RegionServer

HDFS DataNode

11

slide-38
SLIDE 38

Local storage

HBase RegionServer

HDFS DataNode

11

slide-39
SLIDE 39

Local storage

HBase RegionServer

HDFS DataNode MapReduce Shuffler

Yarn Container Yarn Container Hadoop YARN Container MapReduce Tasks

11

slide-40
SLIDE 40

Local storage

HBase RegionServer

HDFS DataNode MapReduce Shuffler

Yarn Container Yarn Container Hadoop YARN Container MapReduce Tasks

11

slide-41
SLIDE 41

Local storage

HBase RegionServer

HDFS DataNode Hadoop YARN NodeManager

Yarn Container Yarn Container Hadoop YARN Container MapReduce Local storage

HBase RegionServer

HDFS DataNode Hadoop YARN NodeManager

Yarn Container Yarn Container Hadoop YARN Container MapReduce Local storage

HBase RegionServer

HDFS DataNode MapReduce Shuffler

Yarn Container Yarn Container Hadoop YARN Container MapReduce Tasks

11

slide-42
SLIDE 42

Goals

12

slide-43
SLIDE 43

Co-ordinated control across processes, machines, and services

Goals

12

slide-44
SLIDE 44

Co-ordinated control across processes, machines, and services Handle system and application level resources

Goals

12

slide-45
SLIDE 45

Co-ordinated control across processes, machines, and services Handle system and application level resources Principals: tenants, background tasks

Goals

12

slide-46
SLIDE 46

Co-ordinated control across processes, machines, and services Handle system and application level resources Principals: tenants, background tasks Real-time and reactive

Goals

12

slide-47
SLIDE 47

Co-ordinated control across processes, machines, and services Handle system and application level resources Principals: tenants, background tasks Real-time and reactive Efficient: Only control what is needed

Goals

12

slide-48
SLIDE 48

Architecture

13

slide-49
SLIDE 49

14

slide-50
SLIDE 50

Tenant Requests

14

slide-51
SLIDE 51

Workflows

Purpose: identify requests from different users, background activities eg, all requests from a tenant over time eg, data balancing in HDFS

15

slide-52
SLIDE 52

Workflows

Purpose: identify requests from different users, background activities eg, all requests from a tenant over time eg, data balancing in HDFS Unit of resource measurement, attribution, and enforcement

15

slide-53
SLIDE 53

Workflows

Purpose: identify requests from different users, background activities eg, all requests from a tenant over time eg, data balancing in HDFS Unit of resource measurement, attribution, and enforcement Tracks a request across varying levels of granularity

Orthogonal to threads, processes, network flows, etc.

15

slide-54
SLIDE 54

Workflows

16

slide-55
SLIDE 55

Workflows

16

slide-56
SLIDE 56

Workflows Resources Purpose: cope with diversity of resources

16

slide-57
SLIDE 57

Workflows Resources Purpose: cope with diversity of resources What we need:

  • 1. Identify overloaded resources

16

slide-58
SLIDE 58

Workflows Resources Purpose: cope with diversity of resources What we need:

  • 1. Identify overloaded resources
  • 2. Identify culprit workflows

16

slide-59
SLIDE 59

Workflows Resources Purpose: cope with diversity of resources What we need:

  • 1. Identify overloaded resources

Slowdown Ratio of how slow the resource is now compared to its baseline performance with no contention.

  • 2. Identify culprit workflows

16

slide-60
SLIDE 60

Workflows Resources Purpose: cope with diversity of resources What we need:

  • 1. Identify overloaded resources

Slowdown Ratio of how slow the resource is now compared to its baseline performance with no contention.

  • 2. Identify culprit workflows

Load Fraction of current utilization that we can attribute to each workflow

16

slide-61
SLIDE 61

Workflows Resources

17

Purpose: cope with diversity of resources What we need:

  • 1. Identify overloaded resources

Slowdown Ratio of how slow the resource is now compared to its baseline performance with no contention.

  • 2. Identify culprit workflows

Load Fraction of current utilization that we can attribute to each workflow

slide-62
SLIDE 62

Workflows Resources

17

Purpose: cope with diversity of resources What we need:

  • 1. Identify overloaded resources

Slowdown Ratio of how slow the resource is now compared to its baseline performance with no contention.

  • 2. Identify culprit workflows

Load Fraction of current utilization that we can attribute to each workflow

slide-63
SLIDE 63

Workflows Resources Slowdown (queue time + execute time) / execute time

  • eg. 100ms queue, 10ms execute

=> slowdown 11 Load time spent executing

  • eg. 10ms execute

=> load 10

17

Purpose: cope with diversity of resources What we need:

  • 1. Identify overloaded resources

Slowdown Ratio of how slow the resource is now compared to its baseline performance with no contention.

  • 2. Identify culprit workflows

Load Fraction of current utilization that we can attribute to each workflow

slide-64
SLIDE 64

Workflows Resources

17

slide-65
SLIDE 65

Workflows Resources Control Points

Goal: enforce resource management decisions

18

slide-66
SLIDE 66

Workflows Resources Control Points

Goal: enforce resource management decisions Decoupled from resources Rate-limits workflows, agnostic to underlying implementation e.g., token bucket priority queue

18

slide-67
SLIDE 67

Workflows Resources Control Points

Goal: enforce resource management decisions Decoupled from resources Rate-limits workflows, agnostic to underlying implementation e.g., token bucket priority queue

19

slide-68
SLIDE 68

Workflows Resources Control Points

Goal: enforce resource management decisions Decoupled from resources Rate-limits workflows, agnostic to underlying implementation e.g., token bucket priority queue

19

slide-69
SLIDE 69

Workflows Resources Control Points

Goal: enforce resource management decisions Decoupled from resources Rate-limits workflows, agnostic to underlying implementation e.g., token bucket priority queue

19

slide-70
SLIDE 70

Workflows Resources Control Points

Goal: enforce resource management decisions Decoupled from resources Rate-limits workflows, agnostic to underlying implementation e.g., token bucket priority queue

19

slide-71
SLIDE 71

Workflows Resources Control Points

Goal: enforce resource management decisions Decoupled from resources Rate-limits workflows, agnostic to underlying implementation e.g., token bucket priority queue

19

slide-72
SLIDE 72

Workflows Resources Control Points

Goal: enforce resource management decisions Decoupled from resources Rate-limits workflows, agnostic to underlying implementation e.g., token bucket priority queue

19

slide-73
SLIDE 73

Workflows Resources Control Points

Goal: enforce resource management decisions Decoupled from resources Rate-limits workflows, agnostic to underlying implementation e.g., token bucket priority queue

19

slide-74
SLIDE 74

Workflows Resources Control Points

20

slide-75
SLIDE 75

Pervasive Measurement

1. Pervasive Measurement

Aggregated locally then reported centrally once per second

Workflows Resources Control Points

20

slide-76
SLIDE 76

Retro Controller API

Pervasive Measurement

1. Pervasive Measurement

Aggregated locally then reported centrally once per second

2. Centralized Controller

Global, abstracted view of the system

Workflows Resources Control Points

20

slide-77
SLIDE 77

Retro Controller API

Pervasive Measurement

1. Pervasive Measurement

Aggregated locally then reported centrally once per second

2. Centralized Controller

Global, abstracted view of the system Policies run in continuous control loop

Policy Policy Policy

Workflows Resources Control Points

20

slide-78
SLIDE 78

Retro Controller API

Pervasive Measurement

1. Pervasive Measurement

Aggregated locally then reported centrally once per second

2. Centralized Controller

Global, abstracted view of the system Policies run in continuous control loop

3. Distributed Enforcement

Co-ordinates enforcement using distributed token bucket

Policy Policy Policy

Workflows Resources Control Points Distributed Enforcement

20

slide-79
SLIDE 79

Retro Controller API

Distributed Enforcement Pervasive Measurement Workflows Resources Control Points

Policy Policy Policy

“Control Plane” for resource management Global, abstracted view of the system

Easier to write Reusable

21

slide-80
SLIDE 80

22

Example: LatencySLO

Policy

slide-81
SLIDE 81

22

Example: LatencySLO

H High Priority Workflows

Policy

“200ms average request latency”

slide-82
SLIDE 82

22

Example: LatencySLO

H High Priority Workflows

Policy

L Low Priority Workflows “200ms average request latency” (use spare capacity)

slide-83
SLIDE 83

22

Example: LatencySLO

H High Priority Workflows

Policy

L Low Priority Workflows “200ms average request latency” (use spare capacity) monitor latencies

slide-84
SLIDE 84

22

Example: LatencySLO

H High Priority Workflows

Policy

L Low Priority Workflows “200ms average request latency” (use spare capacity) monitor latencies attribute interference

slide-85
SLIDE 85

22

Example: LatencySLO

H High Priority Workflows

Policy

L Low Priority Workflows “200ms average request latency” (use spare capacity) monitor latencies throttle interfering workflows

slide-86
SLIDE 86

23

1 foreach candidate in H 2 miss[candidate] = latency(candidate) / guarantee[candidate] 3 W = candidate in H with max miss[candidate] 4 foreach rsrc in resources() // calculate importance of each resource for hipri 5 importance[rsrc] = latency(W, rsrc) * log(slowdown(rsrc)) 6 foreach lopri in L // calculate low priority workflow interference 7 interference[lopri] = Σrsrc importance[rsrc] * load(lopri, rsrc) / load(rsrc) 8 foreach lopri in L // normalize interference 9 interference[lopri] /= Σk interference[k] 10 foreach lopri in L 11 if miss[W] > 1 // throttle 12 scalefactor = 1 – α * (miss[W] – 1) * interference[lopri] 13 else // release 14 scalefactor = 1 + β 15 foreach cpoint in controlpoints() // apply new rates 16 set_rate(cpoint, lopri, scalefactor * get_rate(cpoint, lopri)

Example: LatencySLO

H High Priority Workflows

Policy

L Low Priority Workflows

slide-87
SLIDE 87

23

1 foreach candidate in H 2 miss[candidate] = latency(candidate) / guarantee[candidate] 3 W = candidate in H with max miss[candidate] 4 foreach rsrc in resources() // calculate importance of each resource for hipri 5 importance[rsrc] = latency(W, rsrc) * log(slowdown(rsrc)) 6 foreach lopri in L // calculate low priority workflow interference 7 interference[lopri] = Σrsrc importance[rsrc] * load(lopri, rsrc) / load(rsrc) 8 foreach lopri in L // normalize interference 9 interference[lopri] /= Σk interference[k] 10 foreach lopri in L 11 if miss[W] > 1 // throttle 12 scalefactor = 1 – α * (miss[W] – 1) * interference[lopri] 13 else // release 14 scalefactor = 1 + β 15 foreach cpoint in controlpoints() // apply new rates 16 set_rate(cpoint, lopri, scalefactor * get_rate(cpoint, lopri)

Example: LatencySLO

H High Priority Workflows

Policy

L Low Priority Workflows

slide-88
SLIDE 88

24

1 foreach candidate in H 2 miss[candidate] = latency(candidate) / guarantee[candidate] 3 W = candidate in H with max miss[candidate] 4 foreach rsrc in resources() // calculate importance of each resource for hipri 5 importance[rsrc] = latency(W, rsrc) * log(slowdown(rsrc)) 6 foreach lopri in L // calculate low priority workflow interference 7 interference[lopri] = Σrsrc importance[rsrc] * load(lopri, rsrc) / load(rsrc) 8 foreach lopri in L // normalize interference 9 interference[lopri] /= Σk interference[k] 10 foreach lopri in L 11 if miss[W] > 1 // throttle 12 scalefactor = 1 – α * (miss[W] – 1) * interference[lopri] 13 else // release 14 scalefactor = 1 + β 15 foreach cpoint in controlpoints() // apply new rates 16 set_rate(cpoint, lopri, scalefactor * get_rate(cpoint, lopri)

Example: LatencySLO

H High Priority Workflows

Policy

L Low Priority Workflows

Select the high priority workflow W with worst performance

slide-89
SLIDE 89

24

1 foreach candidate in H 2 miss[candidate] = latency(candidate) / guarantee[candidate] 3 W = candidate in H with max miss[candidate] 4 foreach rsrc in resources() // calculate importance of each resource for hipri 5 importance[rsrc] = latency(W, rsrc) * log(slowdown(rsrc)) 6 foreach lopri in L // calculate low priority workflow interference 7 interference[lopri] = Σrsrc importance[rsrc] * load(lopri, rsrc) / load(rsrc) 8 foreach lopri in L // normalize interference 9 interference[lopri] /= Σk interference[k] 10 foreach lopri in L 11 if miss[W] > 1 // throttle 12 scalefactor = 1 – α * (miss[W] – 1) * interference[lopri] 13 else // release 14 scalefactor = 1 + β 15 foreach cpoint in controlpoints() // apply new rates 16 set_rate(cpoint, lopri, scalefactor * get_rate(cpoint, lopri)

Example: LatencySLO

H High Priority Workflows

Policy

L Low Priority Workflows

Select the high priority workflow W with worst performance Weight low priority workflows by their interference with W

slide-90
SLIDE 90

24

1 foreach candidate in H 2 miss[candidate] = latency(candidate) / guarantee[candidate] 3 W = candidate in H with max miss[candidate] 4 foreach rsrc in resources() // calculate importance of each resource for hipri 5 importance[rsrc] = latency(W, rsrc) * log(slowdown(rsrc)) 6 foreach lopri in L // calculate low priority workflow interference 7 interference[lopri] = Σrsrc importance[rsrc] * load(lopri, rsrc) / load(rsrc) 8 foreach lopri in L // normalize interference 9 interference[lopri] /= Σk interference[k] 10 foreach lopri in L 11 if miss[W] > 1 // throttle 12 scalefactor = 1 – α * (miss[W] – 1) * interference[lopri] 13 else // release 14 scalefactor = 1 + β 15 foreach cpoint in controlpoints() // apply new rates 16 set_rate(cpoint, lopri, scalefactor * get_rate(cpoint, lopri)

Example: LatencySLO

H High Priority Workflows

Policy

L Low Priority Workflows

Select the high priority workflow W with worst performance Weight low priority workflows by their interference with W Throttle low priority workflows proportionally to their weight

slide-91
SLIDE 91

25

1 foreach candidate in H 2 miss[candidate] = latency(candidate) / guarantee[candidate] 3 W = candidate in H with max miss[candidate] 4 foreach rsrc in resources() // calculate importance of each resource for hipri 5 importance[rsrc] = latency(W, rsrc) * log(slowdown(rsrc)) 6 foreach lopri in L // calculate low priority workflow interference 7 interference[lopri] = Σrsrc importance[rsrc] * load(lopri, rsrc) / load(rsrc) 8 foreach lopri in L // normalize interference 9 interference[lopri] /= Σk interference[k] 10 foreach lopri in L 11 if miss[W] > 1 // throttle 12 scalefactor = 1 – α * (miss[W] – 1) * interference[lopri] 13 else // release 14 scalefactor = 1 + β 15 foreach cpoint in controlpoints() // apply new rates 16 set_rate(cpoint, lopri, scalefactor * get_rate(cpoint, lopri)

Example: LatencySLO

H High Priority Workflows

Policy

L Low Priority Workflows

Weight low priority workflows by their interference with W

Select the high priority workflow W with worst performance

Throttle low priority workflows proportionally to their weight

slide-92
SLIDE 92

26

1 foreach candidate in H 2 miss[candidate] = latency(candidate) / guarantee[candidate] 3 W = candidate in H with max miss[candidate] 4 foreach rsrc in resources() // calculate importance of each resource for hipri 5 importance[rsrc] = latency(W, rsrc) * log(slowdown(rsrc)) 6 foreach lopri in L // calculate low priority workflow interference 7 interference[lopri] = Σrsrc importance[rsrc] * load(lopri, rsrc) / load(rsrc) 8 foreach lopri in L // normalize interference 9 interference[lopri] /= Σk interference[k] 10 foreach lopri in L 11 if miss[W] > 1 // throttle 12 scalefactor = 1 – α * (miss[W] – 1) * interference[lopri] 13 else // release 14 scalefactor = 1 + β 15 foreach cpoint in controlpoints() // apply new rates 16 set_rate(cpoint, lopri, scalefactor * get_rate(cpoint, lopri)

Example: LatencySLO

H High Priority Workflows

Policy

L Low Priority Workflows

Throttle low priority workflows proportionally to their weight

Select the high priority workflow W with worst performance Weight low priority workflows by their interference with W

slide-93
SLIDE 93

27

1 foreach candidate in H 2 miss[candidate] = latency(candidate) / guarantee[candidate] 3 W = candidate in H with max miss[candidate] 4 foreach rsrc in resources() // calculate importance of each resource for hipri 5 importance[rsrc] = latency(W, rsrc) * log(slowdown(rsrc)) 6 foreach lopri in L // calculate low priority workflow interference 7 interference[lopri] = Σrsrc importance[rsrc] * load(lopri, rsrc) / load(rsrc) 8 foreach lopri in L // normalize interference 9 interference[lopri] /= Σk interference[k] 10 foreach lopri in L 11 if miss[W] > 1 // throttle 12 scalefactor = 1 – α * (miss[W] – 1) * interference[lopri] 13 else // release 14 scalefactor = 1 + β 15 foreach cpoint in controlpoints() // apply new rates 16 set_rate(cpoint, lopri, scalefactor * get_rate(cpoint, lopri)

Example: LatencySLO

H High Priority Workflows

Policy

L Low Priority Workflows Select the high priority workflow W with worst performance Weight low priority workflows by their interference with W Throttle low priority workflows proportionally to their weight

slide-94
SLIDE 94

27

1 foreach candidate in H 2 miss[candidate] = latency(candidate) / guarantee[candidate] 3 W = candidate in H with max miss[candidate] 4 foreach rsrc in resources() // calculate importance of each resource for hipri 5 importance[rsrc] = latency(W, rsrc) * log(slowdown(rsrc)) 6 foreach lopri in L // calculate low priority workflow interference 7 interference[lopri] = Σrsrc importance[rsrc] * load(lopri, rsrc) / load(rsrc) 8 foreach lopri in L // normalize interference 9 interference[lopri] /= Σk interference[k] 10 foreach lopri in L 11 if miss[W] > 1 // throttle 12 scalefactor = 1 – α * (miss[W] – 1) * interference[lopri] 13 else // release 14 scalefactor = 1 + β 15 foreach cpoint in controlpoints() // apply new rates 16 set_rate(cpoint, lopri, scalefactor * get_rate(cpoint, lopri)

Example: LatencySLO

H High Priority Workflows

Policy

L Low Priority Workflows Select the high priority workflow W with worst performance Weight low priority workflows by their interference with W Throttle low priority workflows proportionally to their weight

slide-95
SLIDE 95

Other types of policy…

28

slide-96
SLIDE 96

Other types of policy…

Bottleneck Fairness

Detect most overloaded resource Fair-share resource between tenants using it

Policy

28

slide-97
SLIDE 97

Other types of policy…

Bottleneck Fairness

Detect most overloaded resource Fair-share resource between tenants using it

Dominant Resource Fairness

Estimate demands and capacities from measurements

Policy Policy

28

slide-98
SLIDE 98

Other types of policy…

Bottleneck Fairness

Detect most overloaded resource Fair-share resource between tenants using it

Dominant Resource Fairness

Estimate demands and capacities from measurements

Policy Policy

Concise Any resources can be bottleneck (policy doesn’t care) Not system specific

28

slide-99
SLIDE 99

Evaluation

29

slide-100
SLIDE 100

Instrumentation

30

slide-101
SLIDE 101

Instrumentation

Retro implementation in Java

Instrumentation Library Central controller implementation

30

slide-102
SLIDE 102

Instrumentation

Retro implementation in Java

Instrumentation Library Central controller implementation

To enable Retro

Propagate Workflow ID within application (like X-Trace, Dapper) Instrument resources with wrapper classes

30

slide-103
SLIDE 103

Instrumentation

Retro implementation in Java

Instrumentation Library Central controller implementation

To enable Retro

Propagate Workflow ID within application (like X-Trace, Dapper) Instrument resources with wrapper classes

Overheads

31

slide-104
SLIDE 104

Instrumentation

Retro implementation in Java

Instrumentation Library Central controller implementation

To enable Retro

Propagate Workflow ID within application (like X-Trace, Dapper) Instrument resources with wrapper classes

Overheads

Resource instrumentation automatic using AspectJ

31

slide-105
SLIDE 105

Instrumentation

Retro implementation in Java

Instrumentation Library Central controller implementation

To enable Retro

Propagate Workflow ID within application (like X-Trace, Dapper) Instrument resources with wrapper classes

Overheads

Resource instrumentation automatic using AspectJ Overall 50-200 lines per system to modify RPCs

31

slide-106
SLIDE 106

Instrumentation

Retro implementation in Java

Instrumentation Library Central controller implementation

To enable Retro

Propagate Workflow ID within application (like X-Trace, Dapper) Instrument resources with wrapper classes

Overheads

Resource instrumentation automatic using AspectJ Overall 50-200 lines per system to modify RPCs Retro overhead: max 1-2% on throughput, latency

31

slide-107
SLIDE 107

Experiments

32

slide-108
SLIDE 108

Experiments

YARN MapReduce ZooKeeper

Systems

HDFS HBase

32

slide-109
SLIDE 109

Experiments

YARN MapReduce ZooKeeper

Systems Workflows MapReduce Jobs (HiBench)

HBase (YCSB) HDFS clients Background Data Replication Background Heartbeats

HDFS HBase

32

slide-110
SLIDE 110

Experiments

YARN MapReduce ZooKeeper

Systems Workflows MapReduce Jobs (HiBench)

HBase (YCSB) HDFS clients Background Data Replication Background Heartbeats

Resources

CPU, Disk, Network (All systems) Locks, Queues (HDFS, HBase)

HDFS HBase

32

slide-111
SLIDE 111

Experiments

YARN MapReduce ZooKeeper

Systems Workflows MapReduce Jobs (HiBench)

HBase (YCSB) HDFS clients Background Data Replication Background Heartbeats

Resources

CPU, Disk, Network (All systems) Locks, Queues (HDFS, HBase)

Policies

LatencySLO

Bottleneck Fairness Dominant Resource Fairness

HDFS HBase

32

slide-112
SLIDE 112

Experiments

Policies for a mixture of systems, workflows, and resources Results on clusters up to 200 nodes YARN MapReduce ZooKeeper

Systems Workflows MapReduce Jobs (HiBench)

HBase (YCSB) HDFS clients Background Data Replication Background Heartbeats

Resources

CPU, Disk, Network (All systems) Locks, Queues (HDFS, HBase)

Policies

LatencySLO

Bottleneck Fairness Dominant Resource Fairness

See paper for full experiment results HDFS HBase

32

slide-113
SLIDE 113

Experiments

Policies for a mixture of systems, workflows, and resources Results on clusters up to 200 nodes YARN MapReduce ZooKeeper

Systems Workflows MapReduce Jobs (HiBench)

HBase (YCSB) HDFS clients Background Data Replication Background Heartbeats

Resources

CPU, Disk, Network (All systems) Locks, Queues (HDFS, HBase)

Policies

LatencySLO

Bottleneck Fairness Dominant Resource Fairness

See paper for full experiment results This talk LatencySLO policy results HDFS HBase

32

slide-114
SLIDE 114

Experiments

Policies for a mixture of systems, workflows, and resources Results on clusters up to 200 nodes YARN MapReduce ZooKeeper

Systems Workflows MapReduce Jobs (HiBench)

HBase (YCSB) HDFS clients Background Data Replication Background Heartbeats

Resources

CPU, Disk, Network (All systems) Locks, Queues (HDFS, HBase)

Policies

LatencySLO

Bottleneck Fairness Dominant Resource Fairness

See paper for full experiment results This talk LatencySLO policy results HDFS HBase

32

slide-115
SLIDE 115

HDFS read 8k HBase read 1 cached row HBase read 1 row

33

slide-116
SLIDE 116

HDFS read 8k HBase read 1 cached row HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan

33

slide-117
SLIDE 117

1 10 100 1000 5 10 15 20 25 30 0.1

SLO-Normalized Latency Time [minutes] HDFS read 8k HBase read 1 cached row HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan

33

slide-118
SLIDE 118

1 10 100 1000 5 10 15 20 25 30 0.1

SLO-Normalized Latency Time [minutes] HDFS read 8k HBase read 1 cached row HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan

33

slide-119
SLIDE 119

1 10 100 1000 5 10 15 20 25 30 0.1

SLO-Normalized Latency Time [minutes] HDFS read 8k HBase read 1 cached row HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan SLO target

33

slide-120
SLIDE 120

1 10 100 1000 5 10 15 20 25 30 0.1

SLO-Normalized Latency Time [minutes] HDFS read 8k HBase read 1 cached row HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan SLO target

33

slide-121
SLIDE 121

1 10 100 1000 5 10 15 20 25 30 0.1

SLO-Normalized Latency Time [minutes] HDFS read 8k HBase read 1 cached row HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan SLO target

33

slide-122
SLIDE 122

5 10 15 5 10 15 20 25 30

Slowdown Time [minutes]

1 10 100 1000 5 10 15 20 25 30 0.1

SLO-Normalized Latency Time [minutes] HDFS read 8k HBase read 1 cached row HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan SLO target

34

slide-123
SLIDE 123

5 10 15 5 10 15 20 25 30

Slowdown Time [minutes]

1 10 100 1000 5 10 15 20 25 30 0.1

SLO-Normalized Latency Time [minutes] HDFS read 8k HBase read 1 cached row Disk HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan SLO target

34

slide-124
SLIDE 124

5 10 15 5 10 15 20 25 30

Slowdown Time [minutes]

1 10 100 1000 5 10 15 20 25 30 0.1

SLO-Normalized Latency Time [minutes] HDFS read 8k HBase read 1 cached row Disk CPU HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan SLO target

34

slide-125
SLIDE 125

5 10 15 5 10 15 20 25 30

Slowdown Time [minutes]

1 10 100 1000 5 10 15 20 25 30 0.1

SLO-Normalized Latency Time [minutes] HDFS read 8k HBase read 1 cached row Disk CPU HDFS NN Lock HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan SLO target

34

slide-126
SLIDE 126

5 10 15 5 10 15 20 25 30

Slowdown Time [minutes]

1 10 100 1000 5 10 15 20 25 30 0.1

SLO-Normalized Latency Time [minutes] HDFS read 8k HBase read 1 cached row Disk CPU HDFS NN Lock HDFS NN Queue HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan SLO target

34

slide-127
SLIDE 127

5 10 15 5 10 15 20 25 30

Slowdown Time [minutes]

1 10 100 1000 5 10 15 20 25 30 0.1

SLO-Normalized Latency Time [minutes] HDFS read 8k HBase read 1 cached row Disk CPU HDFS NN Lock HDFS NN Queue HBase Queue HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan SLO target

34

slide-128
SLIDE 128

1 10 100 1000 5 10 15 20 25 30 0.1

SLO-Normalized Latency Time [minutes] HDFS read 8k HBase read 1 cached row

0.1 1 10 5 10 15 20 25 30

SLO-Normalized Latency Time [minutes]

+ SLO Policy Enabled

SLO target HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan

35

SLO target

slide-129
SLIDE 129

1 10 100 1000 5 10 15 20 25 30 0.1

SLO-Normalized Latency Time [minutes] HDFS read 8k HBase read 1 cached row

0.1 1 10 5 10 15 20 25 30

SLO-Normalized Latency Time [minutes]

+ SLO Policy Enabled

SLO target HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan

35

SLO target

slide-130
SLIDE 130

0.1 1 10 5 10 15 20 25 30

SLO-Normalized Latency Time [minutes] HDFS read 8k HBase read 1 cached row HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan

36

slide-131
SLIDE 131

0.1 1 10 5 10 15 20 25 30

SLO-Normalized Latency Time [minutes] HDFS read 8k HBase read 1 cached row

0.2 1

HBase Table Scan HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan

36

slide-132
SLIDE 132

0.1 1 10 5 10 15 20 25 30

SLO-Normalized Latency Time [minutes] HDFS read 8k HBase read 1 cached row

0.2 1

HBase Table Scan HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan

36

slide-133
SLIDE 133

0.1 1 10 5 10 15 20 25 30

SLO-Normalized Latency Time [minutes] HDFS read 8k HBase read 1 cached row

0.2 1

HBase Table Scan HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan

36

slide-134
SLIDE 134

0.1 1 10 5 10 15 20 25 30

SLO-Normalized Latency Time [minutes] HDFS read 8k HBase read 1 cached row

0.2 1

HBase Table Scan

0.5 1

HDFS mkdir HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan

36

slide-135
SLIDE 135

0.1 1 10 5 10 15 20 25 30

SLO-Normalized Latency Time [minutes] HDFS read 8k HBase read 1 cached row

0.2 1

HBase Table Scan

0.5 1

HDFS mkdir HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan

36

slide-136
SLIDE 136

0.1 1 10 5 10 15 20 25 30

SLO-Normalized Latency Time [minutes] HDFS read 8k HBase read 1 cached row

0.2 1

HBase Table Scan

0.5 1

HDFS mkdir HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan

36

slide-137
SLIDE 137

0.1 1 10 5 10 15 20 25 30

SLO-Normalized Latency Time [minutes] HDFS read 8k HBase read 1 cached row 0.5

1

HBase Cached Table Scan

0.2 1

HBase Table Scan

0.5 1

HDFS mkdir HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan

36

slide-138
SLIDE 138

0.1 1 10 5 10 15 20 25 30

SLO-Normalized Latency Time [minutes] HDFS read 8k HBase read 1 cached row 0.5

1

HBase Cached Table Scan

0.2 1

HBase Table Scan

0.5 1

HDFS mkdir HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan

36

slide-139
SLIDE 139

0.1 1 10 5 10 15 20 25 30

SLO-Normalized Latency Time [minutes] HDFS read 8k HBase read 1 cached row 0.5

1

HBase Cached Table Scan

0.2 1

HBase Table Scan

0.5 1

HDFS mkdir HBase read 1 row HBase Cached Table Scan HDFS mkdir HBase Table Scan

36

slide-140
SLIDE 140

Conclusion

37

slide-141
SLIDE 141

Resource management for shared distributed systems

Conclusion

37

slide-142
SLIDE 142

Resource management for shared distributed systems Centralized resource management

Conclusion

37

slide-143
SLIDE 143

Resource management for shared distributed systems Centralized resource management

Conclusion

Comprehensive: resources, processes, tenants, background tasks

37

slide-144
SLIDE 144

Resource management for shared distributed systems Centralized resource management

Conclusion

Comprehensive: resources, processes, tenants, background tasks Abstractions for writing concise, general-purpose policies:

Workflows Resources (slowdown, load) Control points

37

slide-145
SLIDE 145

Resource management for shared distributed systems Centralized resource management

Conclusion

http://cs.brown.edu/~jcmace

Comprehensive: resources, processes, tenants, background tasks Abstractions for writing concise, general-purpose policies:

Workflows Resources (slowdown, load) Control points