Faria Kalim, Le Xu, Sharanya Bathey, Richa Meherwal, Indranil Gupta - - PowerPoint PPT Presentation

faria kalim le xu sharanya bathey richa meherwal indranil
SMART_READER_LITE
LIVE PREVIEW

Faria Kalim, Le Xu, Sharanya Bathey, Richa Meherwal, Indranil Gupta - - PowerPoint PPT Presentation

Henge: Intent-Driven Multi-Tenant Stream Processing Faria Kalim, Le Xu, Sharanya Bathey, Richa Meherwal, Indranil Gupta Distributed Protocols Research Group Department of Computer Science University of Illinois at Urbana Champaign 1


slide-1
SLIDE 1

http://dprg.cs.uiuc.edu/

Henge: Intent-Driven 
 Multi-Tenant Stream Processing

Faria Kalim, Le Xu, Sharanya Bathey, Richa Meherwal, Indranil Gupta

Distributed Protocols Research Group Department of Computer Science University of Illinois at Urbana Champaign

1

slide-2
SLIDE 2

Henge allows stream processing jobs to satisfy user-specified performance requirements while reducing costs by performing online resource reconfigurations in

a multi-tenant environment.

2

slide-3
SLIDE 3

A Typical Deployment

3

Job 1 Job 2 Job 3 Job 4

Per-job clusters

  • verprovisioning
slide-4
SLIDE 4

4

Job 1 Job 2 Job 3 Job 4

Low level metrics e.g., queue sizes, CPU load as performance indicators

A Typical Deployment

slide-5
SLIDE 5

4

Job 1 Job 2 Job 3 Job 4

Low level metrics e.g., queue sizes, CPU load as performance indicators

A Typical Deployment

slide-6
SLIDE 6

4

Job 1 Job 2 Job 3 Job 4

Low level metrics e.g., queue sizes, CPU load as performance indicators

A Typical Deployment

Manual scaling

slide-7
SLIDE 7

Intent-Driven Multi-Tenancy

5

slide-8
SLIDE 8

Intent-Driven Multi-Tenancy

Efficient resource usage across multiple users ➔ Multi-tenancy

5

...

slide-9
SLIDE 9

Intent-Driven Multi-Tenancy

6

Efficient resource usage across multiple users ➔ Multi-tenancy Application-aware adaptation to user requirements ➔ Intent-driven Multi-tenancy

slide-10
SLIDE 10

Intent-Driven Multi-Tenancy

6

Job Description Service Level Objective (SLO) 1 Finding ride price Latency < 5 s 2 Analyzing earnings over time Throughput > 10K/hr.

CPU Load, Queue Sizes … ... Efficient resource usage across multiple users ➔ Multi-tenancy Application-aware adaptation to user requirements ➔ Intent-driven Multi-tenancy

slide-11
SLIDE 11

Problem

7

How can we achieve user-facing service level objectives for stream processing jobs

  • n multi-tenant clusters?
slide-12
SLIDE 12

Problem

7

How can we achieve user-facing service level objectives for stream processing jobs

  • n multi-tenant clusters?

Latency, Throughput

slide-13
SLIDE 13

Day 1 Day 2

Absolute Throughput SLOs are not Useful

Workload Variability Rate (Tuples/s) Day 1 Day 2

8

slide-14
SLIDE 14

Day 1 Day 2

Absolute Throughput SLOs are not Useful

Workload Variability SLO? Rate (Tuples/s) Day 1 Day 2

8

slide-15
SLIDE 15

Day 1 Day 2

Absolute Throughput SLOs are not Useful

Workload Variability SLO? Rate (Tuples/s) Day 1 Day 2

8

slide-16
SLIDE 16

Day 1 Day 2

Absolute Throughput SLOs are not Useful

Workload Variability SLO? Rate (Tuples/s) Day 1 Day 2

8

slide-17
SLIDE 17

Day 1 Day 2

Absolute Throughput SLOs are not Useful

Workload Variability SLO? Rate (Tuples/s) Input Output Day 1 Day 2

8

slide-18
SLIDE 18

9

Filter … …

Job Operations

Absolute Throughput SLOs are not Useful

slide-19
SLIDE 19

9

Filter … …

Job Operations

Juice: fraction* of the input data processed by the job per unit time.

Absolute Throughput SLOs are not Useful

slide-20
SLIDE 20

Jobs benefit even below SLO threshold

Job Utility Functions

10

slide-21
SLIDE 21

Expected Utility Latency SLO Threshold Current Utility Utility function for a single job

Jobs benefit even below SLO threshold

Job Utility Functions

10

slide-22
SLIDE 22

Expected Utility Latency SLO Threshold Current Utility Utility function for a single job

Jobs benefit even below SLO threshold

Job Utility Functions

10

Henge’s goal Maximize the total utility of the cluster

slide-23
SLIDE 23

Background: Stream Processing Topologies (Jobs)

Splitte r Coun t Operators Spou t

11

Logical DAG for a Word Count Job

slide-24
SLIDE 24

12

Sink Bolt Bolt Bolt Spout Bolt Spout Spout Spout Sink Sink Sink Diamond Topology Star Topology

slide-25
SLIDE 25

Spou t Splitter Coun t Spou t Splitte r Coun t Coun t Coun t

[“So”] [“it”] [“goes”] … [“goes”] [“So it goes…”] [“it”]

Background: Stream Processing Jobs

[“So”]

Executors (Threads)

13

slide-26
SLIDE 26

Spou t Splitter Coun t Spou t Splitte r Coun t Coun t Coun t

[“So”] [“it”] [“goes”] … [“goes”] [“So it goes…”] [“it”]

Background: Stream Processing Jobs

[“So”]

Executors (Threads)

13

Parallelism 2

slide-27
SLIDE 27

14

Background: A Physical Deployment

slide-28
SLIDE 28

14

Background: A Physical Deployment

Spout Splitter Count Count Workers

slide-29
SLIDE 29

Henge’s Cluster-Wide State Machine

15

Converged Total System Utility < Total Expected Utility Not Converged

slide-30
SLIDE 30

Henge’s Cluster-Wide State Machine

15

Converged Total System Utility < Total Expected Utility Not Converged Reconfiguration Reversion or Reconfiguration Reduction

slide-31
SLIDE 31

Reconfiguration

16

De-congest operator by increasing parallelism level of executors 1) Reconfiguration Converged 2) Reconfiguration Not Converged

slide-32
SLIDE 32

Reconfiguration

16

De-congest operator by increasing parallelism level of executors 3) Black-list topologies that show less than Δ% improvement 1) Reconfiguration Converged 2) Reconfiguration Not Converged

slide-33
SLIDE 33

17

Splitter Spout Count Count Workers

Bottlenecks

slide-34
SLIDE 34

18

Spout Splitter Count Count Workers Splitter Splitter

Reconfiguration

Reconfigs.

Bottlenecks

slide-35
SLIDE 35

19

Spout Splitter Count Count Workers Splitter Splitter

Reconfigs. High Load

Bottlenecks

slide-36
SLIDE 36

19

Spout Splitter Count Count Workers Splitter Splitter

Reconfigs. High Load

Bottlenecks

SLO-Satisfying Job

slide-37
SLIDE 37

20

Reconfigs. High Load Reduction

Bottlenecks

slide-38
SLIDE 38

20

Reduction

Reconfigs. High Load Reduction

Bottlenecks

slide-39
SLIDE 39

Reduction

21

Reconfigurations drop in utility Not Converged Reduction

slide-40
SLIDE 40

Reduction

21

Reconfigurations drop in utility If high CPU load on majority of machines, reduce parallelism for

  • perators that

a) are in topologies that satisfy their SLO b) are not congested Not Converged Reduction

slide-41
SLIDE 41

Reversion

22

Reconfigurations drop in utility and reduction is not possible Revert to a past configuration that provided best utility Converged Reversion Not Converged

slide-42
SLIDE 42

Evaluation

23

Real-world workloads: Yahoo! Twitter Web log traces Experimental Setup: 10-40 node Emulab cluster

slide-43
SLIDE 43

Reducing cost and achieving high utilities

24

slide-44
SLIDE 44

Reducing cost and achieving high utilities

24

93.5% utility at 40% resources

slide-45
SLIDE 45

Reducing cost and achieving high utilities

24

93.5% utility at 40% resources 100% utility at 60% resources

slide-46
SLIDE 46

25

Adapting to a Diurnal Pattern

slide-47
SLIDE 47

25

Day 1 Day 2

slide-48
SLIDE 48

25

Reconfigurations

  • Max. Utility

Day 1 Day 2 Day 1 Day 2

slide-49
SLIDE 49

25

Reconfigurations

  • Max. Utility

Day 1 Day 2 Day 1 Day 2

Fewer reconfigurations are required once a job has adjusted to max load

slide-50
SLIDE 50

26

Can Henge do better than manual configuration?

slide-51
SLIDE 51

Henge does better in the 15th to 45th percentile, and is comparable later.

26

Can Henge do better than manual configuration?

slide-52
SLIDE 52

Scaling Cluster Size

27

slide-53
SLIDE 53

Scaling Cluster Size

Limited resources entail more reconfigurations to reach max. utility

27

slide-54
SLIDE 54

More Results

Henge can: handle dynamic workloads abrupt e.g., spikes & natural fluctuations gradual e.g., diurnal patterns satisfy hybrid SLOs scale with number of jobs & cluster size gracefully handle failures

28

slide-55
SLIDE 55

Summary

  • Henge allows users to specify performance intents for their

jobs

  • Henge’s goal is to maximize cluster-wide utility
  • The scheduler performs fine-grained reconfigurations to allow

stream processing jobs to meet user-specified intents

29