Rhythm: Component-distinguishable Workload Deployment in Datacenters - - PowerPoint PPT Presentation

▶

Dec 13, 2023 275 likes •502 views

Rhythm: Component-distinguishable Workload Deployment in Datacenters Laiping Zhao 1 , Yanan Yang 1 , Kaixuan Zhang 1 , Xiaobo Zhou 1 , Tie Qiu 1 , Keqiu Li 1 , Yungang Bao 2 1 Tianjin University, 2 Inst. Of Computing Technology, CAS College of

SLIDE 1

Rhythm: Component-distinguishable Workload Deployment in Datacenters

Laiping Zhao1, Yanan Yang1, Kaixuan Zhang1, Xiaobo Zhou1, Tie Qiu1, Keqiu Li1, Yungang Bao2

1Tianjin University, 2Inst. Of Computing Technology, CAS

College of Intelligence and Computing

SLIDE 2

pBackground pInterference on LC components pRhythm Controller pExperimental Evaluation pConclusion

Outline

SLIDE 3

Background

pAliyun: The average CPU utilization of co-located cluster approaches to 40% [Guo, 2019]. pImproved, but still low utilization.

pLow Resource Utilization of Datacenter

SLIDE 4

Background

pProfiling of the workload. pSchedule in a cross- complementing way.

pReal-time monitoring; pPassive adjustment on resource allocation.

pCo-location: Improving the resource utilization pInterference causes unpredictable latency.

SLIDE 5

Background

pSocialNetwork service. p31 microservices.

Source: [Deathstarbench, ASPLOS’19]

pMany-component Services:

p A Single Transaction Across ~40 Racks of ~60 Servers Each. p Arc: client-server RPC.

Source: [Google, Datacenter Computers modern

challenges in CPU design, 2015]

SLIDE 6

Problem

pHow can we feedback-control when a request is served by multiple components collaboratively?

𝑀"#$%&'' = 𝑀)+,+ 𝑀)+- + ⋯ 𝑈𝑀"#$%&'' = 𝑔(𝑈𝑀)+,+ 𝑈𝑀)+- + ⋯ )

Latency:
Tail Latency:

pGiven an overall TL, how to derive a sub-TL for each component? pOR: How the component-control affect the overall-TL?

SLIDE 7

Inconsistent Interference Tolerance

pComponents perform significant difference (~435%) under the same source of interference.

Redis architecture: E-commerce architecture:

SLIDE 8

Rhythm Design

pRhythm Insight:

nComponents with smaller contributions to the tail latency can be co-located with BE jobs aggressively.

pChallenges:

nHow to quantify the contributions of a component? nHow to control the BE deployment aggressively?

l When to colocate? l How many BEs can we co-locate with the LC?

SLIDE 9

1
Rhythm

nInconsistent interference tolerance ability;

pTracking user request:

SLIDE 10

Request tracer

pCausal path graph

n Send/Receive events: ACCEPT, RECV, SEND, CLOSE n Event: <type, timestamp, context identifier, message identifier> n Context: <hostIP, programName, processID, threadID> n Message: <senderIP, senderPort, receiverIP, receivePort, messageSize>

SLIDE 11

Rhythm

nInconsistent interference tolerance ability; nTracking user request; pServpod abstraction:

n A collection of service components from one LC service that are deployed together on the same physical machine. n For deriving the sojourn time of each request in each server.

SLIDE 12

Rhythm

nInconsistent interference tolerance ability; nTracking user request; nContribution analyzing: nServpod abstraction;

LC LC LC

SLIDE 13

Contribution Analyzer

pServpods with higher average sojourn time contribute more to TL. pServpods with higher sojourn time variance contribute more to TL. pServpods that highly correlated with the tail latency contribute more to tail latency. Mean Variance

SLIDE 14

Contribution Analyzer

pIs this definition effective?

nSensitivity vs contributions

nThe increase in the 99th-tile latency when a single Servpod is interfered by different BEs:

l Mixed BEs of wordcount, imageClassify, lstm, CPU-stress, stream-dram and stream-llc. l DRAM intensive: Stream-dram l CPU intensive: CPU-stress l LLC intensive: Stream-llc.

SLIDE 15

Rhythm

nInconsistent interference tolerance ability; nTracking user request; nContribution analyzing; nServpod abstraction;

Agent 2 LC BE Agent 3 BE Agent 4 LC BE LC

LC LC

BE jobs

…

pController:

n Loadlimit: allowing colocation when load<loadlimit;

l The “Knee point” of performance-load curve.

n Slacklimit: the lower bound of slack for allowing the growth of BEs.

l Slack = SLA – currentTL; l Small contribution à larger slacklimit;

SLIDE 16

Controller

pWhen can we co-locate workloads?

n Loadlimit.

pLoadlimit per servpod:

n The upper bound of the request load for allowing the colocation with BE jobs; n knee point: 76% of max for MySQL; 87% of max for Tomcat.

SLIDE 17

Controller

pHow many BEs can we co-locate?

nSlacklimit: the lower bound of slack for allowing the

growth of BE jobs.

Slack = SLA – currentTL; Co-locating decisions:

contribution 1 < contribution 2
slacklimit1 < slacklimit 2

… …

1-contribution1 1-contribution1 1-contribution1 Servpod 1 Servpod 2

Init. Slacklimt1 = 1
Init. Slacklimt2 = 1

Slacklimit2 Slacklimit1 1-contribution2 1-contribution2 1-contribution2

SLIDE 18

Experimental Evaluation

pBenchmarks:

n LC services：

l Apache Solr：Solr engine+Zookeeper l Elasticsearch：Index+Kibana l Elgg：Webserver+Memcached+Mysql l Redis：Master + Slave l E-commerce: Haproxy+Tomcat+Amoeba+Mysql

n BE Tasks：

l CPU-Stress; Stream-LLC; Stream-DRAM l Iperf：Network l LSTM：Mixed l Wordcount l ImageClassify： deep learning

pTestbed p16 Sockets, 64 GB of DRAM per socket. Each socket shares 20 MB of L3 cache. pIntel Xeon E7-4820 v4 @ 2.0 GHz: 32 KB L1-cache and 256 KB L2-cache per core. pThe operating system is Ubuntu 14.04 with kernel version 4.4.0-31.

SLIDE 19

Overall Analysis

pOverall analysis (compared to Heracles [ISCA,2015])

n Improve EMU (=LC throughput + BE throughput) by 11.6%~24.6%; n Improve CPU utilization by 19.1%~35.3%; n Improve memory bandwidth utilization by 16.8%~33.4%.

EMU CPU Utilization MemBan utilization

SLIDE 20

Timeline Analysis

pTimeline：

nTime 3.3： suspendBE()； nTime 5.6： allowBEGrowth()； nTime 7.7： cutBE(); nTime 9.3: suspendBE().

SLIDE 21

Conclusion

pRhythm, a deployment controller that maximizes the resource utilization while guaranteeing LC service`s tail latency requirement.

nRequest tracer nContribution analyzer nController

pExperiments demonstrate the improvement on system throughput and resource utilization.

SLIDE 22

Rhythm: Component-distinguishable Workload Deployment in Datacenters

Laiping Zhao1, Yanan Yang1, Kaixuan Zhang1, Xiaobo Zhou1, Tie Qiu1, Keqiu Li1, Yungang Bao2

pBackground pInterference on LC components pRhythm Controller pExperimental Evaluation pConclusion

Outline

Background

pAliyun: The average CPU utilization of co-located cluster approaches to 40% [Guo, 2019]. pImproved, but still low utilization.

pLow Resource Utilization of Datacenter

Background

pCo-location: Improving the resource utilization pInterference causes unpredictable latency.

Background

pMany-component Services:

Problem

pHow can we feedback-control when a request is served by multiple components collaboratively?

𝑀"#$%&'' = 𝑀)*+,+ 𝑀)*+- + ⋯ 𝑈𝑀"#$%&'' = 𝑔(𝑈𝑀)*+,+ 𝑈𝑀)*+- + ⋯ )

Inconsistent Interference Tolerance

pComponents perform significant difference (~435%) under the same source of interference.

Rhythm Design

pRhythm Insight:

nComponents with smaller contributions to the tail latency can be co-located with BE jobs aggressively.

pChallenges:

nHow to quantify the contributions of a component? nHow to control the BE deployment aggressively?

nInconsistent interference tolerance ability;

Request tracer

pCausal path graph

nInconsistent interference tolerance ability; nTracking user request; pServpod abstraction:

nInconsistent interference tolerance ability; nTracking user request; nContribution analyzing: nServpod abstraction;

Contribution Analyzer

Contribution Analyzer

pIs this definition effective?

nSensitivity vs contributions

nThe increase in the 99th-tile latency when a single Servpod is interfered by different BEs:

nInconsistent interference tolerance ability; nTracking user request; nContribution analyzing; nServpod abstraction;

Controller

pWhen can we co-locate workloads?

pLoadlimit per servpod:

Controller

pHow many BEs can we co-locate?

nSlacklimit: the lower bound of slack for allowing the

Experimental Evaluation

Overall Analysis

EMU CPU Utilization MemBan utilization

Timeline Analysis

pTimeline：

nTime 3.3： suspendBE()； nTime 5.6： allowBEGrowth()； nTime 7.7： cutBE(); nTime 9.3: suspendBE().

Conclusion

pRhythm, a deployment controller that maximizes the resource utilization while guaranteeing LC service`s tail latency requirement.

nRequest tracer nContribution analyzer nController

pExperiments demonstrate the improvement on system throughput and resource utilization.

Thank you! Questions?

𝑀"#$%&'' = 𝑀)+,+ 𝑀)+- + ⋯ 𝑈𝑀"#$%&'' = 𝑔(𝑈𝑀)+,+ 𝑈𝑀)+- + ⋯ )