Work Stealing for Interac1ve Services to Meet Target Latency Jing Li - - PowerPoint PPT Presentation

work stealing for interac1ve services to meet target
SMART_READER_LITE
LIVE PREVIEW

Work Stealing for Interac1ve Services to Meet Target Latency Jing Li - - PowerPoint PPT Presentation

Work Stealing for Interac1ve Services to Meet Target Latency Jing Li , Kunal Agrawal , Sameh Elnikety, Yuxiong He, I-Ting Angelina Lee , Chenyang Lu , Kathryn S. McKinley Washington University in St. Louis MicrosoF


slide-1
SLIDE 1

Work Stealing for Interac1ve Services to Meet Target Latency

Jing Li∗, Kunal Agrawal∗, Sameh Elnikety†, Yuxiong He†, I-Ting Angelina Lee∗, Chenyang Lu∗, Kathryn S. McKinley† ∗Washington University in St. Louis †MicrosoF Research

*This work and was iniIated and partly done during Jing Li’s internship at MicrosoF Research in summer 2014.

slide-2
SLIDE 2

Interac1ve services must meet a target latency

Interactive services

Search, ads, games, finance Users demand responsiveness

slide-3
SLIDE 3

Interac1ve services must meet a target latency

Interactive services

Search, ads, games, finance Users demand responsiveness

Problem setting

Multiple requests arrive over time Each request: parallelizable Latency = completion time – arrival time Its latency should be less than a target latency T

Goal: maximize the number of requests that meet a target latency T

slide-4
SLIDE 4

Latency in Internet search

Ø In industrial interactive services, thousands of servers together serve a single user query. Ø End-to-end latency ≥ latency of the slowest server Doc lookup & ranking

. . .

Parsing a search query Doc lookup & ranking Doc lookup & ranking Result aggrega1on & snippet genera1on end-to-end response Ime (~ 100ms for user to find responsive)

Target latency

slide-5
SLIDE 5

Goal — Meet Target Latency in Single Server

Ø Goal – design a scheduler to maximize the number of requests that can be completed within the target latency (in a single server) Doc lookup & ranking

. . .

Parsing a search query Doc lookup & ranking Doc lookup & ranking Result aggrega1on & snippet genera1on

Target latency

slide-6
SLIDE 6

Large request must execute in parallel to meet target latency constraint

Target latency Request Sequen1al Execu1on Time (ms) (work)

Sequen1al execu1on is insufficient

slide-7
SLIDE 7

Full parallelism does not always work well

Target latency: 90ms

270 60 Large request Small request

slide-8
SLIDE 8

Full parallelism does not always work well

Target latency: 90ms Case 1: 1 large request + 3 small requests

270 60 60 60 20 1me Finish by 1me 90 Finish by 1me 110

slide-9
SLIDE 9

Full parallelism does not always work well

Target latency: 90ms Case 1: 1 large request + 3 small requests

1me core 1 core 2 core 3 130 150 90 110

Miss 2 requests

270 60 60 60 20 Finish by 1me 90 Finish by 1me 110 Small requests are wai1ng

slide-10
SLIDE 10

Full parallelism does not always work well

Target latency: 90ms Case 1: 1 large request + 3 small requests

1me core 1 core 2 core 3 130 150 90 110 core 1 core 2 core 3 50 110 1me 80 270

Miss 1 request

Miss 2 requests

270 60 60 60 20 Finish by 1me 90 Finish by 1me 110

slide-11
SLIDE 11

Some large requests require parallelism

Target latency: 90ms Case 2: 1 large request + 1 small request

270 60 20 1me Finish by 1me 90 Finish by 1me 110

slide-12
SLIDE 12

Some large requests require parallelism

Target latency: 90ms Case 2: 1 large request + 1 small request

1me core 1 core 2 core 3 90 110 270 60 20 1me Finish by 1me 90 Finish by 1me 110 core 1 core 2 core 3 80 1me 270

Miss 1 request

Miss 0 request

slide-13
SLIDE 13

Strategy: adapt scheduling to load

Case 1 Cannot afford to run all large requests in parallel Case 2 Do need to run some large requests in parallel

1me core 1 core 2 core 3 90 110

Miss 0 request

core 1 core 2 core 3 50 110 1me 80 270

Miss 1 request

slide-14
SLIDE 14

Strategy: adapt scheduling to load

High load run large requests sequentially Cannot afford to run all large requests in parallel Low load run all requests in parallel Do need to run some large requests in parallel

1me core 1 core 2 core 3 90 110

Miss 0 request

core 1 core 2 core 3 50 110 1me 80 270

Miss 1 request

slide-15
SLIDE 15

Latency = Processing Time + Waiting time At low load, processing time dominates latency

q Parallel execution reduces request processing time q All requests run in parallel

At high load, waiting time dominates latency

q Executing a large request in parallel increases waiting time of

many more later arriving requests

q Each large request that is sacrificed helps to reduce waiting time

  • f many more later arriving requests

Why does the adap1ve strategy work?

slide-16
SLIDE 16

Strategy: when load is low, run all requests in parallel; when load is high, run large requests sequentially

Challenge: which request to sacrifice?

slide-17
SLIDE 17

Strategy: when load is low, run all requests in parallel; when load is high, run large requests sequentially

Challenge 1 non-clairvoyant

q We do not know the work of a request when it arrives

Challenge 2 no accurate definition of large requests

q Large is relative to instantaneous load

Challenge: which request to sacrifice?

slide-18
SLIDE 18

Strategy: when load is low, run all requests in parallel; when load is high, run large requests sequentially

Challenge 1 non-clairvoyant

q We do not know the work of a request when it arrives

Challenge 2 no accurate definition of large requests

q Large is relative to instantaneous load q load = 10, large request >180ms

load = 20, large request > 80ms load = 30, large request > 20ms

Challenge: which request to sacrifice?

slide-19
SLIDE 19

Contribu1ons

Tail-control offline threshold calculation Tail-control

  • nline runtime

Tail-control scheduler

slide-20
SLIDE 20

Contribu1ons

Tail-control offline threshold calculation Tail-control

  • nline runtime

Input Tail-control scheduler Target latency T Request work distribuIon

Available in highly engineered interacIve services

Request per second (RPS)

slide-21
SLIDE 21

Contribu1ons

Tail-control offline threshold calculation Tail-control

  • nline runtime

Input Large request threshold table Compute a large request threshold for each load value Tail-control scheduler

slide-22
SLIDE 22

Contribu1ons

Tail-control offline threshold calculation Tail-control

  • nline runtime

Input Use threshold table to decide which request to serialize Tail-control scheduler Large request threshold table

slide-23
SLIDE 23

Contribu1ons

We modify work stealing to implement tail-control scheduling using Intel Thread Building Block

Be\er performance

slide-24
SLIDE 24

Contribu1ons

Tail-control offline threshold calculation Tail-control

  • nline runtime

Input Implementation details in the paper Tail-control scheduler Large request threshold table

slide-25
SLIDE 25

Tail-control scheduler

Tail-control

  • ffline threshold

calculation Tail-control

  • nline

runtime Input Threshold table Runtime functionalities:

q Execute all requests in parallel to begin with q Record total amount of computation time spent on each

request thus far

q Detect large requests based on the current threshold and

current processing time

q Serializes large requests to limit their impact on other

waiting requests

slide-26
SLIDE 26

Work Stealing for Single Request

Ø Workers’ local queues

q Execute work, if there is any in local queue q Steal

Workers

1 2 A A 3

execute parallelize

slide-27
SLIDE 27

Generalize Work Stealing to Mul1ple Req.

Ø Workers’ local queues + a global queue

q Execute work, if there is any in local queue q Steal – further parallelize a request q Admit – start executing a new request

Parallelizable requests arrive at global queue Workers

C B 1 2 A A 3

execute parallelize admit

slide-28
SLIDE 28

Implement Tail-Control in TBB

Ø Workers’ local queues + a global queue

q Execute work, if there is any in local queue q Steal – further parallelize a request q Admit – start executing a new request

Ø Steal-first (try to reduce processing time) Ø Admit-first (try to reduce waiting time) Ø Tail-control

q Steal-first + long request detection & serialization

Parallelizable requests arrive at global queue Workers

C B 1 2 A A 3

execute parallelize admit

slide-29
SLIDE 29

Evalua1on

Ø Various request work distributions

q Bing search q Finance server q Log-normal

Ø Different request arrival

q Poisson q Log-normal

Ø Each setting:100,000 requests, plot target latency miss ratio Ø Two baselines (generalized from work stealing for single job)

q Steal-first: tries to parallelize requests and reduce proc time q Admit-first: tries to admit requests and reduce waiting time

slide-30
SLIDE 30

Improvement in target latency miss ra1o

Be\er performance Hard à Easy to meet the target latency

slide-31
SLIDE 31

Improvement in target latency miss ra1o

Be\er performance Hard à Easy to meet the target latency Rela1ve load: high à low Admit-first wins Steal-first wins

slide-32
SLIDE 32

Improvement in target latency miss ra1o

Be\er performance

slide-33
SLIDE 33

The inner workings of tail-control

Target Latency

slide-34
SLIDE 34

The inner workings of tail-control

Tail-control sacrifices few large requests and reduces latency of many more small requests to meet target latency. Target Latency

slide-35
SLIDE 35

The inner workings of tail-control

Tail-control sacrifices few large requests and reduces latency of many more small requests to meet target latency. Target Latency

slide-36
SLIDE 36

The inner workings of tail-control

Tail-control sacrifices few large requests and reduces latency of many more small requests to meet target latency. Target Latency

slide-37
SLIDE 37

Tail-control performs well with inaccurate input

slide-38
SLIDE 38

Tail-control performs well with inaccurate input

Slightly inaccurate input work distribution is still useful less à more inaccurate input work distribu1on

slide-39
SLIDE 39

Related work

Parallelizing single job to reduce latency

q [Blumofe et al. 1995], [Arora et al. 2001], [Jung et al. 2005], [Ko

et al. 2002], [Wang and O’Boyle 2009], …

Interactive server parallelism optimizing for mean response time and tail latency

q [Raman et al. 2011], [Jeon et al. 2013], [Kim et al. 2015], [Haque

et al. 2015]

Theoretical results of server scheduling for sequential and parallel jobs optimizing for mean and maximum response time

q [Chekuri et al. 2004], [Torng and McCullough 2008], [Fox and

Moseley 2011], [Becchetti et al. 2006], [Kalyanasundaram and Pruhs 1995], [Edmonds and Pruhs 2012], [Agrawal et al. 2016]

slide-40
SLIDE 40

Take home

q For non-clairvoyant interactive services, work distribution is helpful for designing schedulers q Given the work distribution, we can devise an offline threshold calculation algorithm to compute large request threshold for every value of instantaneous load. q We have developed an adaptive scheduler that serializes requests according to the threshold table and demonstrated that it works well in practice. Tail-control

  • ffline threshold

calculation Tail-control

  • nline

runtime Input Threshold table