SLIDE 1 Work Stealing for Interac1ve Services to Meet Target Latency
Jing Li∗, Kunal Agrawal∗, Sameh Elnikety†, Yuxiong He†, I-Ting Angelina Lee∗, Chenyang Lu∗, Kathryn S. McKinley† ∗Washington University in St. Louis †MicrosoF Research
*This work and was iniIated and partly done during Jing Li’s internship at MicrosoF Research in summer 2014.
SLIDE 2
Interac1ve services must meet a target latency
Interactive services
Search, ads, games, finance Users demand responsiveness
SLIDE 3
Interac1ve services must meet a target latency
Interactive services
Search, ads, games, finance Users demand responsiveness
Problem setting
Multiple requests arrive over time Each request: parallelizable Latency = completion time – arrival time Its latency should be less than a target latency T
Goal: maximize the number of requests that meet a target latency T
SLIDE 4 Latency in Internet search
Ø In industrial interactive services, thousands of servers together serve a single user query. Ø End-to-end latency ≥ latency of the slowest server Doc lookup & ranking
. . .
Parsing a search query Doc lookup & ranking Doc lookup & ranking Result aggrega1on & snippet genera1on end-to-end response Ime (~ 100ms for user to find responsive)
Target latency
SLIDE 5 Goal — Meet Target Latency in Single Server
Ø Goal – design a scheduler to maximize the number of requests that can be completed within the target latency (in a single server) Doc lookup & ranking
. . .
Parsing a search query Doc lookup & ranking Doc lookup & ranking Result aggrega1on & snippet genera1on
Target latency
SLIDE 6 Large request must execute in parallel to meet target latency constraint
Target latency Request Sequen1al Execu1on Time (ms) (work)
Sequen1al execu1on is insufficient
SLIDE 7 Full parallelism does not always work well
Target latency: 90ms
270 60 Large request Small request
SLIDE 8 Full parallelism does not always work well
Target latency: 90ms Case 1: 1 large request + 3 small requests
270 60 60 60 20 1me Finish by 1me 90 Finish by 1me 110
SLIDE 9 Full parallelism does not always work well
Target latency: 90ms Case 1: 1 large request + 3 small requests
1me core 1 core 2 core 3 130 150 90 110
✖
Miss 2 requests
270 60 60 60 20 Finish by 1me 90 Finish by 1me 110 Small requests are wai1ng
SLIDE 10 Full parallelism does not always work well
Target latency: 90ms Case 1: 1 large request + 3 small requests
1me core 1 core 2 core 3 130 150 90 110 core 1 core 2 core 3 50 110 1me 80 270
✔
Miss 1 request
✖
Miss 2 requests
270 60 60 60 20 Finish by 1me 90 Finish by 1me 110
SLIDE 11 Some large requests require parallelism
Target latency: 90ms Case 2: 1 large request + 1 small request
270 60 20 1me Finish by 1me 90 Finish by 1me 110
SLIDE 12 Some large requests require parallelism
Target latency: 90ms Case 2: 1 large request + 1 small request
1me core 1 core 2 core 3 90 110 270 60 20 1me Finish by 1me 90 Finish by 1me 110 core 1 core 2 core 3 80 1me 270
✖
Miss 1 request
✔
Miss 0 request
SLIDE 13 Strategy: adapt scheduling to load
Case 1 Cannot afford to run all large requests in parallel Case 2 Do need to run some large requests in parallel
1me core 1 core 2 core 3 90 110
✔
Miss 0 request
core 1 core 2 core 3 50 110 1me 80 270
✔
Miss 1 request
SLIDE 14 Strategy: adapt scheduling to load
High load run large requests sequentially Cannot afford to run all large requests in parallel Low load run all requests in parallel Do need to run some large requests in parallel
1me core 1 core 2 core 3 90 110
✔
Miss 0 request
core 1 core 2 core 3 50 110 1me 80 270
✔
Miss 1 request
SLIDE 15 Latency = Processing Time + Waiting time At low load, processing time dominates latency
q Parallel execution reduces request processing time q All requests run in parallel
At high load, waiting time dominates latency
q Executing a large request in parallel increases waiting time of
many more later arriving requests
q Each large request that is sacrificed helps to reduce waiting time
- f many more later arriving requests
Why does the adap1ve strategy work?
SLIDE 16
Strategy: when load is low, run all requests in parallel; when load is high, run large requests sequentially
Challenge: which request to sacrifice?
SLIDE 17 Strategy: when load is low, run all requests in parallel; when load is high, run large requests sequentially
Challenge 1 non-clairvoyant
q We do not know the work of a request when it arrives
Challenge 2 no accurate definition of large requests
q Large is relative to instantaneous load
Challenge: which request to sacrifice?
SLIDE 18 Strategy: when load is low, run all requests in parallel; when load is high, run large requests sequentially
Challenge 1 non-clairvoyant
q We do not know the work of a request when it arrives
Challenge 2 no accurate definition of large requests
q Large is relative to instantaneous load q load = 10, large request >180ms
load = 20, large request > 80ms load = 30, large request > 20ms
Challenge: which request to sacrifice?
SLIDE 19 Contribu1ons
Tail-control offline threshold calculation Tail-control
Tail-control scheduler
SLIDE 20 Contribu1ons
Tail-control offline threshold calculation Tail-control
Input Tail-control scheduler Target latency T Request work distribuIon
Available in highly engineered interacIve services
Request per second (RPS)
SLIDE 21 Contribu1ons
Tail-control offline threshold calculation Tail-control
Input Large request threshold table Compute a large request threshold for each load value Tail-control scheduler
SLIDE 22 Contribu1ons
Tail-control offline threshold calculation Tail-control
Input Use threshold table to decide which request to serialize Tail-control scheduler Large request threshold table
SLIDE 23 Contribu1ons
We modify work stealing to implement tail-control scheduling using Intel Thread Building Block
Be\er performance
SLIDE 24 Contribu1ons
Tail-control offline threshold calculation Tail-control
Input Implementation details in the paper Tail-control scheduler Large request threshold table
SLIDE 25 Tail-control scheduler
Tail-control
calculation Tail-control
runtime Input Threshold table Runtime functionalities:
q Execute all requests in parallel to begin with q Record total amount of computation time spent on each
request thus far
q Detect large requests based on the current threshold and
current processing time
q Serializes large requests to limit their impact on other
waiting requests
SLIDE 26 Work Stealing for Single Request
Ø Workers’ local queues
q Execute work, if there is any in local queue q Steal
Workers
1 2 A A 3
execute parallelize
SLIDE 27 Generalize Work Stealing to Mul1ple Req.
Ø Workers’ local queues + a global queue
q Execute work, if there is any in local queue q Steal – further parallelize a request q Admit – start executing a new request
Parallelizable requests arrive at global queue Workers
C B 1 2 A A 3
execute parallelize admit
SLIDE 28 Implement Tail-Control in TBB
Ø Workers’ local queues + a global queue
q Execute work, if there is any in local queue q Steal – further parallelize a request q Admit – start executing a new request
Ø Steal-first (try to reduce processing time) Ø Admit-first (try to reduce waiting time) Ø Tail-control
q Steal-first + long request detection & serialization
Parallelizable requests arrive at global queue Workers
C B 1 2 A A 3
execute parallelize admit
SLIDE 29 Evalua1on
Ø Various request work distributions
q Bing search q Finance server q Log-normal
Ø Different request arrival
q Poisson q Log-normal
Ø Each setting:100,000 requests, plot target latency miss ratio Ø Two baselines (generalized from work stealing for single job)
q Steal-first: tries to parallelize requests and reduce proc time q Admit-first: tries to admit requests and reduce waiting time
SLIDE 30 Improvement in target latency miss ra1o
Be\er performance Hard à Easy to meet the target latency
SLIDE 31 Improvement in target latency miss ra1o
Be\er performance Hard à Easy to meet the target latency Rela1ve load: high à low Admit-first wins Steal-first wins
SLIDE 32 Improvement in target latency miss ra1o
Be\er performance
SLIDE 33
The inner workings of tail-control
Target Latency
SLIDE 34
The inner workings of tail-control
Tail-control sacrifices few large requests and reduces latency of many more small requests to meet target latency. Target Latency
SLIDE 35
The inner workings of tail-control
Tail-control sacrifices few large requests and reduces latency of many more small requests to meet target latency. Target Latency
SLIDE 36
The inner workings of tail-control
Tail-control sacrifices few large requests and reduces latency of many more small requests to meet target latency. Target Latency
SLIDE 37
Tail-control performs well with inaccurate input
SLIDE 38
Tail-control performs well with inaccurate input
Slightly inaccurate input work distribution is still useful less à more inaccurate input work distribu1on
SLIDE 39 Related work
Parallelizing single job to reduce latency
q [Blumofe et al. 1995], [Arora et al. 2001], [Jung et al. 2005], [Ko
et al. 2002], [Wang and O’Boyle 2009], …
Interactive server parallelism optimizing for mean response time and tail latency
q [Raman et al. 2011], [Jeon et al. 2013], [Kim et al. 2015], [Haque
et al. 2015]
Theoretical results of server scheduling for sequential and parallel jobs optimizing for mean and maximum response time
q [Chekuri et al. 2004], [Torng and McCullough 2008], [Fox and
Moseley 2011], [Becchetti et al. 2006], [Kalyanasundaram and Pruhs 1995], [Edmonds and Pruhs 2012], [Agrawal et al. 2016]
SLIDE 40 Take home
q For non-clairvoyant interactive services, work distribution is helpful for designing schedulers q Given the work distribution, we can devise an offline threshold calculation algorithm to compute large request threshold for every value of instantaneous load. q We have developed an adaptive scheduler that serializes requests according to the threshold table and demonstrated that it works well in practice. Tail-control
calculation Tail-control
runtime Input Threshold table