Efficient Workstealing for Multicore Event-Driven Systems Fabien - - PowerPoint PPT Presentation

efficient workstealing for multicore event driven systems
SMART_READER_LITE
LIVE PREVIEW

Efficient Workstealing for Multicore Event-Driven Systems Fabien - - PowerPoint PPT Presentation

Efficient Workstealing for Multicore Event-Driven Systems Fabien Gaud 1 , Sylvain Genev` es 1 , Renaud Lachaize 1 , Baptiste Lepers 2 , Fabien Mottet 2 , Gilles Muller 2 , Vivien Qu ema 3 1 University of Grenoble 2 INRIA 3 CNRS International


slide-1
SLIDE 1

Efficient Workstealing for Multicore Event-Driven Systems

Fabien Gaud1, Sylvain Genev` es1, Renaud Lachaize1, Baptiste Lepers2, Fabien Mottet2, Gilles Muller2, Vivien Qu´ ema3

1University of Grenoble 2INRIA 3CNRS

International Conference on Distributed Computing Systems 2010

1 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-2
SLIDE 2

Outline

1

Context

2

Evaluation of Libasync-SMP workstealing

3

Contributions

4

Performance evaluation

5

Conclusion

2 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-3
SLIDE 3

Objectives

Application domain : data servers Focus on event-driven programming Multicore architectures are mainstream Exploiting the available hardware parallelism becomes crucial for data server performance ⇒ Our goal is to provide an efficient multicore runtime for event-driven programming

3 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-4
SLIDE 4

Event-driven runtime basics

Application is structured as a set of handlers processing events. An event can be triggered by an I/O or produced internally The runtime engine repeatedly processes events from its queue

Get an event from the runtime’s queue Call the associated handler which may produce new events

4 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-5
SLIDE 5

Multicore event-driven runtime

Challenges

Helping programmers dealing with concurrency

Locks STM Annotations

Efficiently dispatching events on cores

Static placement Load balancing through workgiving Load balancing through workstealing

⇒ Libasync-SMP is an annotation-based multicore event-driven runtime

5 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-6
SLIDE 6

Libasync-SMP [Zeldovich03]

One event queue per core Mutual exclusion ensured by annotations on events (colors) Event dispatching on cores

Colors are initially dispatched in a round robin manner Load balancing is readjusted through workstealing

Event queue Thread Core 1 Core 2 Core 3 Core 4 Color 0 Color 1 Color 2 Color 3

Evaluation on two network servers

Workstealing is only evaluated on micro-benchmarks

6 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-7
SLIDE 7

Outline

1

Context

2

Evaluation of Libasync-SMP workstealing

3

Contributions Improving the workstealing algorithm Making runtime internals workstealing friendly

4

Performance evaluation

5

Conclusion

7 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-8
SLIDE 8

Expected behavior : the SFS case

Many expensive cryptographic operations Good case for workstealing algorithm Example : clients accessing a 200MB file

20 40 60 80 100 120 140 Throughput (MB/sec) Libasync-smp Libasync-smp - WS

⇒ 35% throughput increase thanks to workstealing

8 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-9
SLIDE 9

Unwanted behavior : the Web server case

Web server serving static content Workstealing costs are noticeable Example : clients accessing 1KB files

50 100 150 200 200 400 600 800 1000 1200 1400 1600 1800 2000 Throughput (KRequests/s) Number of Clients Libasync-smp Libasync-smp - WS

⇒ 33% throughput decrease due to the workstealing mechanism

9 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-10
SLIDE 10

Unwanted behavior : the Web server case (2)

Web server configuration Stealing time Stolen time Cache misses / event Libasync-SMP without workstealing

  • 9

Libasync-SMP with workstealing 197 Kcycles 20 Kcycles 21

Very high stealing costs ≫ stolen computing time Very low cache efficiency : +146% L2 cache misses over Libasync-smp without workstealing

10 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-11
SLIDE 11

Problem statement

Naive workstealing can hurt system performance This paper improves workstealing performance for multicore event-driven runtimes Majors differences with workstealing for thread-based runtimes

Tasks are more fine grained

Sensitivity to stealing costs

One core can post tasks to another core

Cannot use efficient DEqueue structures [Chase05]

Stealing is constrained by colors

O(n) workstealing algorithm

11 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-12
SLIDE 12

Workstealing main steps

core set = construct core set(); (1) foreach(core c in core_set) { LOCK(c); if(can be stolen(c)) { (2) color = choose colors to steal(c); (3) event set = construct event set(c, color); } UNLOCK(c); if(! is_empty(event_set )) { LOCK(myself ); migrate(event set); UNLOCK(myself ); exit; } }

12 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-13
SLIDE 13

Outline

1

Context

2

Evaluation of Libasync-SMP workstealing

3

Contributions Improving the workstealing algorithm Making runtime internals workstealing friendly

4

Performance evaluation

5

Conclusion

13 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-14
SLIDE 14

Idea #1 : Taking hardware topology into account

core_set = construct_core_set (); (1)

In a multicore system, some cores usually share caches Time needed to access cached data is significantly faster than accessing them in main memory Idea : Take the cache hierarchy into consideration when stealing Locality-aware stealing ⇒ Give priority to a neighbor when stealing

14 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-15
SLIDE 15

Idea #2 : Taking into account computation length

if( can_be_stolen (c)) { (2)

Many event handlers are relatively fine grain In our context, workstealing may have a significant cost Idea : Stealing some type of events is not beneficial Time-left stealing : know at any time which colors are worthy Handler execution time is currently set by the programmer but could be discovered at runtime

15 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-16
SLIDE 16

Idea #3 : Taking cache footprint into consideration

color = choose_colors_to_steal (c); (3)

Sometime events can be stolen but are not the best candidates

For example, event handlers accessing large, long-lived, data sets

Penalty-aware stealing : giving penalty to events handlers based on their behavior Penalties are set by the programmer based on preliminary profiling and/or using application behavior knowledge

16 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-17
SLIDE 17

Outline

1

Context

2

Evaluation of Libasync-SMP workstealing

3

Contributions Improving the workstealing algorithm Making runtime internals workstealing friendly

4

Performance evaluation

5

Conclusion

17 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-18
SLIDE 18

The Mely runtime

Core X core-queue stealing-queue Color 0 Color 1 Color 2 Color 3 color-queue Thread

Backward compatible with Libasync-SMP One thread per core One color-queue per color One core-queue per core that links color-queues One stealing-queue per core that allows to efficiently implement Time-left and Penalty-aware stealing strategies

18 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-19
SLIDE 19

Outline

1

Context

2

Evaluation of Libasync-SMP workstealing

3

Contributions Improving the workstealing algorithm Making runtime internals workstealing friendly

4

Performance evaluation

5

Conclusion

19 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-20
SLIDE 20

SFS

15 clients repeatedly request a 200MB file 60% time spent in cryptographic operations ⇒ only color cryptographic operations

20 40 60 80 100 120 140 160 MB/sec Libasync-smp Libasync-smp - WS Mely - WS

⇒ as expected same throughput as the legacy workstealing mechanism

20 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-21
SLIDE 21

Web server

Returns static page content (1KB files requested) Closed-loop injection 5 load injectors simulating between 200 and 2000 clients Architecture is based on legacy design

Per-connection coloring

Parse Request Read Request Write Response Close Epoll Dec Accepted Clients RegisterFd InEpoll Accept GetFrom Cache

21 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-22
SLIDE 22

Web server evaluation

50 100 150 200 200 400 600 800 1000 1200 1400 1600 1800 2000 Throughput (KRequests/s) Number of Clients Mely - WS Libasync-smp Mely Libasync-smp - WS

⇒ Up to 73% improvement over the libasync-SMP workstealing mechanism

22 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-23
SLIDE 23

Web server evaluation (2)

50 100 150 200 200 400 600 800 1000 1200 1400 1600 1800 2000 Throughput (KRequests/s) Number of Clients Mely - WS Userver Apache

⇒ Performances better than other real world Web servers

23 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-24
SLIDE 24

Web server profiling

Web server configura- tion Stealing time Stolen time Cache misses / event Libasync-SMP without workstealing

  • 9

Libasync-SMP with workstealing 197 Kcycles 20 Kcycles 21 Mely with workstealing 6 Kcycles 23 Kcycles 9

Low stealing overhead : 6 Kcycles < stolen computing time Much more cache-efficient than Libasync-SMP

Locality and penalty aware heuristics decrease the number of L2 cache misses by 24%

24 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-25
SLIDE 25

Outline

1

Context

2

Evaluation of Libasync-SMP workstealing

3

Contributions

4

Performance evaluation

5

Conclusion

25 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-26
SLIDE 26

Conclusion

Context

Event driven programming for system services on multicore architectures Workstealing sometimes degrades performances in such systems

Contributions

New heuristics to improve workstealing efficiency Revised runtime internals to reduce workstealing costs ⇒ Improved Web server performance by 73% compared to the legacy workstealing mechanism.

Future work : Automating runtime profiling and decision

26 / 27 Efficient Workstealing for Multicore Event-Driven Systems

slide-27
SLIDE 27

Thank You !

Questions ?

27 / 27 Efficient Workstealing for Multicore Event-Driven Systems