Improving the performance of data servers on multicore architectures - - PowerPoint PPT Presentation

improving the performance of data servers on multicore
SMART_READER_LITE
LIVE PREVIEW

Improving the performance of data servers on multicore architectures - - PowerPoint PPT Presentation

Improving the performance of data servers on multicore architectures Fabien Gaud Grenoble University Advisors: Jean-Bernard Stefani, Renaud Lachaize and Vivien Qu ema Sardes (INRIA/LIG) December 2, 2010 Improving the performance of data


slide-1
SLIDE 1

Improving the performance of data servers on multicore architectures Fabien Gaud

Grenoble University

Advisors: Jean-Bernard Stefani, Renaud Lachaize and Vivien Qu´ ema

Sardes (INRIA/LIG)

December 2, 2010

1 / 50 Improving the performance of data servers on multicore architectures

slide-2
SLIDE 2

Processor evolution

Before ∼2006:

One core Regular increase of clock frequency

Since then:

Almost no increase of clock frequency Increasing number of cores:

Multicore architectures NUMA architectures Manycore architectures

2 / 50 Improving the performance of data servers on multicore architectures

slide-3
SLIDE 3

Multicore is a hot topic

Legacy applications do not efficiently leverage multicore hardware Research topics:

Programming models/languages Operating systems abstractions/internals Runtime/libraries Applications

Active research field:

Corey (OSDI’08) Barrelfish (SOSP’09), Helios (SOSP’09) PK (OSDI’10)

3 / 50 Improving the performance of data servers on multicore architectures

slide-4
SLIDE 4

This thesis

Application domain: data servers, a.k.a. networked services Goal: Improve the performance of data servers on multicore architectures Contributions:

Efficient multicore event-driven programming Scaling the Apache Web server on NUMA multicore systems

4 / 50 Improving the performance of data servers on multicore architectures

slide-5
SLIDE 5

#1: Efficient multicore event-driven programming

CFSE 2009 (best paper award) ICDCS 2010

5 / 50 Improving the performance of data servers on multicore architectures

slide-6
SLIDE 6

Event-driven programming

Application is structured as a set of handlers processing events An event can be:

Triggered by an I/O operation Produced internally by the application

Events are stored in a queue and processed by a single thread

Handler 1 Handler 2 Handler 3 Handler 4 Control Loop Event

6 / 50 Improving the performance of data servers on multicore architectures

slide-7
SLIDE 7

Multicore event-driven programming

Goal: concurrently execute multiple handlers Challenges:

Concurrency management Balancing load on cores

Solutions:

N-Copy 1-Copy with synchronization

7 / 50 Improving the performance of data servers on multicore architectures

slide-8
SLIDE 8

N-Copy

Principle: running one instance of the application per core

Core 1 Core 2 Core 3 Core 4 Event queue Control loop Event

App1 App2 App3 App4

8 / 50 Improving the performance of data servers on multicore architectures

slide-9
SLIDE 9

N-Copy (2)

Advantages:

No concurrency management needed No application modification needed

Drawbacks:

Not applicable to all applications Multiple copies of data Requires external load balancing

9 / 50 Improving the performance of data servers on multicore architectures

slide-10
SLIDE 10

1-copy with synchronization

Principle: 1 instance on multiple cores Concurrency can be managed using:

Locks STM Annotations

Load balancing can be achieved with:

Static placement Workgiving Workstealing

Chosen approach is implemented in Libasync-SMP (Usenix’03)

10 / 50 Improving the performance of data servers on multicore architectures

slide-11
SLIDE 11

Libasync-SMP – Concurrency management

Annotations (colors) set on events

Core 1 Core 2 Core 3 Core 4 Event queue Control loop Events with color 0 Events with color 1 Events with color 2 Events with color 3

App1

11 / 50 Improving the performance of data servers on multicore architectures

slide-12
SLIDE 12

Libasync-SMP – Load balancing

Load balancing is done through workstealing

Core 1 Core 2 Core 3 Core 4 Event queue Control loop Events with color 0 Events with color 1 Events with color 2 Events with color 3

App1

12 / 50 Improving the performance of data servers on multicore architectures

slide-13
SLIDE 13

1-Copy with synchronization

Advantages:

Allows sharing between cores Allows load balancing between cores

Drawbacks:

Need to modify the application Efficient load balancing is difficult

13 / 50 Improving the performance of data servers on multicore architectures

slide-14
SLIDE 14

Workstealing performance: SFS

20 40 60 80 100 120 140 Throughput (MB/sec) Libasync-smp Libasync-smp - WS

35% throughput increase

14 / 50 Improving the performance of data servers on multicore architectures

slide-15
SLIDE 15

Workstealing performance: Web server

20 40 60 80 100 120 140 160 180 Throughput (MB/sec) Libasync-smp Libasync-smp - WS

33% throughput decrease

15 / 50 Improving the performance of data servers on multicore architectures

slide-16
SLIDE 16

What is the problem?

Fine grain events:

Stealing time (197 Kcycles) ≫ stolen processing time (20 Kcycles)

Inefficient cache usage:

+146% L2 cache misses

Inefficient workstealing implementation

O(n) complexity

16 / 50 Improving the performance of data servers on multicore architectures

slide-17
SLIDE 17

Contributions

New:

Workstealing algorithm Runtime implementation

Fine grain events:

Algorithm: steal events with high execution time

Inefficient cache usage:

Algorithm: steal cache-friendly events Algorithm: take cache hierarchy into account

Inefficient workstealing implementation

Runtime: mitigate stealing costs

17 / 50 Improving the performance of data servers on multicore architectures

slide-18
SLIDE 18

Idea #1: Take into account execution time

Problem: stealing cost is not always amortized

Many event handlers are relatively fine grain Workstealing may have a significant cost

Solution: Time-left stealing

Know at any time which colors are worthy (Handler execution time is set by the programmer)

18 / 50 Improving the performance of data servers on multicore architectures

slide-19
SLIDE 19

Idea #2: Take into account caches

Problem: Workstealing can reduce cache efficiency

Stealing events increases cache misses Example: event handlers accessing large, long-lived, data sets

Solution 1: Penalty-aware stealing

Set penalties on handlers based on their cache access pattern (Penalties are set manually based on preliminary profiling)

Solution 2: Locality-aware stealing

Give priority to a neighbor when stealing

19 / 50 Improving the performance of data servers on multicore architectures

slide-20
SLIDE 20

Runtime implementation

Core X core-queue stealing-queue Color 0 Color 1 Color 2 Color 3 color-queue Control loop

One color-queue per color One core-queue per core that links color-queues One stealing-queue per core

20 / 50 Improving the performance of data servers on multicore architectures

slide-21
SLIDE 21

Performance evaluation: SFS

20 40 60 80 100 120 140 Throughput (MB/sec) Libasync-smp Libasync-smp - WS Mely - WS

No throughput degradation

21 / 50 Improving the performance of data servers on multicore architectures

slide-22
SLIDE 22

Performance evaluation: Web server

50 100 150 200 Throughput (MB/sec) Libasync-smp Libasync-smp - WS Mely - WS

73% throughput improvement

22 / 50 Improving the performance of data servers on multicore architectures

slide-23
SLIDE 23

Web server profiling

Web server configuration Stealing time Stolen time Cache misses/event Libasync-SMP - WS 197 Kcycles 20 Kcycles 21 Mely - WS 6 Kcycles 23 Kcycles 9

Stealing time (6 Kcycles) < stolen processing time (23 Kcycles) Improved cache efficiency: -57% L2 cache misses

23 / 50 Improving the performance of data servers on multicore architectures

slide-24
SLIDE 24

Summary

Goal: efficient runtime for multicore event-driven systems Problem: workstealing sometimes degrades performance Contributions:

New workstealing algorithm New runtime implementation

Results: improve throughput by up to 73%

24 / 50 Improving the performance of data servers on multicore architectures

slide-25
SLIDE 25

#2: Scaling the Apache Web server on NUMA multicore systems

Under submission

25 / 50 Improving the performance of data servers on multicore architectures

slide-26
SLIDE 26

Problem

2000 4000 6000 8000 10000 1 2 3 4 # of clients per die # of dies Ideal scalability Apache

  • 26%

The Apache web server do not scale on NUMA architectures

26 / 50 Improving the performance of data servers on multicore architectures

slide-27
SLIDE 27

What can we do?

Address scalability issues at the OS level

Corey (OSDI 08) Barrelfish (SOSP 09) PK (OSDI 10)

27 / 50 Improving the performance of data servers on multicore architectures

slide-28
SLIDE 28

Apache on PK

2000 4000 6000 8000 10000 1 2 3 4 # of clients per die # of dies Ideal scalability Apache on PK Apache

  • 22%

Does not solve scalability issues

28 / 50 Improving the performance of data servers on multicore architectures

slide-29
SLIDE 29

What do we propose?

Addressing scalability issues at the OS level is not sufficient

Application-level issues Some issues are difficult to handle (e.g. scheduling)

Approach: address scalability issues at the application level

29 / 50 Improving the performance of data servers on multicore architectures

slide-30
SLIDE 30

Methodology

Consider both hardware and software bottlenecks Hardware bottlenecks:

Processor interconnect Distant memory accesses

Software bottlenecks:

Synchronization primitives

30 / 50 Improving the performance of data servers on multicore architectures

slide-31
SLIDE 31

Hardware testbed

Die 1

DRAM DRAM

Die 2

DRAM DRAM

L3

L2

L1

C0 C4 C8 C12

I/O

C3 C7 C11 C15

24 Gb/s

Die 3

C1 C5 C9 C13 C2 C6 C10 C14

61 Gb/s

I/O

24 Gb/s

L2

L1

L2

L1

L2

L1

Die 0

4 processors / 16 cores

31 / 50 Improving the performance of data servers on multicore architectures

slide-32
SLIDE 32

Hardware testbed

Die 1

DRAM DRAM

Die 2

DRAM DRAM

L3

L2

L1

C0 C4 C8 C12

I/O

C3 C7 C11 C15

24 Gb/s

Die 3

C1 C5 C9 C13 C2 C6 C10 C14

61 Gb/s

I/O

24 Gb/s

L2

L1

L2

L1

L2

L1

Die 0

4 processors / 16 cores

31 / 50 Improving the performance of data servers on multicore architectures

slide-33
SLIDE 33

Hardware bottlenecks

Memory efficiency (IPC)

Configuration Average IPC 1 die 0.38 4 dies 0.30

21% IPC decrease

32 / 50 Improving the performance of data servers on multicore architectures

slide-34
SLIDE 34

Hardware bottlenecks (2)

IPC decrease:

Reduced cache efficiency Configuration L3 cache miss ratio (%) 1 die 14 4 dies 14

33 / 50 Improving the performance of data servers on multicore architectures

slide-35
SLIDE 35

Hardware bottlenecks (2)

IPC decrease:

Reduced cache efficiency HyperTransport link saturation Configuration Max HT usage (%) 1 die 25 4 dies 75

33 / 50 Improving the performance of data servers on multicore architectures

slide-36
SLIDE 36

Hardware bottlenecks (2)

IPC decrease:

Reduced cache efficiency HyperTransport link saturation Increased number of distant memory accesses Configuration Distant accesses/kB 1 die 4 4 dies 14

33 / 50 Improving the performance of data servers on multicore architectures

slide-37
SLIDE 37

Request processing

Die 0 Die 1 Die 2 Die 3 D R A M D R A M D R A M

Request

C0 C10

NIC

C5

D R A M

Receiving a TCP request

34 / 50 Improving the performance of data servers on multicore architectures

slide-38
SLIDE 38

Request processing

Die 0 Die 1 Die 2 Die 3 D R A M D R A M D R A M

C0 C10

NIC

C5

D R A M

HTTP request processing

34 / 50 Improving the performance of data servers on multicore architectures

slide-39
SLIDE 39

Request processing

Die 0 Die 1 Die 2 Die 3 D R A M D R A M D R A M

C0 C10

NIC

C5

D R A M

PHP processing

34 / 50 Improving the performance of data servers on multicore architectures

slide-40
SLIDE 40

Request processing

Die 0 Die 1 Die 2 Die 3 D R A M D R A M D R A M

C0 C10

NIC

C5

D R A M

Sending the response (1)

34 / 50 Improving the performance of data servers on multicore architectures

slide-41
SLIDE 41

Request processing

Die 0 Die 1 Die 2 Die 3 D R A M D R A M D R A M

Reply

C0 C10

NIC

C5

D R A M

Sending the response (2)

34 / 50 Improving the performance of data servers on multicore architectures

slide-42
SLIDE 42

Proposal #1

Solution: co-localizing TCP, Apache and PHP processing Implementation: use one instance of the Apache/PHP stack per die (N-Copy)

One node manages 5 network interfaces

35 / 50 Improving the performance of data servers on multicore architectures

slide-43
SLIDE 43

N-Copy: request processing

Die 0 Die 2 Die 3 D R A M D R A M D R A M

Request

C2

NIC

D R A M Die 1

Receiving a TCP request

36 / 50 Improving the performance of data servers on multicore architectures

slide-44
SLIDE 44

N-Copy: request processing

Die 0 Die 2 Die 3 D R A M D R A M D R A M

C2 C10

NIC

D R A M

C6

Die 1

HTTP request processing

36 / 50 Improving the performance of data servers on multicore architectures

slide-45
SLIDE 45

N-Copy: request processing

Die 0 Die 2 Die 3 D R A M D R A M D R A M

C2 C10

NIC

D R A M

C6

Die 1

PHP processing

36 / 50 Improving the performance of data servers on multicore architectures

slide-46
SLIDE 46

N-Copy: request processing

Die 0 Die 2 Die 3 D R A M D R A M D R A M

C2 C10

NIC

D R A M

C6

Die 1

Sending the response (1)

36 / 50 Improving the performance of data servers on multicore architectures

slide-47
SLIDE 47

N-Copy: request processing

Die 0 Die 2 Die 3 D R A M D R A M D R A M

Reply

C2 C10

NIC

D R A M

C6

Die 1

Sending the response (2)

36 / 50 Improving the performance of data servers on multicore architectures

slide-48
SLIDE 48

N-Copy: performance

2000 4000 6000 8000 10000 1 2 3 4 # of clients per die # of dies Ideal scalability Apache NCopy Apache

  • 19%

9.1% performance improvement compared to stock Apache

37 / 50 Improving the performance of data servers on multicore architectures

slide-49
SLIDE 49

N-Copy: performance (2)

Configuration Average IPC Distant accesses/kB 1 die 0.38 4 4 dies (Stock Apache) 0.30 14 4 dies (N-Copy) 0.36 5

Memory efficiency improved by 20%

38 / 50 Improving the performance of data servers on multicore architectures

slide-50
SLIDE 50

N-Copy: can we do better?

Die Average CPU usage Die 0 100 Die 1 85 Die 2 85 Die 3 100

Problem:

Dies are not equally efficient Load is not properly balanced on dies

39 / 50 Improving the performance of data servers on multicore architectures

slide-51
SLIDE 51

N-Copy: load balancing

Solution: balance load on dies proportionally to their efficiency Implementation: use an external load balancing mechanism

Currently implemented at client-side Could be integrated in a more global solution

40 / 50 Improving the performance of data servers on multicore architectures

slide-52
SLIDE 52

N-Copy: final performance

2000 4000 6000 8000 10000 1 2 3 4 # of clients per die # of dies Ideal scalability Apache NCopy LB Apache NCopy Apache

  • 10%

21.2% performance improvement compared to stock Apache

41 / 50 Improving the performance of data servers on multicore architectures

slide-53
SLIDE 53

Software bottlenecks

Goal: find functions that

Do not scale Represent a significant execution time

Example:

Function f accounts for

1 cycle/byte at 1 die 10 cycles/byte at 4 dies 20% of the total execution time

18% potential performance gain

42 / 50 Improving the performance of data servers on multicore architectures

slide-54
SLIDE 54

Software bottlenecks (2)

Function Potential performance gain (%) d lookup 2.49% atomic dec and lock 2.32% lookup mnt 1.41% copy user generic string 0.83% memcpy 0.76%

Problem: the VFS layer does not scale

Aggregated potential performance gain: 6 % Most of the calls are issued by the stat function

43 / 50 Improving the performance of data servers on multicore architectures

slide-55
SLIDE 55

Proposal #2

Solution: use an application-level cache to reduce the number of calls to stat Implementation:

Modified the Apache ap directory walk function Using inotify for file updates

44 / 50 Improving the performance of data servers on multicore architectures

slide-56
SLIDE 56

Stat cache: performance

2000 4000 6000 8000 10000 1 2 3 4 # of clients per die # of dies Ideal scalability Apache with all optims Apache NCopy LB Apache NCopy Apache

  • 5%

33% performance improvement compared to stock Apache

45 / 50 Improving the performance of data servers on multicore architectures

slide-57
SLIDE 57

Summary

Problem: Apache does not scale on NUMA architectures Contribution: application-level optimizations considering NUMA aspects and Linux scalability issues Results: +33% performance improvement

46 / 50 Improving the performance of data servers on multicore architectures

slide-58
SLIDE 58

Conclusion

47 / 50 Improving the performance of data servers on multicore architectures

slide-59
SLIDE 59

Conclusion

Application domain: data servers Goal: Improve the performance of data servers on multicore architectures Contributions:

Efficient multicore event-driven programming Scaling the Apache Web server on NUMA multicore systems

48 / 50 Improving the performance of data servers on multicore architectures

slide-60
SLIDE 60

Future work

Short term:

Workstealing: automate profiling and decisions Apache: study other workloads

Long term:

Study the impact of distant memory accesses on other servers Study the impact of programming models on multicore performance Study the scalability of the Java virtual machine

49 / 50 Improving the performance of data servers on multicore architectures

slide-61
SLIDE 61

Questions?

50 / 50 Improving the performance of data servers on multicore architectures

slide-62
SLIDE 62

Backup Slides

2 / 50 Improving the performance of data servers on multicore architectures

slide-63
SLIDE 63

Web server

Returns static page content (1KB files requested) Closed-loop injection 5 load injectors simulating between 200 and 2000 clients Architecture is based on legacy design

Per-connection coloring

Parse Request Read Request Write Response Close Epoll Dec Accepted Clients RegisterFd InEpoll Accept GetFrom Cache

3 / 50 Improving the performance of data servers on multicore architectures

slide-64
SLIDE 64

Web server evaluation

50 100 150 200 200 400 600 800 1000 1200 1400 1600 1800 2000 Throughput (KRequests/s) Number of Clients Mely - WS Libasync-smp Mely Libasync-smp - WS

⇒ Up to 73% improvement over the Libasync-SMP workstealing mechanism

4 / 50 Improving the performance of data servers on multicore architectures

slide-65
SLIDE 65

Mely - Other web server evaluation (2)

50 100 150 200 200 400 600 800 1000 1200 1400 1600 1800 2000 Throughput (KRequests/s) Number of Clients Mely - WS Userver Apache

⇒ Performance better than other real world Web servers

5 / 50 Improving the performance of data servers on multicore architectures

slide-66
SLIDE 66

Apache – Workload description

SPECWeb2005 Support benchmark

Vendor site Mostly static / PHP for dynamic pages Back-end Simulator (BeSim)

Closed-loop injection with think times Defined QoS:

99% of clients served within 5s 95% of clients served within 3s Throughput constraints

Modified to fit in main memory: 12GB

6 / 50 Improving the performance of data servers on multicore architectures

slide-67
SLIDE 67

Software configuration

Apache 2.2.14

Worker version using both threads and processes Sendfile enabled to improve performance

PHP 5.2.12

Tuned number of PHP processes With eAccelerator

Linux 2.6.32

NUMA support IRQ processing manually balanced Responsible for dispatching thread and processes

7 / 50 Improving the performance of data servers on multicore architectures