Bryan Veal Annie Foong Intel R&D Perform ance Scalability of - - PDF document

bryan veal annie foong intel r d perform ance scalability
SMART_READER_LITE
LIVE PREVIEW

Bryan Veal Annie Foong Intel R&D Perform ance Scalability of - - PDF document

Bryan Veal Annie Foong Intel R&D Perform ance Scalability of a Multi-core W eb Server Overview The number of CPU cores on modern servers is increasing rapidly Premise: for highly parallel workloads perform ance should scale w ith


slide-1
SLIDE 1

Perform ance Scalability

  • f a Multi-core W eb Server

Bryan Veal Annie Foong Intel R&D

slide-2
SLIDE 2

1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server 2

Overview

  • The number of CPU cores on modern servers is increasing rapidly
  • Premise: for highly parallel workloads perform ance should scale

w ith the num ber of cores

  • We tested this premise for w eb servers
  • Our results show that w eb servers do not scale
  • We tested for common problems with poor parallel programming
  • We found few parallelism problem s in the TCP/ IP stack and the

web software

  • Instead, we found problem s inherent to server hardw are

design

slide-3
SLIDE 3

1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server 3

W hy Perform ance Should Scale

  • Typical networked servers

– Have multiple cores – Have NICs mapped onto cores – Supports many clients – Each client has its own flow

  • Independence between flows

– Parallelism in the TCP/ IP stack – Parallelism the application

  • Because of flow -level

parallelism , perform ance should scale

Server ` ` ` ` ` ` ` ` Clients

Core Core Core Core NIC NIC NIC NIC Memory

slide-4
SLIDE 4

1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server 4

SPECweb2005 Performance Scaling 1 2 3 4 4 8 12 16 Number of Cores Speedup

I d e a l P e r f

  • r

m a n c e A c t u a l P e r f

  • r

m a n c e

How Perform ance Scales on W eb Servers

Example web server benchmark: SPECweb2005

– Official results from HP – Similar scaling for Intel and AMD CPUs – Performance metric is throughput

  • Ideal performance scales linearly
  • Actual Performance scales poorly

– 2x the cores – 1.5x the performance

Scaling falls short

Perform ance does not scale w ith the num ber

  • f cores!
slide-5
SLIDE 5

1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server 5

Determ ining W hy Perform ance Scales Poorly

  • Reproduced the published results on our own server
  • Tested common causes of poor scaling
  • System

– 8-core Intel Xeon server – 4 1GbE NICs

  • Software

– Apache 2, Linux 2.6, PHP 5 Web Server – SPECweb2005 Support Workload

  • Highest throughput of 3 SPECweb2005 workloads
  • Performance Metrics

– Compare throughput when increasing from 1 to 8 cores – Compare cycles executed per byte transm itted when increasing from 1 to 8 cores

slide-6
SLIDE 6

1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server 6

Web Server Cycles/ Byte 5 10 15 20 25 30 35 40 45 50 2 4 6 8 Number of Cores Cycles/ Byte Web Server Throughput 1 2 3 4 5 2 4 6 8 Number of Cores Throughput (Gb/ s) 1 2 3 4 5 6 7 8 Speedup

I d e a l Scaling falls short Server Ideal

Our Perform ance Scaling Results

Like the published results,

  • ur server scales poorly.

Server More cycles to send data

slide-7
SLIDE 7

1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server 7

Web Server Cycles/ Byte 5 10 15 20 25 30 35 40 45 50 2 4 6 8 Number of Cores Cycles/ Byte Ratio of OS (TCP/ IP Stack) to Application (Web Server) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2 4 6 8 Number of Cores CPU Utilization Ratio

W here Does Cycles per Byte I ncrease?

OS: Application Ratio is Steady B

  • t

h O S a n d A p p l i c a t i

  • n

C

  • n

t r i b u t e

Either both OS and application are poorly parallelized or som ething else is affecting them both.

Ideal

slide-8
SLIDE 8

1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server 8

Possible Causes of Poor Scaling

  • We investigated many other causes—details in paper
  • Potential parallelism problems in software

– Bad parallelism in the TCP/ IP stack – Longer code path per flow – Stalls due to cache and TLB misses

  • Potential scaling problems in hardware

– Stalls due to system bus saturation

slide-9
SLIDE 9

1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server 9

Scaling in the TCP/ I P Stack

TCP/ IP Stack Throughput 1 2 3 4 5 6 2 4 6 Number of Cores Throughput (Gb/ s) TCP/ IP Stack CPU Utilization 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 2 4 6 Number of Cores CPU Utilization

  • Removed web

server—TCP only

  • Bulk transmit

Per-core CPU utilization remains flat

The TCP/ I P stack is parallelized w ell.

  • 6 NICs at

line rate

  • 128 flows

per core

slide-10
SLIDE 10

1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server 1 0

Possible Causes of Poor Scaling

  • Potential parallelism problems in software

– Bad parallelism in the TCP/ IP stack – Longer code path per flow – Stalls due to cache and TLB misses

  • Potential scaling problems in hardware

– Stalls due to system bus saturation

slide-11
SLIDE 11

1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server 1 1

Code Path Scaling

  • Length of code path may increase

with number of cores

  • Examples

– Waiting longer for spin locks – Traversing larger data structures

  • Increases instructions per cycle

(IPC)

  • In fact, IPC is decreasing

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 2 4 6 8 Number of Cores Instructions per Cycle

D e c r e a s i n g

Code path does not increase significantly. Decreasing I PC suggests instruction pipeline stalls.

slide-12
SLIDE 12

1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server 1 2

Finding Pipeline Stalls

Top Third Poorest Scaling Functions 0.0 0.1 0.2 0.3 0.4 0.5

zend_hash_find t cp_sendpage skb_clone ap_merge_per_dir_configs __d_lookup _zend_hash_quick_add_or_updat e kmem_cache_free _zend_mm_alloc_int __alloc_skb dev_hard_st art _xmit copy_user_generic_st ring memset _c free_block memcpy_c t cp_ack t cp_init _t so_segs memcpy

Function Cycles/ Byte Increase between 1 and 8 Cores

Dominated by Memory Load/ Store Stalls

Scaling is harm ed the m ost by stalls for m em ory loads and stores.

slide-13
SLIDE 13

1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server 1 3

Possible Causes of Poor Scaling

  • Potential parallelism problems in software

– Bad parallelism in the TCP/ IP stack – Longer code path per flow – Stalls due to cache and TLB misses

  • Potential scaling problems in hardware

– Stalls due to system bus saturation

slide-14
SLIDE 14

1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server 1 4

Stalls Caused by Cache and TLB Misses

  • More cache and TLB misses can

increase memory accesses

  • Can be caused by increased data

sharing between cores

  • In fact, cache and TLB misses are

decreasing per cycle

0.000 0.001 0.002 0.003 0.004 0.005 2 4 6 8 Number of Cores Misses per Cycle

Decreasing L a s t

  • l

e v e l C a c h e D a t a T L B

Cache and TLB m isses do not cause m em ory load/ store stalls. Som ething else does.

slide-15
SLIDE 15

1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server 1 5

Possible Causes of Poor Scaling

  • Potential parallelism problems in software

– Bad parallelism in the TCP/ IP stack – Longer code path per flow – Stalls due to cache and TLB misses

  • Potential scaling problems in hardware

– Stalls due to system bus saturation

slide-16
SLIDE 16

1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server 1 6

Bus Bus

Possible Cause of Bus Saturation

  • The system bus (front-side bus) has

two main components

– Address Bus carries requests and responses for data, called snoops – Data Bus carries the data itself

  • Bus Transaction Example

– A cache miss generates a snoop on the address bus – Snoop is broadcast to memory and all rem ote caches to find the most current data – Current copy of data is in memory – All rem ote caches and memory respond

  • More caches mean m ore sources

and m ore destinations for snoops

  • Snoops grow O( n² ) w ith

the num ber of caches!

Memory Controller Main Memory Cache Cache Cores Cache Cache Cores

Snoop Snoop Response Data Response

slide-17
SLIDE 17

1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server 1 7

Bus is saturated above 2/ 3 utilization

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 2 4 6 8 Number of Cores System Bus Utilization

  • Snoops may increase bus utilization
  • Bus utilization above 2/ 3 is

considered saturated

  • Data bus utilization increases, but is

not saturated

  • Confirms data sharing between

cores is minimal

  • Address bus utilization increases

faster

  • Becom es saturated on 8 cores
  • Address bus

saturation causes of poor scaling!

The Effect of Snoops on Scaling

Address Bus Data Bus

slide-18
SLIDE 18

1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server 1 8

I nsights

  • Although web servers are highly parallelized and share little data…
  • Systems are designed for shared memory applications
  • Snoops are broadcast regardless of good parallelism
slide-19
SLIDE 19

1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server 1 9

Conclusions

  • Our web server scales poorly with the number of cores
  • The OS and application exploit flow-level parallelism and scale well
  • Address bus saturation due to broadcast snoops causes poor

scaling

slide-20
SLIDE 20

1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server 2 0

Reducing The Effect Broadcast Snoops

  • More or faster links between processors

(e.g. Intel QuickPath Interconnect, HyperTransport)

– Because of O(n² ) growth in broadcast snoops – Need O(n² ) growth in number or speed of links to keep up – Not a scalable solution

  • Introduce directories and directory caches

– Replaces broadcast snooping entirely – But comes with more latency and cost

  • There is no ideal solution yet—this is a current area of our research.
slide-21
SLIDE 21

1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server 2 1

Thanks!

Questions?