bryan veal annie foong intel r d perform ance scalability
play

Bryan Veal Annie Foong Intel R&D Perform ance Scalability of - PDF document

Bryan Veal Annie Foong Intel R&D Perform ance Scalability of a Multi-core W eb Server Overview The number of CPU cores on modern servers is increasing rapidly Premise: for highly parallel workloads perform ance should scale w ith


  1. Bryan Veal Annie Foong Intel R&D Perform ance Scalability of a Multi-core W eb Server

  2. Overview • The number of CPU cores on modern servers is increasing rapidly • Premise: for highly parallel workloads perform ance should scale w ith the num ber of cores • We tested this premise for w eb servers • Our results show that w eb servers do not scale • We tested for common problems with poor parallel programming • We found few parallelism problem s in the TCP/ IP stack and the web software • Instead, we found problem s inherent to server hardw are design 2 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  3. W hy Perform ance Should Scale Clients Server • Typical networked servers – Have multiple cores – Have NICs mapped onto cores ` ` – Supports many clients – Each client has its own flow • Independence between flows NIC Core – Parallelism in the TCP/ IP stack ` – Parallelism the application ` NIC Core • Because of flow -level parallelism , perform ance NIC Core should scale ` ` NIC Core Memory ` ` 3 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  4. How Perform ance Scales on W eb Servers Example web server benchmark: SPECweb2005 Performance SPECweb2005 Scaling – Official results from HP Scaling – Similar scaling for Intel and AMD CPUs falls short 4 – Performance metric is throughput • Ideal performance scales linearly 3 • Actual Performance scales poorly e Speedup c n a – 2x the cores m 2 r o – 1.5x the performance f r e c e n a P m r o l f r a e P e l a u d 1 t c A I Perform ance does not scale w ith the num ber 0 0 4 8 12 16 of cores! Number of Cores 4 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  5. Determ ining W hy Perform ance Scales Poorly • Reproduced the published results on our own server • Tested common causes of poor scaling • System – 8-core Intel Xeon server – 4 1GbE NICs • Software – Apache 2, Linux 2.6, PHP 5 Web Server – SPECweb2005 Support Workload • Highest throughput of 3 SPECweb2005 workloads • Performance Metrics – Compare throughput when increasing from 1 to 8 cores – Compare cycles executed per byte transm itted when increasing from 1 to 8 cores 5 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  6. Our Perform ance Scaling Results Web Server Throughput Web Server Cycles/ Byte Scaling falls short 8 50 45 7 5 Throughput (Gb/ s) 40 6 35 Cycles/ Byte 4 Speedup 5 Server 30 3 4 25 Ideal Server l 20 a 3 More e 2 d 15 I cycles to 2 10 send data 1 1 5 0 0 0 0 2 4 6 8 0 2 4 6 8 Number of Cores Number of Cores Like the published results, our server scales poorly. 6 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  7. W here Does Cycles per Byte I ncrease? Ratio of OS (TCP/ IP Stack) to Web Server Cycles/ Byte Application (Web Server) d n a 50 S n O o i h t 2.0 a 45 t e c o t B i u l p b 1.8 p i 40 A r t CPU Utilization Ratio n o 1.6 C 35 Cycles/ Byte 1.4 30 1.2 OS: Application Ratio is Steady 25 1.0 Ideal 20 0.8 15 0.6 10 0.4 5 0.2 0.0 0 0 2 4 6 8 0 2 4 6 8 Number of Cores Number of Cores Either both OS and application are poorly parallelized or som ething else is affecting them both. 7 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  8. Possible Causes of Poor Scaling • We investigated many other causes—details in paper • Potential parallelism problems in software – Bad parallelism in the TCP/ IP stack – Longer code path per flow – Stalls due to cache and TLB misses • Potential scaling problems in hardware – Stalls due to system bus saturation 8 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  9. Scaling in the TCP/ I P Stack TCP/ IP Stack Throughput TCP/ IP Stack CPU Utilization 6 • Removed web 100% server—TCP only 90% 5 80% Throughput (Gb/ s) • Bulk transmit CPU Utilization 70% 4 60% 3 50% Per-core CPU • 6 NICs at 40% utilization line rate 2 30% remains flat • 128 flows 20% 1 per core 10% 0 0% 0 2 4 6 0 2 4 6 Number of Cores Number of Cores The TCP/ I P stack is parallelized w ell. 9 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  10. Possible Causes of Poor Scaling • Potential parallelism problems in software – Bad parallelism in the TCP/ IP stack – Longer code path per flow – Stalls due to cache and TLB misses • Potential scaling problems in hardware – Stalls due to system bus saturation 1 0 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  11. Code Path Scaling • Length of code path may increase 0.8 with number of cores 0.7 • Examples Instructions per Cycle – Waiting longer for spin locks 0.6 – Traversing larger data structures 0.5 • Increases instructions per cycle (IPC) D e 0.4 c r e a s i n g • In fact, IPC is decreasing 0.3 Code path does not 0.2 increase significantly. 0.1 0.0 Decreasing I PC 0 2 4 6 8 suggests instruction Number of Cores pipeline stalls. 1 1 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  12. Finding Pipeline Stalls Top Third Poorest Scaling Functions memcpy t cp_init _t so_segs t cp_ack Dominated by memcpy_c free_block Memory Load/ Store memset _c Stalls Function copy_user_generic_st ring dev_hard_st art _xmit __alloc_skb _zend_mm_alloc_int kmem_cache_free _zend_hash_quick_add_or_updat e __d_lookup ap_merge_per_dir_configs skb_clone t cp_sendpage zend_hash_find 0.0 0.1 0.2 0.3 0.4 0.5 Cycles/ Byte Increase between 1 and 8 Cores Scaling is harm ed the m ost by stalls for m em ory loads and stores. 1 2 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  13. Possible Causes of Poor Scaling • Potential parallelism problems in software – Bad parallelism in the TCP/ IP stack – Longer code path per flow – Stalls due to cache and TLB misses • Potential scaling problems in hardware – Stalls due to system bus saturation 1 3 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  14. Stalls Caused by Cache and TLB Misses • More cache and TLB misses can 0.005 increase memory accesses • Can be caused by increased data 0.004 sharing between cores Misses per Cycle • In fact, cache and TLB misses are decreasing per cycle 0.003 L a s t Cache and TLB - l e v e l C a c h e 0.002 m isses do not cause D a t a T L B m em ory load/ store Decreasing 0.001 stalls. Som ething else 0.000 0 2 4 6 8 does. Number of Cores 1 4 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  15. Possible Causes of Poor Scaling • Potential parallelism problems in software – Bad parallelism in the TCP/ IP stack – Longer code path per flow – Stalls due to cache and TLB misses • Potential scaling problems in hardware – Stalls due to system bus saturation 1 5 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  16. Possible Cause of Bus Saturation • The system bus (front-side bus) has two main components Cores Cores – Address Bus carries requests and responses for data, called snoops – Data Bus carries the data itself Cache Cache Cache Cache Response • Bus Transaction Example Snoop – A cache miss generates a snoop on Bus the address bus Bus – Snoop is broadcast to memory and Memory Snoop all rem ote caches to find the most Controller current data – Current copy of data is in memory – All rem ote caches and memory respond • More caches mean m ore sources Main and m ore destinations for snoops Response Data Memory • Snoops grow O ( n ² ) w ith the num ber of caches! 1 6 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  17. The Effect of Snoops on Scaling • Snoops may increase bus utilization 100% Bus is saturated above • Bus utilization above 2/ 3 is 90% 2/ 3 utilization considered saturated System Bus Utilization 80% • Data bus utilization increases, but is Address Bus 70% not saturated 60% • Confirms data sharing between cores is minimal 50% 40% • Address bus utilization increases Data Bus faster 30% • Becom es saturated on 8 cores 20% 10% • Address bus 0% saturation causes of 0 2 4 6 8 poor scaling! Number of Cores 1 7 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  18. I nsights • Although web servers are highly parallelized and share little data… • Systems are designed for shared memory applications • Snoops are broadcast regardless of good parallelism 1 8 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

  19. Conclusions • Our web server scales poorly with the number of cores • The OS and application exploit flow-level parallelism and scale well • Address bus saturation due to broadcast snoops causes poor scaling 1 9 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend