Bryan Veal Annie Foong Intel R&D Perform ance Scalability of - PDF document

Bryan Veal Annie Foong Intel R&D Perform ance Scalability of a Multi-core W eb Server

Overview • The number of CPU cores on modern servers is increasing rapidly • Premise: for highly parallel workloads perform ance should scale w ith the num ber of cores • We tested this premise for w eb servers • Our results show that w eb servers do not scale • We tested for common problems with poor parallel programming • We found few parallelism problem s in the TCP/ IP stack and the web software • Instead, we found problem s inherent to server hardw are design 2 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

W hy Perform ance Should Scale Clients Server • Typical networked servers – Have multiple cores – Have NICs mapped onto cores ` ` – Supports many clients – Each client has its own flow • Independence between flows NIC Core – Parallelism in the TCP/ IP stack ` – Parallelism the application ` NIC Core • Because of flow -level parallelism , perform ance NIC Core should scale ` ` NIC Core Memory ` ` 3 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

How Perform ance Scales on W eb Servers Example web server benchmark: SPECweb2005 Performance SPECweb2005 Scaling – Official results from HP Scaling – Similar scaling for Intel and AMD CPUs falls short 4 – Performance metric is throughput • Ideal performance scales linearly 3 • Actual Performance scales poorly e Speedup c n a – 2x the cores m 2 r o – 1.5x the performance f r e c e n a P m r o l f r a e P e l a u d 1 t c A I Perform ance does not scale w ith the num ber 0 0 4 8 12 16 of cores! Number of Cores 4 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

Determ ining W hy Perform ance Scales Poorly • Reproduced the published results on our own server • Tested common causes of poor scaling • System – 8-core Intel Xeon server – 4 1GbE NICs • Software – Apache 2, Linux 2.6, PHP 5 Web Server – SPECweb2005 Support Workload • Highest throughput of 3 SPECweb2005 workloads • Performance Metrics – Compare throughput when increasing from 1 to 8 cores – Compare cycles executed per byte transm itted when increasing from 1 to 8 cores 5 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

Our Perform ance Scaling Results Web Server Throughput Web Server Cycles/ Byte Scaling falls short 8 50 45 7 5 Throughput (Gb/ s) 40 6 35 Cycles/ Byte 4 Speedup 5 Server 30 3 4 25 Ideal Server l 20 a 3 More e 2 d 15 I cycles to 2 10 send data 1 1 5 0 0 0 0 2 4 6 8 0 2 4 6 8 Number of Cores Number of Cores Like the published results, our server scales poorly. 6 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

W here Does Cycles per Byte I ncrease? Ratio of OS (TCP/ IP Stack) to Web Server Cycles/ Byte Application (Web Server) d n a 50 S n O o i h t 2.0 a 45 t e c o t B i u l p b 1.8 p i 40 A r t CPU Utilization Ratio n o 1.6 C 35 Cycles/ Byte 1.4 30 1.2 OS: Application Ratio is Steady 25 1.0 Ideal 20 0.8 15 0.6 10 0.4 5 0.2 0.0 0 0 2 4 6 8 0 2 4 6 8 Number of Cores Number of Cores Either both OS and application are poorly parallelized or som ething else is affecting them both. 7 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

Possible Causes of Poor Scaling • We investigated many other causes—details in paper • Potential parallelism problems in software – Bad parallelism in the TCP/ IP stack – Longer code path per flow – Stalls due to cache and TLB misses • Potential scaling problems in hardware – Stalls due to system bus saturation 8 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

Scaling in the TCP/ I P Stack TCP/ IP Stack Throughput TCP/ IP Stack CPU Utilization 6 • Removed web 100% server—TCP only 90% 5 80% Throughput (Gb/ s) • Bulk transmit CPU Utilization 70% 4 60% 3 50% Per-core CPU • 6 NICs at 40% utilization line rate 2 30% remains flat • 128 flows 20% 1 per core 10% 0 0% 0 2 4 6 0 2 4 6 Number of Cores Number of Cores The TCP/ I P stack is parallelized w ell. 9 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

Possible Causes of Poor Scaling • Potential parallelism problems in software – Bad parallelism in the TCP/ IP stack – Longer code path per flow – Stalls due to cache and TLB misses • Potential scaling problems in hardware – Stalls due to system bus saturation 1 0 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

Code Path Scaling • Length of code path may increase 0.8 with number of cores 0.7 • Examples Instructions per Cycle – Waiting longer for spin locks 0.6 – Traversing larger data structures 0.5 • Increases instructions per cycle (IPC) D e 0.4 c r e a s i n g • In fact, IPC is decreasing 0.3 Code path does not 0.2 increase significantly. 0.1 0.0 Decreasing I PC 0 2 4 6 8 suggests instruction Number of Cores pipeline stalls. 1 1 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

Finding Pipeline Stalls Top Third Poorest Scaling Functions memcpy t cp_init _t so_segs t cp_ack Dominated by memcpy_c free_block Memory Load/ Store memset _c Stalls Function copy_user_generic_st ring dev_hard_st art _xmit __alloc_skb _zend_mm_alloc_int kmem_cache_free _zend_hash_quick_add_or_updat e __d_lookup ap_merge_per_dir_configs skb_clone t cp_sendpage zend_hash_find 0.0 0.1 0.2 0.3 0.4 0.5 Cycles/ Byte Increase between 1 and 8 Cores Scaling is harm ed the m ost by stalls for m em ory loads and stores. 1 2 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

Stalls Caused by Cache and TLB Misses • More cache and TLB misses can 0.005 increase memory accesses • Can be caused by increased data 0.004 sharing between cores Misses per Cycle • In fact, cache and TLB misses are decreasing per cycle 0.003 L a s t Cache and TLB - l e v e l C a c h e 0.002 m isses do not cause D a t a T L B m em ory load/ store Decreasing 0.001 stalls. Som ething else 0.000 0 2 4 6 8 does. Number of Cores 1 4 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

Possible Cause of Bus Saturation • The system bus (front-side bus) has two main components Cores Cores – Address Bus carries requests and responses for data, called snoops – Data Bus carries the data itself Cache Cache Cache Cache Response • Bus Transaction Example Snoop – A cache miss generates a snoop on Bus the address bus Bus – Snoop is broadcast to memory and Memory Snoop all rem ote caches to find the most Controller current data – Current copy of data is in memory – All rem ote caches and memory respond • More caches mean m ore sources Main and m ore destinations for snoops Response Data Memory • Snoops grow O ( n ² ) w ith the num ber of caches! 1 6 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

The Effect of Snoops on Scaling • Snoops may increase bus utilization 100% Bus is saturated above • Bus utilization above 2/ 3 is 90% 2/ 3 utilization considered saturated System Bus Utilization 80% • Data bus utilization increases, but is Address Bus 70% not saturated 60% • Confirms data sharing between cores is minimal 50% 40% • Address bus utilization increases Data Bus faster 30% • Becom es saturated on 8 cores 20% 10% • Address bus 0% saturation causes of 0 2 4 6 8 poor scaling! Number of Cores 1 7 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

I nsights • Although web servers are highly parallelized and share little data… • Systems are designed for shared memory applications • Snoops are broadcast regardless of good parallelism 1 8 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

Conclusions • Our web server scales poorly with the number of cores • The OS and application exploit flow-level parallelism and scale well • Address bus saturation due to broadcast snoops causes poor scaling 1 9 1 2 / 0 3 / 2 0 0 7 ANCS 2 0 0 7 -- Perform ance Scalability of a Multi-Core W eb Server

Bryan Veal Annie Foong Intel R&D Perform ance Scalability of - PDF document

Bryan Veal Annie Foong Intel R&D Perform ance Scalability of a Multi-core W eb Server Overview The number of CPU cores on modern servers is increasing rapidly Premise: for highly parallel workloads perform ance should scale w ith

Validation, Synthesis Validation, Synthesis and Perform ance Perform ance Evaluation of of

Scalability and Replication Marco Serafini COMPSCI 532 Lecture 13 Scalability 2 Scalability

Tourism Exp o Perform a nce Im p rov em ent 30 May 2013 Pw C Hum a n Resource Serv ices (HRS)

Performance and Scalability (Chapter 11) Performance and Scalability Performance: How long

Root zone scalability model Bart Gijsen October 28, 2009 Root zone scalability model

Why not impress your guests with a delicious Veal Shank Frenched naturally from cooking

GeoCom putational I ntelligence and GeoCom putational I ntelligence and High-perform ance

Characterizing Beam-correlated Neutron Backgrounds in the ANNIE detector Amanda Weinstein on

Linux multi-core scalability Oct 2009 Andi Kleen Intel Corporation andi@firstfloor.org

I m pact of Local I nterconnects on Tim ing and Pow er in a High Perform ance Microprocessor

Versioning of Topic Map Templates Structuring Versioning and Scalability Scalability Proc.

Intel Case Intel Case Processor Serial Number (PSN) Processor Serial Number (PSN) 5/9/99 Group

Validation Labs with OpenStack Shuquan Huang, Intel IT Engineering Computing Weibo: @

5G Cloud Native from RAN to Core Christian Maciocco, Intel Shilpa Talwar, Intel Saikrishna

AFS at Intel AFS at Intel Travis Broughton Travis Broughton Agenda Agenda Intels

51 ST ANNUAL GENERAL MEETING Presented by : LIM SOON HUAT YAP CHOI FOONG Managing Director

Effectively Scaling Deep Learning Frameworks (To 40 GPUs and Beyond) Welcome everyone! Im

SCALING PATHWAYS THROUGH NETWORKS Insights from the Student Success Center Network P R E S E N T

Time and Space with Thermal Infrared Satellite Images Todd Steissberg, Ph.D. 1 Marcy Kamerath 2

Deloitte Im Impact Day Trade Analytics for the Endangered Species Trade Donalea Patman OAM Dr

Sharing digital objects #RP55 using NDN: PID interoperability, planning and scaling Kees de

GUIDED PATHWAYS ESSENTIAL PRACTICES: SCALE OF ADOPTION SELF-ASSESSMENT CALIFORNIA COMMUNITY

Texas Pathways Scale of Adoption Assessment Results April 17, 2018 Essential Practices:

Scaling up Development Assistance for LDCs and Vulnerable Countries By Patrick Guillaumont

Sambuz

Useful Links

Newsletter

Mail Us