 
              Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a dso , a y So a Department of Computer Science University of Virginia y g ISPASS 2011 1
M Motivation i i  The number of cores doubles every 18 months  Expected: Performance number of cores  One of the bottlenecks is shared resource contention  For multi-threaded workloads, contention is unavoidable  To reduce contention it is necessary to understand  To reduce contention, it is necessary to understand where and how the contention is created 2
Shared Resource Contention in Shared Resource Contention in Chip ‐ Multiprocessors p p Application 1 Application 1 C0 C C C1 C C2 C C3 Thread L1 L1 L1 L1 Application 2 Thread Thread L2 L2 Front -Side Bus Memory Intel Quad Core Q9550 3
Scenario 1 Scenario 1 Multi ‐ threaded applications pp  With co-runner Application 1 Thread C0 C1 C2 C3 3 Application 2 Thread L1 L1 L1 L1 L2 L L L2 Memory 4
Scenario 2 Scenario 2 Multi ‐ threaded applications pp  Without co-runner Application Thread C0 C1 C2 C3 L1 L1 L1 L1 L2 L2 L2 L2 Memory Memory 5
Shared ‐ Resource Contention  Intra application contention  Intra-application contention  Contention among threads from the same application (No co-runners) ( )  Inter-application contention  Contention among threads from the co-running application 6
C Contributions ib i  A general methodology to evaluate a multi-threaded g gy application’s performance  Intra-application contention  Inter-application contention  Contention in the memory-hierarchy shared resources  Characterizing applications facilitates better understanding of the application’s resource sensitivity understanding of the application s resource sensitivity  Thorough performance analyses and characterization Thorough performance analyses and characterization of multi-threaded PARSEC benchmarks 7
O tli Outline  Motivation Motivation  Contributions  Methodology gy  Measuring intra-application contention  Measuring inter-application contention g pp  Related Work  Summary 8
Methodology Methodology  Designed to measure both intra- and inter- application contention for a targeted shared resource application contention for a targeted shared resource  L1-cache, L2-cache  Front Side Bus (FSB)  Each application is run in two configurations  Baseline: threads do not share the targeted resource  Contention: threads share the targeted resource  Multiple number of targeted resource  Determine contention by comparing performance  Determine contention by comparing performance (gathering hardware performance counters’ values) 9
O tli Outline  Motivation Motivation  Contributions  Methodology gy  Measuring intra-application contention (See paper)  Measuring inter-application contention g pp  Related Work  Summary 10
Measuring inter ‐ application contention  L1-cache Application 1 Thread Thread C0 C1 C2 C3 C0 C1 C2 C3 Application 2 L1 L1 L1 L1 L1 L1 L1 L1 Thread L2 L2 L2 L2 Memory Memory Baseline Baseline Contention Contention Configuration Configuration 11
Measuring inter ‐ application contention l  L2-cache Application 1 Thread C0 C1 C2 C3 C0 C1 C2 C3 Application 2 L1 L1 L1 L1 Thread L1 L1 L1 L1 L2 L2 L2 L2 Memory Memory Baseline Contention Configuration Configuration 12
M Measuring inter ‐ application contention i i t li ti t ti  FSB Application 1 Thread Thread C0 C2 C4 C6 C1 C3 C5 C7 Application 2 L1 L1 L1 L1 L1 L1 L1 L1 Thread L2 L2 L2 L2 Memory Baseline Configuration 13
Measuring intra ‐ application contention l  FSB Application 1 Thread C0 C2 C4 C6 C1 C3 C5 C7 Application 2 Thread L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 Memory Contention Configuration 14
PARSEC Benchmarks Application Domain Application Domain Benchmark(s) Benchmark(s) Financial Analysis Blackscholes (BS) Swaptions (SW) Computer Vision C t Vi i B d t Bodytrack (BT) k (BT) Engineering Canneal (CN) Enterprise Storage Dedup (DD) Animation Facesim (FA) Fluidanimate (FL) Similarity Search Similarity Search Ferret (FE) Ferret (FE) Rendering Raytrace (RT) Data Mining Streamcluster (SC) Media Processing Vips (VP) X264 (X2) 15
Experimental platform Experimental platform  Platform 1: Yorkfield C C0 C C1 C C2 C3 C  Intel Quad core Q9550  32 KB L1-D and L1-I L1 cache L1 cache L1 cache L1 cache cache h L1 L1 L1 L1 L1 L1 L1 L1 HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF  6MB L2-cache L2 cache L2 cache  2GB Memory  2GB Memory L2 L2 L2 L2 HW ‐ PF HW ‐ PF  Common FSB FSB FSB interface interface FSB Memory Controller Hub (Northbridge) MB Memory 16 16
Experimental platform Experimental platform  Platform 2: Harpertown C0 C2 C4 C6 C1 C3 C5 C7 L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache L1 L1 L1 L1 L1 L1 L1 L1 HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF L2 cache L2 cache L2 cache L2 cache L2 cache L2 cache L2 cache L2 cache L2 L2 L2 L2 HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF FSB FSB FSB FSB interface interface interface interface FSB FSB Memory Controller Hub (Northbridge) MB Memory Tanima Dey 17 17
Performance Analysis  Inter-application contention  For i-th co-runner PercentPerformanceDifference i = ( PerformanceBase i – PerformanceContend i ) * 100 PerformanceBase PerformanceBase i  Absolute performance difference sum  Absolute performance difference sum APDS = Σ abs ( PercentPerformanceDifferencei ) 18
I t Inter ‐ application contention li ti t ti  L1-cache – for Streamcluster Inter-application L1-cache Contention 8 ifference (%) 6 4 2 erformance D 0 -2 -4 Pe -6 -8 ptions choles ytrack anneal Dedup acesim Ferret nimate ytrace Vips X264 Body D Ray Swap Blacksc Ca Fluidan Fa Co-running benchmarks 19
Inter application L1 cache contention Inter ‐ application L1 ‐ cache contention Streamcluster Inter-application L1-cache Contention nce (%) 8 6 mance Differen 4 2 0 -2 Perform -4 4 -6 -8 acesim choles dytrack anneal Dedup Ferret nimate aytrace cluster Vips X264 aptions D Fluidan Blacksc Ca Streamc Fa Bod Ra Swa Co-running benchmarks 20
I t Inter ‐ application contention li ti t ti  L1-cache 21 21
I t Inter ‐ application contention li ti t ti  L2-cache 22
I t Inter ‐ application contention li ti t ti  FSB 23
Characterization Benchmarks L1 ‐ cache L2 ‐ cache FSB Blackscholes none none none Bodytrack inter inter intra C Canneal l i t intra i t inter i t intra Dedup inter intra, inter intra, inter Facesim inter inter intra Ferret intra intra, inter intra Fluidanimate inter inter intra Raytrace Raytrace none none none none intra intra Streamcluster inter inter intra Swaptions none none none Vi Vips i intra i inter i inter X264 inter intra, inter intra 24
Summary  The methodology generalizes contention analysis of multi-threaded applications  New approach to characterize applications N h t h t i li ti  Useful for performance analysis of existing and future architecture or benchmarks architecture or benchmarks  Helpful for creating new workloads of diverse properties  Provides insights for designing improved contention- aware scheduling methods h d li th d 25
Related Work  Cache contention  Knauerhase et al. IEEE Micro 2008  Zhuravleve et al ASPLOS 2010  Zhuravleve et al. ASPLOS 2010  Xie et al. CMP-MSI 2008  Mars et al. HiPEAC 2011  Characterizing parallel workload  Jin et al., NASA Technical Report 2009  PARSEC benchmark suite  Bienia et al. PACT 2008  Bhadauria et al IISWC 2009  Bhadauria et al. IISWC 2009 26
Thank you! Thank you! 27
Recommend
More recommend