Tanima Dey
Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a dso , a y So a Department of Computer Science University of Virginia
ISPASS 2011
y g
1
Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a - - PowerPoint PPT Presentation
Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a dso , a y So a Department of Computer Science University of Virginia y g ISPASS 2011 1 M Motivation i i The number of cores doubles every 18 months Expected:
ISPASS 2011
1
The number of cores doubles every 18 months Expected: Performance number of cores One of the bottlenecks is shared resource contention For multi-threaded workloads, contention is
To reduce contention it is necessary to understand To reduce contention, it is necessary to understand
2
C C C C Application 1 C0 C1 C2 C3 L1 L1 L1 L1 Application 1 Thread Application 2 Thread L2 L2
Front -Side Bus
Thread Memory
Intel Quad Core Q9550
3
With co-runner
C0 C1 C2 C3 Application 1 Thread 3 L L L1 L1 L1 L1 Application 2 Thread L2 L2 Memory
4
Without co-runner
C0 C1 C2 C3 Application Thread L2 L2 L1 L1 L1 L1 L2 L2 Memory Memory
5
Intra application contention Intra-application contention
Contention among threads from the same application
(No co-runners) ( )
Inter-application contention
Contention among threads from the co-running
application
6
A general methodology to evaluate a multi-threaded
Intra-application contention Inter-application contention Contention in the memory-hierarchy shared resources
Characterizing applications facilitates better
Thorough performance analyses and characterization
7
Motivation
Contributions Methodology
Measuring intra-application contention Measuring inter-application contention
Related Work Summary
8
Designed to measure both intra- and inter-
L1-cache, L2-cache Front Side Bus (FSB)
Each application is run in two configurations
Baseline: threads do not share the targeted resource Contention: threads share the targeted resource
Multiple number of targeted resource Determine contention by comparing performance
9
Determine contention by comparing performance
Motivation
Contributions Methodology
Measuring intra-application contention (See paper) Measuring inter-application contention
Related Work Summary
10
L1-cache
Application 1 Thread C0 C1 C2 C3 L1 L1 L1 L1 Thread Application 2 Thread C0 C1 C2 C3 L1 L1 L1 L1 L2 L2 L2 L2
Baseline Contention
Memory Memory
Baseline Configuration Contention Configuration
11
L2-cache
Application 1 Thread C0 C1 C2 C3 L1 L1 L1 L1 Application 2 Thread C0 C1 C2 C3 L1 L1 L1 L1 L2 L2 L2 L2 Memory Memory
Baseline Configuration Contention Configuration
12
FSB
Application 1 Thread C0 C2 C4 C6 L1 L1 L1 L1 C1 C3 C5 C7 L1 L1 L1 L1 Thread Application 2 Thread L2 L2 L2 L2 Memory Baseline Configuration
13
FSB
Application 1 Thread C0 C2 C4 C6 L1 L1 L1 L1 C1 C3 C5 C7 L1 L1 L1 L1 Application 2 Thread L2 L2 L2 L2 Memory Contention Configuration
14
Application Domain Benchmark(s) Application Domain Benchmark(s) Financial Analysis Blackscholes (BS) Swaptions (SW) C t Vi i B d t k (BT) Computer Vision Bodytrack (BT) Engineering Canneal (CN) Enterprise Storage Dedup (DD) Animation Facesim (FA) Fluidanimate (FL) Similarity Search Ferret (FE) Similarity Search Ferret (FE) Rendering Raytrace (RT) Data Mining Streamcluster (SC)
15
Media Processing Vips (VP) X264 (X2)
Platform 1: Yorkfield
C C C C
Intel Quad core Q9550 32 KB L1-D and L1-I
h
C0
L1 cache L1
C1 C2 C3
L1 cache L1 L1 cache L1 L1 cache L1
cache
6MB L2-cache 2GB Memory
L2 cache L2 cache L2 L2 L1 HW‐PF L1 HW‐PF L1 HW‐PF L1 HW‐PF
2GB Memory Common FSB
FSB interface L2 HW‐PF FSB interface L2 HW‐PF Memory Controller Hub (Northbridge)
FSB
Memory
MB
16 16
Platform 2: Harpertown
C0
L1 cache
C2 C4 C6
L1 cache L1 cache L1 cache
C1
L1 cache
C3 C5 C7
L1 cache L1 cache L1 cache L2 cache L2 cache L1 HW‐PF L1 HW‐PF L1 HW‐PF L1 HW‐PF L2 cache L2 cache L1 HW‐PF L1 HW‐PF L1 HW‐PF L1 HW‐PF L2 cache FSB interface L2 cache L2 HW‐PF FSB interface L2 HW‐PF L2 cache FSB interface L2 cache L2 HW‐PF FSB interface L2 HW‐PF Memory Controller Hub (Northbridge)
FSB FSB
Tanima Dey
Memory
MB
17 17
Inter-application contention
For i-th co-runner
PercentPerformanceDifferencei = ( PerformanceBasei – PerformanceContendi ) * 100 PerformanceBase PerformanceBasei
Absolute performance difference sum Absolute performance difference sum
APDS = Σ abs ( PercentPerformanceDifferencei )
18
L1-cache – for Streamcluster
8
Inter-application L1-cache Contention
2 4 6 ifference (%)
erformance D
choles ytrack anneal Dedup acesim Ferret nimate ytrace ptions Vips X264 Pe 19 Blacksc Body Ca D Fa Fluidan Ray Swap Co-running benchmarks
Streamcluster
Inter-application L1-cache Contention
6 8 nce (%)
2 4 mance Differen
4 choles dytrack anneal Dedup acesim Ferret nimate aytrace cluster aptions Vips X264 Perform
20
Blacksc Bod Ca D Fa Fluidan Ra Streamc Swa Co-running benchmarks
L1-cache
21 21
L2-cache
22
FSB
23
Benchmarks L1‐cache L2‐cache FSB Blackscholes none none none Bodytrack inter inter intra C l i t i t i t Canneal intra inter intra Dedup inter intra, inter intra, inter Facesim inter inter intra Ferret intra intra, inter intra Fluidanimate inter inter intra Raytrace none none intra Raytrace none none intra Streamcluster inter inter intra Swaptions none none none Vi i i i
24
Vips intra inter inter X264 inter intra, inter intra
The methodology generalizes contention analysis of
N h t h t i li ti
New approach to characterize applications Useful for performance analysis of existing and future
architecture or benchmarks architecture or benchmarks
Helpful for creating new workloads of diverse
properties
Provides insights for designing improved contention-
25
Cache contention
Knauerhase et al. IEEE Micro 2008 Zhuravleve et al ASPLOS 2010 Zhuravleve et al. ASPLOS 2010 Xie et al. CMP-MSI 2008 Mars et al. HiPEAC 2011
Characterizing parallel workload
Jin et al., NASA Technical Report 2009
PARSEC benchmark suite
Bienia et al. PACT 2008 Bhadauria et al IISWC 2009 Bhadauria et al. IISWC 2009
26
27