Does Cache Sharing on Modern CMP Matter to the Performance of Contemporary Multithreaded Programs?
Eddy Zheng Zhang Yunlian Jiang Xipeng Shen (presenter) Computer Science Department The College of William and Mary, VA, USA
Does Cache Sharing on Modern CMP Matter to the Performance of - - PowerPoint PPT Presentation
Does Cache Sharing on Modern CMP Matter to the Performance of Contemporary Multithreaded Programs? Eddy Zheng Zhang Yunlian Jiang Xipeng Shen (presenter) Computer Science Department The College of William and Mary, VA, USA Cache Sharing
Eddy Zheng Zhang Yunlian Jiang Xipeng Shen (presenter) Computer Science Department The College of William and Mary, VA, USA
The College of William and Mary
2
3
The College of William and Mary
4
The College of William and Mary
Non-Uniformity
The College of William and Mary
5
Moursy+:06, Fedorova+:07, Jiang+:08, Zhou+:09]
The College of William and Mary
6
The College of William and Mary
7
The College of William and Mary
8
The College of William and Mary
9
The College of William and Mary
10
The College of William and Mary
11
The College of William and Mary
Program Description Parallelism Working Set Blackscholes Black-Scholes equation data 2MB Bodytrack body tracking data 8MB Canneal
unstruct. 256MB Facesim face simulation data 256MB Fluidanimate fluid dynamics data 64MB Streamcluster
data 16MB Swaptions portfolio pricing data 0.5MB X264 video encoding pipeline 16MB Dedup stream compression pipeline 256MB Ferret image search pipeline 64MB
12
The College of William and Mary
Dimension Variations tions Description benchmarks 10 from PARSEC parallelism 3 data, pipeline, unstructured inputs 4 simsmall, simmedium, simlarge, native # of threads 4 1,2,4,8 assignment 3 threads assignment to cores binding 2 yes, no subset of cores 7 The cores a program uses platforms 2 Intel Xeon & AMD Operon
Program level OS level
13
The College of William and Mary
Intel (Xeon 5310)
32K 32K 32K 32K 8GB DRAM 32K 32K 32K 32K 4MB L2 4MB L2 4MB L2 4MB L2
14
64K 64K 64K 64K 512K 512K 512K 512K 2MB L3 4GB DRAM 64K 64K 64K 64K 512K 512K 512K 512K 2MB L3 4GB DRAM
AMD (Opeteron 2352)
The College of William and Mary
15
The College of William and Mary
16
The College of William and Mary
17
18
T1 T2
T1 T2
19
T1 T2
T1 T3 T3 T4 T2 T4
The College of William and Mary
0.2 0.4 0.6 0.8 1 1.2 1.4
2t simlarge 2t native 4t simlarge 4t native
20
blackscholes bodytrack canneal facesim fluidanimate streamcluster swaptions x264
The College of William and Mary
0.2 0.4 0.6 0.8 1 1.2 1.4
2t simlarge 2t native 4t simlarge 4t native
21
blackscholes bodytrack canneal facesim fluidanimate streamcluster swaptions x264
The College of William and Mary
22
The College of William and Mary
23
1.75 3.5 5.25 7
blackscholes bodytrack canneal facesim fluidanimate streamcluster swaptions x264
The College of William and Mary
Program Description Parallelism Working Set Blackscholes Black-Scholes equation data 2MB Bodytrack body tracking data 8MB Canneal
unstruct. 256MB Facesim face simulation data 256MB Fluidanimate fluid dynamics data 64MB Streamcluster
data 16MB Swaptions portfolio pricing data 0.5MB X264 video encoding pipeline 16MB Dedup stream compression pipeline 256MB Ferret image search pipeline 64MB
The College of William and Mary
25
26
T1 T2 T3 T4 T1 T3 T2 T4 T1 T4 T2 T3
The College of William and Mary
2 4 6 8 10 12 14 16
27
statistically insignificant---large fluctuations across runs of the same config.
2t simlarge 2t native 4t simlarge 4t native
blackscholes bodytrack canneal facesim fluidanimate streamcluster swaptions x264
The College of William and Mary
28
The College of William and Mary
29
The College of William and Mary
30
The College of William and Mary
31
The College of William and Mary
32
The College of William and Mary
33
The College of William and Mary
34
The College of William and Mary
35
The College of William and Mary
36
thread 1 thread 2
The College of William and Mary
37
thread 1 thread 2
The College of William and Mary
0.25 0.5 0.75 1 4t 8t
L2 Cache Miss Mem Bus Trans
38
The College of William and Mary
Normalized L2 Misses (on Intel)
0.25 0.5 0.75 1 4t 8t 4t 8t
Blacksholes Bodytrack
39
The College of William and Mary
* A contrast to the large body of work in OS and architecture.
40
The College of William and Mary
+:09, Tian+:09
41
First systematic examin. of the influence of cache sharing in modern CMP on the perf. of contemporary multithreaded apps.
Insignificant influence from cache
sharing despite inputs, arch, # threads, thread placement,
parallelism, phases, etc.
Mismatch between SW & HW causing the
Large potential of cache-share-aware code
42
The College of William and Mary
43
The College of William and Mary
44