Compile-Time Detection of False Sharing via Loop Cost Modeling
Munara Tolubaeva, Yonghong Yan and Barbara Chapman High Performance Computing and Tools Group (HPCTools) Computer Science Department University of Houston
Compile-Time Detection of False Sharing via Loop Cost Modeling - - PowerPoint PPT Presentation
Compile-Time Detection of False Sharing via Loop Cost Modeling Munara Tolubaeva, Yonghong Yan and Barbara Chapman High Performance Computing and Tools Group (HPCTools) Computer Science Department University of Houston OUTLINE } Introduction
Munara Tolubaeva, Yonghong Yan and Barbara Chapman High Performance Computing and Tools Group (HPCTools) Computer Science Department University of Houston
2
3
for (i=0; i<N; i++) { A[i] = B[i] + C[i]; } for (i=0; i<N; i+=3) { A[i+0] = B[i+0] + C[i+0]; A[i+1] = B[i+1] + C[i+1]; A[i+2] = B[i+2] + C[i+2]; } Loop Unrolling
Unroll factor
4
¨ Estimates the time needed to execute a specific section of code on a
¨ Considers performance impacting architectural features (Processor,
¨ Open64 cost models – the most sophisticated models among open
5
Open64 cost models
Processor model Cache model Parallel model Loop overhead Parallel overhead Machine cost Cache cost Reduction cost Computational resource cost Dependency latency cost Register spilling cost Cache cost Operation cost Issue cost Mem_ref cost TLB cost
For loops
6
Processor model Computational resource cost Dependency latency cost Register spilling cost Operation cost Issue cost Mem_ref cost
7
Cache model Cache cost TLB cost
8
Parallel model Loop overhead Parallel overhead Machine cost Cache cost Reduction cost
9
} Optimize single CPU performance, not considering shared
} Limited use of models for compiler optimizations and
} All false sharing detection techniques implemented at runtime
10
} Processors maintain data consistency via cache coherency
} Data sharing is at cache line granularity
} A store to a single data invalidates the whole copy of a cache line } Successive read suffers a cache miss
} Reload entire cache line
11
12
Code Version Execution Time (sec) Sequential 2 threads 4 threads 8 threads Unoptimized 0.503 4.563 3.961 4.432 Optimized 0.503 0.263 0.137 0.078
linear_regression
applications ¡using ¡the ¡DARWIN ¡framework”, ¡LCPC ¡2011 ¡
16
} False Sharing Detection Methods
} Cache simulation and Memory tracing (Gunther and Weidendorfer WBIA’09,
} Hardware Performance Counters (Marathe et al. Tech. rep.’06, Wicaksono et
} Memory Protection (Tongping and Berger OOPSLA’11) } Memory Shadowing (Zhao et al.
} False Sharing Elimination Methods
} Tune scheduling parameters (chunk size, chunk stride) (Chow and Sarkar
} Compiler transformations (array padding, memory alignment) (Jeremiassen
} All FS detection methods are applied at runtime, incur overhead
17
} Estimates the performance impact of FS on OpenMP parallel
} Ability to output the total number of FS cases that will occur
} Ability to analyze the performance impact of FS on a parallel
} Introduces a linear regression model to reduce the modeling
18
} # of threads executing the loop } Loop boundaries } Step sizes } Index variables } Chunk size (if specified for OpenMP loop)
19
} Obtain array references made in the innermost loop of a loop
} Generate a cache line ownership list for each thread } Apply a stack distance analysis to cache state of each thread } Detect false sharing
20
} Array base name } Array indices } Memory offsets for arrays with structured data types
21
Thread0 Iteration_1: cacheline_a_1_w, cacheline_b_1_r Thread1 Iteration_1: cacheline_a_1_w, cacheline_b_2_r Thread7 Iteration_1: cacheline_a_1_w, cacheline_b_8_r
22
} impossible to know corresponding
} modeling the fully associative
Thread0 Cache State Cacheline_a_1_w Cacheline_b_1_r Cacheline_a_2_w Cacheline_b_2 Insert ¡cache ¡lines ¡ ¡ from ¡new ¡cache ¡ ¡
Evict ¡LRU ¡ ¡ cache ¡lines, ¡ ¡ if ¡stack ¡overflows ¡
memory accesses. SC 2010.
23
} Perform 1 to All comparison
} do other cache states contain my cache line?
i
cl i k k k i
1
j
k n iter j i j i j i cs i j j i
− = =
24
} Predict the total false sharing cases by evaluating much lower
50 100 150 200 20 40 60 80 100 120 140
False Sharing # of Iterations
25
1
n
1
n
2
,
T a b
1 1 2
n n i i i i i
− − = =
1 1 n n i i i i
− − = =
max max
max
max
26
27
} Compare the number of false sharing cases estimated by full false
_ _ _ mod _ mod _ mod _ mod * * _ _ mod _ mod fs measured nfs measured fs eled nfs eled fs eled nfs eled fs measured fs eled fs eled
28
} Four 2.2. GHz 12-core processors (48 cores in total). } Dedicated L1 and L2 caches, 64Kb and 512KB per core. } L3 cache of 10240KB shared among 12 cores. } Cache line size for all caches, 64 bytes.
29
5 10 15 20 25 30 2 4 8 16 24 32 40 48 False Sharing Effect % Number of Threads
Measured Predicted Modeled
30
10 20 30 40 50 2 4 8 16 24 32 40 48 False Sharing Effect % Number of Threads
Measured Predicted Modeled
31
} Assist compiler in optimizing code in high-level loop
} Guide traditional loop transformations. } Assist in generating efficient code.
32