Footprint-based Locality Analysis
Xiaoya Xiang, Bin Bao, Chen Ding
University of Rochester
2011-11-10
Footprint-based Locality Analysis Xiaoya Xiang, Bin Bao, Chen Ding - - PowerPoint PPT Presentation
Footprint-based Locality Analysis Xiaoya Xiang, Bin Bao, Chen Ding University of Rochester 2011-11-10 Memory Performance On modern computer system, memory performance depends on the active data usage. primary factor affecting the
Xiaoya Xiang, Bin Bao, Chen Ding
University of Rochester
2011-11-10
Memory Performance
the active data usage.
and the demand for memory bandwidth.
2
Reuse Distance
consecutive accesses to the same data
associative cache of all sizes
8 8 8
3
Reuse Distance
consecutive accesses to the same data
associative cache of all sizes
a b c a a c b
8 8 8
3
Reuse Distance
consecutive accesses to the same data
associative cache of all sizes
a b c a a c b
2 0 1 2 8 8 8
3
Reuse Distance
consecutive accesses to the same data
associative cache of all sizes
a b c a a c b
2 0 1 2 8 8 8
3
Reuse Distance
consecutive accesses to the same data
associative cache of all sizes
a b c a a c b
2 0 1 2 8 8 8
25 50 75 100 1 2 3
3
Reuse Distance
consecutive accesses to the same data
associative cache of all sizes
a b c a a c b
2 0 1 2 8 8 8
25 50 75 100 1 2 3 25 50 75 100
3
Reuse Distance Measurement
Measurement algorithms since 1970 Time Space Naive counting O(N2) O(N) Trace as a stack [IBM’70] O(NM) O(M) Trace as a vector [IBM’75, Illinois’02] O(NlogN) O(N) Trace as a tree [LBNL’81], splay tree [Michigan’93], interval tree [Illinois’02] O(NlogM) O(M) Fixed cache sizes [Winsconsin’91] O(N) O(C) Approximation tree [Rochester’03] O(NloglogM) O(logM)
O(N) O(1)
N is the length of the trace. M is the size of data. C is the size of cache.
Reuse Distance Measurement
Measurement algorithms since 1970 Time Space Naive counting O(N2) O(N) Trace as a stack [IBM’70] O(NM) O(M) Trace as a vector [IBM’75, Illinois’02] O(NlogN) O(N) Trace as a tree [LBNL’81], splay tree [Michigan’93], interval tree [Illinois’02] O(NlogM) O(M) Fixed cache sizes [Winsconsin’91] O(N) O(C) Approximation tree [Rochester’03] O(NloglogM) O(logM)
O(N) O(1)
N is the length of the trace. M is the size of data. C is the size of cache.
Footprint
number of distinct elements accessed in the window
5
k m m n n n
Footprint
number of distinct elements accessed in the window
5
k m m n n n
window size= 2 footprint=2
Footprint
number of distinct elements accessed in the window
5
k m m n n n
window size= 3 footprint=2
Footprint
number of distinct elements accessed in the window
5
k m m n n n
window size= 4 footprint=2
Footprint
number of distinct elements accessed in the window
5
k m m n n n
window size= 4 footprint=2
windows in a N-long trace
All-footprint CKlogM Alg. [Xiang+ PPoPP’11]
measurement of all-footprint.
6
Average Footprint O(N) Algo. [Xiang+ PACT’11]
average over all windows of length t.
7
a b b b
when window size equals 2 footprint =
Average Footprint O(N) Algo. [Xiang+ PACT’11]
average over all windows of length t.
7
a b b b
when window size equals 2 footprint =
2
Average Footprint O(N) Algo. [Xiang+ PACT’11]
average over all windows of length t.
7
a b b b
when window size equals 2 footprint =
2 1
Average Footprint O(N) Algo. [Xiang+ PACT’11]
average over all windows of length t.
7
a b b b
when window size equals 2 footprint =
2 1 1
Average Footprint O(N) Algo. [Xiang+ PACT’11]
average over all windows of length t.
7
a b b b
when window size equals 2 footprint =
2 1 1
Average Footprint O(N) Algo. [Xiang+ PACT’11]
average over all windows of length t.
7
a b b b
when window size equals 2 footprint =
2 1 1
0.5 1.0 1.5 2.0 1 2 3 4 average footprint window size
8
500 1000 1500 2000 0.2 0.4 0.6 0.8 1.0 cache size in bytes miss rate 403.gcc 0e+00 1e+10 2e+10 3e+10 4e+10 0e+00 2e+06 4e+06 window size average footprint 403.gcc
Footprint Model Reuse Distance Model
counters
perturbation (deterministic results)
relation, more intuitive
Footprint Analysis is Faster [PACT 11]
9
Footprint Analysis is Faster [PACT 11]
9
Footprint Analysis is Faster [PACT 11]
9
Footprint to Reuse Distance Conversion
reuse windows
10
rd 2 1 2 2 reuse ws:w 4 2 3 3
2.5 1.83 2.2 2.2
2.5 1.83 2.2 2.2
a b b a c a c
Footprint to Reuse Distance Conversion
reuse windows
10
rd 2 1 2 2 reuse ws:w 4 2 3 3
2.5 1.83 2.2 2.2
2.5 1.83 2.2 2.2
a b b a c a c
Footprint to Reuse Distance Conversion
reuse windows
10
rd 2 1 2 2 reuse ws:w 4 2 3 3
2.5 1.83 2.2 2.2
2.5 1.83 2.2 2.2
a b b a c a c
Footprint Sampling
window has known boundaries.
parallel.
11
Evaluation: Analysis Speed
12
(sec) rd slowdown fp slowdown fp-sampling slowdown max 1302.82 (436.cactus) 688x (456.hmmer) 40x (464.h264ref) 47% (416.gamess) min 30.57 (403.gcc) 104x (429.mcf) 10x (429.mcf) 6% (456.hmmer) mean 434.1 300x 21x 17%
Evaluation: Accuracy of Miss Rate Prediction
13
401.bzip2 403.gcc 410.bwaves 416.gamess 429.mcf 433.milc 434.zeusmp 436.cactusADM 437.leslie3d 444.namd 445.gobmk 450.soplex 453.povray 454.calculix 456.hmmer 458.sjeng 459.GemsFDTD 462.libquantum 464.h264ref 470.lbm 473.astar 482.sphinx3 Average
8−way 32KB cache miss rate
Miss rate 0.00 0.05 0.10 0.15 0.20 Simulation Full−trace footprint Sampling, miss−rate aggregation Sampling, footprint aggregation
14
401.bzip2 403.gcc 410.bwaves 416.gamess 429.mcf 433.milc 434.zeusmp 436.cactusADM 437.leslie3d 444.namd 445.gobmk 450.soplex 453.povray 454.calculix 456.hmmer 458.sjeng 459.GemsFDTD 462.libquantum 464.h264ref 470.lbm 473.astar 482.sphinx3 Average
8−way 256K cache miss rate
Miss rate 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Simulation Full−trace footprint Sampling, miss−rate aggregation Sampling, footprint aggregation 401.bzip2 403.gcc 410.bwaves 416.gamess 429.mcf 433.milc 434.zeusmp 436.cactusADM 437.leslie3d 444.namd 445.gobmk 450.soplex 453.povray 454.calculix 456.hmmer 458.sjeng 459.GemsFDTD 462.libquantum 464.h264ref 470.lbm 473.astar 482.sphinx3 Average
16−way 4MB cache miss rate
Miss rate 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Simulation Full−trace footprint Sampling, miss−rate aggregation Sampling, footprint aggregation
Evaluation: Corun Slowdown Prediction
15
10 20 30 40 50 1.5 2.0 2.5 3.0 3.5 ranked program triples (from least interference to most interference) slowdown footprint and reuse distance lifetime gradient exhaustive testing footprint and reuse distance footprint only exhaustive testing
Summary
and the traditional locality statistics.
through parallel sampling at a marginal cost, on average 17% for SPEC2006 benchmarks.
16
17