A Cache-conscious Profitability A Cache-conscious Profitability - - PowerPoint PPT Presentation
A Cache-conscious Profitability A Cache-conscious Profitability - - PowerPoint PPT Presentation
A Cache-conscious Profitability A Cache-conscious Profitability Model for Empirical Tuning of Model for Empirical Tuning of Loop Fusion Loop Fusion Apan Qasem Ken Kennedy Apan Qasem Ken Kennedy Rice University Rice University Houston, TX
LCPC 2005 Rice University 2
Outline Outline
– Motivation – Related Work – Profitability Model
- Using hierarchical classification of reuse
- Accounting for conflict misses
- Enforcing resource constraints
- Tuning fusion parameters
– Preliminary Experiments – Conclusions and Future Work
LCPC 2005 Rice University 3
Motivation Motivation
– Making the right fusion choices is a non- trivial task
- Optimal fusion known to be NP-complete
- Profitability depends on the underlying
architecture
– Conflict misses – Resource Constraints
- Exploiting inter-loop nest locality is not enough
L1 : doj = 1 , N do i = 1 , M b ( i , j ) = a( i , j ) +a ( i , j
- 1
)+a ( i , j
- 2
) enddo enddo L2 : doj = 1 , N do i = 1 , M c ( i , j ) = b ( i , j ) + d( i , j ) enddo enddo
- uter loop reuse in a()
loop-crossing reuse in b() lost reuse in a() saved loads for b()
L12 : do j = 1 ,N d
- i
= 1 , M b ( i , j ) = a ( i , j ) +a( i , j
- 1
)+a ( i , j
- 2
) c ( i , j ) = b ( i , j ) + d ( i , j ) enddo enddo
LCPC 2005 Rice University 5
Fused loop nest from weather modeling application
LCPC 2005 Rice University 6
Related Work Related Work
– Heuristic algorithms to find good fusion solutions
- Gao et. al. [92], Kennedy [00], Lim and Lam
[01],
– Approaches that aim to reduce bandwidth
- Ding and Kennedy [01], Song et. al. [01]
– Main distinction from previous work
- Use of architecture specific information
- Empirical tuning of fusion parameters
LCPC 2005 Rice University 7
Outline Outline
– Motivation – Related Work – Profitability Model
- Using hierarchical classification of reuse
- Accounting for conflict misses
- Enforcing resource constraints
- Tuning fusion parameters
– Preliminary Experiments – Conclusions and Future Work
LCPC 2005 Rice University 8
Hierarchical Hierarchical Reuse Reuse
– Use the concept of reuse level as a way to quantify reuse at each level of the memory hierarchy – Associate with each reference a value that expresses the level at which the reuse is exploited
Reuse Level = smallest k such that Reuse Distance ≤ Capacity(Lk)
LCPC 2005 Rice University 9
Hierarchical Reuse Hierarchical Reuse
– Obtain benefit from reuse of r only if
Reuse Level(r)pre > Reuse Level(r)post
– Perform this check for every reused reference – Account for miss access cost for each level
- f memory
LCPC 2005 Rice University 10
Conflict Miss Model Conflict Miss Model
– Use a probabilistic model to predict when a conflict miss might occur
- Derived from Hill & Smith model for
associativity [HS:IEEE89]
– Ask the question:
If m distinct cache lines are accessed between references to the same cache line r what is the probability that n of them are going to land in the line occupied by r?
P =1− m i ⎛ ⎝ ⎜ ⎞ ⎠ ⎟
i= 0 a−1
∑
1 s ⎡ ⎣ ⎢ ⎤ ⎦ ⎥
i s −1
s ⎡ ⎣ ⎢ ⎤ ⎦ ⎥
m−i
m ≤ E(a,s,T)
m
r1 ? ?
2-way cache
Set 1 Set 0 Set n
Set P to be ≤ T Effective Cache Capacity memory access
r1
LCPC 2005 Rice University 12
Effective Cache Capacity Effective Cache Capacity
– Effective cache capacity is the maximum reuse distance for which we can expect a reused value to still be in cache – We adjust the definition of reuse level based on the definition of effective cache capacity
Reuse Level = smallest k such that Reuse Distance ≤ ECC(Lk)
LCPC 2005 Rice University 13
Evaluation of Conflict Miss Model: Evaluation of Conflict Miss Model: er lebacher er lebacher
0% 2% 4% 6% 8% 10% 12% 14% 16% (10, 76) (52, 182) (107, 272) (166, 349) Predicted Measured (direct) Measured (2-way)
LCPC 2005 Rice University 14
Evaluation of Conflict Miss Model: Evaluation of Conflict Miss Model: ar raysweep ar raysweep
0% 2% 4% 6% 8% 10% 12% 14% 16% (10, 76) (52, 182) (107, 272) (166, 349) Predicted Measured (direct) Measured (2-way)
LCPC 2005 Rice University 15
Resource Constraints Resource Constraints
– Need to constrain resource demands of fused loop
Register Pressure(Lfused) < Register Set Size Instructions(Lfused) < I-Cache Capacity
– Easy to incorporate into a constrained weighted fusion algorithm
LCPC 2005 Rice University 16
Parameterizing the Model Parameterizing the Model
– Parameters amenable to tuning
- Effective Cache Capacity
- Register Set Size
- I-Cache Capacity
LCPC 2005 Rice University 17
Parameterizing the Model Parameterizing the Model
– Use a tolerance factor to determine how much of a resource we can use at each tuning step
Effective Registers = T x Register Set Size [0 < T ≤ 1] Effective Cache Capacity = E(a, s, T) [0.01 ≤ T ≤ 0.20]
LCPC 2005 Rice University 18
Tuning Fusion Parameters Tuning Fusion Parameters
– Start off conservatively with a low tolerance value and increase tolerance at each step – Each tuning parameter constitutes a single search dimension – Search is sequential and orthogonal
- stop when performance starts to worsen
- use reference values for other dimension when
searching a particular dimension
LCPC 2005 Rice University 19
Experimental Setup Experimental Setup
– Four different strategies
- cc
fm , s imp le, m ips
- p
ro , n
- fuse
– Four benchmarks
- advec
t 3d , e r l ebache r , l i vermore18 , mgr i d
– Platform
- SGI R12K
- 2-level cache hierarchy
- Primary L1 I-Cache, Unified L2
LCPC 2005 Rice University 20
Performance Improvement Summary Performance Improvement Summary
0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
Speedup over no-fuse
advect3d erlebacher liv18 mgrid
Benchmarks ccfm simple mips-pro nofuse
LCPC 2005 Rice University 21
Conclusions Conclusions
– Detailed cache effect analysis combined with empirical search can lead to better fusion choices – Overall memory performance can be further improved by considering fusion and tiling interactions
Extra Slides Begin Here Extra Slides Begin Here
LCPC 2005 Rice University 23
Memory Performance Memory Performance Comparison: Comparison: advect 3d advect 3d
0.8 0.9 1 1.1 1.2 1.3 Cycles L1D Misses L2D Misses Graduated lds
ccfm simple mips no-fuse
LCPC 2005 Rice University 24
Memory Performance Memory Performance Comparison: Comparison: er l ebacher er l ebacher
0.7 0.8 0.9 1 1.1 1.2 1.3 Cycles L1D Misses L2D Misses Graduated lds ccfm simple mips-pro nofuse
LCPC 2005 Rice University 25
Memory Performance Memory Performance Comparison: Comparison: l
i vermore18 l i vermore18
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 Cycles L1D Misses L2D Misses Graduated lds ccfm simple mips-pro nofuse
LCPC 2005 Rice University 26
Memory Performance Memory Performance Comparison: Comparison: mgr
id mgr id
0.8 0.9 1 1.1 1.2 Cycles L1D Misses L2D Misses Graduated lds ccfm simple mips-pro nofuse
LCPC 2005 Rice University 27
Experimental Results Experimental Results on
- n advec
t3d advec t3d
1.00 3.06E+ 05 9.19E+ 04 3.76E+ 04 9.88E+ 04 nofuse 1.00 3.06E+ 05 9.19E+ 04 3.76E+ 04 9.88E+ 04 mips-pro 0.80 4.26E+ 05 5.08E+ 04 3.78E+ 04 1.23E+ 05 simple 1.17 3.66E+ 05 5.13E+ 04 4.48E+ 04 8.41E+ 04 ccfm Speedup Graduated Loads L2D Misses L1D Misses Cycle Count Fusion Strategy
LCPC 2005 Rice University 28
Experimental Results on Experimental Results on erl ebacher erl ebacher
Speedup Graduated Loads L2D Misses L1D Misses Cycle Count Fusion Strategy 1.00 4.34E+ 08 2.92E+ 07 2.34E+ 08 5.65E+ 09 nofuse 1.08 4.52E+ 08 2.74E+ 07 1.70E+ 08 5.23E+ 09 mips-pro 0.99 3.90E+ 08 3.09E+ 07 1.85E+ 08 5.68E+ 09 simple 1.08 4.02E+ 08 2.72E+ 07 2.00E+ 08 5.23E+ 09 ccfm
LCPC 2005 Rice University 29
Evaluation of Conflict Miss Model : Evaluation of Conflict Miss Model : randaccess randaccess
0% 2% 4% 6% 8% 10% 12% 14% 16% 18% (10, 76) (52, 182) (107, 272) (166, 349) Predicted Measured (direct) Measured (2-way)
LCPC 2005 Rice University 30