A Cache-conscious Profitability A Cache-conscious Profitability - - PowerPoint PPT Presentation

a cache conscious profitability a cache conscious
SMART_READER_LITE
LIVE PREVIEW

A Cache-conscious Profitability A Cache-conscious Profitability - - PowerPoint PPT Presentation

A Cache-conscious Profitability A Cache-conscious Profitability Model for Empirical Tuning of Model for Empirical Tuning of Loop Fusion Loop Fusion Apan Qasem Ken Kennedy Apan Qasem Ken Kennedy Rice University Rice University Houston, TX


slide-1
SLIDE 1

A Cache-conscious Profitability Model for Empirical Tuning of Loop Fusion A Cache-conscious Profitability Model for Empirical Tuning of Loop Fusion

Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

slide-2
SLIDE 2

LCPC 2005 Rice University 2

Outline Outline

– Motivation – Related Work – Profitability Model

  • Using hierarchical classification of reuse
  • Accounting for conflict misses
  • Enforcing resource constraints
  • Tuning fusion parameters

– Preliminary Experiments – Conclusions and Future Work

slide-3
SLIDE 3

LCPC 2005 Rice University 3

Motivation Motivation

– Making the right fusion choices is a non- trivial task

  • Optimal fusion known to be NP-complete
  • Profitability depends on the underlying

architecture

– Conflict misses – Resource Constraints

  • Exploiting inter-loop nest locality is not enough
slide-4
SLIDE 4

L1 : doj = 1 , N do i = 1 , M b ( i , j ) = a( i , j ) +a ( i , j

  • 1

)+a ( i , j

  • 2

) enddo enddo L2 : doj = 1 , N do i = 1 , M c ( i , j ) = b ( i , j ) + d( i , j ) enddo enddo

  • uter loop reuse in a()

loop-crossing reuse in b() lost reuse in a() saved loads for b()

L12 : do j = 1 ,N d

  • i

= 1 , M b ( i , j ) = a ( i , j ) +a( i , j

  • 1

)+a ( i , j

  • 2

) c ( i , j ) = b ( i , j ) + d ( i , j ) enddo enddo

slide-5
SLIDE 5

LCPC 2005 Rice University 5

Fused loop nest from weather modeling application

slide-6
SLIDE 6

LCPC 2005 Rice University 6

Related Work Related Work

– Heuristic algorithms to find good fusion solutions

  • Gao et. al. [92], Kennedy [00], Lim and Lam

[01],

– Approaches that aim to reduce bandwidth

  • Ding and Kennedy [01], Song et. al. [01]

– Main distinction from previous work

  • Use of architecture specific information
  • Empirical tuning of fusion parameters
slide-7
SLIDE 7

LCPC 2005 Rice University 7

Outline Outline

– Motivation – Related Work – Profitability Model

  • Using hierarchical classification of reuse
  • Accounting for conflict misses
  • Enforcing resource constraints
  • Tuning fusion parameters

– Preliminary Experiments – Conclusions and Future Work

slide-8
SLIDE 8

LCPC 2005 Rice University 8

Hierarchical Hierarchical Reuse Reuse

– Use the concept of reuse level as a way to quantify reuse at each level of the memory hierarchy – Associate with each reference a value that expresses the level at which the reuse is exploited

Reuse Level = smallest k such that Reuse Distance ≤ Capacity(Lk)

slide-9
SLIDE 9

LCPC 2005 Rice University 9

Hierarchical Reuse Hierarchical Reuse

– Obtain benefit from reuse of r only if

Reuse Level(r)pre > Reuse Level(r)post

– Perform this check for every reused reference – Account for miss access cost for each level

  • f memory
slide-10
SLIDE 10

LCPC 2005 Rice University 10

Conflict Miss Model Conflict Miss Model

– Use a probabilistic model to predict when a conflict miss might occur

  • Derived from Hill & Smith model for

associativity [HS:IEEE89]

– Ask the question:

If m distinct cache lines are accessed between references to the same cache line r what is the probability that n of them are going to land in the line occupied by r?

slide-11
SLIDE 11

P =1− m i ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

i= 0 a−1

1 s ⎡ ⎣ ⎢ ⎤ ⎦ ⎥

i s −1

s ⎡ ⎣ ⎢ ⎤ ⎦ ⎥

m−i

m ≤ E(a,s,T)

m

r1 ? ?

2-way cache

Set 1 Set 0 Set n

Set P to be ≤ T Effective Cache Capacity memory access

r1

slide-12
SLIDE 12

LCPC 2005 Rice University 12

Effective Cache Capacity Effective Cache Capacity

– Effective cache capacity is the maximum reuse distance for which we can expect a reused value to still be in cache – We adjust the definition of reuse level based on the definition of effective cache capacity

Reuse Level = smallest k such that Reuse Distance ≤ ECC(Lk)

slide-13
SLIDE 13

LCPC 2005 Rice University 13

Evaluation of Conflict Miss Model: Evaluation of Conflict Miss Model: er lebacher er lebacher

0% 2% 4% 6% 8% 10% 12% 14% 16% (10, 76) (52, 182) (107, 272) (166, 349) Predicted Measured (direct) Measured (2-way)

slide-14
SLIDE 14

LCPC 2005 Rice University 14

Evaluation of Conflict Miss Model: Evaluation of Conflict Miss Model: ar raysweep ar raysweep

0% 2% 4% 6% 8% 10% 12% 14% 16% (10, 76) (52, 182) (107, 272) (166, 349) Predicted Measured (direct) Measured (2-way)

slide-15
SLIDE 15

LCPC 2005 Rice University 15

Resource Constraints Resource Constraints

– Need to constrain resource demands of fused loop

Register Pressure(Lfused) < Register Set Size Instructions(Lfused) < I-Cache Capacity

– Easy to incorporate into a constrained weighted fusion algorithm

slide-16
SLIDE 16

LCPC 2005 Rice University 16

Parameterizing the Model Parameterizing the Model

– Parameters amenable to tuning

  • Effective Cache Capacity
  • Register Set Size
  • I-Cache Capacity
slide-17
SLIDE 17

LCPC 2005 Rice University 17

Parameterizing the Model Parameterizing the Model

– Use a tolerance factor to determine how much of a resource we can use at each tuning step

Effective Registers = T x Register Set Size [0 < T ≤ 1] Effective Cache Capacity = E(a, s, T) [0.01 ≤ T ≤ 0.20]

slide-18
SLIDE 18

LCPC 2005 Rice University 18

Tuning Fusion Parameters Tuning Fusion Parameters

– Start off conservatively with a low tolerance value and increase tolerance at each step – Each tuning parameter constitutes a single search dimension – Search is sequential and orthogonal

  • stop when performance starts to worsen
  • use reference values for other dimension when

searching a particular dimension

slide-19
SLIDE 19

LCPC 2005 Rice University 19

Experimental Setup Experimental Setup

– Four different strategies

  • cc

fm , s imp le, m ips

  • p

ro , n

  • fuse

– Four benchmarks

  • advec

t 3d , e r l ebache r , l i vermore18 , mgr i d

– Platform

  • SGI R12K
  • 2-level cache hierarchy
  • Primary L1 I-Cache, Unified L2
slide-20
SLIDE 20

LCPC 2005 Rice University 20

Performance Improvement Summary Performance Improvement Summary

0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

Speedup over no-fuse

advect3d erlebacher liv18 mgrid

Benchmarks ccfm simple mips-pro nofuse

slide-21
SLIDE 21

LCPC 2005 Rice University 21

Conclusions Conclusions

– Detailed cache effect analysis combined with empirical search can lead to better fusion choices – Overall memory performance can be further improved by considering fusion and tiling interactions

slide-22
SLIDE 22

Extra Slides Begin Here Extra Slides Begin Here

slide-23
SLIDE 23

LCPC 2005 Rice University 23

Memory Performance Memory Performance Comparison: Comparison: advect 3d advect 3d

0.8 0.9 1 1.1 1.2 1.3 Cycles L1D Misses L2D Misses Graduated lds

ccfm simple mips no-fuse

slide-24
SLIDE 24

LCPC 2005 Rice University 24

Memory Performance Memory Performance Comparison: Comparison: er l ebacher er l ebacher

0.7 0.8 0.9 1 1.1 1.2 1.3 Cycles L1D Misses L2D Misses Graduated lds ccfm simple mips-pro nofuse

slide-25
SLIDE 25

LCPC 2005 Rice University 25

Memory Performance Memory Performance Comparison: Comparison: l

i vermore18 l i vermore18

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 Cycles L1D Misses L2D Misses Graduated lds ccfm simple mips-pro nofuse

slide-26
SLIDE 26

LCPC 2005 Rice University 26

Memory Performance Memory Performance Comparison: Comparison: mgr

id mgr id

0.8 0.9 1 1.1 1.2 Cycles L1D Misses L2D Misses Graduated lds ccfm simple mips-pro nofuse

slide-27
SLIDE 27

LCPC 2005 Rice University 27

Experimental Results Experimental Results on

  • n advec

t3d advec t3d

1.00 3.06E+ 05 9.19E+ 04 3.76E+ 04 9.88E+ 04 nofuse 1.00 3.06E+ 05 9.19E+ 04 3.76E+ 04 9.88E+ 04 mips-pro 0.80 4.26E+ 05 5.08E+ 04 3.78E+ 04 1.23E+ 05 simple 1.17 3.66E+ 05 5.13E+ 04 4.48E+ 04 8.41E+ 04 ccfm Speedup Graduated Loads L2D Misses L1D Misses Cycle Count Fusion Strategy

slide-28
SLIDE 28

LCPC 2005 Rice University 28

Experimental Results on Experimental Results on erl ebacher erl ebacher

Speedup Graduated Loads L2D Misses L1D Misses Cycle Count Fusion Strategy 1.00 4.34E+ 08 2.92E+ 07 2.34E+ 08 5.65E+ 09 nofuse 1.08 4.52E+ 08 2.74E+ 07 1.70E+ 08 5.23E+ 09 mips-pro 0.99 3.90E+ 08 3.09E+ 07 1.85E+ 08 5.68E+ 09 simple 1.08 4.02E+ 08 2.72E+ 07 2.00E+ 08 5.23E+ 09 ccfm

slide-29
SLIDE 29

LCPC 2005 Rice University 29

Evaluation of Conflict Miss Model : Evaluation of Conflict Miss Model : randaccess randaccess

0% 2% 4% 6% 8% 10% 12% 14% 16% 18% (10, 76) (52, 182) (107, 272) (166, 349) Predicted Measured (direct) Measured (2-way)

slide-30
SLIDE 30

LCPC 2005 Rice University 30

Putting It All Together Putting It All Together

– Use hierarchical reuse analysis and conflict miss model to assign weights between fusible loops – Use weights to drive a resource constraint-based fusion algorithm – Empirically tune for effective cache capacity and other parameters