Impr oving Memor y Hier ar chy Per for mance For Ir r egular - - PowerPoint PPT Presentation

impr oving memor y hier ar chy per for mance for ir r
SMART_READER_LITE
LIVE PREVIEW

Impr oving Memor y Hier ar chy Per for mance For Ir r egular - - PowerPoint PPT Presentation

Impr oving Memor y Hier ar chy Per for mance For Ir r egular Applications J ohn Mellor- Crummey * David Whalley * K ennedy * en K * Dept. of Computer Science * Dept. of Computer Science Rice University Florida State University Motivation


slide-1
SLIDE 1

Impr oving Memor y Hier ar chy Per for mance For Ir r egular Applications

J ohn Mellor- Crummey* David Whalley* K en K ennedy*

*Dept. of Computer Science

Florida State University

*Dept. of Computer Science

Rice University

slide-2
SLIDE 2

Motivation

  • Gap between processor and memory speeds is widening
  • Modern machines use multi- level memory hierarchies
  • High perf ormance requires tailoring programs to match

memory hierarchy characteristics

slide-3
SLIDE 3

Exploiting Deep Memor y Hier ar chies

  • Principal strategies

—loop transf ormations to improve data reuse – register and cache blocking, loop f usion —data pref etching

  • Limitations

—f ail to deal with irregular codes – loop transf ormations depend on predictable subscripts – pref etching can help, but at higher overhead —primarily f ocused on latency reduction – but bandwidth is critical on modern machines

slide-4
SLIDE 4

Ir r egular Codes

I ndirect ref erences have poor temporal and spatial locality

—poor spatial locality ² low utilization of bandwidth consumed —poor temporal locality ² more bandwidth needed Memory L1 Cache 32 Bytes 25 % Utilization L2 Cache 128 Bytes

  • 6. 25 %

Utilization Register 8 Bytes 100 % Utilization

slide-5
SLIDE 5

A Recipe for High Per for mance

  • Don’t squander memory bandwidth

—use as much of each cache line as possible

  • Maximize temporal reuse

—reuse reduces bandwidth needs

slide-6
SLIDE 6

Challenges

I rregular and adaptive problems

  • Structure of data and computation unknown until runtime
  • Structure may change during execution
slide-7
SLIDE 7

Our Appr oach

Coordinated dynamic reorderings

  • Dynamic data reordering to improve spatial locality
  • Dynamic computation reordering to exploit spatial locality

and improve temporal reuse

slide-8
SLIDE 8

Contr ibutions

  • I ntroduce multi- level blocking f or irregular computations
  • Evaluate two new strategies f or coordinated dynamic

reordering of data and computation f or irregular applications

slide-9
SLIDE 9

Outline

  • I ntroduction
  • Running example
  • I mproving memory hierarchy perf ormance

—dynamic data reordering —dynamic computation reorderings

  • Experimental results: 2 case studies
  • Related work
  • Conclusions
slide-10
SLIDE 10

Running Example

Moldyn molecular dynamics benchmark

  • Modeled af ter non- bonded f orce calculation in CHARMM
  • I nteraction list f or all pairs of atoms within a cutof f radius

FOR step = 1 to timesteps DO if (MOD(step,20) = 1) compute interaction pairs FOR each interaction pair (i,j) DO compute forces between part[i] and part[j] FOR each particle j update position of part[j] based on force

slide-11
SLIDE 11

Dynamic Data Reor der ing

Problem: —lack of spatial locality in data f or irregular problems Approach: —reorder data elements used together to be nearby in memory using space- f illing curves to increase spatial locality available [Al- Furaih and Ranka, I PPS 98]

slide-12
SLIDE 12

Space- Filling Cur ves

  • Continuous, non- smooth curves through n- D space
  • Mapping between points in space and those along the curve
  • Recursive structure preserves locality

Fif th- order Hilbert curve in 2 dimensions

slide-13
SLIDE 13

Space- Filling Cur ve Data Reor der ing

  • Points nearby in space are nearby (on average) on the curve

− ordering data along the curve co- locates neighborhoods

slide-14
SLIDE 14

Space- Filling Cur ve Data Reor der ing

Advantages

—increases spatial locality (on average) —data reordering is independent of computation order

slide-15
SLIDE 15

Computation Reor der ing

Problems: —lack of temporal locality in data accesses – values may be evicted bef ore extensive reuse – premature eviction results in extra misses later —f ailure to exploit spatial locality ef f ectively

Trace of L1 misses over 100K particle interactions (Moldyn)

slide-16
SLIDE 16

Computation Reor der ing Appr oaches

  • Space- f illing curve based reordering of computations
  • Multi- level blocking of irregular computations
slide-17
SLIDE 17

Space- Filling Cur ve Computation Or der

Example: Moldyn molecular dynamics benchmark

—sort the interaction list based on SFC particle positions

Advantage

—improves temporal locality by ordered traversal of space SFC(P1) SFC(P2) interaction sorting key

slide-18
SLIDE 18

Blocking for Ir r egular Codes

FOR each particle p1 FOR p2 in interacts_with(p1) F(p1) = F(p1) + ƒ(A(p1), A(p2)) F(p2) = F(p2) + ƒ(A(p2), A(p1)) Unblocked code FOR b1 = 1, Nblocks FOR b2 = b1, Nblocks FOR p1 in block b1 FOR p2 in block b2 ∩ interacts_with(p1) F(p1) = F(p1) + ƒ(A(p1), A(p2)) F(p2) = F(p2) + ƒ(A(p2), A(p1)) Blocked (1 Level)

Consider blocks of data at a time

Thoroughly process a block bef ore moving to the next

slide-19
SLIDE 19

Dynamic Multilevel Blocking

  • Associate a tuple of block numbers with each particle

—one block number per level of the memory hierarchy – block number = selected bits of particle address particle address

  • For an interaction pair, interleave particle block numbers

A B C A(p1) B(p1) C(p1) A(p2) B(p2) C(p2)

  • Sort by composite block number multi- level blocking

L1 capacity TLB capacity L2 capacity

slide-20
SLIDE 20

Effects of Multi- Level Blocking

10K L1 misses 1M L1 misses 100K L1 misses L1 miss patterns f or Moldyn using dynamic multi- level blocking

slide-21
SLIDE 21

Coor d inated Appr o aches

L1 misses, 100K interactions,

  • riginal data order
  • riginal computation order

L1 misses, 100K interactions, Hilbert data order blocked computation order

slide-22
SLIDE 22

Pr ogr ams

  • Moldyn: a synthetic molecular dynamics benchmark
  • MAGI : Air Force particle hydrodynamics code

FOR N timesteps DO FOR each particle p DO create an interaction list for particle p FOR each particle j in interaction_list(p) update information for particle j

28K particles, 253 timesteps (DOD testcase) 256K atoms, 27 million interactions, 20 timesteps

slide-23
SLIDE 23

Exper imental Platfor m

Cache Conf iguration Cache Type Cache Size Associativity Block Size L1 Cache 32K B 2- way 32B L2 Cache 1MB 2- way 128B TLB 512K B 64- way 8K B SGI O2: R10K hardware perf ormance monitoring support

slide-24
SLIDE 24

Moldyn Results

0.2 0.4 0.6 0.8 1 1.2 1.4 L1 Misses L2 Misses TLB Misses C ycles FD HD HC BC FD + HC HD + HC HD + BC

FD = f irst touch data order HD = Hilbert data order HC = Hilbert computation order BC = Blocked Computation

slide-25
SLIDE 25

MAGI Results

0.1 0.2 0.3 0.4 0.5 0.6 L1 Misses L2 Misses TLB Misses Cycles FD + FC HD + HC HD/ FD + HC/ FT FD = f irst touch data order HD = Hilbert data order FC = First- touch computation HC = Hilbert Computation

slide-26
SLIDE 26

Related Wor k

  • Blocking/ tiling of regular codes

—paging, (mostly 1 level) cache, registers

  • Loop interchange, f usion
  • Sof tware- driven data pref etching
  • Space- f illing curves

—domain partitioning, AMR —improving locality through SFC data order – divide and conquer algorithms, PI C codes

  • Breadth- f irst traversals f or ordering data f or iterative

graph algorithms

slide-27
SLIDE 27

Conclusions

  • Matching data and computation order improves perf ormance

—data reordering: improves spatial locality —computation reordering: boosts spatial and temporal reuse —big improvements with coordinated approaches – f actor of 4 reduction in cycles f or Moldyn – f actor of 2. 3 reduction in cycles f or MAGI

  • I mplications f or other codes

—space- f illing curve reorderings f or “neighborhood- based” computations —dynamic multi- level blocking: regularize memory hierarchy use

  • f any explicitly- specif ied computation order
slide-28
SLIDE 28

Extr a Slides

slide-29
SLIDE 29

MAGI Results

Data Order Comp Order L1 Misses L2 Misses TLB Misses Cycles First T. First T. . 43 . 27 . 49 . 56 Hilbert Hilbert . 28 . 12 . 16 . 44 Hilbert/ First T. Hilbert/ First T. . 32 . 12 . 14 . 44 Results on SGI O2 Relative change (baseline result = 1. 0)

slide-30
SLIDE 30

Moldyn Results

Data Order Comp Order L1 Misses L2 Misses TLB Misses Cycles First T. None . 87 . 77 . 31 . 79 Hilbert None . 88 . 78 . 26 . 81 None Hilbert . 45 . 12 . 74 . 38 None Blocked

  • 1. 3

. 46 . 21 . 63 First T. Hilbert . 34 . 14 . 0080 . 39 Hilbert Hilbert . 26 . 10 . 0062 . 27 Hilbert Blocked . 25 . 11 . 0063 . 30

Results on SGI O2 Relative change (baseline result = 1. 0) L1 Miss Ratio L2 Miss Ratio TLB Miss Ratio . 23 . 62 . 10 Baseline program miss ratios

slide-31
SLIDE 31

The Bandwidth Bottleneck

Machine Balance: Average number of bytes a machine can transf er per f loating point operation Program Balance: Average number of bytes a program transf ers per f loating point operation

L1–Reg L2–L1 Mem–L2 SGI Origin 4 4

  • 0. 8

Source: Ding and K

  • ennedy. PLDI ‘99.

Benchmarks L1–Reg L2–L1 Mem–L2 Sweep3D

  • 15. 0
  • 9. 1
  • 7. 8

Convolut ion

  • 6. 4
  • 5. 1
  • 5. 2

Dmxpy

  • 8. 3
  • 8. 3
  • 8. 4

FFT

  • 8. 3
  • 3. 0
  • 2. 7

NAS SP

  • 10. 8
  • 6. 4
  • 4. 9
slide-32
SLIDE 32

Str ategies for Ir r egular Applications

  • Static transf ormations

—data regrouping: arrays of attributes structures

  • Dynamic transf ormations

—reorder at the beginning of major computational phases – dynamic data reordering – computation reordering – integrated approaches —amortize the cost of reordering over a phase’s computation

slide-33
SLIDE 33

Blocking Illustr ation

slide-34
SLIDE 34

Dynamic Data Reor der ing

DO I = 1, Npairs F(P(1,I)) = F(P(1,I)) + ƒ(A(P(1,I)), A(P(2,I)) F(P(1,I)) = F(P(2,I)) + ƒ(A(P(2,I)), A(P(1,I)) ENDDO DO I = 1, Nparticles A(I) = g(A(I), F(I)) ENDDO

Calculate f orces Update particle positions Original program

slide-35
SLIDE 35

Dynamic Data Reor der ing

DO I = 1, Npairs F(P(1,I)) = F(P(1,I)) + ƒ(A(P(1,I)), A(P(2,I)) F(P(1,I)) = F(P(2,I)) + ƒ(A(P(2,I)), A(P(1,I)) ENDDO DO I = 1, Nparticles A(I) = g(A(I), F(I)) ENDDO

Af ter data reordering:

DO I = 1, Npairs F(L(P(1,I)))= F(L(P(1,I))) + ƒ(A(L(P(1,I))), A(L(P(2,I)))) F(L(P(2,I)))= F(L(P(2,I))) + ƒ(A(L(P(2,I))), A(L(P(1,I)))) ENDDO DO I = 1, Nparticles A(L(I)) = g(A(L(I)), F(L(I))) ENDDO

Extra level of indirection … … but L and P can be composed!

slide-36
SLIDE 36

Dynamic Data Reor der ing

DO I = 1, Npairs F(P(1,I)) = F(P(1,I)) + ƒ(A(P(1,I)), A(P(2,I)) F(P(2,I)) = F(P(2,I)) + ƒ(A(P(2,I)), A(P(1,I)) ENDDO DO I = 1, Nparticles A(I) = g(A(I), F(I)) ENDDO DO I = 1, Npairs P(1,I) = L(P(1,I)) P(2,I) = L(P(2,I)) ENDDO

And reorder position updates Redef ine P

slide-37
SLIDE 37

Space- Filling Cur ve Computation Or der

FOR each interaction pair (p1,p2) F(p1) = F(p1) + ƒ(A(p1), A(p2)) F(p2) = F(p2) + ƒ(A(p2), A(p1))

Original Force Calculation Moldyn molecular dynamics example Computation ordered by sorting the pairs in SFC order

FOR each particle p1 (in SFC order) FOR p2 in interacts_with(p1) (in SFC order) F(p1) = F(p1) + ƒ(A(p1), A(p2)) F(p2) = F(p2) + ƒ(A(p2), A(p1))

Abstract view

slide-38
SLIDE 38

Fir st- Touch Data Reor der ing

P1 P2 P3 P4 P5 Original Particle Order P1 P1 P2 P1 P3 P4 P2 P2 P3 P5 I nteraction Pairs P1 P2 P3 P4 P5 First- Touch Particle Order

Assign element s t o cache lines in order of “f irst t ouch” by int eract ion pairs

Computation Order

slide-39
SLIDE 39

Fir st Touch Data Reor der ing

  • Advantages

—greedily increases spatial locality of data accesses —simple, ef f icient, linear time

  • Disadvantages

—computation order (e. g. interaction list) must be known bef ore data reordering can be perf ormed —its greedy locality improvements may have diminishing benef its f or latter part of the interaction list Ding and Kennedy. PLDI ‘ 99.

slide-40
SLIDE 40

Data Regr ouping

Assume no sequence and storage association

DO I = 1, N, 4 A(I) = B(I) + C(I) * D(I) ENDDO A(I) B(I) C(I) D(I)

Cache line af ter transf ormation: Advantages: items used together are on same line, f ewer conf lict misses