[PPT] - Architectural Support for Parallel Reduction in Scalable Shared PowerPoint Presentation

SLIDE 1

Architectural Support for Parallel Reduction in Scalable Shared Memory Multiprocessors

María Jesús Garzarán M. Prvulovic§ Y. Zhang§

in Scalable Shared-Memory Multiprocessors

María Jesús Garzarán, M. Prvulovic , Y. Zhang ,

A. Jula*, H. Yu* , L. Rauchwerger*, and J. Torrellas§
U. of Zaragoza, Spain

* Texas A&M University

§University of

Illinois

SLIDE 2

Motivation

Reductions are important in scientific codes

for (...) { for (...) { … x = x op expression; … }

Reduction parallelization algorithms are not scalable

parallel_for (...) { … lock(w[x[i]]); lock(w[x[i]]); x = x op expression; unlock(w[x[i]]);

2

… }

SLIDE 3

Contribution

New architectural support for parallel reductions in CC-NUMA: CC NUMA:

Speeds-up parallel reduction

M k ll l d i l bl

Makes parallel reduction scalable

Increase 16-proc speedup from 2.7 to 7.6

6 8

p Opt

2 4

Speedup NoOpt

3

1 4 8 16

Nprocs

SLIDE 4

Outline

Background on Reduction P ll li i R d i i S f Parallelizing Reduction in Software Our contribution: Private Cache-Line Reduction (PCLR) Evaluation Related Work Conclusions

4

SLIDE 5

Background on Reduction

Reduction operation:

for (…) { ( ) { … x = x op expression; … } – op : associative and commutative operator d t i i h l i th l – x : does not occur in expression or anywhere else in the loop

There may be complex flow dependences across iterations

– Parallelization of reductions needs special transformations

for (i= 0; i< Nodes; i++)

5

w[x[i]] += expression;

SLIDE 6

Outline

Background on Reduction P ll li i R d i i S f Parallelizing Reduction in Software Our contribution: Private Cache-Line Reduction (PCLR) Evaluation Related Work Conclusions

6

SLIDE 7

Parallelizing Reduction in Software (I)

Enclose access in unordered critical section parallel_for(...){ lock(w[x[i]]); [ [i]]+ i w[x[i]]+=expression; unlock(w[x[i]]); }

Drawbacks:

}

– Overhead – Contention increases with the # of processors.

7

SLIDE 8

Parallelizing Reduction in Software (II)

Each processor accumulates on a private array

for (i=0;i<array_size;i++) w priv[pid][i]=0;

Clear Private Array

parallel_for (...) i [ id][ [i]] i _p [p ][ ] ; w_priv[pid][x[i]]+=expression; barrier();

for (i=MyBegin;i<MyEnd;i++) for (p=0; p<NumProcs; p++) w[i]+=w priv[p][i]; Merge

8

[ ] _p [p][ ]; barrier();

SLIDE 9

Drawbacks of the Privatization Method

Initialization phase

– Sweeps the cache before the parallel loop starts

for (i=0;i<array_size;i++) w priv[pid][i]=0;

Clear Private Array

w_priv[pid][i]=0;

y

9

SLIDE 10

Drawbacks of the Privatization Method

Merging phase:

W k i l h i – Work proportional to the array size P i t P1 P i t P2 Private P1 Private P2 Shared array Shared array P1 P2

10

SLIDE 11

Drawbacks of the Privatization Method

Merging phase:

W k i l h i – Work proportional to the array size P i t P1 P i t P2 P i t P1 P i t P2 P i t P3 P i t P4 Private P1 Private P2 Private P1 Private P2 Private P3 Private P4 Shared array Shared array P1

11

This method is not scalable

SLIDE 12

Outline

Background on Reduction P ll li i R d i i S f Parallelizing Reduction in Software Our contribution: Private Cache-Line Reduction (PCLR) (PCLR) Evaluation l d k Related Work Conclusions

12

SLIDE 13

Main Idea of PCLR

Use non-coherent cache lines in the diff h different processors as the temporary private arrays

Remove initialization phase
Accumulate on cache lines

Accumulate on cache lines

Remove the merging phase

13

SLIDE 14

Removing Initialization

Caches lines are initialized on-demand on cache misses

for (i=0;i<array_size;i++) w priv[pid][i]=0;

Clear Private Array

parallel_for (...) i [ id][ [i]] i _p [p ][ ] ; w_priv[pid][x[i]]+=expression; barrier();

for (i=MyBegin;i<MyEnd;i++) for (p=0; p<NumProcs; p++) w[i]+=w priv[p][i]; Merge

14

[ ] _p [p][ ]; barrier();

SLIDE 15

Removing Initialization

i CPU Neutral Element miss Cache Directory Memory Memory Network

15

SLIDE 16

Accumulating on Cache Lines

No need to allocate a private array

parallel_for (...) i [ id][ [i]] i [ [i]] + i w_priv[pid][x[i]]+=expression; barrier(); w[x[i]] += expression;

for (i=MyBegin;i<MyEnd;i++) for (p=0; p<NumProcs; p++) w[i]+=w priv[p][i]; Merge

16

[ ] _p [p][ ]; barrier();

SLIDE 17

Removing Initialization

Accumulate on CPU load add Accumulate on private cache lines store Directory Memory Memory Network

17

SLIDE 18

Removing the Merge

Lines are accumulated at the home on displacements

parallel_for (...) [ [i]] i w[x[i]]+=expression; barrier(); CacheFlush();

for (i=MyBegin;i<MyEnd;i++) for (p=0; p<NumProcs; p++) w[i]+=w priv[p][i]; Merge

barrier();

18

[ ] _p [p][ ]; barrier();

SLIDE 19

Removing the Merge

CPU Directory displacement Directory Memory Memory Network Memory

19

SLIDE 20

Recognizing Reduction Data

Naïve approach Naïve approach

– new load & store instructions

Advanced Mechanism Advanced Mechanism

– shadow addresses [like Impulse]

20

SLIDE 21

Shadow Addresses

The directory recognizes shadow addresses and translates them into the original ones them into the original ones

Physical Memory

riginal reduction

array

Directory recognizes and translates

shadow reduction array

Uninstalled memory

21

array

SLIDE 22

Atomicity Issues

A reduction op is composed of a pair load-store inst A bl if h li i di l d b h A problem appears if a cache line is displaced between the reduction load and the store

load r1, addr dd 1 1 3 add r1, r1, r3 store r1, addr

22

SLIDE 23

Atomicity Issues

A reduction op is composed of a pair load-store inst A bl if h li i di l d b h A problem appears if a cache line is displaced between the reduction load and the store

cache

load r1, addr dd 1 1 3

PC 5

r3 = 1

add r1, r1, r3 store r1, addr

Home memory addr = 2 Fi l lt h ld b 8

23

Final result should be 8

SLIDE 24

Atomicity Issues

A reduction op is composed of a pair load-store inst A bl if h li i di l d b h A problem appears if a cache line is displaced between the reduction load and the store

cache

load r1, addr dd 1 1 3

5

r1 5 r3 = 1

add r1, r1, r3 store r1, addr

PC displacement p + 5 Home memory Fi l lt h ld b 8 addr = 2

24

Final result should be 8

SLIDE 25

Atomicity Issues

A reduction op is composed of a pair load-store inst A bl if h li i di l d b h A problem appears if a cache line is displaced between the reduction load and the store

cache

load r1, addr dd 1 1 3

r1 5 r3 = 1

add r1, r1, r3 store r1, addr

PC

r1 6

Home memory Fi l lt h ld b 8 + 5 addr = 2

25

Final result should be 8

SLIDE 26

Atomicity Issues

A reduction op is composed of a pair load-store inst A bl if h li i di l d b h A problem appears if a cache line is displaced between the reduction load and the store

load r1, addr dd 1 1 3

r1 5 r3 = 1

6

add r1, r1, r3 store r1, addr

PC

r1 6 6

PC Home memory Fi l lt h ld b 8 + 5 addr = 2

26

Final result should be 8

SLIDE 27

Atomicity Issues

A reduction op is composed of a pair load-store inst A bl if h li i di l d b h A problem appears if a cache line is displaced between the reduction load and the store

load r1, addr dd 1 1 3

r1 5 r3 = 1

6

add r1, r1, r3 store r1, addr

PC

r1 6 6

When displaced PC Home memory Fi l lt h ld b 8 + 5 addr = 2 + 6 displaced

27

Final result should be 8

SLIDE 28

Solution to the Atomicity Problem

Atomic exchange neutral element - memory contents

load r1, addr load r1, neutral swap r1, addr , add r1, r1, r3 store r1, addr p , add r1, r1, r3 store r1, addr

The line can be displaced between swap and store

28

SLIDE 29

Summary of Architectural Support

Special support in directory/network controller Intercept a reduction cache miss & return neutral l elements U ALU t d t t di pl t t l p’ Use ALU to merge data at displacements or at loop’s end Reduction lines can be dirty in multiple caches

29

SLIDE 30

Advantages of PCLR

R i i i li i h Remove initialization phase:

– Avoid cache sweeping

No need allocate private arrays Remove merging phase:

W k h d i l h i i d f – Work at the end: proportional to cache size instead of to array size

30

SLIDE 31

Outline

Background on Reduction P ll li i R d i i S f Parallelizing Reduction in Software Our contribution: Private Cache-Line Reduction (PCLR) Evaluation Related Work Conclusions

31

SLIDE 32

Evaluation Methodology

Execution-driven simulator S l bl l i 4 16 Scalable multiprocessor: 4-16 processors Detailed superscalar processor model 32 KB L1 2-way, 512 KB L2 4-way Round-trip latencies non-contention:

L1 (2 cyc), L2 (10 cyc), Local Memory(104 cyc), 2-hop(297 cyc)

Floating-point unit in the directory controller:

– fully pipelined, – latency (6 processor cyc).

32

SLIDE 33

Applications

Fortran and C codes Loops with reduction ops identified by the compiler

– Euler [HPF-2 suite] – Equake [SPECfp2000 suite] Vml* [Sparse BLAS suite] Reduction loops account for an – Vml [Sparse BLAS suite] – Charmm* [CHARMM appl] – Nbf * [GROMOS appl] Reduction loops account for an

avg. of 81.2% of Tsequential

Nbf [GROMOS appl]

33

* Kernels

SLIDE 34

Mechanisms Evaluated

Software implementation

– Sw : Privatized arrays and merge at loop’s end

T i l i f PCLR Two implementations of PCLR:

– Hw : Hardwired directory controller Fl P bl di ll lik MAGIC [FLASH] – Flex : Programmable directory controller like MAGIC [FLASH]

Implement PCLR with no HW changes in directory controller
Contention and latency increase

34

SLIDE 35

Execution Time for 16 Processors

Euler Equake Vml Charmm Nbf

Euler

Equake Vml Charmm Nbf

1 3 4 0 3 5 7 3 14 10 6 3 1 6 1 5 0 1 9 9 9 7 7 9 1 15 6 14 2 0.6 0.8 1

1.3 4.0 3.5 7.3 14.0 10.6 3.1 6.1 5.0 1.9 9.9 7.7 9.1 15.6 14.2

ion Time

1.3 4.0 3.5 7.3 14 10.6 3.1 6.1 5.0 1.9 9.9 7.7 9.1 15.6 14.2 0.2 0.4 0.6

Executi

x x x x x

Sw Hw Flex Sw Hw Flex Sw Hw Flex Sw Hw Flex Sw Hw Flex

Loop Merge Init Loop Merge Init

Average speedups:

35

g p p Sw (2.7), Hw (7.6), Flex (6.4)

SLIDE 36

Scalability

8 H 4 6 Hw Flex Sw

edup

2

Spee

1 4 8 16

Nprocs

Sw scales poorly

Merging phase limits speedups (Amdahl’s Law) PCLR l l bl

36

PCLR truly scalable

SLIDE 37

Outline

Background on Reduction P ll li i R d i i S f Parallelizing Reduction in Software Our contribution: Private Cache-Line Reduction (PCLR) Evaluation Related Work Conclusions

37

SLIDE 38

Related Work

Larus et al. [ASPLOS94]

R il bl Sh d M – Reconcilable Shared Memory

Zhang et al.[Illinois TR]

M difi d hit t f l ti ll li ti – Modified architecture for speculative parallelization

Hardware support for synchronization

Fetch&Add (NYU Ultracomputer) – Fetch&Add (NYU Ultracomputer) – Fetch&Op (IBM RP3) – Support for combining trees Support for combining trees – Memory-based synchronization primitives (Cedar) – Set of synchronization primitives (Goodman et al.)

38

SLIDE 39

Outline

Background on Reduction P ll li i R d i i S f Parallelizing Reduction in Software Our contribution: Private Cache-Line Reduction (PCLR) Evaluation Related Work Conclusions

39

SLIDE 40

Conclusions

Proposed novel architectural support for scalable ll l d i parallel reduction A hi l difi i d i di Architectural modifications concentrated in directory controller Average speedup for 16 processors increases from 2.7 to 7 6 to 7.6

40

SLIDE 41

Architectural Support for Parallel Reduction in Scalable Shared Memory Multiprocessors

María Jesús Garzarán M. Prvulovic Y. Zhang

in Scalable Shared-Memory Multiprocessors

María Jesús Garzarán, M. Prvulovic, Y. Zhang,

A. Jula, H. Yu, L. Rauchwerger, and J. Torrellas

garzaran@posta unizar es garzaran@posta.unizar.es http://iacoma.cs.uiuc.edu

SLIDE 42

Parallelizing Reductions

Two steps

R i i h d i i bl – Recognizing the reduction variable

syntactically pattern-matching
verify that the operator is conmuttative & associative

verify that the operator is conmuttative & associative

verify reduction variable is not used anywhere else

– Apply a parallelization transformation pp y p

42

SLIDE 43

Reduction

Reduction operation: f ( ) for (...) x = x op expression

– op: associative and commutative operator – x: does not occur in expression or anywhere else

Parallelization of reductions needs special transformations

– There may be flow dependences between iterations

for (i=0;i<Nodes;i++) [ [i]] i

43

w[x[i]]+=expression;

SLIDE 44

Recognizing Reduction Data (I)

Naïve approach

Special load and store for reduction accesses (load&addint, load&addfloat …)

Special messages on cache-miss

B i h d i h h i i l

Bring the data into the cache in a special state
Special displacement message

44

SLIDE 45

Recognizing Reduction Data (I)

Advanced Mechanism

Shadow addresses [Impulse]

Use a shadow array during the reduction
Shadow array is mapped to shadow physical addresses
Directory controller

R i h d h i l dd – Recognizes shadow physical addresses – Translates them into the physical address corresponding to the original reduction array corresponding to the original reduction array.

45

SLIDE 46

Recognizing Reduction Data

Advanced Mechanism

Shadow addresses [Impulse]

Use a shadow array during the reduction
Shadow array is mapped to shadow physical addresses
Directory controller

R i h d h i l dd – Recognizes shadow physical addresses – Translates them into the physical address corresponding to the original reduction array corresponding to the original reduction array.

46

SLIDE 47

Reduction

Reduction operation

for (i=0;i<Nodes;i++) w[x[i]]+=expression;

– +: associative and commutative operator – w: does not occur in expression or anywhere else

Parallelization of reductions needs special transformations

– There may be flow dependences between iterations

47

SLIDE 48

Additional Use of PCLR

Dynamic Last Value assingment

L ll li d h h i i i – Loop parallelized through privatization – The privatized variable is used after the loop – The compiler cannot determine the last writing iteration – The compiler cannot determine the last writing iteration for (i =0; i<N; i++) Dynamic last Value assignment if (f(i)){ A[g[i]]= …;

Identify the private array with

the last value [g[ ]] ; … = A[g[i]]; }

Copy the value from the

private var. to the shared var.

48

}

SLIDE 49

Drawbacks of the Privatization Method

Merge phase

portion of the array to merge decreases # processors increase # private arrays to merge increases # private arrays to merge increases Work of Merging is always proportional to array size Merge for (i=MyBegin;i<MyEnd;i++) for (p=0; p<NumProcs; p++) [i]+ pri [p][i]; g w[i]+=w_priv[p][i]; barrier();

49

This method is not scalable

SLIDE 50

Removing Initialization

i CPU Cache N t l miss Neutral Element Directory Network Memory

50

y

SLIDE 51

Accumulating on Cache Lines

CPU load add Accumulate on store Accumulate on private cache lines Directory Network Memory

51

y

SLIDE 52

Removing the Merge

CPU Cache displacement CPU p Merge Directory Network Merge Memory

52

y Memory

SLIDE 53

Atomicity Concerns

Solution:

S i l l d d i i – Special load and store instructions

load&pin and store&unpin

– Small number of Cache Pin Registers(CPR) Each has: – Small number of Cache Pin Registers(CPR). Each has:

tag of the pinned line
counter

Operation

– Load&pin: Allocate CPR; Set tag; Set counter = 1. g – Store&unpin: Decrease counter. Deallocate CPR if counter eq 0. – Displacement: Prevented for the lines with a match in any CPR

53