Jaewook Shin , Jacqueline Chame and Mary Hall PACT02 September 23, - - PowerPoint PPT Presentation

jaewook shin jacqueline chame and mary hall
SMART_READER_LITE
LIVE PREVIEW

Jaewook Shin , Jacqueline Chame and Mary Hall PACT02 September 23, - - PowerPoint PPT Presentation

Jaewook Shin , Jacqueline Chame and Mary Hall PACT02 September 23, 2002 USC USC UNIVERSITY UNIVERSITY UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA Motivation Multimedia applications are becoming


slide-1
SLIDE 1

PACT’02

Jaewook Shin, Jacqueline Chame and Mary Hall

September 23, 2002

OF SOUTHERN OF SOUTHERN

USC USC

UNIVERSITY UNIVERSITY UNIVERSITY UNIVERSITY CALIFORNIA CALIFORNIA

slide-2
SLIDE 2

PACT'02 2

Motivation

Multimedia applications are becoming increasingly important. Multimedia Extension Architectures

– Intel SSE, Motorola AltiVec, …

New compiler technology for new optimization goals

– Exploit fine-grain parallelism supported by architecture – Exploit reuse of data in the large register files

slide-3
SLIDE 3

PACT'02 3

Overview

1. Motivation 2. Background

  • Unroll-and-jam
  • Scalar replacement

3. Algorithm

  • Unroll amount selection for unroll-and-jam
  • Register requirement analysis
  • Superword replacement
  • Packing in registers

4. Experiments

  • Reduction in dynamic memory accesses
  • Speedup

5. Conclusion

slide-4
SLIDE 4

PACT'02 4

Superword-Level Parallelism (SLP)

Definition: Fine grain parallelism in aggregate data

  • bjects larger than a machine word

Architectural features include:

– Variable-sized data fields – Support to rearrange data fields – Superword register file Motivation

128

SR31

1 2 3 4 5 6 13 12 11 10 9 8 7 16 15 14 1 1 2 2 3 3 4 4 5 6 7 8

SR0 SR1 SR2 SR3 SR4 SR5 Sixteen 8-bit Operands Eight 16-bit Operands Four 32-bit Operands

Example: AltiVec

slide-5
SLIDE 5

PACT'02 5

Superword-Level Locality (SLL)

Definition: Exploit data reuse in superword registers Large capacity register file is used as a compiler controlled cache. Differences from data reuse in caches

– Eliminates memory access cycles completely – Storage has to be named explicitly

Differences from data reuse in scalar registers

– Spatial reuse in superword registers

Motivation

256 bits 32 DIVA 128 bits 32 AltiVec 128 bits 8 Pentium 4

… …

slide-6
SLIDE 6

PACT'02 6

Unroll-and-jam

Unrolls outer loops and fuses the resulting inner loops together Shortens the distance between reuse

for(i=1;i<=32;i++) for(j=0;j<32;j++) A[i][j] = A[i-1][j] + B[j] Original loop nest Reuse distance (iterations) 32 for(i=1;i<=32;i+=2) for(j=0;j<32;j++) A[i][j] = A[i-1][j] + B[j] for(j=0;j<32;j++) A[i+1][j] = A[i][j] + B[j] Outer loop is unrolled 32 for(i=1;i<=32;i+=2) for(j=0;j<32;j++) A[i][j] = A[i-1][j] + B[j] A[i+1][j] = A[i][j] + B[j] Inner loops are fused together

Background

slide-7
SLIDE 7

PACT'02 7

Scalar vs. Superword Replacement

Identifies array references to the same memory address Replaces array references with scalar/superword variables

Background

Original loop nest for(i=1;i<=32;i+=2) for(j=0;j<32;j++) A[i][j] = A[i-1][j] + B[j] A[i+1][j] = A[i][j] + B[j] for(i=1; i<=32; i+=2) for(j=0; j<32; j+=4) A[i][j:j+3] = A[i-1][j:j+3] + B[j:j+3] A[i+1][j:j+3] = A[i][j:j+3] + B[j:j+3] Superword-level parallelization for(i=1; i<=32; i+=2) for(j=0; j<32; j++) T1 = B[j] T2 = A[i-1][j] + T1 A[i+1][j] = T2 + T1 A[i][j] = T2 Scalar replacement Superword replacement for(i=1; i<=32; i+=2) for(j=0; j<32; j+=4) SV1 = B[j:j+3] SV2 = A[i-1][j:j+3] + SV1 A[i+1][j:j+3] = SV2 + SV1 A[i][j:j+3] = SV2

1.5X 4X 1.5X 6X

slide-8
SLIDE 8

PACT'02 8

Putting it all together

Unroll-and-jam Original loop nest for(i=1;i<=32;i++) for(j=0;j<32;j++) A[i][j] = A[i-1][j] + B[j] for(i=1; i<=32; i++) for(j=0; j<32; j+=4) A[i][j:j+3] = A[i-1][j:j+3] + B[j:j+3] Superword-level parallelization for(i=1; i<=32; i+=2) for(j=0; j<32; j+=4) A[i][j:j+3] = A[i-1][j:j+3] + B[j:j+3] A[i+1][j:j+3] = A[i][j:j+3] + B[j:j+3] Superword replacement for(i=1; i<=32; i+=2) for(j=0; j<32; j+=4) SV1 = B[j:j+3] SV2 = A[i-1][j:j+3] + SV1 A[i+1][j:j+3] = SV2 + SV1 A[i][j:j+3] = SV2

Algorithm

slide-9
SLIDE 9

PACT'02 9

What is required ?

Unroll amount selection Code generation

Algorithm

slide-10
SLIDE 10

PACT'02 10

Assumptions

Array subscript expressions are linear functions of loop index variables No reuse of registers within an iteration of the transformed loop

– Registers allocated for caching data are live throughout the loop body

No data reuse across iterations of the transformed loop

– Only loop independent reuse opportunities are exploited Algorithm

slide-11
SLIDE 11

PACT'02 11

Unroll Amount Selection: Optimization Goal

Algorithm

Find unroll factors <X1, X2, …, Xn> for loops 1 to n Maximize data reuse in superword registers exposed by unroll-and-jam Constraint: The number of superword registers required does not exceed what is available.

slide-12
SLIDE 12

PACT'02 12

Reuse in Scalar vs. Superword Register

for(i=0; i<N; i++) A[i], A[i+2]

No

for(i=0; i<N; i++) A[i]

No

Scalar

for(i=0; i<N; i++) A[i], A[i+2] for(i=0; i<N; i+=4) A[i:i+3]

Yes

Group spatial

Yes

Self spatial

Superword Reuse

A[i] A[i+1] A[i] A[i+2] A[i+3] A[i+2] A[i]

...

A[i] A[i+2]

… Algorithm

slide-13
SLIDE 13

PACT'02 13

Register Requirement Analysis

Derives the number of superword registers required for a particular unroll amount and array references. Example: A[i] when i loop is unrolled by X

superword registers are required !

  • 4

X

Algorithm

A[i+0]

A[i+(X-2)] … A[i+1] A[i+2] A[i+3] A[i+(X-1)]

High address superword Low address

slide-14
SLIDE 14

PACT'02 14

Register Requirement Analysis(cont.)

For A[ai+b] and an unroll amount X

X a ≥ SWS a < SWS 1 a = 0 Number of registers Coefficient

  • SWS

aX

The current implementation can also deal with

A[ai+b1][cj+d1], A[ai+b2][cj+d2], … Group of array references A[ai+b][cj+d] Multi-dimensional arrays A[ai+bj+c] Multiple index variables Example Array References

Algorithm

SWS(SuperWord Size): Number of data elements that fit in a superword register

slide-15
SLIDE 15

PACT'02 15

1 16 31 1 11 21 31

0.0E+00 5.0E+08 1.0E+09 1.5E+09 2.0E+09 2.5E+09 3.0E+09 3.5E+09

Unroll Amount Selection

Search for unroll amounts that maximize reuse in superword registers Prune search space

– Exploit monotonicity at each dimension – Avoid register pressure Search space for FIR

Unroll amount j-loop Unroll amount i-loop # Mem. Acc.

Algorithm

slide-16
SLIDE 16

PACT'02 16

Code Generation Optimizations

Superword Replacement

– Exploit reuse opportunities

– Temporal reuse: similar to scalar replacement – Spatial reuse: sliding windows such as FIR

– Unaligned memory accesses

Packing in registers

– Replaces packing through memory – Reduces scalar memory accesses Algorithm

slide-17
SLIDE 17

PACT'02 17

Packing in Registers

p temp1 p replicate(a, 0) p = shift_and_load(p, temp1) w = *((float *)&a + 0); x = *((float *)&b + 0); y = *((float *)&c + 0); z = *((float *)&d + 0); *((float *)&p + 0) = w; *((float *)&p + 1) = x; *((float *)&p + 2) = y; *((float *)&p + 3) = z; temp1 = replicate(a, 0); temp2 = replicate(b, 0); temp3 = replicate(c, 0); temp4 = replicate(d, 0); p = shift_and_load(p, temp1); p = shift_and_load(p, temp2); p = shift_and_load(p, temp3); p = shift_and_load(p, temp4); Packing through memory Packing in registers

In some cases, data must be packed into a superword register.

– Alignment, non-unit stride array references

Packing through memory is expensive. Packing in superword registers

Algorithm

a[0] a[1] a[2] a[3] a[0] a[0] a[0] a[0] a[0] a[0] a[0] a[0] a[0]

slide-18
SLIDE 18

PACT'02 18

Packing in Registers

p temp2 p replicate(a, 0) p = shift_and_load(p, temp2) w = *((float *)&a + 0); x = *((float *)&b + 0); y = *((float *)&c + 0); z = *((float *)&d + 0); *((float *)&p + 0) = w; *((float *)&p + 1) = x; *((float *)&p + 2) = y; *((float *)&p + 3) = z; temp1 = replicate(a, 0); temp2 = replicate(b, 0); temp3 = replicate(c, 0); temp4 = replicate(d, 0); p = shift_and_load(p, temp1); p = shift_and_load(p, temp2); p = shift_and_load(p, temp3); p = shift_and_load(p, temp4); Packing through memory Packing in registers

In some cases, data must be packed into a superword register.

– Alignment, non-unit stride array references

Packing through memory is expensive. Packing in superword registers

Algorithm

a[0] a[1] a[2] a[3] a[0] a[0] a[0] a[0] a[0] b[0] b[0] b[0] b[0] a[0] b[0]

slide-19
SLIDE 19

PACT'02 19

Packing in Registers

p temp3 p replicate(a, 0) p = shift_and_load(p, temp3) w = *((float *)&a + 0); x = *((float *)&b + 0); y = *((float *)&c + 0); z = *((float *)&d + 0); *((float *)&p + 0) = w; *((float *)&p + 1) = x; *((float *)&p + 2) = y; *((float *)&p + 3) = z; temp1 = replicate(a, 0); temp2 = replicate(b, 0); temp3 = replicate(c, 0); temp4 = replicate(d, 0); p = shift_and_load(p, temp1); p = shift_and_load(p, temp2); p = shift_and_load(p, temp3); p = shift_and_load(p, temp4); Packing through memory Packing in registers

In some cases, data must be packed into a superword register.

– Alignment, non-unit stride array references

Packing through memory is expensive. Packing in superword registers

Algorithm

a[0] a[1] a[2] a[3] a[0] a[0] a[0] a[0] a[0] b[0] c[0] c[0] c[0] c[0] a[0] b[0] c[0]

slide-20
SLIDE 20

PACT'02 20

Packing in Registers

p temp4 p replicate(a, 0) p = shift_and_load(p, temp4) w = *((float *)&a + 0); x = *((float *)&b + 0); y = *((float *)&c + 0); z = *((float *)&d + 0); *((float *)&p + 0) = w; *((float *)&p + 1) = x; *((float *)&p + 2) = y; *((float *)&p + 3) = z; temp1 = replicate(a, 0); temp2 = replicate(b, 0); temp3 = replicate(c, 0); temp4 = replicate(d, 0); p = shift_and_load(p, temp1); p = shift_and_load(p, temp2); p = shift_and_load(p, temp3); p = shift_and_load(p, temp4); Packing through memory Packing in registers

In some cases, data must be packed into a superword register.

– Alignment, non-unit stride array references

Packing through memory is expensive. Packing in superword registers

Algorithm

a[0] a[1] a[2] a[3] a[0] a[0] a[0] a[0] a[0] b[0] c[0] d[0] d[0] d[0] d[0] a[0] b[0] c[0] d[0]

slide-21
SLIDE 21

PACT'02 21

Superword Replacement Packing in registers

Experimental Flow

Macintosh G4 Superword extended gcc Performance results

G4 executable C/ Fortran program

Unroll-and-Jam MIT-SLP

Superword instruction extended C program

Experiment

Identify data reuse Data dependence analysis

Select unroll amounts

Register requirement analysis

slide-22
SLIDE 22

PACT'02 22

Reduction in Dynamic Memory Accesses

10 20 30 40 50 60 70 80 90 100

VMM FIR YUV MMM SWIM TOMCATV

10 20 30 40 50 60 70 80 90 100

VMM FIR YUV MMM SWIM TOMCATV

Vector Mem. Acc. Removed(%) Scalar Mem. Acc. Removed(%)

Experiment

slide-23
SLIDE 23

PACT'02 23

Speedup Breakdown

0.5 1 1.5 2 2.5 3

VMM FIR YUV MMM SWIM TOMCATV

SLP Unroll-and-Jam + SLP Superword Replacement Packing in Registers

Experiment

slide-24
SLIDE 24

PACT'02 24

Related Research

Conclusion

  • Conflict misses
  • Simpler storage management

No data locality at superword register level Scalar registers that do not have spatial locality Wolfe(89), Ferrante et al(91), Lam et al(91), Wolf(92), Esseghir(93), Temam et al(93,95), Carr et al(94), Coleman and McKinley(95), Gosh et al(97,98), Chame and Moon(99), Rivera and Tseng(99), Sarkar and Megiddo(00), Chatterjee(01), ... Cheong and Lam(97), Larsen and Amarasinghe(00), Sreraman and Govindarajan(00), Commercial products

  • Code Warrior 7
  • Vast/AltiVec
  • Intel C compiler

Wolf(92), Carr and Kennedy(94), Jimenez(99) Locality in caches Superword-Level Parallelism Locality in Scalar Registers

slide-25
SLIDE 25

PACT'02 25

Conclusion

An algorithm for compiler-controlled caching in a superword register file Optimizations

– Superword Replacement, Packing in Registers

Speedups over SLP from 1.3 to 2.8X Compatible and complementary with locality

  • ptimizations for cache

Conclusion