jaewook shin jacqueline chame and mary hall
play

Jaewook Shin , Jacqueline Chame and Mary Hall PACT02 September 23, - PowerPoint PPT Presentation

Jaewook Shin , Jacqueline Chame and Mary Hall PACT02 September 23, 2002 USC USC UNIVERSITY UNIVERSITY UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA Motivation Multimedia applications are becoming


  1. Jaewook Shin , Jacqueline Chame and Mary Hall PACT’02 September 23, 2002 USC USC UNIVERSITY UNIVERSITY UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA

  2. Motivation � Multimedia applications are becoming increasingly important. � Multimedia Extension Architectures – Intel SSE, Motorola AltiVec, … � New compiler technology for new optimization goals – Exploit fine-grain parallelism supported by architecture – Exploit reuse of data in the large register files 2 PACT'02

  3. Overview 1. Motivation 2. Background Unroll-and-jam � Scalar replacement � 3. Algorithm Unroll amount selection for unroll-and-jam � Register requirement analysis � Superword replacement � Packing in registers � 4. Experiments Reduction in dynamic memory accesses � Speedup � 5. Conclusion 3 PACT'02

  4. Superword-Level Parallelism (SLP) � Definition: Fine grain parallelism in aggregate data objects larger than a machine word � Architectural features include: – Variable-sized data fields – Support to rearrange data fields – Superword register file SR0 Sixteen 8-bit Operands SR1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 SR2 SR3 1 2 3 4 5 6 7 8 Eight 16-bit Operands SR4 Four 32-bit Operands SR5 1 2 3 4 Motivation Example: AltiVec SR31 0 128 4 PACT'02

  5. Superword-Level Locality (SLL) � Definition: Exploit data reuse in superword registers � Large capacity register file is used as a compiler controlled cache. � Differences from data reuse in caches – Eliminates memory access cycles completely – Storage has to be named explicitly � Differences from data reuse in scalar registers – Spatial reuse in superword registers 128 bits 256 bits 128 bits 8 … … Pentium 4 32 32 Motivation AltiVec DIVA 5 PACT'02

  6. Unroll-and-jam � Unrolls outer loops and fuses the resulting inner loops together � Shortens the distance between reuse Reuse distance (iterations) Original loop nest for(i=1;i<=32;i++) 32 for(j=0;j<32;j++) A[i][j] = A[i-1][j] + B[j] Outer loop is unrolled for(i=1;i<=32;i+=2) 32 for(j=0;j<32;j++) A[i][j] = A[i-1][j] + B[j] for(j=0;j<32;j++) A[i+1][j] = A[i][j] + B[j] Inner loops are fused for(i=1;i<=32;i+=2) 0 together for(j=0;j<32;j++) Background A[i][j] = A[i-1][j] + B[j] A[i+1][j] = A[i][j] + B[j] 6 PACT'02

  7. Scalar vs. Superword Replacement � Identifies array references to the same memory address � Replaces array references with scalar/superword variables Original loop nest Superword-level parallelization 4X for(i=1;i<=32;i+=2) for(i=1; i<=32; i+=2) for(j=0;j<32;j++) for(j=0; j<32; j+=4) A[i][j] = A[i-1][j] + B[j] A[i][j:j+3] = A[i-1][j:j+3] + B[j:j+3] A[i+1][j] = A[i][j] + B[j] A[i+1][j:j+3] = A[i][j:j+3] + B[j:j+3] 1.5X 1.5X 6X Scalar replacement Superword replacement for(i=1; i<=32; i+=2) for(i=1; i<=32; i+=2) for(j=0; j<32; j++) for(j=0; j<32; j+=4) T1 = B[j] SV1 = B[j:j+3] Background T2 = A[i-1][j] + T1 SV2 = A[i-1][j:j+3] + SV1 A[i+1][j] = T2 + T1 A[i+1][j:j+3] = SV2 + SV1 A[i][j] = T2 A[i][j:j+3] = SV2 7 PACT'02

  8. Putting it all together Original loop nest for(i=1;i<=32;i++) for(j=0;j<32;j++) A[i][j] = A[i-1][j] + B[j] Superword-level parallelization for(i=1; i<=32; i++) for(j=0; j<32; j+=4) A[i][j:j+3] = A[i-1][j:j+3] + B[j:j+3] Unroll-and-jam for(i=1; i<=32; i+=2) for(j=0; j<32; j+=4) A[i][j:j+3] = A[i-1][j:j+3] + B[j:j+3] A[i+1][j:j+3] = A[i][j:j+3] + B[j:j+3] Superword replacement for(i=1; i<=32; i+=2) for(j=0; j<32; j+=4) SV1 = B[j:j+3] SV2 = A[i-1][j:j+3] + SV1 Algorithm A[i+1][j:j+3] = SV2 + SV1 A[i][j:j+3] = SV2 8 PACT'02

  9. What is required ? � Unroll amount selection � Code generation Algorithm 9 PACT'02

  10. Assumptions � Array subscript expressions are linear functions of loop index variables � No reuse of registers within an iteration of the transformed loop – Registers allocated for caching data are live throughout the loop body � No data reuse across iterations of the transformed loop – Only loop independent reuse opportunities are exploited Algorithm 10 PACT'02

  11. Unroll Amount Selection: Optimization Goal � Find unroll factors <X 1 , X 2 , …, X n > for loops 1 to n � Maximize data reuse in superword registers exposed by unroll-and-jam � Constraint: The number of superword registers required does not exceed what is available. Algorithm 11 PACT'02

  12. Reuse in Scalar vs. Superword Register Reuse Scalar Superword No Yes for(i=0; i<N; i++) for(i=0; i<N; i+=4) Self A[i] A[i:i+3] spatial A[i] A[i] A[i+1] A[i+2] A[i+3] No Yes for(i=0; i<N; i++) for(i=0; i<N; i++) Group A[i], A[i+2] A[i], A[i+2] spatial ... … A[i] A[i+2] A[i] A[i+2] Algorithm 12 PACT'02

  13. Register Requirement Analysis � Derives the number of superword registers required for a particular unroll amount and array references. � Example: A[i] when i loop is unrolled by X superword Low address High address A[i+0] A[i+ 1] A[i+ 2] A[i+ 3] … A[i+ (X-2)] A[i+ (X-1)] Algorithm X � � superword registers are required ! 4 � � � � 13 PACT'02

  14. Register Requirement Analysis(cont.) � For A[ai+b] and an unroll amount X Coefficient Number of registers a = 0 1 aX a < SWS � � SWS � � � � a ≥ SWS X � SWS(SuperWord Size): Number of data elements that fit in a superword register � The current implementation can also deal with Array References Example Multiple index variables A[ai+bj+c] Multi-dimensional arrays A[ai+b][cj+d] Algorithm Group of array references A[ai+b1][cj+d1], A[ai+b2][cj+d2], … 14 PACT'02

  15. Unroll Amount Selection � Search for unroll amounts that maximize reuse in superword registers � Prune search space – Exploit monotonicity at each dimension – Avoid register pressure 3.5E+09 3.0E+09 Search space for FIR 2.5E+09 2.0E+09 # Mem. Acc. 1.5E+09 1.0E+09 1 5.0E+08 Algorithm 16 Unroll amount j-loop 31 0.0E+00 1 11 21 31 Unroll amount i-loop 15 PACT'02

  16. Code Generation Optimizations � Superword Replacement – Exploit reuse opportunities – Temporal reuse: similar to scalar replacement – Spatial reuse: sliding windows such as FIR – Unaligned memory accesses � Packing in registers – Replaces packing through memory – Reduces scalar memory accesses Algorithm 16 PACT'02

  17. Packing in Registers � In some cases, data must be packed into a superword register. – Alignment, non-unit stride array references � Packing through memory is expensive. � Packing in superword registers w = *((float *)&a + 0); replicate(a, 0) x = *((float *)&b + 0); y = *((float *)&c + 0); a[0] a[1] a[2] a[3] z = *((float *)&d + 0); *((float *)&p + 0) = w; *((float *)&p + 1) = x; *((float *)&p + 2) = y; a[0] a[0] a[0] a[0] *((float *)&p + 3) = z; Packing through memory p = shift_and_load(p, temp1) temp1 = replicate(a, 0); temp2 = replicate(b, 0); p temp1 temp3 = replicate(c, 0); a[0] a[0] a[0] a[0] temp4 = replicate(d, 0); p = shift_and_load(p, temp1); p = shift_and_load(p, temp2); Algorithm p = shift_and_load(p, temp3); a[0] p = shift_and_load(p, temp4); p Packing in registers 17 PACT'02

  18. Packing in Registers � In some cases, data must be packed into a superword register. – Alignment, non-unit stride array references � Packing through memory is expensive. � Packing in superword registers w = *((float *)&a + 0); replicate(a, 0) x = *((float *)&b + 0); y = *((float *)&c + 0); a[0] a[1] a[2] a[3] z = *((float *)&d + 0); *((float *)&p + 0) = w; *((float *)&p + 1) = x; *((float *)&p + 2) = y; a[0] a[0] a[0] a[0] *((float *)&p + 3) = z; Packing through memory p = shift_and_load(p, temp2) temp1 = replicate(a, 0); temp2 = replicate(b, 0); p temp2 temp3 = replicate(c, 0); a[0] b[0] b[0] b[0] b[0] temp4 = replicate(d, 0); p = shift_and_load(p, temp1); p = shift_and_load(p, temp2); Algorithm p = shift_and_load(p, temp3); a[0] b[0] p = shift_and_load(p, temp4); p Packing in registers 18 PACT'02

  19. Packing in Registers � In some cases, data must be packed into a superword register. – Alignment, non-unit stride array references � Packing through memory is expensive. � Packing in superword registers w = *((float *)&a + 0); replicate(a, 0) x = *((float *)&b + 0); y = *((float *)&c + 0); a[0] a[1] a[2] a[3] z = *((float *)&d + 0); *((float *)&p + 0) = w; *((float *)&p + 1) = x; *((float *)&p + 2) = y; a[0] a[0] a[0] a[0] *((float *)&p + 3) = z; Packing through memory p = shift_and_load(p, temp3) temp1 = replicate(a, 0); temp2 = replicate(b, 0); p temp3 temp3 = replicate(c, 0); a[0] b[0] c[0] c[0] c[0] c[0] temp4 = replicate(d, 0); p = shift_and_load(p, temp1); p = shift_and_load(p, temp2); Algorithm p = shift_and_load(p, temp3); a[0] b[0] c[0] p = shift_and_load(p, temp4); p Packing in registers 19 PACT'02

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend