Jaewook Shin , Jacqueline Chame and Mary Hall PACT02 September 23, - PowerPoint PPT Presentation

Jaewook Shin , Jacqueline Chame and Mary Hall PACT’02 September 23, 2002 USC USC UNIVERSITY UNIVERSITY UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA

Motivation � Multimedia applications are becoming increasingly important. � Multimedia Extension Architectures – Intel SSE, Motorola AltiVec, … � New compiler technology for new optimization goals – Exploit fine-grain parallelism supported by architecture – Exploit reuse of data in the large register files 2 PACT'02

Overview 1. Motivation 2. Background Unroll-and-jam � Scalar replacement � 3. Algorithm Unroll amount selection for unroll-and-jam � Register requirement analysis � Superword replacement � Packing in registers � 4. Experiments Reduction in dynamic memory accesses � Speedup � 5. Conclusion 3 PACT'02

Superword-Level Parallelism (SLP) � Definition: Fine grain parallelism in aggregate data objects larger than a machine word � Architectural features include: – Variable-sized data fields – Support to rearrange data fields – Superword register file SR0 Sixteen 8-bit Operands SR1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 SR2 SR3 1 2 3 4 5 6 7 8 Eight 16-bit Operands SR4 Four 32-bit Operands SR5 1 2 3 4 Motivation Example: AltiVec SR31 0 128 4 PACT'02

Superword-Level Locality (SLL) � Definition: Exploit data reuse in superword registers � Large capacity register file is used as a compiler controlled cache. � Differences from data reuse in caches – Eliminates memory access cycles completely – Storage has to be named explicitly � Differences from data reuse in scalar registers – Spatial reuse in superword registers 128 bits 256 bits 128 bits 8 … … Pentium 4 32 32 Motivation AltiVec DIVA 5 PACT'02

Unroll-and-jam � Unrolls outer loops and fuses the resulting inner loops together � Shortens the distance between reuse Reuse distance (iterations) Original loop nest for(i=1;i<=32;i++) 32 for(j=0;j<32;j++) A[i][j] = A[i-1][j] + B[j] Outer loop is unrolled for(i=1;i<=32;i+=2) 32 for(j=0;j<32;j++) A[i][j] = A[i-1][j] + B[j] for(j=0;j<32;j++) A[i+1][j] = A[i][j] + B[j] Inner loops are fused for(i=1;i<=32;i+=2) 0 together for(j=0;j<32;j++) Background A[i][j] = A[i-1][j] + B[j] A[i+1][j] = A[i][j] + B[j] 6 PACT'02

Scalar vs. Superword Replacement � Identifies array references to the same memory address � Replaces array references with scalar/superword variables Original loop nest Superword-level parallelization 4X for(i=1;i<=32;i+=2) for(i=1; i<=32; i+=2) for(j=0;j<32;j++) for(j=0; j<32; j+=4) A[i][j] = A[i-1][j] + B[j] A[i][j:j+3] = A[i-1][j:j+3] + B[j:j+3] A[i+1][j] = A[i][j] + B[j] A[i+1][j:j+3] = A[i][j:j+3] + B[j:j+3] 1.5X 1.5X 6X Scalar replacement Superword replacement for(i=1; i<=32; i+=2) for(i=1; i<=32; i+=2) for(j=0; j<32; j++) for(j=0; j<32; j+=4) T1 = B[j] SV1 = B[j:j+3] Background T2 = A[i-1][j] + T1 SV2 = A[i-1][j:j+3] + SV1 A[i+1][j] = T2 + T1 A[i+1][j:j+3] = SV2 + SV1 A[i][j] = T2 A[i][j:j+3] = SV2 7 PACT'02

Putting it all together Original loop nest for(i=1;i<=32;i++) for(j=0;j<32;j++) A[i][j] = A[i-1][j] + B[j] Superword-level parallelization for(i=1; i<=32; i++) for(j=0; j<32; j+=4) A[i][j:j+3] = A[i-1][j:j+3] + B[j:j+3] Unroll-and-jam for(i=1; i<=32; i+=2) for(j=0; j<32; j+=4) A[i][j:j+3] = A[i-1][j:j+3] + B[j:j+3] A[i+1][j:j+3] = A[i][j:j+3] + B[j:j+3] Superword replacement for(i=1; i<=32; i+=2) for(j=0; j<32; j+=4) SV1 = B[j:j+3] SV2 = A[i-1][j:j+3] + SV1 Algorithm A[i+1][j:j+3] = SV2 + SV1 A[i][j:j+3] = SV2 8 PACT'02

What is required ? � Unroll amount selection � Code generation Algorithm 9 PACT'02

Assumptions � Array subscript expressions are linear functions of loop index variables � No reuse of registers within an iteration of the transformed loop – Registers allocated for caching data are live throughout the loop body � No data reuse across iterations of the transformed loop – Only loop independent reuse opportunities are exploited Algorithm 10 PACT'02

Unroll Amount Selection: Optimization Goal � Find unroll factors <X 1 , X 2 , …, X n > for loops 1 to n � Maximize data reuse in superword registers exposed by unroll-and-jam � Constraint: The number of superword registers required does not exceed what is available. Algorithm 11 PACT'02

Reuse in Scalar vs. Superword Register Reuse Scalar Superword No Yes for(i=0; i<N; i++) for(i=0; i<N; i+=4) Self A[i] A[i:i+3] spatial A[i] A[i] A[i+1] A[i+2] A[i+3] No Yes for(i=0; i<N; i++) for(i=0; i<N; i++) Group A[i], A[i+2] A[i], A[i+2] spatial ... … A[i] A[i+2] A[i] A[i+2] Algorithm 12 PACT'02

Register Requirement Analysis � Derives the number of superword registers required for a particular unroll amount and array references. � Example: A[i] when i loop is unrolled by X superword Low address High address A[i+0] A[i+ 1] A[i+ 2] A[i+ 3] … A[i+ (X-2)] A[i+ (X-1)] Algorithm X � � superword registers are required ! 4 � � � � 13 PACT'02

Register Requirement Analysis(cont.) � For A[ai+b] and an unroll amount X Coefficient Number of registers a = 0 1 aX a < SWS � � SWS � � � � a ≥ SWS X � SWS(SuperWord Size): Number of data elements that fit in a superword register � The current implementation can also deal with Array References Example Multiple index variables A[ai+bj+c] Multi-dimensional arrays A[ai+b][cj+d] Algorithm Group of array references A[ai+b1][cj+d1], A[ai+b2][cj+d2], … 14 PACT'02

Unroll Amount Selection � Search for unroll amounts that maximize reuse in superword registers � Prune search space – Exploit monotonicity at each dimension – Avoid register pressure 3.5E+09 3.0E+09 Search space for FIR 2.5E+09 2.0E+09 # Mem. Acc. 1.5E+09 1.0E+09 1 5.0E+08 Algorithm 16 Unroll amount j-loop 31 0.0E+00 1 11 21 31 Unroll amount i-loop 15 PACT'02

Code Generation Optimizations � Superword Replacement – Exploit reuse opportunities – Temporal reuse: similar to scalar replacement – Spatial reuse: sliding windows such as FIR – Unaligned memory accesses � Packing in registers – Replaces packing through memory – Reduces scalar memory accesses Algorithm 16 PACT'02

Packing in Registers � In some cases, data must be packed into a superword register. – Alignment, non-unit stride array references � Packing through memory is expensive. � Packing in superword registers w = *((float *)&a + 0); replicate(a, 0) x = *((float *)&b + 0); y = *((float *)&c + 0); a[0] a[1] a[2] a[3] z = *((float *)&d + 0); *((float *)&p + 0) = w; *((float *)&p + 1) = x; *((float *)&p + 2) = y; a[0] a[0] a[0] a[0] *((float *)&p + 3) = z; Packing through memory p = shift_and_load(p, temp1) temp1 = replicate(a, 0); temp2 = replicate(b, 0); p temp1 temp3 = replicate(c, 0); a[0] a[0] a[0] a[0] temp4 = replicate(d, 0); p = shift_and_load(p, temp1); p = shift_and_load(p, temp2); Algorithm p = shift_and_load(p, temp3); a[0] p = shift_and_load(p, temp4); p Packing in registers 17 PACT'02

Packing in Registers � In some cases, data must be packed into a superword register. – Alignment, non-unit stride array references � Packing through memory is expensive. � Packing in superword registers w = *((float *)&a + 0); replicate(a, 0) x = *((float *)&b + 0); y = *((float *)&c + 0); a[0] a[1] a[2] a[3] z = *((float *)&d + 0); *((float *)&p + 0) = w; *((float *)&p + 1) = x; *((float *)&p + 2) = y; a[0] a[0] a[0] a[0] *((float *)&p + 3) = z; Packing through memory p = shift_and_load(p, temp2) temp1 = replicate(a, 0); temp2 = replicate(b, 0); p temp2 temp3 = replicate(c, 0); a[0] b[0] b[0] b[0] b[0] temp4 = replicate(d, 0); p = shift_and_load(p, temp1); p = shift_and_load(p, temp2); Algorithm p = shift_and_load(p, temp3); a[0] b[0] p = shift_and_load(p, temp4); p Packing in registers 18 PACT'02

Packing in Registers � In some cases, data must be packed into a superword register. – Alignment, non-unit stride array references � Packing through memory is expensive. � Packing in superword registers w = *((float *)&a + 0); replicate(a, 0) x = *((float *)&b + 0); y = *((float *)&c + 0); a[0] a[1] a[2] a[3] z = *((float *)&d + 0); *((float *)&p + 0) = w; *((float *)&p + 1) = x; *((float *)&p + 2) = y; a[0] a[0] a[0] a[0] *((float *)&p + 3) = z; Packing through memory p = shift_and_load(p, temp3) temp1 = replicate(a, 0); temp2 = replicate(b, 0); p temp3 temp3 = replicate(c, 0); a[0] b[0] c[0] c[0] c[0] c[0] temp4 = replicate(d, 0); p = shift_and_load(p, temp1); p = shift_and_load(p, temp2); Algorithm p = shift_and_load(p, temp3); a[0] b[0] c[0] p = shift_and_load(p, temp4); p Packing in registers 19 PACT'02

Jaewook Shin , Jacqueline Chame and Mary Hall PACT02 September 23, - PowerPoint PPT Presentation

Jaewook Shin , Jacqueline Chame and Mary Hall PACT02 September 23, 2002 USC USC UNIVERSITY UNIVERSITY UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA Motivation Multimedia applications are becoming

Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler

USC USC INFORMATION INFORMATION SCIENCES SCIENCES INSTITUTE INSTITUTE The Architecture of

1 Wils Wils lsdorf lsdorf dorf Hall dorf Hall Hall Hall Wils Wils lsdorf lsdorf dorf

Statewide Health Information Network for NY (SHIN-NY) Overview Elizabeth Amato VP, SHIN-NY

Statistical Computing with R Laboratory CS109L Lecture 1 Kevin Shin March 27, 2015 Shin

St. Mary Siphon Tubes North of Babb, MT Halls Coulee Siphon St. Mary Canal St. Mary Canal

Porting Tizen to Odroid-U3 & Tizen Training Course Dongkun Shin Embedded Software Lab.,

HOLSHOUSER HALL RENOVATIONS SOUTH VILLAGE SITE PLAN Phase XI Residence Hall(Hunt Hall)

Quantum Hall effect effect Quantum Hall integer integer Hall bar geometry classical quantum

Dr Jacqueline Baxter The Open University Walton Hall Milton Keynes MK7 6AA

Hall Effect Measurement System Hall and Hall and van der Pauw Measurements van der Pauw

Pre-congress: Wednesday, May 10 2017 Beirut hall Chartouni hall Cedar hall Phoenicia hall

Cele lebrating Mary ry Ward Week 2017 Mary ry Ward 1585 - 1645 Mary Ward North

Mary The Mother of God RCIA October 31, 2013 Mary, the Mother of God As we look at Mary, as

Spin Hall Effect and Experimental Observation 1701110147@pku.edu.cn 2017.12.15

THURSDAY MAIN HALL B - NORTHEAST HALL C - SOUTHEAST HALL SCIENCE, SYMBOLISM & SIGNS

FOOD SAFETY TRAINING FOR FARMER SUPPORT ORGANIZATIONS, PART 3 F O O D S A F E T Y C E R T I F I

Alternative Work and click on the Picture Tools Format tab. In the Format ribbon,

NRC Workshop on 10 CFR Part 61 Phoenix Hyatt Regency Hotel March 4, 2011 Afternoon Agenda 1:10

What is frame busting? What is frame busting? HTML allows for any site to frame any URL with an

React Native HTTP/Fetch Sending data 1 Sending data to web server Two methods GET

Dashboard str u ct u re o v er v ie w BU IL D IN G DASH BOAR D S W ITH SH IN YDASH BOAR D L u c

Constructing a Crystal Once you specify the lattice, you can then hang a

Objects and Events Week 5 INFM 603 Muddiest Points Commonly used functions

Jaewook Shin , Jacqueline Chame and Mary Hall PACT02 September 23, - PowerPoint PPT Presentation

Jaewook Shin , Jacqueline Chame and Mary Hall PACT02 September 23, 2002 USC USC UNIVERSITY UNIVERSITY UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA Motivation Multimedia applications are becoming

Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler

USC USC INFORMATION INFORMATION SCIENCES SCIENCES INSTITUTE INSTITUTE The Architecture of

1 Wils Wils lsdorf lsdorf dorf Hall dorf Hall Hall Hall Wils Wils lsdorf lsdorf dorf

Statewide Health Information Network for NY (SHIN-NY) Overview Elizabeth Amato VP, SHIN-NY

Statistical Computing with R Laboratory CS109L Lecture 1 Kevin Shin March 27, 2015 Shin

St. Mary Siphon Tubes North of Babb, MT Halls Coulee Siphon St. Mary Canal St. Mary Canal

Porting Tizen to Odroid-U3 &amp; Tizen Training Course Dongkun Shin Embedded Software Lab.,

HOLSHOUSER HALL RENOVATIONS SOUTH VILLAGE SITE PLAN Phase XI Residence Hall(Hunt Hall)

Quantum Hall effect effect Quantum Hall integer integer Hall bar geometry classical quantum

Dr Jacqueline Baxter The Open University Walton Hall Milton Keynes MK7 6AA

Hall Effect Measurement System Hall and Hall and van der Pauw Measurements van der Pauw

Pre-congress: Wednesday, May 10 2017 Beirut hall Chartouni hall Cedar hall Phoenicia hall

Cele lebrating Mary ry Ward Week 2017 Mary ry Ward 1585 - 1645 Mary Ward North

Mary The Mother of God RCIA October 31, 2013 Mary, the Mother of God As we look at Mary, as

Spin Hall Effect and Experimental Observation 1701110147@pku.edu.cn 2017.12.15

THURSDAY MAIN HALL B - NORTHEAST HALL C - SOUTHEAST HALL SCIENCE, SYMBOLISM &amp; SIGNS

FOOD SAFETY TRAINING FOR FARMER SUPPORT ORGANIZATIONS, PART 3 F O O D S A F E T Y C E R T I F I

Alternative Work and click on the Picture Tools Format tab. In the Format ribbon,

NRC Workshop on 10 CFR Part 61 Phoenix Hyatt Regency Hotel March 4, 2011 Afternoon Agenda 1:10

What is frame busting? What is frame busting? HTML allows for any site to frame any URL with an

React Native HTTP/Fetch Sending data 1 Sending data to web server Two methods GET

Dashboard str u ct u re o v er v ie w BU IL D IN G DASH BOAR D S W ITH SH IN YDASH BOAR D L u c

Constructing a Crystal Once you specify the lattice, you can then hang a

Objects and Events Week 5 INFM 603 Muddiest Points Commonly used functions

Porting Tizen to Odroid-U3 & Tizen Training Course Dongkun Shin Embedded Software Lab.,

THURSDAY MAIN HALL B - NORTHEAST HALL C - SOUTHEAST HALL SCIENCE, SYMBOLISM & SIGNS