a note on the performance distribution of affine schedules
play

A Note on the Performance Distribution of Affine Schedules Louis-Nol - PowerPoint PPT Presentation

A Note on the Performance Distribution of Affine Schedules Louis-Nol Pouchet 1 , Cdric Bastoul 1 , John Cavazos 2 and Albert Cohen 1 1 ALCHEMY, INRIA Futurs / University of Paris-Sud XI, France 2 Computer and Information Sciences, University of


  1. A Note on the Performance Distribution of Affine Schedules Louis-Noël Pouchet 1 , Cédric Bastoul 1 , John Cavazos 2 and Albert Cohen 1 1 ALCHEMY, INRIA Futurs / University of Paris-Sud XI, France 2 Computer and Information Sciences, University of Delaware, USA January 27, 2008 2nd Workshop on Statistical and Machine learning approaches to ARchitectures and compilaTion Göteborg, Sweden

  2. Outline: SMART’08 Outline Motivation ◮ Automatic performance portability: iterative compilation ◮ Search space expressiveness → bring the iterative optimization problem into the polyhedral model ◮ Tradeoff expressiveness / traversal easiness ◮ Improve static characterization of the search space ◮ Highlight dynamic properties ◮ Validate a dedicated heuristic to traverse the space 2

  3. Building the Search Space: SMART’08 The Model Original Schedule   i � 1 0 0 0 for (i = 0; i < n; ++i) for (i = 0; i < n; ++i) � j Θ S 1 . x S 1 = � .   for (j = 0; j < n; ++j){ 0 1 0 0 for (j = 0; j < n; ++j){  n  S1: C[i][j] = 0; 1 C[i][j] = 0; for (k = 0; k < n; ++k) for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* i C[i][j] += A[i][k]*   � 1 0 0 0 0 B[k][j]; � j B[k][j]; Θ S 2 .   � x S 2 = . 0 1 0 0 0 k   } }   0 0 1 0 0 n 1 ◮ Represent Static Control Parts (control flow and dependences must be statically computable) ◮ Use code generator (e.g. CLooG) to generate C code from polyhedral representation (provided iteration domains + schedules) 3

  4. Building the Search Space: SMART’08 The Model Original Schedule  i  � 1 0 0 0 for (i = 0; i < n; ++i) j for (i = 0; i < n; ++i) � Θ S 1 . x S 1 = � .   for (j = 0; j < n; ++j){ 0 1 0 0 n for (j = 0; j < n; ++j){   S1: C[i][j] = 0; 1 C[i][j] = 0; for (k = 0; k < n; ++k) for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* C[i][j] += A[i][k]* i   � 1 0 0 0 0 B[k][j]; � j B[k][j]; Θ S 2 .   x S 2 = � 0 1 0 0 0 . k   } }   0 0 1 0 0 n 1 ◮ Represent Static Control Parts (control flow and dependences must be statically computable) ◮ Use code generator (e.g. CLooG) to generate C code from polyhedral representation (provided iteration domains + schedules) 3

  5. Building the Search Space: SMART’08 The Model Original Schedule  i  � 1 0 0 0 for (i = 0; i < n; ++i) j for (i = 0; i < n; ++i) � Θ S 1 . x S 1 = � .   for (j = 0; j < n; ++j){ 0 1 0 0 n for (j = 0; j < n; ++j){   S1: C[i][j] = 0; 1 C[i][j] = 0; for (k = 0; k < n; ++k) for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* i C[i][j] += A[i][k]*   � 1 0 0 0 0 B[k][j]; � j B[k][j]; Θ S 2 .   � x S 2 = . 0 1 0 0 0 k   } }   0 0 1 0 0 n 1 ◮ Represent Static Control Parts (control flow and dependences must be statically computable) ◮ Use code generator (e.g. CLooG) to generate C code from polyhedral representation (provided iteration domains + schedules) 3

  6. Building the Search Space: SMART’08 The Model Distribute loops  i  � 1 0 0 0 for (i = 0; i < n; ++i) for (i = 0; i < n; ++i) � j Θ S 1 . x S 1 = � .   for (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ 0 1 0 0 n   C[i][j] = 0; S1: C[i][j] = 0; 1 for (i = n ; i < 2* n; ++i) for (k = 0; k < n; ++k) for (j = 0; j < n; ++j) S2: C[i][j] += A[i][k]* i   � 1 0 0 1 0 for (k = 0; k < n; ++k) B[k][j]; � j Θ S 2 .   C[i -n ][j] += A[i -n ][k]* � x S 2 = . 0 1 0 0 0 k   } B[k][j];   0 0 1 0 0 n 1 ◮ All instances of S1 are executed before the first S2 instance 3

  7. Building the Search Space: SMART’08 The Model Distribute loops + Interchange loops for S2  i  � 1 0 0 0 for (i = 0; i < n; ++i) for (i = 0; i < n; ++i) j � Θ S 1 . x S 1 = � .   for (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ 0 1 0 0 n   C[i][j] = 0; S1: C[i][j] = 0; 1 for ( k = n; k < 2*n; ++k) for (k = 0; k < n; ++k) for (j = 0; j < n; ++j) S2: C[i][j] += A[i][k]* i   � 0 0 1 1 0 for (i = 0; i < n; ++i) B[k][j]; � j Θ S 2 .   C[i][j] += A[i][k-n]* � x S 2 = . 0 1 0 0 0 k   } B[k-n][j];   1 0 0 0 0 n 1 ◮ The outer-most loop for S2 becomes k 3

  8. Building the Search Space: SMART’08 The Model Illegal schedule  i  � 1 0 1 0 for (k = 0; k < n; ++k) for (i = 0; i < n; ++i) j � Θ S 1 . x S 1 = � .   for (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ 0 1 0 0 n   for (i = 0; i < n; ++i) S1: C[i][j] = 0; 1 C[i][j] += A[i][k]* for (k = 0; k < n; ++k) B[k][j]; S2: C[i][j] += A[i][k]* i   � 0 0 1 0 0 for (i = n; i < 2*n; ++i) B[k][j]; � j Θ S 2 .   for (j = 0; j < n; ++j) � x S 2 = . 0 1 0 0 0 k   } C[i-n][j] = 0;   1 0 0 0 0 n 1 ◮ All instances of S1 are executed after the last S2 instance 3

  9. Building the Search Space: SMART’08 The Model A legal schedule  i  � 1 0 1 0 for (i = n; i < 2*n; ++i) for (i = 0; i < n; ++i) j � Θ S 1 . x S 1 = � .   for (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ 0 1 0 0 n   C[i][j] = 0; S1: C[i][j] = 0; 1 for (k= n+1; k<= 2*n; ++k) for (k = 0; k < n; ++k) for (j = 0; j < n; ++j) S2: C[i][j] += A[i][k]* i   � 0 0 1 1 1 for (i = 0; i < n; ++i) B[k][j]; � j Θ S 2 .   C[i][j] += A[i][k-n-1]* � x S 2 = . 0 1 0 0 0 k   } B[k-n-1][j];   1 0 0 0 0 n 1 ◮ Delay the S2 instances ◮ Constraints must be expressed between Θ S 1 and Θ S 2 3

  10. Building the Search Space: SMART’08 The Model Implicit fine-grain parallelism  i  for (i = 0; i < n; ++i) for (i = 0; i < n; ++i) j Θ S 1 . x S 1 = ( 1 0 0 0 ) . �   pfor (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ n   C[i][j] = 0; S1: C[i][j] = 0; 1 for (k = n; k < 2*n; ++k) for (k = 0; k < n; ++k) pfor (j = 0; j < n; ++j) S2: C[i][j] += A[i][k]* i   pfor (i = 0; i < n; ++i) B[k][j]; j Θ S 2 .   C[i][j] += A[i][k-n]* � x S 2 = ( 0 0 1 1 0 ) . k   } B[k-n][j];   n 1 ◮ Number of rows of Θ ↔ number of outer-most sequential loops 3

  11. Building the Search Space: SMART’08 The Model Representing a schedule   i � 1 0 1 0 for (i = n; i < 2*n; ++i) for (i = 0; i < n; ++i) � j Θ S 1 . x S 1 = � .   for (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ 0 1 0 0  n  C[i][j] = 0; S1: C[i][j] = 0; 1 for (k= n+1; k<= 2*n; ++k) for (k = 0; k < n; ++k) for (j = 0; j < n; ++j) S2: C[i][j] += A[i][k]* i   � 0 0 1 1 1 for (i = 0; i < n; ++i) B[k][j]; � j Θ S 2 .   C[i][j] += A[i][k-n-1]* x S 2 = � 0 1 0 0 0 . k   } B[k-n-1][j];   1 0 0 0 0 n 1 � 1 0 0 0 1 1 1 0 1 � . ( i j i j k n n 1 1 ) T x = Θ . � 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 3

  12. Building the Search Space: SMART’08 The Model Representing a schedule   i � 1 0 1 0 for (i = n; i < 2*n; ++i) for (i = 0; i < n; ++i) � j Θ S 1 . x S 1 = � .   for (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ 0 1 0 0  n  C[i][j] = 0; S1: C[i][j] = 0; 1 for (k= n+1; k<= 2*n; ++k) for (k = 0; k < n; ++k) for (j = 0; j < n; ++j) S2: C[i][j] += A[i][k]* i   � 0 0 1 1 1 for (i = 0; i < n; ++i) B[k][j]; � j Θ S 2 .   C[i][j] += A[i][k-n-1]* x S 2 = � 0 1 0 0 0 . k   } B[k-n-1][j];   1 0 0 0 0 n 1 � 1 0 0 0 1 1 1 0 1 � . ( i j i j k n n 1 1 ) T x = Θ . � 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 � ı � p c 3

  13. Building the Search Space: SMART’08 The Model Representing a schedule   i � 1 0 1 0 for (i = n; i < 2*n; ++i) for (i = 0; i < n; ++i) � j Θ S 1 . x S 1 = � .   for (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ 0 1 0 0  n  C[i][j] = 0; S1: C[i][j] = 0; 1 for (k= n+1; k<= 2*n; ++k) for (k = 0; k < n; ++k) for (j = 0; j < n; ++j) S2: C[i][j] += A[i][k]* i   � 0 0 1 1 1 for (i = 0; i < n; ++i) B[k][j]; � j Θ S 2 .   C[i][j] += A[i][k-n-1]* x S 2 = � 0 1 0 0 0 . k   } B[k-n-1][j];   1 0 0 0 0 n 1 Transformation Description Changes the direction in which a loop traverses its iteration range reversal � ı Makes the bounds of a given loop depend on an outer loop counter skewing Exchanges two loops in a perfectly nested loop, a.k.a. permutation interchange fusion Fuses two loops, a.k.a. jamming � p distribution Splits a single loop nest into many, a.k.a. fission or splitting peeling Extracts one iteration of a given loop c shifting Allows to reorder loops 3

  14. Building the Search Space: SMART’08 The Search Space Challenges ◮ Completeness (combinatorial problem) ◮ Scalability (large integer polyhedra computation) Proposed solution ◮ Philosophically close to Feautrier’s maximal fine-grain parallelism ◮ One point in the space ⇔ one distinct legal program version ◮ Bound schedule coefficients in [ − 1 , 1 ] to limit control overhead ◮ No completeness, but decent scalability ◮ Deliver a mechanism to automatically complete / correct schedules 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend