AlphaZ: A System for Design Space Exploration in the Polyhedral - - PowerPoint PPT Presentation

alphaz a system for design space exploration in the
SMART_READER_LITE
LIVE PREVIEW

AlphaZ: A System for Design Space Exploration in the Polyhedral - - PowerPoint PPT Presentation

AlphaZ: A System for Design Space Exploration in the Polyhedral Model Tomofumi Yuki, Gautam Gupta, DaeGon Kim, Tanveer Pathan, and Sanjay Rajopadhye Polyhedral Compilation n The Polyhedral Model n Now a well established approach for


slide-1
SLIDE 1

AlphaZ: A System for Design Space Exploration in the Polyhedral Model

Tomofumi Yuki, Gautam Gupta, DaeGon Kim, Tanveer Pathan, and Sanjay Rajopadhye

slide-2
SLIDE 2

Polyhedral Compilation

n The Polyhedral Model

n Now a well established approach for automatic

parallelization

n Based on mathematical formalism n Works well for regular/dense computation

n Many tools and compilers:

n PIPS, PLuTo, MMAlpha, RStream, GRAPHITE(gcc),

Polly (LLVM), ...

2

slide-3
SLIDE 3

Design Space (still a subset)

n Space Time + Tiling: schedule + parallel loops

n Primary focus of existing tools

n Memory Allocation

n Most tools for general purpose processors do

not modify the original allocation

n Complex interaction with space time

n Higher-level Optimizations

n Reduction detection n Simplifying Reduction (complexity reduction)

3

slide-4
SLIDE 4

AlphaZ

n Tool for Exploration

n Provides a collection of analyses,

transformations, and code generators

n Unique Features

n Memory Allocation n Reductions

n Can be used as a push-button system

n e.g., Parallelization à la PLuTo is possible n Not our current focus

4

slide-5
SLIDE 5

This Paper: Case Studies

n adi.c from PolyBench

n Re-considering memory allocation allows the

program to be fully tiled

n Outperforms PLuTo that only tiles inner loops

n UNAfold (RNA folding application)

n Complexity reduction from O(n4) to O(n3) n Application of the transformations is fully

automatic

5

slide-6
SLIDE 6

This Talk: Focus on Memory

n Tiling requires more memory n e.g., Smith-Waterman dependence

6

Sequential Tiled

slide-7
SLIDE 7

ADI-like Computation

n Updates 2D grid with outer time loop n PLuTo only tiles inner two dimensions

n Due to a memory based dependence n With an extra scalar, becomes tilable in all

three dimensions

n PolyBench implementation has a bug

n It does not correctly implement ADI n ADI is not tilable in all dimensions

7

slide-8
SLIDE 8

adi.c: Original Allocation

for (t=0; t < tsteps; t++) { for (i1 = 0; i1 < n; i1++) for (i2 = 0; i2 < n; i2++) X[i1][i2] = foo(X[i1][i2], X[i1][i2-1], …) … for (i1 = 0; i1 < n; i1++) for (i2 = n-1; i2 >= 1; i2--) X[i1][i2] = bar(X[i1][i2], X[i1][i2-1], …) … }

  • for (t=0; t < tsteps; t++) {

for (i1 = 0; i1 < n; i1++) for (i2 = 0; i2 < n; i2++) X[i1][i2] = foo(X[i1][i2], X[i1][i2-1], …) … for (i1 = 0; i1 < n; i1++) for (i2 = n-1; i2 >= 1; i2--) X[i1][i2] = bar(X[i1][i2], X[i1][i2-1], …) … }

n Not tilable because of the reverse loop

n Memory based dependence: (i1,i2 -> i1,i2+1) n Require all dependences to be non-negative

8

slide-9
SLIDE 9

adi.c: Original Allocation

S1 X[i1] S2 X[i1]

  • for (i2 = 0; i2 < n; i2++)

S1: X[i1][i2] = foo(X[i1][i2], X[i1][i2-1], …) …

  • for (i2 = n-1; i2 >= 1; i2--)

S2: X[i1][i2] = bar(X[i1][i2], X[i1][i2-1], …) …

9

slide-10
SLIDE 10

n Once the two loops are fused:

n Value of X only needs to be preserved for one

iteration of i2

n We don’t need a full array X’, just a scalar

adi.c: With Extra Memory

X[i1] X’[i1]

  • for (i2 = 0; i2 < n; i2++)

S1: X[i1][i2] = foo(X[i1][i2], X[i1][i2-1], …) …

  • for (i2 = 1; i2 < n; i2++)

S2: X’[i1][i2] = bar(X[i1][i2], X[i1][i2-1], …) …

10

slide-11
SLIDE 11

n PLuTo does not scale because the outer loop is

not tiled

adi.c: Performance

Speedup of Optimized Code on Xeon Number of Threads (Cores) Speed up compared to original code AlphaZ PLuTo 1 2 4 8 1 2 4 8

Speedup of Optimized Code on Cray XT6m

Number of Threads (Cores) Speed up compared to original code AlphaZ PLuTo 4 8 12 16 20 24 4 8 12 16 20 24

11

slide-12
SLIDE 12

UNAfold

n UNAfold [Markham and Zuker 2008]

n RNA secondary structure prediction algorithm n O(n3) algorithm was known [Lyngso et al. 1999]

n too complicated to implement n “good enough” workaround exists

n AlphaZ

n Systematically transform O(n4) to O(n3) n Most of the process can be automated

12

slide-13
SLIDE 13

UNAFold: Optimization

n Key: Simplifying Reductions [POPL 2006]

n Finds “hidden scans” in reductions n Rare case: compiler can reduce complexity

n Almost automatic:

n The O(n4) section must be separated

n many boundary cases

n Require function to be inlined to expose reuse

n Transformations to perform the above is

available; no manual modification of code

13

slide-14
SLIDE 14

n Complexity reduction is empirically confirmed

UNAfold: Performance

200 400 600 800 1000 1400 500 1000 1500 2000 2500

Execution Time of UNAfold

Sequence Length (N) Execution Time in Seconds

  • riginal

simplified

2.0 2.2 2.4 2.6 2.8 3.0 3.2 1 2 3 4 5 6 7 8

Log plot of Execution Time

Log of Sequence Length Log of Execution Time

  • riginal

simplified y = 4x + b1 y = 3x + b2

14

slide-15
SLIDE 15

AlphaZ System Overview

n Target Mapping:

n Specifies schedule,

memory allocation, etc.

15

C Alpha Polyhedral Representation Target Mapping Analyses Transformations Code Gen C+OpenMP C+CUDA C+MPI

slide-16
SLIDE 16

Human-in-the-Loop

n Automatic parallelization—“holy grail” goal

n Current automatic tools are restrictive

n A strategy that works well is “hard-coded” n difficult to pass domain specific knowledge

n Human-in-the-Loop

n Provide full control to the user

n Help finding new “good” strategies n Guide the transformation with domain specific

knowledge

16

slide-17
SLIDE 17

Conclusions

n There are more strategies worth exploring

n some may currently be difficult to automate

n Case Studies

n adi.c: memory n UNAfold: reductions

n AlphaZ: Tool for trying out new ideas

17

slide-18
SLIDE 18

Acknowledgements

n AlphaZ Developers/Users

n Members of Mélange at CSU n Members of CAIRN at IRISA, Rennes n Dave Wonnacott at Haverford University and

his students

18

slide-19
SLIDE 19

Key: Simplifying Reductions

n Simplifying Reductions [POPL 2006]

n Finds “hidden scans” in reductions n Rare case: compiler can reduce complexity

n Main idea:

n can be written

19

X[i] = A[i]

k=0 i

X[i] = i = 0 : A[i] i > 0 : X[i −1]+A[i] # $ %

O(n2) O(n)