Pluto Scheduling Algorithm By Athanasios Konstantinidis Supervisor - - PowerPoint PPT Presentation

pluto scheduling algorithm
SMART_READER_LITE
LIVE PREVIEW

Pluto Scheduling Algorithm By Athanasios Konstantinidis Supervisor - - PowerPoint PPT Presentation

More Definite Results From the Pluto Scheduling Algorithm By Athanasios Konstantinidis Supervisor Paul H. J. Kelly About Me PhD student at Imperial College London supervised by Paul H. J. Kelly. Compiler and Language support for


slide-1
SLIDE 1

More Definite Results From the Pluto Scheduling Algorithm

By

Athanasios Konstantinidis

Supervisor

Paul H. J. Kelly

slide-2
SLIDE 2

About Me

  • PhD student at Imperial College London supervised by Paul H. J. Kelly.
  • Compiler and Language support for Heterogeneous parallel architectures (e.g.

GPGPUs, Cell BE, Multicore etc.).

  • Developing our own source-to-source polyhedral compiler (CUDA back-end).
  • Sponsored by EPSRC and Codeplay Software Ltd.
slide-3
SLIDE 3

Our Polyhedral Framework

Control Graph Extraction ROSE AST Polyhedral Model extraction Main graph Polyhedral framework poly graph PLuTo Scheduling Dependence Analysis Constraints Polyhedral scanning algorithm (CLooG)

Affine Transformations

CLooG graph extraction CLooG IR CUDA graph extraction Graph to AST CLooG graph CUDA graph ROSE AST

slide-4
SLIDE 4

Our Polyhedral Framework

Control Graph Extraction ROSE AST Polyhedral Model extraction Main graph Polyhedral framework poly graph PLuTo Scheduling Dependence Analysis Constraints Polyhedral scanning algorithm (CLooG)

Affine Transformations

CLooG graph extraction CLooG IR CUDA graph extraction Graph to AST CLooG graph CUDA graph ROSE AST

Does not require file I/O for syntactic post-processing

slide-5
SLIDE 5

The layout of the constraints can affect the scheduling solutions

Our Polyhedral Framework

Control Graph Extraction ROSE AST Polyhedral Model extraction Main graph Polyhedral framework poly graph PLuTo Scheduling Dependence Analysis Constraints Polyhedral scanning algorithm (CLooG)

Affine Transformations

CLooG graph extraction CLooG IR CUDA graph extraction Graph to AST CLooG graph CUDA graph ROSE AST

slide-6
SLIDE 6

PLuTo Scheduling Algorithm 1

  • Iteratively looks for a maximal set of linearly independent affine transforms of

the original iteration space.

  • An affine transform is a hyperplane representing a loop in the transformed

iteration space.

  • Each hyperplane needs to respect a set of constraints that guarantee legality

and minimum communication between hyperplane instances (i.e. between different loop iterations).

space time space time time space

slide-7
SLIDE 7

PLuTo Scheduling Algorithm 2

MAX + scalar dimensions

  • Solve(M)  Uses a Parametric Integer Programming Library (PIP) to find the

lexicographic minimum solution.

slide-8
SLIDE 8

Global Constraint Matrix M  Empty M  Legality M  Communication Bounding M  Non-Trivial solution While ( Solve(M) ) { M  Linear Independence }

PLuTo Scheduling Algorithm 3

  • Iteratively find as many linearly independent solution as possible
slide-9
SLIDE 9

Global Constraint Matrix M  Empty M  Legality M  Communication Bounding M  Non-Trivial solution While ( Solve(M) ) { M  Linear Independence } Cut in SCC If NO solution is found Remove Killed dependences

PLuTo Scheduling Algorithm 4

  • If NO MORE solutions can be found  remove any killed dependences
  • If NO solution was found  cut the dependence graph into Strongly Connected

Components (SCC) – loop distribution – and remove the killed dependences

slide-10
SLIDE 10

do { Global Constraint Matrix M  Empty M  Legality M  Communication Bounding M  Non-Trivial solution While ( Solve(M) ) { M  Linear Independence } Cut in SCC If NO solution is found Remove Killed dependences } While ( (total_sols < MAX) OR (deps ≠ 0) )

PLuTo Scheduling Algorithm 5

  • Iteratively find bands of fully permutable loop nests
slide-11
SLIDE 11

Communication Bounding Constraints

Affine Form

  • n

Structure Parameters

  • For every dependence edge e :

Farkas Lemma Parameters Unknown schedule coefficients Constant Identification h-transformation

slide-12
SLIDE 12

Ordering Sensitivity 1

Cost

  • For the same Cost the solution we will get from the PIP solver will eventually

depend on the ordering of the transformation coefficients.

Ordering of Transformation Coefficients

slide-13
SLIDE 13

Ordering Sensitivity (example)

Cost = 1

  • Minimum Cost is 1.
  • No outer parallel loop.

i j

for i = 0,N for j = 0, N A[i][j] = A[i-1][j]*A[i-1][j-1] 0 1

slide-14
SLIDE 14

Ordering Sensitivity (example)

  • By changing the order of the transformation coefficients we get two different

solutions both having Cost = 1.

i j

Order 1 : Cost = 1 0 1 0 Order 2 : Cost = 1 0 1 0

i j

slide-15
SLIDE 15

Ordering Sensitivity (example)

  • By adding the linear independence constraints we get a second solution.
  • Order 2 yields an inner loop that is fully parallel.
  • Which solution/order is better ?

i j

Order 1 : Cost = 1 1 0 0 Order 2 : Cost = 1 1 0 0

i j

Fully Parallel Inner Loop Pipeline/Wavefront

slide-16
SLIDE 16

Pipeline Degrees of Parallelism

  • N Non-parallel loops can be transformed into a wavefront/pipeline consisted of
  • ne sequential and N-1 parallel loops i.e. degrees of parallelism.

Wavefront/pipeline Non-parallel loops

i j i j

slide-17
SLIDE 17

Pipeline Degrees of Parallelism

i j

Start-up Cost Drain Cost

i j

Better spatial/temporal Locality along a wavefront

slide-18
SLIDE 18

Pipeline Degrees of Parallelism

i j

Start-up Cost Drain Cost

i j

Better spatial/temporal Locality along a wavefront Depend on structure parameters

slide-19
SLIDE 19

Pipeline Degrees of Parallelism

i j

Start-up Cost Drain Cost

i j

Better spatial/temporal Locality along a wavefront Depend on structure parameters Number of Read-after-Read dependences that lie within the wavefront

slide-20
SLIDE 20

Fully Parallel vs Pipeline Degrees of Parallelism 1

  • We propose a way of distinguishing between fully parallel and pipeline degrees
  • f parallelism.
  • We use dependence direction vectors in order to expose inner fully parallel

degrees of parallelism.

Direction Information :

bit vector

If e extends along i If e does not extend along i

Boolean

If e extends in only 1 dimension If e extends in more than 1 dimensions

slide-21
SLIDE 21

Fully Parallel vs Pipeline Degrees of Parallelism 2

i j

1

e

2

e

slide-22
SLIDE 22

Fully Parallel vs Pipeline Degrees of Parallelism 3

i j i j

Fully parallel dimension

  • By placing the coefficients of fully parallel dimensions in leading minimization positions

we are effectively pushing them towards inner nest levels.

  • As a result fully parallel degrees of parallelism can be recovered.
slide-23
SLIDE 23

Conclusions

  • The PLuTo scheduling algorithm iteratively finds affine transformations that

minimize communication.

  • For the same minimum communication the solution might be sensitive to the
  • rdering of the affine transformation coefficients in the global constraint matrix.
  • We might have to choose between fully parallel and pipeline degrees of

parallelism.

  • We propose a method for distinguishing between fully parallel and pipeline

degrees of parallelism.

  • We use dependence direction information in order to expose inner fully parallel

loops.

slide-24
SLIDE 24

Thank You !

Any Questions ?