Register Pressure in Software-Pipelined Loop Nests Fast Computation - - PowerPoint PPT Presentation

register pressure in software pipelined loop nests
SMART_READER_LITE
LIVE PREVIEW

Register Pressure in Software-Pipelined Loop Nests Fast Computation - - PowerPoint PPT Presentation

Register Pressure in Software-Pipelined Loop Nests Fast Computation and Impact on Architecture Design Alban Douillet Guang R. Gao { douillet,ggao } @capsl.udel.edu Department of Electrical & Computer Engineering University of Delaware 18


slide-1
SLIDE 1

Register Pressure in Software-Pipelined Loop Nests

Fast Computation and Impact on Architecture Design Alban Douillet Guang R. Gao

{douillet,ggao}@capsl.udel.edu Department of Electrical & Computer Engineering University of Delaware

18th International Workshop

  • n Languages and Compilers for Parallel Computing

Hawthorne, New York October 20th-22nd, 2005

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 1 / 43

slide-2
SLIDE 2

Introduction

Scientific Applications

loop nests dominant

Single-dimension Software Pipelining (SSP)

software pipelines most profitable loop in loop nest high register pressure register allocation is time-consuming

Need for a fast method to evaluate register pressure

detect infeasible schedules before calling the register allocator measure quality of register allocation solution give estimate of register needs for future architecture designs

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 2 / 43

slide-3
SLIDE 3

Outline

1

Loop Nest Software-Pipelining

2

Problem Statement

3

Definitions & Issues

4

Fast Register Pressure Computation

5

Experiments

6

Conclusion

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 3 / 43

slide-4
SLIDE 4

Outline

1

Loop Nest Software-Pipelining

2

Problem Statement

3

Definitions & Issues

4

Fast Register Pressure Computation

5

Experiments

6

Conclusion

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 4 / 43

slide-5
SLIDE 5

Modulo Scheduling

most popular SWP technique well studied and understood full array of loop optimizations single loop, parallel execution of iterations new iteration issued every T cyles (initiation interval)

FOR J=0,4 c b d END FOR epilog kernel prolog b c d b c d b c d c d d c b b

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 5 / 43

slide-6
SLIDE 6

Modulo Scheduling

But... limited to innermost loop loop transformations to bring ILP or data cache reuse potential to innermost loop not always possible

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 6 / 43

slide-7
SLIDE 7

Single-Dimension Software-Pipelining (SSP)

proposed by Rong et al. (CGO’04, PLDI’05) software pipelines the most profitable loop level in a loop nest equivalent to MS if innermost level selected can be seen as generalization of MS to loop nests proven performance boost vs. MS can take advantage of loop optimizations used for MS single-dimension b/c simplifies multi-dimensional DDG into a uni-dimensional DDG

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 7 / 43

slide-8
SLIDE 8

SSP Kernel

SSP generates a kernel similar to MS enclosed stages single initiation interval T L1 is the outermost loop and Ln the innermost Si: number of stages at level i

T

1 T−1 T−2 ... cycles

innermost stages middle loop stages

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 8 / 43

slide-9
SLIDE 9

SSP Ideal Schedule

Generated using kernel as a template new outermost iteration issued every T cycles

  • utermost iterations executed in parallel

inner iterations executed sequentially within one outermost iteration resource conflicts!

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 9 / 43

slide-10
SLIDE 10

SSP Ideal Schedule: Example

FOR J=1,3 c b END FOR e FOR I=1,4 a d END FOR a e T=1 S1=5 S2=3 b c d a e a e a e a e T resource conflicts b c d b c d b c d b c d b c d b c d b c d b c d b c d b c d b c d b c d

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 10 / 43

slide-11
SLIDE 11

SSP Final Schedule

delays some outermost iterations to avoid resource conflicts

  • utermost iterations executed in groups
  • f Sn

resource conflict-free schedule

a e a e a e a delay Sn iterations b c d b c d b c d b c d b c d b c d b c d b c d b c d b c d b c d b c

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 11 / 43

slide-12
SLIDE 12

SSP Loop Patterns

Patterns: Outermost Loop Pattern Inner Loop Execution Segment

Innermost Loop Pattern Draining & Filling Pattern

Composition: OLP: all S kernel stages ILES: cyclic combination of Sn consecutive stages

Kernel

d c b a e

delay

... ...

d d c c b b c d d c c d d c c d d c c d d c d d c c b b a e a e d d c c b b a e a e DFP ILP OLP OLP ILES

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 12 / 43

slide-13
SLIDE 13

SSP Implementation

Selection Loop Simplification Dependence Construction Schedule Allocation Register Generation Code

loop nest DDG 1−D final schedule register allocated kernel loop nest kernel

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 13 / 43

slide-14
SLIDE 14

Outline

1

Loop Nest Software-Pipelining

2

Problem Statement

3

Definitions & Issues

4

Fast Register Pressure Computation

5

Experiments

6

Conclusion

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 14 / 43

slide-15
SLIDE 15

Motivation

need to determine feasibility of schedules

register allocation is time-consuming unfeasible schedules b/c of high register pressure not uncommon

need to evaluate quality of register allocator

how far from optimal solution?

need to evaluate actual register needs for architectural designs

are register files big enough?

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 15 / 43

slide-16
SLIDE 16

Problem Statement

Given a loop nest and an SSP schedule for it, evaluate the register pressure MaxLive of the final schedule.

  • nly rotating registers

MaxLive = maximum number of live variables at any given cycle in the final schedule MaxLive definition similar to the one for MS.

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 16 / 43

slide-17
SLIDE 17

Updated SSP Implementation

Selection Loop Simplification Dependence Construction Schedule Allocation Register Generation Code

too high?

Register Pressure Evaluation

loop nest DDG 1−D final schedule register allocated kernel kernel no yes choose different loop level OR increase initiation interval loop nest

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 17 / 43

slide-18
SLIDE 18

Outline

1

Loop Nest Software-Pipelining

2

Problem Statement

3

Definitions & Issues

4

Fast Register Pressure Computation

5

Experiments

6

Conclusion

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 18 / 43

slide-19
SLIDE 19

Definitions

scalar lifetime:

start: definition cycle of the value end: kill cycle of the value

  • mega: number of outermost iterations spanned

classification

global: constant values, ignored input & output: prolog and epilog, ignored local: within same outermost iteration cross-iteration: between outermost iterations

local start end cross−iteration

  • mega
  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 19 / 43

slide-20
SLIDE 20

Issues

more complex lifetime patterns than MS

non-constant initiation rate stretched lifetimes

same stage may have different lifetimes patterns

a stage is not always followed by the same stages difference between first and last instance of the same stage

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 20 / 43

slide-21
SLIDE 21

Lifetimes Example

a b d c a b d c e e c c d d c c d d b b a b d c a b d c e e

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 21 / 43

slide-22
SLIDE 22

Outline

1

Loop Nest Software-Pipelining

2

Problem Statement

3

Definitions & Issues

4

Fast Register Pressure Computation

5

Experiments

6

Conclusion

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 22 / 43

slide-23
SLIDE 23

Method Overview

Method Keys: separate OLP from ILES instances of stages separate first from last instances of stages separate local from cross-iteration lifetimes Steps: count number of local lifetimes in first instance of stages count number of local lifetimes in last instance of stages count number of cross-iteration lifetimes in each stage list all possible combinations of stages in schedule add number of lifetimes for each combination in OLP and ILES MaxLive is the highest value

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 23 / 43

slide-24
SLIDE 24

Local Lifetimes

traditional liveness analysis computes for both first and last instances of stage s

each cycle c in stage between 0 and T − 1 live-out set of stage (c = T)

stage of level i visited i times LTlocal(s, c, first/last)

level 1 level 2 level 3

first last

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 24 / 43

slide-25
SLIDE 25

Cross-Iteration Lifetimes

need stage and cycle of definition and kill direct formula

for each cycle c in OLP between 0 and T − 1 not stage-specific

LTcross 2 1 2

b c d a a a b b c a

  • mega=3

c = 0

kill

S = 0

kill

c = 2

def

S = 1

def

LTcross(c) =

  • v∈civs

((Skill(v) − Sdef(v) + 1) + δdef(c, v) + δkill(c, v)) where

  • δdef(c, v) = −1 if c < cdef(v), 0 otherwise

δkill(c, v) = −1 if c > ckill(v), 0 otherwise

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 25 / 43

slide-26
SLIDE 26

OLP: last stages local lifetimes

a b d c a b d c e e c c d d c c d d b b a b d c a b d c e e

Combination of first and last stages not always the same

l1

  • s=ln−i

LTlocal(s, c, last)

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 26 / 43

slide-27
SLIDE 27

OLP: first stages local lifetimes

a b d c a b d c e e c c d d c c d d b b a b d c a b d c e e

Combination of first and last stages not always the same

ln−1−i

  • s=f1

LTlocal(s, c, first)

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 27 / 43

slide-28
SLIDE 28

OLP: cross-iteration lifetimes

a b d c a b d c e e c c d d c c d d b b a b d c a b d c e e

Number of cross-iteration lifetimes identical between instances of OLP LTcross(c)

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 28 / 43

slide-29
SLIDE 29

OLP count

LTolp(c) = LTcross(c) + max

i∈[1,Sn]

 

l1

  • s=ln−i

LTlocal(s, c, last) +

ln−1−i

  • s=f1

LTlocal(s, c, first)  

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 29 / 43

slide-30
SLIDE 30

ILES: last stretched local lifetimes

a b d c a b d c e e c c d d c c d d b b a b d c a b d c e e

Live-out of last stages

l1

  • s=ln

LTlocal(s, T, last)

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 30 / 43

slide-31
SLIDE 31

ILES: first stretched local lifetimes

a b d c a b d c e e c c d d c c d d b b a b d c a b d c e e

Live-out of first stages

fn−2

  • s=f1

LTlocal(s, T, first)

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 31 / 43

slide-32
SLIDE 32

ILES: stretched cross-iteration lifetimes

a b d c a b d c e e c c d d c c d d b b a b d c a b d c e e

live-out cross-iteration lifetimes from OLP LTcross(T)

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 32 / 43

slide-33
SLIDE 33

ILES: local lifetimes

a b d c a b d c e e c c d d c c d d b b a b d c a b d c e e

Sn consecutive stages (cyclic) max

l∈[2,n]

  • max

i0∈[0,Sl−1]

Sn−1

  • i=0

LTlocal(fl + (i0 + i)%Sl, c, first)

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 33 / 43

slide-34
SLIDE 34

ILES count

LTiles(c) = LTcross(T) +

l1

  • s=ln

LTlocal(s, T, last) +

fn−2

  • s=f1

LTlocal(s, T, first) + max

l∈[2,n]

  • max

i0∈[0,Sl−1]

Sn−1

  • i=0

LTlocal(fl + (i0 + i)%Sl, c, first)

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 34 / 43

slide-35
SLIDE 35

MaxLive

FatCoverolp = max

∀c∈[0,T−1]

  • LTolp(c)
  • FatCoveriles

= max

∀c∈[0,T−1] (LTiles(c))

MaxLive = max(FatCoveriles, FatCoverolp)

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 35 / 43

slide-36
SLIDE 36

Outline

1

Loop Nest Software-Pipelining

2

Problem Statement

3

Definitions & Issues

4

Fast Register Pressure Computation

5

Experiments

6

Conclusion

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 36 / 43

slide-37
SLIDE 37

Experimental Framework

ORC2.1 compiler 1.4Ghz Itanium workstation, 1GB RAM Livermore, SPEC2000 FP , NPB 2.2 benchmarks 127 loop nests 328 test cases

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 37 / 43

slide-38
SLIDE 38

Running Time

in the order of 1/1000 sec quadratic running time 3 orders of magnitude faster than register allocator speedup increases as loop gets deeper ⇒ fast enough to use in SSP framework

1 10 100 1000 10000 100000 Speedup (log scale) Benchmarks

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 38 / 43

slide-39
SLIDE 39

MaxLive

Average: INT=42 FP=15 register pressure too high for 43% of loop nests of depth 4

0.2 0.4 0.6 0.8 1 1 2 3 4 5 Amenability Ratio Depth

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 39 / 43

slide-40
SLIDE 40

FP/INT Comparison

FP register pressure stable as loop gets deeper

pressure never exceeds 64 registers

INT register pressure increases as loop gets deeper

loop overheads array indexes longer live ranges

5 10 15 20 25 30 35 40 45 50 Extra FP Register Pressure per Level Benchmarks Level 3 or higher Level 2 Level 1

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 40 / 43

slide-41
SLIDE 41

Register File Size

max FP register file size: 64 ideal INT/FP ratio: 2 77% and 67% of loop nests of depth 4 and 5 would become feasible

32 64 96 128 160 192 224 256 0.5 1 1.5 2 2.5 Total Register Pressure (FP+INT) FP/INT Ratio Benchmarks Total Register Pressure FP/INT Ratio

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 41 / 43

slide-42
SLIDE 42

Outline

1

Loop Nest Software-Pipelining

2

Problem Statement

3

Definitions & Issues

4

Fast Register Pressure Computation

5

Experiments

6

Conclusion

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 42 / 43

slide-43
SLIDE 43

Conclusion

SSP

software-pipelines loop nests at the most profitable level register pressure is however very high

need for fast method to evaluate register pressure

detect infeasible schedules early measure quality of register allocator give an estimate of the actual register needs for future architecture designs

proposed solution

deals with issues specific to loop nest SWP and SSP is very fast and can be used before register allocation

future work

incremental solution to be integrated to the scheduler different architectures: clustered VLIW,...

  • A. Douillet, G.R. Gao (Univ. of Delaware)

Register Pressure in SWP’ed Loop Nests LCPC’05 43 / 43