being a mathematical quantity which when applied to itself equals - - PowerPoint PPT Presentation

being a mathematical quantity which when applied
SMART_READER_LITE
LIVE PREVIEW

being a mathematical quantity which when applied to itself equals - - PowerPoint PPT Presentation

idempotent ( - dm - p - tnt ) adj. 1 of, relating to, or being a mathematical quantity which when applied to itself equals itself; 2 of, relating to, or being an operation under which a mathematical quantity is idempotent. idempotent


slide-1
SLIDE 1

idempotent (ī-dəm-pō-tənt) adj. 1 of, relating to, or being a mathematical quantity which when applied to itself equals itself; 2 of, relating to, or being an

  • peration under which a mathematical quantity is

idempotent. idempotent processing (ī-dəm-pō-tənt prə-ses-iŋ) n. the application of only idempotent operations in sequence; said of the execution of computer programs in units of only idempotent computations, typically, to achieve restartable behavior.

slide-2
SLIDE 2

Static Analysis and Compiler Design for Idempotent Processing

Marc de Kruijf Karthikeyan Sankaralingam Somesh Jha

PLDI 2012, Beijing

slide-3
SLIDE 3

Example

2

int int sum(int int *array, int int len) { int int x = 0; for for (int int i = 0; i < len; ++i) x += array[i]; return return x; }

source code

slide-4
SLIDE 4

Example

3

R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3

assembly code F F F F

faults exceptions

x

load ? mis-speculations

slide-5
SLIDE 5

Example

4

R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3

assembly code

slide-6
SLIDE 6

R0 and R1 are unmodified R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3

Example

5

assembly code

just re-execute!

convention: use checkpoints/buffers

slide-7
SLIDE 7

It’s Idempotent!

6

idempoh… what…?

int int sum(int int *data, int int len) { int int x = 0; for for (int int i = 0; i < len; ++i) x += data[i]; return return x; }

=

slide-8
SLIDE 8

Idempotent Processing

7

idempotent regions ALL THE TIME

slide-9
SLIDE 9

Idempotent Processing

8

normal compiler: custom compiler:

executive summary

low runtime overhead (typically 2-12%) cut semantic clobber antidependences how? idempotence inhibited by clobber antidependences

slide-10
SLIDE 10

9

Presentation Overview

❶ Idempotence ❷ Algorithm ❸ Results

=

slide-11
SLIDE 11

What is Idempotence?

10

Yes

2

is this idempotent?

slide-12
SLIDE 12

What is Idempotence?

11

No

2

how about this?

slide-13
SLIDE 13

What is Idempotence?

12

Yes

2

maybe this?

slide-14
SLIDE 14

What is Idempotence?

13

  • peration sequence

dependence chain idempotent? write read, write write, read, write Yes No Yes

it’s all about the data dependences

slide-15
SLIDE 15

What is Idempotence?

14

  • peration sequence

dependence chain idempotent? write, read read, write write, read, write Yes No Yes

it’s all about the data dependences CLOBBER ANTIDEPENDENCE

antidependence with an exposed read

slide-16
SLIDE 16

Semantic Idempotence

15

(1) local (“pseudoregister”) state: can be renamed to remove clobber antidependences* does not semantically constrain idempotence

two types of program state

(2) non-local (“memory”) state: cannot “rename” to avoid clobber antidependences semantically constrains idempotence semantic idempotence = no non-local clobber antidep. preserve local state by renaming and careful allocation

slide-17
SLIDE 17

16

Presentation Overview

❶ Idempotence ❷ Algorithm ❸ Results

=

slide-18
SLIDE 18

Region Construction Algorithm

17

steps one, two, and three

Step 1: transform function remove artificial dependences, remove non-clobbers Step 3: refine for correctness & performance account for loops, optimize for dynamic behavior Step 2: construct regions around antidependences cut all non-local antidependences in the CFG

slide-19
SLIDE 19

Step 1: Transform

18

Transformation 1: SSA for pseudoregister antidependences

not one, but two transformations

clobber antidependences region boundaries region identification

But we still have a problem:

depends on

slide-20
SLIDE 20

Step 1: Transform

19

Transformation 1: SSA for pseudoregister antidependences

not one, but two transformations

Transformation 2: Scalar replacement of memory variables

[x] = a; b = [x]; [x] = c; [x] = a; b = a; [x] = c;

non-clobber antidependences… GONE!

slide-21
SLIDE 21

Step 1: Transform

20

Transformation 1: SSA for pseudoregister antidependences

not one, but two transformations

Transformation 2: Scalar replacement of memory variables

clobber antidependences region boundaries region identification

depends on

slide-22
SLIDE 22

Region Construction Algorithm

21

steps one, two, and three

Step 1: transform function remove artificial dependences, remove non-clobbers Step 3: refine for correctness & performance account for loops, optimize for dynamic behavior Step 2: construct regions around antidependences cut all non-local antidependences in the CFG

slide-23
SLIDE 23

Step 2: Cut the CFG

22

cut, cut, cut…

construct regions by “cutting” non-local antidependences antidependence

slide-24
SLIDE 24

larger is (generally) better: large regions amortize the cost of input preservation

23

region size

  • verhead

sources of overhead

  • ptimal

region size?

Step 2: Cut the CFG

rough sketch

but where to cut…?

slide-25
SLIDE 25

Step 2: Cut the CFG

24

but where to cut…?

goal: the minimum set of cuts that cuts all antidependence paths intuition: minimum cuts fewest regions large regions approach: a series of reductions: minimum vertex multi-cut (NP-complete) minimum hitting set among paths minimum hitting set among “dominating nodes” details in paper…

slide-26
SLIDE 26

Region Construction Algorithm

25

steps one, two, and three

Step 1: transform function remove artificial dependences, remove non-clobbers Step 3: refine for correctness & performance account for loops, optimize for dynamic behavior Step 2: construct regions around antidependences cut all non-local antidependences in the CFG

slide-27
SLIDE 27

Step 3: Loop-Related Refinements

26

correctness: Not all local antidependences removed by SSA…

loops affect correctness and performance

loop-carried antidependences may clobber depends on boundary placement; handled as a post-pass performance: Loops tend to execute multiple times… to maximize region size, place cuts outside of loop algorithm modified to prefer cuts outside of loops details in paper…

slide-28
SLIDE 28

27

Presentation Overview

❶ Idempotence ❷ Algorithm ❸ Results

=

slide-29
SLIDE 29

Results

compiler implementation

– Paper compiler implementation in LLVM v2.9 – LLVM v3.1 source code release in July timeframe

experimental data

(1) runtime overhead (2) region size (3) use case

28

slide-30
SLIDE 30

Runtime Overhead

29

2 4 6 8 10 12 instruction count execution time

as a percentage

percent overhead

7.6 7.7 benchmark suites (gmean) (gmean)

slide-31
SLIDE 31

Region Size

30

1 10 100 compiler- generated

average number of instructions

28

dynamic region size

(gmean) benchmark suites (gmean)

slide-32
SLIDE 32

Use Case

31

5 10 15 20 25 30 35 idempotence checkpoint/log instruction TMR

hardware fault recovery

(gmean) 8.2 24.0 30.5

percent overhead

benchmark suites (gmean)

slide-33
SLIDE 33

32

Presentation Overview

❶ Idempotence ❸ Results

=

❷ Algorithm

slide-34
SLIDE 34

Summary & Conclusions

33

idempotent processing

– large (low-overhead) idempotent regions all the time

static analysis, compiler algorithm

– (a) remove artifacts (b) partition (c) compile

summary low overhead

– 2-12% runtime overhead typical

slide-35
SLIDE 35

Summary & Conclusions

34

several applications already demonstrated

– CPU hardware simplification (MICRO ’11) – GPU exceptions and speculation (ISCA ’12) – hardware fault recovery (this paper)

conclusions future work

– more applications, hybrid techniques – optimal region size? – enabling even larger region sizes

slide-36
SLIDE 36

Back-up Slides

35

slide-37
SLIDE 37

Error recovery

36

mis-speculation (e.g. branch misprediction)

– compiler handles for pseudoregister state – for non-local memory, store buffer assumed

arbitrary failure (e.g. hardware fault)

– ECC and other verification assumed – variety of existing techniques; details in paper

exceptions

– generally no side-effects beyond out-of-order-ness – fairly easy to handle

dealing with side-effects

slide-38
SLIDE 38

Optimal Region Size?

37

region size

  • verhead

detection latency register pressure re-execution time

it depends… (rough sketch not to scale)

slide-39
SLIDE 39

Prior Work

38

relating to idempotence

Technique Year Domain Sentinel Scheduling 1992 Speculative memory re-ordering Fast Mutual Exclusion 1992 Uniprocessor mutual exclusion Multi-Instruction Retry 1995 Branch and hardware fault recovery Atomic Heap Transactions 1999 Atomic memory allocation Reference Idempotency 2006 Reducing speculative storage Restart Markers 2006 Virtual memory in vector machines Data-Triggered Threads 2011 Data-triggered multi-threading Idempotent Processors 2011 Hardware simplification for exceptions Encore 2011 Hardware fault recovery iGPU 2012 GPU exception/speculation support

slide-40
SLIDE 40

Detailed Runtime Overhead

39

5 10 15 20 25 30 instruction count execution time

as a percentage

suites (gmean)

  • utliers

(gmean)

percent overhead

7.6 7.7 non-idempotent inner loops + high register pressure

slide-41
SLIDE 41

Detailed Region Size

40

1 10 100 1000 10000 compiler ideal Ideal w/o outliers

average number of instructions

suites (gmean)

  • utliers

(gmean) 28

/

116 45 >1,000,000 limited aliasing information