Hardware Acceleration for Programs in SSA Form Manuel Mohr, Artjom - - PowerPoint PPT Presentation

hardware acceleration for programs in ssa form
SMART_READER_LITE
LIVE PREVIEW

Hardware Acceleration for Programs in SSA Form Manuel Mohr, Artjom - - PowerPoint PPT Presentation

saarland university computer science Hardware Acceleration for Programs in SSA Form Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jrg Henkel Institute for Program Structures and Data Organization, Karlsruhe


slide-1
SLIDE 1

1

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Institute for Program Structures and Data Organization, Karlsruhe Institute of Technology (KIT)

Hardware Acceleration for Programs in SSA Form

Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association

www.kit.edu

computer science

saarland

university

slide-2
SLIDE 2

SSA-Based Register Allocation

2

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Not in Static Single Assignment Form In Static Single Assignment Form Front end Middle end Back end Parsing Optimizations Register Allocation

slide-3
SLIDE 3

SSA-Based Register Allocation

2

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Not in Static Single Assignment Form In Static Single Assignment Form Front end Middle end Back end Parsing Optimizations Register Allocation

slide-4
SLIDE 4

SSA-Based Register Allocation

2

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Not in Static Single Assignment Form In Static Single Assignment Form Front end Middle end Back end Parsing Optimizations SSA-Based Register Allocation

slide-5
SLIDE 5

SSA-Based Register Allocation

2

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Not in Static Single Assignment Form In Static Single Assignment Form Front end Middle end Back end Parsing Optimizations SSA-Based Register Allocation

Fewer spills but more shuffle code

slide-6
SLIDE 6

Register Transfer Graphs

3

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Shuffle code = parallel copy operations between registers

slide-7
SLIDE 7

Register Transfer Graphs

3

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Shuffle code = parallel copy operations between registers r1 r2 r3 r4 r5 Register Transfer Graph (RTG) Nodes: Registers Directed edge (r1, r2): After copies, value of r1 must be in r2 At most one incoming edge per node No incoming edge: Register value is irrelevant after copies

slide-8
SLIDE 8

Motivation

4

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Number and size of RTGs depend on quality of allocation Reduction is an NP-complete problem r1 r2 r3 r4 r5 r6 r7 r8 ⇒ On standard hardware, implementation may be expensive: 5% to 20% of all generated instructions (SPEC)

slide-9
SLIDE 9

Motivation

4

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Number and size of RTGs depend on quality of allocation Reduction is an NP-complete problem r1 r2 r3 r4 r5 r6 r7 r8 ⇒ On standard hardware, implementation may be expensive: 5% to 20% of all generated instructions (SPEC) mov r2 , r1 xor r6 , r7 xor r4 , r5 mov r3 , r2 xor r6 , r5 xor r5 , r4 mov r7 , r8 xor r5 , r6 xor r4 , r3 xor r6 , r7 xor r6 , r5 xor r3 , r4 xor r7 , r6 xor r5 , r4 xor r4 , r3

slide-10
SLIDE 10

Motivation

4

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Number and size of RTGs depend on quality of allocation Reduction is an NP-complete problem r1 r2 r3 r4 r5 r6 r7 r8 ⇒ On standard hardware, implementation may be expensive: 5% to 20% of all generated instructions (SPEC) Question 1: Is it possible to create an instruction set extension that allows implementing an RTG in one processor cycle?

slide-11
SLIDE 11

Motivation

4

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Number and size of RTGs depend on quality of allocation Reduction is an NP-complete problem r1 r2 r3 r4 r5 r6 r7 r8 ⇒ On standard hardware, implementation may be expensive: 5% to 20% of all generated instructions (SPEC) Question 1: Is it possible to create an instruction set extension that allows implementing an RTG in one processor cycle? Question 2: Is it worth it?

slide-12
SLIDE 12

Fundamental Hardware Constraints

5

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Changing contents of multiple registers in one cycle very costly

slide-13
SLIDE 13

Fundamental Hardware Constraints

5

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Changing contents of multiple registers in one cycle very costly Idea: Modify access to register file instead of contents

Swap r1 and r2: Exchange the access to r1 and r2

r1 r2 42 23 Register File

slide-14
SLIDE 14

Fundamental Hardware Constraints

5

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Changing contents of multiple registers in one cycle very costly Idea: Modify access to register file instead of contents

Swap r1 and r2: Exchange the access to r1 and r2

r1 r2 42 23 Register File

slide-15
SLIDE 15

Fundamental Hardware Constraints

5

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Changing contents of multiple registers in one cycle very costly Idea: Modify access to register file instead of contents

Swap r1 and r2: Exchange the access to r1 and r2

r1 r2 42 23 Register File ⇒ Restriction to permutations of registers

slide-16
SLIDE 16

ISA Extension

6

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Add permutation instructions to SPARC V8 ISA 32 registers ⇒ 5 bits to identify one register 7 bits for opcode ⇒ 25 bits left for encoding 5 register numbers

0001 000 a1 b c d e

31 27 21 19 14 9 4

a2

24

slide-17
SLIDE 17

ISA Extension

6

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Add permutation instructions to SPARC V8 ISA 32 registers ⇒ 5 bits to identify one register 7 bits for opcode ⇒ 25 bits left for encoding 5 register numbers

0001 000 a1 b c d e

31 27 21 19 14 9 4

a2

24

Two new instructions: permi5: Implement cyclic RTG with up to 5 elements permi23: Implement two independent cycles with 2 and up to 3 elements

slide-18
SLIDE 18

Examples

7

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

r1 r2 r3 r4 r5 permi5 r1, r2, r3, r4, r5

slide-19
SLIDE 19

Examples

7

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

r1 r2 r3 r4 r5 permi5 r1, r2, r3, r4, r5 r1 r2 permi5 r1, r2

slide-20
SLIDE 20

Examples

7

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

r1 r2 r3 r4 r5 permi5 r1, r2, r3, r4, r5 r1 r2 permi5 r1, r2 r1 r2 r3 r4 permi23 r1, r2, r3, r4

slide-21
SLIDE 21

Code Generation

8

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Goal: Generate efficient code using permi instructions for all RTGs Question: Which RTGs can be implemented using only permi?

slide-22
SLIDE 22

Code Generation

8

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Goal: Generate efficient code using permi instructions for all RTGs Question: Which RTGs can be implemented using only permi? RTGs in permutation form

Permutation can be written as a product of cycles Cycles can be implemented with permis

r1 r2 r3 r4 r5

slide-23
SLIDE 23

Code Generation

8

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Goal: Generate efficient code using permi instructions for all RTGs Question: Which RTGs can be implemented using only permi? RTGs in permutation form

Permutation can be written as a product of cycles Cycles can be implemented with permis

r1 r2 r3 r4 r5 In general: RTGs can duplicate values

Permutations are injective Value duplication impossible

r1 r2 r3 r4 r5

slide-24
SLIDE 24

Two-Phase Approach

9

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Arbitrary RTG

r1 r2 r3 r4 r5 r6 r7 r8 r9

slide-25
SLIDE 25

Two-Phase Approach

9

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Arbitrary RTG

r1 r2 r3 r4 r5 r6 r7 r8 r9

Phase 1: Conversion

r1 r3 r4 r6 r9 r5 r8 + mov r3, r2 mov r6, r7 mov r4, r5

slide-26
SLIDE 26

Two-Phase Approach

9

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Arbitrary RTG

r1 r2 r3 r4 r5 r6 r7 r8 r9

Phase 1: Conversion

r1 r3 r4 r6 r9 r5 r8 + mov r3, r2 mov r6, r7 mov r4, r5

Phase 2: Decomposition permi5 r1, r3, r4, r6, r9 permi5 r5, r8

slide-27
SLIDE 27

Conversion into Permutation Form

10

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

r1 r2 r3 r4 r5 r6 → r1 r3 r4 r5 r6 + mov r3, r2

slide-28
SLIDE 28

Conversion into Permutation Form

10

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

r1 r2 r3 r4 r5 r6 → r1 r3 r4 r5 r6 + mov r3, r2 At each node with > 1 outgoing edge: keep edge that is part of longest path starting at node

slide-29
SLIDE 29

Conversion into Permutation Form

10

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

r1 r2 r3 r4 r5 r6 → r1 r3 r4 r5 r6 + mov r3, r2 At each node with > 1 outgoing edge: keep edge that is part of longest path starting at node

slide-30
SLIDE 30

Conversion into Permutation Form

10

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

r1 r2 r3 r4 r5 r6 → r1 r3 r4 r5 r6 + mov r3, r2 At each node with > 1 outgoing edge: keep edge that is part of longest path starting at node

slide-31
SLIDE 31

Conversion into Permutation Form

10

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

r1 r2 r3 r4 r5 r6 → r1 r3 r4 r5 r6 + mov r3, r2 At each node with > 1 outgoing edge: keep edge that is part of longest path starting at node

slide-32
SLIDE 32

Conversion into Permutation Form

10

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

r1 r2 r3 r4 r5 r6 → r1 r3 r4 r5 r6 + mov r3, r2 At each node with > 1 outgoing edge: keep edge that is part of longest path starting at node

slide-33
SLIDE 33

Conversion into Permutation Form

10

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

r1 r2 r3 r4 r5 r6 → r1 r3 r4 r5 r6 + mov r3, r2 At each node with > 1 outgoing edge: keep edge that is part of longest path starting at node

slide-34
SLIDE 34

Conversion into Permutation Form

10

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

r1 r2 r3 r4 r5 r6 → r1 r3 r4 r5 r6 + mov r3, r2 At each node with > 1 outgoing edge: keep edge that is part of longest path starting at node

slide-35
SLIDE 35

Decomposition into Cycles

11

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

After conversion: Implement RTG in permutation form with as few permis as possible Need to combine multiple cycles to exploit permi23

slide-36
SLIDE 36

Decomposition into Cycles

11

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

After conversion: Implement RTG in permutation form with as few permis as possible Need to combine multiple cycles to exploit permi23 r1 r2 r3 r4 r5 r6 r7 r8 r9 permi5 permi23

slide-37
SLIDE 37

Decomposition into Cycles

11

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

After conversion: Implement RTG in permutation form with as few permis as possible Need to combine multiple cycles to exploit permi23 r1 r2 r3 r4 r5 r6 r7 r8 r9 permi5 permi23

slide-38
SLIDE 38

Decomposition into Cycles

11

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

After conversion: Implement RTG in permutation form with as few permis as possible Need to combine multiple cycles to exploit permi23 r1 r2 r3 r4 r5 r6 r7 r8 r9 permi5 permi23

slide-39
SLIDE 39

Decomposition into Cycles

12

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Greedy decomposition algorithm with linear runtime Phase 1

While there is a cycle of size 4 or more: use permi5 to implement it

slide-40
SLIDE 40

Decomposition into Cycles

12

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Greedy decomposition algorithm with linear runtime Phase 1

While there is a cycle of size 4 or more: use permi5 to implement it

r1 r2 r3 r4 permi5

slide-41
SLIDE 41

Decomposition into Cycles

12

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Greedy decomposition algorithm with linear runtime Phase 1

While there is a cycle of size 4 or more: use permi5 to implement it

r1 r2 r3 r4 r5 r6 r7 permi5

slide-42
SLIDE 42

Decomposition into Cycles

12

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Greedy decomposition algorithm with linear runtime Phase 1

While there is a cycle of size 4 or more: use permi5 to implement it

Phase 2: Only cycles of size ≤ 3 left

slide-43
SLIDE 43

Decomposition into Cycles

12

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Greedy decomposition algorithm with linear runtime Phase 1

While there is a cycle of size 4 or more: use permi5 to implement it

Phase 2: Only cycles of size ≤ 3 left

If 2-cycle and 3-cycle available: combine using permi23

r1 r2 r3 r4 r5 permi23

slide-44
SLIDE 44

Decomposition into Cycles

12

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Greedy decomposition algorithm with linear runtime Phase 1

While there is a cycle of size 4 or more: use permi5 to implement it

Phase 2: Only cycles of size ≤ 3 left

If 2-cycle and 3-cycle available: combine using permi23 If only 2-cycles available: combine in pairs using permi23

r1 r2 r3 r4 r5 r6 r7 r8 permi23 permi23

slide-45
SLIDE 45

Decomposition into Cycles

12

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Greedy decomposition algorithm with linear runtime Phase 1

While there is a cycle of size 4 or more: use permi5 to implement it

Phase 2: Only cycles of size ≤ 3 left

If 2-cycle and 3-cycle available: combine using permi23 If only 2-cycles available: combine in pairs using permi23 If only 3-cycles available: combine in groups of three using permi23

r1 r2 r3 r4 r5 r6 r7 r8 r9 permi23 permi23

slide-46
SLIDE 46

Base Architecture

13

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Underlying architecture: Gaisler LEON3, 7-stage pipeline Example: add r9 r5 r7

Fetch Decode Register Execute Memory Exception Writeback I$

insn

reg data r5 1233 r7 3105 r9 7404

Operand Regs

r5 r7

Operand Data

1233 3105 4338

Result Data

+

4338

Result Data

4338

Result Data

add

Operation

add

Operation

5 7 add 9

Instruction Word Register File

reg data r5 1233 r7 3105 r9 4338

Register File ALU Result Reg

r9

Result Reg

r9

Result Reg

r9

Result Reg

r9

Result Reg

r9

For permi support: modifications of Decode and Exception stages

slide-47
SLIDE 47

Permutation Support

14

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Key component: permutation table in Decode stage

Contains mapping logical → physical register address Physical address from permutation table used when accessing register file

Initialized with identity at system reset

Fetch Decode Register

I$ insn reg data r6 6410 r7 3105 r8 7404 Operand Regs r8 r6 Operand Data 7404 6410 add Operation add Operation 5 7 add 9 Instruction Word Register File log phys r5 r7 r9 r8 r6 r7 Result Reg r7 Result Reg r7 lookup phys. register addrs

slide-48
SLIDE 48

Applying new Permutations

15

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Applying permutation permi5 r5 r9 r7 r6 r8

Fetch Decode

I$ insn 9 7 permi 5 Instruction Word log phys r5 r6 r7 r5 r6 r7 retrieve old permutation 6 8 r8 r9 r9 r8 generate new permutation r5 → r8 r6 → r9 r7 → r6 r8 → r5 r9 → r7 write new permutation

Permutation applied in Decode stage (early committing)

No changes to forwarding logic required

slide-49
SLIDE 49

Results

16

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Experimental evaluation Implemented code generation strategy in libFIRM Used SPEC CPU2000 benchmark suite as input programs Modified SPARC emulator to support permi instructions

Ability to get precise dynamic instruction counts

Validation by measurements on FPGA prototype implementation

By running Linux on FPGA prototype, ability to reuse executables

slide-50
SLIDE 50

Compile Time

17

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Default [ms] Our code gen. [ms] Relative Backend (total) 63 598.0 63 927.0 +0.5% Code generation does not cause significant overhead

slide-51
SLIDE 51

Code Quality

18

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Four different register allocator configurations: ILP Recoloring Biased Naive Increasing RTG size Increasing number of RTGs Decreasing compilation time

slide-52
SLIDE 52

Code Quality

19

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Benchmark ILP Recoloring Biased Naive 164.gzip −0.7% −1.0% −1.9% −16.4% 175.vpr −0.3% −0.3% −1.0% −3.4% 176.gcc −0.4% −0.5% −2.7% −11.4% 181.mcf −1.9% −1.9% −2.8% −7.8% 186.crafty −1.0% −0.8% −3.9% −15.2% 197.parser −0.9% −1.0% −2.7% −12.6% 253.perlbmk −0.6% −0.1% −1.8% −9.9% 254.gap −0.3% −0.9% −2.0% −7.1% 255.vortex −0.5% −0.8% −5.1% −15.1% 256.bzip2 −0.3% −0.6% −3.1% −11.3% 300.twolf −0.3% −0.3% −0.8% −1.9% Relative change of number of executed instructions

slide-53
SLIDE 53

Code Quality

19

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Benchmark ILP Recoloring Biased Naive 164.gzip −0.7% −1.0% −1.9% −16.4% 175.vpr −0.3% −0.3% −1.0% −3.4% 176.gcc −0.4% −0.5% −2.7% −11.4% 181.mcf −1.9% −1.9% −2.8% −7.8% 186.crafty −1.0% −0.8% −3.9% −15.2% 197.parser −0.9% −1.0% −2.7% −12.6% 253.perlbmk −0.6% −0.1% −1.8% −9.9% 254.gap −0.3% −0.9% −2.0% −7.1% 255.vortex −0.5% −0.8% −5.1% −15.1% 256.bzip2 −0.3% −0.6% −3.1% −11.3% 300.twolf −0.3% −0.3% −0.8% −1.9% Relative change of number of executed instructions Universal reduction, up to 5.1% for realistic scenarios The worse the register allocation, the higher the benefit using permis Confirmation by FPGA measurements, speedup up to 1.07

slide-54
SLIDE 54

Area Overhead

20

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Base system Our system Overhead Frequency 80 MHz 80 MHz 0% BlockRAMs 28 28 0% Flip-flops 7 607 8 851 16% LUTs 15 024 21 630 44% Slices 7 249 9 507 31% Frequency unaffected Logical-physical mapping ⇒ increase in FF usage Large multiplexers ⇒ increase in LUT usage

Considerably smaller overhead for ASIC implementation

slide-55
SLIDE 55

Conclusion

21

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Summary Novel approach to accelerate shuffle code by hardware extension New instructions added to standard instruction set Code generation approach producing efficient code fast Extensive evaluation including FPGA prototype implementation Universal speedup, instruction count reduction up to 5.1%

slide-56
SLIDE 56

22

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Backup Slides

slide-57
SLIDE 57

RTG Semantics

23

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

r1 r2 r3 r4 r5 r6

slide-58
SLIDE 58

RTG Semantics

23

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

r1 r2 r3 r4 r5 r6

slide-59
SLIDE 59

Exception Handling

24

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Early committing can cause problems due to traps

Timer interrupts to invoke OS scheduler SPARC window overflows/underflows caused by nested function calls

Trap handling in LEON3:

mov call permi

Register

  • Execute

Memory

mov

Exception

call

Decode

  • Fetch
  • Writeback

permi

slide-60
SLIDE 60

Exception Handling

24

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Early committing can cause problems due to traps

Timer interrupts to invoke OS scheduler SPARC window overflows/underflows caused by nested function calls

Trap handling in LEON3:

mov call permi

Register

  • Execute
  • Memory

mov

Exception

call

Decode

  • Fetch
  • Writeback

permi

slide-61
SLIDE 61

Exception Handling

24

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Early committing can cause problems due to traps

Timer interrupts to invoke OS scheduler SPARC window overflows/underflows caused by nested function calls

Trap handling in LEON3:

mov call permi Trap Handler

Register

  • Execute
  • Memory
  • Exception
  • Decode
  • Fetch

permi

Writeback

  • permi executed twice – permutation applied twice → program crash

Instructions that commit after exception stage can be annulled permi: revert effect of permutations executed before trap

slide-62
SLIDE 62

Reverting Permutations

25

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

Permutation history buffer tracks last 4 instructions Exception Stage: if a trap occurs, check permutation history buffer for permi instructions If any occur, go through history buffer in reverse order

For each permi: apply inverse permutation to permutation table

Decode

  • log

phys r5 r6 r7 r8 r9 r6

  • r8

r5 r9 r7 generate new permutation r5 → r5 r6 → r6 r7 → r7 r8 → r9 r9 → r8 write new permutation

Register Execute Memory Exception

Permutation cycle

  • inverted cycle

6 7 8 9 5 9 7 5 Permutation cycle 6 8 Permutation cycle

  • Permutation cycle
  • retrieve old

permutation user-defined cycle select

permis will be re-executed after trap handler ⇒ Register File in expected state

slide-63
SLIDE 63

Reversion Effects

26

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

  • 10−8

10−7 10−6 10−5 10−4 10−3 1 6 4 . g z i p 1 7 5 . v p r 1 7 6 . g c c 1 8 1 . m c f 1 8 6 . c r a f t y 1 9 7 . p a r s e r 2 5 3 . p e r l b m k 2 5 4 . g a p 2 5 5 . v

  • r

t e x 2 5 6 . b z i p 2 3 . t w

  • l

f revert time / total time

slide-64
SLIDE 64

φ-functions

27

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

x = . . . ; y = . . . ; i f ( . . . ) { t = x ; x = y ; y = t ; } a = x ; b = y ; x = . . . y = . . . condjump a = φ(x , y ) b = φ(y , x )

slide-65
SLIDE 65

φ-functions

27

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

x = . . . ; y = . . . ; i f ( . . . ) { t = x ; x = y ; y = t ; } a = x ; b = y ; xr1 = . . . yr2 = . . . condjump ar1 = φ(xr1, yr2) br2 = φ(yr2, xr1)

slide-66
SLIDE 66

φ-functions

27

October 1, 2013 Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, Jörg Henkel – Hardware Acceleration for Programs in SSA Form IPD, ITEC

x = . . . ; y = . . . ; i f ( . . . ) { t = x ; x = y ; y = t ; } a = x ; b = y ; xr1 = . . . yr2 = . . . condjump ar1 = φ(xr1, yr1) br2 = φ(yr2, xr2) r1 r2