Dealing with Register Hierarchies Matthias Braun (MatzeB) / LLVM - - PowerPoint PPT Presentation

dealing with register hierarchies
SMART_READER_LITE
LIVE PREVIEW

Dealing with Register Hierarchies Matthias Braun (MatzeB) / LLVM - - PowerPoint PPT Presentation

Dealing with Register Hierarchies Matthias Braun (MatzeB) / LLVM Developers' Meeting 2016 r1;r2;r3;r4 r0;r1;r2;r3 r0,r1,r2,r3 Q0 r3;r4;r5 r1,r2,r3,r4 r2,r3,r4,r5 r2;r3;r4 D0 D1 r3,r4,r5,r6 r4,r5,r6,r7 r1;r2;r3 S0 S1 S2 S3 r2;r3


slide-1
SLIDE 1

Dealing with Register Hierarchies

S0 D0 Q0 S1 D1 S2 S3 FP Register

Matthias Braun (MatzeB) / LLVM Developers' Meeting 2016

r0,r1,r2,r3 r1,r2,r3,r4 r2,r3,r4,r5 r3,r4,r5,r6 r4,r5,r6,r7 r5,r6,r7,r8 r6,r7,r8,r9 ... 4 Tuple Class

r0;r1;r2;r3 r1;r2;r3;r4 r1;r2;r3 r2;r3 r3 r3;r4 r2;r3;r4 r3;r4;r5

slide-2
SLIDE 2

Register Allocation

  • Rewrite program with unlimited number of virtual registers to use

actual registers

  • Techniques: Interference Checks, Assignment, Spilling, Splitting,

Rematerialization

%0 = const 5 %1 = const 7 %2 = add %0, %1 return %2 r0 = const 5 r1 = const 7 r0 = add r0, r1 return r0

slide-3
SLIDE 3

Register Allocation for GPUs

  • Hundreds of registers available, but using fewer increases

parallelism

  • Mix of Scalar (single value) and Vector (multiple values) operations
  • Load/Store instructions work on multiple registers


(high latency, high throughput)

r[0:3] = load_x4 # Load r0, r1, r2, r3 r4 = add r0, 1 r5 = add r1, 2 r6 = add r2, 3 r7 = add r3, 4 store_x4 r[4:7] # Store r4, r5, r6, r7

slide-4
SLIDE 4

Liveness Tracking

  • Linearize program
  • Number instructions

consecutively (SlotIndexes)

b1: %1 = const 5 jmp b3 %0 = def cmp ... jeq b2 b2: store %0 %1 = def b3: 2% = add %1, 1

slide-5
SLIDE 5

Liveness Tracking

b1: %1 = const 5 jmp b3 %0 = def cmp ... jeq b2 b2: store %0 %1 = def b3: 2% = add %1, 1 SlotIdx 1 2 3 4 5 6 7 8 9

10

  • Linearize program
  • Number instructions

consecutively (SlotIndexes)

slide-6
SLIDE 6

Liveness Tracking

  • Linearize program
  • Number instructions

consecutively (SlotIndexes)

  • Liveness as sorted list of

intervals (segments)

SlotIdx %0 %1 %2 … %1: [4:6)[8:9)[9:10) … b1: %1 = const 5 jmp b3 %0 = def cmp ... jeq b2 b2: store %0 %1 = def b3: 2% = add %1, 1 1 2 3 4 5 6 7 8 9

10

slide-7
SLIDE 7

Modeling Register Hierarchies

slide-8
SLIDE 8

r[0:3] = load_x4 r4 = add r0, 1 r5 = add r1, 2 r6 = add r2, 3 r7 = add r3, 4 store_x4 r[4:7]

Tuple Registers

slide-9
SLIDE 9

%0,%1,%2,%3 = load_x4 %4 = add %0, 1 %5 = add %1, 2 %6 = add %2, 3 %7 = add %3, 4 store_x4 %4,%5,%6,%7

Tuple Registers

slide-10
SLIDE 10

%0,%1,%2,%3 = load_x4 %4 = add %0, 1 %5 = add %1, 2 %6 = add %2, 3 %7 = add %3, 4 store_x4 %4,%5,%6,%7

❌ No relation between virtual registers but need to be consecutive

Tuple Registers

slide-11
SLIDE 11

Tuple Registers

%0 = load_x4 %1.sub0 = add %0.sub0, 1 %1.sub1 = add %0.sub1, 2 %1.sub2 = add %0.sub2, 3 %1.sub3 = add %0.sub3, 4 store_x4 %1

slide-12
SLIDE 12

Tuple Registers

%0 = load_x4 %1.sub0 = add %0.sub0, 1 %1.sub1 = add %0.sub1, 2 %1.sub2 = add %0.sub2, 3 %1.sub3 = add %0.sub3, 4 store_x4 %1 r0,r1,r2,r3 r1,r2,r3,r4 r2,r3,r4,r5 r3,r4,r5,r6 r4,r5,r6,r7 r5,r6,r7,r8 r6,r7,r8,r9 ...

4 Tuple Class

  • Register class contains tuples
  • Allocator picks a single (tuple) register
  • Parts called subregisters or lanes
  • Select parts with subregister index (.xxx Syntax)
slide-13
SLIDE 13

Construction

%0 = load %1 = const 42 %2 = reg_sequence %0, sub1, %1, sub0 store_x2 %2

  • reg_sequence defines multiple

subregisters (for SSA)
 (there is also insert_subreg, extract_subreg)

slide-14
SLIDE 14

Construction

%0 = load %1 = const 42 %2 = reg_sequence %0, sub1, %1, sub0 store_x2 %2 %0 = load %1 = const 42 %2.sub0<undef> = copy %0 %2.sub1 = copy %1 store_x2 %2

  • TwoAddressInstruction pass

translates to copy sequence

  • reg_sequence defines multiple

subregisters (for SSA)
 (there is also insert_subreg, extract_subreg)

slide-15
SLIDE 15

Construction

%0 = load %1 = const 42 %2 = reg_sequence %0, sub1, %1, sub0 store_x2 %2 %0 = load %1 = const 42 %2.sub0<undef> = copy %0 %2.sub1 = copy %1 store_x2 %2

  • RegisterCoalescing pass

eliminates copies

%2.sub0<undef> = load %2.sub1 = const 42 store_x2 %2

  • TwoAddressInstruction pass

translates to copy sequence

  • reg_sequence defines multiple

subregisters (for SSA)
 (there is also insert_subreg, extract_subreg)

slide-16
SLIDE 16

Improving Register Allocation

slide-17
SLIDE 17

%0 = load_x4 %1.sub0 = add %0.sub0, 1 %1.sub1 = add %0.sub1, 2 %1.sub2 = add %0.sub2, 3 %1.sub3 = add %0.sub3, 4 store_x4 %1 %0 %1

Subregister Liveness

slide-18
SLIDE 18

%0 %1 %0 = load_x4 %1.sub0 = add %0.sub0, 1 %1.sub1 = add %0.sub1, 2 %1.sub2 = add %0.sub2, 3 %1.sub3 = add %0.sub3, 4 store_x4 %1

Subregister Liveness

slide-19
SLIDE 19

Can allocate v0 and v1 to the same register tuple

%0 = load_x4 %1.sub0 = add %0.sub0, 1 %1.sub1 = add %0.sub1, 2 %1.sub2 = add %0.sub2, 3 %1.sub3 = add %0.sub3, 4 store_x4 %1 %0 sub0 sub1 sub2 sub3 %1 sub0 sub1 sub2 sub3

Subregister Liveness

slide-20
SLIDE 20

Subregister Liveness: Lane Masks

  • Lane Mask: 1 bit per subregister
  • Annotate subregister liveness parts with lane mask
  • Start with whole virtual register; Split and refine as necessary
slide-21
SLIDE 21

Lane Masks: sub0: 0b0001 sub1: 0b0010 sub2: 0b0100 sub1_sub2: 0b0110 sub3: 0b1000 all: 0b1111 %0 = load_x4 store_x4 %0 %1 = load_x4 %1.sub0 = const 13 %1.sub3 = const 42 store_x4 %1

Subregister Liveness: Lane Masks

  • Lane Mask: 1 bit per subregister
  • Annotate subregister liveness parts with lane mask
  • Start with whole virtual register; Split and refine as necessary
slide-22
SLIDE 22

Lane Masks: sub0: 0b0001 sub1: 0b0010 sub2: 0b0100 sub1_sub2: 0b0110 sub3: 0b1000 all: 0b1111 %1 %0 = load_x4 store_x4 %0 %1 = load_x4 %1.sub0 = const 13 %1.sub3 = const 42 store_x4 %1 %0

1111

Subregister Liveness: Lane Masks

Lane Mask:

  • Lane Mask: 1 bit per subregister
  • Annotate subregister liveness parts with lane mask
  • Start with whole virtual register; Split and refine as necessary
slide-23
SLIDE 23
  • Lane Mask: 1 bit per subregister
  • Annotate subregister liveness parts with lane mask
  • Start with whole virtual register; Split and refine as necessary

Lane Masks: sub0: 0b0001 sub1: 0b0010 sub2: 0b0100 sub1_sub2: 0b0110 sub3: 0b1000 all: 0b1111 %1 %0 = load_x4 store_x4 %0 %1 = load_x4 %1.sub0 = const 13 %1.sub3 = const 42 store_x4 %1

0101 0001 1000

Lane Mask: %0

1111

Subregister Liveness: Lane Masks

slide-24
SLIDE 24

Assignment Heuristics

  • Default: Assign in program order

To Assign r0 r1 r2 r3 r4 r5

slide-25
SLIDE 25

To Assign r0 r1 r2 r3 r4 r5

Assignment Heuristics

  • Default: Assign in program order
slide-26
SLIDE 26

r0 r1 r2 r3 r4 r5

Assignment Heuristics

To Assign

  • Default: Assign in program order
slide-27
SLIDE 27

r0 r1 r2 r3 r4 r5

Assignment Heuristics

To Assign

  • Default: Assign in program order
  • Wide pieces may not fit in holes

left by small ones

slide-28
SLIDE 28

r0 r1 r2 r3 r4 r5

Assignment Heuristics

To Assign

  • Default: Assign in program order
  • Wide pieces may not fit in holes

left by small ones

slide-29
SLIDE 29
  • Default: Assign in program order
  • Wide pieces may not fit in holes

left by small ones

r0 r1 r2 r3 r4 r5

Assignment Heuristics

To Assign

slide-30
SLIDE 30

r0 r1 r2 r3 r4 r5

Assignment Heuristics

To Assign

  • Default: Assign in program order
  • Wide pieces may not fit in holes

left by small ones

  • Tweak: Prioritize bigger classes
slide-31
SLIDE 31

r0 r1 r2 r3 r4 r5

Assignment Heuristics

To Assign

  • Default: Assign in program order
  • Wide pieces may not fit in holes

left by small ones

  • Tweak: Prioritize bigger classes
slide-32
SLIDE 32

r0 r1 r2 r3 r4 r5

Assignment Heuristics

To Assign

  • Default: Assign in program order
  • Wide pieces may not fit in holes

left by small ones

  • Tweak: Prioritize bigger classes
slide-33
SLIDE 33

Interference Checks: Register Units

  • Tuples multiply number of registers
  • Interference check of single register in target with 1-10 tuples:


45 aliases!

r0,r1,r2,r3 r1,r2,r3,r4 r2,r3,r4,r5 r3,r4,r5,r6 r1,r2,r3 r2,r3 r3 r3,r4 r2,r3,r4 r3,r4,r5 ...

slide-34
SLIDE 34

Interference Checks: Register Units

  • Each register mapped to one or more

units:
 Registers alias iff they share a unit

  • Liveness/Interference checks of

actual registers uses register units

u0 u1 u2 u3 u4 u5 r0;r1;r2;r3 r1;r2;r3;r4 r1;r2;r3 r2;r3 r3 r3;r4 r2;r3;r4 r3;r4;r5

slide-35
SLIDE 35

Usage, Results, Future Work

slide-36
SLIDE 36

Use in LLVM

  • Declare Subregister Indexes + Subregisters in XXXRegisterInfo.td
  • TableGen computes register units and combined subregister

indexes/classes

  • Enable fine grained liveness tracking by overriding

TargetSubtargetInfo::enableSubRegLiveness()

  • AllocationPriority part of register class specification
slide-37
SLIDE 37

Results: Apple GPU Compiler

  • Compared various benchmarks and captured application shaders
  • Average 20% reduction in register usage (-6% up to 50%)!
  • Speedup 2-3% (-4% up to 70%)
slide-38
SLIDE 38

Results: AMDGPU Target

slide-39
SLIDE 39

Results: AMDGPU Target

slide-40
SLIDE 40

Future Work

  • Support partially dead/undef operands
  • Early splitting and rematerialization (before register limit)
  • Partial registers spilling
  • Consider partial liveness in register pressure tracking
  • Missed optimizations (no obvious use/def relation for lanes)
slide-41
SLIDE 41

Thank You for Your Attention!

slide-42
SLIDE 42

Thank you for your attention! Backup Slides

slide-43
SLIDE 43

Register Hierarchies

  • CPU registers can overlap. Partial register accessible by subregister.

Also called lanes (Vector Regs)

AH AX AL EAX RAX

X86 GP Register

S0 D0 Q0 S1 D1 S2 S3

ARM FP Register

movw 0xABCD, %ax # Put 16bits into %ax movb %al, x # Uses lower 8 bits: 0xCD movb %ah, y # Uses upper 8 bits: 0xAB

slide-44
SLIDE 44

Register Allocation Pipeline

PHIElimination TwoAddressInstruction RegisterCoalescer ProcessImplicitDefs DetectDeadLanes RenameIndependentSubregs MachineScheduler RegAllocGreedy VirtRegRewriter StackSlotColoring

slide-45
SLIDE 45

Subregister Indexes

  • Subregister indexes relate wide/

small registers on virtual registers

  • Writes may be marked undef if
  • ther parts of register do not

matter

  • LLVM synthesizes combined

indexes (`sub0_low16bits`)

Register Allocation:

%0 = load_x4 %1.sub0<undef> = add %0.sub2, 13 %1.sub1 = const 42 store_x2 %1 r4_r5_r6_r7 = load_x4 r0 = add r6, 13 r1 = const 42 store_x2 r0_r1

slide-46
SLIDE 46

Slot Indexes

  • Position in a program; Each instruction is assigned a number (incremented

by 4 so we need to renumber less often when inserting instructions)

  • Slots describe position in the instruction:
  • Block/Base (Block begin/end, PHI-defs)
  • EarlyClobber (early point to force interference with normal def/use)
  • Register (normal def/uses use this)
  • Dead (liveness of dead definitions ends here)
slide-47
SLIDE 47

Constraints & Classes

  • A register class is set of registers; Models register constraints
  • Class defined for each register operand of LLVM MI Instruction

(MCInstrDesc)

  • Each virtual register has a class

EAX EBX ECX EDX ESI ESP EBP EDI

GR32 class

R8 ...