Dynamic Binary Optimization Introduction Application profiling - - PowerPoint PPT Presentation

dynamic binary optimization
SMART_READER_LITE
LIVE PREVIEW

Dynamic Binary Optimization Introduction Application profiling - - PowerPoint PPT Presentation

Dynamic Binary Optimization Introduction Application profiling Optimizing translation blocks Compatibility Code reordering Other code optimizations 1 EECS 768 Virtual Machines Optimization Overview Identify frequently


slide-1
SLIDE 1

EECS 768 Virtual Machines 1

Dynamic Binary Optimization

  • Introduction
  • Application profiling
  • Optimizing translation blocks
  • Compatibility
  • Code reordering
  • Other code optimizations
slide-2
SLIDE 2

EECS 768 Virtual Machines 2

Optimization Overview

  • Identify frequently executed hot code regions
  • basic blocks
  • paths – indicate control flow
  • edges – approximation to paths
  • Dynamic profiling
  • count execution frequencies
  • software or hardware implemented
  • Form large translation blocks
  • traces and superblocks
  • Schedule and optimize large blocks
slide-3
SLIDE 3

EECS 768 Virtual Machines 3

Optimization Based On Profiling

Basic Block A . . . . . . R3 … R7 ... R1 R2 + R3 BEQ L1 if R3 ==0 L1: R1 … ... Basic Block C Basic Block B . . . R6 R1 + R6 … ... Compensation code R1 R2 + R3 Basic Block A . . . . . . R3 … R7 ... BEQ L1 if R3 ==0 L1: R1 … ... Basic Block C Basic Block B . . . R6 R1 + R6 … ...

slide-4
SLIDE 4

EECS 768 Virtual Machines 4

Optimization Based On Profiling (2)

Basic Block A . . . . . . R3 … R7 ... R1 R2 + R3 BEQ L1 if R3 ==0 L1: R1 … ... Basic Block C Basic Block B . . . R6 R1 + R6 … ... Superblock . . . . . . R3 … R7 ... BNE L2 if R3 !=0 R1 … ... Basic Block B L2: . . . R6 R1 + R6 … ... Compensation code R1 R2 + R3

slide-5
SLIDE 5

EECS 768 Virtual Machines 5

Program Behavior

  • Many aspects of a program's behavior are

predictable

  • branches, data values
  • Backward branch primarily taken
  • Forward branch mostly not taken

R3 ← 100 loop: R1 ← mem(R2) ; load from memory Br found if R1 == -1 ; look for -1 R2 ← R2 + 4 R3 ← R3 -1 Br loop if R3 != 0 ; loop closing branch . . found:

slide-6
SLIDE 6

EECS 768 Virtual Machines 6

Branch Behavior

  • Conditional branch predominantly decided one way
  • either taken or not taken

0% 10% 20% 30% 40% 50% 0-10% 10-20% 20-30% 30-40% 40-50% 50-60% 60-70% 70-80% 80-90% >90% Percent Taken Fraction of Static Conditional Branches

slide-7
SLIDE 7

EECS 768 Virtual Machines 7

Branch Behavior (2)

  • Most branches decided the same way as on

previous execution

  • backward conditional branches are mostly taken
  • forward conditional branches taken less often

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 7 6 . g c c 1 8 1 . m c f 1 9 7 . p a r s e r 2 5 2 . e

  • n

2 5 6 . b z i p 2 1 7 1 . s w i m 1 7 3 . a p p l u 1 7 7 . m e s a 1 8 7 . f a c e r e c 1 8 9 . l u c a s Percent Dynamic Branches Decided Same As Previous Time

slide-8
SLIDE 8

EECS 768 Virtual Machines 8

Other Program Behavior

  • Some indirect jumps have a single target
  • others have several targets (e.g. returns)
  • Predictability extends to data values
  • many instructions always produce the same result

0.1 0.2 0.3 0.4 0.5 0.6 0.7 A l l A d d / S u b L

  • a

d L

  • g

i c S h i f t S e t Instruction Type Fraction with Constant Value static dynamic

slide-9
SLIDE 9

EECS 768 Virtual Machines 9

Profiling

  • Collect statistics about a program as it runs
  • branches (taken, not taken)
  • jump targets
  • data values
  • Predictability allows these statistics to be used for
  • ptimizations in the future
  • Profiling in a VM differs from traditional profiling

used for compiler feedback

slide-10
SLIDE 10

EECS 768 Virtual Machines 10

Conventional (Offline) Profiling

  • Multiple passes through compiler
  • Done at program development time
  • profile overhead is a small issue
  • Can be based on global analysis

B C D E F A

Compiler Front-end HLL Program Instrumented Code Optimizing Compiler Test Data Program Execution Compiler Back-end Program Statistics Optmized Binary intermediate form Instrumented Code

slide-11
SLIDE 11

EECS 768 Virtual Machines 11

VM-Based (Online) Profiling

  • Profile overhead is very important
  • profile time part of total execution time
  • Limited view of program (no a priori global view)
  • profile probes cannot be carefully placed

B D E A

Interpreter Program Binary Translator/ Optimizer Program Data Partial Program Statistics partially "discovered" code

slide-12
SLIDE 12

EECS 768 Virtual Machines 12

Types of Profiles

  • Block or node profiles
  • identify hot code blocks; fewer nodes than edges
  • Edge profiles
  • more precise idea of program flow
  • block profile can be derived from edge profile

50 48 38 15 2 13 10 17 15 12 B C D E F A B C D E F A

65 50 48 17 25 15

slide-13
SLIDE 13

EECS 768 Virtual Machines 13

Collecting Profiles

  • Instrumentation-based
  • software probes

 slows down program more  requires less total time than sampling

  • hardware probes

 less overhead than software  less well-supported in processors  typically event counters

  • Sampling based
  • interrupt at random intervals and take sample

 slows down program less  requires longer time to get same amount of data

  • not useful during interpretation
slide-14
SLIDE 14

EECS 768 Virtual Machines 14

Profiling During Interpretation

Instruction function list . branch_conditional(inst) { BO = extract(inst,25,5); BI = extract(inst,20,5); displacement = extract(inst,15,14) * 4; . . // code to compute whether branch should be taken . . profile_addr = lookup(PC); if (branch_taken) profile_cnt(profile_addr, taken)++; PC = PC + displacement; Else profile_cnt(profile_addr, nottaken)++; PC = PC + 4; }

Branch PC taken count PC not taken count

HASH

slide-15
SLIDE 15

EECS 768 Virtual Machines 15

Profiling Translated Code

  • Software instrumentation in stub code

Translated Basic Block Fall-thru stub Branch target stub Increment edge counter (i) If (counter (i) > trigger) then invoke optimizer Else branch to fall-thru basic block Increment edge counter (j) If (counter (j) > trigger) then invoke optimizer Else branch to target basic block

slide-16
SLIDE 16

EECS 768 Virtual Machines 16

Sampling

  • Set interval counter
  • Interrupt when counter hits zero
  • Sample PC at that point
  • Gives block profile
  • Could be modified to give edge profile

Zero Detect Instruction Address Interval Counter decrement for each instruction Program Counter Load PC TRAP Initialize Counter Sample PC

slide-17
SLIDE 17

EECS 768 Virtual Machines 17

Improving Code Locality

A B E A B E

  • Provide more optimization
  • pportunities.
  • Spatial locality
  • consecutive memory

accesses are adjacent

  • Temporal locality
  • same memory access is

repeated in near future

  • Reasons for spatial and

temporal locality

  • loops and sequential

program flow

slide-18
SLIDE 18

EECS 768 Virtual Machines 18

Improving Locality: Example

Br cond1 == true Br cond2 == false Br uncond Br cond3 == true Br uncond Br cond4 == true

A B C D E F G 30 70 68 2 97 15 29 B D C G A E F 1 1 29 68 1 3

slide-19
SLIDE 19

EECS 768 Virtual Machines 19

Improving Locality: Example (2)

  • Little locality (spatial or temporal) in cache line that

spans blocks E and F

  • F seldom used
  • wasted I-cache space and I-fetch bandwidth
  • Heavily used discontiguous code blocks
  • e.g., C and D
  • still wastes I-fetch bandwidth

E F F F

Br uncond

slide-20
SLIDE 20

EECS 768 Virtual Machines 20

Improving Locality: Rearrange Code

Br cond1 == true Br cond2 == false Br uncond Br cond3 == true Br uncond Br cond4 == true

A B C D E F G

Br cond1 == false

A

Br cond3 == true

D E

Br cond2 == false Br uncond

B C

Br cond4 == true

G F

Br uncond Br uncond

slide-21
SLIDE 21

EECS 768 Virtual Machines 21

Improving Locality: Procedure Inlining

Call proc xyz Proc xyz Return Call proc xyz A K L B X

. . .

Y Z A K L B

. . .

Y X X Z

  • Inlining – duplicate

procedure body at call-site

  • Partial inlining
  • follow dominant flow of

control

  • not practical to find full

procedure during dynamic incremental code discovery

  • Disadvantages
  • increases code size
  • increases register pressure
slide-22
SLIDE 22

EECS 768 Virtual Machines 22

Improving Locality: Traces

  • Divide program into chunks
  • may contain multiple blocks
  • Greedy Method
  • suitable for on-the-fly translation
  • start at hottest block not in trace
  • follow hottest edges
  • stop when trace reaches a

certain size

  • stop when a block already in a

trace is reached

Trace 1 Trace 3 Trace 2

30 70 68 97 15 29 B D C G A E F 1 1 29 68 1 3 2

slide-23
SLIDE 23

EECS 768 Virtual Machines 23

Improving Locality: Traces (2)

  • No redundancy
  • may reduce I-cache pressure
  • good for spatial locality
  • Join points sometimes inihibit optimizations.
  • Typically not used in optimizing VMs.
slide-24
SLIDE 24

EECS 768 Virtual Machines 24

Improving Locality: Superblocks

  • Superblock – One entry, multiple exits
  • May contain redundant blocks (tail duplication)
  • More commonly used by dynamic optimizers
  • better branch prediction
  • less constraints on optimizations

15 B D C G A E F 15 B D C G A E F G G

slide-25
SLIDE 25

EECS 768 Virtual Machines 25

Superblocks: Example

Br cond1 == true Br cond2 == false Br uncond Br cond3 == true Br uncond Br cond4 == true

A B C D E F G 30 70 68 2 97 15 29 B D C G A E F 1 1 29 68 1 3

Br cond1 == false

A

Br cond3 == true

D E

Br cond2 == false

B C

Br cond4 == true

G F

Br uncond Br cond4 == true

G G

Br uncond Br cond4 == true Br uncond

slide-26
SLIDE 26

EECS 768 Virtual Machines 26

Optimization Strategy

A B C A B C

  • pt.

A B C

comp comp

A B C

Collect basic blocks using profile information Convert to intermediate form; place in buffer Schedule and

  • ptimize

Add compensation code; place in code cache Intermediate form Generate target code Optimized target code Original source code

slide-27
SLIDE 27

EECS 768 Virtual Machines 27

Optimization and Compatibility

  • Requirements for compatibility
  • isomorphism of user/privilege mode control transfer points
  • isomorphism of guest state at the control transfer points
  • Optimizations can affect the visibility of traps
  • reordering instructions may affect where traps occur
  • adding/eliminating instructions may affect if traps occur
  • Trap compatibility
  • trap during native execution of source instruction also
  • ccurs during emulation of corresponding target instruction
  • trap observed during emulation should also occur in the

corresponding source instruction

slide-28
SLIDE 28

EECS 768 Virtual Machines 28

Optimization and Compatibility (2)

  • Trap compatibility

Source … r4 ← r6 + 1 r1 ← r2 + r3 → trap? r1 ← r4 + r5 r6 ← r1 * r7 Target … R4 ← R6 + 1 Remove R1 ← R4 + R5 dead R6 ← R1 * R7 assignment

  • Memory and register state compatibility
  • consistent program state on guest and native platform at

each control transfer point

Source … r1 ← r2 + r3 r9 ← r1 + r5 reschedule r6 ← r1 * r7 r3 ← r6 + 1 … Target … R1 ← R2 + R3 R6 ← R1 * R7 R9 ← R1 + R5 → trap? R3 ← R6 + 1 … Target with saved reg. state … R1 ← R2 + R3 S1 ← R1 * R7 R9 ← R1 + R5 R6 ← S1 R3 ← S1 + 1 …

slide-29
SLIDE 29

EECS 768 Virtual Machines 29

Code Reordering

  • Important aspect of several optimizations
  • especially for pipelined RICS, and VLIW processors
  • reduce pipeline stalls and functional unit latencies
  • Primitive instruction reordering issues
  • consider reordering pairs of instructions
  • divide instructions into basic categories
slide-30
SLIDE 30

EECS 768 Virtual Machines 30

Instruction Categories

  • reg updates – instructions

updating registers

  • memory updates –

instructions updating memory

  • branch instructions –

transfer of control instructions

  • join point – points where

jump/branch enter code sequence (only for traces)

. . . R1  mem(R6) reg R2  mem(R6 +4) reg R3  R1 + 1 reg R4  R1 << 2 reg Br exit; if R7 == 0 br R7  R7 + 1 reg mem (R6)  R3 mem

slide-31
SLIDE 31

EECS 768 Virtual Machines 31

Moving Instructions Below Branches

  • Duplicate compensation code at the exit point.
  • Pretty straightforward.
  • Works for registers as well as memory state.

Br reg Br reg reg (compensation) Br mem Br mem mem (compensation)

… R1 ← mem(R6) R2 ← mem(R6+4) R3 ← R1 + 1 R4 ← R1 << 2 Br exit if R7 == 0 R7 ← R7 + 1 mem(R6) ← R3 … R1 ← mem(R6) R2 ← mem(R6+4) R3 ← R1 + 1 Br exit if R7 == 0 R4 ← R1 << 2 R7 ← R7 + 1 mem(R6) ← R3 R4 ← R1 << 2

slide-32
SLIDE 32

EECS 768 Virtual Machines 32

Moving Instructions Above Branches

  • Use checkpoint for moving reg instructions
  • calculate reg update in a temporary register
  • if branch taken, real register is unmodified
  • if instruction traps, all register state unmodified

Br reg (R) Br reg (T) R T

… R2 ← R1 << 2 Br exit if R8 == 0 R6 ← R7 * R2 mem(R6) ← R3 R6 ← R2 + 2 … R2 ← R1 << 2 T1 ← R7 * R2 Br exit if R8 == 0 R6 ← T1 mem(T1) ← R3 R6 ← R2 + 2 … R2 ← R1 << 2 T1 ← R7 * R2 Br exit if R8 == 0 mem(T1) ← R3 R6 ← R2 + 2

slide-33
SLIDE 33

EECS 768 Virtual Machines 33

Moving Instructions Above Branches

  • Moving stores above branches breaks memory state

compatibility

  • what if exit branch is taken ?
  • difficult to replicate memory state!

Br mem

X

… R2 ← R1 << 2 T1 ← R7 * R2 Br exit if R8 == 0 mem(T1) ← R3 R6 ← R2 + 2

slide-34
SLIDE 34

EECS 768 Virtual Machines 34

Moving Code Above Join Points

  • Similar to previous case of branches
  • Straightforward, compensation is via duplication

join point reg reg join point reg (compensation) join point mem mem join point mem (compensation)

… R1 ← R1 + 1 R7 ← mem(R6) R7 ← R7 + 1 ... … R1 ← R1 + 1 R7 ← mem(R6) R7 ← R7 + 1 ... R7 ← mem(R6)

slide-35
SLIDE 35

EECS 768 Virtual Machines 35

Moving Code Below Join Point

  • Should not be done in most cases.
  • No way to compensate if the join is taken.

join point reg join point mem

slide-36
SLIDE 36

EECS 768 Virtual Machines 36

Movement in Straight Line Code

  • Can be done via checkpointing registers

reg(R) reg R T reg reg(T) reg(R) mem R T mem reg(T)

… R1 ← R1 * 3 mem(R6) ← R1 R7 ← R7 << 3 R9 ← R7 + R2 ... … R1 ← R1 * 3 T1 ← R7 << 3 mem(R6) ← R1 R7 ← T1 R9 ← T1 + R2 ...

slide-37
SLIDE 37

EECS 768 Virtual Machines 37

Movement in Straight Line Code

  • Hoisting stores breaks memory state compatibility
  • unless there is a way to back up store instructions
  • expensive

mem

X

reg mem

X

mem

slide-38
SLIDE 38

EECS 768 Virtual Machines 38

Instruction Reordering – Summary

first second reg mem br join reg extend live range

  • f reg instruction

extend live range

  • f reg instruction

extend live range

  • f reg instruction

add compensation code at entrance mem not allowed not allowed not allowed add compensation code at entrance br add compensation code at branch exit add compensation code at branch exit Not allowed (changes control flow) Not allowed (changes control flow) join Not allowed (can

  • nly be done

in rare cases) Not allowed (can

  • nly be done

in rare cases) Not allowed (changes control flow) no effect

slide-39
SLIDE 39

EECS 768 Virtual Machines 39

Optimizations

  • Basic local optimizations
  • applied within translation blocks
  • can even optimize statically optimized code further
  • constant propagation, constant folding, strength

reduction, dead-assignment elimination, cse, register assignment, etc.

  • compatibility issues verified on a case-by-case basis
  • Inter-superblock optimizations
  • go across basic blocks
  • ISA-specific optimizations
  • if conversion, instruction alignment
slide-40
SLIDE 40

EECS 768 Virtual Machines 40

Static Vs Dynamic Optimizations

  • Advantages of dynamic optimizations
  • availability of runtime profile information (specialization)
  • ability to see the whole program post-link-time
  • ability to detect and optimize program phases
  • Disadvantages
  • compilation time adds to total execution time

– apply low-overhead conservative optimizations – only apply local optimizations

  • high level semantic information may not be available

– exception, HLL (Java) Vms