Dynamic Binary Optimization Introduction Application profiling - PowerPoint PPT Presentation

Dynamic Binary Optimization ● Introduction ● Application profiling ● Optimizing translation blocks ● Compatibility ● Code reordering ● Other code optimizations 1 EECS 768 Virtual Machines

Optimization Overview ● Identify frequently executed hot code regions ● basic blocks ● paths – indicate control flow ● edges – approximation to paths ● Dynamic profiling ● count execution frequencies ● software or hardware implemented ● Form large translation blocks ● traces and superblocks ● Schedule and optimize large blocks 2 EECS 768 Virtual Machines

Optimization Based On Profiling Basic Block A Basic Block A . . . . . . . . . . . . R3 … R3 … R7 ... R7 ... R1 R2 + R3 BEQ L1 if R3 ==0 BEQ L1 if R3 ==0 Compensation code Basic Block B . . . R1 R2 + R3 R6 R1 + R6 … ... Basic Block B . . . R6 R1 + R6 … Basic Block C ... L1: R1 0 … ... Basic Block C L1: R1 0 … ... 3 EECS 768 Virtual Machines

Optimization Based On Profiling (2) Basic Block A . . . . . . R3 … Superblock R7 ... R1 R2 + R3 . . . . . . BEQ L1 if R3 ==0 R3 … R7 ... BNE L2 if R3 !=0 Basic Block B . . . R1 0 R6 R1 + R6 … … ... ... Compensation code R1 R2 + R3 Basic Block C Basic Block B L2: . . . L1: R1 0 R6 R1 + R6 … ... … ... 4 EECS 768 Virtual Machines

Program Behavior ● Many aspects of a program's behavior are predictable ● branches, data values R3 ← 100 R1 ← loop: mem(R2) ; load from memory Br found if R1 == -1 ; look for -1 R2 ← R2 + 4 R3 ← R3 -1 Br loop if R3 != 0 ; loop closing branch . . found: ● Backward branch primarily taken ● Forward branch mostly not taken 5 EECS 768 Virtual Machines

Branch Behavior ● Conditional branch predominantly decided one way ● either taken or not taken 50% Conditional Branches Fraction of Static 40% 30% 20% 10% 0% 0-10% 10-20% 20-30% 30-40% 40-50% 50-60% 60-70% 70-80% 80-90% >90% Percent Taken 6 EECS 768 Virtual Machines

Branch Behavior (2) ● Most branches decided the same way as on previous execution ● backward conditional branches are mostly taken ● forward conditional branches taken less often 100% Percent Dynamic Branches Decided 90% 80% Same As Previous Time 70% 60% 50% 40% 30% 20% 10% 0% n f r m u s c 2 a c c e o a c p l s e m i p s g e w c i e r p r z e u . . . a s m 6 1 2 b a c l . p . 7 8 5 . . 1 . a 9 6 3 7 . 2 1 1 7 7 f 8 5 7 7 . 1 9 1 7 2 1 1 1 8 1 7 EECS 768 Virtual Machines

Other Program Behavior ● Some indirect jumps have a single target • others have several targets (e.g. returns) ● Predictability extends to data values • many instructions always produce the same result 0.7 0.6 Fraction with Constant Value 0.5 0.4 static dynamic 0.3 0.2 0.1 0 l t t l d c b f e A a i i u g h S o S o S L / L d d A Instruction Type 8 EECS 768 Virtual Machines

Profiling ● Collect statistics about a program as it runs • branches (taken, not taken) • jump targets • data values ● Predictability allows these statistics to be used for optimizations in the future ● Profiling in a VM differs from traditional profiling used for compiler feedback 9 EECS 768 Virtual Machines

Conventional ( Offline ) Profiling ● Multiple passes through compiler ● Done at program development time • profile overhead is a small issue ● Can be based on global analysis intermediate form A B C Instrumented HLL Compiler Compiler Code Program Front-end Back-end D E F Instrumented Code Program Program Optimizing Optmized Execution Statistics Compiler Binary Test Data 10 EECS 768 Virtual Machines

VM-Based ( Online ) Profiling ● Profile overhead is very important • profile time part of total execution time ● Limited view of program (no a priori global view) • profile probes cannot be carefully placed partially "discovered" code A Partial B Translator/ Program Program Interpreter Optimizer Binary D Statistics E Program Data 11 EECS 768 Virtual Machines

Types of Profiles ● Block or node profiles • identify hot code blocks; fewer nodes than edges ● Edge profiles • more precise idea of program flow • block profile can be derived from edge profile A A 65 50 15 B C B C 50 15 12 13 48 38 D 17 D 25 10 2 E E 48 15 F F 17 12 EECS 768 Virtual Machines

Collecting Profiles ● Instrumentation-based • software probes  slows down program more  requires less total time than sampling • hardware probes  less overhead than software  less well-supported in processors  typically event counters ● Sampling based • interrupt at random intervals and take sample  slows down program less  requires longer time to get same amount of data • not useful during interpretation 13 EECS 768 Virtual Machines

Profiling During Interpretation taken not taken PC count count Instruction function list . Branch PC HASH branch_conditional(inst) { BO = extract(inst,25,5); BI = extract(inst,20,5); displacement = extract(inst,15,14) * 4; . . // code to compute whether branch should be taken . . profile_addr = lookup(PC); if (branch_taken) profile_cnt(profile_addr, taken)++; PC = PC + displacement; Else profile_cnt(profile_addr, nottaken)++; PC = PC + 4; } 14 EECS 768 Virtual Machines

Profiling Translated Code ● Software instrumentation in stub code Increment edge Translated counter (i) Basic Block Increment edge If (counter (i) > Fall-thru counter (j) trigger) then stub invoke optimizer If (counter (j) > Branch target trigger) then Else branch to stub invoke optimizer fall-thru basic block Else branch to target basic block 15 EECS 768 Virtual Machines

Sampling ● Set interval counter ● Interrupt when counter hits zero ● Sample PC at that point ● Gives block profile ● Could be modified to give edge profile Sample PC Initialize Counter TRAP Load PC decrement for each Interval Counter Instruction Address instruction Zero Detect Program Counter 16 EECS 768 Virtual Machines

Improving Code Locality ● Provide more optimization opportunities. ● Spatial locality A A ● consecutive memory B accesses are adjacent E ● Temporal locality B ● same memory access is repeated in near future ● Reasons for spatial and E temporal locality ● loops and sequential program flow 17 EECS 768 Virtual Machines

Improving Locality: Example 3 A Br cond1 == true A B 30 70 Br cond2 == false C B D Br uncond 1 29 D 68 Br cond3 == true F C E E Br uncond 2 29 68 F 1 G 15 G 97 Br cond4 == true 1 18 EECS 768 Virtual Machines

Improving Locality: Example (2) ● Little locality (spatial or temporal) in cache line that spans blocks E and F ● F seldom used • wasted I-cache space and I-fetch bandwidth ● Heavily used discontiguous code blocks • e.g., C and D • still wastes I-fetch bandwidth E F F F Br uncond 19 EECS 768 Virtual Machines

Improving Locality: Rearrange Code A A Br cond1 == false Br cond1 == true D B Br cond2 == false Br cond3 == true E C G Br uncond D Br cond4 == true Br uncond Br cond3 == true B E Br cond2 == false Br uncond C F Br uncond G Br cond4 == true F Br uncond 20 EECS 768 Virtual Machines

Improving Locality: Procedure Inlining ● Inlining – duplicate procedure body at call-site A ● Partial inlining X Y A ● follow dominant flow of Call proc xyz control B B . ● not practical to find full . . Proc xyz . . X procedure during dynamic . K incremental code discovery Y K ● Disadvantages Z X Call proc xyz Return L Z ● increases code size L ● increases register pressure 21 EECS 768 Virtual Machines

Improving Locality: Traces ● Divide program into chunks 3 ● may contain multiple blocks ● Greedy Method Trace 1 A 30 70 • suitable for on-the-fly translation Trace 2 • start at hottest block not in trace B D • follow hottest edges 1 29 68 Trace 3 • stop when trace reaches a 2 F C E certain size • stop when a block already in a 29 68 1 trace is reached G 15 97 1 22 EECS 768 Virtual Machines

Improving Locality: Traces (2) ● No redundancy • may reduce I-cache pressure • good for spatial locality ● Join points sometimes inihibit optimizations. ● Typically not used in optimizing VMs. 23 EECS 768 Virtual Machines

Improving Locality: Superblocks ● Superblock – One entry, multiple exits ● May contain redundant blocks (tail duplication) ● More commonly used by dynamic optimizers ● better branch prediction ● less constraints on optimizations A A B D B D F C E F C E G 15 G G 15 G 24 EECS 768 Virtual Machines

Superblocks: Example A Br cond1 == false D A 3 Br cond3 == true Br cond1 == true E A B G 30 70 Br cond2 == false Br cond4 == true C B D Br uncond 1 29 68 Br uncond D B F C E Br cond2 == false Br cond3 == true 2 C 29 E 68 1 Br uncond G G 15 F Br cond4 == true 97 Br uncond 1 G F Br cond4 == true G Br cond4 == true Br uncond 25 EECS 768 Virtual Machines

Dynamic Binary Optimization Introduction Application profiling - PowerPoint PPT Presentation

Dynamic Binary Optimization Introduction Application profiling Optimizing translation blocks Compatibility Code reordering Other code optimizations 1 EECS 768 Virtual Machines Optimization Overview Identify frequently

Binary Numbers Binary numbers look like this Binary Numbers or Binary Code Binary numbers or

A Quick Review Decimal to binary Binary to decimal Binary to hexadecimal

Binary Trees, Heaps Binary Trees, Heaps Binary trees Binary trees A binary tree (

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Binary Numbers 723 Binary Numbers 723 = 7x100 + 2x10 + 3x1 Binary Numbers 723 = 7x100 + 2x10 +

CMSC 206 Binary Search Trees 1 Binary Search Tree n A Binary Search Tree is a Binary Tree in

Binary Search Trees and Balanced Binary Search Trees using AVL Trees Mark Redekopp David Kempe

LECTURE 2 Review 1 Binary Math and Assembly BINARY MATH In this section, we review Binary

Binary trees Binary trees David Morgan Binary trees Binary trees elements have up to 2

Binary Search Trees A binary search tree is a binary tree T such that - each internal node

Trees Linear Vs non-linear data structures Types of binary trees Binary tree traversals

Week 8 Oliver Kullmann Binary trees The notion BinaryTrees of binary search tree Tree

The Power of Binary 0, 1, 10, 11, 100, 101, 110, 111... What is Binary? a binary number

Binary Trees, Heaps Binary Trees, Heaps K08

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Profiling and diagnosing large-scale decentralized systems David Oppenheimer ROC Retreat

Flame Graphs for Online Performance Profiling agentzh@gmail.com Yichun Zhang (agentzh)

Data-centric Profiling Working Group Outbrief Basic Concept Associating performance data with

Designing Privacy-Aware Social Networks: A Mul:-Agent Approach

ECE590-03 Enterprise Storage Architecture Fall 2016 Workload profiling and sizing Tyler Bletsch

Linux Systems Performance Brendan Gregg Senior Performance Architect Systems

Pr Profiling Energy Consumption of DASH Video St Streaming over 4G 4G LTE Networks Pr

2 nd CERN Advanced Performance Tuning Workshop - introduction Andrzej Nowak (CERN openlab)