263-2810: Advanced Compiler Design Compilation with dynamic - - PowerPoint PPT Presentation
263-2810: Advanced Compiler Design Compilation with dynamic - - PowerPoint PPT Presentation
263-2810: Advanced Compiler Design Compilation with dynamic information Thomas R. Gross Computer Science Department ETH Zurich, Switzerland Outline Dynamic information: Information obtained at runtime During program execution Why use
Outline
§ Dynamic information: Information obtained at runtime
§ During program execution
§ Why use dynamic information
§ Guidance for optimizations § Starting point for speculation
2
Outline
7.1 Review of compiler models
§ Structure to obtain and use dynamic information
7.2 Techniques to obtain dynamic information 7.3 Classification of dynamic information
§ What is a “profile”?
7.4 Types of profiles 7.5 Practical issues 7.6 Optimization based on dynamic information
3
7.1 Compiler model
§ So far we used a simple model
§ “Ahead of time” compiler
4
Src Frontend IR Backend a.out Execution Platform (OS/Core)
7.1 Compiler model
§ So far we used a simple model
§ “Ahead of time” compiler
5
Src Frontend IR Backend a.out Execution Platform (OS/Core)
§ Use of dynamic information requires re-compilation
Re-compilation
§ Compiler can (re)use dynamic information from earlier executions
§ Many questions § Off-line model preserved: compiler reads source plus historical information
6
Src Frontend IR Backend a.out Execution Platform (OS/Core)
Continuous compilation
§ No reason to limit compiler to a single recompilation § Continuous compilation: compiler works in parallel with program execution
§ Some issues remain
7
Src Frontend IR Backend a.out Execution Platform (OS/Core) Continuous compiler
Just-in-time compilation
§ Delay compilation to first execution § Compilation time matters § Combine interpretation and compilation
9
Src Frontend IR Backend Execution Platform (OS/Core) Just-in-time compiler
Issues
§ Unit of compilation
§ Basic block § Trace § Method § Package
§ Delay compilation to first execution
§ Or to N-th execution § Interpret first (N-1) executions
§ Use multi-tier compilation model
§ More than one “just-in-time” compiler § Optimizer: later stages of multi-tier compilation system
§ How to switch between interpreted and compiled code § Is there ever a need to de-optimize?
10
7.2 Techniques to obtain dynamic information
§ Get compiler to collect information
§ Produces code to collect information § Instrument program § Precise (collect only what is requested)
§ Get platform (hardware, OS) to collect information
§ Modern processors contain monitoring/measurement units
11
Program instrumentation
§ Compiler adds operations to program § Can be done “early” during compilation
§ Add IR operations § Subject to optimization § But no guarantee that added operations can be removed
§ Issues
§ Overhead: extra operations take time to execute § Extra operations may perturb measurements § Cache hit rate § Register spill traffic
12
Hardware-based monitoring
§ Program (processor) monitoring unit (PMU) collects data on the fly
§ Usually no noticeable overhead for collection § May have to store collected data § Overhead § Can be configured to monitor various aspects
§ PMUs use sampling
§ Select “event” to monitor
§ Execution of instruction (any instruction, control flow transfers, method invocation…) § Cache miss § ….
§ Select frequency of monitoring
13
§ Sampling allows software to tune the overhead
§ Overhead too high: Increase sampling interval § Information not precise enough: shorten sampling interval § No guarantee that sampling provides meaningful data but it almost always does
§ Issues
§ PMU may be processor-dependent § Different implementations of same architecture use different PMUs § Often many restrictions on what can be observed at the same time
§ May have to execute program multiple times (for same input) to get all the data you need
§ Documentation is often incomplete (or incorrect)
§ PMUs don’t sell processors, performance or energy consumption matter(s)
§ Use may require special privileges
§ Do not influence other jobs on the same system § No information leakage
14
Instrumentation vs. PMU usage
§ PMU wins if it can get the information you need
§ Usually faster § Usually more precise § Often accurate enough
15
7.3 Classification of dynamic information
§ Brief discussion, not a complete discussion of all aspects § An attempt to sort out different dimensions § Independent of how the information is collected
§ Measurement by instrumentation § Measurement by processor/operating system/runtime system
16
Deterministic execution
§ Assume deterministic program execution
§ May need to capture/replay input and output (incl. network-based communication or message passing) § Design harness to capture environment
§ Packets arrive in the same order, same contents § Interrupts/signal arrive in the same order § I/O
§ Non-deterministic programs can be made deterministic for monitoring
§ Record random numbers generated and always replay this list
17
7.3.1 What is measured/observed?
§ Program properties vs. platform (hardware) properties § Program properties
§ Do not change if program is executed on a different platform (for the same input) § Examples § # of times a method is invoked § # of times a branch is taken § Loop trip count
§ “Profile” : record of program properties
18
§ Platform (hardware) properties
§ Depend on specific system used for program execution § Results on platform X may be different from results on platform Y § Examples § Cache hit rate (may depend on cache size, replacement policy, mapping strategy) § TLB hit rate § Prefetching effectiveness
§ Hardware properties require use of PMU or simulator
§ Simulator may be slow
19
7.3.2 Granularity of information
§ “Fine-grained” vs. “coarse-grained” § Fine-grained: Information about single resource/effect of individual operations
§ Examples:
§ Basic blocks § Instructions/operations § Resources like registers, individual variables, branch outcomes
§ Coarse-grained: Summary information
§ For aggregation/collection § Examples:
§ Method properties § Package/application/library information § Summary information on cache behavior, … § Wall-clock timing
20
7.3.3 Discussion
4 questions/issues to think about
- 1. What kind of information do you need to capture?
§ Depends on the relationship between dynamic information (observed at runtime) and the compiler’s
- ptimization/transformation
§ Connection not always obvious § Dynamic information must provide compiler guidance
21
§ Profiling or reporting of summary information tells you what happened
§ Time spent § Memory region accessed
§ Not clear what compiler can do to improve program
§ Some operations take time § … even if implementation is efficient
22
- 2. What accuracy is needed?
§ What cost is acceptable?
§ How much perturbation can be tolerated?
§ See Question 1
§ Are observations repeatable? To what extend?
§ All measurements (“live” observations) must deal with cost
- f collection, overhead, measurement errors, perturbation,
….
§ Devise measurement strategy
23
- 3. How can we bridge the gap between what is collected and
what is needed by the compiler? § Compiler works with data structures (and/or abstractions), instrumentation or PMUs work with addresses
§ Example: Map virtual (or physical) addresses to symbol table that is used by the compiler § Objects may be moved by garbage collector § Must constantly update map of addresses and (application program)
- bjects
24
- 4. Is the information obtained stable?
§ Stability: similar setups provide similar results
§ Variations on this theme: how much does the information depend on the input data?
§ Record decisions on dynamic method resolution for input A § Same information for input B § Is there (any) overlap?
§ Variation: how does scaling the size of the input set influence the information obtained?
§ Example: Execution time as for quadratic vs exponential algorithms § Small input sets may mislead due to constant factors
§ “Algorithmic profiling”
25
§ Profiles most interesting to compiler § Hardware designers should deal with processor properties
§ Helps if the hardware designers understand compilers
26
Return of investment
§ The cost of obtaining and processing dynamic information must be recovered
§ Speedup of execution § Otherwise why bother?
§ May be difficult if
§ Program executed rarely § Dynamic information not stable § Information gathered may be unconnected to compiler optimization
27
7.4 Types of profiles
§ A wide variety of events can be observed
28
7.4.1 Method invocation
§ Frequency of method invocation
§ Which methods are invoked / functions called?
§ Breakdown as function of all methods invoked/functions called
§ Absolute counts
§ Sampling works usually well
§ Some inaccuracy can be tolerated § Compiler often needs a ranking (from often-invoked to never invoked)
§ Good tool support
§ gprof – ancient but still relevant on Unix/Linux systems § vtune – powerful tool for IA32, connects to managed runtimes
29
§ Variations
§ Time spent in a method
§ Time spent in the body of method f()
§ Time spent in a method plus (including) time spent in called method
§ f() à g() à h() § Time in g and h is included in time for f
§ Time spent in a method in a given context
§ Time spent in method when invoked at call site 1, call site 2, ….
30
§ Method information useful but often compiler needs finer- grained information § Profiles that look inside a method
31
7.4.2 Basic blocks
§ Counts
§ How often is a basic block executed § Can be obtained by sampling, or § Can be obtained through counters (completely accurate)
§ Contribution
§ What percentage of execution time is spent in a given basic block
§ “Block profile” § Easy to obtain
§ Good example of optimized instrumentation code
32
33
7.4.3 Transitions
§ Frequency of transition from basic block Bi to block Bj § “Edge profile” § Weight for edges in control flow graph § Can be obtained by inserting counters
§ Good target for optimization § Don’t need to instrument all edges
36
37
7.4.4 Paths taken
§ How often is a path executed (taken)? § Path: sequence of basic blocks
§ Example
§ “Path profile”
40
41
P1: B1 B3 B5 B6 B8 B9 – 90 P2: B1 B4 B5 B7 B8 B9 – 10 P3: B2 B4 B5 B6 B8 B9 – 10 P4: B2 B4 B5 B7 B8 B9 -- 90
43
ENTRY EXIT B0 B1 B2 B3 B4 B6 B7 B5 B8 B9
P1: B1 B3 B5 B6 B8 B9 – 90 P2: B1 B4 B5 B7 B8 B9 – 10 P3: B2 B4 B5 B6 B8 B9 – 10 P4: B2 B4 B5 B7 B8 B9 -- 90
§ Path profile cannot be constructed from edge/block profile alone
§ Additional counters are needed § Potentially expensive
§ Often, information on the top k paths is sufficient
§ Limit attention to the top paths (say top 10) § Reduce need to instrument
45
7.4.5 Discussion
§ Often combine approaches § Use inexpensive profiling (method profiling) to determine where more detailed info is needed
46
§ Properties of data (objects) can be interesting too § Previous example: actual type of reference at call site
§ “Call context profile”
§ More general: values assumed by variables, parameters, fields, etc
47
7.4.6 Value profile
§ Given a point P in the program. § The “value profile” at P captures the actual values of a resource at P
§ Can be (min, max) § Can be histogram § Can be top k values (by count)
49
movl (%eax, %edx), %ebx
§ load Memory[%eax+%edx] à %ebx § Processor must compute sum before sending out addresss
§ Adding 0 takes no time …
§ Could record that a register has value 0 § Processor could send out %eax if %edx contains 0 § Does this happen often enough to warrant design/implementation effort? § Answer: get value profile
§ Compare instructions (for < 0, ≠ 0, etc) can use similar technique
51
void foo(int i) { … j = j + i; …} § a.foo(x)
§ x is (often | always) = 0
§ Create two version of foo
§ foo0 // x == 0 § Can optimize stmts like “j = j + i” § foo≠0 // default § if (x==0) { foo0();} else { foo≠0 (x); }
52
§ Value profile obtained from program executions
§ Need (usually) input for program 1. How relevant are value profiles? 2. Program execution must be correct for other inputs
§ Balance measurement effort and specialization effort vs potential gain
55
7.5 Practical issues
§ 7.5.1 Obtaining edge/block profiles § 7.5.2 Obtaining path profiles
56
7.5.1 Edge/block profiles
§ Edge profile ⟹ block profiles § Block profile ⟹ edge profile
§ In general
§ Edge profile more powerful but also more expensive to
- btain
57
§ Consider the sets Bcount : set of blocks Bi that are instrumented (contain a counter) Ecount : set of edges Ej that are instrumented § Goal: to find Bcount and Ecount so that the expected cost is as low as possible
59
Adding a block to Bcount
§ Add increment operation to update counter CB for B
61
B B incr CB
Adding an edge to Ecount
§ “Instrument” edge § Insert new basic block B, with increment operation to update counter CE for E
63
B0 B1
E
B0 B1
B incr CE
§ Adding a basic block to CFG more expensive than adding an increment operation to an existing basic block
§ In general ….
§ But not all edges E in CFG must be in Ecount to obtain frequency information for edge profile
64
Kirchhoff’s law
§ Kirchhoff’s (first) law: conservation of electric charge
§ Kirchhoff’s current law § Sum of currents flowing into a node = sum of currents leaving node
§ Freq(B) = ∑ Freq (E), E incoming edges § Freq(B) = ∑ Freq (E’), E’ outgoing edges
65
B
§ Given a block B with k incoming and outgoing edges
§ Need k-1 counters
§ Idea: If there is a choice, do not instrument high-frequency edges
§ High-frequency edges: either based on guess or on dynamic information from previous executions
§ Need to find out which edges need instrumentation (resp are good or bad candidates)
66
Spanning tree (review)
§ Given a CFG G with nodes B and edges E § Spanning tree (ST(G)) defined by
§ T – set of edges, T ⊆ E, edges in T undirected § Any pair of blocks B’, B’’ ∊ B is connected by a unique (cycle free) path in ST
§ Direction of edges in G not considered for ST(G)
67
Example
void foo() { while (P) do { if (Q) { A } else { B } if (R) { Exit() } C } // end while }
68
Example
void foo() { while (P) do { if (Q) { A } else { B } if (R) { Exit() } C } // end while }
70
ENTRY EXIT P Q A B R C
Add edge EXIT à ENTRY so that Kirchhoff’s law is valid for all nodes
72
ENTRY EXIT P Q A B R C ENTRY EXIT P Q A B R C
§ Given weights on edges, maximum spanning tree is a spanning tree so that the weights on T are maximal
§ Weights can be obtained from previous executions, guesses, …
73
Maximum spanning tree
74
ENTRY EXIT P Q A B R C 1.0 1.0 10.5 5.25 5.25 0.5 10 5.25 5.25 0.5 10
§ Maximal spanning tree
§ All nodes are reachable § Examples
75
Instrumentation
§ Ecount = E provides solution but wastes cycles § Minimize overhead: edges that are not part of the maximum spanning tree are candidates
77
Option 1
78
ENTRY EXIT P Q A B R C 1.0 1.0 10.5 5.25 5.25 0.5 10 5.25 5.25 0.5
80
7.5.2 Path profiles
§ Selection of edges for profiling (7.5.1) can be extended to
- btain path profiles
§ Idea: log sequence of edges traversed
§ Instead of incrementing a counter, write a “witness” to log file when edge is traversed § More overhead – must record witness marker
§ After program terminates, process witness log file
§ Reconstruct paths taken § Cut-off can be implemented easily
§ Path profiles useful but there exists a better alternative
§ Traces – topic of future lecture
81