263-2810: Advanced Compiler Design Compilation with dynamic - - PowerPoint PPT Presentation

263 2810 advanced compiler design compilation with
SMART_READER_LITE
LIVE PREVIEW

263-2810: Advanced Compiler Design Compilation with dynamic - - PowerPoint PPT Presentation

263-2810: Advanced Compiler Design Compilation with dynamic information Thomas R. Gross Computer Science Department ETH Zurich, Switzerland Outline Dynamic information: Information obtained at runtime During program execution Why use


slide-1
SLIDE 1

263-2810: Advanced Compiler Design Compilation with dynamic information

Thomas R. Gross Computer Science Department ETH Zurich, Switzerland

slide-2
SLIDE 2

Outline

§ Dynamic information: Information obtained at runtime

§ During program execution

§ Why use dynamic information

§ Guidance for optimizations § Starting point for speculation

2

slide-3
SLIDE 3

Outline

7.1 Review of compiler models

§ Structure to obtain and use dynamic information

7.2 Techniques to obtain dynamic information 7.3 Classification of dynamic information

§ What is a “profile”?

7.4 Types of profiles 7.5 Practical issues 7.6 Optimization based on dynamic information

3

slide-4
SLIDE 4

7.1 Compiler model

§ So far we used a simple model

§ “Ahead of time” compiler

4

Src Frontend IR Backend a.out Execution Platform (OS/Core)

slide-5
SLIDE 5

7.1 Compiler model

§ So far we used a simple model

§ “Ahead of time” compiler

5

Src Frontend IR Backend a.out Execution Platform (OS/Core)

§ Use of dynamic information requires re-compilation

slide-6
SLIDE 6

Re-compilation

§ Compiler can (re)use dynamic information from earlier executions

§ Many questions § Off-line model preserved: compiler reads source plus historical information

6

Src Frontend IR Backend a.out Execution Platform (OS/Core)

slide-7
SLIDE 7

Continuous compilation

§ No reason to limit compiler to a single recompilation § Continuous compilation: compiler works in parallel with program execution

§ Some issues remain

7

Src Frontend IR Backend a.out Execution Platform (OS/Core) Continuous compiler

slide-8
SLIDE 8

Just-in-time compilation

§ Delay compilation to first execution § Compilation time matters § Combine interpretation and compilation

9

Src Frontend IR Backend Execution Platform (OS/Core) Just-in-time compiler

slide-9
SLIDE 9

Issues

§ Unit of compilation

§ Basic block § Trace § Method § Package

§ Delay compilation to first execution

§ Or to N-th execution § Interpret first (N-1) executions

§ Use multi-tier compilation model

§ More than one “just-in-time” compiler § Optimizer: later stages of multi-tier compilation system

§ How to switch between interpreted and compiled code § Is there ever a need to de-optimize?

10

slide-10
SLIDE 10

7.2 Techniques to obtain dynamic information

§ Get compiler to collect information

§ Produces code to collect information § Instrument program § Precise (collect only what is requested)

§ Get platform (hardware, OS) to collect information

§ Modern processors contain monitoring/measurement units

11

slide-11
SLIDE 11

Program instrumentation

§ Compiler adds operations to program § Can be done “early” during compilation

§ Add IR operations § Subject to optimization § But no guarantee that added operations can be removed

§ Issues

§ Overhead: extra operations take time to execute § Extra operations may perturb measurements § Cache hit rate § Register spill traffic

12

slide-12
SLIDE 12

Hardware-based monitoring

§ Program (processor) monitoring unit (PMU) collects data on the fly

§ Usually no noticeable overhead for collection § May have to store collected data § Overhead § Can be configured to monitor various aspects

§ PMUs use sampling

§ Select “event” to monitor

§ Execution of instruction (any instruction, control flow transfers, method invocation…) § Cache miss § ….

§ Select frequency of monitoring

13

slide-13
SLIDE 13

§ Sampling allows software to tune the overhead

§ Overhead too high: Increase sampling interval § Information not precise enough: shorten sampling interval § No guarantee that sampling provides meaningful data but it almost always does

§ Issues

§ PMU may be processor-dependent § Different implementations of same architecture use different PMUs § Often many restrictions on what can be observed at the same time

§ May have to execute program multiple times (for same input) to get all the data you need

§ Documentation is often incomplete (or incorrect)

§ PMUs don’t sell processors, performance or energy consumption matter(s)

§ Use may require special privileges

§ Do not influence other jobs on the same system § No information leakage

14

slide-14
SLIDE 14

Instrumentation vs. PMU usage

§ PMU wins if it can get the information you need

§ Usually faster § Usually more precise § Often accurate enough

15

slide-15
SLIDE 15

7.3 Classification of dynamic information

§ Brief discussion, not a complete discussion of all aspects § An attempt to sort out different dimensions § Independent of how the information is collected

§ Measurement by instrumentation § Measurement by processor/operating system/runtime system

16

slide-16
SLIDE 16

Deterministic execution

§ Assume deterministic program execution

§ May need to capture/replay input and output (incl. network-based communication or message passing) § Design harness to capture environment

§ Packets arrive in the same order, same contents § Interrupts/signal arrive in the same order § I/O

§ Non-deterministic programs can be made deterministic for monitoring

§ Record random numbers generated and always replay this list

17

slide-17
SLIDE 17

7.3.1 What is measured/observed?

§ Program properties vs. platform (hardware) properties § Program properties

§ Do not change if program is executed on a different platform (for the same input) § Examples § # of times a method is invoked § # of times a branch is taken § Loop trip count

§ “Profile” : record of program properties

18

slide-18
SLIDE 18

§ Platform (hardware) properties

§ Depend on specific system used for program execution § Results on platform X may be different from results on platform Y § Examples § Cache hit rate (may depend on cache size, replacement policy, mapping strategy) § TLB hit rate § Prefetching effectiveness

§ Hardware properties require use of PMU or simulator

§ Simulator may be slow

19

slide-19
SLIDE 19

7.3.2 Granularity of information

§ “Fine-grained” vs. “coarse-grained” § Fine-grained: Information about single resource/effect of individual operations

§ Examples:

§ Basic blocks § Instructions/operations § Resources like registers, individual variables, branch outcomes

§ Coarse-grained: Summary information

§ For aggregation/collection § Examples:

§ Method properties § Package/application/library information § Summary information on cache behavior, … § Wall-clock timing

20

slide-20
SLIDE 20

7.3.3 Discussion

4 questions/issues to think about

  • 1. What kind of information do you need to capture?

§ Depends on the relationship between dynamic information (observed at runtime) and the compiler’s

  • ptimization/transformation

§ Connection not always obvious § Dynamic information must provide compiler guidance

21

slide-21
SLIDE 21

§ Profiling or reporting of summary information tells you what happened

§ Time spent § Memory region accessed

§ Not clear what compiler can do to improve program

§ Some operations take time § … even if implementation is efficient

22

slide-22
SLIDE 22
  • 2. What accuracy is needed?

§ What cost is acceptable?

§ How much perturbation can be tolerated?

§ See Question 1

§ Are observations repeatable? To what extend?

§ All measurements (“live” observations) must deal with cost

  • f collection, overhead, measurement errors, perturbation,

….

§ Devise measurement strategy

23

slide-23
SLIDE 23
  • 3. How can we bridge the gap between what is collected and

what is needed by the compiler? § Compiler works with data structures (and/or abstractions), instrumentation or PMUs work with addresses

§ Example: Map virtual (or physical) addresses to symbol table that is used by the compiler § Objects may be moved by garbage collector § Must constantly update map of addresses and (application program)

  • bjects

24

slide-24
SLIDE 24
  • 4. Is the information obtained stable?

§ Stability: similar setups provide similar results

§ Variations on this theme: how much does the information depend on the input data?

§ Record decisions on dynamic method resolution for input A § Same information for input B § Is there (any) overlap?

§ Variation: how does scaling the size of the input set influence the information obtained?

§ Example: Execution time as for quadratic vs exponential algorithms § Small input sets may mislead due to constant factors

§ “Algorithmic profiling”

25

slide-25
SLIDE 25

§ Profiles most interesting to compiler § Hardware designers should deal with processor properties

§ Helps if the hardware designers understand compilers

26

slide-26
SLIDE 26

Return of investment

§ The cost of obtaining and processing dynamic information must be recovered

§ Speedup of execution § Otherwise why bother?

§ May be difficult if

§ Program executed rarely § Dynamic information not stable § Information gathered may be unconnected to compiler optimization

27

slide-27
SLIDE 27

7.4 Types of profiles

§ A wide variety of events can be observed

28

slide-28
SLIDE 28

7.4.1 Method invocation

§ Frequency of method invocation

§ Which methods are invoked / functions called?

§ Breakdown as function of all methods invoked/functions called

§ Absolute counts

§ Sampling works usually well

§ Some inaccuracy can be tolerated § Compiler often needs a ranking (from often-invoked to never invoked)

§ Good tool support

§ gprof – ancient but still relevant on Unix/Linux systems § vtune – powerful tool for IA32, connects to managed runtimes

29

slide-29
SLIDE 29

§ Variations

§ Time spent in a method

§ Time spent in the body of method f()

§ Time spent in a method plus (including) time spent in called method

§ f() à g() à h() § Time in g and h is included in time for f

§ Time spent in a method in a given context

§ Time spent in method when invoked at call site 1, call site 2, ….

30

slide-30
SLIDE 30

§ Method information useful but often compiler needs finer- grained information § Profiles that look inside a method

31

slide-31
SLIDE 31

7.4.2 Basic blocks

§ Counts

§ How often is a basic block executed § Can be obtained by sampling, or § Can be obtained through counters (completely accurate)

§ Contribution

§ What percentage of execution time is spent in a given basic block

§ “Block profile” § Easy to obtain

§ Good example of optimized instrumentation code

32

slide-32
SLIDE 32

33

slide-33
SLIDE 33

7.4.3 Transitions

§ Frequency of transition from basic block Bi to block Bj § “Edge profile” § Weight for edges in control flow graph § Can be obtained by inserting counters

§ Good target for optimization § Don’t need to instrument all edges

36

slide-34
SLIDE 34

37

slide-35
SLIDE 35

7.4.4 Paths taken

§ How often is a path executed (taken)? § Path: sequence of basic blocks

§ Example

§ “Path profile”

40

slide-36
SLIDE 36

41

P1: B1 B3 B5 B6 B8 B9 – 90 P2: B1 B4 B5 B7 B8 B9 – 10 P3: B2 B4 B5 B6 B8 B9 – 10 P4: B2 B4 B5 B7 B8 B9 -- 90

slide-37
SLIDE 37

43

ENTRY EXIT B0 B1 B2 B3 B4 B6 B7 B5 B8 B9

P1: B1 B3 B5 B6 B8 B9 – 90 P2: B1 B4 B5 B7 B8 B9 – 10 P3: B2 B4 B5 B6 B8 B9 – 10 P4: B2 B4 B5 B7 B8 B9 -- 90

slide-38
SLIDE 38

§ Path profile cannot be constructed from edge/block profile alone

§ Additional counters are needed § Potentially expensive

§ Often, information on the top k paths is sufficient

§ Limit attention to the top paths (say top 10) § Reduce need to instrument

45

slide-39
SLIDE 39

7.4.5 Discussion

§ Often combine approaches § Use inexpensive profiling (method profiling) to determine where more detailed info is needed

46

slide-40
SLIDE 40

§ Properties of data (objects) can be interesting too § Previous example: actual type of reference at call site

§ “Call context profile”

§ More general: values assumed by variables, parameters, fields, etc

47

slide-41
SLIDE 41

7.4.6 Value profile

§ Given a point P in the program. § The “value profile” at P captures the actual values of a resource at P

§ Can be (min, max) § Can be histogram § Can be top k values (by count)

49

slide-42
SLIDE 42

movl (%eax, %edx), %ebx

§ load Memory[%eax+%edx] à %ebx § Processor must compute sum before sending out addresss

§ Adding 0 takes no time …

§ Could record that a register has value 0 § Processor could send out %eax if %edx contains 0 § Does this happen often enough to warrant design/implementation effort? § Answer: get value profile

§ Compare instructions (for < 0, ≠ 0, etc) can use similar technique

51

slide-43
SLIDE 43

void foo(int i) { … j = j + i; …} § a.foo(x)

§ x is (often | always) = 0

§ Create two version of foo

§ foo0 // x == 0 § Can optimize stmts like “j = j + i” § foo≠0 // default § if (x==0) { foo0();} else { foo≠0 (x); }

52

slide-44
SLIDE 44

§ Value profile obtained from program executions

§ Need (usually) input for program 1. How relevant are value profiles? 2. Program execution must be correct for other inputs

§ Balance measurement effort and specialization effort vs potential gain

55

slide-45
SLIDE 45

7.5 Practical issues

§ 7.5.1 Obtaining edge/block profiles § 7.5.2 Obtaining path profiles

56

slide-46
SLIDE 46

7.5.1 Edge/block profiles

§ Edge profile ⟹ block profiles § Block profile ⟹ edge profile

§ In general

§ Edge profile more powerful but also more expensive to

  • btain

57

slide-47
SLIDE 47

§ Consider the sets Bcount : set of blocks Bi that are instrumented (contain a counter) Ecount : set of edges Ej that are instrumented § Goal: to find Bcount and Ecount so that the expected cost is as low as possible

59

slide-48
SLIDE 48

Adding a block to Bcount

§ Add increment operation to update counter CB for B

61

B B incr CB

slide-49
SLIDE 49

Adding an edge to Ecount

§ “Instrument” edge § Insert new basic block B, with increment operation to update counter CE for E

63

B0 B1

E

B0 B1

B incr CE

slide-50
SLIDE 50

§ Adding a basic block to CFG more expensive than adding an increment operation to an existing basic block

§ In general ….

§ But not all edges E in CFG must be in Ecount to obtain frequency information for edge profile

64

slide-51
SLIDE 51

Kirchhoff’s law

§ Kirchhoff’s (first) law: conservation of electric charge

§ Kirchhoff’s current law § Sum of currents flowing into a node = sum of currents leaving node

§ Freq(B) = ∑ Freq (E), E incoming edges § Freq(B) = ∑ Freq (E’), E’ outgoing edges

65

B

slide-52
SLIDE 52

§ Given a block B with k incoming and outgoing edges

§ Need k-1 counters

§ Idea: If there is a choice, do not instrument high-frequency edges

§ High-frequency edges: either based on guess or on dynamic information from previous executions

§ Need to find out which edges need instrumentation (resp are good or bad candidates)

66

slide-53
SLIDE 53

Spanning tree (review)

§ Given a CFG G with nodes B and edges E § Spanning tree (ST(G)) defined by

§ T – set of edges, T ⊆ E, edges in T undirected § Any pair of blocks B’, B’’ ∊ B is connected by a unique (cycle free) path in ST

§ Direction of edges in G not considered for ST(G)

67

slide-54
SLIDE 54

Example

void foo() { while (P) do { if (Q) { A } else { B } if (R) { Exit() } C } // end while }

68

slide-55
SLIDE 55

Example

void foo() { while (P) do { if (Q) { A } else { B } if (R) { Exit() } C } // end while }

70

ENTRY EXIT P Q A B R C

Add edge EXIT à ENTRY so that Kirchhoff’s law is valid for all nodes

slide-56
SLIDE 56

72

ENTRY EXIT P Q A B R C ENTRY EXIT P Q A B R C

slide-57
SLIDE 57

§ Given weights on edges, maximum spanning tree is a spanning tree so that the weights on T are maximal

§ Weights can be obtained from previous executions, guesses, …

73

slide-58
SLIDE 58

Maximum spanning tree

74

ENTRY EXIT P Q A B R C 1.0 1.0 10.5 5.25 5.25 0.5 10 5.25 5.25 0.5 10

slide-59
SLIDE 59

§ Maximal spanning tree

§ All nodes are reachable § Examples

75

slide-60
SLIDE 60

Instrumentation

§ Ecount = E provides solution but wastes cycles § Minimize overhead: edges that are not part of the maximum spanning tree are candidates

77

slide-61
SLIDE 61

Option 1

78

ENTRY EXIT P Q A B R C 1.0 1.0 10.5 5.25 5.25 0.5 10 5.25 5.25 0.5

slide-62
SLIDE 62

80

slide-63
SLIDE 63

7.5.2 Path profiles

§ Selection of edges for profiling (7.5.1) can be extended to

  • btain path profiles

§ Idea: log sequence of edges traversed

§ Instead of incrementing a counter, write a “witness” to log file when edge is traversed § More overhead – must record witness marker

§ After program terminates, process witness log file

§ Reconstruct paths taken § Cut-off can be implemented easily

§ Path profiles useful but there exists a better alternative

§ Traces – topic of future lecture

81