Configurable and Efficient Memory Access Tracing via Selective - PowerPoint PPT Presentation

Simone Economo, Davide Cingolani, Alessandro Pellegrini and Francesco Quaglia DIAG - Sapienza University of Rome Configurable and Efficient Memory Access Tracing via Selective Expression-based x86 Binary Instrumentation {economo,cingolani,pellegrini,quaglia} @ diag.uniroma1.it

• Interception of memory accesses issued by a program • Off-line and on-line applications – Performance evaluation of architectures • e.g., Trace-driven simulation – Detection of security vulnerabilities • e.g., Buffer overflows – Detection of memory inefficiencies • e.g., Memory leaks – Runtime optimization of programs • e.g., CC-NUMA systems Memory access tracing

• Memory access tracing is interesting because – Intercepting all accesses may lead to excessive runtime overhead • e.g., profilers and debuggers – Intercepting some accesses may lead to inaccurate tracing results • e.g., trace-driven simulation, run-time optimization – Users could want a trade-off between accuracy and overhead • e.g., "I'm willing to sacrifice some accuracy for less overhead" – Users could be interested in tracing accesses to bigger chunks • e.g., OS pages, cache lines, malloc chunk etc. Tracing challenges

• Hardware-based – Performance Monitoring Units (PMUs) • Tracing performed implicitly by the hardware running the program • Software-based – Kernel-level • Usually limited to OS-page granularity (e.g., 4KB or 2MB) – Library-level • Usually limited to very specific application domains (e.g., MPI applications) – Binary Code Instrumentation • Performed explicitly and transparently by injecting additional code in the program • Our approach! Tracing techniques

Accurate directly affects the tracing accuracy Configurable Efficient should affect both overhead and accuracy – in terms of subset size and tracing granularity – 3. Add flexibility to tracing – 1. Instrument a subset of the accesses using a smart selection algorithm – 2. Make this subset representative directly affects the tracing overhead – rather than the entire stream – Our goals

Constants in expressions don't carry memory-alignment information A single expression can encode different addresses over time • False chunk sharing – Different expressions can encode the same address at the same time • Address aliasing – • • Memory addresses are encoded as expressions Address multiplexing – • Memory address expressions are subject to some issues – e.g., x86 SIB expressions (Scale-Index-Base) – evaluated to actual addresses at run-time – linear combinations of registers and constants Instrumentation issues • evaluated to Base + Index * Scale + Displacement

• x86 SIB addressing ( Scale-Index-Base ) is complex – The same structure is used for addressing different types of memory • e.g., The base address of a static object can be specified through an immediate • e.g., The base address of a dynamic object must be specified through a register – An address can be computed in more convoluted ways • e.g., A register in a SIB expression can be the result of another SIB expression Instrumentation issues on x86/GCC/Linux mov 0x601120(,%rax,4),%edi mov -0x4(%rbp),%edx lea 0x0(,%rax,4),%rdx add %rdx,%rax mov (%rax),%esi

• An abstract addressing model – Formalizes the structure and complexity of SIB expressions • A selection algorithm – Deals with the intrinsic issues of tracing via instrumentation – Satisfies the efficiency, accuracy and flexibility goals Our contributions

– either a register identifier or an immediate • A BID template is a family of expressions – sharing the same type (register or immediate) for each field ➡ x86 SIB expressions fall into two BID templates: 1. (e.g., dynamic memory or convoluted accesses to all kinds of memory) 2. (e.g., static memory) Base-Index-Displacement (BID) model • A BID address field is a placeholder for a value • A BID address expression is a tuple of fields <b,i,d> – evaluates to the address b + i + d RRI , when the base address is a register IRR , when the base address is an immediate

• It relies on two user-defined parameters: 1. • Determines the percentage of traced accesses at runtime • Affects overhead and accuracy 2. Chunk size = C • Determines the granularity of tracing • Partially affects accuracy • It elides the address multiplexing problem – Register values coming from multiple control-flow paths are ignored • The internal state is dicarded at basic-block boundaries – Updates to the contents of registers are tracked • Including possible updates coming from conditional data-flow instructions Selection algorithm Instrumentation factor = ω

• Two BID expressions are equal if and only if – they share the same fields – they share the same values for each field • Pointer aliasing can still occur – because the contents of registers are unpredictable – ...but there are no false positives Expression equality

• Equal expressions form a cluster led by a representative – so that further analysis doesn't have to consider the whole cluster – its access count is the size of the cluster that it represents ➡ Tracing a representative means tracing the cluster – a single instrumentation coin buys tracing of the whole cluster – reduces the overhead without affecting the accuracy Expression representatives

• The distance between two representatives is – evaluated on a field-by-field basis – zero if they are likely to fall into the same C-byte chunk – greater if they are likely to produce more distant addresses • False chunk sharing is still possible – because only runtime addresses have memory-alignment information – ...but the probability of false positives decreases with increasing C's – ...and also with decreasing gaps between immediates Expression distance • by comparing register identifiers against equality (e.g., rax ≠ rbx ) • by comparing immediates against their absolute difference e.g., |0x10 - 0x18| )

False True False False False True True True Distance function for RRI expressions b 1 = b 2 i 1 = i 2 i 1 = i 2 4 3 1 |d 1 - d 2 | ≥ C 0 5

False False True False True True True False Distance function for IRR expressions |b 1 - b 2 | ≥ C i 1 = i 2 5 d 1 = d 2 d 1 = d 2 1 0 4 3

e 2 e 1 Absolute difference less than C Example of false chunk sharing

• The score of a representative is a tuple composed of 1. Access count = how many other accesses are traced for free 2. ➡ The higher the score, the most valuable is the access – tells where an instrumentation coin is best spent – improves the accuracy without affecting the overhead Expression scores Average distance ≃ how well the access "samples" the address space

• Reduced to a (0,1)-knapsack problem , solved iteratively – Items are representatives – Values are scores – Weights are all equal ➡ Maximize sum of values, for all representatives, such that – items in the knapsack don't exceed the residual space Selecting expressions – The knapsack size is ω % of all representatives – Iteration i sees the residual space left by iteration i - 1

Start a new iterative step Place it in the knapsack 2. Unfreeze all representatives 1. If there is residual space in the knapsack 2. Freeze all zero-distance representatives 3. 2. • Base step (ignoring frozen ones) Select the next most-valuable representative 1. Solve a residual (0,1)-knapsack instance 1. • Iterative step Choose representatives and compute scores – The iterative (0,1)-knapsack

Example ω = 50%, C = 16B, n = 18, m = ? 1. RRI mov -0x4(%rbp),%edx 2. RRI mov -0x8(%rbp),%eax 3. RRI mov -0x18(%rbp),%rax 4. RRI mov -0x4(%rbp),%edx 5. RRI mov -0x8(%rbp),%eax 6. RRI mov -0x18(%rbp),%rax 7. RRI mov (%rax),%esi 8. RRI mov -0x4(%rbp),%edx 9. RRI mov -0xc(%rbp),%eax 10. IRR mov 0x601120(,%rax,4),%edi 11. RRI mov -0xc(%rbp),%edx 12. RRI mov -0x8(%rbp),%eax 13. IRR mov 0x601060(,%rax,4),%eax 14. RRI mov -0x4(%rbp),%edx 15. RRI mov -0x8(%rbp),%eax 16. RRI mov -0x18(%rbp),%rax 17. RRI mov -0x4(%rbp),%edx 18. RRI mov (%rax),%esi

Example ω = 50%, C = 16B, n = 18, m = ? 1. RRI 5 mov -0x4(%rbp),%edx 2. RRI 4 mov -0x8(%rbp),%eax 3. RRI 3 mov -0x18(%rbp),%rax 4. RRI mov -0x4(%rbp),%edx 5. RRI mov -0x8(%rbp),%eax 6. RRI mov -0x18(%rbp),%rax 7. RRI 1 mov (%rax),%esi 8. RRI mov -0x4(%rbp),%edx 9. RRI 2 mov -0xc(%rbp),%eax 10. IRR 1 mov 0x601120(,%rax,4),%edi 11. RRI mov -0xc(%rbp),%edx 12. RRI mov -0x8(%rbp),%eax 13. IRR 1 mov 0x601060(,%rax,4),%eax 14. RRI mov -0x4(%rbp),%edx 15. RRI mov -0x8(%rbp),%eax 16. RRI mov -0x18(%rbp),%rax 17. RRI mov -0x4(%rbp),%edx 18. RRI 1 mov (%rax),%esi

Example ω = 50%, C = 16B, n = 18, m = 8 1. RRI mov -0x4(%rbp),%edx score = <5, ?> 2. RRI mov -0x8(%rbp),%eax score = <4, ?> 3. RRI mov -0x18(%rbp),%rax score = <3, ?> 4. RRI mov (%rax),%esi score = <1, ?> 5. RRI mov -0xc(%rbp),%eax score = <2, ?> 6. IRR mov 0x601120(,%rax,4),%edi score = <1, ?> 7. IRR mov 0x601060(,%rax,4),%eax score = <1, ?> 8. RRI mov (%rax),%esi score = <1, ?>

Configurable and Efficient Memory Access Tracing via Selective - PowerPoint PPT Presentation

Simone Economo, Davide Cingolani, Alessandro Pellegrini and Francesco Quaglia DIAG - Sapienza University of Rome Configurable and Efficient Memory Access Tracing via Selective Expression-based x86 Binary Instrumentation

Fibre Optic Multiplexer Configurable The What is the Badger Fully configurable Audio/Data

Overview of Overview of configurable architectures configurable architectures Prof. Kurt

Sampling Effect on Performance Prediction of Configurable Systems : A Case Study Juliana Alves

Advanced Ray Tracing 1 2/8/2006 Distributed Ray Tracing Distributed ray tracing is an

MIT 6.837 - Ray Tracing Ray Tracing MIT EECS 6.837 Most slides are taken from Frdo Durand and

Computer Graphics - Ray-Tracing II - Hendrik Lensch Computer Graphics WS07/08 Ray Tracing II

1 minute Path tracing Bidirectional path tracing Progressive photon mapping 1 minute

Advanced Ray Tracing Stochastic ray tracing: distribute rays stochastically across pixel

61A Extra Lecture 9 Announcements Pixels (Demo) Ray Tracing Ray Tracing A technique for

Dual-Mode Configurable RISC-V Processor IP Nuclei System Technology Dual-Mode

Designing a Web of Highly-Configurable Designing a Web of Highly-Configurable Intrusion Detection

A Configurable Hardware Scheduler A Configurable Hardware Scheduler (CHS) for Real- -Time

Configurable software- -based based Configurable software edge router architecture edge router

Reinforcement Learning in Configurable Continuous Environments Alberto Maria Metelli, Emanuele

An Architecture for An Architecture for Configurable Dependability of Configurable Dependability

Maca a configurable tool to Maca a configurable tool to integrate Polish morphological

Conditionals Structure vs. Flow Program Structure Program Flow Order code is presented

Edge-coloring Multigraphs Daniel W. Cranston Virginia Commonwealth University dcranston@vcu.edu

Loop invariants <precondition: n>0> int i = 0; while (i < n){ i = i+1; We want to

Mingzhang Yin, Yuguang Yue, Mingyuan Zhou The University of Texas at Austin Department of

Complexity-Effec/ve Mul/core Cache Coherence MCC2012 Stefanos

This presentation uses the following non-standard fonts: Lato

S IMULATION OF T HOUSAND -C ORE S YSTEMS D ANIEL S ANCHEZ C HRISTOS K OZYRAKIS MIT S TANFORD

Scalable, Automated Characterization of Parallel Application Communication Behavior Philip C.

Configurable and Efficient Memory Access Tracing via Selective - PowerPoint PPT Presentation

Simone Economo, Davide Cingolani, Alessandro Pellegrini and Francesco Quaglia DIAG - Sapienza University of Rome Configurable and Efficient Memory Access Tracing via Selective Expression-based x86 Binary Instrumentation

Fibre Optic Multiplexer Configurable The What is the Badger Fully configurable Audio/Data

Overview of Overview of *configurable* architectures *configurable* architectures Prof. Kurt

Sampling Effect on Performance Prediction of Configurable Systems : A Case Study Juliana Alves

Advanced Ray Tracing 1 2/8/2006 Distributed Ray Tracing Distributed ray tracing is an

MIT 6.837 - Ray Tracing Ray Tracing MIT EECS 6.837 Most slides are taken from Frdo Durand and

Computer Graphics - Ray-Tracing II - Hendrik Lensch Computer Graphics WS07/08 Ray Tracing II

1 minute Path tracing Bidirectional path tracing Progressive photon mapping 1 minute

Advanced Ray Tracing Stochastic ray tracing: distribute rays stochastically across pixel

61A Extra Lecture 9 Announcements Pixels (Demo) Ray Tracing Ray Tracing A technique for

Dual-Mode Configurable RISC-V Processor IP Nuclei System Technology Dual-Mode

Designing a Web of Highly-Configurable Designing a Web of Highly-Configurable Intrusion Detection

A Configurable Hardware Scheduler A Configurable Hardware Scheduler (CHS) for Real- -Time

Configurable software- -based based Configurable software edge router architecture edge router

Reinforcement Learning in Configurable Continuous Environments Alberto Maria Metelli, Emanuele

An Architecture for An Architecture for Configurable Dependability of Configurable Dependability

Maca a configurable tool to Maca a configurable tool to integrate Polish morphological

Conditionals Structure vs. Flow Program Structure Program Flow Order code is presented

Edge-coloring Multigraphs Daniel W. Cranston Virginia Commonwealth University dcranston@vcu.edu

Loop invariants &lt;precondition: n&gt;0&gt; int i = 0; while (i &lt; n){ i = i+1; We want to

Mingzhang Yin*, Yuguang Yue*, Mingyuan Zhou The University of Texas at Austin Department of

Complexity-Effec/ve Mul/core Cache Coherence MCC2012 Stefanos

This presentation uses the following non-standard fonts: Lato

S IMULATION OF T HOUSAND -C ORE S YSTEMS D ANIEL S ANCHEZ C HRISTOS K OZYRAKIS MIT S TANFORD

Scalable, Automated Characterization of Parallel Application Communication Behavior Philip C.

Overview of Overview of configurable architectures configurable architectures Prof. Kurt

Loop invariants <precondition: n>0> int i = 0; while (i < n){ i = i+1; We want to

Mingzhang Yin, Yuguang Yue, Mingyuan Zhou The University of Texas at Austin Department of