Configurable and Efficient Memory Access Tracing via Selective - - PowerPoint PPT Presentation

configurable and efficient memory access tracing via
SMART_READER_LITE
LIVE PREVIEW

Configurable and Efficient Memory Access Tracing via Selective - - PowerPoint PPT Presentation

Simone Economo, Davide Cingolani, Alessandro Pellegrini and Francesco Quaglia DIAG - Sapienza University of Rome Configurable and Efficient Memory Access Tracing via Selective Expression-based x86 Binary Instrumentation


slide-1
SLIDE 1

Configurable and Efficient Memory Access Tracing via Selective Expression-based x86 Binary Instrumentation

Simone Economo, Davide Cingolani, Alessandro Pellegrini and Francesco Quaglia

DIAG - Sapienza University of Rome {economo,cingolani,pellegrini,quaglia} @ diag.uniroma1.it

slide-2
SLIDE 2

Memory access tracing

  • Interception of memory accesses issued by a program
  • Off-line and on-line applications

– Performance evaluation of architectures

  • e.g., Trace-driven simulation

– Detection of security vulnerabilities

  • e.g., Buffer overflows

– Detection of memory inefficiencies

  • e.g., Memory leaks

– Runtime optimization of programs

  • e.g., CC-NUMA systems
slide-3
SLIDE 3

Tracing challenges

  • Memory access tracing is interesting because

– Intercepting all accesses may lead to excessive runtime overhead

  • e.g., profilers and debuggers

– Intercepting some accesses may lead to inaccurate tracing results

  • e.g., trace-driven simulation, run-time optimization

– Users could want a trade-off between accuracy and overhead

  • e.g., "I'm willing to sacrifice some accuracy for less overhead"

– Users could be interested in tracing accesses to bigger chunks

  • e.g., OS pages, cache lines, malloc chunk etc.
slide-4
SLIDE 4

Tracing techniques

  • Hardware-based

– Performance Monitoring Units (PMUs)

  • Tracing performed implicitly by the hardware running the program
  • Software-based

– Kernel-level

  • Usually limited to OS-page granularity (e.g., 4KB or 2MB)

– Library-level

  • Usually limited to very specific application domains (e.g., MPI applications)

– Binary Code Instrumentation

  • Performed explicitly and transparently by injecting additional code in the program
  • Our approach!
slide-5
SLIDE 5

Our goals

  • 1. Instrument a subset of the accesses

– rather than the entire stream – directly affects the tracing overhead

  • 2. Make this subset representative

– using a smart selection algorithm – directly affects the tracing accuracy

  • 3. Add flexibility to tracing

– in terms of subset size and tracing granularity – should affect both overhead and accuracy

Efficient Configurable Accurate

slide-6
SLIDE 6

Instrumentation issues

  • Memory addresses are encoded as expressions

– linear combinations of registers and constants – evaluated to actual addresses at run-time – e.g., x86 SIB expressions (Scale-Index-Base)

  • evaluated to Base + Index * Scale + Displacement
  • Memory address expressions are subject to some issues

– Address multiplexing

  • A single expression can encode different addresses over time

– Address aliasing

  • Different expressions can encode the same address at the same time

– False chunk sharing

  • Constants in expressions don't carry memory-alignment information
slide-7
SLIDE 7

Instrumentation issues on x86/GCC/Linux

  • x86 SIB addressing (Scale-Index-Base) is complex

– The same structure is used for addressing different types of memory

  • e.g.,

The base address of a static object can be specified through an immediate mov 0x601120(,%rax,4),%edi

  • e.g.,

The base address of a dynamic object must be specified through a register mov -0x4(%rbp),%edx

– An address can be computed in more convoluted ways

  • e.g.,

A register in a SIB expression can be the result of another SIB expression lea 0x0(,%rax,4),%rdx add %rdx,%rax mov (%rax),%esi

slide-8
SLIDE 8

Our contributions

  • An abstract addressing model

– Formalizes the structure and complexity of SIB expressions

  • A selection algorithm

– Deals with the intrinsic issues of tracing via instrumentation – Satisfies the efficiency, accuracy and flexibility goals

slide-9
SLIDE 9

Base-Index-Displacement (BID) model

  • A BID address field is a placeholder for a value

– either a register identifier or an immediate

  • A BID address expression is a tuple of fields <b,i,d>

– evaluates to the address b + i + d

  • A BID template is a family of expressions

– sharing the same type (register or immediate) for each field

➡ x86 SIB expressions fall into two BID templates:

1. RRI, when the base address is a register

(e.g., dynamic memory or convoluted accesses to all kinds of memory)

2. IRR, when the base address is an immediate

(e.g., static memory)

slide-10
SLIDE 10

Selection algorithm

  • It relies on two user-defined parameters:

1. Instrumentation factor =ω

  • Determines the percentage of traced accesses at runtime
  • Affects overhead and accuracy

2. Chunk size = C

  • Determines the granularity of tracing
  • Partially affects accuracy
  • It elides the address multiplexing problem

– Register values coming from multiple control-flow paths are ignored

  • The internal state is dicarded at basic-block boundaries

– Updates to the contents of registers are tracked

  • Including possible updates coming from conditional data-flow instructions
slide-11
SLIDE 11

Expression equality

  • Two BID expressions are equal if and only if

– they share the same fields – they share the same values for each field

  • Pointer aliasing can still occur

– because the contents of registers are unpredictable – ...but there are no false positives

slide-12
SLIDE 12

Expression representatives

  • Equal expressions form a cluster led by a representative

– so that further analysis doesn't have to consider the whole cluster – its access count is the size of the cluster that it represents

➡ Tracing a representative means tracing the cluster

– a single instrumentation coin buys tracing of the whole cluster – reduces the overhead without affecting the accuracy

slide-13
SLIDE 13

Expression distance

  • The distance between two representatives is

– evaluated on a field-by-field basis

  • by comparing register identifiers against equality (e.g., rax ≠ rbx)
  • by comparing immediates against their absolute difference e.g., |0x10 - 0x18|)

– zero if they are likely to fall into the same C-byte chunk – greater if they are likely to produce more distant addresses

  • False chunk sharing is still possible

– because only runtime addresses have memory-alignment information – ...but the probability of false positives decreases with increasing C's – ...and also with decreasing gaps between immediates

slide-14
SLIDE 14

Distance function for RRI expressions

b1 = b2 i1 = i2 |d1 - d2| ≥ C i1 = i2 5 4 3 1

True True True True False False False False

slide-15
SLIDE 15

Distance function for IRR expressions

|b1 - b2| ≥ C i1 = i2 5 d1 = d2

True True False False

d1 = d2 4 3

True False

1

True False

slide-16
SLIDE 16

Example of false chunk sharing

e1 e2

Absolute difference less than C

slide-17
SLIDE 17

Expression scores

  • The score of a representative is a tuple composed of

1. Access count = how many other accesses are traced for free 2. Average distance ≃ how well the access "samples" the address space

➡ The higher the score, the most valuable is the access

– tells where an instrumentation coin is best spent – improves the accuracy without affecting the overhead

slide-18
SLIDE 18

Selecting expressions

  • Reduced to a (0,1)-knapsack problem, solved iteratively

– Items are representatives – Values are scores – Weights are all equal – The knapsack size is ω% of all representatives – Iteration i sees the residual space left by iteration i - 1

➡ Maximize sum of values, for all representatives, such that

– items in the knapsack don't exceed the residual space

slide-19
SLIDE 19

The iterative (0,1)-knapsack

  • Base step

– Choose representatives and compute scores

  • Iterative step

1. Solve a residual (0,1)-knapsack instance

1. Select the next most-valuable representative

(ignoring frozen ones)

2. Place it in the knapsack 3. Freeze all zero-distance representatives

2. If there is residual space in the knapsack

1. Unfreeze all representatives 2. Start a new iterative step

slide-20
SLIDE 20

Example

1. RRI mov -0x4(%rbp),%edx 2. RRI mov -0x8(%rbp),%eax 3. RRI mov -0x18(%rbp),%rax 4. RRI mov -0x4(%rbp),%edx 5. RRI mov -0x8(%rbp),%eax 6. RRI mov -0x18(%rbp),%rax 7. RRI mov (%rax),%esi 8. RRI mov -0x4(%rbp),%edx 9. RRI mov -0xc(%rbp),%eax 10. IRR mov 0x601120(,%rax,4),%edi 11. RRI mov -0xc(%rbp),%edx 12. RRI mov -0x8(%rbp),%eax 13. IRR mov 0x601060(,%rax,4),%eax 14. RRI mov -0x4(%rbp),%edx 15. RRI mov -0x8(%rbp),%eax 16. RRI mov -0x18(%rbp),%rax 17. RRI mov -0x4(%rbp),%edx 18. RRI mov (%rax),%esi

ω = 50%, C = 16B, n = 18, m = ?

slide-21
SLIDE 21

Example

1. RRI 5 mov -0x4(%rbp),%edx 2. RRI 4 mov -0x8(%rbp),%eax 3. RRI 3 mov -0x18(%rbp),%rax 4. RRI mov -0x4(%rbp),%edx 5. RRI mov -0x8(%rbp),%eax 6. RRI mov -0x18(%rbp),%rax 7. RRI 1 mov (%rax),%esi 8. RRI mov -0x4(%rbp),%edx 9. RRI 2 mov -0xc(%rbp),%eax 10. IRR 1 mov 0x601120(,%rax,4),%edi 11. RRI mov -0xc(%rbp),%edx 12. RRI mov -0x8(%rbp),%eax 13. IRR 1 mov 0x601060(,%rax,4),%eax 14. RRI mov -0x4(%rbp),%edx 15. RRI mov -0x8(%rbp),%eax 16. RRI mov -0x18(%rbp),%rax 17. RRI mov -0x4(%rbp),%edx 18. RRI 1 mov (%rax),%esi

ω = 50%, C = 16B, n = 18, m = ?

slide-22
SLIDE 22

Example

1. RRI mov -0x4(%rbp),%edx score = <5, ?> 2. RRI mov -0x8(%rbp),%eax score = <4, ?> 3. RRI mov -0x18(%rbp),%rax score = <3, ?> 4. RRI mov (%rax),%esi score = <1, ?> 5. RRI mov -0xc(%rbp),%eax score = <2, ?> 6. IRR mov 0x601120(,%rax,4),%edi score = <1, ?> 7. IRR mov 0x601060(,%rax,4),%eax score = <1, ?> 8. RRI mov (%rax),%esi score = <1, ?>

ω = 50%, C = 16B, n = 18, m = 8

slide-23
SLIDE 23

Example

1. RRI mov -0x4(%rbp),%edx score = <5, ?> 2. RRI mov -0x8(%rbp),%eax score = <4, ?> 3. RRI mov -0x18(%rbp),%rax score = <3, ?> 4. RRI mov (%rax),%esi score = <1, ?> 5. RRI mov -0xc(%rbp),%eax score = <2, ?> 6. IRR mov 0x601120(,%rax,4),%edi score = <1, ?> 7. IRR mov 0x601060(,%rax,4),%eax score = <1, ?> 8. RRI mov (%rax),%esi score = <1, ?>

ω = 50%, C = 16B, n = 18, m = 8

5 3 5 3 3 3 5 5 5 3 3 3 3 3 3 3 3 5 3 3 3 3 3 3 5 5

slide-24
SLIDE 24

Example

1. RRI mov -0x4(%rbp),%edx score = <5, 3.2>, same chunk as 2 2. RRI mov -0x8(%rbp),%eax score = <4, 2.2>, same chunk as 1 3. RRI mov -0x18(%rbp),%rax score = <3, 3.2>, same chunk as 4 4. RRI mov -0xc(%rbp),%eax score = <2, 2.2>, same chunk as 3 5. IRR mov 0x601120(,%rax,4),%edi score = <1, 5.0> 6. IRR mov 0x601060(,%rax,4),%eax score = <1, 5.0> 7. RRI mov (%rax),%esi score = <1, 3.0> 8. RRI mov (%rax),%esi score = <1, 3.0>

ω = 50%, C = 16B, n = 18, m = 8

  • Compute average distances

– Over all representatives sharing the same template

  • Reorder expressions by decreasing scores

– Using lexicographic order

slide-25
SLIDE 25

Example

1. RRI mov -0x4(%rbp),%edx score = <5, 3.2>, same chunk as 2 2. RRI mov -0x8(%rbp),%eax score = <4, 2.2>, same chunk as 1 3. RRI mov -0x18(%rbp),%rax score = <3, 3.2>, same chunk as 4 4. RRI mov -0xc(%rbp),%eax score = <2, 2.2>, same chunk as 3 5. IRR mov 0x601120(,%rax,4),%edi score = <1, 5.0> 6. IRR mov 0x601060(,%rax,4),%eax score = <1, 5.0> 7. RRI mov (%rax),%esi score = <1, 3.0> 8. RRI mov (%rax),%esi score = <1, 3.0>

ω = 50%, C = 16B, n = 18, m = 8

  • Pick representative no. 1

– Freeze representative no. 2 as they might fall into the same chunk – K = {1} – Residual W = 3

slide-26
SLIDE 26

Example

1. RRI mov -0x4(%rbp),%edx score = <5, 3.2>, same chunk as 2 2. RRI mov -0x8(%rbp),%eax score = <4, 2.2>, same chunk as 1 3. RRI mov -0x18(%rbp),%rax score = <3, 3.2>, same chunk as 4 4. RRI mov -0xc(%rbp),%eax score = <2, 2.2>, same chunk as 3 5. IRR mov 0x601120(,%rax,4),%edi score = <1, 5.0> 6. IRR mov 0x601060(,%rax,4),%eax score = <1, 5.0> 7. RRI mov (%rax),%esi score = <1, 3.0> 8. RRI mov (%rax),%esi score = <1, 3.0>

ω = 50%, C = 16B, n = 18, m = 8

  • Pick representative no. 3

– Freeze representative no. 4 as they might fall into the same chunk – K = {1,3} – Residual W = 2

slide-27
SLIDE 27

Example

1. RRI mov -0x4(%rbp),%edx score = <5, 3.2>, same chunk as 2 2. RRI mov -0x8(%rbp),%eax score = <4, 2.2>, same chunk as 1 3. RRI mov -0x18(%rbp),%rax score = <3, 3.2>, same chunk as 4 4. RRI mov -0xc(%rbp),%eax score = <2, 2.2>, same chunk as 3 5. IRR mov 0x601120(,%rax,4),%edi score = <1, 5.0> 6. IRR mov 0x601060(,%rax,4),%eax score = <1, 5.0> 7. RRI mov (%rax),%esi score = <1, 3.0> 8. RRI mov (%rax),%esi score = <1, 3.0>

ω = 50%, C = 16B, n = 18, m = 8

  • Pick representative no. 5

– K = {1,3,5} – Residual W = 1

slide-28
SLIDE 28

Example

1. RRI mov -0x4(%rbp),%edx score = <5, 3.2>, same chunk as 2 2. RRI mov -0x8(%rbp),%eax score = <4, 2.2>, same chunk as 1 3. RRI mov -0x18(%rbp),%rax score = <3, 3.2>, same chunk as 4 4. RRI mov -0xc(%rbp),%eax score = <2, 2.2>, same chunk as 3 5. IRR mov 0x601120(,%rax,4),%edi score = <1, 5.0> 6. IRR mov 0x601060(,%rax,4),%eax score = <1, 5.0> 7. RRI mov (%rax),%esi score = <1, 3.0> 8. RRI mov (%rax),%esi score = <1, 3.0>

ω = 50%, C = 16B, n = 18, m = 8

  • Pick representative no. 6

– K = {1,3,5,6} – Residual W = 0

slide-29
SLIDE 29

Example

1. RRI mov -0x4(%rbp),%edx score = <5, 3.2>, same chunk as 2 2. RRI mov -0x8(%rbp),%eax score = <4, 2.2>, same chunk as 1 3. RRI mov -0x18(%rbp),%rax score = <3, 3.2>, same chunk as 4 4. RRI mov -0xc(%rbp),%eax score = <2, 2.2>, same chunk as 3 5. IRR mov 0x601120(,%rax,4),%edi score = <1, 5.0> 6. IRR mov 0x601060(,%rax,4),%eax score = <1, 5.0> 7. RRI mov (%rax),%esi score = <1, 3.0> 8. RRI mov (%rax),%esi score = <1, 3.0>

ω = 50%, C = 16B, n = 18, m = 8

  • No more space left in the knapsack
  • End of algorithm

– K = {1,3,5,6} – 4 out of 8 coins (50%) – 10 out of 18 accesses (~56%) in pessimistic mode (1.1 payoff factor) – 16 out of 18 accesses (~89%) in optimistic mode (1.78 payoff factor)

slide-30
SLIDE 30

Example

1. RRI mov -0x4(%rbp),%edx score = <5, 3.2>, same chunk as 2 2. RRI mov -0x8(%rbp),%eax score = <4, 2.2>, same chunk as 1 3. RRI mov -0x18(%rbp),%rax score = <3, 3.2>, same chunk as 4 4. RRI mov -0xc(%rbp),%eax score = <2, 2.2>, same chunk as 3 5. IRR mov 0x601120(,%rax,4),%edi score = <1, 5.0> 6. IRR mov 0x601060(,%rax,4),%eax score = <1, 5.0> 7. RRI mov (%rax),%esi score = <1, 3.0> 8. RRI mov (%rax),%esi score = <1, 3.0>

ω = 50%, C = 16B, n = 18, m = 8

  • Assume that ω=100% and a conservative approach is used
  • Pick representative no. 7

– K = {1,3,5,6,7}

slide-31
SLIDE 31

Example

1. RRI mov -0x4(%rbp),%edx score = <5, 3.2>, same chunk as 2 2. RRI mov -0x8(%rbp),%eax score = <4, 2.2>, same chunk as 1 3. RRI mov -0x18(%rbp),%rax score = <3, 3.2>, same chunk as 4 4. RRI mov -0xc(%rbp),%eax score = <2, 2.2>, same chunk as 3 5. IRR mov 0x601120(,%rax,4),%edi score = <1, 5.0> 6. IRR mov 0x601060(,%rax,4),%eax score = <1, 5.0> 7. RRI mov (%rax),%esi score = <1, 3.0> 8. RRI mov (%rax),%esi score = <1, 3.0>

ω = 50%, C = 16B, n = 18, m = 8

  • Pick representative no. 8

– K = {1,3,5,6,7,8}

slide-32
SLIDE 32

Example

1. RRI mov -0x4(%rbp),%edx score = <5, 3.2>, same chunk as 2 2. RRI mov -0x8(%rbp),%eax score = <4, 2.2>, same chunk as 1 3. RRI mov -0x18(%rbp),%rax score = <3, 3.2>, same chunk as 4 4. RRI mov -0xc(%rbp),%eax score = <2, 2.2>, same chunk as 3 5. IRR mov 0x601120(,%rax,4),%edi score = <1, 5.0> 6. IRR mov 0x601060(,%rax,4),%eax score = <1, 5.0> 7. RRI mov (%rax),%esi score = <1, 3.0> 8. RRI mov (%rax),%esi score = <1, 3.0>

ω = 50%, C = 16B, n = 18, m = 8

  • End of first iteration

– Unfreeze all frozen representatives – Start a new iteration

slide-33
SLIDE 33

Example

1. RRI mov -0x4(%rbp),%edx score = <5, 3.2>, same chunk as 2 2. RRI mov -0x8(%rbp),%eax score = <4, 2.2>, same chunk as 1 3. RRI mov -0x18(%rbp),%rax score = <3, 3.2>, same chunk as 4 4. RRI mov -0xc(%rbp),%eax score = <2, 2.2>, same chunk as 3 5. IRR mov 0x601120(,%rax,4),%edi score = <1, 5.0> 6. IRR mov 0x601060(,%rax,4),%eax score = <1, 5.0> 7. RRI mov (%rax),%esi score = <1, 3.0> 8. RRI mov (%rax),%esi score = <1, 3.0>

ω = 50%, C = 16B, n = 18, m = 8

  • Pick representative no. 2

– K = {1,3,5,6,7,8,2}

slide-34
SLIDE 34

Example

1. RRI mov -0x4(%rbp),%edx score = <5, 3.2>, same chunk as 2 2. RRI mov -0x8(%rbp),%eax score = <4, 2.2>, same chunk as 1 3. RRI mov -0x18(%rbp),%rax score = <3, 3.2>, same chunk as 4 4. RRI mov -0xc(%rbp),%eax score = <2, 2.2>, same chunk as 3 5. IRR mov 0x601120(,%rax,4),%edi score = <1, 5.0> 6. IRR mov 0x601060(,%rax,4),%eax score = <1, 5.0> 7. RRI mov (%rax),%esi score = <1, 3.0> 8. RRI mov (%rax),%esi score = <1, 3.0>

ω = 50%, C = 16B, n = 18, m = 8

  • Pick representative no. 4

– K = {1,3,5,6,7,8,2,4}

slide-35
SLIDE 35

Example

1. RRI mov -0x4(%rbp),%edx score = <5, 3.2>, same chunk as 2 2. RRI mov -0x8(%rbp),%eax score = <4, 2.2>, same chunk as 1 3. RRI mov -0x18(%rbp),%rax score = <3, 3.2>, same chunk as 4 4. RRI mov -0xc(%rbp),%eax score = <2, 2.2>, same chunk as 3 5. IRR mov 0x601120(,%rax,4),%edi score = <1, 5.0> 6. IRR mov 0x601060(,%rax,4),%eax score = <1, 5.0> 7. RRI mov (%rax),%esi score = <1, 3.0> 8. RRI mov (%rax),%esi score = <1, 3.0>

ω = 50%, C = 16B, n = 18, m = 8

  • End of second iteration
  • End of algorithm
slide-36
SLIDE 36

Experimental environment

  • HP ProLiant

– 32 cores, 64 GB of RAM – GNU/Linux (kernel 2.6) with GCC/G++ 4.9.2

  • Hijacker

– A static binary rewriting and instrumentation tool – Developed by our group over ~8 years of research – Works on binary relocatable files

  • PARSEC 2.1

– blackscholes, fluidanimate, canneal, freqmine and swaptions – Using simlarge and simmedium as input sets

slide-37
SLIDE 37

Experimental parameters

  • Instrumentation factor ω

– 10%, 25%, 50%, 75%, 100% – Plus a non-selective (NS) run that blindly traces all accesses

  • Chunk size C

– 16B (cache line) – 64B (malloc block) – 4KB (kernel page)

slide-38
SLIDE 38

Accuracy measures

  • Minimal Accuracy (MA)

– A pessimistic measure – Ratio of traced representatives over all representatives

  • weighted by their access counts
  • Alignment-Independent Accuracy (AIA)

– An optimistic measure – Defined like minimal accuracy, but requires additional guarantees

  • A selected representative also represents its zero-distance group
  • Its access count is the aggregated access count of the entire group
  • No mistakes are expected due to false chunk sharing
slide-39
SLIDE 39

Accuracy results for freqmine

slide-40
SLIDE 40

Accuracy results for fluidanimate

slide-41
SLIDE 41

Slowdown results for freqmine

slide-42
SLIDE 42

Slowdown results for fluidanimate

slide-43
SLIDE 43

Comments

  • Increasing ω worsens the runtime overhead
  • Increasing ω increases the actual accuracy

– because more representatives will be instrumented

  • Increasing C may improve the accuracy

– when there are no mistakes due to false cache sharing – also because instrumentation coins can be spent elsewhere

  • ...or it can improve the runtime overhead

– if we decide to save those coins

slide-44
SLIDE 44

Comments (2)

  • Increasing C has little effect on the actual accuracy

– because frozen unselected representatives weren't so valuable – could be worse on other applications

  • Increasing C has little effect on the runtime overhead

– caching and micro-architectural effects?

  • ...but cannot improve it

– we choose a ω% of the representatives regardless of C

  • ω=100% is already an improvement over NS

– roughly 50% to 100% faster in baseline units – because we exploit equality between expressions

slide-45
SLIDE 45

Future work

  • Test BID against more architectures

– Requires the modelling of additional addressing modes – May require changes in the formal definition of templates

  • Improve the selection algorithm

– By extending its scope to regions larger than basic blocks – By refining the notions of equality, distance and priority

  • Perform additional experiments

– To quantify mistakes due to, e.g., false cache sharing – To derive additional metrics for benchmarks

slide-46
SLIDE 46

Thank you! Questions?