Optimizing Binary Translation of Dynamically Generated Code Byron - - PowerPoint PPT Presentation

optimizing binary translation of dynamically generated
SMART_READER_LITE
LIVE PREVIEW

Optimizing Binary Translation of Dynamically Generated Code Byron - - PowerPoint PPT Presentation

Optimizing Binary Translation of Dynamically Generated Code Byron Hawkins Brian Demsky University of California, Irvine Derek Bruening Qin Zhao Google, Inc. Profiling Bug detection Program analysis Security SPEC CPU 2006


slide-1
SLIDE 1

Optimizing Binary Translation of Dynamically Generated Code

Byron Hawkins Brian Demsky University of California, Irvine Derek Bruening Qin Zhao Google, Inc.

slide-2
SLIDE 2
  • Profiling
  • Bug detection
  • Program analysis
  • Security
slide-3
SLIDE 3

SPEC CPU 2006

  • 12% overhead*
  • 21% overhead*

*geometric mean

slide-4
SLIDE 4

SPEC CPU 2006

  • 12% overhead*
  • 21% overhead*

*geometric mean

slide-5
SLIDE 5

Octane JavaScript Benchmark

  • 15x overhead on Chrome V8

4.4x overhead on Mozilla Ion

  • 18x overhead on Chrome V8

8x overhead on Mozilla Ion

slide-6
SLIDE 6

Octane JavaScript Benchmark

  • 15x overhead on Chrome V8

4.4x overhead on Mozilla Ion

  • 18x overhead on Chrome V8

8x overhead on Mozilla Ion

slide-7
SLIDE 7

New Era of Dynamic Code

  • Back in 2003...

– Browsers: one single-phase JIT engine – Microsoft Office: negligible dynamic code

  • A decade later...

– Browsers: at least 2 multi-phase JIT engines – Microsoft Office: one multi-phase JIT

  • Active at startup of all applications
slide-8
SLIDE 8

New Era of Dynamic Code

  • Back in 2003...

– Browsers: one single-phase JIT engine – Microsoft Office: negligible dynamic code

  • A decade later...

– Browsers: at least 2 multi-phase JIT engines – Microsoft Office: one multi-phase JIT

  • Active at startup of all applications
slide-9
SLIDE 9

Evaluation Platform

  • Optimize binary translation of dynamic code
  • Maintain performance for static code
  • DynamoRIO on 64-bit Linux for x86

Goals

slide-10
SLIDE 10

Evaluation Platform

  • Optimize binary translation of dynamic code
  • Maintain performance for static code
  • DynamoRIO on 64-bit Linux for x86

Goals

slide-11
SLIDE 11

Outline

  • Background on binary translation

– Current optimizations for statically compiled code – Dynamic code → wasting translation overhead

  • Coarse-grained detection of code changes
  • New optimizations

– Manual annotations – Automated inference

  • Performance results
  • Related Work
slide-12
SLIDE 12

Outline

  • Background on binary translation

– Current optimizations for statically compiled code – Dynamic code → wasting translation overhead

  • Coarse-grained detection of code changes
  • New optimizations

– Manual annotations – Automated inference

  • Performance results
  • Related Work
slide-13
SLIDE 13

Outline

  • Background on binary translation

– Current optimizations for statically compiled code – Dynamic code → wasting translation overhead

  • Coarse-grained detection of code changes
  • New optimizations

– Manual annotations – Automated inference

  • Performance results
  • Related Work
slide-14
SLIDE 14

A B C D E F

SPEC Benchmark App foo() bar()

A

DynamoRIO Code Cache

BB Cache Trace Cache

Translate application into code cache as it runs

slide-15
SLIDE 15

A B C D E F

SPEC Benchmark App foo() bar()

A C

DynamoRIO Code Cache

BB Cache Trace Cache

Translate application into code cache as it runs

slide-16
SLIDE 16

A B C D E F

SPEC Benchmark App foo() bar()

A C D

DynamoRIO Code Cache

BB Cache Trace Cache

Translate application into code cache as it runs

slide-17
SLIDE 17

A B C D E F

SPEC Benchmark App foo() bar()

A C D E

DynamoRIO Code Cache

BB Cache Trace Cache

Translate application into code cache as it runs

slide-18
SLIDE 18

A B C D E F

SPEC Benchmark App foo() bar()

A C D E

Indirect Branch Lookup DynamoRIO Code Cache

BB Cache Trace Cache

Translate application into code cache as it runs

slide-19
SLIDE 19

A B C D E F

SPEC Benchmark App foo() bar()

A C D E F

Indirect Branch Lookup DynamoRIO Code Cache

BB Cache Trace Cache

Correlate indirect branch targets via hashtable

slide-20
SLIDE 20

A B C D E F A C D E F ?

SPEC Benchmark App foo() bar()

A C D E F

Indirect Branch Lookup DynamoRIO Code Cache

BB Cache Trace Cache

Hot paths are compiled into traces (10% speedup)

slide-21
SLIDE 21

Cost

  • Translate code
  • Build traces

Benefit

  • Repeated execution of translated code
  • Optimized traces

– Can beat native performance on SPEC benchmarks

slide-22
SLIDE 22

Cost

  • Translate code
  • Build traces

Benefit

  • Repeated execution of translated code
  • Optimized traces

– Can beat native performance on SPEC benchmarks

slide-23
SLIDE 23

A B C D E F A C D E F ?

foo() bar()

A C D E F

Indirect Branch Lookup DynamoRIO Code Cache

BB Cache Trace Cache

JIT Compiled Function

What if the target code is dynamically generated?

slide-24
SLIDE 24

A B C' D' E F A C D E F ?

JIT Compiled Function foo() bar()

F A C D E

Indirect Branch Lookup DynamoRIO Code Cache

BB Cache Trace Cache

The code may be changed frequently at runtime

slide-25
SLIDE 25

A B C' D' E F A C E D F ?

foo() bar()

F A C D E

Indirect Branch Lookup DynamoRIO Code Cache

BB Cache Trace Cache

JIT Compiled Function

Corresponding translations become invalid

slide-26
SLIDE 26

A B C' D' E F A C E D F ?

JIT Compiled Function foo() bar()

F A C D E

Indirect Branch Lookup DynamoRIO Code Cache

BB Cache Trace Cache

Stale translations must be deleted for retranslation

slide-27
SLIDE 27

A B C' D' E F A C E D F ?

JIT Compiled Function foo() bar()

F A C D E

Indirect Branch Lookup DynamoRIO Code Cache

BB Cache Trace Cache

Stale translations must be deleted for retranslation → “cache consistency”

slide-28
SLIDE 28

A B C' D' E F A C E D F ?

JIT Compiled Function foo() bar()

F A C D E

Indirect Branch Lookup DynamoRIO Code Cache

BB Cache Trace Cache

Stale translations must be deleted for retranslation → How to detect code changes?

slide-29
SLIDE 29

Detecting Code Changes on x86

  • Monitor all memory writes
slide-30
SLIDE 30

Detecting Code Changes on x86

  • Monitor all memory writes

– Too much overhead!

slide-31
SLIDE 31

Detecting Code Changes on x86

  • Monitor all memory writes

– Too much overhead!

  • Instrument traces to check freshness
slide-32
SLIDE 32

Detecting Code Changes on x86

  • Monitor all memory writes

– Too much overhead!

  • Instrument traces to check freshness

– DynamoRIO supports standalone basic blocks

→ too much overhead!

slide-33
SLIDE 33

Detecting Code Changes on x86

  • Monitor all memory writes

– Too much overhead!

  • Instrument traces to check freshness

– DynamoRIO supports standalone basic blocks

→ too much overhead!

  • Leverage page permissions and faults
slide-34
SLIDE 34

Detecting Code Changes on x86

  • Monitor all memory writes

– Too much overhead!

  • Instrument traces to check freshness

– DynamoRIO supports standalone basic blocks

→ too much overhead!

  • Leverage page permissions and faults

– Make code pages artificially read-only

slide-35
SLIDE 35

Detecting Code Changes on x86

  • Monitor all memory writes

– Too much overhead!

  • Instrument traces to check freshness

– DynamoRIO supports standalone basic blocks

→ too much overhead!

  • Leverage page permissions and faults

– Make code pages artificially read-only – Intercept page faults and invalidate translations

slide-36
SLIDE 36

Detecting Code Changes on x86

  • Monitor all memory writes

– Too much overhead!

  • Instrument traces to check freshness

– DynamoRIO supports standalone basic blocks

→ too much overhead!

  • Leverage page permissions and faults

– Make code pages artificially read-only – Intercept page faults and invalidate translations

→ Acceptable overhead (for rare occurrence)

slide-37
SLIDE 37

Detecting Code Changes on x86

  • Monitor all memory writes

– Too much overhead!

  • Instrument traces to check freshness

– DynamoRIO supports standalone basic blocks

→ too much overhead!

  • Leverage page permissions and faults

– Make code pages artificially read-only – Intercept page faults and invalidate translations

→ How does this work?

slide-38
SLIDE 38

Chrome V8 foo() DynamoRIO Code Cache compile_js() foo() compile_js()

r-x r-x rwx

bar() bar()

slide-39
SLIDE 39

Chrome V8 foo() DynamoRIO Code Cache compile_js() foo() compile_js()

r-x r-x rwx

bar() bar()

X

Page fault

slide-40
SLIDE 40

Chrome V8 foo() DynamoRIO Code Cache compile_js() foo() compile_js()

r-x r-x rwx

bar() bar()

X

slide-41
SLIDE 41

Chrome V8 foo_2() DynamoRIO Code Cache compile_js() compile_js()

rwx r-x rwx

bar() bar()

Allow write

slide-42
SLIDE 42

Chrome V8 foo_2() DynamoRIO Code Cache compile_js() compile_js()

rwx r-x rwx

bar_2() bar() compile_more_js()

Thread B Thread A Allow write!

slide-43
SLIDE 43

Chrome V8 foo_2() DynamoRIO Code Cache compile_js() compile_js()

rwx r-x rwx

bar_2() bar() compile_more_js()

Thread B Thread A

Concurrent Writer Problem

All translations from the modifjed page must be removed

slide-44
SLIDE 44

Cache Consistency Overhead

  • For non-JIT modules:

– System call hooks (program startup only) – Self-modifying code (very rare)

  • For JIT engines:

– Code generation – Code optimization – Code adjustment for reuse

slide-45
SLIDE 45

Cache Consistency Overhead

  • For non-JIT modules:

– System call hooks (program startup only) – Self-modifying code (very rare)

  • For JIT engines:

– Code generation – Code optimization – Code adjustment for reuse

slide-46
SLIDE 46

Cache Consistency Overhead

Chrome V8 foo() DynamoRIO Code Cache compile_js() foo() compile_js()

r-x r-x rwx

bar()

JIT writes a second function to unused space in the page

slide-47
SLIDE 47

Cache Consistency Overhead

Chrome V8 foo() DynamoRIO Code Cache compile_js() foo() compile_js()

r-x r-x rwx

bar()

DynamoRIO must invalidate all translations from the page

slide-48
SLIDE 48

Cache Consistency Overhead

Chrome V8 foo() DynamoRIO Code Cache compile_js() foo() compile_js()

r-x r-x rwx

bar() bar()

Trivial code changes require flushing all translations

slide-49
SLIDE 49

Cache Consistency Overhead

Chrome V8 foo() DynamoRIO Code Cache compile_js() foo() compile_js()

r-x r-x rwx

bar() bar()

Even a data change requires flushing all translations

slide-50
SLIDE 50

Cache Consistency Overhead

  • For non-JIT modules:

– System call hooks (program startup only) – Self-modifying code (very rare)

  • For JIT engines:

– Code generation – Code optimization – Code adjustment for reuse

  • Disable traces to reduce translation overhead?
slide-51
SLIDE 51

Cache Consistency Overhead

  • For non-JIT modules:

– System call hooks (program startup only) – Self-modifying code (very rare)

  • For JIT engines:

– Code generation – Code optimization – Code adjustment for reuse

  • Disable traces to reduce translation overhead?

25% slowdown on Octane!

slide-52
SLIDE 52

Conserving DGC Translations

  • Add annotation framework to DynamoRIO

– Macros compile into DynamoRIO hooks – Native execution skips hook (2 direct jumps)

  • Annotate the target application

– Specify which memory allocations contain code – Notify DynamoRIO of all JIT code writes

slide-53
SLIDE 53

Conserving DGC Translations

  • Add annotation framework to DynamoRIO

– Macros compile into DynamoRIO hooks – Native execution skips hook (2 direct jumps)

  • Annotate the target application

– Specify which memory allocations contain code – Notify DynamoRIO of all JIT code writes

slide-54
SLIDE 54

Conserving DGC Translations

  • Add annotation framework to DynamoRIO

– Macros compile to DynamoRIO hooks – Native execution skips hook (2 direct jumps)

  • Annotate the target application

– Specify which memory allocations contain code – Notify DynamoRIO of all JIT code writes

void* OS::Allocate(const size_t size, int prot) { void* mbase = mmap(0, size, prot, MAP_PRIVATE | MAP_ANONYMOUS); if (mbase == MAP_FAILED) return NULL; if (IS_EXECUTABLE(prot)) DYNAMORIO_MANAGE_CODE_AREA(mbase, size); return mbase; }

slide-55
SLIDE 55

Conserving DGC Translations

  • Add annotation framework to DynamoRIO

– Macros compile to DynamoRIO hooks – Native execution skips hook (2 direct jumps)

  • Annotate the target application

– Specify which memory allocations contain code – Notify DynamoRIO of all JIT code writes

slide-56
SLIDE 56

Conserving DGC Translations

  • Add annotation framework to DynamoRIO

– Macros compile to DynamoRIO hooks – Native execution skips hook (2 direct jumps)

  • Annotate the target application

– Specify which memory allocations contain code – Notify DynamoRIO of all JIT code writes

void CpuFeatures::FlushICache(void* start, size_t size) { /* no native action for Intel x86 */ DYNAMORIO_FLUSH_FRAGMENTS(start, size); }

slide-57
SLIDE 57

Conserving DGC Translations

  • Add annotation framework to DynamoRIO

– Macros compile to DynamoRIO hooks – Native execution skips hook (2 direct jumps)

  • Annotate the target application

– Specify which memory allocations contain code – Notify DynamoRIO of all JIT code writes

→ How does this work?

slide-58
SLIDE 58

Annotated JIT Writes

Chrome V8 foo() DynamoRIO Code Cache compile_js() foo() compile_js()

rwx r-x rwx

bar()

Annotation handler flushes only the written region

slide-59
SLIDE 59

Annotated JIT Writes

Chrome V8 foo() DynamoRIO Code Cache compile_js() foo() compile_js()

rwx r-x rwx

bar() bar()

Annotation handler flushes only the written basic block

slide-60
SLIDE 60

Annotated JIT Writes

Chrome V8 foo() DynamoRIO Code Cache compile_js() foo() compile_js()

rwx r-x rwx

bar() bar()

Data writes can be safely ignored

slide-61
SLIDE 61

Problems with Annotations

  • Source code may not be available
  • COTS binaries may be preferred
  • Application may be difficult to annotate

– Chrome V8 requires 4 annotations

  • Trivial to place the annotations correctly

– Mozilla Ion requires 17 annotations

  • Complex analysis required to correctly place annotations
slide-62
SLIDE 62

Problems with Annotations

  • Source code may not be available
  • COTS binaries may be preferred
  • Application may be difficult to annotate

– Chrome V8 requires 4 annotations

  • Trivial to place the annotations correctly

– Mozilla Ion requires 17 annotations

  • Complex analysis required to correctly place annotations
slide-63
SLIDE 63

Inference Approach

  • Infer which store instructions are writing code
  • Instrument those store instructions

– Fine-grained cache consistency policy – Avoid page faults

  • Other writers → default cache consistency
slide-64
SLIDE 64

Inference Approach

  • Infer which store instructions are writing code
  • Instrument those store instructions

– Fine-grained cache consistency policy – Avoid page faults

  • Other writers → default cache consistency
  • Incorrect inference → negligible overhead
slide-65
SLIDE 65

Infer Code-Writing Instructions

Chrome V8 DynamoRIO Code Cache compile_js() compile_js()

r-x r-x

A A

Handle up to ~10 page faults (heuristic)

slide-66
SLIDE 66

Infer Code-Writing Instructions

Chrome V8 DynamoRIO Code Cache compile_js() compile_js()

r-x r-x JIT

A A

Flag the faulting page as a JIT code page

slide-67
SLIDE 67

Parallel Memory Mapping

A

Physical Memory Chrome V8

r-x r-x

A

JIT rwx

A'

Map a second writable page to the same location

slide-68
SLIDE 68

Conserving DGC Translations

A

Chrome V8

r-x rwx r-x

DynamoRIO Code Cache compile_js()

JIT

A

compile_js()

A' A

Redirect the store's target to the writable mapping

slide-69
SLIDE 69

Conserving DGC Translations

A

Chrome V8

r-x rwx r-x

DynamoRIO Code Cache compile_js()

JIT

A

compile_js()

A' A

Locate and remove stale translations

slide-70
SLIDE 70

Comparing Approaches

Annotations

+ Lower virtual memory

usage (for 32-bit)

+ Simple to implement

Inference

+ Requires no source

code changes

slide-71
SLIDE 71

Speedup – Octane on V8

Score Overhead Speedup Original DynamoRIO 2,271 15.80x

  • Annotation

DynamoRIO 14,532 2.47x 6.40x Inference DynamoRIO 14,257 2.52x 6.28x Native 35,889

slide-72
SLIDE 72

Speedup – Octane on V8

Score Overhead Speedup Original DynamoRIO 2,271 15.80x

  • Annotation

DynamoRIO 14,532 2.47x 6.40x Inference DynamoRIO 14,257 2.52x 6.28x Native 35,889

slide-73
SLIDE 73

Speedup – Octane on V8

Score Overhead Speedup Original DynamoRIO 2,271 15.80x

  • Annotation

DynamoRIO 14,532 2.47x 6.40x Inference DynamoRIO 14,257 2.52x 6.28x Native 35,889

slide-74
SLIDE 74

Speedup – Octane on Ion

Score Overhead Speedup Original DynamoRIO 7,185 4.36x

  • Annotation

DynamoRIO 11,914 2.27x 1.92x Inference DynamoRIO 13,797 2.15x 2.03x Native 31,340

slide-75
SLIDE 75

Speedup – Octane on Ion

Score Overhead Speedup Original DynamoRIO 7,185 4.36x

  • Annotation

DynamoRIO 11,914 2.27x 1.92x Inference DynamoRIO 13,797 2.15x 2.03x Native 31,340

slide-76
SLIDE 76

Speedup – Octane on Ion

Score Overhead Speedup Original DynamoRIO 7,185 4.36x

  • Annotation

DynamoRIO 11,914 2.27x 1.92x Inference DynamoRIO 13,797 2.15x 2.03x Native 31,340

slide-77
SLIDE 77

Speedup – Octane on V8

slide-78
SLIDE 78

Benchmark Overhead

slide-79
SLIDE 79

SPEC CPU 2006 SPEC CPU 2006

SPEC CPU SPEC Int SPEC fp Original DynamoRIO 12.27% 17.73% 8.60% Inference DynamoRIO 12.35% 17.88% 8.60%

Performance is maintained for programs that do not dynamically generate code

slide-80
SLIDE 80

Related Work

Platform Cache Consistency Policy DynamoRIO Page protection → flush all Pin Instrument trace heads → flush trace QEMU Software TLB hook → flush basic block Valgrind Instrument basic block → flush Transmeta Page protection → flush region (HW support) Librando Page protection → hash compare and flush

slide-81
SLIDE 81

Conclusion

  • New trend towards dynamic code generation
  • We explore two optimization approaches

– Annotation-driven code change notification – Code change inference

  • We improve performance of binary translation
  • n the Octane benchmark by more than 6x
slide-82
SLIDE 82

DynamoRIO Annotations

Compiled annotation in x64 Linux

slide-83
SLIDE 83

DynamoRIO Annotations

Native execution jumps over the annotation

slide-84
SLIDE 84

DynamoRIO Annotations

DynamoRIO transforms the annotation into a handler call

slide-85
SLIDE 85

DynamoRIO Annotations

DynamoRIO transforms the annotation into a handler call