An Analysis of Call-site Patching Without Strong Hardware Support - - PowerPoint PPT Presentation

an analysis of call site patching without strong hardware
SMART_READER_LITE
LIVE PREVIEW

An Analysis of Call-site Patching Without Strong Hardware Support - - PowerPoint PPT Presentation

An Analysis of Call-site Patching Without Strong Hardware Support for Self-Modifying-Code Tim Hartley, Foivos Zakkak, first.last@manchester.ac.uk Christos Kotselidis, Mikel Lujan MPLR19 2019-10-22 Call-Sites Direct branching Indirect


slide-1
SLIDE 1

first.last@manchester.ac.uk

An Analysis of Call-site Patching Without Strong Hardware Support for Self-Modifying-Code

Tim Hartley, Foivos Zakkak, Christos Kotselidis, Mikel Lujan

MPLR’19 2019-10-22

slide-2
SLIDE 2

Direct branching Indirect branching

2019-10-22 MPLR’19 @foivoszakkak 2

Call-Sites

Method A call/jmp <offset> Method B Method C Method A ld target, 0xabcd call/jmp target Method B Method C Memory

slide-3
SLIDE 3

§ Tiered compilation § De-optimization § Etc.

2019-10-22 MPLR’19 @foivoszakkak 3

Call-Site Patching

slide-4
SLIDE 4

Code-stream vs Data-stream 1. Code gets fetched to I-Cache 2. Data get fetched to D-Cache 3. CPU executes code from I-Cache 4. CPU writes data to D-Cache 5. D-Cache writes-back to memory 6. D-Cache fetches code to be edited 7. CPU writes code to D-Cache 8. D-Cache writes-back code

2019-10-22 MPLR’19 @foivoszakkak 4

JIT compilation and Caches

D-CACHE I-CACHE

Main Memory

001010101010110 010101010100101 011001001100111 100010101010100 100010101010111 100010101010100 011001001100111 010101010101010 111110101010100 010100100010101 100110010000011 100110010000011 110001001110010 101010100100101 101011100100111 101010100100101 110010100100101 111110101010100 CPU

1 2 3 4 5 6 7 8

slide-5
SLIDE 5

§ Fixed size instructions

– Limit the range of direct branches/calls

  • +- 128MiB on AArch64
  • +- 1MiB on RISC-V

– Require multiple instructions to perform long-range calls

2019-10-22 MPLR’19 @foivoszakkak 5

Low-power architectures and call-site patching

AArch64 x86-64 128MiB 240MiB

slide-6
SLIDE 6

§ Weak memory models and self-modifying-code (SMC) support

– SW explicitly issues memory barriers – Code-stream handled separately from data-stream (need to sync them)

§ Not all instructions are safe to patch

– ARM (armv7 and armv8) and IBM (Power) limit the instructions that are safe to be patched while executing

  • Even if using atomic writes

2019-10-22 MPLR’19 @foivoszakkak 6

Low-power architectures and call-site patching (cont.)

slide-7
SLIDE 7

2019-10-22 MPLR’19 @foivoszakkak 7

Patchable call-site implementations in AArch64

B TARGET

Direct Branching (short-range only)

MOVZ X16, #0xABCD ; Craft the address MOVK X16, #0xEF89, lsl #16 ; holding MOVK X16, #0x7654, lsl #32 ; the MOVK X16, #0x0213, lsl #48 ; target LDR X16, [X16] BLR X16

Absolute-Load Indirect Branching

CALLEE_1 : .quad 0 x0123456789ABCDEF ... CALLEE_N : .quad 0 x01234ABCDEF56789 START : ... LDR X16, CALLEE_1 BLR X16

Relative-Load Indirect Branching

L: LDR X16, CALLEE BR X16 ; Don 't link CALLEE: .quad 0 x0123456789ABCDEF START: ... BL SHORT_TARGET ; or L

Trampolines (OpenJDK approach)

slide-8
SLIDE 8

2019-10-22 MPLR’19 @foivoszakkak 8

Comparison of call-site implementation approaches

slide-9
SLIDE 9

§ Odroid-C2

– Quad-core Cortex-A53 @ 1.54GHz (pinned)

  • 8-stage pipelined processor with 2-way superscalar, in-order pipeline

– 2 GB DDR3 RAM – Ubuntu 18.04.02 LTS – Kernel: Odroid 3.16..68-41 – GCC 8.3.0 – MaxineVM 2.8.0 – OpenJDK 8 u212

2019-10-22 MPLR’19 @foivoszakkak 9

Evaluation Setup

slide-10
SLIDE 10

§ Generates inline call-sites § Callers are ret-only methods § To patch we call a patcher method instead of a ret-only § Patcher always patches the next call-site (allows us to control number of patches § Patcher performs the necessary barriers as it would in a real system

2019-10-22 MPLR’19 @foivoszakkak 10

Microbenchmark

slide-11
SLIDE 11

2019-10-22 MPLR’19 @foivoszakkak 11

Microbenchmark results

slide-12
SLIDE 12

§ We take the best two performing approaches (Direct and Relative-Load Indirect) and evaluate them with DaCapo using MaxineVM § We had to tweak Relative-Load Indirect to make it work with MaxineVM

– Due to its metacircular nature, MaxineVM can only operate with offsets (relative branches), since at boot image creation the absolute targets are not known yet

2019-10-22 MPLR’19 @foivoszakkak 12

Dacapo and MaxineVM

ADR X17, CALL ; Get address of BLR LDR X16, OFFSET ; Load offset ADD X16, X16 , X17 ; Add them B #8 ; Jump over inline offset OFFSET: .int CALL - CALLEE_1 CALL: BLR X16

Indirect-Maxine

slide-13
SLIDE 13

2019-10-22 MPLR’19 @foivoszakkak 13

Indirect-Maxine in Microbenchmark results

slide-14
SLIDE 14

2019-10-22 MPLR’19 @foivoszakkak 14

DaCapo Results

slide-15
SLIDE 15

§ OpenJDK’s method seems the best for AArch64 since it penalizes only long-range branches and avoids explicit instruction cache invalidations

  • n callers.

– If you have a higher #"#$%&'($%) *(""+

#+,#'-&'($%) *(""+ ratio then maybe Relative-Load is better

§ The most promising approach in theory would be combining the following gadgets

– On AArch64 this is not possible though since ADRP and ADD cannot be safely

  • verwritten if they are being executed concurrently with the modifications.

2019-10-22 MPLR’19 @foivoszakkak 15

Conclusions

B TARGET

Direct (short-range only)

ADRP X16, CALLEE ADD X16, X16, :lo12:CALLEE BLR X16

Indirect (long-rang)