Context Threading: A flexible and efficient dispatch technique for - - PowerPoint PPT Presentation

context threading a flexible and efficient dispatch
SMART_READER_LITE
LIVE PREVIEW

Context Threading: A flexible and efficient dispatch technique for - - PowerPoint PPT Presentation

Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl Benjamin Vitale Mathew Zaleski Angela Demke Brown Research supported by IBM CAS, NSERC, CITO 1 Interpreter performance Why not


slide-1
SLIDE 1

1

Research supported by IBM CAS, NSERC, CITO

Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters

Marc Berndl Benjamin Vitale Mathew Zaleski Angela Demke Brown

slide-2
SLIDE 2

Context Threading

2

Interpreter performance

  • Why not just in time (JIT) compile?
  • High performance JVMs still interpret
  • People use interpreted languages that don’t

yet have JITs

  • They still want performance!
  • 30-40% of execution time is due to stalls

caused by branch misprediction.

  • Our technique eliminates 95% of branch

mispredictions

slide-3
SLIDE 3

Context Threading

3

Overview

✔Motivation

  • Background: The Context Problem
  • Existing Solutions
  • Our Approach
  • Inlining
  • Results
slide-4
SLIDE 4

Context Threading

load

4

A Tale of Two Machines

Loaded Program

Virtual Program

Return Address Wayness (Conditional)

Execution Cycle Bytecode Bodies Pipeline

Target Address (Indirect)

Predictors

Execution Cycle Virtual Machine Interpreter

Real Machine CPU

slide-5
SLIDE 5

Context Threading

5

Interpreter

Loaded Program

Bytecode bodies Internal Representation

fetch

dispatch Load Parms

execute

Execution Cycle

slide-6
SLIDE 6

Context Threading

0: iconst_0 1: istore_1 2: iload_1 3: iload_1 4: iadd 5: istore_1 6: iload_1 7: bipush 64 9: if_icmplt 2 12: return

6

Running Java Example

void foo(){ int i=1; do{ i+=i; } while(i<64); }

Java Source Java Bytecode

Javac compiler

slide-7
SLIDE 7

Context Threading

while(1){

  • pcode = *vPC++;

switch(opcode){ //and many more.. } };

7

Switched Interpreter

case iload_1: .. break;

case iadd: .. break;

slow. burdened by switch and loop overhead

slide-8
SLIDE 8

Context Threading

“Threading” Dispatch

  • No switch overhead. Data driven indirect branch.

8

execution of virtual program “threads” through bodies

(as in needle & thread)

iload_1: .. goto *vPC++; iadd: .. goto *vPC++; istore: .. goto *vPC++; 0: iconst_0 1: istore_1 2: iload_1 3: iload_1 4: iadd 5: istore_1 6: iload_1 7: bipush 64 9: if_icmplt 2 12: return

slide-9
SLIDE 9

Context Threading

0: iconst_0 1: istore_1 2: iload_1 3: iload_1 4: iadd 5: istore_1 6: iload_1 7: bipush 64 9: if_icmplt 2 12: return

Context Problem

  • Data driven indirect branches hard to predict

9

iload_1: .. goto *vPC++; iadd: .. goto *vPC++; istore: .. goto *vPC++;

indirect branch predictor (micro-arch)

slide-10
SLIDE 10

Context Threading

10

Direct Threaded Interpreter

  • 7

&&if_icmplt 64 &&bipush &&iload_1 &&istore_1 &&iadd &&iload_1 &&iload_1 … iload_1 iload_1 iadd istore_1 iload_1 bipush 64 if_icmplt 2 …

DTT - Direct Threading Table Virtual Program vPC

iload_1: .. goto *vPC++; iadd: .. goto *vPC++;

Target of computed goto is data-driven

C implementation

  • f each body

istore: .. goto *vPC++;

slide-11
SLIDE 11

Context Threading

11

Existing Solutions

Body Body Body Body Body GOTO *PC ???? Piumarta & Ricardi : Bodies Replicated Super Instruction Replicate iload_1 goto *pc 1 iload_1 goto *pc 2 1 1 2 2 Ertl & Gregg: Bodies and Dispatch Replicated

Limited to relocatable virtual instructions

slide-12
SLIDE 12

Context Threading

12

Overview

✔Motivation ✔Background: The Context Problem ✔Existing Solutions

  • Our Approach
  • Inlining
  • Results
slide-13
SLIDE 13

Context Threading

13

Key Observation

  • Virtual and native control flow similar
  • Linear or straight-line code
  • Conditional branches
  • Calls and Returns
  • Indirect branches
  • Hardware has predictors for each type
  • Direct uses indirect branch for everything!
  • Solution: Leverage hardware predictors
slide-14
SLIDE 14

Context Threading

14

Essence of our Solution

iload_1: .. ret; iadd: .. ret; .. call iload_1 call istore_1 call iadd call iload_1 call iload_1

CTT - Context Threading Table (generated code) Bytecode bodies (ret terminated) Return Branch Predictor Stack

… iload_1 iload_1 iadd istore_1 iload_1 bipush 64 if_icmplt 2 …

Package bodies as subroutines and call them

slide-15
SLIDE 15

Context Threading

15

Subroutine Threading

iload_1: … ret; iadd: … ret;

call bipush call if_icmplt call iload_1 call istore_1 call iadd call iload_1 call iload_1

CTT load time generated code Bytecode bodies (ret terminated)

if_cmplt: … goto *vPC++;

virtual branch instructions as before

… iload_1 iload_1 iadd istore_1 iload_1 bipush 64 if_icmplt 2 … 64

  • 7

DTT contains addresses in CTT vPC

slide-16
SLIDE 16

Context Threading

16

The Context Threading Table

  • A sequence of generated call instructions
  • Good alignment of virtual and hardware

control flow for straight-line code.

  • Can virtual branches go into the CTT?
slide-17
SLIDE 17

Context Threading

17

Specialized Branch Inlining

Conditional Branch Predictor now mobilized

… … target: … call … call iload_1

if(icmplt) goto target:

Branch Inlined Into the CTT

5

DTT vPC

target:

… …

Inlining conditional branches provides context

slide-18
SLIDE 18

Context Threading

18

Tiny Inlining

  • Context Threading is a dispatch technique
  • But, we inline branches
  • Some non-branching bodies are very small
  • Why not inline those?

Inline all tiny linear bodies into the CTT

slide-19
SLIDE 19

Context Threading

19

Overview

✔Motivation ✔Background: The Context Problem ✔Existing Solutions ✔Our Approach ✔Inlining

  • Results
slide-20
SLIDE 20

Context Threading

20

Experimental Setup

  • Two Virtual Machines on two hardware

architectures.

  • VM: Java/SableVM, OCaml interpreter
  • Compare against direct threaded SableVM
  • SableVM distro uses selective inlining
  • Arch: P4, PPC
  • Branch Misprediction
  • Execution Time

Is our technique effective and general?

slide-21
SLIDE 21

Context Threading

21

Mispredicted Taken Branches

0.25 0.50 0.75 1.00 compress db jack javac jess mpeg mtrt ray scimark soot Subroutine Branch Inlining Tiny Inlining Normalized to Direct Threading

95% mispredictions eliminated on average

SableVm/Java Pentium 4

slide-22
SLIDE 22

Context Threading

22

Execution time

0.25 0.50 0.75 1.00 c

  • m

p r e s s d b j a c k j a v a c j e s s m p e g m t r t r a y s c i m a r k s

  • t

Subroutine Branch Inlining Tiny Inlining

Normalized to Direct Threading

27% average reduction in execution time

Pentium 4

slide-23
SLIDE 23

Context Threading

23

Execution Time (geomean)

0.25 0.50 0.75 1.00 j a v a / p 4 j a v a / p p c

  • c

a m l / p 4

  • c

a m l / p p c Subroutine Branch Inlining Tiny Inlining Normalized to Direct Threading

Our technique is effective and general

slide-24
SLIDE 24

Context Threading

24

Conclusions

  • Context Problem: branch

mispredictions due to mismatch between native and virtual control flow

  • Solution: Generate control flow code

into the Context Threading Table

  • Results
  • Eliminate 95% of branch mispredictions
  • Reduce execution time by 30-40%
  • recent, post CGO 2005, work follows
slide-25
SLIDE 25

Context Threading

What about Scripting Languages?

  • Recently ported context

threading to TCL.

  • 10x cycles executed per

bytecode dispatched.

  • Much lower dispatch
  • verhead.
  • Speedup due to

subroutine threading,

  • approx. 5%.
  • TCL conference 2005

25

100 101 102 103 104 105 Tcl or Ocaml Benchmark Cycles per Dispatch

Tcl Ocaml

Cycles per virtual instruction