Context Threading: A flexible and efficient dispatch technique for - PowerPoint PPT Presentation

Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl Benjamin Vitale Mathew Zaleski Angela Demke Brown Research supported by IBM CAS, NSERC, CITO 1

Interpreter performance • Why not just in time (JIT) compile? • High performance JVMs still interpret • People use interpreted languages that don’t yet have JITs • They still want performance! • 30-40% of execution time is due to stalls caused by branch misprediction. • Our technique eliminates 95% of branch mispredictions Context Threading 2

Overview ✔ Motivation • Background: The Context Problem • Existing Solutions • Our Approach • Inlining • Results Context Threading 3

A Tale of Two Machines Virtual Machine Interpreter Execution Cycle Virtual Loaded Program Program load Bytecode Bodies Pipeline Execution Cycle Real Machine Target Address Predictors (Indirect) CPU Return Address Wayness (Conditional) Context Threading 4

Interpreter fetch execute Loaded Program Load dispatch Parms Internal Bytecode Representation bodies Execution Cycle Context Threading 5

Running Java Example Java Bytecode Java Source 0: iconst_0 void foo(){ 1: istore_1 2: iload_1 int i=1; Javac 3: iload_1 do{ compiler 4: iadd 5: istore_1 i+=i; 6: iload_1 } while(i<64); 7: bipush 64 } 9: if_icmplt 2 12: return Context Threading 6

Switched Interpreter while(1){ opcode = *vPC++; switch(opcode){ case iload_1: .. break; case iadd: .. break; //and many more.. } };  slow. burdened by switch and loop overhead 7 Context Threading

“ Threading” Dispatch iload_1: 0: iconst_0 .. execution of 1: istore_1 goto *vPC++; 2: iload_1 virtual program 3: iload_1 “threads” 4: iadd iadd: 5: istore_1 .. through bodies 6: iload_1 goto *vPC++; 7: bipush 64 9: if_icmplt 2 (as in needle & thread) istore: 12: return .. goto *vPC++; ‣ No switch overhead. Data driven indirect branch. 8 Context Threading

Context Problem iload_1: .. 0: iconst_0 goto *vPC++; 1: istore_1 2: iload_1 3: iload_1 iadd: 4: iadd .. 5: istore_1 goto *vPC++; 6: iload_1 7: bipush 64 indirect branch istore: 9: if_icmplt 2 predictor .. 12: return goto *vPC++; (micro-arch) ‣ Data driven indirect branches hard to predict 9 Context Threading

Direct Threaded Interpreter vPC iload_1: &&iload_1 .. … goto *vPC++; &&iload_1 iload_1 &&iadd iload_1 iadd: &&istore_1 iadd .. istore_1 &&iload_1 goto *vPC++; iload_1 &&bipush bipush 64 64 istore: if_icmplt 2 &&if_icmplt .. … goto *vPC++; -7 DTT - Direct C implementation Virtual Threading Table of each body Program  Target of computed goto is data-driven Context Threading 10

Existing Solutions Replicate Super Instruction 1 iload_1 goto *pc 1 Body Body 1 Body Body 2 iload_1 Body goto *pc 2 GOTO *PC 2 ???? Ertl & Gregg: Piumarta & Ricardi : Bodies and Dispatch Bodies Replicated Replicated  Limited to relocatable virtual instructions Context Threading 11

Overview ✔ Motivation ✔ Background: The Context Problem ✔ Existing Solutions • Our Approach • Inlining • Results Context Threading 12

Key Observation • Virtual and native control flow similar • Linear or straight-line code • Conditional branches • Calls and Returns • Indirect branches • Hardware has predictors for each type • Direct uses indirect branch for everything! ‣ Solution: Leverage hardware predictors Context Threading 13

Essence of our Solution CTT - Context … Bytecode bodies Threading Table iload_1 (ret terminated) (generated code) iload_1 iadd call iload_1 iload_1: istore_1 call iload_1 .. iload_1 ret; call iadd bipush 64 call istore_1 if_icmplt 2 iadd: call iload_1 … .. .. ret; Return Branch Predictor Stack  Package bodies as subroutines and call them Context Threading 14

Subroutine Threading vPC … Bytecode bodies iload_1 call iload_1 (ret terminated) iload_1 call iload_1 iadd iload_1: call iadd istore_1 … call istore_1 iload_1 ret; call iload_1 bipush 64 call bipush if_icmplt 2 iadd: … call if_icmplt … 64 ret; CTT load time -7 if_cmplt: generated code … DTT contains goto *vPC++; addresses in CTT  virtual branch instructions as before Context Threading 15

The Context Threading Table • A sequence of generated call instructions • Good alignment of virtual and hardware control flow for straight-line code. ‣ Can virtual branches go into the CTT? Context Threading 16

Specialized Branch Inlining vPC … … if(icmplt) target : … goto target: 5 call iload_1 Conditional call … Branch … Predictor now … mobilized … target: Branch Inlined DTT Into the CTT  Inlining conditional branches provides context Context Threading 17

Tiny Inlining • Context Threading is a dispatch technique • But, we inline branches • Some non-branching bodies are very small • Why not inline those?  Inline all tiny linear bodies into the CTT Context Threading 18

Overview ✔ Motivation ✔ Background: The Context Problem ✔ Existing Solutions ✔ Our Approach ✔ Inlining • Results Context Threading 19

Experimental Setup • Two Virtual Machines on two hardware architectures. • VM: Java/SableVM, OCaml interpreter • Compare against direct threaded SableVM • SableVM distro uses selective inlining • Arch: P4, PPC • Branch Misprediction • Execution Time  Is our technique effective and general? Context Threading 20

Mispredicted Taken Branches Subroutine Branch Inlining Tiny Inlining 1.00 Direct Threading Normalized to 0.75 0.50 0.25 0 compress db jack javac jess mpeg mtrt ray scimark soot SableVm/Java Pentium 4  95% mispredictions eliminated on average Context Threading 21

Execution time Subroutine Branch Inlining Tiny Inlining 1.00 Pentium 4 Direct Threading Normalized to 0.75 0.50 0.25 0 s b k c s g t y k t r o s a s e a d c r t e e r v a o a p m r a j j m m s p j m i c s o c  27% average reduction in execution time Context Threading 22

Execution Time (geomean) Subroutine Branch Inlining Tiny Inlining 1.00 Direct Threading Normalized to 0.75 0.50 0.25 0 4 c 4 c p p p p p p / / a l / / m v a l m a v a j a c a j o c o  Our technique is effective and general Context Threading 23

Conclusions • Context Problem: branch mispredictions due to mismatch between native and virtual control flow • Solution: Generate control flow code into the Context Threading Table • Results • Eliminate 95% of branch mispredictions • Reduce execution time by 30-40% ‣ recent, post CGO 2005, work follows Context Threading 24

What about Scripting Languages? • Recently ported context 10 5 threading to TCL. Tcl Cycles per virtual instruction Ocaml • 10x cycles executed per 10 4 bytecode dispatched. Cycles per Dispatch 10 3 • Much lower dispatch overhead. 10 2 • Speedup due to subroutine threading, 10 1 approx. 5%. • TCL conference 2005 10 0 Tcl or Ocaml Benchmark Context Threading 25

Context Threading: A flexible and efficient dispatch technique for - PowerPoint PPT Presentation

Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl Benjamin Vitale Mathew Zaleski Angela Demke Brown Research supported by IBM CAS, NSERC, CITO 1 Interpreter performance Why not

Redundancy-free Residual Dispatch Using Ordered Binary Decision Diagrams for Efficient Dispatch

Threading, Events, and Concurrency Threading Recap Threading in Multicore World

METRO WEST AMBULANCE DISPATCH CENTER Presented by Kristen Hoover & Larry Boxman DISPATCH

Outline ROS Basics Plan Execution Very Simple Dispatch Very Simple Temporal

Costs 26 th May 2011 Agenda Introduction Overview of Dispatch Balancing Costs (30 mins)

Chip Multi-threading and Chip Multi-threading and Sun s Niagara-series s Niagara-series

Threading the Needle: Threading the Needle: NHs Journey to Establish NHs Journey to

Web Threading DAVID CATUHE - @DELTAKOSH BABYLON.JS / MICROSOFT Today multi - threading is

Protein threading Protein Threading Basic premise Structure is better conserved than

Threads Threads Threads vs Processes Multi-threading Models Threading Issues

The The Beverly Beverly Middle Middle School School Flexible Flexible Learning Learning

Dispatch and Scheduling Proposed Decision Dispatch and Scheduling Proposed Decision Jonathan

Week3: Overview Essence of Dispatch, and Collections Damien Cassou, Stphane Ducasse and Luc

Dispatch of Scheduled Network Services David Bowker Ken Secomb May 2007 Outline The Role

DSM Dispatch Conditions p Dr Richard Tooth 22 November 2012 2 Introduction Proposals in

Stochastic Economic Dispatch for Power Stochastic Economic Dispatch for Power Grids with High

STACL: Simultaneous Translation with Integrated Anticipation & Controllable Latency Liang

Welcome to CS251 Interpreter and Translators Theory of Programming Languages Computer Science

Scripted Components Dr. James A. Bednar jbednar@inf.ed.ac.uk

An Introduction to Python for Scientists Hands-On Tutorial Ahmed Attia Statistical and Applied

Modular Instrumentation of Interpreters in JavaScript Florent Marchand de Kerchove, Jacques

Software Reliability and System reliability Steven J Zeil Old Dominion Univ. Spring 2012 1

Emulation Michael Jantz Acknowledgements Slides adapted from Chapter 2 in Virtual Machines:

Programming Abstractions Week 7-1: MiniScheme Interpreter Stephen Checkoway Project overview In

Context Threading: A flexible and efficient dispatch technique for - PowerPoint PPT Presentation

Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl Benjamin Vitale Mathew Zaleski Angela Demke Brown Research supported by IBM CAS, NSERC, CITO 1 Interpreter performance Why not

Redundancy-free Residual Dispatch Using Ordered Binary Decision Diagrams for Efficient Dispatch

Threading, Events, and Concurrency Threading Recap Threading in Multicore World

METRO WEST AMBULANCE DISPATCH CENTER Presented by Kristen Hoover &amp; Larry Boxman DISPATCH

Outline ROS Basics Plan Execution Very Simple Dispatch Very Simple Temporal

Costs 26 th May 2011 Agenda Introduction Overview of Dispatch Balancing Costs (30 mins)

Chip Multi-threading and Chip Multi-threading and Sun s Niagara-series s Niagara-series

Threading the Needle: Threading the Needle: NHs Journey to Establish NHs Journey to

Web Threading DAVID CATUHE - @DELTAKOSH BABYLON.JS / MICROSOFT Today multi - threading is

Protein threading Protein Threading Basic premise Structure is better conserved than

Threads Threads Threads vs Processes Multi-threading Models Threading Issues

The The Beverly Beverly Middle Middle School School Flexible Flexible Learning Learning

Dispatch and Scheduling Proposed Decision Dispatch and Scheduling Proposed Decision Jonathan

Week3: Overview Essence of Dispatch, and Collections Damien Cassou, Stphane Ducasse and Luc

Dispatch of Scheduled Network Services David Bowker Ken Secomb May 2007 Outline The Role

DSM Dispatch Conditions p Dr Richard Tooth 22 November 2012 2 Introduction Proposals in

Stochastic Economic Dispatch for Power Stochastic Economic Dispatch for Power Grids with High

STACL: Simultaneous Translation with Integrated Anticipation &amp; Controllable Latency Liang

Welcome to CS251 Interpreter and Translators Theory of Programming Languages Computer Science

Scripted Components Dr. James A. Bednar jbednar@inf.ed.ac.uk

An Introduction to Python for Scientists Hands-On Tutorial Ahmed Attia Statistical and Applied

Modular Instrumentation of Interpreters in JavaScript Florent Marchand de Kerchove, Jacques

Software Reliability and System reliability Steven J Zeil Old Dominion Univ. Spring 2012 1

Emulation Michael Jantz Acknowledgements Slides adapted from Chapter 2 in Virtual Machines:

Programming Abstractions Week 7-1: MiniScheme Interpreter Stephen Checkoway Project overview In

METRO WEST AMBULANCE DISPATCH CENTER Presented by Kristen Hoover & Larry Boxman DISPATCH

STACL: Simultaneous Translation with Integrated Anticipation & Controllable Latency Liang