context threading a flexible and efficient dispatch
play

Context Threading: A flexible and efficient dispatch technique for - PowerPoint PPT Presentation

Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl Benjamin Vitale Mathew Zaleski Angela Demke Brown Research supported by IBM CAS, NSERC, CITO 1 Interpreter performance Why not


  1. Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl Benjamin Vitale Mathew Zaleski Angela Demke Brown Research supported by IBM CAS, NSERC, CITO 1

  2. Interpreter performance • Why not just in time (JIT) compile? • High performance JVMs still interpret • People use interpreted languages that don’t yet have JITs • They still want performance! • 30-40% of execution time is due to stalls caused by branch misprediction. • Our technique eliminates 95% of branch mispredictions Context Threading 2

  3. Overview ✔ Motivation • Background: The Context Problem • Existing Solutions • Our Approach • Inlining • Results Context Threading 3

  4. A Tale of Two Machines Virtual Machine Interpreter Execution Cycle Virtual Loaded Program Program load Bytecode Bodies Pipeline Execution Cycle Real Machine Target Address Predictors (Indirect) CPU Return Address Wayness (Conditional) Context Threading 4

  5. Interpreter fetch execute Loaded Program Load dispatch Parms Internal Bytecode Representation bodies Execution Cycle Context Threading 5

  6. Running Java Example Java Bytecode Java Source 0: iconst_0 void foo(){ 1: istore_1 2: iload_1 int i=1; Javac 3: iload_1 do{ compiler 4: iadd 5: istore_1 i+=i; 6: iload_1 } while(i<64); 7: bipush 64 } 9: if_icmplt 2 12: return Context Threading 6

  7. Switched Interpreter while(1){ opcode = *vPC++; switch(opcode){ case iload_1: .. break; case iadd: .. break; //and many more.. } };  slow. burdened by switch and loop overhead 7 Context Threading

  8. “ Threading” Dispatch iload_1: 0: iconst_0 .. execution of 1: istore_1 goto *vPC++; 2: iload_1 virtual program 3: iload_1 “threads” 4: iadd iadd: 5: istore_1 .. through bodies 6: iload_1 goto *vPC++; 7: bipush 64 9: if_icmplt 2 (as in needle & thread) istore: 12: return .. goto *vPC++; ‣ No switch overhead. Data driven indirect branch. 8 Context Threading

  9. Context Problem iload_1: .. 0: iconst_0 goto *vPC++; 1: istore_1 2: iload_1 3: iload_1 iadd: 4: iadd .. 5: istore_1 goto *vPC++; 6: iload_1 7: bipush 64 indirect branch istore: 9: if_icmplt 2 predictor .. 12: return goto *vPC++; (micro-arch) ‣ Data driven indirect branches hard to predict 9 Context Threading

  10. Direct Threaded Interpreter vPC iload_1: &&iload_1 .. … goto *vPC++; &&iload_1 iload_1 &&iadd iload_1 iadd: &&istore_1 iadd .. istore_1 &&iload_1 goto *vPC++; iload_1 &&bipush bipush 64 64 istore: if_icmplt 2 &&if_icmplt .. … goto *vPC++; -7 DTT - Direct C implementation Virtual Threading Table of each body Program  Target of computed goto is data-driven Context Threading 10

  11. Existing Solutions Replicate Super Instruction 1 iload_1 goto *pc 1 Body Body 1 Body Body 2 iload_1 Body goto *pc 2 GOTO *PC 2 ???? Ertl & Gregg: Piumarta & Ricardi : Bodies and Dispatch Bodies Replicated Replicated  Limited to relocatable virtual instructions Context Threading 11

  12. Overview ✔ Motivation ✔ Background: The Context Problem ✔ Existing Solutions • Our Approach • Inlining • Results Context Threading 12

  13. Key Observation • Virtual and native control flow similar • Linear or straight-line code • Conditional branches • Calls and Returns • Indirect branches • Hardware has predictors for each type • Direct uses indirect branch for everything! ‣ Solution: Leverage hardware predictors Context Threading 13

  14. Essence of our Solution CTT - Context … Bytecode bodies Threading Table iload_1 (ret terminated) (generated code) iload_1 iadd call iload_1 iload_1: istore_1 call iload_1 .. iload_1 ret; call iadd bipush 64 call istore_1 if_icmplt 2 iadd: call iload_1 … .. .. ret; Return Branch Predictor Stack  Package bodies as subroutines and call them Context Threading 14

  15. Subroutine Threading vPC … Bytecode bodies iload_1 call iload_1 (ret terminated) iload_1 call iload_1 iadd iload_1: call iadd istore_1 … call istore_1 iload_1 ret; call iload_1 bipush 64 call bipush if_icmplt 2 iadd: … call if_icmplt … 64 ret; CTT load time -7 if_cmplt: generated code … DTT contains goto *vPC++; addresses in CTT  virtual branch instructions as before Context Threading 15

  16. The Context Threading Table • A sequence of generated call instructions • Good alignment of virtual and hardware control flow for straight-line code. ‣ Can virtual branches go into the CTT? Context Threading 16

  17. Specialized Branch Inlining vPC … … if(icmplt) target : … goto target: 5 call iload_1 Conditional call … Branch … Predictor now … mobilized … target: Branch Inlined DTT Into the CTT  Inlining conditional branches provides context Context Threading 17

  18. Tiny Inlining • Context Threading is a dispatch technique • But, we inline branches • Some non-branching bodies are very small • Why not inline those?  Inline all tiny linear bodies into the CTT Context Threading 18

  19. Overview ✔ Motivation ✔ Background: The Context Problem ✔ Existing Solutions ✔ Our Approach ✔ Inlining • Results Context Threading 19

  20. Experimental Setup • Two Virtual Machines on two hardware architectures. • VM: Java/SableVM, OCaml interpreter • Compare against direct threaded SableVM • SableVM distro uses selective inlining • Arch: P4, PPC • Branch Misprediction • Execution Time  Is our technique effective and general? Context Threading 20

  21. Mispredicted Taken Branches Subroutine Branch Inlining Tiny Inlining 1.00 Direct Threading Normalized to 0.75 0.50 0.25 0 compress db jack javac jess mpeg mtrt ray scimark soot SableVm/Java Pentium 4  95% mispredictions eliminated on average Context Threading 21

  22. Execution time Subroutine Branch Inlining Tiny Inlining 1.00 Pentium 4 Direct Threading Normalized to 0.75 0.50 0.25 0 s b k c s g t y k t r o s a s e a d c r t e e r v a o a p m r a j j m m s p j m i c s o c  27% average reduction in execution time Context Threading 22

  23. Execution Time (geomean) Subroutine Branch Inlining Tiny Inlining 1.00 Direct Threading Normalized to 0.75 0.50 0.25 0 4 c 4 c p p p p p p / / a l / / m v a l m a v a j a c a j o c o  Our technique is effective and general Context Threading 23

  24. Conclusions • Context Problem: branch mispredictions due to mismatch between native and virtual control flow • Solution: Generate control flow code into the Context Threading Table • Results • Eliminate 95% of branch mispredictions • Reduce execution time by 30-40% ‣ recent, post CGO 2005, work follows Context Threading 24

  25. What about Scripting Languages? • Recently ported context 10 5 threading to TCL. Tcl Cycles per virtual instruction Ocaml • 10x cycles executed per 10 4 bytecode dispatched. Cycles per Dispatch 10 3 • Much lower dispatch overhead. 10 2 • Speedup due to subroutine threading, 10 1 approx. 5%. • TCL conference 2005 10 0 Tcl or Ocaml Benchmark Context Threading 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend