cross layer workload characterization of meta tracing jit
play

Cross-Layer Workload Characterization of Meta-Tracing JIT VMs Berkin - PowerPoint PPT Presentation

Cross-Layer Workload Characterization of Meta-Tracing JIT VMs Berkin Ilbeyi 1 , Carl Friedrich Bolz-Tereick 2 , and Christopher Batten 1 1 Cornell University, 2 Heinrich-Heine-Universitt Dsseldorf Dynamic languages are popular S. Cass. The


  1. Python-based interpreter ... b += a ... Application: FooLang ... compile 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) Application: Bytecode 27 INPLACE_ADD interpret 28 STORE_FAST 1 (b) ... Interpreter: Python while True: compile bc = bcs[bci] bci += bc.length Interpreter: Bytecode if bc.type == INPLACE_ADD: v1 = stack.pop() v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 7

  2. Python-based interpreter ... b += a ... Application: FooLang ... compile 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) Application: Bytecode 27 INPLACE_ADD interpret 28 STORE_FAST 1 (b) ... Interpreter: Python while True: compile bc = bcs[bci] bci += bc.length Interpreter: Bytecode if bc.type == INPLACE_ADD: interpret v1 = stack.pop() v2 = stack.pop() Interpreter Interpreter if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 7

  3. RPython Framework Application: Python compile Application: Bytecode Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 8

  4. RPython Framework Application: Python Interpreter: RPython compile Application: Bytecode Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 8

  5. RPython Framework Application: Python Interpreter: RPython Framework: RPython compile Application: Bytecode Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 8

  6. RPython Framework Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Framework: C Application: Bytecode Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 8

  7. RPython Framework Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Framework: C Application: Bytecode compile interpret PyPy: Binary Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 8

  8. RPython Framework Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Framework: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR assemble JIT-ed code: Binary Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 8

  9. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) ... Interpreter while True: bc = bcs[bci] bci += bc.length if bc.type == INPLACE_ADD: v1 = stack.pop() v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  10. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: v1 = stack.pop() v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  11. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: v1 = stack.pop() v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  12. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: v1 = stack.pop() v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  13. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  14. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  15. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() p2 = getarrayitem(p0, 0) v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  16. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() p2 = getarrayitem(p0, 0) v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  17. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() p2 = getarrayitem(p0, 0) v2 = stack.pop() if (type(v1) == int and type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  18. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() p2 = getarrayitem(p0, 0) v2 = stack.pop() guard_class(p1, int) if (type(v1) == int and guard_class(p2, int) type(v2) == int): stack.push(v1 + v2) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  19. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() p2 = getarrayitem(p0, 0) v2 = stack.pop() guard_class(p1, int) if (type(v1) == int and guard_class(p2, int) type(v2) == int): i3 = getfield(p1, intval) stack.push(v1 + v2) i4 = getfield(p2, intval) elif ... elif bc.type == LOAD_FAST: stack.push(local[bc.varnum]) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  20. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() p2 = getarrayitem(p0, 0) v2 = stack.pop() guard_class(p1, int) if (type(v1) == int and guard_class(p2, int) type(v2) == int): i3 = getfield(p1, intval) stack.push(v1 + v2) i4 = getfield(p2, intval) elif ... i5 = int_add_ovf(i3, i4) elif bc.type == LOAD_FAST: guard_no_overflow() stack.push(local[bc.varnum]) ... ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  21. Meta-trace Application bytecode ... 21 LOAD_FAST 1 (b) 24 LOAD_FAST 0 (a) 27 INPLACE_ADD 28 STORE_FAST 1 (b) Meta-interpreter ... Interpreter while True: Meta-trace bc = bcs[bci] bci += bc.length ... if bc.type == INPLACE_ADD: p1 = getarrayitem(p0, 1) v1 = stack.pop() p2 = getarrayitem(p0, 0) v2 = stack.pop() guard_class(p1, int) if (type(v1) == int and guard_class(p2, int) Deoptimization back to interpreter on guard failure type(v2) == int): i3 = getfield(p1, intval) stack.push(v1 + v2) i4 = getfield(p2, intval) elif ... i5 = int_add_ovf(i3, i4) elif bc.type == LOAD_FAST: guard_no_overflow() stack.push(local[bc.varnum]) ... ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 9

  22. Cross-layer annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR assemble JIT-ed code: Binary Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10

  23. Cross-layer annotations application annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR assemble JIT-ed code: Binary Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10

  24. Cross-layer annotations application annotations interpreter annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR assemble JIT-ed code: Binary Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10

  25. Cross-layer annotations application annotations interpreter annotations framework annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR assemble JIT-ed code: Binary Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10

  26. Cross-layer annotations application annotations interpreter annotations framework annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR IR node of interest assemble JIT-ed code: Binary asm of interest Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10

  27. Cross-layer annotations application annotations interpreter annotations framework annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR IR node of interest assemble perf counters JIT-ed code: Binary asm of interest using PAPI Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10

  28. Cross-layer annotations application annotations interpreter annotations framework annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR IR node of interest assemble perf counters JIT-ed code: Binary asm of interest using PAPI Dynamic Binary Instrumentation Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10

  29. Cross-layer annotations application annotations interpreter annotations framework annotations Application: Python Interpreter: RPython Framework: RPython translate compile Interpreter + Application: C Application: Bytecode compile interpret trace and optimize PyPy: Binary Meta-trace: JIT IR IR node of interest assemble perf counters JIT-ed code: Binary asm of interest using PAPI phase counters, Dynamic Binary Instrumentation IR node counters Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 10

  30. Cross-layer workload characterization of meta-tracing JIT VMs PyPy >> CPython ▪ How can meta-tracing JITs significantly improve the performance of multiple dynamic languages? PyPy << C ▪ Why are meta-tracing JITs for dynamic programming still slower than C? Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 11

  31. Meta-tracing JIT improves the performance significantly PyPy with meta-tracing JIT speedup over CPython: 10 15 20 25 30 0 5 51.2 richards crypto_pyaes 30.2 chaos telco spectral-norm django twisted_iteration spitfire_cstringio raytrace-simple hexiom2 float ai nbody_modified twisted_pb fannkuch genshi_text pyflate-fast bm_mako twisted_names json_bench Motivation • Meta-tracing • PyPy >> CPython • PyPy << C genshi_xml bm_chameleon pypy_interp twisted_tcp html5lib meteor-contest sympy_sum spitfire spambayes rietveld deltablue eparse sympy_expand slowspitfire sympy_integrate pidigits bm_mdp sympy_str 12

  32. PyPy speedup over CPython and Pycket speedup over Racket: Meta-tracing JIT improves performance significantly across multiple languages PyPy speedup 12 11 10 9 8 7 6 5 4 3 2 1 0 binarytrees chameneosredux fannkuchredux fasta knucleotide mandelbrot meteor nbody pidigits regexdna revcomp spectralnorm threadring Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 13

  33. PyPy speedup over CPython and Pycket speedup over Racket: Meta-tracing JIT improves performance significantly across multiple languages Pycket speedup PyPy speedup 2 12 11 10 1.5 9 8 7 1 6 5 4 0.5 3 2 1 0 0 binarytrees fannkuchredux fasta mandelbrot meteor nbody pidigits revcomp spectralnorm binarytrees chameneosredux fannkuchredux fasta knucleotide mandelbrot meteor nbody pidigits regexdna revcomp spectralnorm threadring Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 13

  34. Meta-tracing JIT VM phases richards calls to AOT funs JIT GC deoptimization tracing & opt interpreter 0 2B 4B 6B 8B 10B instructions Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 14

  35. Meta-tracing JIT VM phases richards calls to AOT funs JIT GC deoptimization tracing & opt interpreter 0 2B 4B 6B 8B 10B instructions sympy_str calls to AOT funs JIT GC deoptimization tracing & opt interpreter 0 2B 4B 6B 8B 10B instructions Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 14

  36. Meta-tracing JIT VM phases Fastest on PyPy Slowest on PyPy JIT calls JIT GC deopt tracing interp Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 15

  37. The JIT phase: The fastest benchmarks tend to execute JIT-compiled code the most JIT + JIT call to AOT 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 16

  38. Meta-tracing inlines all loops and can hurt performance Interpreter while True: ... memcpy(d, s, n) ... def memcpy(dest, src, n): i = 0 while i < n: dest[i] = src[i] i += 1 Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 17

  39. Meta-tracing inlines all loops and can hurt performance Interpreter while True: ... memcpy(d, s, n) Meta-interpreter ... def memcpy(dest, src, n): i = 0 while i < n: dest[i] = src[i] Meta-trace i += 1 ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 17

  40. Meta-tracing inlines all loops and can hurt performance Interpreter while True: ... memcpy(d, s, n) Meta-interpreter ... def memcpy(dest, src, n): i = 0 while i < n: dest[i] = src[i] Meta-trace i += 1 ... guard_gt(i0, 0) i3 = getarrayitem(p1, 0) setarrayitem(p2, 0, i3) Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 17

  41. Meta-tracing inlines all loops and can hurt performance Interpreter while True: ... memcpy(d, s, n) Meta-interpreter ... def memcpy(dest, src, n): i = 0 while i < n: dest[i] = src[i] Meta-trace i += 1 ... guard_gt(i0, 0) i3 = getarrayitem(p1, 0) setarrayitem(p2, 0, i3) guard_gt(i0, 1) i4 = getarrayitem(p1, 1) setarrayitem(p2, 1, i4) Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 17

  42. Meta-tracing inlines all loops and can hurt performance Interpreter while True: ... memcpy(d, s, n) Meta-interpreter ... def memcpy(dest, src, n): i = 0 while i < n: dest[i] = src[i] Meta-trace i += 1 ... guard_gt(i0, 0) i3 = getarrayitem(p1, 0) setarrayitem(p2, 0, i3) guard_gt(i0, 1) i4 = getarrayitem(p1, 1) setarrayitem(p2, 1, i4) guard_gt(i0, 2) i5 = getarrayitem(p1, 2) setarrayitem(p2, 2, i5) ... Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 17

  43. Examples of significant AOT-compiled functions Benchmark % Source Function ai 19.4 interpreter setobject.get_storage_from_list bm_chameleon 17.9 RPython types rordereddict.ll_call_lookup_function bm_mako 26.1 RPython lib runicode.unicode_encode_ucs1_helper json_bench 18.5 PyPy module _pypyjson.raw_encode_basestring_ascii nbody_modified 44.6 external lib pow Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 18

  44. JIT calls to AOT-compiled functions: AOT-compiled functions can improve performance by avoiding long traces JIT JIT call to AOT functions 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 19

  45. PyPy bytecode execution rate compared to CPython: Benchmarks that perform the best also warm up the fastest Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 20

  46. PyPy bytecode execution rate compared to CPython: Benchmarks that perform the best also warm up the fastest richards 50 30 10 0 2B 4B 6B 8B 10B instructions Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 20

  47. PyPy bytecode execution rate compared to CPython: Benchmarks that perform the best also warm up the fastest Breakeven point: the performance of the two VMs at this point is equal richards 50 30 10 0 2B 4B 6B 8B 10B instructions PyPy w/o JIT breakeven point CPython breakeven point Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 20

  48. PyPy bytecode execution rate compared to CPython: Benchmarks that perform the best also warm up the fastest Breakeven point: the performance of the two VMs at this point is equal richards html5lib 3 50 2 30 1 10 0 2B 4B 6B 8B 10B 0 2B 4B 6B 8B 10B instructions instructions Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 20

  49. PyPy bytecode execution rate compared to CPython: Benchmarks that perform the best also warm up the fastest Breakeven point: the performance of the two VMs at this point is equal richards html5lib 3 50 2 30 1 10 0 2B 4B 6B 8B 10B 0 2B 4B 6B 8B 10B instructions instructions PyPy w/o JIT breakeven point CPython breakeven point Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 20

  50. PyPy bytecode execution rate compared to CPython: Benchmarks that perform the best also warm up the fastest Breakeven point: the performance of the two VMs at this point is equal richards html5lib sympy_str 3 50 2 2 30 1 1 10 0 2B 4B 6B 8B 10B 0 2B 4B 6B 8B 10B 0 2B 4B 6B 8B 10B instructions instructions instructions PyPy w/o JIT breakeven point Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 20

  51. Cross-layer workload characterization of meta-tracing JIT VMs PyPy >> CPython ▪ How can meta-tracing JITs significantly improve the performance of multiple dynamic languages? PyPy << C ▪ Why are meta-tracing JITs for dynamic programming still slower than C? Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 21

  52. Cross-layer workload characterization of meta-tracing JIT VMs PyPy >> CPython ▪ How can meta-tracing JITs significantly improve the performance of multiple dynamic languages? ▪ Meta-tracing JIT compilation significantly improves the performance PyPy << C ▪ Why are meta-tracing JITs for dynamic programming still slower than C? Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 21

  53. Cross-layer workload characterization of meta-tracing JIT VMs PyPy >> CPython ▪ How can meta-tracing JITs significantly improve the performance of multiple dynamic languages? ▪ Meta-tracing JIT compilation significantly improves the performance ▪ AOT-compiled functions are good to break pathological traces PyPy << C ▪ Why are meta-tracing JITs for dynamic programming still slower than C? Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 21

  54. Cross-layer workload characterization of meta-tracing JIT VMs PyPy >> CPython ▪ How can meta-tracing JITs significantly improve the performance of multiple dynamic languages? ▪ Meta-tracing JIT compilation significantly improves the performance ▪ AOT-compiled functions are good to break pathological traces ▪ Easier-to-JIT programs perform the best and warm up the fastest PyPy << C ▪ Why are meta-tracing JITs for dynamic programming still slower than C? Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 21

  55. PyPy and Pycket slowdown over C/C++: Meta-tracing JIT has a big performance gap between static languages 1374 PyPy slowdown 30 31 25 20 15 10 5 0 binarytrees chameneosredux fannkuchredux fasta knucleotide mandelbrot meteor nbody pidigits regexdna revcomp spectralnorm threadring Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 22

  56. PyPy and Pycket slowdown over C/C++: Meta-tracing JIT has a big performance gap between static languages 1374 Pycket slowdown PyPy slowdown 12 30 31 10 25 8 20 6 15 4 10 2 5 0 0 binarytrees fannkuchredux fasta mandelbrot meteor nbody pidigits revcomp spectralnorm binarytrees chameneosredux fannkuchredux fasta knucleotide mandelbrot meteor nbody pidigits regexdna revcomp spectralnorm threadring Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 22

  57. Meta-tracing JIT phases JIT JIT call to AOT functions 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 23

  58. Meta-tracing JIT IR node breakdown: Likely a big part of JIT compiled code is overhead Fastest on PyPy Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 24

  59. Meta-tracing JIT IR node breakdown: Likely a big part of JIT compiled code is overhead Fastest on PyPy Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 24

  60. Meta-tracing JIT IR node breakdown: Likely a big part of JIT compiled code is overhead Fastest on PyPy Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 24

  61. Meta-tracing JIT phases JIT JIT call to AOT functions 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 25

  62. Interpreter phase Interpreter 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 26

  63. RPython-to-C translation has overheads PyPy without meta-tracing JIT speedup over CPython: 0.2 0.4 0.6 0.8 1.2 0 1 richards crypto_pyaes chaos telco spectral-norm django twisted_iteration spitfire_cstringio raytrace-simple hexiom2 float ai nbody_modified twisted_pb fannkuch genshi_text pyflate-fast bm_mako twisted_names json_bench Motivation • Meta-tracing • PyPy >> CPython • PyPy << C genshi_xml bm_chameleon pypy_interp twisted_tcp html5lib meteor-contest sympy_sum spitfire spambayes rietveld deltablue eparse sympy_expand slowspitfire sympy_integrate pidigits bm_mdp sympy_str 27

  64. Tracing and optimization phase Tracing & optimization 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 28

  65. Deoptimization phase Deoptimization 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 29

  66. Garbage collection phase Garbage collection 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 30

  67. Meta-tracing JIT VM overheads: Overheads are diverse and can add up to significant portion of execution Interpreter Tracing & optimization Deoptimization Garbage collection 1 0.75 0.5 0.25 0 Benchmarks Fastest on PyPy Benchmarks Slowest on PyPy Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 31

  68. Iron law of processor performance: Does meta-tracing VM code execute poorly in addition to more instructions? Time Instructions Cycle Time × = × Program Program Instructions Cycle Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 32

  69. Comparing meta-tracing JIT IPC to C/C++: Meta-tracing has a similar IPC for most benchmarks C/C++ IPC PyPy IPC Pycket IPC 2.25 1.875 1.5 1.125 0.75 0.375 0 s x x a e t r y s a p m g o o e u u t d d t n m n s r e i r e g d d i o d o i b a t t o r r i e e o e b x n d t f l d e c r r y e m n e l a i s h d v a r p g l e a o c c n e r e t r n e u r u a c h r i n k n m e t b e n k p m n s a a f h c Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 33

  70. Comparing meta-tracing JIT IPC to C/C++: Meta-tracing has a similar IPC for most benchmarks C/C++ IPC PyPy IPC Pycket IPC 2.25 1.875 1.5 1.125 0.75 0.375 0 s x x a e t r y s a p m g o o e u u t d d t n m n s r e i r e g d d i o d o i b a t t o r r i e e o e b x n d t f l d e c r r y e m n e l a i s h d v a r p g l e a o c c n e r e t r n e u r u a c h r i n k n m e t b e n k p m n s a a f h c Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 33

  71. IPC measurements can be accurately matched against VM phases JIT GC deopt trace interp Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 34

  72. Microarchitectural characterization by the VM phase: Meta-tracing-JIT-compiled code has a similar IPC, fewer branches and mispredictions IPC 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Interp Trace Deopt GC JIT C/C++ Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 35

  73. Microarchitectural characterization by the VM phase: Meta-tracing-JIT-compiled code has a similar IPC, fewer branches and mispredictions IPC Branch per instruction 1.8 0.2 1.6 1.4 0.15 1.2 1 0.1 0.8 0.6 0.05 0.4 0.2 0 0 Interp Trace Deopt GC JIT C/C++ Interp Trace Deopt GC JIT C/C++ Motivation • Meta-tracing • PyPy >> CPython • PyPy << C 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend