designing an adaptive vm that combines vectorized and jit
play

Designing an adaptive VM that combines vectorized and JIT execution - PowerPoint PPT Presentation

Designing an adaptive VM that combines vectorized and JIT execution on heterogeneous hardware Tim Gubner ICDE PhD Symposium, 2018 1 Modern hardware vs. data processing systems ASIC Dark Silicon CPU GPU FPGA 2 State of the art System


  1. Designing an adaptive VM that combines vectorized and JIT execution on heterogeneous hardware Tim Gubner ICDE PhD Symposium, 2018 1

  2. Modern hardware vs. data processing systems ASIC Dark Silicon CPU GPU FPGA 2

  3. State of the art System CPU GPU FGPA ASIC MonetDB ✓ doppioDB ✓ ✓ Ocelot ✓ ✓ HyPer ✓ MapD ✓ ✓ CoGaDB ✓ ✓ TensorFlow ✓ ✓ ? ✓ 3

  4. Goal One system to rule them all One system to bring them, and in Dark Silicon bind them 4

  5. Idea Domain- specific language Adaptive virtual machine CPU GPU FPGA ASIC 5

  6. Virtual Machine

  7. Compile or not to compile • Compilation is time consuming ( ≥ 20 ms 1 ) • Also noticeable in HyPer 2 • Compilers make assumptions. Resulting code either: • Static and concise • Dynamic and bulky (code explosion) 1 Using LLVM C++ API and optimization passes 2 Kohn et al. ”Adaptive Execution of Compiled Queries”, ICDE 2018 6

  8. Compile or not to compile • Compilation is time consuming ( ≥ 20 ms 1 ) • Also noticeable in HyPer 2 • Compilers make assumptions. Resulting code either: • Static and concise • Dynamic and bulky (code explosion) Why would we ALWAYS want to compile EVERYTHING? 1 Using LLVM C++ API and optimization passes 2 Kohn et al. ”Adaptive Execution of Compiled Queries”, ICDE 2018 6

  9. (Real) JIT-compilation Interpret Install new kernel Collect runtime (& guards) information & traces Adaptive by design Low compilation effort Ability to exploit multiple Profile Compile hardware architectures Aggressive workload-driven optimizations Mixed execution Select worthy Create specialised sub-program(s) program (& guards) Optimize 7

  10. Domain-Specific Language

  11. The seek for the right level of abstraction Low enough • Micro-adaptivity 3 • Efficient interpretation • JIT / incremental compilation 3 R˘ aducanu et al. ”Micro adaptivity in Vectorwise”, SIGMOD 2013 8

  12. The seek for the right level of abstraction Low enough • Micro-adaptivity 3 • Efficient interpretation • JIT / incremental compilation High enough • Effcient execution on multiple devices • Macro-adaptivity: e.g. reorder operations 3 R˘ aducanu et al. ”Micro adaptivity in Vectorwise”, SIGMOD 2013 8

  13. The seek for the right level of abstraction Low enough • Micro-adaptivity 3 • Efficient interpretation • JIT / incremental compilation High enough • Effcient execution on multiple devices • Macro-adaptivity: e.g. reorder operations Goal Relation algebra → ? → Assembly, OpenCL ... 3 R˘ aducanu et al. ”Micro adaptivity in Vectorwise”, SIGMOD 2013 8

  14. Why (another) DSL? Relational algebra Too high-level (Scalar) Monad/Monoid comprehension Weld 4 , MRQL 5 High-level but per-tuple transformations lose information 4 S. Palkar et al. ”Weld: A Common Runtime for High Performance Data Analytics”, CIDR 2017 5 Fegaras, L. ”An Algebra for Distributed Big Data Analytics”, 2016 9

  15. Why (another) DSL? C alikes OpenCL, CUDA, Intel SPMD (ispc) ... Too low-level MonetDB assembly language Heavily data-parallel, too low-level a

  16. Our vision Data-parallelism as first-class citizen • Data-parallel skeletons/patterns Specialized operations on chunks of data For example: map , filter , scatter , gather ... • Lambda functions • Immutable variables for intermediates (Static single assignment form) • Mutable variables for remaining state • Partially typed ( a ∈ DECIMAL(6,2) instead of a ∈ int64 t ) b

  17. Skeletons Op. map filter scatter gather ht ins merge ✓ π ✓ ✓ σ ✓ ✓ ✓ ✓ ✓ ⊲ ⊳ Hash G Hash ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ∪ Hash ✓ ✓ ✓ ⊲ ⊳ Merge Sort ✓ ( ✓ ) ✓ Skeletons themselves do not need to be implemented data-parallel (e.g. ht ins )... c

  18. Example mut i mut k i := 0 k := 0 loop let input = read i some_data in let a = map (\x -> 2*x) input in let t = filter (\x -> x>0) a in let b = condense t write x i a write y k b i := i + len(a) k := k + len(b) if i >= 4096 then break d

  19. Plan

  20. Plan Base framework DSL, vectorized interpreter Dynamic VM Workload-specific optimizations Multiple target architectures GPUs, potentially FPGAs e

  21. Takeaways DSL • Abstract enough for: • Efficient portability • Adaptive optimizations Domain- • Efficient interpretation specific language • State of art does not fit! • Data parallelism as first-class citizen Adaptive virtual VM machine • Interpret first, maybe compile later • Cost-models are hard to get right! CPU GPU FPGA ASIC • Adaptive by design • Aggressive workload-driven optimizations • Mixed execution f

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend