Designing an adaptive VM that combines vectorized and JIT execution - - PowerPoint PPT Presentation

designing an adaptive vm that combines vectorized and jit
SMART_READER_LITE
LIVE PREVIEW

Designing an adaptive VM that combines vectorized and JIT execution - - PowerPoint PPT Presentation

Designing an adaptive VM that combines vectorized and JIT execution on heterogeneous hardware Tim Gubner ICDE PhD Symposium, 2018 1 Modern hardware vs. data processing systems ASIC Dark Silicon CPU GPU FPGA 2 State of the art System


slide-1
SLIDE 1

Designing an adaptive VM that combines vectorized and JIT execution on heterogeneous hardware

Tim Gubner ICDE PhD Symposium, 2018

1

slide-2
SLIDE 2

Modern hardware vs. data processing systems

CPU GPU ASIC FPGA

Dark Silicon

2

slide-3
SLIDE 3

State of the art

System CPU GPU FGPA ASIC MonetDB ✓ doppioDB ✓ ✓ Ocelot ✓ ✓ HyPer ✓ MapD ✓ ✓ CoGaDB ✓ ✓ TensorFlow ✓ ✓ ? ✓

3

slide-4
SLIDE 4

Goal

One system to rule them all

One system to bring them, and in Dark Silicon bind them

4

slide-5
SLIDE 5

Idea Domain- specific language Adaptive virtual machine CPU GPU FPGA ASIC

5

slide-6
SLIDE 6

Virtual Machine

slide-7
SLIDE 7

Compile or not to compile

  • Compilation is time consuming (≥ 20 ms 1)
  • Also noticeable in HyPer 2
  • Compilers make assumptions.

Resulting code either:

  • Static and concise
  • Dynamic and bulky (code explosion)

1Using LLVM C++ API and optimization passes 2Kohn et al. ”Adaptive Execution of Compiled Queries”, ICDE 2018

6

slide-8
SLIDE 8

Compile or not to compile

  • Compilation is time consuming (≥ 20 ms 1)
  • Also noticeable in HyPer 2
  • Compilers make assumptions.

Resulting code either:

  • Static and concise
  • Dynamic and bulky (code explosion)

Why would we ALWAYS want to compile EVERYTHING?

1Using LLVM C++ API and optimization passes 2Kohn et al. ”Adaptive Execution of Compiled Queries”, ICDE 2018

6

slide-9
SLIDE 9

(Real) JIT-compilation

Interpret Profile Optimize Compile

Create specialised program (& guards) Select worthy sub-program(s) Install new kernel (& guards) Collect runtime information & traces

Adaptive by design Low compilation effort Ability to exploit multiple hardware architectures Aggressive workload-driven

  • ptimizations

Mixed execution 7

slide-10
SLIDE 10

Domain-Specific Language

slide-11
SLIDE 11

The seek for the right level of abstraction

Low enough

  • Micro-adaptivity 3
  • Efficient interpretation
  • JIT / incremental compilation

3R˘

aducanu et al. ”Micro adaptivity in Vectorwise”, SIGMOD 2013

8

slide-12
SLIDE 12

The seek for the right level of abstraction

Low enough

  • Micro-adaptivity 3
  • Efficient interpretation
  • JIT / incremental compilation

High enough

  • Effcient execution on multiple devices
  • Macro-adaptivity: e.g. reorder operations

3R˘

aducanu et al. ”Micro adaptivity in Vectorwise”, SIGMOD 2013

8

slide-13
SLIDE 13

The seek for the right level of abstraction

Low enough

  • Micro-adaptivity 3
  • Efficient interpretation
  • JIT / incremental compilation

High enough

  • Effcient execution on multiple devices
  • Macro-adaptivity: e.g. reorder operations

Goal Relation algebra → ? → Assembly, OpenCL ...

3R˘

aducanu et al. ”Micro adaptivity in Vectorwise”, SIGMOD 2013

8

slide-14
SLIDE 14

Why (another) DSL?

Relational algebra Too high-level (Scalar) Monad/Monoid comprehension Weld 4, MRQL 5 High-level but per-tuple transformations lose information

  • 4S. Palkar et al. ”Weld: A Common Runtime for High Performance Data

Analytics”, CIDR 2017

5Fegaras, L. ”An Algebra for Distributed Big Data Analytics”, 2016

9

slide-15
SLIDE 15

Why (another) DSL?

C alikes OpenCL, CUDA, Intel SPMD (ispc) ... Too low-level MonetDB assembly language Heavily data-parallel, too low-level

a

slide-16
SLIDE 16

Our vision

Data-parallelism as first-class citizen

  • Data-parallel skeletons/patterns

Specialized operations on chunks of data For example: map, filter, scatter, gather ...

  • Lambda functions
  • Immutable variables for intermediates (Static single

assignment form)

  • Mutable variables for remaining state
  • Partially typed (a ∈ DECIMAL(6,2) instead of a ∈

int64 t)

b

slide-17
SLIDE 17

Skeletons

Op.

map filter scatter gather ht ins merge

π ✓ σ ✓ ✓ ⊲ ⊳Hash ✓ ✓ ✓ ✓ ✓ GHash ✓ ✓ ✓ ✓ ✓ ∪Hash ✓ ✓ ✓ ✓ ✓ ⊲ ⊳Merge ✓ ✓ ✓ Sort ✓ (✓) ✓ Skeletons themselves do not need to be implemented data-parallel (e.g. ht ins)...

c

slide-18
SLIDE 18

Example

mut i mut k i := 0 k := 0 loop let input = read i some_data in let a = map (\x -> 2*x) input in let t = filter (\x -> x>0) a in let b = condense t write x i a write y k b i := i + len(a) k := k + len(b) if i >= 4096 then break

d

slide-19
SLIDE 19

Plan

slide-20
SLIDE 20

Plan

Base framework DSL, vectorized interpreter Dynamic VM Workload-specific optimizations Multiple target architectures GPUs, potentially FPGAs

e

slide-21
SLIDE 21

Takeaways

Domain- specific language Adaptive virtual machine CPU GPU FPGA ASIC

DSL

  • Abstract enough for:
  • Efficient portability
  • Adaptive optimizations
  • Efficient interpretation
  • State of art does not fit!
  • Data parallelism as first-class citizen

VM

  • Interpret first, maybe compile later
  • Cost-models are hard to get right!
  • Adaptive by design
  • Aggressive workload-driven
  • ptimizations
  • Mixed execution

f