The Case for Heterogeneous HTAP Raja Appuswamy, Manos - - PowerPoint PPT Presentation

the case for heterogeneous htap
SMART_READER_LITE
LIVE PREVIEW

The Case for Heterogeneous HTAP Raja Appuswamy, Manos - - PowerPoint PPT Presentation

The Case for Heterogeneous HTAP Raja Appuswamy, Manos Karpathiotakis, Danica Porobic, and Anastasia Ailamaki Data-Intensive Applications and Systems Lab EPFL 1 HTAP the contract with the hardware Hybrid OLTP & OLAP Processing HTAP


slide-1
SLIDE 1

The Case for Heterogeneous HTAP

Raja Appuswamy, Manos Karpathiotakis, Danica Porobic, and Anastasia Ailamaki

Data-Intensive Applications and Systems Lab

EPFL

1

slide-2
SLIDE 2

HTAP – the contract with the hardware

Database

Hybrid OLTP & OLAP Processing

HTAP DBMS High-throughput OLTP Low-latency OLAP Fresh data

DRAM

Core

LLC DRAM

Core Core Core Core Core Core Core

DRAM

Core

LLC DRAM

Core Core Core Core Core Core Core

Massive parallelism => high concurrency Global shared memory => data sharing System-wide coherence => synchronization

HTAP on multicores

2

Necessary for current systems

slide-3
SLIDE 3

Shifting hardware landscape (1): Specialization of CPUs

DRAM

Core

LLC

PCIe

DRAM

Core Core Core Core Core Core Core

DRAM

Core

LLC

PCIe

DRAM

Core Core Core Core Core Core Core

1 coherence domain

DRAM

Core

LLC

PCIe

DRAM

Core Core Core Core Core Core Core

DRAM

Core

LLC

PCIe

DRAM

Core Core Core Core Core Core Core

Multiple coherence domains Multisocket multicores Intel SCC, ARM v8, Cell SPE

CPUs: general-purpose à customizable features

3

slide-4
SLIDE 4

Shifting hardware landscape (2): Generalization of GPUs

2008 2010 2012 2014 2016 4 8 16 20 Normalized SGEMM/Watt UVA UM Paging UM NVLink (80-200GB/s) Pascal Maxwell Kepler Fermi Tesla Dynamic Parallelism PCIe 3.0 (16 GB/s) Programmability Interface

GPUs: Niche accelerators à general-purpose processors

4

slide-5
SLIDE 5

Emerging hardware: Revisiting the contract

  • Homogeneous Heterogeneous parallelism
  • Task-parallel CPUs
  • Data-parallel GPUs
  • System-wide Relaxed cache coherence
  • OS (FOS), FS (Hare)
  • runtimes (Cosh)
  • Global shared memory
  • Unified address space

CurrentEmerging hardware HTAP software

  • Shared-everything OLTP: N/A
  • No synch. sans coherence
  • Cannot exploit heterogeneity
  • HTAP across processors
  • Server as distributed system
  • Fails to exploit shared memory

5

Clean slate redesign in order

slide-6
SLIDE 6

Heterogeneous HTAP (H2TAP): Caldera

  • Store data in shared memory
  • Run OLTP workloads on task-parallel archipelago
  • Run OLAP workloads on data-parallel archipelago

Data-parallel archipelago (OLAP) Task-parallel archipelago (OLTP)

DRAM In-memory data store Core GPU DRAM DRAM Core DRAM Core DRAM Core DRAM Core DRAM Core GPU DRAM

6

Loose job-to-core assignment exploits heterogeneity

slide-7
SLIDE 7

H2TAP Challenges

  • Store data in shared memory
  • Choose optimal data layout
  • OLTP on task-parallel archipelago
  • Make up for (lack of) cache coherence
  • OLAP on data-parallel archipelago
  • Share transactionally-consistent snapshots across processors

Data-parallel archipelago (OLAP) Task-parallel archipelago (OLTP)

DRAM In-memory data store Core GPU DRAM DRAM Core DRAM Core DRAM Core DRAM Core DRAM Core GPU DRAM

7

slide-8
SLIDE 8

C1 C1 C1 C2 C2 C2 C3 C3 C3

DSM page

  • Need to minimize PCIe data transfer to GPU
  • Data access on GPU should be sequential to enable “coalescing”
  • Caldera implements NSM, DSM, and PAX

Data layout

C1 C2 C3 C4 C1 C2 C3 C4 C1 C2 C3 C4

NSM page

C1 C1 C1

PAX minipage

C2 C2 C2

PAX minipage

C3 C3 C3

PAX minipage PAXpage

PAX fits GPUs best (PCIe & coalesced accesses)

slide-9
SLIDE 9

OLTP without cache coherence

  • Use Data-Oriented Transaction Execution principles
  • Thread-to-data assignment leads to partitioned data, metadata (2PL, index)

Thd B Thd A

9

slide-10
SLIDE 10
  • 2. Reply(&k)
  • Use explicit messaging instead of implicit latching
  • Exploit shared memory by exchanging pointers instead of data

OLTP without cache coherence

  • 1. Msg (lookup, k)

Thd B Thd A

  • 4. Release(k)
  • 3. Access *k

10

Enforce coherence in software

slide-11
SLIDE 11

Transactionally-consistent data sharing

  • Data sharing across workloads
  • Use Unified Virtual Addressing (UVA) for CPU—GPU sharing
  • Consistent data sharing via hardware snapshotting (ex: Hyper)
  • CUDA runtime restricts use in H2TAP context
  • Caldera supports lightweight software snapshotting
  • OLAP queries run on immutable snapshot
  • Copy-on-write performed by update transactions

11

Snapshots across GPU-CPU archipelagos

slide-12
SLIDE 12

Caldera blueprint

Data-parallel archipelago GPU Task-parallel archipelago DRAM DRAM DRAM DRAM In-memory data store Core Core Scheduler Query compiler Core Query runtime Query parser & optimizer

12

Determine ideal processor for query Elastic core to workload assignment OLAP on database snapshot OLTP without cache coherence Compile query to X86 or PTX code

slide-13
SLIDE 13

Experiments

Setup

  • Two 12-core Intel Xeon E5-2650L v3 CPUs, 256GB RAM
  • GeForce GTX 980 GPU (PCIe 3.0) with 4GB memory
  • TPC-C, TPC-H, YCSB in various scale factors
  • Silo, MonetDB, DBMS-C

Goals

  • Message passing and Software snapshotting overhead
  • PAX performance compared to NSM and DSM on GPUs
  • Caldera performance compared to state-of-the-art

13

slide-14
SLIDE 14

OLTP throughput

0.5 1 1.5 2 1 2 4 8 12 16 20 24 Throughput (MTps) # cores running TPC-C NewOrder (1WH/core) Caldera Silo

Message passing-based design scales well Better code & data locality (partitioning), no synchronization overhead

14

slide-15
SLIDE 15

OLAP response time (incl. data movement)

2 4 6 8 10

Caldera DBMS-C MonetDB

Execution Time (sec) TPCH SF 300 - Query 6

Bounded by PCIe bandwidth (12GB/s) Emerging interconnects (NVLink): 80-200 GB/s

15

Exploits GPU parallelism Saturates PCIe b/w

slide-16
SLIDE 16

Impact of snapshotting

50 100 150 200 1 2 4 8 16 32 64 100 OLTP Throughput (KTps) % records touched by OLTP q1 q1-10 9x Ideal 3.5x 2 4 6 1 2 4 8 16 32 64 100 OLAP Response Time (secs) % records touched by OLTP Ideal 2x

16

Limitation: Software shadow copying imposes a high overhead Possible fix: data classification, snapshot sharing, h/w acceleration

slide-17
SLIDE 17

Impact of data layout

1 table (i1 integer, i2 integer, …. i16 integer) SELECT SUM(colA + colB) FROM table

Data (16GB) in host memory Data (1GB) in GPU memory

1 2 3 4 DSM PAX NSM Execution Time (msec.) 1 2 3 DSM PAX NSM Execution Time (sec.)

PAX, DSM saturate PCIe NSM 14x worse (non-coalesced accesses) PAX exploits GPU memory BW NSM only 2x worse (GPUs have reduced the access “tax”)

Hybrid layouts like PAX a good fit for H2TAP

17

slide-18
SLIDE 18

Conclusion

  • Hardware architecture is changing
  • New opportunities: massive parallelism, fast interconnects
  • New challenges: heterogeneity, relaxed coherence
  • Databases can and should exploit hardware trends
  • Exploit hardware heterogeneity in their core architecture design
  • Decouple system-wide coherence from shared memory
  • Time to move from HTAP to H2TAP
  • H2TAP architecture: revisit age-old h/w—s/w contract
  • Caldera: Preliminary prototype to prove that H2TAP is possible

18