The Case for Heterogeneous HTAP
Raja Appuswamy, Manos Karpathiotakis, Danica Porobic, and Anastasia Ailamaki
Data-Intensive Applications and Systems Lab
EPFL
1
The Case for Heterogeneous HTAP Raja Appuswamy, Manos - - PowerPoint PPT Presentation
The Case for Heterogeneous HTAP Raja Appuswamy, Manos Karpathiotakis, Danica Porobic, and Anastasia Ailamaki Data-Intensive Applications and Systems Lab EPFL 1 HTAP the contract with the hardware Hybrid OLTP & OLAP Processing HTAP
1
Database
HTAP DBMS High-throughput OLTP Low-latency OLAP Fresh data
DRAM
Core
LLC DRAM
Core Core Core Core Core Core Core
DRAM
Core
LLC DRAM
Core Core Core Core Core Core Core
Massive parallelism => high concurrency Global shared memory => data sharing System-wide coherence => synchronization
2
DRAM
Core
LLC
PCIe
DRAM
Core Core Core Core Core Core Core
DRAM
Core
LLC
PCIe
DRAM
Core Core Core Core Core Core Core
DRAM
Core
LLC
PCIe
DRAM
Core Core Core Core Core Core Core
DRAM
Core
LLC
PCIe
DRAM
Core Core Core Core Core Core Core
3
2008 2010 2012 2014 2016 4 8 16 20 Normalized SGEMM/Watt UVA UM Paging UM NVLink (80-200GB/s) Pascal Maxwell Kepler Fermi Tesla Dynamic Parallelism PCIe 3.0 (16 GB/s) Programmability Interface
4
CurrentEmerging hardware HTAP software
5
Data-parallel archipelago (OLAP) Task-parallel archipelago (OLTP)
DRAM In-memory data store Core GPU DRAM DRAM Core DRAM Core DRAM Core DRAM Core DRAM Core GPU DRAM
6
Data-parallel archipelago (OLAP) Task-parallel archipelago (OLTP)
DRAM In-memory data store Core GPU DRAM DRAM Core DRAM Core DRAM Core DRAM Core DRAM Core GPU DRAM
7
C1 C1 C1 C2 C2 C2 C3 C3 C3
DSM page
C1 C2 C3 C4 C1 C2 C3 C4 C1 C2 C3 C4
NSM page
C1 C1 C1
PAX minipage
C2 C2 C2
PAX minipage
C3 C3 C3
PAX minipage PAXpage
Thd B Thd A
9
Thd B Thd A
10
11
Data-parallel archipelago GPU Task-parallel archipelago DRAM DRAM DRAM DRAM In-memory data store Core Core Scheduler Query compiler Core Query runtime Query parser & optimizer
12
Determine ideal processor for query Elastic core to workload assignment OLAP on database snapshot OLTP without cache coherence Compile query to X86 or PTX code
13
0.5 1 1.5 2 1 2 4 8 12 16 20 24 Throughput (MTps) # cores running TPC-C NewOrder (1WH/core) Caldera Silo
14
2 4 6 8 10
Caldera DBMS-C MonetDB
Execution Time (sec) TPCH SF 300 - Query 6
15
Exploits GPU parallelism Saturates PCIe b/w
50 100 150 200 1 2 4 8 16 32 64 100 OLTP Throughput (KTps) % records touched by OLTP q1 q1-10 9x Ideal 3.5x 2 4 6 1 2 4 8 16 32 64 100 OLAP Response Time (secs) % records touched by OLTP Ideal 2x
16
Limitation: Software shadow copying imposes a high overhead Possible fix: data classification, snapshot sharing, h/w acceleration
1 table (i1 integer, i2 integer, …. i16 integer) SELECT SUM(colA + colB) FROM table
Data (16GB) in host memory Data (1GB) in GPU memory
1 2 3 4 DSM PAX NSM Execution Time (msec.) 1 2 3 DSM PAX NSM Execution Time (sec.)
PAX, DSM saturate PCIe NSM 14x worse (non-coalesced accesses) PAX exploits GPU memory BW NSM only 2x worse (GPUs have reduced the access “tax”)
17
18