The Case for Heterogeneous HTAP Raja Appuswamy, Manos - PowerPoint PPT Presentation

The Case for Heterogeneous HTAP Raja Appuswamy, Manos Karpathiotakis, Danica Porobic, and Anastasia Ailamaki Data-Intensive Applications and Systems Lab EPFL 1

HTAP – the contract with the hardware Hybrid OLTP & OLAP Processing HTAP on multicores Massive parallelism => high concurrency Global shared memory => data sharing High-throughput OLTP Low-latency OLAP System-wide coherence => synchronization DRAM DRAM DRAM DRAM Core Core Core Core HTAP DBMS Core Core Core Core LLC LLC Core Core Core Core Database Fresh data Core Core Core Core Necessary for current systems 2

Shifting hardware landscape (1): Specialization of CPUs Multisocket multicores Intel SCC, ARM v8, Cell SPE DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core LLC LLC LLC LLC Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core PCIe PCIe PCIe PCIe 1 coherence domain Multiple coherence domains CPUs: general-purpose à customizable features 3

Shifting hardware landscape (2): Generalization of GPUs Pascal Paging UM Programmability NVLink 20 Interface (80-200GB/s) Normalized SGEMM/Watt Maxwell 16 UM Kepler 8 Dynamic Parallelism Fermi PCIe 3.0 (16 GB/s) 4 UVA Tesla 0 GPUs: Niche accelerators à general-purpose processors 2008 2010 2012 2014 2016 4

Emerging hardware: Revisiting the contract CurrentEmerging hardware HTAP software • Homogeneous Heterogeneous parallelism • Cannot exploit heterogeneity • Task-parallel CPUs • HTAP across processors • Data-parallel GPUs • System-wide Relaxed cache coherence • Shared-everything OLTP: N/A • No synch. sans coherence • OS (FOS), FS (Hare) • runtimes (Cosh) • Server as distributed system • Global shared memory • Fails to exploit shared memory • Unified address space Clean slate redesign in order 5

Heterogeneous HTAP (H 2 TAP): Caldera • Store data in shared memory • Run OLTP workloads on task-parallel archipelago • Run OLAP workloads on data-parallel archipelago Task-parallel archipelago (OLTP) Data-parallel archipelago (OLAP) Core Core Core Core Core Core GPU GPU DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM In-memory data store Loose job-to-core assignment exploits heterogeneity 6

H 2 TAP Challenges • Store data in shared memory • Choose optimal data layout • OLTP on task-parallel archipelago • Make up for (lack of) cache coherence • OLAP on data-parallel archipelago • Share transactionally-consistent snapshots across processors Task-parallel archipelago (OLTP) Data-parallel archipelago (OLAP) Core Core Core Core Core Core GPU GPU DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM In-memory data store 7

Data layout • Need to minimize PCIe data transfer to GPU • Data access on GPU should be sequential to enable “coalescing” • Caldera implements NSM, DSM, and PAX C1 C1 C1 C1 C2 C3 C4 PAX minipage C1 C1 C2 C3 C4 C2 C1 C2 C2 C2 C1 C2 C3 C4 C2 C3 C1 PAX minipage NSM page C3 C2 C3 C3 C3 C3 PAX minipage DSM page PAX fits GPUs best (PCIe & coalesced accesses) PAXpage

OLTP without cache coherence • Use Data-Oriented Transaction Execution principles • Thread-to-data assignment leads to partitioned data, metadata (2PL, index) Thd A Thd B 9

OLTP without cache coherence • Use explicit messaging instead of implicit latching • Exploit shared memory by exchanging pointers instead of data 1. Msg (lookup, k) 2. Reply(&k) Thd A Thd B 4. Release(k) 3. Access *k Enforce coherence in software 10

Transactionally-consistent data sharing • Data sharing across workloads • Use Unified Virtual Addressing (UVA) for CPU—GPU sharing • Consistent data sharing via hardware snapshotting (ex: Hyper) • CUDA runtime restricts use in H 2 TAP context • Caldera supports lightweight software snapshotting • OLAP queries run on immutable snapshot • Copy-on-write performed by update transactions Snapshots across GPU-CPU archipelagos 11

Caldera blueprint Determine ideal OLTP without processor for Query parser & optimizer cache coherence query Query compiler Compile query to X86 or PTX code Query runtime Scheduler Task-parallel archipelago Data-parallel archipelago Core Core Core GPU Elastic core to workload OLAP on database assignment DRAM DRAM DRAM DRAM snapshot In-memory data store 12

Experiments Setup • Two 12-core Intel Xeon E5-2650L v3 CPUs, 256GB RAM • GeForce GTX 980 GPU (PCIe 3.0) with 4GB memory • TPC-C, TPC-H, YCSB in various scale factors • Silo, MonetDB, DBMS-C Goals • Message passing and Software snapshotting overhead • PAX performance compared to NSM and DSM on GPUs • Caldera performance compared to state-of-the-art 13

OLTP throughput 2 Throughput (MTps) Caldera Silo 1.5 1 0.5 0 1 2 4 8 12 16 20 24 # cores running TPC-C NewOrder (1WH/core) Message passing-based design scales well Better code & data locality (partitioning), no synchronization overhead 14

OLAP response time (incl. data movement) 10 Execution Time (sec) 8 6 4 Exploits GPU parallelism Saturates PCIe b/w 2 0 Caldera DBMS-C MonetDB TPCH SF 300 - Query 6 Bounded by PCIe bandwidth (12GB/s) Emerging interconnects (NVLink): 80-200 GB/s 15

Impact of snapshotting Ideal 6 200 OLAP Response Time (secs) q1 OLTP Throughput (KTps) q1-10 150 4 9x 2x 100 2 3.5x Ideal 50 0 0 1 2 4 8 16 32 64 100 1 2 4 8 16 32 64 100 % records touched by OLTP % records touched by OLTP Limitation: Software shadow copying imposes a high overhead Possible fix: data classification, snapshot sharing, h/w acceleration 16

Impact of data layout 1 table ( i1 integer, i2 integer, …. i16 integer ) SELECT SUM(colA + colB) FROM table Data (1GB) in GPU memory Data (16GB) in host memory NSM only 2x worse 3 NSM 14x worse 4 (GPUs have reduced Execution Time ( msec. ) Execution Time (sec.) (non-coalesced the access “tax”) accesses) 3 PAX exploits GPU 2 memory BW PAX, DSM 2 saturate PCIe 1 1 0 0 DSM PAX NSM DSM PAX NSM Hybrid layouts like PAX a good fit for H 2 TAP 17

Conclusion • Hardware architecture is changing • New opportunities: massive parallelism, fast interconnects • New challenges: heterogeneity, relaxed coherence • Databases can and should exploit hardware trends • Exploit hardware heterogeneity in their core architecture design • Decouple system-wide coherence from shared memory • Time to move from HTAP to H 2 TAP • H 2 TAP architecture: revisit age-old h/w—s/w contract • Caldera: Preliminary prototype to prove that H 2 TAP is possible 18

The Case for Heterogeneous HTAP Raja Appuswamy, Manos - PowerPoint PPT Presentation

The Case for Heterogeneous HTAP Raja Appuswamy, Manos Karpathiotakis, Danica Porobic, and Anastasia Ailamaki Data-Intensive Applications and Systems Lab EPFL 1 HTAP the contract with the hardware Hybrid OLTP & OLAP Processing HTAP

Joint WMO/TF HTAP/GEO Workshop on Integrated Observations for Assessing Hemispheric Transport

1 A1: policy context A2 Observational Evidence and Capabilities Related to Intercontinental

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

Sum m ary of Part C: POPs Sergey Dutchak EMEP/MSC-E HTAP 20 10 Assessm ent Report Part C:

Make HTAP Real with TiFlash A TiDB native Columnar Extension About me Liu Cong,

CS 839: Design the Next-Generation Database Lecture 24: HTAP Xiangyao Yu 4/16/2020 1

Uniform access to heterogeneous Uniform access to heterogeneous grid infrastructures with grid

Mining Heterogeneous Mining Heterogeneous Information Networks Information Networks Xifeng Yan

Learning by Fusing Heterogeneous Data Marinka Zitnik Thesis Defense, October 22 2015 Motivation

Composing heterogeneous software with style Stephen Kell stephen.kell@cs.ox.ac.uk Composing. . .

Decentralized Dynamic Scheduling across Heterogeneous Multi core across Heterogeneous Multi

Static Worksharing Strategies for Heterogeneous Computers with Unrecoverable Failures Anne

An Introduction to Coupling Conditions Homogeneous Heterogeneous Domain Decomposition Problems

Modeling Heterogeneous Modeling Heterogeneous Real- -time Components in BIP time Components in

Case Comparisons Department of Government London School of Economics and Political Science Uses

Unleashing dynamic task scheduling at rack-scale Magnus Norgren, Andra Hugo (DDN

So far... Uncertianty of what? CV plot Prediction data Plotting - data processing Variance of

Distributed Shared Memory Distributed Shared Memory Systems Page based

T re a tme nt o f Mo o d Diso rde rs in Midlife Disc lo sure s I HAVE NO DISCL OSURE S Wo

15-721 DATABASE SYSTEMS Lecture #10 Storage Models & Data Layout Andy Pavlo / /

Distributed Memory and Cache Consistency (some slides courtesy of Alvin Lebeck) Software DSM 101

Optimizing Magnetic Shielding vs. Cryogenics i XFEL Configurations ILC (~16 000 cavits)

Distributed Shared Memory Presented by Humayun Arafat 1 Outline Background Shared Memory,

The Case for Heterogeneous HTAP Raja Appuswamy, Manos - PowerPoint PPT Presentation

The Case for Heterogeneous HTAP Raja Appuswamy, Manos Karpathiotakis, Danica Porobic, and Anastasia Ailamaki Data-Intensive Applications and Systems Lab EPFL 1 HTAP the contract with the hardware Hybrid OLTP & OLAP Processing HTAP

Joint WMO/TF HTAP/GEO Workshop on Integrated Observations for Assessing Hemispheric Transport

1 A1: policy context A2 Observational Evidence and Capabilities Related to Intercontinental

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

Sum m ary of Part C: POPs Sergey Dutchak EMEP/MSC-E HTAP 20 10 Assessm ent Report Part C:

Make HTAP Real with TiFlash A TiDB native Columnar Extension About me Liu Cong,

CS 839: Design the Next-Generation Database Lecture 24: HTAP Xiangyao Yu 4/16/2020 1

Uniform access to heterogeneous Uniform access to heterogeneous grid infrastructures with grid

Mining Heterogeneous Mining Heterogeneous Information Networks Information Networks Xifeng Yan

Learning by Fusing Heterogeneous Data Marinka Zitnik Thesis Defense, October 22 2015 Motivation

Composing heterogeneous software with style Stephen Kell stephen.kell@cs.ox.ac.uk Composing. . .

Decentralized Dynamic Scheduling across Heterogeneous Multi core across Heterogeneous Multi

Static Worksharing Strategies for Heterogeneous Computers with Unrecoverable Failures Anne

An Introduction to Coupling Conditions Homogeneous Heterogeneous Domain Decomposition Problems

Modeling Heterogeneous Modeling Heterogeneous Real- -time Components in BIP time Components in

Case Comparisons Department of Government London School of Economics and Political Science Uses

Unleashing dynamic task scheduling at rack-scale Magnus Norgren, Andra Hugo (DDN

So far... Uncertianty of what? CV plot Prediction data Plotting - data processing Variance of

Distributed Shared Memory Distributed Shared Memory Systems Page based

T re a tme nt o f Mo o d Diso rde rs in Midlife Disc lo sure s I HAVE NO DISCL OSURE S Wo

15-721 DATABASE SYSTEMS Lecture #10 Storage Models &amp; Data Layout Andy Pavlo / /

Distributed Memory and Cache Consistency (some slides courtesy of Alvin Lebeck) Software DSM 101

Optimizing Magnetic Shielding vs. Cryogenics i XFEL Configurations ILC (~16 000 cavits)

Distributed Shared Memory Presented by Humayun Arafat 1 Outline Background Shared Memory,

15-721 DATABASE SYSTEMS Lecture #10 Storage Models & Data Layout Andy Pavlo / /