GPU ACCELERATION FOR OLAP Tim Kaldewey, Jiri Kraus, Nikolay - PowerPoint PPT Presentation

GPU ACCELERATION FOR OLAP Tim Kaldewey, Jiri Kraus, Nikolay Sakharnykh 03/26/2018

A TYPICAL ANALYTICS QUERY From a business question to SQL Business question (TPC-H query 4) SQL Determines how well the order priority select o_orderpriority, system is working and gives an assessment count(o_orderkey) as order_count, of customer satisfaction from orders Counts the number of orders ordered in a where given quarter of a given year in which at o_orderdate >= date '[DATE]' and o_orderdate < date '[DATE]' + interval '3' month and least one lineitem was received by the exists ( select * from lineitem customer later than its committed date. where l_orderkey = o_orderkey and The query lists the count of such orders for l_commitdate < l_receiptdate) each order priority sorted in ascending group by o_orderpriority, priority order order by o_orderpriority; Examples: DATE = 1/1/1993, 1/4/1993, … 2

A TYPICAL ANALYTICS QUERY From SQL to Database Operators Database Operators SQL select o_orderpriority, aggregate count(o_orderkey) as order_count, from orders where o_orderdate >= date '[DATE]' and predicate (filter) o_orderdate < date '[DATE]' + interval '3' month and exists ( select * from lineitem join where l_orderkey = o_orderkey and predicate (filter) l_commitdate < l_receiptdate) group by aggregate o_orderpriority, order by sort o_orderpriority; 3

A TYPICAL ANALYTICS QUERY Joins are implicit in a business question SQL Database Operators Business question Counts the number of select orders ordered in a o_orderpriority, aggregate given quarter of a count(o_orderkey) as order_count, from given year in which at orders least one lineitem was where received by the predicate (filter) o_orderdate >= date '[DATE]' and customer later than its o_orderdate < date '[DATE]' + interval '3' month and exists ( select * from lineitem committed date. The join where l_orderkey = o_orderkey and query lists the count of predicate (filter) l_commitdate < l_receiptdate) such orders for each aggregate group by order priority sorted in o_orderpriority, order by ascending priority sort o_orderpriority; order 4

TPC-H SCHEMA part (p_) PARTKEY customer (c_) NAME order (o_) lineitem (l_) CUSTKEY MFGR ORDERKEY ORDERKEY NAME CATEGORY LINENUMBER CUSTKEY ADDRESS BRAND PARTKEY ORDERDATE CITY … SUPPKEY ORDPRIORITY … COMMITDATE ORDERSTATUS supplier (s_) … RECEIPTDATE SUPPKEY … NAME … ADDRESS nation (n_) CITY NATIONKEY NATIONKEY NAME … … 5

WHERE DOES TIME GO? TPC-H query 4 @SF10 = 10GB data warehouse CPU execution breakdown select o_orderpriority, join group-by count(o_orderkey) as order_count, 1% from orders where o_orderdate >= date '[DATE]’ and o_orderdate < date '[DATE]' + interval '3' month and exists ( select * from lineitem where l_orderkey = o_orderkey and l_commitdate < l_receiptdate) group by o_orderpriority, order by o_orderpriority; 99% 18/22 TPC-H Queries involve Join & are the longest running ones 6

RELATIONAL JOIN Join Results Lineitem 1 Order 2 l_orderkey o_orderkey o_orderpriority o_orderkey o_orderpriority 23 11 1 23 5 = 14 23 5 11 1 56 27 2 27 2 11 29 4 23 5 39 27 Payload Primary Key 23 Foreign Key 1 after applying predicate “ l_commitdate < l_receiptdate ” 7 2 after applying predicates “ o_orderdate >= date '[DATE]’ and o_orderdate < date '[DATE]' + interval '3' month ”

HASH JOIN Join Results Lineitem 1 Order 2 l_orderkey o_orderkey o_orderpriority o_orderkey o_orderpriority 23 11 1 23 5 = 14 23 5 11 1 56 27 2 27 2 11 29 4 23 5 39 27 Payload Primary Key 23 Build hash table Foreign Key = Probe inputs 1 after applying predicate “ l_commitdate < l_receiptdate ” 8 2 after applying predicates “ o_orderdate >= date '[DATE]’ and o_orderdate < date '[DATE]' + interval '3' month ”

HASH JOIN General approach – including payload(s) 1. Build a hash table Build table Hash table k 1 ,p 1 k 1 ,p 1 Scan k 2, p 2 k 2 ,p 2 1. Compute h(k) – k is a primary (unique) key 2. Insert k with payload p into hashed location 3. If occupied, insert into next free one 9

HASH JOIN General approach – including payload(s) 1. Build a hash table 2. Probe the hash table Build table Hash table Probe table Hash table k 1 ,p 1 k 1 ,p 1 k 1 ,p 1 fk 1 Scan Scan k 2, p 2 fk 2 k 2 ,p 2 k 3 ,p 3 k 2 ,p 2 k 4 ,p 4 k 5 ,p 5 k 6 ,p 6 k 7 ,p 7 1. Compute h(k) – k is a primary (unique) key 1. Compute h(fk) – fk is foreign key 2. Insert k with payload p into hashed location 2. Compare fk and key k in the hash table 3. If occupied, insert into next free one 3. If there is a match store the result 10

HASH JOIN General approach – including payload(s) 1. Build a hash table 2. Probe the hash table Build table Hash table Probe table Hash table k 1 ,p 1 k 1 ,p 1 k 1 ,p 1 fk 1 Scan Scan k 2, p 2 fk 2 k 2 ,p 2 k 3 ,p 3 k 2 ,p 2 k 4 ,p 4 k 5 ,p 5 k 6 ,p 6 k 7 ,p 7 1. Compute h(k) – k is a primary (unique) key 1. Compute h(fk) – fk is foreign key 2. Insert k with payload p into hashed location 2. Compare fk and key k in the hash table 3. If occupied, insert into next free one 3. If there is a match store the result Build and Probe both produce a random data access pattern! 11

RANDOM ACCESS PERFORMANCE Intel Xeon Gold 6140 (Skylake) DDR4: 6 channels, 64-bit per channel, 2666MT/s, tFAW=30ns* Peak memory bandwidth 120GB/s Random 8B access 6GB/s NVIDIA Tesla V100 (Volta) Peak memory bandwidth 900GB/s Random 8B access 60GB/s 12 *DDR4 specs: https://www.micron.com/~/media/documents/products/data-sheet/dram/ddr4/4gb_ddr4_sdram.pdf

HETEROGENEOUS SYSTEM Faster access 16GB 16GB HBM HBM … Multiple TBs DDR Larger capacity 13

HETEROGENEOUS SYSTEM Data Placement & Movement Faster access Hash Table(s) Key Payload 16GB 16GB 23 5 HBM HBM 27 2 Build & Probe … Multiple TBs DDR DB Larger capacity 14

INTERCONNECT SPEEDS PCIe3: 16GB/s • • NVLINK1: 20GB/s per link, up to 2 links between CPU/GPU = 40GB/s NVLINK2: 25GB/s per link, up to 3 links = 75GB/s • • Further increase throughput via compression: see S8417 tomorrow at 2pm* GPU joins (HT probing): PCIe/NVLINK1: GPU random access bandwidth >> CPU-GPU bandwidth • • NVLINK2 could be limited by GPU random access performance 15 *S8417 - Breaking the Speed of Interconnect with Compression for Database Applications – Tuesday, Mar 27, 2:00pm – Room 210F

JOIN OPERATOR ACCELERATION 16

IMPLEMENTING HASH JOIN concurrent_unordered_map Hash table in Unified Memory can be accessed on the host and template < typename Key , typename Element , device Key unused_key , Element unused_element , typename Hasher = default_hash < Key >, typename Equality = equal_to < Key >, typename Allocator = managed_allocator< thrust :: pair < Key , Element > > > class concurrent_unordered_map : public managed { Features: public : //omitting typedef* Construction on CPU explicit concurrent_unordered_map ( size_type n , • const Hasher & hf , const Equality & eql , • Works on CPU and GPU const allocator_type & a ); Concurrent insert’s • __host__ __device__ iterator begin (); • Concurrent find __host__ __device__ iterator end (); Concurrent insert and find* • __host__ __device__ iterator insert ( const value_type & x ); No concurrent CPU and GPU • __host__ __device__ const_iterator find ( const key_type & k ) const ; insert (currently) void prefetch ( const int dev_id ); }; *No visibility guarantees 17

IMPLEMENTING HASH JOIN 1. Build a hash table with a concurrent_unordered_map template < typename map_type > __global__ void build_hash_tbl ( map_type * const map , const typename map_type :: key_type * const build_tbl , const typename map_type :: size_type build_tbl_size ) { const typename map_type :: mapped_type i = tIdx . x + bIdx . x * bDim . x ; if ( i < build_tbl_size ) { map -> insert ( thrust :: make_pair ( build_tbl [ i ], i ) ); } } 18

IMPLEMENTING HASH JOIN Concurrent Inserts with atomicCAS iterator insert ( const value_type & x ) { size_type key_hash = hf ( x . first ); Key Key Value Value size_type ht_idx = key_hash % ht_size ; unused_key unused_key unused_element unused_element value_type * it = 0 ; while ( 0 == it ) { unused_key unused_key unused_element unused_element value_type * tmp_it = ht_values + ht_idx ; unused_key unused_key unused_element unused_element const key_type old_key = atomicCAS ( &( tmp_it -> first ), unused_key , x . first ); 7287984 7287984 9024 9024 if ( equal ( unused_key , old_key ) || unused_key x.first=75 x.second=875 unused_element equal ( x.first , old_key ) ) { ( ht_values + ht_idx )-> second = x . second ; 283408 283408 2309480 2309480 it = tmp_it ; } unused_key unused_key unused_element unused_element ht_idx = ( ht_idx + 1 )% ht_size ; unused_key unused_key unused_element unused_element } return iterator ( ht_values , ht_values + ht_size , it ); } 19

GPU ACCELERATION FOR OLAP Tim Kaldewey, Jiri Kraus, Nikolay - PowerPoint PPT Presentation

GPU ACCELERATION FOR OLAP Tim Kaldewey, Jiri Kraus, Nikolay Sakharnykh 03/26/2018 A TYPICAL ANALYTICS QUERY From a business question to SQL Business question (TPC-H query 4) SQL Determines how well the order priority select

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Data Warehouse and OLAP II Data Warehouse and OLAP II Week 6 1 Team Homework Assignment #8

OLAP & Data Mining OLAP & Data Mining Agenda Agenda SQL Server Features (in short) SQL

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

OLAP and Data Mining Chapter 17 OLTP Compared With OLAP On Line Transaction Processing

Data Warehouse and OLAP Data Warehouse and OLAP Week 5 1 Midterm I Midterm I Friday, March

Saiku Saiku taking taking OLAP OLAP databases databases into into 21 21st st

Fjord Bridging the between Gap OLAP SQL and Jens-Peter Dittrich 1, *

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge

GPYTORCH : BLACKBOX MATRIX- MATRIX GAUSSIAN PROCESS INFERENCE WITH GPU ACCELERATION Jacob R.

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Beyond GIS: Spatial On-Line Analytical Processing and Big Data Professor Yvan Bedard, PhD,

Real-Time On-line Analytical Processing (OLAP) On Multi-Core and Cloud Architectures Frank Dehne

Chapter 4 , Data W arehouse and OLAP Operations Young-Rae Cho Associate Professor Department of

Introduction to ORSA OIC Risk Management Seminar 30 June 2016 William Song Part 1 ORSA

Alpha Presentation Modern Online Analytical Processing Cube The Capstone Experience Team Urban

OLTP + OLAP ftw! QCon San Francisco November 8, 2016 ABOUT ME Developed web apps for 5 years

Id ParentId Data 1 0 Parent 2 1 Child1 3 1 Child2 Id ParentId0 ParentId1 ParentId2

Web Performance Optimization: Analytics Wim Leers Promotor: Prof. dr. Jan Van den Bussche Web

GPU ACCELERATION FOR OLAP Tim Kaldewey, Jiri Kraus, Nikolay - PowerPoint PPT Presentation

GPU ACCELERATION FOR OLAP Tim Kaldewey, Jiri Kraus, Nikolay Sakharnykh 03/26/2018 A TYPICAL ANALYTICS QUERY From a business question to SQL Business question (TPC-H query 4) SQL Determines how well the order priority select

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Data Warehouse and OLAP II Data Warehouse and OLAP II Week 6 1 Team Homework Assignment #8

OLAP &amp; Data Mining OLAP &amp; Data Mining Agenda Agenda SQL Server Features (in short) SQL

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

OLAP and Data Mining Chapter 17 OLTP Compared With OLAP On Line Transaction Processing

Data Warehouse and OLAP Data Warehouse and OLAP Week 5 1 Midterm I Midterm I Friday, March

Saiku Saiku taking taking OLAP OLAP databases databases into into 21 21st st

Fjord Bridging the between Gap OLAP SQL and Jens-Peter Dittrich 1, *

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge

GPYTORCH : BLACKBOX MATRIX- MATRIX GAUSSIAN PROCESS INFERENCE WITH GPU ACCELERATION Jacob R.

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Beyond GIS: Spatial On-Line Analytical Processing and Big Data Professor Yvan Bedard, PhD,

Real-Time On-line Analytical Processing (OLAP) On Multi-Core and Cloud Architectures Frank Dehne

Chapter 4 , Data W arehouse and OLAP Operations Young-Rae Cho Associate Professor Department of

Introduction to ORSA OIC Risk Management Seminar 30 June 2016 William Song Part 1 ORSA

Alpha Presentation Modern Online Analytical Processing Cube The Capstone Experience Team Urban

OLTP + OLAP ftw! QCon San Francisco November 8, 2016 ABOUT ME Developed web apps for 5 years

Id ParentId Data 1 0 Parent 2 1 Child1 3 1 Child2 Id ParentId0 ParentId1 ParentId2

Web Performance Optimization: Analytics Wim Leers Promotor: Prof. dr. Jan Van den Bussche Web

OLAP & Data Mining OLAP & Data Mining Agenda Agenda SQL Server Features (in short) SQL

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team