gpu acceleration for olap
play

GPU ACCELERATION FOR OLAP Tim Kaldewey, Jiri Kraus, Nikolay - PowerPoint PPT Presentation

GPU ACCELERATION FOR OLAP Tim Kaldewey, Jiri Kraus, Nikolay Sakharnykh 03/26/2018 A TYPICAL ANALYTICS QUERY From a business question to SQL Business question (TPC-H query 4) SQL Determines how well the order priority select


  1. GPU ACCELERATION FOR OLAP Tim Kaldewey, Jiri Kraus, Nikolay Sakharnykh 03/26/2018

  2. A TYPICAL ANALYTICS QUERY From a business question to SQL Business question (TPC-H query 4) SQL Determines how well the order priority select o_orderpriority, system is working and gives an assessment count(o_orderkey) as order_count, of customer satisfaction from orders Counts the number of orders ordered in a where given quarter of a given year in which at o_orderdate >= date '[DATE]' and o_orderdate < date '[DATE]' + interval '3' month and least one lineitem was received by the exists ( select * from lineitem customer later than its committed date. where l_orderkey = o_orderkey and The query lists the count of such orders for l_commitdate < l_receiptdate) each order priority sorted in ascending group by o_orderpriority, priority order order by o_orderpriority; Examples: DATE = 1/1/1993, 1/4/1993, … 2

  3. A TYPICAL ANALYTICS QUERY From SQL to Database Operators Database Operators SQL select o_orderpriority, aggregate count(o_orderkey) as order_count, from orders where o_orderdate >= date '[DATE]' and predicate (filter) o_orderdate < date '[DATE]' + interval '3' month and exists ( select * from lineitem join where l_orderkey = o_orderkey and predicate (filter) l_commitdate < l_receiptdate) group by aggregate o_orderpriority, order by sort o_orderpriority; 3

  4. A TYPICAL ANALYTICS QUERY Joins are implicit in a business question SQL Database Operators Business question Counts the number of select orders ordered in a o_orderpriority, aggregate given quarter of a count(o_orderkey) as order_count, from given year in which at orders least one lineitem was where received by the predicate (filter) o_orderdate >= date '[DATE]' and customer later than its o_orderdate < date '[DATE]' + interval '3' month and exists ( select * from lineitem committed date. The join where l_orderkey = o_orderkey and query lists the count of predicate (filter) l_commitdate < l_receiptdate) such orders for each aggregate group by order priority sorted in o_orderpriority, order by ascending priority sort o_orderpriority; order 4

  5. TPC-H SCHEMA part (p_) PARTKEY customer (c_) NAME order (o_) lineitem (l_) CUSTKEY MFGR ORDERKEY ORDERKEY NAME CATEGORY LINENUMBER CUSTKEY ADDRESS BRAND PARTKEY ORDERDATE CITY … SUPPKEY ORDPRIORITY … COMMITDATE ORDERSTATUS supplier (s_) … RECEIPTDATE SUPPKEY … NAME … ADDRESS nation (n_) CITY NATIONKEY NATIONKEY NAME … … 5

  6. WHERE DOES TIME GO? TPC-H query 4 @SF10 = 10GB data warehouse CPU execution breakdown select o_orderpriority, join group-by count(o_orderkey) as order_count, 1% from orders where o_orderdate >= date '[DATE]’ and o_orderdate < date '[DATE]' + interval '3' month and exists ( select * from lineitem where l_orderkey = o_orderkey and l_commitdate < l_receiptdate) group by o_orderpriority, order by o_orderpriority; 99% 18/22 TPC-H Queries involve Join & are the longest running ones 6

  7. RELATIONAL JOIN Join Results Lineitem 1 Order 2 l_orderkey o_orderkey o_orderpriority o_orderkey o_orderpriority 23 11 1 23 5 = 14 23 5 11 1 56 27 2 27 2 11 29 4 23 5 39 27 Payload Primary Key 23 Foreign Key 1 after applying predicate “ l_commitdate < l_receiptdate ” 7 2 after applying predicates “ o_orderdate >= date '[DATE]’ and o_orderdate < date '[DATE]' + interval '3' month ”

  8. HASH JOIN Join Results Lineitem 1 Order 2 l_orderkey o_orderkey o_orderpriority o_orderkey o_orderpriority 23 11 1 23 5 = 14 23 5 11 1 56 27 2 27 2 11 29 4 23 5 39 27 Payload Primary Key 23 Build hash table Foreign Key = Probe inputs 1 after applying predicate “ l_commitdate < l_receiptdate ” 8 2 after applying predicates “ o_orderdate >= date '[DATE]’ and o_orderdate < date '[DATE]' + interval '3' month ”

  9. HASH JOIN General approach – including payload(s) 1. Build a hash table Build table Hash table k 1 ,p 1 k 1 ,p 1 Scan k 2, p 2 k 2 ,p 2 1. Compute h(k) – k is a primary (unique) key 2. Insert k with payload p into hashed location 3. If occupied, insert into next free one 9

  10. HASH JOIN General approach – including payload(s) 1. Build a hash table 2. Probe the hash table Build table Hash table Probe table Hash table k 1 ,p 1 k 1 ,p 1 k 1 ,p 1 fk 1 Scan Scan k 2, p 2 fk 2 k 2 ,p 2 k 3 ,p 3 k 2 ,p 2 k 4 ,p 4 k 5 ,p 5 k 6 ,p 6 k 7 ,p 7 1. Compute h(k) – k is a primary (unique) key 1. Compute h(fk) – fk is foreign key 2. Insert k with payload p into hashed location 2. Compare fk and key k in the hash table 3. If occupied, insert into next free one 3. If there is a match store the result 10

  11. HASH JOIN General approach – including payload(s) 1. Build a hash table 2. Probe the hash table Build table Hash table Probe table Hash table k 1 ,p 1 k 1 ,p 1 k 1 ,p 1 fk 1 Scan Scan k 2, p 2 fk 2 k 2 ,p 2 k 3 ,p 3 k 2 ,p 2 k 4 ,p 4 k 5 ,p 5 k 6 ,p 6 k 7 ,p 7 1. Compute h(k) – k is a primary (unique) key 1. Compute h(fk) – fk is foreign key 2. Insert k with payload p into hashed location 2. Compare fk and key k in the hash table 3. If occupied, insert into next free one 3. If there is a match store the result Build and Probe both produce a random data access pattern! 11

  12. RANDOM ACCESS PERFORMANCE Intel Xeon Gold 6140 (Skylake) DDR4: 6 channels, 64-bit per channel, 2666MT/s, tFAW=30ns* Peak memory bandwidth 120GB/s Random 8B access 6GB/s NVIDIA Tesla V100 (Volta) Peak memory bandwidth 900GB/s Random 8B access 60GB/s 12 *DDR4 specs: https://www.micron.com/~/media/documents/products/data-sheet/dram/ddr4/4gb_ddr4_sdram.pdf

  13. HETEROGENEOUS SYSTEM Faster access 16GB 16GB HBM HBM … Multiple TBs DDR Larger capacity 13

  14. HETEROGENEOUS SYSTEM Data Placement & Movement Faster access Hash Table(s) Key Payload 16GB 16GB 23 5 HBM HBM 27 2 Build & Probe … Multiple TBs DDR DB Larger capacity 14

  15. INTERCONNECT SPEEDS PCIe3: 16GB/s • • NVLINK1: 20GB/s per link, up to 2 links between CPU/GPU = 40GB/s NVLINK2: 25GB/s per link, up to 3 links = 75GB/s • • Further increase throughput via compression: see S8417 tomorrow at 2pm* GPU joins (HT probing): PCIe/NVLINK1: GPU random access bandwidth >> CPU-GPU bandwidth • • NVLINK2 could be limited by GPU random access performance 15 *S8417 - Breaking the Speed of Interconnect with Compression for Database Applications – Tuesday, Mar 27, 2:00pm – Room 210F

  16. JOIN OPERATOR ACCELERATION 16

  17. IMPLEMENTING HASH JOIN concurrent_unordered_map Hash table in Unified Memory can be accessed on the host and template < typename Key , typename Element , device Key unused_key , Element unused_element , typename Hasher = default_hash < Key >, typename Equality = equal_to < Key >, typename Allocator = managed_allocator< thrust :: pair < Key , Element > > > class concurrent_unordered_map : public managed { Features: public : //omitting typedef* Construction on CPU explicit concurrent_unordered_map ( size_type n , • const Hasher & hf , const Equality & eql , • Works on CPU and GPU const allocator_type & a ); Concurrent insert’s • __host__ __device__ iterator begin (); • Concurrent find __host__ __device__ iterator end (); Concurrent insert and find* • __host__ __device__ iterator insert ( const value_type & x ); No concurrent CPU and GPU • __host__ __device__ const_iterator find ( const key_type & k ) const ; insert (currently) void prefetch ( const int dev_id ); }; *No visibility guarantees 17

  18. IMPLEMENTING HASH JOIN 1. Build a hash table with a concurrent_unordered_map template < typename map_type > __global__ void build_hash_tbl ( map_type * const map , const typename map_type :: key_type * const build_tbl , const typename map_type :: size_type build_tbl_size ) { const typename map_type :: mapped_type i = tIdx . x + bIdx . x * bDim . x ; if ( i < build_tbl_size ) { map -> insert ( thrust :: make_pair ( build_tbl [ i ], i ) ); } } 18

  19. IMPLEMENTING HASH JOIN Concurrent Inserts with atomicCAS iterator insert ( const value_type & x ) { size_type key_hash = hf ( x . first ); Key Key Value Value size_type ht_idx = key_hash % ht_size ; unused_key unused_key unused_element unused_element value_type * it = 0 ; while ( 0 == it ) { unused_key unused_key unused_element unused_element value_type * tmp_it = ht_values + ht_idx ; unused_key unused_key unused_element unused_element const key_type old_key = atomicCAS ( &( tmp_it -> first ), unused_key , x . first ); 7287984 7287984 9024 9024 if ( equal ( unused_key , old_key ) || unused_key x.first=75 x.second=875 unused_element equal ( x.first , old_key ) ) { ( ht_values + ht_idx )-> second = x . second ; 283408 283408 2309480 2309480 it = tmp_it ; } unused_key unused_key unused_element unused_element ht_idx = ( ht_idx + 1 )% ht_size ; unused_key unused_key unused_element unused_element } return iterator ( ht_values , ht_values + ht_size , it ); } 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend