How to run SQL queries on TBs of data using GPUs Jake Wheat Lead - - PowerPoint PPT Presentation

how to run sql queries on tbs of data using gpus
SMART_READER_LITE
LIVE PREVIEW

How to run SQL queries on TBs of data using GPUs Jake Wheat Lead - - PowerPoint PPT Presentation

How to run SQL queries on TBs of data using GPUs Jake Wheat Lead Architect, SQream Technologies @sqreamtech SIL7138 1 How to run SQL queries on TBs of data using GPUs 1. A toy SQL query engine 2. Support wide range of SQL queries 3.


slide-1
SLIDE 1

1

@sqreamtech SIL7138

How to run SQL queries on TBs of data using GPUs

Jake Wheat Lead Architect, SQream Technologies

slide-2
SLIDE 2

2

@sqreamtech SIL7138

How to run SQL queries on TBs of data using GPUs

1. A toy SQL query engine 2. Support wide range of SQL queries 3. Support larger data 4. Optimise

slide-3
SLIDE 3

3

@sqreamtech SIL7138

Hypothetical set of physical operator primitives

Exam ample SQ SQL Physical cal O Operato ator Implement ntati ation

select a+b, c * 5 from t select

(a.k.a project/extend/rename)

thrust::transform select a, count(*), sum(b), avg(b) from t group by a stream aggregate thrust::reduce_by_key select a, b from t where a > 0.5 filter thrust::remove_if select distinct a from t stream distinct thrust::unique select a, b, c, d from t

  • rder by a,b

sort thrust::sort select * from t union all select * from u union all

  • select * from t

inner join u using (a) sort merge join (smj) simple implementation: thrust::upper_bounds, lower_bounds, unnest, gather

slide-4
SLIDE 4

4

@sqreamtech SIL7138

A more complete engine

  • non indexed nested loop join (ninlj)
  • not matching
  • outer join
  • stream union distinct
  • distinct aggregates
  • grouping sets
  • window functions
slide-5
SLIDE 5

5

@sqreamtech SIL7138

Running on bigger data

slide-6
SLIDE 6

6

@sqreamtech SIL7138

4 MB at a time

GPU

4 MB Result 1 TB Didn’t fit 1 TB Result

Queries on data which doesn't fit in the GPU memory

Chunk the data

slide-7
SLIDE 7

7

@sqreamtech SIL7138

Queries on data which doesn't fit in the GPU memory

Use an external sorting algorithm Extend the idea to other operators too, for instance:

  • NINLJ
  • SMJ
  • window functions
  • distinct aggregates (select a, count(distinct b), count(b) from t group by a)
  • uter joins
slide-8
SLIDE 8

8

@sqreamtech SIL7138

GPU

1 TB GPU / CPU spool + external sort Sort 4 MB at a time

Queries on data which doesn't fit in the GPU memory

Use external sorting algorithms and a variety of spools

slide-9
SLIDE 9

9

@sqreamtech SIL7138

Optimising: some non-GPU specific ideas

  • Use good benchmarking and profiling tools
  • Take advantage of columnar - don't read columns which aren't needed
  • Compressed data – helps with both disk I/O and PCI bus I/O
  • Make sure we use good I/O patterns: big reads and writes, work with the Linux FS cache

effectively

  • Use large host memory buffers
  • Use AST optimisations (e.g. push predicates closer to the tablescans)
  • Use a cost based optimizer
slide-10
SLIDE 10

10

@sqreamtech SIL7138

Some GPU-specific ideas

  • Use CUDA specific profiling tools in addition to your regular C++ ones

(or at least have a way to enable cudaDeviceSynchronize calls all over your code)

  • GPU task/queue
  • Combine tasks on the GPU
  • Use large chunks
  • Rechunking
  • Reduce or avoid host ↔ device transfers
  • Optimise GPU code
slide-11
SLIDE 11

11

@sqreamtech SIL7138

GPU task/queue

1. Allocate all the GPU memory we will need 2. Upload the data to the GPU 3. Run some kernels on this data 4. Download the results to host 5. Release the allocated GPU memory

slide-12
SLIDE 12

12

@sqreamtech SIL7138

GPU task/queue

Benefits

  • Can help support concurrency in a single query (to hide PCI transfer latency)
  • Can help support concurrent queries on a single GPU with some basic fairness property

(long running query won't starve out a short running query)

  • GPU memory usage is very predictable
  • No inter-task co-ordination needed
  • No chance of deadlock on GPU resources
  • Can be simply extended to distribute tasks over multiple GPUs
slide-13
SLIDE 13

13

@sqreamtech SIL7138

Host Worker Host Worker Host Worker GPU task queue GPU 1 GPU worker GPU worker …. GPU 2 GPU worker GPU worker …. Host Worker Host Worker

Short GPU task concept, concurrency

slide-14
SLIDE 14

14

@sqreamtech SIL7138

Combine tasks on the GPU where possible

select a, b+c as d from t where b > 5 order by a

Logical Direct Combined

TableScan a,b,c Transform d:=b+c Remove If b>5 Sort by a Sort Merge

TableScan a,b,c To device Transform d:=b+c To host To device Remove If b>5 To host To device Sort by a To host Sort Merge

TableScan a,b,c To device Transform d:=b+c Remove If b>5 Sort by a To host Sort Merge

slide-15
SLIDE 15

15

@sqreamtech SIL7138

Use larger chunks of data

1. transfer 10,000 rows to device 2. transform d := b + c 3. remove_if b > 5 4. sort by a 5. transfer results to host 1. transfer 10,000,000 rows to device 2. transform d := b + c 3. remove_if b > 5 4. sort by a 5. transfer results to host

slide-16
SLIDE 16

16

@sqreamtech SIL7138

Rechunking

Table scans Optimised disk IO sizes often much bigger than the GPU can handle

Small Big Big

IO reads GPU process

slow good fast too big to fit fast good split

Chunk size

slide-17
SLIDE 17

17

@sqreamtech SIL7138

Rechunking

Output of remove_if, join, reduce/reduce_by_key, etc. If the output of these is very small amounts of rows, collecting these rows back into big chunks can improve the performance of later operations (e.g. sort).

slide-18
SLIDE 18

18

@sqreamtech SIL7138

High selectivity Low selectivity input remove if sort fast input remove if sort slow input remove if rechunk sort fast

Rechunking

slide-19
SLIDE 19

19

@sqreamtech SIL7138

Rechunking

Reducing PCI transfer amounts in NINLJ for each chunk in table A: for each chunk in table B: load A,B to the GPU, join them Table A size: 10,000,000,000 rows Table B size: 10,000,000 rows At a chunk size of 1 million rows you will upload 110 billion rows to the device. If you increase it to 20 million rows, you will upload only 15 billion rows to the device 110B / 15B = Potentially more than 7x faster

slide-20
SLIDE 20

20

@sqreamtech SIL7138

9 chunks Loop and load 9 times 3 chunks Loop and load 3 times

Rechunking

Reducing PCI transfer amounts in NINLJ

slide-21
SLIDE 21

21

@sqreamtech SIL7138

Reduce / avoid HOST ↔ DEVICE transfers

  • NINLJ: keep the small table on the GPU for the whole join
  • Chunk skipping -

data is inserted to the database ordered by a record timestamp. The storage layer preserves this order (or inserts the data in sorted order) Typical use-case is 24 months of data in the table, but queries only want to summarize a particular day, week or month.

slide-22
SLIDE 22

22

@sqreamtech SIL7138

Optimising Kernels

Three example bottlenecks from SQream:

  • reduce_by_key
  • multikey sort
  • internals of the join
slide-23
SLIDE 23

23

@sqreamtech SIL7138

Strings are very difficult to deal with

  • compression
  • can be large size per record

non big data specific GPU issues:

  • variable length data
  • collations
  • unicode, ICU doesn't run on GPUs

Good solutions: we're still looking

slide-24
SLIDE 24

24

@sqreamtech SIL7138

Find out more about SQream’s high performance GPU-driven database software

 sqream.com