Efficient Join Processing across Heterogeneous Processors Henning - - PowerPoint PPT Presentation

efficient join processing across heterogeneous processors
SMART_READER_LITE
LIVE PREVIEW

Efficient Join Processing across Heterogeneous Processors Henning - - PowerPoint PPT Presentation

Efficient Join Processing across Heterogeneous Processors Henning Funke, Sebastian Bre, Stefan Noll, Jens Teubner December 15, 2015 1 / 14 GPUs IN DATABASES ARE LIKE A MUSCLE CAR IN A TRAFFIC JAM 3 / 14 Bottleneck 3 / 14 really? 3 / 14


slide-1
SLIDE 1

Efficient Join Processing across Heterogeneous Processors

Henning Funke, Sebastian Breß, Stefan Noll, Jens Teubner December 15, 2015

1 / 14

slide-2
SLIDE 2

GPUs IN DATABASES ARE LIKE A MUSCLE CAR IN A TRAFFIC JAM

slide-3
SLIDE 3

3 / 14

slide-4
SLIDE 4

3 / 14

Bottleneck

slide-5
SLIDE 5

3 / 14

really?

slide-6
SLIDE 6

GPU – Hashjoin Algorithm

Cuckoo Hashing Implementation1 → Strict limit on number probes per query key

◮ Pipeline join probe and result compaction in shared memory

1Based on: Alcantara, Dan Anthony Feliciano. Efficient hash tables on the

  • GPU. University of California at Davis, 2011.

4 / 14

slide-7
SLIDE 7

GPU – Hashjoin Algorithm

Cuckoo Hashing Implementation1 → Strict limit on number probes per query key

◮ Pipeline join probe and result compaction in shared memory

Performance: Build (GTX970) 0.5 0.6 0.7 0.8 0.9 5 10 PCIe Fill factor Build Throughput GB/s

1Based on: Alcantara, Dan Anthony Feliciano. Efficient hash tables on the

  • GPU. University of California at Davis, 2011.

4 / 14

slide-8
SLIDE 8

Performance: GPU Join Probe (GTX970)

100 101 102 103 104 105 5 10 15 20 PCIe Build Size KB Probe Throughput GB/s

5 / 14

slide-9
SLIDE 9

Performance: GPU Join Probe (GTX970)

100 101 102 103 104 105 5 10 15 20 PCIe > 100K elements Real world data Build Size KB Probe Throughput GB/s

6 / 14

slide-10
SLIDE 10

Joins on Multiple Heterogeneous Processors

Challenges

◮ Scalability to large data ◮ Communication ◮ Local and remote resources

Related work

Figure : P. Frey, R. Goncalves, M. L. Kersten, and J. Teubner. Spinning

Relations: High-Speed Networks for Distributed Join Processing. In DaMoN, 2009.

7 / 14

slide-11
SLIDE 11

Star Join on Heterogeneous Processors

Processing Strategy

→ Allocate smaller tables on all devices → Asynchronous data transfers at computation speed → Merge results into continuous arrays

8 / 14

slide-12
SLIDE 12

Performance: Join Probe Across Heterogeneous Processors

Intel Xeon E5-1607 v2 and NVIDIA Geforce GTX970 2 4 6 8 1 2 3 4 5 CPU worker threads Probe throughput GB/s CPU alone CPU + GPU

9 / 14

slide-13
SLIDE 13

Coprocessor Control Thread Scheduling

Figure : Profiling coprocessor kernel invocations

→ Steer control flow from coprocessor itself → Increase block size

10 / 14

slide-14
SLIDE 14

Hardware Schema – Memory Bandwidth Utilization

MC CPU GPU

149 GB/s

MEM MEM

Per core (4) scan 15.8 GB/s gather 4 GB/s even share 7.8 GB/s Local prefix scan 84 GB/s gather 33.2 GB/s

PCIe

12 GB/s 31 GB/s

11 / 14

slide-15
SLIDE 15

Hardware Schema – Memory Bandwidth Utilization

MC CPU GPU

149 GB/s

MEM MEM

Per core (4) scan 15.8 GB/s gather 4 GB/s even share 7.8 GB/s Local prefix scan 84 GB/s gather 33.2 GB/s

PCIe

12 GB/s 31 GB/s

12 / 14

slide-16
SLIDE 16

Hardware Schema – Memory Bandwidth Utilization

MC CPU GPU

149 GB/s

MEM MEM

Per core (4) scan 15.8 GB/s gather 4 GB/s even share 7.8 GB/s Local prefix scan 84 GB/s gather 33.2 GB/s

PCIe

12 GB/s 31 GB/s

◮ PCI express bus and main memory can become a bottleneck ◮ Take bandwidth footprint and throughput into account

→ Instead of bulk processing, apply dataflow perspective

12 / 14

slide-17
SLIDE 17

Improving Resource Utilization

Cache awareness

◮ Materialize tuples in hash table

→ Useful payload in cacheline

◮ Order probe data by hash function

GPU Optimizations

◮ Pipeline data between GPU kernels ◮ Concurrent kernel execution

13 / 14

slide-18
SLIDE 18

Key Insights

◮ PCIe is not the dominating

bottleneck for GPU joins

◮ Dataflow oriented processing

→ Join arbitrary outer relation sizes

◮ Move part of probes to coprocessor

→ Save memory bandwidth → Gain throughput

Future Work

◮ Join processing → query processing ◮ Compile pipelined operator sequences ◮ Stream arbitrary columns

Figure : Operator

  • pipelines. From Leis et al.

Morsel-driven parallelism SIGMOD 2014

14 / 14

slide-19
SLIDE 19

Key Insights

◮ PCIe is not the dominating

bottleneck for GPU joins

◮ Dataflow oriented processing

→ Join arbitrary outer relation sizes

◮ Move part of probes to coprocessor

→ Save memory bandwidth → Gain throughput

Future Work

◮ Join processing → query processing ◮ Compile pipelined operator sequences ◮ Stream arbitrary columns

Thank you!

Figure : Operator

  • pipelines. From Leis et al.

Morsel-driven parallelism SIGMOD 2014

14 / 14