efficient join processing across heterogeneous processors
play

Efficient Join Processing across Heterogeneous Processors Henning - PowerPoint PPT Presentation

Efficient Join Processing across Heterogeneous Processors Henning Funke, Sebastian Bre, Stefan Noll, Jens Teubner December 15, 2015 1 / 14 GPUs IN DATABASES ARE LIKE A MUSCLE CAR IN A TRAFFIC JAM 3 / 14 Bottleneck 3 / 14 really? 3 / 14


  1. Efficient Join Processing across Heterogeneous Processors Henning Funke, Sebastian Breß, Stefan Noll, Jens Teubner December 15, 2015 1 / 14

  2. GPUs IN DATABASES ARE LIKE A MUSCLE CAR IN A TRAFFIC JAM

  3. 3 / 14

  4. Bottleneck 3 / 14

  5. really? 3 / 14

  6. GPU – Hashjoin Algorithm Cuckoo Hashing Implementation 1 → Strict limit on number probes per query key ◮ Pipeline join probe and result compaction in shared memory 1 Based on: Alcantara, Dan Anthony Feliciano. Efficient hash tables on the GPU. University of California at Davis, 2011. 4 / 14

  7. GPU – Hashjoin Algorithm Cuckoo Hashing Implementation 1 → Strict limit on number probes per query key ◮ Pipeline join probe and result compaction in shared memory Performance: Build (GTX970) Build Throughput GB/s PCIe 10 5 0 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 Fill factor 1 Based on: Alcantara, Dan Anthony Feliciano. Efficient hash tables on the GPU. University of California at Davis, 2011. 4 / 14

  8. Performance: GPU Join Probe (GTX970) 20 Probe Throughput GB/s 15 PCIe 10 5 0 10 0 10 1 10 2 10 3 10 4 10 5 Build Size KB 5 / 14

  9. Performance: GPU Join Probe (GTX970) 20 Probe Throughput GB/s 15 PCIe 10 > 100 K elements 5 Real world data 0 10 0 10 1 10 2 10 3 10 4 10 5 Build Size KB 6 / 14

  10. Joins on Multiple Heterogeneous Processors Challenges ◮ Scalability to large data ◮ Communication ◮ Local and remote resources Related work Figure : P. Frey, R. Goncalves, M. L. Kersten, and J. Teubner. Spinning Relations: High-Speed Networks for Distributed Join Processing. In DaMoN, 2009. 7 / 14

  11. Star Join on Heterogeneous Processors Processing Strategy → Allocate smaller tables on all devices → Asynchronous data transfers at computation speed → Merge results into continuous arrays 8 / 14

  12. Performance: Join Probe Across Heterogeneous Processors Intel Xeon E5-1607 v2 and NVIDIA Geforce GTX970 5 CPU alone CPU + GPU 4 Probe throughput GB/s 3 2 1 0 0 2 4 6 8 CPU worker threads 9 / 14

  13. Coprocessor Control Thread Scheduling Figure : Profiling coprocessor kernel invocations → Steer control flow from coprocessor itself → Increase block size 10 / 14

  14. Hardware Schema – Memory Bandwidth Utilization Per core (4) scan 15.8 GB/s CPU gather 4 GB/s even share 7.8 GB/s GPU Local PCIe 149 GB/s prefix scan 84 GB/s MC 12 GB/s gather 33.2 GB/s MEM 31 GB/s MEM 11 / 14

  15. Hardware Schema – Memory Bandwidth Utilization Per core (4) scan 15.8 GB/s CPU gather 4 GB/s even share 7.8 GB/s GPU Local PCIe 149 GB/s MC prefix scan 84 GB/s 12 GB/s gather 33.2 GB/s MEM 31 GB/s MEM 12 / 14

  16. Hardware Schema – Memory Bandwidth Utilization Per core (4) scan 15.8 GB/s CPU gather 4 GB/s even share 7.8 GB/s GPU Local PCIe 149 GB/s MC prefix scan 84 GB/s 12 GB/s gather 33.2 GB/s MEM 31 GB/s MEM ◮ PCI express bus and main memory can become a bottleneck ◮ Take bandwidth footprint and throughput into account → Instead of bulk processing, apply dataflow perspective 12 / 14

  17. Improving Resource Utilization Cache awareness ◮ Materialize tuples in hash table → Useful payload in cacheline ◮ Order probe data by hash function GPU Optimizations ◮ Pipeline data between GPU kernels ◮ Concurrent kernel execution 13 / 14

  18. Key Insights ◮ PCIe is not the dominating bottleneck for GPU joins ◮ Dataflow oriented processing → Join arbitrary outer relation sizes ◮ Move part of probes to coprocessor → Save memory bandwidth → Gain throughput Future Work Figure : Operator pipelines. From Leis et al. ◮ Join processing → query processing Morsel-driven parallelism SIGMOD 2014 ◮ Compile pipelined operator sequences ◮ Stream arbitrary columns 14 / 14

  19. Key Insights ◮ PCIe is not the dominating bottleneck for GPU joins ◮ Dataflow oriented processing → Join arbitrary outer relation sizes ◮ Move part of probes to coprocessor → Save memory bandwidth → Gain throughput Future Work Figure : Operator pipelines. From Leis et al. ◮ Join processing → query processing Morsel-driven parallelism SIGMOD 2014 ◮ Compile pipelined operator sequences ◮ Stream arbitrary columns Thank you! 14 / 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend