toward gpus being mainstream in analytic processing
play

Toward GPUs being mainstream in analytic processing An initial - PowerPoint PPT Presentation

Toward GPUs being mainstream in analytic processing An initial argument using simple scan- aggregate queries Jason Power || Yinan Li || Mark D. Hill Jignesh M. Patel || David A. Wood < powerjg@cs.wisc.edu> DaMoN 2015 6/1/2015 UNIVERSITY


  1. Toward GPUs being mainstream in analytic processing An initial argument using simple scan- aggregate queries Jason Power || Yinan Li || Mark D. Hill Jignesh M. Patel || David A. Wood < powerjg@cs.wisc.edu> DaMoN 2015 6/1/2015 UNIVERSITY OF WISCONSIN 1

  2. Summary ▪ GPUs are energy efficient ▪ Discrete GPUs unpopular for DBMS ▪ New integrated GPUs solve the problems ▪ Scan-aggregate GPU implementation ▪ Wide bit-parallel scan ▪ Fine-grained aggregate GPU offload ▪ Up to 70% energy savings over multicore CPU ▪ Even more in the future 6/1/2015 UNIVERSITY OF WISCONSIN 2

  3. Analytic Data is Growing ▪ Data is growing rapidly ▪ Analytic DBs increasingly important Source: IDC’s Digital Universe Study. 2012. Want: High performance Need: Low energy 6/1/2015 UNIVERSITY OF WISCONSIN 3

  4. GPUs to the Rescue? ▪ GPUs are becoming more general ▪ Easier to program ▪ Integrated GPUs are everywhere ▪ GPUs show great promise [Govindaraju ’04, He ’14, He ’14, Kaldewey ‘12, Satish ’10, and many others] ▪ Higher performance than CPUs ▪ Better energy efficiency ▪ Analytic DBs look like GPU workloads 6/1/2015 UNIVERSITY OF WISCONSIN 4

  5. GPU Microarchitecture Compute Unit Graphics Processing Unit I-Fetch/Sched CU SP SP SP SP SP SP SP SP L2 Cache SP SP SP SP Register File Scratchpad L1 Cache Cache 6/1/2015 UNIVERSITY OF WISCONSIN 5

  6. Discrete GPUs CPU chip PCIe Bus Cores Memory Bus Discrete GPU Memory Bus 6/1/2015 UNIVERSITY OF WISCONSIN 6

  7. Discrete GPUs ➊ CPU chip PCIe Bus Cores Memory Bus Discrete GPU Memory Bus ➋ 6/1/2015 UNIVERSITY OF WISCONSIN 7

  8. Discrete GPUs ➌ CPU chip PCIe Bus Cores Memory Bus ➍ And repeat Discrete GPU Memory Bus 6/1/2015 UNIVERSITY OF WISCONSIN 8

  9. Discrete GPUs ▪ Copy data over PCIe ➊ ▪ Low bandwidth ▪ High latency ▪ Small working memory ➋ ▪ High latency user → kernel calls ➌ ▪ Repeated many times ➍ 98% of time spent not computing 6/1/2015 UNIVERSITY OF WISCONSIN 9

  10. Integrated GPUs Heterogeneous chip CPU cores Memory Bus GPU CUs 6/1/2015 UNIVERSITY OF WISCONSIN 10

  11. Heterogeneous System Arch. ▪ API for tightly-integrated accelerators ▪ Industry support ▪ Initial hardware support today ▪ HSA foundation (AMD, ARM,Qualcomm, others) ▪ No need for data copies ➊➋ ▪ Cache coherence and shared address space ❹ ▪ No OS kernel interaction ➌ ▪ User-mode queues 6/1/2015 UNIVERSITY OF WISCONSIN 11

  12. Outline ▪ Background ▪ Algorithms ▪ Scan ▪ Aggregate ▪ Results 6/1/2015 UNIVERSITY OF WISCONSIN 12

  13. Analytic DBs ▪ Resident in main-memory ▪ Column-based layout ▪ WideTable & BitWeaving [Li and Patel ‘13 & ‘14] ▪ Convert queries to mostly scans by pre-joining tables ▪ Fast scan by using sub-word parallelism ▪ Similar to industry proposals [SAP Hana, Oracle Exalytics, IBM DB2 BLU] ▪ Scan-aggregate queries 6/1/2015 UNIVERSITY OF WISCONSIN 13

  14. Running Example Shirt Shirt Shirt Color Color Amount 2 1 Green 2 3 Color Code Green 0 1 1 Red Blue 1 Blue 2 5 Green 2 Green 3 7 Yellow 3 Yellow 0 2 Red 3 1 Yellow 1 4 Blue 3 2 Yellow 6/1/2015 UNIVERSITY OF WISCONSIN 14

  15. Running Example Shirt Shirt Shirt Count the number of Color Color Amount green shirts in the 2 1 Green inventory 2 3 Green 1 1 Blue 2 5 Green Scan the color ➊ 3 7 Yellow column for green (2) 0 2 Red 3 1 Yellow Aggregate amount ➋ 1 4 Blue where there is a match 3 2 Yellow 6/1/2015 UNIVERSITY OF WISCONSIN 15

  16. Traditional Scan Algorithm Shirt Column 10 10 01 Color Data 2 (10) Compare 2 (10) 10 10 10 Code 1 (01) (Green) 2 (10) 3 (11) Result BitVector 11 11010000 0000... 110 0 (00) 3 (11) 1 (01) 3 (11) 6/1/2015 UNIVERSITY OF WISCONSIN 16

  17. Vertical Layout Color word word word c0 c0 c0 c1 c1 c2 c3 c4 c5 c6 c7 2 (10) c0 w0 w0 w0 1 1 1 1 1 0 1 1 0 1 0 2 (10) c1 w1 w1 w1 0 0 0 0 0 1 0 1 0 1 1 1 (01) c2 c8 c9 2 (10) c3 w2 1 0 3 (11) c4 w3 1 0 0 (00) c5 3 (11) c6 1 (01) c7 110110110 00101011 10000000 ... 3 (11) c8 0 (00) c9 6/1/2015 UNIVERSITY OF WISCONSIN 17

  18. CPU BitWeaving Scan Column 11011011 00101011 10000000 Data Compare 11111111 00000000 Code Result 1101 11010000 0000... BitVector CPU width: 64-bits, up to 256-bit SIMD 6/1/2015 UNIVERSITY OF WISCONSIN 18

  19. GPU BitWeaving Scan Column 11011011 00101011 10000000 Data Compare 11111111 11111111 11111111 Code Result 11010000 0000... BitVector GPU width: 16,384-bit SIMD 6/1/2015 UNIVERSITY OF WISCONSIN 19

  20. GPU Scan Algorithm ▪ GPU uses very wide “words” ▪ CPU: 64-bits or 256-bits with SIMD ▪ GPU: 16,384 bits (256 lanes × 64-bits) ▪ Memory and caches optimized for bandwidth ▪ HSA programming model ▪ No data copies ▪ Low CPU-GPU interaction overhead 6/1/2015 UNIVERSITY OF WISCONSIN 20

  21. CPU Aggregate Algorithm Shirt Result Amount BitVector 11010000 0000... 1 3 1 5 7 1+3+5+... 1+3 Result 2 1 4 2 6/1/2015 UNIVERSITY OF WISCONSIN 21

  22. GPU Aggregate Algorithm Result BitVector 11010000 0000... On CPU Column 0,1 0,1,3,... Offsets 6/1/2015 UNIVERSITY OF WISCONSIN 22

  23. GPU Aggregate Algorithm Shirt Column 0,1,3,... Amount Offsets 1 3 1 On GPU 5 7 1+3+5+... Result 2 1 4 2 6/1/2015 UNIVERSITY OF WISCONSIN 23

  24. Aggregate Algorithm ▪ Two phases ▪ Convert from BitVector to offsets (on CPU) ▪ Materialize data and compute ( offload to GPU ) ▪ Two group-by algorithms (see paper) ▪ HSA programming model ▪ Fine-grained sharing ▪ Can offload subset of computation 6/1/2015 UNIVERSITY OF WISCONSIN 24

  25. Outline ▪ Background ▪ Algorithms ▪ Results 6/1/2015 UNIVERSITY OF WISCONSIN 25

  26. Experimental Methods ▪ AMD A10-7850 ▪ 4-core CPU ▪ 8-compute unit GPU ▪ 16GB capacity, 21 GB/s DDR3 memory ▪ Separate discrete GPU ▪ Watts-Up meter for full-system power ▪ TPC-H @ scale-factor 10 6/1/2015 UNIVERSITY OF WISCONSIN 26

  27. Scan Performance & Energy 6/1/2015 UNIVERSITY OF WISCONSIN 27

  28. Scan Performance & Energy Takeaway: Integrated GPU most efficient for scans 6/1/2015 UNIVERSITY OF WISCONSIN 28

  29. TPC-H Queries Query 12 Performance 6/1/2015 UNIVERSITY OF WISCONSIN 29

  30. TPC-H Queries Query 12 Performance Query 12 Energy Integrated GPU faster for both aggregate and scan computation 6/1/2015 UNIVERSITY OF WISCONSIN 30

  31. TPC-H Queries Query 12 Performance Query 12 Energy 6/1/2015 UNIVERSITY OF WISCONSIN 31

  32. TPC-H Queries Query 12 Performance Query 12 Energy More energy : Decrease in latency does not offset power increase Less energy : Decrease in latency AND decrease in power 6/1/2015 UNIVERSITY OF WISCONSIN 32

  33. Future Die Stacked GPUs ▪ 3D die stacking DRAM ▪ Same physical & logical integration GPU ▪ Increased compute CPU Board ▪ Increased bandwidth Power et al. Implications of 3D GPUs on the Scan Primitive SIGMOD Record. Volume 44, Issue 1. March 2015 6/1/2015 UNIVERSITY OF WISCONSIN 33

  34. Conclusions Discrete Integrated 3D Stacked GPUs GPUs GPUs High ☺ High ☺ Moderate Performance Memory High ☺ High ☺ Low ☹ Bandwidth Low ☺ Low ☺ High ☹ Overhead Memory High ☺ Low ☹ Moderate Capacity 6/1/2015 UNIVERSITY OF WISCONSIN 34

  35. ? 6/1/2015 UNIVERSITY OF WISCONSIN 35

  36. HSA vs CUDA/OpenCL ▪ HSA defines a heterogeneous architecture ▪ Cache coherence ▪ Shared virtual addresses ▪ Architected queuing ▪ Intermediate language ▪ CUDA/OpenCL are a level above HSA ▪ Come with baggage ▪ Not as flexible ▪ May not be able to take advantage of all features 6/1/2015 UNIVERSITY OF WISCONSIN 36

  37. Scan Performance & Energy 6/1/2015 UNIVERSITY OF WISCONSIN 37

  38. Group-by Algorithms 6/1/2015 UNIVERSITY OF WISCONSIN 38

  39. All TPC-H Results 6/1/2015 UNIVERSITY OF WISCONSIN 39

  40. Average TPC-H Results Average Performance Average Energy 6/1/2015 UNIVERSITY OF WISCONSIN 40

  41. What’s Next? ▪ Developing cost model for GPU ▪ Using the GPU is just another algorithm to choose ▪ Evaluate exactly when the GPU is more efficient ▪ Future “database machines” ▪ GPUs are a good tradeoff between specialization and commodity 6/1/2015 UNIVERSITY OF WISCONSIN 41

  42. Conclusions ▪ Integrated GPUs viable for DBMS? ▪ Solve problems with discrete GPUs ▪ (Somewhat) better performance and energy ▪ Looking toward the future... ▪ CPUs cannot keep up with bandwidth ▪ GPUs perfectly designed for these workloads 6/1/2015 UNIVERSITY OF WISCONSIN 42

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend