gpu acceleration of in memory data analytics
play

GPU-Acceleration of In-Memory Data Analytics Evangelia Sitaridi - PowerPoint PPT Presentation

GPU-Acceleration of In-Memory Data Analytics Evangelia Sitaridi AWS Redshift GPUs for Telcos Fast query-time Quickly identify network problems No time to index data Respond fast to customers Geospatial visualization Take


  1. GPU-Acceleration of In-Memory Data Analytics Evangelia Sitaridi AWS Redshift

  2. GPUs for Telcos • Fast query-time • Quickly identify network problems No time to index data • Respond fast to customers • Geospatial visualization • Take advantage of GPU visualization capabilities SMS Hub traffic *Picture taken from: http://www.vizualytics.com/Solutions/Telecom/Telecom.html 2

  3. GPUs for Social Media Analytics Search terms: debate Match regexp: “/\B#\w*[a-zA-Z]+\w*/ Filter location 3

  4. Challenges for GPU Databases • Special threading model à Increased programming complexity • Which algorithms more efficient for GPUs? • How much multiple code paths increase cost of code maintenance? • Special memory architecture • How to adapt data layout? • Limited memory capacity • Data transfer cost between CPUs and GPUs a) Through PCI/E link to the GPU b) From storage system to the GPU • Fair comparison against software-based solutions 4

  5. Challenges for GPU Databases • Special threading model à Increased programming complexity • Which algorithms more efficient for GPUs? • How much multiple code paths increase cost of code maintenance? • Special memory architecture • How to adapt data layout? • Limited memory capacity • Data transfer cost between CPUs and GPUs a) Through PCI/E link to the GPU b) From storage system to the GPU • Fair comparison against software-based solutions 5

  6. Outline o CPU vs GPU introduction o Accelerated wildcard string search o Insight: Change the layout of the strings in the GPU main memory o 3X speed-up & 2X energy savings against parallel state-of-the-art CPU libraries o Gompresso: Massively parallel decompression o Insight: Trade-off compression ratio for increased parallelism o 2X speed-ups & 1.2X energy savings against multi-core state-of-the-art CPU libraries o GPUs on the cloud 6

  7. CPU-GPU Analogies Goal: High throughput Goal: Low latency (overlapping different instructions) CPU thread GPU warp RAM Global memory T ens of threads Thousands of threads Hundreds of GBs capacity Few tens of GB 7

  8. GPU Architecture K40: 15 Stream Multiprocessors GPU Thread if(condition) … … … 1 1 a++; else … … b++; … … … 1 1 endif CUDA Kernel 8

  9. GPU Architecture SM15 K40: 15 Stream Multiprocessors … SM2 SM1 Warp Warp Warp : Unit of execution scheduler scheduler GPU Thread Register File Branch … 1 1 1 1 1 1 if(condition) … … … 1 1 … a++; else … … b++; … … … 1 1 endif warp n warp 1 Branch complete CUDA Kernel Shared Memory Global Memory 9

  10. GPU Architecture SM15 K40: 15 Stream Multiprocessors … SM2 SM1 Warp Warp Warp : Unit of execution scheduler scheduler GPU Thread Register File Branch … 1 1 1 1 1 1 if(condition) … … … 1 1 … 1 a++; 1 else … … b++; … … … 1 1 endif warp n warp 1 Branch complete CUDA Kernel Shared Memory Global Memory 10

  11. GPU Architecture SM15 K40: 15 Stream Multiprocessors … SM2 SM1 Warp Warp Warp : Unit of execution scheduler scheduler GPU Thread Register File Branch … 1 1 1 1 1 1 if(condition) … … … 1 1 … 1 a++; 1 else … … b++; 1 1 1 … … … … 1 1 endif warp n warp 1 Branch complete CUDA Kernel Shared Memory Global Memory 11

  12. GPU Architecture SM15 K40: 15 Stream Multiprocessors … SM2 SM1 Warp Warp Warp : Unit of execution scheduler scheduler GPU Thread Register File Branch … 1 1 1 1 1 1 if(condition) … … … 1 1 … 1 a++; 1 else … … b++; 1 1 1 … … … … 1 1 endif 1 1 1 1 1 1 warp n warp 1 Branch complete CUDA Kernel Shared Memory Global Memory 12

  13. Outline o CPU vs GPU introduction o Accelerated wildcard string search o Insight: Change the layout of the strings in the GPU main memory o 3X speed-up & 2X energy savings against parallel state-of-the-art CPU libraries o Gompresso: Massively parallel decompression o Insight: Trade-off compression ratio for increased parallelism o 2X speed-ups & 1.2X energy savings against multi-core state-of-the-art CPU libraries o GPUs on the cloud 13

  14. Text Query Applications ACGTACCTGATCGTAGGATCCCAAGTACATCATTTC ACC Input GENOMIC DATA Search Pattern Wild card searches Id Address “* 3rdAve*New York* ” 3 “9 Front St, Washington DC, 20001” Search Pattern 8 “3338 A Lockport Pl #6, Margate City, NJ, 8402” 9 “18 3rd Ave, New York, NY, 10016” 15 “88 Sw 28th T er, Harrison, NJ, 7029” DATABASE COLUMNS 16 “87895 Concord Rd, La Mesa, CA, 91142” Q2,9,13,14,16,20 of TPC-H contain expensive LIKE predicates 14

  15. Wildcard Search Challenges • Approaches simplifying search cannot be applied • String indexes, e.g. suffix trees • For query ‘%customer%complaints’ multiple queries need be issued • ’%customer%’ AND ‘%complaints%’ • Confirm results • Dictionary compression • Wildcard searches not simplified using dictionaries • String data need to be decompressed 15

  16. Background: How to Search Text Fast? Knuth-Morris-Pratt Algorithm i=5 ACACA T ACCTACTTTACGTACGT Step 6 Input: j=5 ACACA C G Pattern: Character mismatch -10 01 2 3 4 Shift pattern table Advance to the next character: a) If the input matches to the pattern b) While there is a mismatch shift to the left of the pattern Stop when the beginning of the pattern has been reached 16

  17. Background: How to Search Text Fast? Knuth-Morris-Pratt Algorithm i=5 ACACA T ACCTACTTTACGTACGT Step 6 Input: j=5 ACACA C G Pattern: Character mismatch ACACA T ACCTACTTTACGTACGT Step 7 i=5 ACA C ACG j=1 Shift pattern -10 01 2 3 4 Shift pattern table Advance to the next character: a) If the input matches to the pattern b) While there is a mismatch shift to the left of the pattern Stop when the beginning of the pattern has been reached 17

  18. GPU Limiting factor: Cache Pressure Threads matching different strings Warp size: 32 x Stream Multiprocessors: 15 #Warps in each SM : 64 Cache footprint : 30720 cache lines >> Tesla K40 architecture L2 Capacity : 12288 cache lines Smaller cache size per thread than CPUs: Need improved locality! 18

  19. Adapt Memory Layout: Pivoting Strings Baseline (contiguous) layout String 1 String 2 String 3 CTAACCGAGTAAAGAACGTAAACTCATTCGACTAAACCGAGTAAAGA… Pivoted layout CTAAACGTCTAA…CCGAAAACACCG…GTAATCATAGTA…AAGATCGAAAGA… – Split strings in equally sized pieces – Interleave pieces in memory à Improve locality Initially: Each warp loads a cache line (128 bytes) CTAAACGTCTAA…CCGAAAACACCG…GTAATCATAGTA…AAGATCGAAAGA… T0 T1 T2 Partial solution: Threads might progress in different rate 19

  20. Adapt Memory Layout: Pivoting Strings Baseline (contiguous) layout String 1 String 2 String 3 CTAACCGAGTAAAGAACGTAAACTCATTCGACTAAACCGAGTAAAGA… Pivoted layout CTAAACGTCTAA…CCGAAAACACCG…GTAATCATAGTA…AAGATCGAAAGA… – Split strings in equally sized pieces – Interleave pieces in memory à Improve locality In presence of partial matches some threads might fall “behind” CTAAACGTCTAA…CCGAAAACACCG…GTAATCATAGTA…AAGATCGAAAGA… T0 T1 T2 Memory divergence! Partial solution: Threads might progress in different rate 20

  21. Transform Control Flow of KMP -10 01 2 3 4 Shift pattern table Knuth-Morris-Pratt Algorithm i=5 ACACA T ACCTACTTTACGTACGT Step 6 Input: j=5 ACACA C G Pattern: Character mismatch While Loop ACACA T ACCTACTTTACGTACGT i=5 j=3 ACA C ACG Mismatch à Shift pattern ACACA T ACCTACTTTACGTACGT i=5 j=1 A C ACACG Mismatch à Shift pattern … ACACA T ACCTACTTTACGTACGT Step 7 i=6 ACACA C G j=0 Shift pattern KMP Hybrid: Advance input in pivoted piece size 21

  22. GPU vs. CPU Comparison select s_suppkey from supplier where s_comment like ’%Customer%Complaints%’ – Performance Metrics – Price ($) – Performance (GB/s) – Performance per $ – Estimated energy consumption – Evaluate three systems – CPU only system – GPU only system – CPU+GPU combined system 22

  23. GPU vs. CPU Comparison GPU CPU (Boost BM) CPU (CMPISTRI) CPU+GPU Price ($) 3100 952 952 4052 Performance (GB/s) 98.7 40.75 43.1 138.7 Energy consumed (J) 1.27 2.49 2.35 1.78 Performance/$ 31.89 42.8 45.28 34.25 CPU: Dual-socket E5-2620 – Band. 102.4 GB/s Circle best column value per row GPU: Tesla K40 – Band. 288 GB/s Design system by choosing the desired trade-offs 23

  24. Outline o CPU vs GPU introduction o Accelerating wildcard string search o Insight: Change the layout of the strings in the GPU main memory o 3X speed-up & 2X energy savings against parallel state-of-the-art CPU libraries o Gompresso: Massively parallel decompression o Insight: Trade-off compression ratio for increased parallelism o 2X speed-ups & 1.2X energy savings against multi-core state-of-the-art CPU libraries o GPUs on the cloud 24

  25. Example: Why Use Compression? A) Reduce basic S3 costs Cloud Warehouse Amazon S3 Data lakes Databases Query Engine B) Reduce query costs Database Decompression speed more important than compression speed 25

  26. Background: LZ77 Compression Input characters Output … 0 1 2 3 ATTACTAGAATGT (2,5)… ATTACTAGAATGT TACTAATCTGAT CGGGCCGGGCCTG Backreferences Literals (Position, Length) Unmatched characters 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend