GPU-Acceleration of In-Memory Data Analytics Evangelia Sitaridi - PowerPoint PPT Presentation

GPU-Acceleration of In-Memory Data Analytics Evangelia Sitaridi AWS Redshift

GPUs for Telcos • Fast query-time • Quickly identify network problems No time to index data • Respond fast to customers • Geospatial visualization • Take advantage of GPU visualization capabilities SMS Hub traffic *Picture taken from: http://www.vizualytics.com/Solutions/Telecom/Telecom.html 2

GPUs for Social Media Analytics Search terms: debate Match regexp: “/\B#\w*[a-zA-Z]+\w*/ Filter location 3

Challenges for GPU Databases • Special threading model à Increased programming complexity • Which algorithms more efficient for GPUs? • How much multiple code paths increase cost of code maintenance? • Special memory architecture • How to adapt data layout? • Limited memory capacity • Data transfer cost between CPUs and GPUs a) Through PCI/E link to the GPU b) From storage system to the GPU • Fair comparison against software-based solutions 4

Challenges for GPU Databases • Special threading model à Increased programming complexity • Which algorithms more efficient for GPUs? • How much multiple code paths increase cost of code maintenance? • Special memory architecture • How to adapt data layout? • Limited memory capacity • Data transfer cost between CPUs and GPUs a) Through PCI/E link to the GPU b) From storage system to the GPU • Fair comparison against software-based solutions 5

Outline o CPU vs GPU introduction o Accelerated wildcard string search o Insight: Change the layout of the strings in the GPU main memory o 3X speed-up & 2X energy savings against parallel state-of-the-art CPU libraries o Gompresso: Massively parallel decompression o Insight: Trade-off compression ratio for increased parallelism o 2X speed-ups & 1.2X energy savings against multi-core state-of-the-art CPU libraries o GPUs on the cloud 6

CPU-GPU Analogies Goal: High throughput Goal: Low latency (overlapping different instructions) CPU thread GPU warp RAM Global memory T ens of threads Thousands of threads Hundreds of GBs capacity Few tens of GB 7

GPU Architecture K40: 15 Stream Multiprocessors GPU Thread if(condition) … … … 1 1 a++; else … … b++; … … … 1 1 endif CUDA Kernel 8

GPU Architecture SM15 K40: 15 Stream Multiprocessors … SM2 SM1 Warp Warp Warp : Unit of execution scheduler scheduler GPU Thread Register File Branch … 1 1 1 1 1 1 if(condition) … … … 1 1 … a++; else … … b++; … … … 1 1 endif warp n warp 1 Branch complete CUDA Kernel Shared Memory Global Memory 9

GPU Architecture SM15 K40: 15 Stream Multiprocessors … SM2 SM1 Warp Warp Warp : Unit of execution scheduler scheduler GPU Thread Register File Branch … 1 1 1 1 1 1 if(condition) … … … 1 1 … 1 a++; 1 else … … b++; … … … 1 1 endif warp n warp 1 Branch complete CUDA Kernel Shared Memory Global Memory 10

GPU Architecture SM15 K40: 15 Stream Multiprocessors … SM2 SM1 Warp Warp Warp : Unit of execution scheduler scheduler GPU Thread Register File Branch … 1 1 1 1 1 1 if(condition) … … … 1 1 … 1 a++; 1 else … … b++; 1 1 1 … … … … 1 1 endif warp n warp 1 Branch complete CUDA Kernel Shared Memory Global Memory 11

GPU Architecture SM15 K40: 15 Stream Multiprocessors … SM2 SM1 Warp Warp Warp : Unit of execution scheduler scheduler GPU Thread Register File Branch … 1 1 1 1 1 1 if(condition) … … … 1 1 … 1 a++; 1 else … … b++; 1 1 1 … … … … 1 1 endif 1 1 1 1 1 1 warp n warp 1 Branch complete CUDA Kernel Shared Memory Global Memory 12

Outline o CPU vs GPU introduction o Accelerated wildcard string search o Insight: Change the layout of the strings in the GPU main memory o 3X speed-up & 2X energy savings against parallel state-of-the-art CPU libraries o Gompresso: Massively parallel decompression o Insight: Trade-off compression ratio for increased parallelism o 2X speed-ups & 1.2X energy savings against multi-core state-of-the-art CPU libraries o GPUs on the cloud 13

Text Query Applications ACGTACCTGATCGTAGGATCCCAAGTACATCATTTC ACC Input GENOMIC DATA Search Pattern Wild card searches Id Address “* 3rdAve*New York* ” 3 “9 Front St, Washington DC, 20001” Search Pattern 8 “3338 A Lockport Pl #6, Margate City, NJ, 8402” 9 “18 3rd Ave, New York, NY, 10016” 15 “88 Sw 28th T er, Harrison, NJ, 7029” DATABASE COLUMNS 16 “87895 Concord Rd, La Mesa, CA, 91142” Q2,9,13,14,16,20 of TPC-H contain expensive LIKE predicates 14

Wildcard Search Challenges • Approaches simplifying search cannot be applied • String indexes, e.g. suffix trees • For query ‘%customer%complaints’ multiple queries need be issued • ’%customer%’ AND ‘%complaints%’ • Confirm results • Dictionary compression • Wildcard searches not simplified using dictionaries • String data need to be decompressed 15

Background: How to Search Text Fast? Knuth-Morris-Pratt Algorithm i=5 ACACA T ACCTACTTTACGTACGT Step 6 Input: j=5 ACACA C G Pattern: Character mismatch -10 01 2 3 4 Shift pattern table Advance to the next character: a) If the input matches to the pattern b) While there is a mismatch shift to the left of the pattern Stop when the beginning of the pattern has been reached 16

Background: How to Search Text Fast? Knuth-Morris-Pratt Algorithm i=5 ACACA T ACCTACTTTACGTACGT Step 6 Input: j=5 ACACA C G Pattern: Character mismatch ACACA T ACCTACTTTACGTACGT Step 7 i=5 ACA C ACG j=1 Shift pattern -10 01 2 3 4 Shift pattern table Advance to the next character: a) If the input matches to the pattern b) While there is a mismatch shift to the left of the pattern Stop when the beginning of the pattern has been reached 17

GPU Limiting factor: Cache Pressure Threads matching different strings Warp size: 32 x Stream Multiprocessors: 15 #Warps in each SM : 64 Cache footprint : 30720 cache lines >> Tesla K40 architecture L2 Capacity : 12288 cache lines Smaller cache size per thread than CPUs: Need improved locality! 18

Adapt Memory Layout: Pivoting Strings Baseline (contiguous) layout String 1 String 2 String 3 CTAACCGAGTAAAGAACGTAAACTCATTCGACTAAACCGAGTAAAGA… Pivoted layout CTAAACGTCTAA…CCGAAAACACCG…GTAATCATAGTA…AAGATCGAAAGA… – Split strings in equally sized pieces – Interleave pieces in memory à Improve locality Initially: Each warp loads a cache line (128 bytes) CTAAACGTCTAA…CCGAAAACACCG…GTAATCATAGTA…AAGATCGAAAGA… T0 T1 T2 Partial solution: Threads might progress in different rate 19

Adapt Memory Layout: Pivoting Strings Baseline (contiguous) layout String 1 String 2 String 3 CTAACCGAGTAAAGAACGTAAACTCATTCGACTAAACCGAGTAAAGA… Pivoted layout CTAAACGTCTAA…CCGAAAACACCG…GTAATCATAGTA…AAGATCGAAAGA… – Split strings in equally sized pieces – Interleave pieces in memory à Improve locality In presence of partial matches some threads might fall “behind” CTAAACGTCTAA…CCGAAAACACCG…GTAATCATAGTA…AAGATCGAAAGA… T0 T1 T2 Memory divergence! Partial solution: Threads might progress in different rate 20

Transform Control Flow of KMP -10 01 2 3 4 Shift pattern table Knuth-Morris-Pratt Algorithm i=5 ACACA T ACCTACTTTACGTACGT Step 6 Input: j=5 ACACA C G Pattern: Character mismatch While Loop ACACA T ACCTACTTTACGTACGT i=5 j=3 ACA C ACG Mismatch à Shift pattern ACACA T ACCTACTTTACGTACGT i=5 j=1 A C ACACG Mismatch à Shift pattern … ACACA T ACCTACTTTACGTACGT Step 7 i=6 ACACA C G j=0 Shift pattern KMP Hybrid: Advance input in pivoted piece size 21

GPU vs. CPU Comparison select s_suppkey from supplier where s_comment like ’%Customer%Complaints%’ – Performance Metrics – Price ($) – Performance (GB/s) – Performance per $ – Estimated energy consumption – Evaluate three systems – CPU only system – GPU only system – CPU+GPU combined system 22

GPU vs. CPU Comparison GPU CPU (Boost BM) CPU (CMPISTRI) CPU+GPU Price ($) 3100 952 952 4052 Performance (GB/s) 98.7 40.75 43.1 138.7 Energy consumed (J) 1.27 2.49 2.35 1.78 Performance/$ 31.89 42.8 45.28 34.25 CPU: Dual-socket E5-2620 – Band. 102.4 GB/s Circle best column value per row GPU: Tesla K40 – Band. 288 GB/s Design system by choosing the desired trade-offs 23

Outline o CPU vs GPU introduction o Accelerating wildcard string search o Insight: Change the layout of the strings in the GPU main memory o 3X speed-up & 2X energy savings against parallel state-of-the-art CPU libraries o Gompresso: Massively parallel decompression o Insight: Trade-off compression ratio for increased parallelism o 2X speed-ups & 1.2X energy savings against multi-core state-of-the-art CPU libraries o GPUs on the cloud 24

Example: Why Use Compression? A) Reduce basic S3 costs Cloud Warehouse Amazon S3 Data lakes Databases Query Engine B) Reduce query costs Database Decompression speed more important than compression speed 25

Background: LZ77 Compression Input characters Output … 0 1 2 3 ATTACTAGAATGT (2,5)… ATTACTAGAATGT TACTAATCTGAT CGGGCCGGGCCTG Backreferences Literals (Position, Length) Unmatched characters 26

GPU-Acceleration of In-Memory Data Analytics Evangelia Sitaridi - PowerPoint PPT Presentation

GPU-Acceleration of In-Memory Data Analytics Evangelia Sitaridi AWS Redshift GPUs for Telcos Fast query-time Quickly identify network problems No time to index data Respond fast to customers Geospatial visualization Take

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge

GPYTORCH : BLACKBOX MATRIX- MATRIX GAUSSIAN PROCESS INFERENCE WITH GPU ACCELERATION Jacob R.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

DECOMPRESSION ON HETEROGENEOUS MULTICORE ARCHITECTURES Wasuwee Sodsong 1 , Jingun Hong 1 ,

Compression and Decompression in Cognition Vertolli, M. O., Kelly, M., & Davies, J.

Exploring Qualcomm Baseband via ModKit Tencent Blade Team Tencent Security Platform Department

Robust PCA Yingjun Wu Preliminary: vector projection Scalar projection of a onto b: a1 could be

Opening Exercise Suppose that you are given three integers in int variables. Describe a way to

Uncovering SAP vulnerabilities: Reversing and breaking the Diag protocol Martin Gallo Core

F F Fast Transforms using the Cell/B.E. Processor Fast Transforms using the Cell/B.E. Processor

Memory-Optimized Distributed Graph Processing through Novel Compression Techniques Katia

GPU-Acceleration of In-Memory Data Analytics Evangelia Sitaridi - PowerPoint PPT Presentation

GPU-Acceleration of In-Memory Data Analytics Evangelia Sitaridi AWS Redshift GPUs for Telcos Fast query-time Quickly identify network problems No time to index data Respond fast to customers Geospatial visualization Take

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge

GPYTORCH : BLACKBOX MATRIX- MATRIX GAUSSIAN PROCESS INFERENCE WITH GPU ACCELERATION Jacob R.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

DECOMPRESSION ON HETEROGENEOUS MULTICORE ARCHITECTURES Wasuwee Sodsong 1 , Jingun Hong 1 ,

Compression and Decompression in Cognition Vertolli, M. O., Kelly, M., &amp; Davies, J.

Exploring Qualcomm Baseband via ModKit Tencent Blade Team Tencent Security Platform Department

Robust PCA Yingjun Wu Preliminary: vector projection Scalar projection of a onto b: a1 could be

Opening Exercise Suppose that you are given three integers in int variables. Describe a way to

Uncovering SAP vulnerabilities: Reversing and breaking the Diag protocol Martin Gallo Core

F F Fast Transforms using the Cell/B.E. Processor Fast Transforms using the Cell/B.E. Processor

Memory-Optimized Distributed Graph Processing through Novel Compression Techniques Katia

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Compression and Decompression in Cognition Vertolli, M. O., Kelly, M., & Davies, J.