GPU-Acceleration of In-Memory Data Analytics Evangelia Sitaridi - - PowerPoint PPT Presentation

gpu acceleration of in memory data analytics
SMART_READER_LITE
LIVE PREVIEW

GPU-Acceleration of In-Memory Data Analytics Evangelia Sitaridi - - PowerPoint PPT Presentation

GPU-Acceleration of In-Memory Data Analytics Evangelia Sitaridi AWS Redshift GPUs for Telcos Fast query-time Quickly identify network problems No time to index data Respond fast to customers Geospatial visualization Take


slide-1
SLIDE 1

GPU-Acceleration of In-Memory Data Analytics

Evangelia Sitaridi AWS Redshift

slide-2
SLIDE 2

GPUs for Telcos

  • Fast query-time
  • Quickly identify network problems
  • Respond fast to customers
  • Geospatial visualization
  • Take advantage of GPU visualization capabilities

2

No time to index data SMS Hub traffic

*Picture taken from: http://www.vizualytics.com/Solutions/Telecom/Telecom.html

slide-3
SLIDE 3

GPUs for Social Media Analytics

3

Search terms: Match regexp: “/\B#\w*[a-zA-Z]+\w*/ debate Filter location

slide-4
SLIDE 4

Challenges for GPU Databases

  • Special threading model à Increased programming complexity
  • Which algorithms more efficient for GPUs?
  • How much multiple code paths increase cost of code maintenance?
  • Special memory architecture
  • How to adapt data layout?
  • Limited memory capacity
  • Data transfer cost between CPUs and GPUs

a) Through PCI/E link to the GPU b) From storage system to the GPU

  • Fair comparison against software-based solutions

4

slide-5
SLIDE 5

Challenges for GPU Databases

  • Special threading model à Increased programming complexity
  • Which algorithms more efficient for GPUs?
  • How much multiple code paths increase cost of code maintenance?
  • Special memory architecture
  • How to adapt data layout?
  • Limited memory capacity
  • Data transfer cost between CPUs and GPUs

a) Through PCI/E link to the GPU b) From storage system to the GPU

  • Fair comparison against software-based solutions

5

slide-6
SLIDE 6

Outline

  • CPU vs GPU introduction
  • Accelerated wildcard string search
  • Insight: Change the layout of the strings in the GPU main memory
  • 3X speed-up & 2X energy savings against parallel state-of-the-art CPU libraries
  • Gompresso: Massively parallel decompression
  • Insight: Trade-off compression ratio for increased parallelism
  • 2X speed-ups & 1.2X energy savings against multi-core state-of-the-art CPU libraries
  • GPUs on the cloud

6

slide-7
SLIDE 7

CPU-GPU Analogies

7

CPU thread GPU warp RAM Global memory T ens of threads Thousands of threads Hundreds of GBs capacity Few tens of GB

Goal: Low latency Goal: High throughput (overlapping different instructions)

slide-8
SLIDE 8

GPU Architecture

8

CUDA Kernel 1 … 1 … … 1 … … 1 … … … GPU Thread

K40: 15 Stream Multiprocessors else a++; b++; if(condition) endif

slide-9
SLIDE 9

GPU Architecture

9

CUDA Kernel 1 … 1 … … 1 … … 1 … … … Shared Memory SM1 SM2 SM15 …

Global Memory

1 1 1 … 1 1 1 warp 1 … Branch Branch complete Register File Warp scheduler Warp scheduler warp n GPU Thread

Warp: Unit of execution

K40: 15 Stream Multiprocessors else a++; b++; if(condition) endif

slide-10
SLIDE 10

GPU Architecture

10

CUDA Kernel 1 … 1 … … 1 … … 1 … … … Shared Memory SM1 SM2 SM15 …

Global Memory

1 1 1 … 1 1 1 warp 1 … 1 1 Branch Branch complete Register File Warp scheduler Warp scheduler warp n GPU Thread

Warp: Unit of execution

K40: 15 Stream Multiprocessors else a++; b++; if(condition) endif

slide-11
SLIDE 11

GPU Architecture

11

CUDA Kernel 1 … 1 … … 1 … … 1 … … … Shared Memory SM1 SM2 SM15 …

Global Memory

1 1 1 … 1 1 1 warp 1 1 1 … 1 1 1 Branch Branch complete … Register File Warp scheduler Warp scheduler warp n GPU Thread

Warp: Unit of execution

K40: 15 Stream Multiprocessors else a++; b++; if(condition) endif

slide-12
SLIDE 12

GPU Architecture

12

CUDA Kernel 1 … 1 … … 1 … … 1 … … … Shared Memory SM1 SM2 SM15 …

Global Memory

1 1 1 … 1 1 1 warp 1 1 1 … 1 1 1 Branch Branch complete 1 1 1 … 1 1 1 Register File Warp scheduler Warp scheduler warp n GPU Thread

Warp: Unit of execution

K40: 15 Stream Multiprocessors else a++; b++; if(condition) endif

slide-13
SLIDE 13

Outline

  • CPU vs GPU introduction
  • Accelerated wildcard string search
  • Insight: Change the layout of the strings in the GPU main memory
  • 3X speed-up & 2X energy savings against parallel state-of-the-art CPU libraries
  • Gompresso: Massively parallel decompression
  • Insight: Trade-off compression ratio for increased parallelism
  • 2X speed-ups & 1.2X energy savings against multi-core state-of-the-art CPU libraries
  • GPUs on the cloud

13

slide-14
SLIDE 14

Text Query Applications

14

ACGTACCTGATCGTAGGATCCCAAGTACATCATTTC

Input

ACC

Search Pattern

GENOMIC DATA

Id Address 3 “9 Front St, Washington DC, 20001” 8 “3338 A Lockport Pl #6, Margate City, NJ, 8402” 9 “18 3rd Ave, New York, NY, 10016” 15 “88 Sw 28th T er, Harrison, NJ, 7029” 16 “87895 Concord Rd, La Mesa, CA, 91142”

DATABASE COLUMNS

Search Pattern

“*3rdAve*New York*”

Q2,9,13,14,16,20 of TPC-H contain expensive LIKE predicates Wild card searches

slide-15
SLIDE 15

Wildcard Search Challenges

  • Approaches simplifying search cannot be applied
  • String indexes, e.g. suffix trees
  • For query ‘%customer%complaints’ multiple queries need be issued
  • ’%customer%’ AND ‘%complaints%’
  • Confirm results
  • Dictionary compression
  • Wildcard searches not simplified using dictionaries
  • String data need to be decompressed

15

slide-16
SLIDE 16

Background: How to Search Text Fast?

16

Knuth-Morris-Pratt Algorithm Input: Pattern:

ACACATACCTACTTTACGTACGT ACACACG

Character mismatch

Step 6

i=5 j=5

Advance to the next character: a) If the input matches to the pattern b) While there is a mismatch shift to the left of the pattern Stop when the beginning of the pattern has been reached Shift pattern table

  • 10 01 2 3 4
slide-17
SLIDE 17

Background: How to Search Text Fast?

17

Knuth-Morris-Pratt Algorithm Input: Pattern:

ACACATACCTACTTTACGTACGT ACACACG

Character mismatch

ACACATACCTACTTTACGTACGT

Shift pattern

Step 6 Step 7

i=5 j=5 i=5 j=1

ACACACG

Advance to the next character: a) If the input matches to the pattern b) While there is a mismatch shift to the left of the pattern Stop when the beginning of the pattern has been reached Shift pattern table

  • 10 01 2 3 4
slide-18
SLIDE 18

GPU Limiting factor: Cache Pressure

18

Warp size: 32 Stream Multiprocessors: 15 #Warps in each SM: 64 Cache footprint: 30720 cache lines L2 Capacity: 12288 cache lines

Smaller cache size per thread than CPUs: Need improved locality!

Threads matching different strings

x >>

Tesla K40 architecture

slide-19
SLIDE 19

Adapt Memory Layout: Pivoting Strings

Baseline (contiguous) layout

CTAACCGAGTAAAGAACGTAAACTCATTCGACTAAACCGAGTAAAGA…

Pivoted layout

CTAAACGTCTAA…CCGAAAACACCG…GTAATCATAGTA…AAGATCGAAAGA… – Split strings in equally sized pieces – Interleave pieces in memory à Improve locality CTAAACGTCTAA…CCGAAAACACCG…GTAATCATAGTA…AAGATCGAAAGA…

19

Partial solution: Threads might progress in different rate

String 1 String 2 String 3

T0 T1 T2

Initially: Each warp loads a cache line (128 bytes)

slide-20
SLIDE 20

Baseline (contiguous) layout

CTAACCGAGTAAAGAACGTAAACTCATTCGACTAAACCGAGTAAAGA…

Pivoted layout

CTAAACGTCTAA…CCGAAAACACCG…GTAATCATAGTA…AAGATCGAAAGA… – Split strings in equally sized pieces – Interleave pieces in memory à Improve locality CTAAACGTCTAA…CCGAAAACACCG…GTAATCATAGTA…AAGATCGAAAGA…

20

String 1 String 2 String 3

Memory divergence!

T1 T2 T0

In presence of partial matches some threads might fall “behind”

Adapt Memory Layout: Pivoting Strings

Partial solution: Threads might progress in different rate

slide-21
SLIDE 21

21

Knuth-Morris-Pratt Algorithm Input: Pattern:

ACACATACCTACTTTACGTACGT ACACACG

Character mismatch

ACACATACCTACTTTACGTACGT

Shift pattern

Step 6 Step 7 i=5 j=5 i=6 j=0

ACACACG ACACATACCTACTTTACGTACGT

Mismatchà Shift pattern

i=5 j=3

ACACACG

KMP Hybrid: Advance input in pivoted piece size

Transform Control Flow of KMP

Shift pattern table

ACACATACCTACTTTACGTACGT

i=5 j=1

ACACACG

While Loop

Mismatchà Shift pattern

  • 10 01 2 3 4
slide-22
SLIDE 22

GPU vs. CPU Comparison

select s_suppkey from supplier where s_comment like ’%Customer%Complaints%’

–Performance Metrics

– Price ($) – Performance (GB/s) – Performance per $ – Estimated energy consumption

–Evaluate three systems

– CPU only system – GPU only system – CPU+GPU combined system

22

slide-23
SLIDE 23

GPU CPU (Boost BM) CPU (CMPISTRI) CPU+GPU Price ($) 3100 952 952 4052 Performance (GB/s) 98.7 40.75 43.1 138.7 Energy consumed (J) 1.27 2.49 2.35 1.78 Performance/$ 31.89 42.8 45.28 34.25

23

Circle best column value per row

GPU vs. CPU Comparison

CPU: Dual-socket E5-2620 – Band. 102.4 GB/s GPU: Tesla K40 – Band. 288 GB/s

Design system by choosing the desired trade-offs

slide-24
SLIDE 24

Outline

  • CPU vs GPU introduction
  • Accelerating wildcard string search
  • Insight: Change the layout of the strings in the GPU main memory
  • 3X speed-up & 2X energy savings against parallel state-of-the-art CPU libraries
  • Gompresso: Massively parallel decompression
  • Insight: Trade-off compression ratio for increased parallelism
  • 2X speed-ups & 1.2X energy savings against multi-core state-of-the-art CPU libraries
  • GPUs on the cloud

24

slide-25
SLIDE 25

Example: Why Use Compression?

Databases Query Engine Database

Cloud Warehouse

Data lakes

Amazon S3 A) Reduce basic S3 costs B) Reduce query costs Decompression speed more important than compression speed

25

slide-26
SLIDE 26

Background: LZ77 Compression

26

Input characters

0 1 2 3 …

ATTACTAGAATGT TACTAATCTGAT CGGGCCGGGCCTG

Output

ATTACTAGAATGT(2,5)…

Backreferences (Position, Length) Literals Unmatched characters

slide-27
SLIDE 27

Background: LZ77 Compression

27

Input characters

0 1 2 3 …

ATTACTAGAATGT TACTAATCTGAT CGGGCCGGGCCTG

Sliding window buffer Unencoded lookahead characters

Find longest match Output

ATTACTAGAATGT(2,5)…

Backreferences (Position, Length) Literals Unmatched characters

slide-28
SLIDE 28

28

(0,4)M(5,4)COMM…

Output data block W I K I P E D I A . C O Window buffer contents

Background: LZ77 Decompression

Input data block

Tokens

WIKIMEDIACOMM

slide-29
SLIDE 29

How to Parallelize Decompression?

29

Thread 1 Thread 2 Data block 1 Data block 2

Naïve approach performance 200 MB/s << 250 GB/s (K20x)

Input file

Split input file in independent blocks

>1000 threads available!

slide-30
SLIDE 30

GPU LZ77 Decompression

30

CCGA(0,2)CGG(4,3)AGTT(12,4)

Compressed input Uncompressed output

Tokens

Improve utilization: Group strings of literals with the following back-reference

slide-31
SLIDE 31

GPU LZ77 Decompression

31

CCGA(0,2)CGG(4,3)AGTT(12,4)

Compressed input Uncompressed output

1) Read tokens (parallel)

Tokens

slide-32
SLIDE 32

GPU LZ77 Decompression

32

CCGA(0,2)CGG(4,3)AGTT(12,4) CCGA CGG AGTT

Compressed input Uncompressed output

2) Write literals (parallel prefix sum)

Tokens

slide-33
SLIDE 33

GPU LZ77 Decompression

33

CCGA(0,2)CGG(4,3)AGTT(12,4) 3.2) Write uncompressed output: CCGACCCGGCCCAGTTCCGA

Compressed input Uncompressed output

Problem: Back-references processed in parallel might be dependent! à Use voting function __ballot to detect dependencies

3.1 ) Compute uncompressed output

Tokens

slide-34
SLIDE 34

How to Handle Thread Dependencies?

34

1) Write literals (parallel) 2) While(!all backreferences written)

a) Check dependencies satisfied (parallel) b) Copy back-references w/o pending dependencies

T

  • kens

DE

Second loop: Dependencies satisfied

…CCGACGTTCCGT…

Uncompressed input

CCGA(0,3)T(4,4)

MRR

T

  • kens

CCGA(0,3)T(0,3)T

A) Compression

Only search for matches w/o dependencies

B) Decompression

Copy back-references (fully parallel)

Bandwidth: 7 GB/s Bandwidth: 16 GB/s

slide-35
SLIDE 35

Decompression Skyline

35

Byte-level encoding Bit-level encoding

CPU: Dual-socket E5-2620 – Band. 102.4 GB/s GPU: Tesla K40 – Band. 288 GB/s

slide-36
SLIDE 36

GPUs on the Cloud

  • Cloud offerings
  • AWS
  • Google Cloud
  • Microsoft Azure
  • IBM Softlayer
  • Nimbix
  • Opportunity
  • Evaluate the usefulness of GPUs/FPGAs without the high investment
  • Special considerations
  • Charging model
  • Scaling capabilities
  • Software licensing

36

slide-37
SLIDE 37
  • Accelerated wildcard string search
  • Insight: Change the layout of the strings in the GPU main memory
  • 3X speed-up & 2X energy savings against parallel state-of-the-art CPU libraries
  • Gompresso: Massively parallel decompression
  • Insight: Trade-off compression ratio for increased parallelism
  • 2X speed-ups & 1.2X energy savings against multi-core state-of-the-art CPU libraries
  • GPUs on the cloud: Open questions

37

Summary