Toward GPUs being mainstream in analytic processing An initial - - PowerPoint PPT Presentation

toward gpus being mainstream in analytic processing
SMART_READER_LITE
LIVE PREVIEW

Toward GPUs being mainstream in analytic processing An initial - - PowerPoint PPT Presentation

Toward GPUs being mainstream in analytic processing An initial argument using simple scan- aggregate queries Jason Power || Yinan Li || Mark D. Hill Jignesh M. Patel || David A. Wood < powerjg@cs.wisc.edu> DaMoN 2015 6/1/2015 UNIVERSITY


slide-1
SLIDE 1

6/1/2015 UNIVERSITY OF WISCONSIN

Toward GPUs being mainstream in analytic processing

An initial argument using simple scan- aggregate queries

1

Jason Power || Yinan Li || Mark D. Hill Jignesh M. Patel || David A. Wood <powerjg@cs.wisc.edu>

DaMoN 2015

slide-2
SLIDE 2

6/1/2015 UNIVERSITY OF WISCONSIN

Summary

▪ GPUs are energy efficient

▪ Discrete GPUs unpopular for DBMS ▪ New integrated GPUs solve the problems

2

▪ Scan-aggregate GPU implementation

▪ Wide bit-parallel scan ▪ Fine-grained aggregate GPU offload

▪ Up to 70% energy savings over multicore CPU

▪ Even more in the future

slide-3
SLIDE 3

6/1/2015 UNIVERSITY OF WISCONSIN

Analytic Data is Growing

▪ Data is growing rapidly ▪ Analytic DBs increasingly important

3

Want: High performance Need: Low energy

Source: IDC’s Digital Universe Study. 2012.

slide-4
SLIDE 4

6/1/2015 UNIVERSITY OF WISCONSIN

GPUs to the Rescue?

▪ GPUs are becoming more general

▪ Easier to program ▪ Integrated GPUs are everywhere

4

[Govindaraju ’04, He ’14, He ’14, Kaldewey ‘12, Satish ’10, and many others]

▪ GPUs show great promise

▪ Higher performance than CPUs ▪ Better energy efficiency

▪ Analytic DBs look like GPU workloads

slide-5
SLIDE 5

6/1/2015 UNIVERSITY OF WISCONSIN

GPU Microarchitecture

5

L2 Cache Graphics Processing Unit I-Fetch/Sched

SP SP SP SP

L1 Cache

Scratchpad Cache

Register File

Compute Unit

SP SP SP SP SP SP SP SP

CU

slide-6
SLIDE 6

6/1/2015 UNIVERSITY OF WISCONSIN

Discrete GPUs

6

Memory Bus CPU chip Memory Bus Discrete GPU Cores PCIe Bus

slide-7
SLIDE 7

6/1/2015 UNIVERSITY OF WISCONSIN

Discrete GPUs

7

Memory Bus CPU chip Memory Bus Discrete GPU Cores PCIe Bus

➊ ➋

slide-8
SLIDE 8

6/1/2015 UNIVERSITY OF WISCONSIN

Discrete GPUs

8

Memory Bus CPU chip Memory Bus Discrete GPU Cores PCIe Bus

➌ ➍ And repeat

slide-9
SLIDE 9

6/1/2015 UNIVERSITY OF WISCONSIN

Discrete GPUs

▪ Copy data over PCIe

▪ Low bandwidth ▪ High latency

▪ Small working memory ▪ High latency user→kernel calls ▪ Repeated many times

9

98% of time spent not computing

➊ ➋ ➌ ➍

slide-10
SLIDE 10

6/1/2015 UNIVERSITY OF WISCONSIN

Integrated GPUs

10

Memory Bus

Heterogeneous chip

CPU cores GPU CUs

slide-11
SLIDE 11

6/1/2015 UNIVERSITY OF WISCONSIN

▪ No need for data copies

▪ Cache coherence and shared address space

▪ No OS kernel interaction

▪ User-mode queues

Heterogeneous System Arch.

11

▪ API for tightly-integrated accelerators ▪ Industry support

▪ Initial hardware support today ▪ HSA foundation (AMD, ARM,Qualcomm, others)

➊➋ ➌ ❹

slide-12
SLIDE 12

6/1/2015 UNIVERSITY OF WISCONSIN

Outline

▪ Background ▪ Algorithms

▪ Scan ▪ Aggregate

▪ Results

12

slide-13
SLIDE 13

6/1/2015 UNIVERSITY OF WISCONSIN

Analytic DBs

▪ Resident in main-memory ▪ Column-based layout

13

▪ WideTable & BitWeaving [Li and Patel ‘13 & ‘14]

▪ Convert queries to mostly scans by pre-joining tables ▪ Fast scan by using sub-word parallelism ▪ Similar to industry proposals [SAP Hana, Oracle Exalytics, IBM DB2 BLU]

▪ Scan-aggregate queries

slide-14
SLIDE 14

6/1/2015 UNIVERSITY OF WISCONSIN

Running Example

14

Shirt Color Shirt Amount Green 1 Green 3 Blue 1 Green 5 Yellow 7 Red 2 Yellow 1 Blue 4 Yellow 2 Color Code Red Blue 1 Green 2 Yellow 3 Shirt Color 2 2 1 2 3 3 1 3

slide-15
SLIDE 15

6/1/2015 UNIVERSITY OF WISCONSIN

Running Example

15

Shirt Color Shirt Amount Green 1 Green 3 Blue 1 Green 5 Yellow 7 Red 2 Yellow 1 Blue 4 Yellow 2 Shirt Color 2 2 1 2 3 3 1 3

Count the number of green shirts in the inventory

➊ ➋

Scan the color column for green (2) Aggregate amount where there is a match

slide-16
SLIDE 16

6/1/2015 UNIVERSITY OF WISCONSIN

10 10 01 110 11010000 0000...

Traditional Scan Algorithm

16

Column Data Compare Code (Green)

10

Result BitVector 11

10 10

Shirt Color 2 (10) 2 (10) 1 (01) 2 (10) 3 (11) 0 (00) 3 (11) 1 (01) 3 (11)

slide-17
SLIDE 17

6/1/2015 UNIVERSITY OF WISCONSIN

Vertical Layout

17

Color c0 2 (10) c1 2 (10) c2 1 (01) c3 2 (10) c4 3 (11) c5 0 (00) c6 3 (11) c7 1 (01) c8 3 (11) c9 0 (00) word c0 c1 c2 c3 c4 c5 c6 c7 w0 1 1 1 1 1 w1 1 1 1 1 c8 c9 w2 1 w3 1 word c0 w0 1 w1 word c0 c1 w0 1 1 w1

110110110 00101011 10000000 ...

slide-18
SLIDE 18

6/1/2015 UNIVERSITY OF WISCONSIN

1101

CPU BitWeaving Scan

18

Column Data Compare Code Result BitVector

11111111 00000000 11010000 0000... 11011011 00101011 10000000 CPU width: 64-bits, up to 256-bit SIMD

slide-19
SLIDE 19

6/1/2015 UNIVERSITY OF WISCONSIN

GPU BitWeaving Scan

19

11011011 00101011 10000000

Column Data Compare Code Result BitVector

11111111 11111111 11111111 11010000 0000... GPU width: 16,384-bit SIMD

slide-20
SLIDE 20

6/1/2015 UNIVERSITY OF WISCONSIN

GPU Scan Algorithm

▪ GPU uses very wide “words”

▪ CPU: 64-bits or 256-bits with SIMD ▪ GPU: 16,384 bits (256 lanes × 64-bits)

▪ Memory and caches optimized for bandwidth ▪ HSA programming model

▪ No data copies ▪ Low CPU-GPU interaction overhead

20

slide-21
SLIDE 21

6/1/2015 UNIVERSITY OF WISCONSIN

Shirt Amount 1 3 1 5 7 2 1 4 2

1+3 1+3+5+...

CPU Aggregate Algorithm

21

Result BitVector 11010000 0000... Result

slide-22
SLIDE 22

6/1/2015 UNIVERSITY OF WISCONSIN

GPU Aggregate Algorithm

22

Result BitVector 11010000 0000... Column Offsets

0,1 0,1,3,... On CPU

slide-23
SLIDE 23

6/1/2015 UNIVERSITY OF WISCONSIN

GPU Aggregate Algorithm

23

Column Offsets

0,1,3,...

Result

1+3+5+... On GPU

Shirt Amount 1 3 1 5 7 2 1 4 2

slide-24
SLIDE 24

6/1/2015 UNIVERSITY OF WISCONSIN

Aggregate Algorithm

▪ Two phases

▪ Convert from BitVector to offsets (on CPU) ▪ Materialize data and compute (offload to GPU)

▪ Two group-by algorithms (see paper) ▪ HSA programming model

▪ Fine-grained sharing ▪ Can offload subset of computation

24

slide-25
SLIDE 25

6/1/2015 UNIVERSITY OF WISCONSIN

Outline

▪ Background ▪ Algorithms ▪ Results

25

slide-26
SLIDE 26

6/1/2015 UNIVERSITY OF WISCONSIN

Experimental Methods

▪ AMD A10-7850

▪ 4-core CPU ▪ 8-compute unit GPU ▪ 16GB capacity, 21 GB/s DDR3 memory ▪ Separate discrete GPU

▪ Watts-Up meter for full-system power ▪ TPC-H @ scale-factor 10

26

slide-27
SLIDE 27

6/1/2015 UNIVERSITY OF WISCONSIN

Scan Performance & Energy

27

slide-28
SLIDE 28

6/1/2015 UNIVERSITY OF WISCONSIN

Scan Performance & Energy

28

Takeaway: Integrated GPU most efficient for scans

slide-29
SLIDE 29

6/1/2015 UNIVERSITY OF WISCONSIN

TPC-H Queries

29

Query 12 Performance

slide-30
SLIDE 30

6/1/2015 UNIVERSITY OF WISCONSIN

TPC-H Queries

30

Query 12 Performance Query 12 Energy Integrated GPU faster for both aggregate and scan computation

slide-31
SLIDE 31

6/1/2015 UNIVERSITY OF WISCONSIN

TPC-H Queries

31

Query 12 Performance Query 12 Energy

slide-32
SLIDE 32

6/1/2015 UNIVERSITY OF WISCONSIN

TPC-H Queries

32

Query 12 Performance Query 12 Energy More energy: Decrease in latency does not offset power increase Less energy: Decrease in latency AND decrease in power

slide-33
SLIDE 33

6/1/2015 UNIVERSITY OF WISCONSIN

Future Die Stacked GPUs

33

▪ 3D die stacking ▪ Same physical & logical integration ▪ Increased compute ▪ Increased bandwidth

Board

Power et al. Implications of 3D GPUs on the Scan Primitive SIGMOD Record. Volume 44, Issue 1. March 2015

DRAM CPU GPU

slide-34
SLIDE 34

6/1/2015 UNIVERSITY OF WISCONSIN

Conclusions

34

Discrete GPUs Integrated GPUs 3D Stacked GPUs

Performance

High ☺ Moderate High ☺

Memory Bandwidth

High ☺ Low ☹ High ☺

Overhead

High ☹ Low ☺ Low ☺

Memory Capacity

Low ☹ High ☺ Moderate

slide-35
SLIDE 35

6/1/2015 UNIVERSITY OF WISCONSIN

?

35

slide-36
SLIDE 36

6/1/2015 UNIVERSITY OF WISCONSIN

HSA vs CUDA/OpenCL

▪ HSA defines a heterogeneous architecture

▪ Cache coherence ▪ Shared virtual addresses ▪ Architected queuing ▪ Intermediate language

▪ CUDA/OpenCL are a level above HSA

▪ Come with baggage ▪ Not as flexible ▪ May not be able to take advantage of all features

36

slide-37
SLIDE 37

6/1/2015 UNIVERSITY OF WISCONSIN

Scan Performance & Energy

37

slide-38
SLIDE 38

6/1/2015 UNIVERSITY OF WISCONSIN

Group-by Algorithms

38

slide-39
SLIDE 39

6/1/2015 UNIVERSITY OF WISCONSIN

All TPC-H Results

39

slide-40
SLIDE 40

6/1/2015 UNIVERSITY OF WISCONSIN

Average TPC-H Results

40

Average Performance Average Energy

slide-41
SLIDE 41

6/1/2015 UNIVERSITY OF WISCONSIN

What’s Next?

41

▪ Developing cost model for GPU

▪ Using the GPU is just another algorithm to choose ▪ Evaluate exactly when the GPU is more efficient

▪ Future “database machines”

▪ GPUs are a good tradeoff between specialization and commodity

slide-42
SLIDE 42

6/1/2015 UNIVERSITY OF WISCONSIN

Conclusions

▪ Integrated GPUs viable for DBMS?

▪ Solve problems with discrete GPUs ▪ (Somewhat) better performance and energy

42

▪ Looking toward the future...

▪ CPUs cannot keep up with bandwidth ▪ GPUs perfectly designed for these workloads