Toward GPUs being mainstream in analytic processing An initial - PowerPoint PPT Presentation

Toward GPUs being mainstream in analytic processing An initial argument using simple scan- aggregate queries Jason Power || Yinan Li || Mark D. Hill Jignesh M. Patel || David A. Wood < powerjg@cs.wisc.edu> DaMoN 2015 6/1/2015 UNIVERSITY OF WISCONSIN 1

Summary ▪ GPUs are energy efficient ▪ Discrete GPUs unpopular for DBMS ▪ New integrated GPUs solve the problems ▪ Scan-aggregate GPU implementation ▪ Wide bit-parallel scan ▪ Fine-grained aggregate GPU offload ▪ Up to 70% energy savings over multicore CPU ▪ Even more in the future 6/1/2015 UNIVERSITY OF WISCONSIN 2

Analytic Data is Growing ▪ Data is growing rapidly ▪ Analytic DBs increasingly important Source: IDC’s Digital Universe Study. 2012. Want: High performance Need: Low energy 6/1/2015 UNIVERSITY OF WISCONSIN 3

GPUs to the Rescue? ▪ GPUs are becoming more general ▪ Easier to program ▪ Integrated GPUs are everywhere ▪ GPUs show great promise [Govindaraju ’04, He ’14, He ’14, Kaldewey ‘12, Satish ’10, and many others] ▪ Higher performance than CPUs ▪ Better energy efficiency ▪ Analytic DBs look like GPU workloads 6/1/2015 UNIVERSITY OF WISCONSIN 4

GPU Microarchitecture Compute Unit Graphics Processing Unit I-Fetch/Sched CU SP SP SP SP SP SP SP SP L2 Cache SP SP SP SP Register File Scratchpad L1 Cache Cache 6/1/2015 UNIVERSITY OF WISCONSIN 5

Discrete GPUs CPU chip PCIe Bus Cores Memory Bus Discrete GPU Memory Bus 6/1/2015 UNIVERSITY OF WISCONSIN 6

Discrete GPUs ➊ CPU chip PCIe Bus Cores Memory Bus Discrete GPU Memory Bus ➋ 6/1/2015 UNIVERSITY OF WISCONSIN 7

Discrete GPUs ➌ CPU chip PCIe Bus Cores Memory Bus ➍ And repeat Discrete GPU Memory Bus 6/1/2015 UNIVERSITY OF WISCONSIN 8

Discrete GPUs ▪ Copy data over PCIe ➊ ▪ Low bandwidth ▪ High latency ▪ Small working memory ➋ ▪ High latency user → kernel calls ➌ ▪ Repeated many times ➍ 98% of time spent not computing 6/1/2015 UNIVERSITY OF WISCONSIN 9

Integrated GPUs Heterogeneous chip CPU cores Memory Bus GPU CUs 6/1/2015 UNIVERSITY OF WISCONSIN 10

Heterogeneous System Arch. ▪ API for tightly-integrated accelerators ▪ Industry support ▪ Initial hardware support today ▪ HSA foundation (AMD, ARM,Qualcomm, others) ▪ No need for data copies ➊➋ ▪ Cache coherence and shared address space ❹ ▪ No OS kernel interaction ➌ ▪ User-mode queues 6/1/2015 UNIVERSITY OF WISCONSIN 11

Outline ▪ Background ▪ Algorithms ▪ Scan ▪ Aggregate ▪ Results 6/1/2015 UNIVERSITY OF WISCONSIN 12

Analytic DBs ▪ Resident in main-memory ▪ Column-based layout ▪ WideTable & BitWeaving [Li and Patel ‘13 & ‘14] ▪ Convert queries to mostly scans by pre-joining tables ▪ Fast scan by using sub-word parallelism ▪ Similar to industry proposals [SAP Hana, Oracle Exalytics, IBM DB2 BLU] ▪ Scan-aggregate queries 6/1/2015 UNIVERSITY OF WISCONSIN 13

Running Example Shirt Shirt Shirt Color Color Amount 2 1 Green 2 3 Color Code Green 0 1 1 Red Blue 1 Blue 2 5 Green 2 Green 3 7 Yellow 3 Yellow 0 2 Red 3 1 Yellow 1 4 Blue 3 2 Yellow 6/1/2015 UNIVERSITY OF WISCONSIN 14

Running Example Shirt Shirt Shirt Count the number of Color Color Amount green shirts in the 2 1 Green inventory 2 3 Green 1 1 Blue 2 5 Green Scan the color ➊ 3 7 Yellow column for green (2) 0 2 Red 3 1 Yellow Aggregate amount ➋ 1 4 Blue where there is a match 3 2 Yellow 6/1/2015 UNIVERSITY OF WISCONSIN 15

Traditional Scan Algorithm Shirt Column 10 10 01 Color Data 2 (10) Compare 2 (10) 10 10 10 Code 1 (01) (Green) 2 (10) 3 (11) Result BitVector 11 11010000 0000... 110 0 (00) 3 (11) 1 (01) 3 (11) 6/1/2015 UNIVERSITY OF WISCONSIN 16

Vertical Layout Color word word word c0 c0 c0 c1 c1 c2 c3 c4 c5 c6 c7 2 (10) c0 w0 w0 w0 1 1 1 1 1 0 1 1 0 1 0 2 (10) c1 w1 w1 w1 0 0 0 0 0 1 0 1 0 1 1 1 (01) c2 c8 c9 2 (10) c3 w2 1 0 3 (11) c4 w3 1 0 0 (00) c5 3 (11) c6 1 (01) c7 110110110 00101011 10000000 ... 3 (11) c8 0 (00) c9 6/1/2015 UNIVERSITY OF WISCONSIN 17

CPU BitWeaving Scan Column 11011011 00101011 10000000 Data Compare 11111111 00000000 Code Result 1101 11010000 0000... BitVector CPU width: 64-bits, up to 256-bit SIMD 6/1/2015 UNIVERSITY OF WISCONSIN 18

GPU BitWeaving Scan Column 11011011 00101011 10000000 Data Compare 11111111 11111111 11111111 Code Result 11010000 0000... BitVector GPU width: 16,384-bit SIMD 6/1/2015 UNIVERSITY OF WISCONSIN 19

GPU Scan Algorithm ▪ GPU uses very wide “words” ▪ CPU: 64-bits or 256-bits with SIMD ▪ GPU: 16,384 bits (256 lanes × 64-bits) ▪ Memory and caches optimized for bandwidth ▪ HSA programming model ▪ No data copies ▪ Low CPU-GPU interaction overhead 6/1/2015 UNIVERSITY OF WISCONSIN 20

CPU Aggregate Algorithm Shirt Result Amount BitVector 11010000 0000... 1 3 1 5 7 1+3+5+... 1+3 Result 2 1 4 2 6/1/2015 UNIVERSITY OF WISCONSIN 21

GPU Aggregate Algorithm Result BitVector 11010000 0000... On CPU Column 0,1 0,1,3,... Offsets 6/1/2015 UNIVERSITY OF WISCONSIN 22

GPU Aggregate Algorithm Shirt Column 0,1,3,... Amount Offsets 1 3 1 On GPU 5 7 1+3+5+... Result 2 1 4 2 6/1/2015 UNIVERSITY OF WISCONSIN 23

Aggregate Algorithm ▪ Two phases ▪ Convert from BitVector to offsets (on CPU) ▪ Materialize data and compute ( offload to GPU ) ▪ Two group-by algorithms (see paper) ▪ HSA programming model ▪ Fine-grained sharing ▪ Can offload subset of computation 6/1/2015 UNIVERSITY OF WISCONSIN 24

Outline ▪ Background ▪ Algorithms ▪ Results 6/1/2015 UNIVERSITY OF WISCONSIN 25

Experimental Methods ▪ AMD A10-7850 ▪ 4-core CPU ▪ 8-compute unit GPU ▪ 16GB capacity, 21 GB/s DDR3 memory ▪ Separate discrete GPU ▪ Watts-Up meter for full-system power ▪ TPC-H @ scale-factor 10 6/1/2015 UNIVERSITY OF WISCONSIN 26

Scan Performance & Energy 6/1/2015 UNIVERSITY OF WISCONSIN 27

Scan Performance & Energy Takeaway: Integrated GPU most efficient for scans 6/1/2015 UNIVERSITY OF WISCONSIN 28

TPC-H Queries Query 12 Performance 6/1/2015 UNIVERSITY OF WISCONSIN 29

TPC-H Queries Query 12 Performance Query 12 Energy Integrated GPU faster for both aggregate and scan computation 6/1/2015 UNIVERSITY OF WISCONSIN 30

TPC-H Queries Query 12 Performance Query 12 Energy 6/1/2015 UNIVERSITY OF WISCONSIN 31

TPC-H Queries Query 12 Performance Query 12 Energy More energy : Decrease in latency does not offset power increase Less energy : Decrease in latency AND decrease in power 6/1/2015 UNIVERSITY OF WISCONSIN 32

Future Die Stacked GPUs ▪ 3D die stacking DRAM ▪ Same physical & logical integration GPU ▪ Increased compute CPU Board ▪ Increased bandwidth Power et al. Implications of 3D GPUs on the Scan Primitive SIGMOD Record. Volume 44, Issue 1. March 2015 6/1/2015 UNIVERSITY OF WISCONSIN 33

Conclusions Discrete Integrated 3D Stacked GPUs GPUs GPUs High ☺ High ☺ Moderate Performance Memory High ☺ High ☺ Low ☹ Bandwidth Low ☺ Low ☺ High ☹ Overhead Memory High ☺ Low ☹ Moderate Capacity 6/1/2015 UNIVERSITY OF WISCONSIN 34

? 6/1/2015 UNIVERSITY OF WISCONSIN 35

HSA vs CUDA/OpenCL ▪ HSA defines a heterogeneous architecture ▪ Cache coherence ▪ Shared virtual addresses ▪ Architected queuing ▪ Intermediate language ▪ CUDA/OpenCL are a level above HSA ▪ Come with baggage ▪ Not as flexible ▪ May not be able to take advantage of all features 6/1/2015 UNIVERSITY OF WISCONSIN 36

Scan Performance & Energy 6/1/2015 UNIVERSITY OF WISCONSIN 37

Group-by Algorithms 6/1/2015 UNIVERSITY OF WISCONSIN 38

All TPC-H Results 6/1/2015 UNIVERSITY OF WISCONSIN 39

Average TPC-H Results Average Performance Average Energy 6/1/2015 UNIVERSITY OF WISCONSIN 40

What’s Next? ▪ Developing cost model for GPU ▪ Using the GPU is just another algorithm to choose ▪ Evaluate exactly when the GPU is more efficient ▪ Future “database machines” ▪ GPUs are a good tradeoff between specialization and commodity 6/1/2015 UNIVERSITY OF WISCONSIN 41

Conclusions ▪ Integrated GPUs viable for DBMS? ▪ Solve problems with discrete GPUs ▪ (Somewhat) better performance and energy ▪ Looking toward the future... ▪ CPUs cannot keep up with bandwidth ▪ GPUs perfectly designed for these workloads 6/1/2015 UNIVERSITY OF WISCONSIN 42

Toward GPUs being mainstream in analytic processing An initial - PowerPoint PPT Presentation

Toward GPUs being mainstream in analytic processing An initial argument using simple scan- aggregate queries Jason Power || Yinan Li || Mark D. Hill Jignesh M. Patel || David A. Wood < powerjg@cs.wisc.edu> DaMoN 2015 6/1/2015 UNIVERSITY

Zeros of analytic functions Lecture 14 Zeros of analytic functions Zeros of analytic functions

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Channel Presentation Mainstream Networks Holding GmbH & Co. KG 1 Mainstream Networks Holding

Moving HPC into the Mainstream Addressing the challenges of Mainstream HPC WHOS ON THE PANEL

A Decision A Decision A Decision-Analytic Approach for A Decision Analytic Approach for

Accessibility of Mainstream Services for Aboriginal Victorians Tabled 29 May 2014 29 May 2014

Medusa Simplified Graph Processing on GPUs Motivation Graph processing algorithms are often

(Toward) Radiative transfer on AMR with GPUs Dominique Aubert Universit de Strasbourg

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

On p -adic comparison theorems for analytic spaces Wies lawa Nizio l, joint with Pierre

Analytic Combinatorics in Several Variables Robin Pemantle and Mark Wilson A of A conference, 30

Hadamard type operators for real analytic functions of several variables and moments of analytic

5. Analytic Combinatorics http://aofa.cs.princeton.edu Analytic combinatorics is a calculus for

Functional Analytic Framework Functional Analytic Framework for Model Selection for Model

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

Q4 2018 and Capital Markets Update 5 March 2019 | Aalborg, Denmark Agenda Lunch 1200-1230

Rules for presentation of ITU-T | ISO/IEC common text (September 2014) This guide provides rules

IMPLANTATION OF THE SCHARIOTH MACULA LENS - SML - Preliminary results of Quality Of Life Study

NDCAP Presentation August 15, 2018 1 Introductions Entergy Mike Twomey Vice President

United States v. Global Partners LLP, et al. Proposed Consent Decree CITY COUNCIL WORKSHOP

Preventing SQL Injection Attacks Using AMNESIA William G.J. Halfond and Alessandro Orso Georgia

Memory Chapter 7 Encoding, Storage and Retrieval of Memor y Encoding Storage

LEARNING AND MEMORY Chapter 11 Learning and Memory Learning How experience changes the

Toward GPUs being mainstream in analytic processing An initial - PowerPoint PPT Presentation

Toward GPUs being mainstream in analytic processing An initial argument using simple scan- aggregate queries Jason Power || Yinan Li || Mark D. Hill Jignesh M. Patel || David A. Wood < powerjg@cs.wisc.edu> DaMoN 2015 6/1/2015 UNIVERSITY

Zeros of analytic functions Lecture 14 Zeros of analytic functions Zeros of analytic functions

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Channel Presentation Mainstream Networks Holding GmbH &amp; Co. KG 1 Mainstream Networks Holding

Moving HPC into the Mainstream Addressing the challenges of Mainstream HPC WHOS ON THE PANEL

A Decision A Decision A Decision-Analytic Approach for A Decision Analytic Approach for

Accessibility of Mainstream Services for Aboriginal Victorians Tabled 29 May 2014 29 May 2014

Medusa Simplified Graph Processing on GPUs Motivation Graph processing algorithms are often

(Toward) Radiative transfer on AMR with GPUs Dominique Aubert Universit de Strasbourg

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

On p -adic comparison theorems for analytic spaces Wies lawa Nizio l, joint with Pierre

Analytic Combinatorics in Several Variables Robin Pemantle and Mark Wilson A of A conference, 30

Hadamard type operators for real analytic functions of several variables and moments of analytic

5. Analytic Combinatorics http://aofa.cs.princeton.edu Analytic combinatorics is a calculus for

Functional Analytic Framework Functional Analytic Framework for Model Selection for Model

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

Q4 2018 and Capital Markets Update 5 March 2019 | Aalborg, Denmark Agenda Lunch 1200-1230

Rules for presentation of ITU-T | ISO/IEC common text (September 2014) This guide provides rules

IMPLANTATION OF THE SCHARIOTH MACULA LENS - SML - Preliminary results of Quality Of Life Study

NDCAP Presentation August 15, 2018 1 Introductions Entergy Mike Twomey Vice President

United States v. Global Partners LLP, et al. Proposed Consent Decree CITY COUNCIL WORKSHOP

Preventing SQL Injection Attacks Using AMNESIA William G.J. Halfond and Alessandro Orso Georgia

Memory Chapter 7 Encoding, Storage and Retrieval of Memor y Encoding Storage

LEARNING AND MEMORY Chapter 11 Learning and Memory Learning How experience changes the

Channel Presentation Mainstream Networks Holding GmbH & Co. KG 1 Mainstream Networks Holding