of Integrated-GPUs for Database Co-Processing Edward Ching, Norbert - - PowerPoint PPT Presentation

of integrated gpus for
SMART_READER_LITE
LIVE PREVIEW

of Integrated-GPUs for Database Co-Processing Edward Ching, Norbert - - PowerPoint PPT Presentation

Unleashing the Hidden Power of Integrated-GPUs for Database Co-Processing Edward Ching, Norbert Egi , Masood Mortazavi, Vincent Cheung, Guangyu Shi BigSys14, September 25 th 2014 IT Research Department Overview Wide variety of compute


slide-1
SLIDE 1

Unleashing the Hidden Power

  • f Integrated-GPUs for

Database Co-Processing

Edward Ching, Norbert Egi, Masood Mortazavi, Vincent Cheung, Guangyu Shi

BigSys’14, September 25th 2014

IT Research Department

slide-2
SLIDE 2

Overview

  • Wide variety of compute resources available

(MC CPUs, GPUs, FPGAs, DSPs, etc.)

  • Discrete GPUs (d-GPU) might be the most

well-known

  • Integrated GPUs (i-GPU) became available for

general-purpose parallel computation

  • Architectural and performance comparison

2

slide-3
SLIDE 3

Introduction

  • Discrete GPUs (d-GPUs) have long been used for application

acceleration

  • CPU+GPU co-processing for data analytics being widely

adopted: requires less computation compared to HPC, but on much more data

  • PCIe became performance “bottleneck”
  • Recent CPUs with integrated GPUs (i-GPUs) look like a viable

alternative

  • Our focus is on modern i-GPUs for parallel data processing
  • Help system designers in selecting the right architectural
  • ption

3

slide-4
SLIDE 4

Architecture: d-GPU

  • Large local memory and

cache hierarchy

  • High-throughput GDR5

access

  • Connection over PCIe

– Low throughput – High latency

4

slide-5
SLIDE 5

Architecture: i-GPU

  • Connection over internal bus

– High throughput – Low latency

  • Shared LLC
  • True zero-copy

5

Haswell Microarchitecture

slide-6
SLIDE 6

Hardware parameters

Nvidia GTX780 Intel HD4600 Cores 12 20 Threads / Core 6 7 Data lane / Thread 32 8 / 16 / 32

  • Max. Physical Occupancy

2304 4480 Clock (GHz) 1.0 1.25 Power consumption (W) 250 <30 GFLOPS 3977 432 Memory / Cache 3GB GDR5 8MB L3 cache

6

slide-7
SLIDE 7

Data Transfer Mechanisms

d-GPU i-GPU DMA Memory Mapping Zero-Copy  HW supported  No CPU or GPU intervention  Directly to GDR5  Goes over the relatively slow PCIe  Can add to the total execution time

7

slide-8
SLIDE 8

Data Transfer Mechanisms

d-GPU i-GPU DMA Memory Mapping Zero-Copy  HW supported  DDR3 can directly be referenced via GPU MMU  No CPU or GPU intervention  Programming MMU is faster the DMA transfer  Directly to GDR5  Only data that is needed is moved  Goes over the relatively slow PCIe  GPU is stalled during data transfer  Can add to the total execution time  Goes over the relatively slow PCIe

8

slide-9
SLIDE 9

Data Transfer Mechanisms

d-GPU i-GPU DMA Memory Mapping Zero-Copy  HW supported  DDR3 can directly be referenced via GPU MMU  Shared cache and main memory  No CPU or GPU intervention  Programming MMU is faster the DMA transfer  GTT mapping (similar to MMU)  Directly to GDR5  Only data that is needed is moved  Data goes over the fast internal bus of the CPU  Goes over the relatively slow PCIe  GPU is stalled during data transfer  Data can even be retrieved from shared LLC  Can add to the total execution time  Goes over the relatively slow PCIe  Works “vice versa” (CPUGPU)

9

slide-10
SLIDE 10

Performance Analysis

  • (1) Compilation Environment
  • (2) Raw data transfer
  • (3) Micro-benchmarks
  • (4) Database queries

10

slide-11
SLIDE 11

Compilation Environment

OpenCL Query Co-Processing Functions Haswell Processor Graphics Hardware (Gen 7 - HD4600) Linux OpenCL driver for Haswell Linux OpenCL Compiler for Haswell Linux OpenCL API for Haswell OpenCL Kernel src, OpenCL API OpenCL Kernel src Haswell GPU Instr Set binary GPU (EU, GTT, etc) config Memory Mgmt GPU binary Execution Control Device access 11

slide-12
SLIDE 12

Raw Data Transfer

12

slide-13
SLIDE 13

Micro-benchmarks

  • Simple optimal memory access patterns: Map,

Reduce, Scan;

  • Randomized memory access pattern: Gather,

Scatter;

  • Combination of several sorting operations: Split,

Bitonic and Radix Sort;

13

slide-14
SLIDE 14

Micro-benchmarks

14

slide-15
SLIDE 15

Micro-benchmarks

15

slide-16
SLIDE 16

Database query

TPC-H Q1 TPC-H Q9

16

select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice*(1-l_discount)) as sum_disc_price, sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate <= date '1998-12-01' - interval '[DELTA]' day (3) group by l_returnflag, l_linestatus

  • rder by

l_returnflag, l_linestatus; select nation, o_year, sum(amount) as sum_profit from ( select n_name as nation, extract(year from o_orderdate) as o_year, l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount from part, supplier, lineitem, partsupp, orders, nation where s_suppkey = l_suppkey and ps_suppkey = l_suppkey and ps_partkey = l_partkey and p_partkey = l_partkey and o_orderkey = l_orderkey and s_nationkey = n_nationkey and p_name like '%[COLOR]%’ ) as profit group by nation,

  • _year
  • rder by

nation,

  • _year desc;
slide-17
SLIDE 17

Database query (TPC-H Q1)

0.000 0.100 0.200 0.300 0.400 0.500 0.00 10.00 20.00 30.00 40.00 50.00

(ms)

TPC-H Q1 UDF Benchmark Test: iGPU vs dGPUs vs CPU

Time (ms) Throughput (GB/s/W)

17

slide-18
SLIDE 18

Database query (TPC-H Q9)

18

slide-19
SLIDE 19

Conclusion

  • Examined query and primitive operation processing
  • Used micro-benchmarks and more realistic data-

analytics queries

  • Found, that i-GPU compute resources are weaker
  • But excel significantly in the speed of data access
  • Behave as “free” resources
  • Consume far less power

19

slide-20
SLIDE 20

Q & A

20

slide-21
SLIDE 21

R&D Openings in Munich

  • Huawei’s European Research Center (ERC)
  • 10+ openings
  • Database System Architects and Software Engineers
  • recruitment.erc@huawei.com
  • http://career.huawei.com/career/en/i18n/toJobDeta

il.do?callMethod=toJobDetail&jobID=43263

21