Unleashing the Hidden Power
- f Integrated-GPUs for
Database Co-Processing
Edward Ching, Norbert Egi, Masood Mortazavi, Vincent Cheung, Guangyu Shi
BigSys’14, September 25th 2014
IT Research Department
of Integrated-GPUs for Database Co-Processing Edward Ching, Norbert - - PowerPoint PPT Presentation
Unleashing the Hidden Power of Integrated-GPUs for Database Co-Processing Edward Ching, Norbert Egi , Masood Mortazavi, Vincent Cheung, Guangyu Shi BigSys14, September 25 th 2014 IT Research Department Overview Wide variety of compute
Unleashing the Hidden Power
Database Co-Processing
Edward Ching, Norbert Egi, Masood Mortazavi, Vincent Cheung, Guangyu Shi
BigSys’14, September 25th 2014
IT Research Department
(MC CPUs, GPUs, FPGAs, DSPs, etc.)
well-known
general-purpose parallel computation
2
acceleration
adopted: requires less computation compared to HPC, but on much more data
alternative
3
cache hierarchy
access
– Low throughput – High latency
4
– High throughput – Low latency
5
Haswell Microarchitecture
Nvidia GTX780 Intel HD4600 Cores 12 20 Threads / Core 6 7 Data lane / Thread 32 8 / 16 / 32
2304 4480 Clock (GHz) 1.0 1.25 Power consumption (W) 250 <30 GFLOPS 3977 432 Memory / Cache 3GB GDR5 8MB L3 cache
6
d-GPU i-GPU DMA Memory Mapping Zero-Copy HW supported No CPU or GPU intervention Directly to GDR5 Goes over the relatively slow PCIe Can add to the total execution time
7
d-GPU i-GPU DMA Memory Mapping Zero-Copy HW supported DDR3 can directly be referenced via GPU MMU No CPU or GPU intervention Programming MMU is faster the DMA transfer Directly to GDR5 Only data that is needed is moved Goes over the relatively slow PCIe GPU is stalled during data transfer Can add to the total execution time Goes over the relatively slow PCIe
8
d-GPU i-GPU DMA Memory Mapping Zero-Copy HW supported DDR3 can directly be referenced via GPU MMU Shared cache and main memory No CPU or GPU intervention Programming MMU is faster the DMA transfer GTT mapping (similar to MMU) Directly to GDR5 Only data that is needed is moved Data goes over the fast internal bus of the CPU Goes over the relatively slow PCIe GPU is stalled during data transfer Data can even be retrieved from shared LLC Can add to the total execution time Goes over the relatively slow PCIe Works “vice versa” (CPUGPU)
9
10
OpenCL Query Co-Processing Functions Haswell Processor Graphics Hardware (Gen 7 - HD4600) Linux OpenCL driver for Haswell Linux OpenCL Compiler for Haswell Linux OpenCL API for Haswell OpenCL Kernel src, OpenCL API OpenCL Kernel src Haswell GPU Instr Set binary GPU (EU, GTT, etc) config Memory Mgmt GPU binary Execution Control Device access 11
12
Reduce, Scan;
Scatter;
Bitonic and Radix Sort;
13
14
15
TPC-H Q1 TPC-H Q9
16
select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice*(1-l_discount)) as sum_disc_price, sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate <= date '1998-12-01' - interval '[DELTA]' day (3) group by l_returnflag, l_linestatus
l_returnflag, l_linestatus; select nation, o_year, sum(amount) as sum_profit from ( select n_name as nation, extract(year from o_orderdate) as o_year, l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount from part, supplier, lineitem, partsupp, orders, nation where s_suppkey = l_suppkey and ps_suppkey = l_suppkey and ps_partkey = l_partkey and p_partkey = l_partkey and o_orderkey = l_orderkey and s_nationkey = n_nationkey and p_name like '%[COLOR]%’ ) as profit group by nation,
nation,
0.000 0.100 0.200 0.300 0.400 0.500 0.00 10.00 20.00 30.00 40.00 50.00
(ms)
TPC-H Q1 UDF Benchmark Test: iGPU vs dGPUs vs CPU
Time (ms) Throughput (GB/s/W)
17
18
analytics queries
19
20
il.do?callMethod=toJobDetail&jobID=43263
21