GPU-Accelerated Analytics on your Data Lake. Data Lake @blazingdb - - PowerPoint PPT Presentation

gpu accelerated analytics on your data lake data lake
SMART_READER_LITE
LIVE PREVIEW

GPU-Accelerated Analytics on your Data Lake. Data Lake @blazingdb - - PowerPoint PPT Presentation

GPU-Accelerated Analytics on your Data Lake. Data Lake @blazingdb Data Swamp @blazingdb ETL Hell


slide-1
SLIDE 1

GPU-Accelerated Analytics on your Data Lake.

slide-2
SLIDE 2

Data Lake

@blazingdb

slide-3
SLIDE 3

Data Swamp

@blazingdb

slide-4
SLIDE 4

ETL Hell

@blazingdb

DATA LAKE

0001010100001001011010110

>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>> >>>>>>>>>>>>>>>>> >>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>> >>>

01010101001001 01010101100001 01011010100100 01011010100001 01010110100001 01010101001001 01010101100001 01011010100100 01011010100001 01010110100001

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>> >>>> >>>>> >>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> >>>> >>>>> >>>>> >>>> >>>> >>>>>>>>>> >>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>

slide-5
SLIDE 5

COMMON

@blazingdb

DATA

LAYER

slide-6
SLIDE 6

Simplify Data Storage

@blazingdb

SCHEMA METADATA DATA

slide-7
SLIDE 7

SQL Warehouse on Data Lake

@blazingdb

slide-8
SLIDE 8

BlazingDB – How it works

@blazingdb

  • Compression/Decompression
  • Filtering (Predicate Pushdown)
  • Aggregations
  • Transformations
  • Joins
  • Sorting/Ordering

DATA LAKE

0001010100001001011010110
  • RAM Cache (Hot)
  • Disk Cache (Medium)
  • HDD
  • SSD

Local Disk HDFS AWS S3

slide-9
SLIDE 9

BlazingDB Multi-nodal Cluster

@blazingdb

slide-10
SLIDE 10

Shared Data Architecture

@blazingdb

DATA LAKE

0001010100001001011010110
slide-11
SLIDE 11

The Nays

@blazingdb

No Vendor Lock-in No Consistency Management No BlazingDB Specific ETL No Duplication No Ingest

slide-12
SLIDE 12

The Yays

@blazingdb

High Concurrency Data Sharing (Across Clusters And Other Tools) Multi-Terabyte Queries Scalable, On Demand Data Warehouse Incredibly Fast SQL

slide-13
SLIDE 13

@blazingdb

DEMO

slide-14
SLIDE 14

@blazingdb

Demo - Architecture

HDFS on Azure Azure GPU Servers NC24 V1

  • 4 Servers
slide-15
SLIDE 15

Queries: BlazingDB 4 Node Query times (Lower is better)

@blazingdb

Cold Medium (Disk cache only) Hot

Query 1 Query 2 Query 3 Query 4 Query 5 QUERIES SECONDS 142.1 281.1 380.5 135.5 46 73.6 154.1 251.8 73.8 46.3 72 63.1 14 12.2 14.9
slide-16
SLIDE 16

Query 1

@blazingdb

Query 1 SECONDS

Cold Medium

(Disk cache only)

Hot

select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendeprice) as sum_disc_price, sum(l_extendeprice*(1-l_discount)) as sum_base_price, sum(l_extendeprice*(1-l_discount)*(1+l_tax)) as sum_charge, avg(l_quatity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(l_quantity) as count_order from lineitem where l_shipdate <= ‘1995-06-01’ group by l_returnflag, l_linestatus

  • rder by l_returnflag, l_linestatus;

1 2 3 4 5 6 7 8 9 10 11 12 13

Query1

Data Points

  • 6 billion row table
  • Many aggregations/transformations
slide-17
SLIDE 17

Query 2

@blazingdb

Query 2 SECONDS

Cold Medium

(Disk cache only)

Hot

select lineitem.l_orderkey, sum(lineitem.l_extendedprice*(1- lineitem.l_discount)) as revenue,

  • rders.o_orderdate, orders.o_shippriority

from customer inner join orders on customer.c_custkey =

  • rders.o_custkey inner join lineitem on

lineitem.l_orderkey = orders.o_orderkey where customer.c_mktsegment = 'BUILDING' and orders.o_orderdate < '1995-03-15' and lineitem.l_shipdate > '1995-03-15' group by lineitem.l_orderkey,

  • rders.o_orderdate, orders.o_shippriority
  • rder by revenue desc,orders.o_orderdate;

1 2 3 4 5 6 7 8 9 10 11 12 13

Query2

Data Points

  • Join 6B rows to 1.5B rows to 150M rows
  • Many aggregations/transformations
  • Order (sorting)
slide-18
SLIDE 18

Query 3

@blazingdb

Query 3 SECONDS

Cold Medium

(Disk cache only)

Hot

select nation.name, sum(lineitem.l_extendedprice * (1 - lineitem.l_discount)) as revenue from customer inner join orders on customer.cust_key =

  • rders.o_custkey inner join lineitem on

lineitem.l_orderkey = orders.o_orderkey inner join supplier on lineitem.l_suppkey = supplier.s_suppkey inner join nation on supplier.s_nationkey = nation.nation_key inner join region on nation.region_key = region.r_regionkey where supplier.s_nationkey = nation.nation_key and region.r_name = 'ASIA' and orders.o_orderdate >= '19940101' and orders.o_orderdate < '19950101' group by nation.name order by revenue desc 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Query3

Data Points

  • Join 6B rows to 1.5B rows to 150M rows (and many

small joins)

  • Multiple aggregations/transformations
  • Order (sorting)
slide-19
SLIDE 19

Query 4

@blazingdb

Query 4 SECONDS

Cold Medium

(Disk cache only)

Hot

select sum(l_extendedprice) as sum_exprice, sum(l_discount) as sum_discount from lineitem where l_shipdate >= '19940101' and l_shipdate < '19950101' and l_discount >= 0.05 and l_discount <= 0.07 and l_quantity < 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Query4

Data Points

  • 6B row table
  • Multiple aggregations/transformations
slide-20
SLIDE 20

Query 5

@blazingdb

Query 5 SECONDS

Cold Medium

(Disk cache only)

Hot

select supplier.s_acctbal, supplier.s_suppkey, nation.name, part.p_partkey, part.p_mfgr, supplier.s_address, supplier.s_phone, supplier.s_comment from supplier inner join partsupp on supplier.s_suppkey = partsupp.ps_suppkey inner join nation on supplier.s_nationkey = nation.nation_key inner join region on nation.region_key = region.r_regionkey inner join part on part.p_partkey = partsupp.ps_partkey where part.p_size = 15 and part.p_type in ('ECONOMY ANODIZED BRASS', 'ECONOMY BRUSHED BRASS', 'ECONOMY BURNISHED BRASS', 'ECONOMY PLATED BRASS', 'ECONOMY POLISHED BRASS', 'LARGE ANODIZED BRASS', LARGE BRUSHED BRASS','LARGE BURNISHED BRASS','LARGE PLATED BRASS', 'LARGE POLISHED BRASS', 'SMALL ANODIZED BRASS', 'SMALL BRUSHED BRASS', 'SMALL BURNISHED BRASS', SMALL PLATED BRASS', 'SMALL POLISHED BRASS', 'STANDARD ANODIZED BRASS', 'STANDARD BRUSHED BRASS', 'STANDARD BURNISHED BRASS', 'STANDARD PLATED BRASS', 'STANDARD POLISHED BRASS') and region.r_name = 'EUROPE'
  • rder by supplier.s_acctbal desc, supplier.s_suppkey, nation.name,
part.p_partkey

Query1

Data Points

  • Join multiple tables
  • Many aggregations/transformations
  • String comparisons
slide-21
SLIDE 21

@blazingdb

Data Pipeline

GPU Data Frame Apache Arrow

Common Data Layer INGEST STORAGE

(Data Lake)

Coming Soon

slide-22
SLIDE 22

@blazingdb

Questions?

slide-23
SLIDE 23