GPU-Accelerated Analytics on your Data Lake.
GPU-Accelerated Analytics on your Data Lake. Data Lake @blazingdb - - PowerPoint PPT Presentation
GPU-Accelerated Analytics on your Data Lake. Data Lake @blazingdb - - PowerPoint PPT Presentation
GPU-Accelerated Analytics on your Data Lake. Data Lake @blazingdb Data Swamp @blazingdb ETL Hell
Data Lake
@blazingdb
Data Swamp
@blazingdb
ETL Hell
@blazingdb
DATA LAKE
0001010100001001011010110>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>> >>>>>>>>>>>>>>>>> >>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>> >>>
01010101001001 01010101100001 01011010100100 01011010100001 01010110100001 01010101001001 01010101100001 01011010100100 01011010100001 01010110100001
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>> >>>> >>>>> >>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> >>>> >>>>> >>>>> >>>> >>>> >>>>>>>>>> >>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>
COMMON
@blazingdb
DATA
LAYER
Simplify Data Storage
@blazingdb
SCHEMA METADATA DATA
SQL Warehouse on Data Lake
@blazingdb
BlazingDB – How it works
@blazingdb
- Compression/Decompression
- Filtering (Predicate Pushdown)
- Aggregations
- Transformations
- Joins
- Sorting/Ordering
DATA LAKE
0001010100001001011010110- RAM Cache (Hot)
- Disk Cache (Medium)
- HDD
- SSD
Local Disk HDFS AWS S3
BlazingDB Multi-nodal Cluster
@blazingdb
Shared Data Architecture
@blazingdb
DATA LAKE
0001010100001001011010110The Nays
@blazingdb
No Vendor Lock-in No Consistency Management No BlazingDB Specific ETL No Duplication No Ingest
The Yays
@blazingdb
High Concurrency Data Sharing (Across Clusters And Other Tools) Multi-Terabyte Queries Scalable, On Demand Data Warehouse Incredibly Fast SQL
@blazingdb
DEMO
@blazingdb
Demo - Architecture
HDFS on Azure Azure GPU Servers NC24 V1
- 4 Servers
Queries: BlazingDB 4 Node Query times (Lower is better)
@blazingdb
Cold Medium (Disk cache only) Hot
Query 1 Query 2 Query 3 Query 4 Query 5 QUERIES SECONDS 142.1 281.1 380.5 135.5 46 73.6 154.1 251.8 73.8 46.3 72 63.1 14 12.2 14.9Query 1
@blazingdb
Query 1 SECONDSCold Medium
(Disk cache only)Hot
select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendeprice) as sum_disc_price, sum(l_extendeprice*(1-l_discount)) as sum_base_price, sum(l_extendeprice*(1-l_discount)*(1+l_tax)) as sum_charge, avg(l_quatity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(l_quantity) as count_order from lineitem where l_shipdate <= ‘1995-06-01’ group by l_returnflag, l_linestatus
- rder by l_returnflag, l_linestatus;
1 2 3 4 5 6 7 8 9 10 11 12 13
Query1
Data Points
- 6 billion row table
- Many aggregations/transformations
Query 2
@blazingdb
Query 2 SECONDSCold Medium
(Disk cache only)Hot
select lineitem.l_orderkey, sum(lineitem.l_extendedprice*(1- lineitem.l_discount)) as revenue,
- rders.o_orderdate, orders.o_shippriority
from customer inner join orders on customer.c_custkey =
- rders.o_custkey inner join lineitem on
lineitem.l_orderkey = orders.o_orderkey where customer.c_mktsegment = 'BUILDING' and orders.o_orderdate < '1995-03-15' and lineitem.l_shipdate > '1995-03-15' group by lineitem.l_orderkey,
- rders.o_orderdate, orders.o_shippriority
- rder by revenue desc,orders.o_orderdate;
1 2 3 4 5 6 7 8 9 10 11 12 13
Query2
Data Points
- Join 6B rows to 1.5B rows to 150M rows
- Many aggregations/transformations
- Order (sorting)
Query 3
@blazingdb
Query 3 SECONDSCold Medium
(Disk cache only)Hot
select nation.name, sum(lineitem.l_extendedprice * (1 - lineitem.l_discount)) as revenue from customer inner join orders on customer.cust_key =
- rders.o_custkey inner join lineitem on
lineitem.l_orderkey = orders.o_orderkey inner join supplier on lineitem.l_suppkey = supplier.s_suppkey inner join nation on supplier.s_nationkey = nation.nation_key inner join region on nation.region_key = region.r_regionkey where supplier.s_nationkey = nation.nation_key and region.r_name = 'ASIA' and orders.o_orderdate >= '19940101' and orders.o_orderdate < '19950101' group by nation.name order by revenue desc 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Query3
Data Points
- Join 6B rows to 1.5B rows to 150M rows (and many
small joins)
- Multiple aggregations/transformations
- Order (sorting)
Query 4
@blazingdb
Query 4 SECONDSCold Medium
(Disk cache only)Hot
select sum(l_extendedprice) as sum_exprice, sum(l_discount) as sum_discount from lineitem where l_shipdate >= '19940101' and l_shipdate < '19950101' and l_discount >= 0.05 and l_discount <= 0.07 and l_quantity < 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Query4
Data Points
- 6B row table
- Multiple aggregations/transformations
Query 5
@blazingdb
Query 5 SECONDSCold Medium
(Disk cache only)Hot
select supplier.s_acctbal, supplier.s_suppkey, nation.name, part.p_partkey, part.p_mfgr, supplier.s_address, supplier.s_phone, supplier.s_comment from supplier inner join partsupp on supplier.s_suppkey = partsupp.ps_suppkey inner join nation on supplier.s_nationkey = nation.nation_key inner join region on nation.region_key = region.r_regionkey inner join part on part.p_partkey = partsupp.ps_partkey where part.p_size = 15 and part.p_type in ('ECONOMY ANODIZED BRASS', 'ECONOMY BRUSHED BRASS', 'ECONOMY BURNISHED BRASS', 'ECONOMY PLATED BRASS', 'ECONOMY POLISHED BRASS', 'LARGE ANODIZED BRASS', LARGE BRUSHED BRASS','LARGE BURNISHED BRASS','LARGE PLATED BRASS', 'LARGE POLISHED BRASS', 'SMALL ANODIZED BRASS', 'SMALL BRUSHED BRASS', 'SMALL BURNISHED BRASS', SMALL PLATED BRASS', 'SMALL POLISHED BRASS', 'STANDARD ANODIZED BRASS', 'STANDARD BRUSHED BRASS', 'STANDARD BURNISHED BRASS', 'STANDARD PLATED BRASS', 'STANDARD POLISHED BRASS') and region.r_name = 'EUROPE'- rder by supplier.s_acctbal desc, supplier.s_suppkey, nation.name,
Query1
Data Points
- Join multiple tables
- Many aggregations/transformations
- String comparisons
@blazingdb
Data Pipeline
GPU Data Frame Apache Arrow
Common Data Layer INGEST STORAGE
(Data Lake)
Coming Soon
@blazingdb
Questions?