[PPT] - Apache Ignite as MPP Accelerator Alexander Ermakov, CTO Agenda PowerPoint Presentation

SLIDE 1

Apache Ignite as MPP Accelerator

Alexander Ermakov, CTO

SLIDE 2

About us
Why do traditional DWH needs in-memory grid?
Real Time Analytics for Telco Cases
Integrating Apache Ignite with Arenadata DB
Using the power of in-memory computing with MPP (Example)

Agenda

SLIDE 3

<About us>

SLIDE 4

Arenadata unites a keen team of developers & engineers

working on building enterprise data platform.

We are contributors of Open Source Projects:
Greenplum
Apache PXF
Apache Bigtop
Members of ODPi (Linux Foundation) since 2015

Who we are?

SLIDE 5

ODPi Compliant Platforms

SLIDE 6

Arenadata Enterprise Data Platform

Platform Extension Framework

SLIDE 7

Arenadata - Open Source

store.arenadata.io

SLIDE 8

Our Partners

SLIDE 9

Why DWH needs in-memory grid?

SLIDE 10

Gene Sequencing Smart Grids

COST TO SEQUENCE

ONE GENOME

HAS FALLEN FROM $100M

IN 2001

TO $10K IN 2011 TO $1K IN 2014

READING SMART METERS

EVERY 15 MINUTES IS

3000X MORE

DATA INTENSIVE

Stock Market Social Media

FACEBOOK UPLOADS

250 MILLION

PHOTOS EACH DAY

Oil Exploration Video Surveillance

OIL RIGS GENERATE

25000

DATA POINTS PER SECOND Medical Imaging Mobile Sensors

New Generation of Business Cases

SLIDE 11

Data Value Chain

ms seconds hours weeks months year years+

SLIDE 12

Data Warehouse

BI SP Table OLTP ODS DDS Data Mart DWH ES

API

ELT & DQ

Batch CDC

Sources Transport Store Analyze Transform

SLIDE 13

Data Lake

OLTP SP Table ELT & DQ ODS DDS Data Mart DWH

Batch

ES

API

CDC

…

API Queue

Hadoop

HDFS SQL On Hadoop

Sources Transport Store Analyze Transform

BI

SLIDE 14

Lambda Architecture

…

ELT & DQ ODS DDS Data Mart DWH

Batch

STG

Batch

App Hadoop Real Time App

HDFS SQL On Hadoop

ES

API

CDC

Sources Transport Store Analyze Transform

BI

Queue

SLIDE 15

Kappa Architecture

…

STG

Batch

App Real Time App

Sources Transport Store Analyze Transform

BI

Queue

SLIDE 16

Real Time Analytics for Telco Cases

SLIDE 17

Customer Retention / Connection Breakdowns

SLIDE 18

Geo Marketing

SLIDE 19

Migrating from a Reactive, Static and Constrained Model…

HDFS

Data Lake

Ingest Store Analytics Hard to change Labor intensive Inefficient Coding based No real-time information Based on expensive ETL

SLIDE 20

To Pro-Active, Self-Improving, Machine Learning Systems

HDFS

Data Lake

Expert System / Machine Learning In-Memory Real-Time Data

Continuous Learning Continuous Improvement Continuous Adapting

Data Stream Pipeline

Multiple Data Sources Real-Time Processing Store Everything

SLIDE 21

Sandboxes

Data Feeds Stream Processing Expert Systems Machine Learning Historical Data Business Value Smart Decisions

HDFS

Data Lake

SLIDE 22

Data Streaming Reference Architecture

Data Feeds Transactional Apps Analytic Apps Data Stream Pipeline Real Time Data & Distributed Computing Expert Systems & Machine Learning Advanced Analytics Data Lake

SLIDE 23

Data Streaming Reference Architecture

Data Feeds Transactional Apps Analytic Apps Data Lake

SLIDE 24

Integrate Apache Ignite with Arenadata DB

SLIDE 25

Arenadata Grid

SLIDE 26

Arenadata Grid Use Cases

SLIDE 27

Segment Segment Segment Segment Segment

…

Segment Host with one or more Segment Instances Segment Instances process queries in parallel

Flexible framework for processing large datasets

High speed interconnect for continuous pipelining of data processing Master Host and Standby Master Host Master coordinates work with Segment Hosts Segment Hosts have their own CPU, disk and memory (shared nothing)

Arenadata DB Architecture

Standby Master Master Host

SQL

SLIDE 28

Greenpum Core Development

Zstandard support (will be added to stable at 6.0.0 due to naming convention)
PXF development: we bet a lot. Ignite integration, push down feature, JDBC & Ignite stable release
Few bugs and a lot of issues

SLIDE 29

29

Parallel Query Optimizer

Cost-based optimization looks for

the most efficient plan

Physical plan contains scans, joins,

sorts, aggregations, etc.

Global planning avoids sub-optimal

‘SQL pushing’ to segments

Directly inserts ‘motion’

nodes for inter-segment communication

PHYSICAL EXECUTION PLAN FROM SQL OR MAPREDUCE

Gather Motion 4:1(Slice 3) Sort HashAggregate HashJoin Redistribute Motion 4:4(Slice 1) HashJoin Hash Hash HashJoin Hash Broadcast Motion 4:4(Slice 2) Seq Scan on motion Seq Scan on customer Seq Scan on line item Seq Scan on orders

SLIDE 30

30

MADlib: Toolkit for Advanced Big Data Analytics

Better Parallelism

– Algorithms designed to leverage MPP or Hadoop architecture

Better Scalability

– Algorithms scale as your data set scales – No data movement

Better Predictive Accuracy

– Using all data, not a sample, may improve accuracy

Open Source

– Available for customization and optimization by user

SLIDE 31

31

MADlib In-Database Functions

Predictive Modeling Library Linear Systems

Sparse and Dense Solvers

Matrix Factorization

Singular Value Decomposition (SVD)

Generalized Linear Models

Linear Regression
Logistic Regression
Multinomial Logistic Regression
Cox Proportional Hazards
Regression
Elastic Net Regularization
Sandwich

Estimators (Huber white, clustered, marginal effects) Machine Learning Algorithms

ARIMA
Principal Component Analysis (PCA)
Association Rules (Affinity Analysis, Market

Basket)

Topic Modeling (Parallel LDA)
Decision Trees
Ensemble Learners (Random Forests)
Support Vector Machines
Conditional Random Field (CRF)
Clustering (K-means)
Cross Validation

Descriptive Statistics Sketch-based Estimators

CountMin (Cormode-

Muthukrishnan)

FM (Flajolet-Martin)
MFV (Most Frequent

Values) Correlation Summary Support Modules Array Operations Sparse Vectors Random Sampling Probability Functions

SLIDE 32

Polymorphic Table Storage

Historical data (Years) slow HDD Actual data (months) regular HDD Now data (hours) SSD

Single table

Provide the choice of processing model for any

table or any individual partition – Enable Information Lifecycle Management (ILM)

Storage types can be mixed within a table or

database – Four table types: heap, row-oriented AO, column-oriented, external – Block compression: Gzip (levels 1-9), Zstd – Columnar compression: RLE

SLIDE 33

Platform eXtension Framework (PXF)

An advanced version of

Greenplum external tables

Supports connectors for

HDFS, HBase and Hive, JDBC, Ignite (Arenadata DB)

Provides extensible

framework API to enable custom connector

SLIDE 34

PXF Profiles

HDFS Files
Ignite
JDBC
Avro
HBase
Hive

– Text based – SequenceFile – RCFile – ORCFile

CREATE EXTERNAL TABLE pxf_sales_part( item_name TEXT, item_type TEXT, supplier_key INTEGER, item_price DOUBLE PRECISION, delivery_state TEXT, delivery_city TEXT ) LOCATION (‘pxf://grid_host?Profile=Ingite&IGNITE_CACHE=test&BUFFER_ SIZE=10000’);

SLIDE 35

PXF Profiles

<profile> <name>Ignite</name> <plugins> <fragmenter>IgniteFragmenter</fragmenter> <accessor>IgniteAccessor</accessor> <resolver>IgniteResolver</resolver> <analyzer>IgniteAnalyzer</analyzer> </plugins> </profile>

SLIDE 36

Fragmenter – returns a list of source data fragments and their

location

Accessor – access a given list of fragments read them and return

records

Resolver – deserialize each record according to a given schema or

technique

Analyzer – returns statistics about the source data

PXF Classes

SLIDE 37

Date User_id Message 21-01-2018 16 <message> 01-11-2018 500 <message> 15-05-2018 2042 <message> 17-09-2017 15 <message> 15-06-2016 55 <message> 24-12-2015 3510 <message> 01-01-2012 19 <message> 26-04-2013 42 <message> 23-05-2010 17 <message>

Grid external table Latency: milliseconds Cost per GB: $$$ Hadoop external table Latency: tens of seconds Cost per GB: $ Regular ADB table Latency: seconds Cost per GB: $$ … where Date > 20-01-2018

… partition by Date ( partition1: Date => 01-01-2018 partition2: Date < 01-01-2018 and Date => 01-01-2015 partition3: Date < 01-01-2015 )

… where Date < 18-09-2017 … where Date > 16-06-2017 AND User_id < 400 Pushdown Partition filter Pushdown filter Executed in external system

PXF Pushdown Feature

SLIDE 38

PXF Pushdown Feature

SLIDE 39

Using power of In-Memory computing with MPP

SLIDE 40

Test Bench

Ignite1 Ignite2

Greenplum Master Greenplum Seg2 Greenplum Seg1 Hadoop Namenode Hadoop Datanode2 Hadoop Datanode1

Internal Affinity Functions

Arenadata Unified Data Platform

PXF interaction

SLIDE 41

Creating Table in MPP

SLIDE 42

Creating External Table for Apache Ignite & Load Data

SLIDE 43

Creating External Table in Hive & Load Data

SLIDE 44

Exchange Partitions with External Tables

SLIDE 45

Target Table

SLIDE 46

Execution Plan

prt2: Greenplum Heap Partition prt1: Ignite Cache Partition

SLIDE 47

Apache Ignite as MPP Accelerator

Alexander Ermakov, CTO

Agenda

<About us>

working on building enterprise data platform.

Who we are?

ODPi Compliant Platforms

Arenadata Enterprise Data Platform

Arenadata - Open Source

store.arenadata.io

Our Partners

Why DWH needs in-memory grid?

25000

New Generation of Business Cases

Data Value Chain

Real Time Analytics for Telco Cases

Customer Retention / Connection Breakdowns

Geo Marketing

Migrating from a Reactive, Static and Constrained Model…

To Pro-Active, Self-Improving, Machine Learning Systems

Sandboxes

Data Streaming Reference Architecture

Data Streaming Reference Architecture

Integrate Apache Ignite with Arenadata DB

Arenadata Grid

Arenadata Grid Use Cases

…

Flexible framework for processing large datasets

Arenadata DB Architecture

Greenpum Core Development

Parallel Query Optimizer

the most efficient plan

sorts, aggregations, etc.

‘SQL pushing’ to segments

nodes for inter-segment communication

MADlib: Toolkit for Advanced Big Data Analytics

MADlib In-Database Functions

Polymorphic Table Storage

Platform eXtension Framework (PXF)

PXF Profiles

PXF Profiles

location

records

technique

PXF Classes

Grid external table Latency: milliseconds Cost per GB: $$$ Hadoop external table Latency: tens of seconds Cost per GB: $ Regular ADB table Latency: seconds Cost per GB: $$ … where Date > 20-01-2018

… where Date < 18-09-2017 … where Date > 16-06-2017 AND User_id < 400 Pushdown Partition filter Pushdown filter Executed in external system

PXF Pushdown Feature

PXF Pushdown Feature

Using power of In-Memory computing with MPP

Test Bench

Creating Table in MPP

Creating External Table for Apache Ignite & Load Data

Creating External Table in Hive & Load Data

Exchange Partitions with External Tables

Target Table

Execution Plan

Thank you! Questions?