Apache Ignite as MPP Accelerator Alexander Ermakov, CTO Agenda - - PowerPoint PPT Presentation

apache ignite as mpp accelerator
SMART_READER_LITE
LIVE PREVIEW

Apache Ignite as MPP Accelerator Alexander Ermakov, CTO Agenda - - PowerPoint PPT Presentation

Apache Ignite as MPP Accelerator Alexander Ermakov, CTO Agenda About us Why do traditional DWH needs in-memory grid? Real Time Analytics for Telco Cases Integrating Apache Ignite with Arenadata DB Using the power of


slide-1
SLIDE 1

Apache Ignite as MPP Accelerator

Alexander Ermakov, CTO

slide-2
SLIDE 2
  • About us
  • Why do traditional DWH needs in-memory grid?
  • Real Time Analytics for Telco Cases
  • Integrating Apache Ignite with Arenadata DB
  • Using the power of in-memory computing with MPP (Example)

Agenda

slide-3
SLIDE 3

<About us>

slide-4
SLIDE 4
  • Arenadata unites a keen team of developers & engineers

working on building enterprise data platform.

  • We are contributors of Open Source Projects:
  • Greenplum
  • Apache PXF
  • Apache Bigtop
  • Members of ODPi (Linux Foundation) since 2015

Who we are?

slide-5
SLIDE 5

ODPi Compliant Platforms

slide-6
SLIDE 6

Arenadata Enterprise Data Platform

Platform Extension Framework

slide-7
SLIDE 7

Arenadata - Open Source

store.arenadata.io

slide-8
SLIDE 8

Our Partners

slide-9
SLIDE 9

Why DWH needs in-memory grid?

slide-10
SLIDE 10

Gene Sequencing Smart Grids

COST TO SEQUENCE

ONE GENOME

HAS FALLEN FROM $100M

IN 2001

TO $10K IN 2011 TO $1K IN 2014

READING SMART METERS

EVERY 15 MINUTES IS

3000X MORE

DATA INTENSIVE

Stock Market Social Media

FACEBOOK UPLOADS

250 MILLION

PHOTOS EACH DAY

Oil Exploration Video Surveillance

OIL RIGS GENERATE

25000

DATA POINTS PER SECOND Medical Imaging Mobile Sensors

New Generation of Business Cases

slide-11
SLIDE 11

Data Value Chain

ms seconds hours weeks months year years+

slide-12
SLIDE 12

Data Warehouse

BI SP Table OLTP ODS DDS Data Mart DWH ES

API

ELT & DQ

Batch CDC

Sources Transport Store Analyze Transform

slide-13
SLIDE 13

Data Lake

OLTP SP Table ELT & DQ ODS DDS Data Mart DWH

Batch

ES

API

CDC

API Queue

Hadoop

HDFS SQL On Hadoop

Sources Transport Store Analyze Transform

BI

slide-14
SLIDE 14

Lambda Architecture

ELT & DQ ODS DDS Data Mart DWH

Batch

STG

Batch

App Hadoop Real Time App

HDFS SQL On Hadoop

ES

API

CDC

Sources Transport Store Analyze Transform

BI

Queue

slide-15
SLIDE 15

Kappa Architecture

STG

Batch

App Real Time App

Sources Transport Store Analyze Transform

BI

Queue

slide-16
SLIDE 16

Real Time Analytics for Telco Cases

slide-17
SLIDE 17

Customer Retention / Connection Breakdowns

slide-18
SLIDE 18

Geo Marketing

slide-19
SLIDE 19

Migrating from a Reactive, Static and Constrained Model…

HDFS

Data Lake

Ingest Store Analytics Hard to change Labor intensive Inefficient Coding based No real-time information Based on expensive ETL

slide-20
SLIDE 20

To Pro-Active, Self-Improving, Machine Learning Systems

HDFS

Data Lake

Expert System / Machine Learning In-Memory Real-Time Data

Continuous Learning Continuous Improvement Continuous Adapting

Data Stream Pipeline

Multiple Data Sources Real-Time Processing Store Everything

slide-21
SLIDE 21

Sandboxes

Data Feeds Stream Processing Expert Systems Machine Learning Historical Data Business Value Smart Decisions

HDFS

Data Lake

slide-22
SLIDE 22

Data Streaming Reference Architecture

Data Feeds Transactional Apps Analytic Apps Data Stream Pipeline Real Time Data & Distributed Computing Expert Systems & Machine Learning Advanced Analytics Data Lake

slide-23
SLIDE 23

Data Streaming Reference Architecture

Data Feeds Transactional Apps Analytic Apps Data Lake

slide-24
SLIDE 24

Integrate Apache Ignite with Arenadata DB

slide-25
SLIDE 25

Arenadata Grid

slide-26
SLIDE 26

Arenadata Grid Use Cases

slide-27
SLIDE 27

Segment Segment Segment Segment Segment

Segment Host with one or more Segment Instances Segment Instances process queries in parallel

Flexible framework for processing large datasets

High speed interconnect for continuous pipelining of data processing Master Host and Standby Master Host Master coordinates work with Segment Hosts Segment Hosts have their own CPU, disk and memory (shared nothing)

Arenadata DB Architecture

Standby Master Master Host

SQL

slide-28
SLIDE 28

Greenpum Core Development

  • Zstandard support (will be added to stable at 6.0.0 due to naming convention)
  • PXF development: we bet a lot. Ignite integration, push down feature, JDBC & Ignite stable release
  • Few bugs and a lot of issues
slide-29
SLIDE 29

29

Parallel Query Optimizer

  • Cost-based optimization looks for

the most efficient plan

  • Physical plan contains scans, joins,

sorts, aggregations, etc.

  • Global planning avoids sub-optimal

‘SQL pushing’ to segments

  • Directly inserts ‘motion’

nodes for inter-segment communication

PHYSICAL EXECUTION PLAN FROM SQL OR MAPREDUCE

Gather Motion 4:1(Slice 3) Sort HashAggregate HashJoin Redistribute Motion 4:4(Slice 1) HashJoin Hash Hash HashJoin Hash Broadcast Motion 4:4(Slice 2) Seq Scan on motion Seq Scan on customer Seq Scan on line item Seq Scan on orders

slide-30
SLIDE 30

30

MADlib: Toolkit for Advanced Big Data Analytics

  • Better Parallelism

– Algorithms designed to leverage MPP or Hadoop architecture

  • Better Scalability

– Algorithms scale as your data set scales – No data movement

  • Better Predictive Accuracy

– Using all data, not a sample, may improve accuracy

  • Open Source

– Available for customization and optimization by user

slide-31
SLIDE 31

31

MADlib In-Database Functions

Predictive Modeling Library Linear Systems

  • Sparse and Dense Solvers

Matrix Factorization

  • Singular Value Decomposition (SVD)

Generalized Linear Models

  • Linear Regression
  • Logistic Regression
  • Multinomial Logistic Regression
  • Cox Proportional Hazards
  • Regression
  • Elastic Net Regularization
  • Sandwich

Estimators (Huber white, clustered, marginal effects) Machine Learning Algorithms

  • ARIMA
  • Principal Component Analysis (PCA)
  • Association Rules (Affinity Analysis, Market

Basket)

  • Topic Modeling (Parallel LDA)
  • Decision Trees
  • Ensemble Learners (Random Forests)
  • Support Vector Machines
  • Conditional Random Field (CRF)
  • Clustering (K-means)
  • Cross Validation

Descriptive Statistics Sketch-based Estimators

  • CountMin (Cormode-

Muthukrishnan)

  • FM (Flajolet-Martin)
  • MFV (Most Frequent

Values) Correlation Summary Support Modules Array Operations Sparse Vectors Random Sampling Probability Functions

slide-32
SLIDE 32

Polymorphic Table Storage

Historical data (Years) slow HDD Actual data (months) regular HDD Now data (hours) SSD

Single table

  • Provide the choice of processing model for any

table or any individual partition – Enable Information Lifecycle Management (ILM)

  • Storage types can be mixed within a table or

database – Four table types: heap, row-oriented AO, column-oriented, external – Block compression: Gzip (levels 1-9), Zstd – Columnar compression: RLE

slide-33
SLIDE 33

Platform eXtension Framework (PXF)

  • An advanced version of

Greenplum external tables

  • Supports connectors for

HDFS, HBase and Hive, JDBC, Ignite (Arenadata DB)

  • Provides extensible

framework API to enable custom connector

slide-34
SLIDE 34

PXF Profiles

  • HDFS Files
  • Ignite
  • JDBC
  • Avro
  • HBase
  • Hive

– Text based – SequenceFile – RCFile – ORCFile

CREATE EXTERNAL TABLE pxf_sales_part( item_name TEXT, item_type TEXT, supplier_key INTEGER, item_price DOUBLE PRECISION, delivery_state TEXT, delivery_city TEXT ) LOCATION (‘pxf://grid_host?Profile=Ingite&IGNITE_CACHE=test&BUFFER_ SIZE=10000’);

slide-35
SLIDE 35

PXF Profiles

<profile> <name>Ignite</name> <plugins> <fragmenter>IgniteFragmenter</fragmenter> <accessor>IgniteAccessor</accessor> <resolver>IgniteResolver</resolver> <analyzer>IgniteAnalyzer</analyzer> </plugins> </profile>

slide-36
SLIDE 36
  • Fragmenter – returns a list of source data fragments and their

location

  • Accessor – access a given list of fragments read them and return

records

  • Resolver – deserialize each record according to a given schema or

technique

  • Analyzer – returns statistics about the source data

PXF Classes

slide-37
SLIDE 37

Date User_id Message 21-01-2018 16 <message> 01-11-2018 500 <message> 15-05-2018 2042 <message> 17-09-2017 15 <message> 15-06-2016 55 <message> 24-12-2015 3510 <message> 01-01-2012 19 <message> 26-04-2013 42 <message> 23-05-2010 17 <message>

Grid external table Latency: milliseconds Cost per GB: $$$ Hadoop external table Latency: tens of seconds Cost per GB: $ Regular ADB table Latency: seconds Cost per GB: $$ … where Date > 20-01-2018

… partition by Date ( partition1: Date => 01-01-2018 partition2: Date < 01-01-2018 and Date => 01-01-2015 partition3: Date < 01-01-2015 )

… where Date < 18-09-2017 … where Date > 16-06-2017 AND User_id < 400 Pushdown Partition filter Pushdown filter Executed in external system

PXF Pushdown Feature

slide-38
SLIDE 38

PXF Pushdown Feature

slide-39
SLIDE 39

Using power of In-Memory computing with MPP

slide-40
SLIDE 40

Test Bench

Ignite1 Ignite2

Greenplum Master Greenplum Seg2 Greenplum Seg1 Hadoop Namenode Hadoop Datanode2 Hadoop Datanode1

Internal Affinity Functions

Arenadata Unified Data Platform

PXF interaction

slide-41
SLIDE 41

Creating Table in MPP

slide-42
SLIDE 42

Creating External Table for Apache Ignite & Load Data

slide-43
SLIDE 43

Creating External Table in Hive & Load Data

slide-44
SLIDE 44

Exchange Partitions with External Tables

slide-45
SLIDE 45

Target Table

slide-46
SLIDE 46

Execution Plan

prt2: Greenplum Heap Partition prt1: Ignite Cache Partition

slide-47
SLIDE 47

Thank you! Questions?