SnappyData: Apache Spark Meets Embedded In-Memory Database Masaki - PowerPoint PPT Presentation

SnappyData: Apache Spark Meets Embedded In-Memory Database Masaki Yamakawa UL Systems, Inc.

About me Masaki Yamakawa UL Systems, Inc. Managing Consultant { “Sector” ：” Financial ” “Skills” ：[“ Distributed system ”, “ In-memory computing ”] “Hobbies”：” Marathon running ” } 1

Agenda 1.Current Issues of Real-Time Analytics Solutions 2.SnappyData Features 3.Our SnappyData Case Study 2

Current Issues of Real-Time Analytics Solutions PART 1 3

Are you satisfied with real-time analytics solutions? – Complex – Slow – Bad performance – Loading data to memory required – Difficulty with updates 4

What are common demands for data processing platform? Transaction Analytics Streaming RDBMS DWH Traditional data processing NoSQL SQL on Hadoop Stream data processing Bigdata processing 5

Tends to become complex system when integrates with multiple products Data visualization ETL processing Store enterprise data and analysis ETL processing Enterprise RDBMS DWH BI/Analytic Data visualization systems s and analysis AP Store and process big data Web/B2C services etc. Store streaming data Notification, Alert Process real-time data Real-time IoT / sensor / Stream data AP real-time data etc. processing 6

Tends to become complex system when integrates with multiple products Data visualization ETL processing Store enterprise data and analysis Increased Inefficiency ETL processing Enterprise RDBMS DWH BI/Analytic TCO Data visualization systems s and analysis AP Store and process big data It takes Difficult to time to maintain data Web/B2C services etc. Store streaming data consistency analyze Notification, Alert Process real-time data Real-time IoT / sensor / Stream data AP real-time data etc. processing 7

Although it became quite simple after Spark released…? Data visualization Store enterprise data and analysis Enterprise BI/Analytic systems s or AP Store and process big data Web/B2C services etc. or Notification, Alert Process real-time data Real-time IoT / sensor / AP real-time data etc. 8

SnappyData can build simpler real-time analytics solutions! Data visualization SnappyData Store enterprise data and analysis Enterprise BI/Analytic systems s AP Store and process big data Web/B2C services etc. Notification, Alert Process real-time data Real-time IoT / sensor / AP real-time data etc. 9

SnappyData Features PART 2 10

SnappyData is the Spark Database for Spark users 11

Apache Spark+Distributed In-memory DB+Own features Batch processing Analytics Row database Columnar database Stream Transaction Synapsis Data Engine processing Distributed Distributed In- computing framework memory database SnappyData's own features 12

What is SnappyData's core component? • Seamless integration of Spark and in-memory database conponents Spark Micro-batch Streaming Spark Core GemFire XD SnappyData's Spark SQL Catalyst additional Transaction Continuous SynopsisData features Query OLTP Query OLAP Query Engine In-Memory Database Stream Row Column Sample/TopK Table Table Index Table Table Distributed P2P Cluster Management Replication/Partition file system HDFS HDFS HDFS HDFS 13

Key to Spark programʼs accelerations 1 In-memory database In-memory data format 2 Unified cluster 3 Optimized SparkSQL 4 14

Key#1: Data exists in in-memory database In case of Spark In case of SnappyData Spark program Spark program Spark Spark In memory In memory Distributed On HDFS in-memory database disk 15

Key#1: Data access code example In case of Spark In case of SnappyData // create SnappySession from SparkContext val snappy = new org.apache.spark.sql. SnappySession (spark.sparkContext) // load data from HDFS val df = spark.sqlContext. read . format("com.databricks.spark.csv"). No need to load data option("header", "true"). load ("hdfs://...") df.createOrReplaceTempView("SparkTable") // create new DataFrame using SparkSQL // create new DataFrame using SparkSQL val filteredDf = val filteredDf = spark . sql ("SELECT * FROM SparkTable WHERE ...") snappy . sql ("SELECT * FROM SnappyTable WHERE …") val newDf = filteredDf. .... val newDf = filteredDf. .... // save processing results // save processing results newDf. write . newDf. write.insertInto ("NewSnappyTable") format("com.databricks.spark.csv"). option("header", "false"). save ("hdfs://...") 16

Key#2: SnappyData same data format as Sparkʼs In case of Spark In case of SnappyData Spark Spark DataFrame DataFrame serialization/ No serialization/deserialization, O/R deserializatio mapping n No O/R mapping reading/writin reading/writin g data g data HDFS/data storage GemFire XD : In-memory database CSV file 17

Key#3: Spark and GemFire XD cluster can be integrated Unified cluster mode SnappyData Leader （Spark Driver） Spark Context SnappyData DataServer SnappyData DataServer SnappyData DataServer Spark Executor Spark Executor Spark Executor DataFrame DataFrame DataFrame DataFrame DataFrame DataFrame Spark with GemFire XD cluster In-memory In-memory In-memory database database database SnappyData JVM JVM JVM Locator 18

Key#3: Another cluster mode (for your reference) Split cluster mode SnappyData Leader （Spark Driver） Spark Context Spark Executor Spark Executor Spark Executor Spark cluster DataFrame DataFrame DataFrame DataFrame DataFrame DataFrame JVM JVM JVM In-memory In-memory In-memory GemFire XD database database database SnappyData cluster Locator SnappyData DataServer SnappyData DataServer SnappyData DataServer JVM JVM JVM 19

Key#4: SparkSQL Acceleration In case of Spark In case of SnappyData Unique DAG is generated, SELECT A.CardNumber, less shuffle and faster SUM (A.TxAmount) FROM CreditCardTx1 A, CreditCardComm B WHERE Accelerate the A.CardNumber=B.CardNumber AND A.TxAmount+B.Comm < 1000 processing by modifying GROUP BY some workload of A.CardNumber ORDER BY SparkSQL Sort A.CardNumber SnappyHashJoi n SortMerg eJoin SnappyHash HashAggregat Aggregate e HashAggregat e 20

Our SnappyData Case Study: How to use SnappyData PART 3 21

Example of use: Production plan simulation system Messaging Middleware Production results Real-time notification Production results stream Machine sensor data Machine sensor stream Production Machine BOM results table sensor table table APP BOM stream BOM BI Tool Simulation parameters table In-memory database Simulation parameters 22

Architecture with SnappyData • Use SnappyData to realize all data processings such as stream processings, transactions, analytics • The key is that it includes in-memory database and can be processed by SQL A) Stream B) data Transaction processing SQL Messaging Middleware SQL C) Analytics In-memory APP database SQL 23

A) Stream Data Processing • The stream data is inserted into the table Difference from plain Spark • Stream data processing can be executed by SQL SnappyData A) Stream data processing Messaging Middleware SQL In-memory APP database 24

SnappyData implements stream data processing using SQL Stream table Process（Continuous Query） Senso Machin Poin VIN Value Timestamp rId eNo t SELECT * 28.076 2017/11/05 1 11AA 111 1 0 10:10:01 FROM MachineSensorStream 2017/11/05 2 22BB 222 37 60.069 WINDOW 10:10:20 (DURATION 10 SECONDS, 2017/11/05 3 11AA 111 2 37.528 SLIDE 2 SECONDS) 10:10:21 WHERE 2017/11/05 4 33CC 333 25 1.740 Point=1; 10:11:05 2017/11/05 5 11AA 111 3 88.654 10:11:15 394.39 2017/11/05 6 11AA 111 4 0 10:11:16 25

Only specifies stream data source info in table definition CREATE STREAM TABLE MachineSensorStream (SensorId long, Streaming data source other than VIN string, Kafka l TWITTER_STREAM MachineNo int, l DIRECTKAFKA_STREAM Point long l RABBITMQ_STREAM Value double, l SOCKET_STREAM l FILE_STREAM Timestamp timestamp) Streaming data source USING KAFKA_STREAM OPTIONS (storagelevel 'MEMORY_AND_DISK_SER_2', Storage level (Spark setting) rowConverter 'uls.snappy.KafkaToRowsConverter', Stream data row converter class kafkaParams 'zookeeper.connect->localhost:2181;xx', Setting for each streaming topics 'MachineSensorStream'); data source 26

Implements StreamToRowsConverter and converts to table format class KafkaToRowsConverter extends StreamToRowsConverter with Serializable { override def toRows (message: Any): Seq[Row] = { val sensor: MachineSensorStream = message.asInstanceOf[MachineSensorStream] Seq ( Row.fromSeq (Seq(sensor.getSensorId, Data for one row sensor.getVin, sensor.getMachineNo, of stream table sensor.getPoint, sensor.getValue, sensor.getTimestamp))) } } 27

Stream data processing using SQL SELECT DURATION 10 secs * FROM MachineSensorStream WINDOW ( DURATION 10 SECONDS , SLIDE 2 SECONDS ) WHERE SLIDE Point=1; 2 secs Point acquires “1" data in 2 secs sliding window 28

SnappyData: Apache Spark Meets Embedded In-Memory Database Masaki - PowerPoint PPT Presentation

SnappyData: Apache Spark Meets Embedded In-Memory Database Masaki Yamakawa UL Systems, Inc. About me Masaki Yamakawa UL Systems, Inc. Managing Consultant { Sector Financial Skills [ Distributed system ,

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

The CARES Act & Return to Work Overview & Major Provisions for Individuals $2 Trillion

Public Believes Ebola is Likely to Spread from Symptomatic Patients by Multiple Routes % saying

SSD Fail ilures in in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae

DIADS: Addressing the My-Problem- or-Yours Syndrome with Integrated SAN and Database

Modelization of membrane potentials and information transmission in large systems of neurons

Neural Networks - Deep Learning Artificial Intelligence @ Allegheny College Janyl Jumadinova

Board AND COMPUTER, PLEASE LET ADAM KNOW WHAT PHONE NUMBER YOU USED TO DIAL IN SO YOU CAN BE

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

SnappyData: Apache Spark Meets Embedded In-Memory Database Masaki - PowerPoint PPT Presentation

SnappyData: Apache Spark Meets Embedded In-Memory Database Masaki Yamakawa UL Systems, Inc. About me Masaki Yamakawa UL Systems, Inc. Managing Consultant { Sector Financial Skills [ Distributed system ,

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

The CARES Act &amp; Return to Work Overview &amp; Major Provisions for Individuals $2 Trillion

Public Believes Ebola is Likely to Spread from Symptomatic Patients by Multiple Routes % saying

SSD Fail ilures in in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae

DIADS: Addressing the My-Problem- or-Yours Syndrome with Integrated SAN and Database

Modelization of membrane potentials and information transmission in large systems of neurons

Neural Networks - Deep Learning Artificial Intelligence @ Allegheny College Janyl Jumadinova

Board AND COMPUTER, PLEASE LET ADAM KNOW WHAT PHONE NUMBER YOU USED TO DIAL IN SO YOU CAN BE

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

The CARES Act & Return to Work Overview & Major Provisions for Individuals $2 Trillion