Apache Ignite - Using a Memory Grid for Heterogeneous Computation - PowerPoint PPT Presentation

Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation Chris Herrera Hashmap

Topics • Who - Key Hashmap Team Members • The Use Case - Our Need for a Memory Grid • Requirements • Approach V1 • Approach V1.5 • Approach V2 • Lessons Learned • What’s Next • Questions 2

Who - Hashmap WHO Big Data, IIoT/IoT, AI/ML Services since 2012 ● HQ Atlanta area with offices in Houston, Toronto, ● and Pune Consulting Services and Managed Services ● REACH 125 Customers across 25 Industries ● PARTNERS Cloud and technology platform providers ● 3

Who - Hashmap Team Members Jay Kapadnis Akshay Mhetre Chris Herrera Lead Architect Team Lead Chief Architect/Innovation Officer Hashmap Hashmap Hashmap Pune, India Pune, India Houston, TX 4

The Use Case Oilfield Drilling Data Processing

Why - Oilfield Drilling Data Processing The Process Plan Execute Optimize WITSML Server Plan Store 6

Why - Oilfield Drilling Data Processing ● How to match the data The Plan ● Deduplication ● Missing information ● Various formats ● Various ingest paths Data Analyst TDM EDM WellView Homegrown Vendors Financial Homegrown 7

Why - Oilfield Drilling Data Processing Rig Site Data Flow ● Operational Data ● Missing classification ● Unknown quality ● Various formats WITSML Server ● Various ingest paths MWD ● Unknown completeness WITSML Server Mud Logger CSV CSV Data Analyst CSV Cement Magic CSV CSV DLIS 8 Wireline

Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office ● Impossible to generate insights without huge data cleansing operations ● Extracting value is a very expensive operation that has to be done with a combination of experts ● Generating reports requires a huge number of man-hours Data Analyst TDM EDM WellView Homegrown Vendors Financial Homegrown 9

Why - Oilfield Drilling Data Processing BUT WAIT… 10

Why - Oilfield Drilling Data Processing We still have all the compute to deal with, some of which is very legacy code Identify & Feature Persist & Load Parse Clean Enrich Engineering Report Load the data into a Parse the data from Deduplicate, Understand where the Generate additional Land the data into a staging area to start CSV, WITSML, DLIS, interpolate, pivot, split, data came from and features that are understanding what to store that allows for BI etc... aggregate what its global key required to get useful reports and interactive do with it should be insights into the data queries 11

Requirements What do we have to do?

Functional Requirements Cleaning and Feature Engineering (the legacy code I referred to) • Parse WITSML / DLIS • Attribute Mapping • Unit Conversions • Null Value Handling • Rig Operation Enrichment • Rig State Detection • Invisible Lost Time Analysis • Anomaly Detection 13

Non-Functional Requirements Description Requirement 1 Heterogeneous Data ● Very flexible ingest Ingest ● Flexible simple transformations 2 Robust Data Pipeline ● Easy to debug ● Trusted 3 Extensible Feature ● Be able to support existing Engineering computational frameworks / runtimes 4 Scalable ● Scales up ● Scales Down 5 ● If a data processing workflow fails at a Reliable step, it does not continue with 14 erroneous data

Approach V1 How Then?

Solution V1 Spark Zeppelin BI Hive ● Heterogeneous ingest implemented Staging Reporting Marts through a combination of NiFi processors/flows and Spark Jobs HDFS HDFS Well TDM EDM WITSML View ● Avro files loaded as external tables ● BI connected via ODBC (Tableau) CS WITS CS V ● Zeppelin Hive interpreter was used ML V Files Server to access the data in Hive TDM EDM WellView Homegrown 16

Issues with the Solution ● Very Slow BI ● Tough to debug cleansing ● Tough to debug feature extractions ● A lot of overhead for limited benefit ● Painful data loading process ● Incremental refresh was challenging ● Chaining the jobs together in a workflow was very hard Mostly achieved via Jupyter Notebooks ○ ● In order to achieve the functional requirements, all of the computations were implemented in Spark, even if there was little benefit 17

V1 Achieved Requirements Achieved Description Requirement 1 Heterogeneous Data ● Very flexible ingest Ingest ● Flexible simple transformations 2 Robust Data Pipeline ● Hard to Debug ● Hard to modify 3 Extensible Feature ● Hard to support other frameworks Engineering ● Hard to modify current computations 4 Scalable ● Scales up but not down 5 Robust ● Hard to debug 18

Approach V1.5 An Architectural Midstep

A Quick Architectural Midstep (V1.5) Spark Jupyter BI Hive Staging Reporting Marts ● Complicated an already complex Ignite system In-Memory MapReduce ● Did not solve all of the problems HDFS/IGFS HDF Well TDM EDM WITSML S ● Needed a simpler way to solve all of View the issues ● Ignite persistence was released CS WITS CS ML V while we were investigating this V Files Server 20 TDM EDM WellView Homegrown

Approach V2 How Now?

Approach V2 ● Allows for very interactive Kubernetes workflows Workflow API Scheduler API Functions API ● Workflows can be scheduled Spark Docker Flink Zeppelin ● Each workflow is made up of functions (microservices) ● Each instance of a workflow Ignite HDFS Service Grid Memory Grid workflow contains its own cache Functions Caches ● Zeppelin via the Ignite Workflow Workflow Function Function interpreter Cache Cache ● Workflows loaded data and Persistent Storage also processed data (Configurable) 22

Approach V2 - The Workflow ● Source is the location the data is coming from ● The workflow is the data that goes from function to function ● Data stored as data frames can be queried by an API or another function Source Function 1 Function 2 Function 3 SQL / DF SQL / DF Key Val Key Val Service Service Service Apache Ignite Apache Ignite 23

Approach – The Workflow • Each function runs as a service using Service Grid • The function receives input from any source • Kafka* • JDBC • Ignite Cache • Once the function is applied, store the result into the Ignite cache store 24

Workflow Capabilities ● Start / Stop / Restart ● Execute single functions within a workflow ● Pause execution to validate intermediate steps 25

Approach - Spark Based Functions - Persistence • After each function has completed its computation the Spark DataFrame is df.write stored via distributed storage .format(FORMAT_IGNITE) .option(OPTION_TABLE, tableName) • Table name is stored as // table name to store data SQL_PUBLIC_<tableName> .option(OPTION_CREATE_TABLE_PRIMARY_KEY_ FIELDS, “id”) .save() Spark Function DF Key Val Service Apache Ignite 26

Approach – Intermediate Querying • Once the data is in the cache, val cache = ignite.getOrCreateCache(cacheConfig) the data can be optionally val cursor = cache.query(new persisted using the Ignite SqlFieldsQuery(s”SELECT * FROM $tableName persistence module limit 20")) val data = cursor.getAll • The data can be queried using the Ignite SQL grid module as API well Spark Function • Allows for intermediate validation DF of the data as it proceeds Key Val Service through the workflow Apache Ignite 27

Approach - Applied to the Use Case Workflow API Scheduler API Channel Rig State Java Mapping / Detection / WITSML WITS Unit Enrichment ML Client Server Conversion / Pivot (Docker) (Docker) (Spark) SQL SQL Key Val Key Val Service Service Service Apache Ignite Apache Ignite 28

V2 Achieved Requirements Achieved Description Requirement 1 Heterogeneous Data ● Very flexible ingest Ingest ● Flexible transformations 2 Robust Data Pipeline ● Easy to debug ● Easy to modify 3 Extensible Feature ● Easy to add Engineering ● Easy to experiment 4 Scalable ● Scales up ● Scales down 5 Robust ● Easy to debug ● Reliable 29

Solution Benchmark Setup • Dimension Tables already loaded • 8 functions (6 wells of data – 5.7 billion points) • Ingest / Parse WITSML • Null Value Handling • Interpolation • Depth Adjustments • Drill State Detection • Rig State Detection • Anomaly Detection • Pivot Dataset • For V1 everything was implemented as a Spark application • For V2 the computations remained close to their original format 30

Solution Comparison V1 - Execute Time V2 - Execute Time • 9 Hours • 2 Hours Without WITSML Download Without WITSML Download • 7 Hours • 22 minutes 19x Improvement V1 to V2 31

Lessons Learned How Now?

Apache Ignite - Using a Memory Grid for Heterogeneous Computation - PowerPoint PPT Presentation

Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation Chris Herrera Hashmap Topics Who - Key Hashmap Team Members The Use Case - Our Need for a Memory Grid Requirements

Ingesting Streaming Data for Analysis in Apache Ignite Pat Patterson StreamSets

Ultimate Guide to Successful Cross-Platform Deployments with Apache Ignite Pavel Petroshenko

Apache Ignite as MPP Accelerator Alexander Ermakov, CTO Agenda About us Why do

Using Distributed Tracing to Resolve Performance Issues in Apache Ignite Greg Stachnick, Director

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Ignite Extensions - Modularization Saikat Maitra Twitter @samaitra Github samaitra

Sun and Grid John Barr Grid Business Development 07808 328351 john.barr@sun.com Sun and Grid

Predicting Share Prices in Real-Time with Apache Spark and Apache Ignite MANUEL MOURATO Summary

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

How-to for real-time streaming and analytics at scale with Apache Kafka and Apache Ignite Viktor

real-time alerting, analytics and reporting at scale with Apache Kafka and Apache Ignite

ON-GRID VS OFF-GRID SOLAR On-Grid Solar is solar generation that is connected to the utility grid

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Medical Care of Vulnerable and Underserved Populations February 28- March 2, 2019 Holiday Inn

Knowledge in the Situation Calculus Adrian Pearce 8 July 2009 includes slides by Ryan Kelly

Video Joseph April 20th - Sept 21st a new sermon series - jealousy, betrayal, temptation &

Todays whether: if, elif, or else! Congrats, Pats!

Natural Language Processing Angel Xuan Chang angelxuanchang.github.io/nlp-class adapted from

Taking Time Seriously Bryan OSullivan Twitter: @bos31337 Monday, June 18, 12 Lets talk

NETCONF Discussion Draft-ietf-i2rs-ephemeral-state-14.txt Presenter: Susan Hares Co-authors:

Apache Ignite - Using a Memory Grid for Heterogeneous Computation - PowerPoint PPT Presentation

Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation Chris Herrera Hashmap Topics Who - Key Hashmap Team Members The Use Case - Our Need for a Memory Grid Requirements

Ingesting Streaming Data for Analysis in Apache Ignite Pat Patterson StreamSets

Ultimate Guide to Successful Cross-Platform Deployments with Apache Ignite Pavel Petroshenko

Apache Ignite as MPP Accelerator Alexander Ermakov, CTO Agenda About us Why do

Using Distributed Tracing to Resolve Performance Issues in Apache Ignite Greg Stachnick, Director

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Ignite Extensions - Modularization Saikat Maitra Twitter @samaitra Github samaitra

Sun and Grid John Barr Grid Business Development 07808 328351 john.barr@sun.com Sun and Grid

Predicting Share Prices in Real-Time with Apache Spark and Apache Ignite MANUEL MOURATO Summary

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

How-to for real-time streaming and analytics at scale with Apache Kafka and Apache Ignite Viktor

real-time alerting, analytics and reporting at scale with Apache Kafka and Apache Ignite

ON-GRID VS OFF-GRID SOLAR On-Grid Solar is solar generation that is connected to the utility grid

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Medical Care of Vulnerable and Underserved Populations February 28- March 2, 2019 Holiday Inn

Knowledge in the Situation Calculus Adrian Pearce 8 July 2009 includes slides by Ryan Kelly

Video Joseph April 20th - Sept 21st a new sermon series - jealousy, betrayal, temptation &amp;

Todays whether: if, elif, or else! Congrats, Pats!

Natural Language Processing Angel Xuan Chang angelxuanchang.github.io/nlp-class adapted from

Taking Time Seriously Bryan OSullivan Twitter: @bos31337 Monday, June 18, 12 Lets talk

NETCONF Discussion Draft-ietf-i2rs-ephemeral-state-14.txt Presenter: Susan Hares Co-authors:

Video Joseph April 20th - Sept 21st a new sermon series - jealousy, betrayal, temptation &