HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE USING GPUS Mark - PowerPoint PPT Presentation

HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE USING GPUS Mark Brooks - Principal System Engineer @ Kinetica May 09, 2017

The Challenge: How to maintain analytic performance while dealing with: • Larger data volumes • Streaming data with minimal end-to-end latency • Ad-hoc drill down (you can’t pre-aggregate everything) 2

Architectural and Design Approaches 1. One database to rule them all 2. SQL on Hadoop (or directly on the Data Lake) 3. Data Lake + NoSQL + Spark + Search + Cache +… 4. Lambda Architecture 5. Kappa Architecture 6. Next generation hardware acceleration 3

One Database To Rule Them All 4

SQL on a Data Lake 5 Credit: https://www.slideshare.net/Bigdatapump/sql-on-hadoop-49494494

Hadoop + NoSQL + Search + Memory Cache +… 6 Credit: Matt Turck - https://www.slideshare.net/mjft01/big-data-landscape-matt-turck-may-2014

Lambda Architecture Credit: Nathan Marz http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html James Kinley http://jameskinley.tumblr.com/tagged/Lambda 7

Lambda Architecture Credit: James Kinley http://jameskinley.tumblr.com/tagged/Lambda 7

Kappa Architecture Credit: Jay Kreps https://www.oreilly.com/ideas/questioning-the-lambda-architecture 8

Kappa Architecture Stream processing systems already have a notion of parallelism; why not just handle reprocessing by increasing the parallelism and replaying history very, very fast? Credit: Jay Kreps https://www.oreilly.com/ideas/questioning-the-lambda-architecture 8

Next Generation Hardware Acceleration Consider a system with these characteristics: • Horizontally Scalable • Low end-to-end latency • Powerful enough to not require pre-aggregation This is now possible… Credit: Jay Kreps https://www.oreilly.com/ideas/questioning-the-lambda-architecture 8

GPU Accelerated Compute 1990 - 2000’s 2005… 2010… 2017… AT SCALE PROCESSING BECOMES THE BOTTLENECK DATA WAREHOUSE AFFORDABLE MEMORY GPU ACCELERATED COMPUTE DISTRIBUTED STORAGE RDBMS & Data Warehouse Hadoop and MapReduce Affordable memory allows for GPU cores bulk process tasks in parallel - far more efficient for many technologies enable enables distributed storage and faster data read and write. data-intensive tasks than CPUs organizations to store and processing across multiple HANA, MemSQL, & Exadata which process those tasks linearly. analyze growing volumes of data machines. provide faster analytics. on high performance machines, but at high cost. Storing massive volumes of data becomes more affordable, but performance is slow 12

Kinetica: Core HTTP Head Node GPU Accelerated ANALYTICS DATABASE ACCELERATED BY GPUs Columnar In-memory Database A1 B1 C1 A2 B2 C2 Columnar in-memory database A3 B3 C3 A4 B4 C4 Data available much like a traditional RDBMS… rows, columns Disk Commodity Hardware w/ GPUs Data held in-memory; persisted to disk KINETICA Interact with Kinetica through its native REST API, Java, Python, JavaScript, NodeJS, C++, SQL, etc… as well as with various connectors Native GIS & IP address object support VERY FAST: Ideal for OLAP workloads Typical hardware setup: 256GB - 1TB memory with 2-4 GPUs per node. 13

Multi-Head Ingest and Scale-Out Architecture ON-DEMAND SCALE OUT HTTP Head Node HTTP Head Node HTTP Head Node Columnar Columnar Columnar In-memory In-memory In-memory A1 C1 A1 C1 A1 C1 B1 B1 B1 + A2 B2 C2 A2 B2 C2 A2 B2 C2 A3 C3 A3 C3 A3 C3 B3 B3 B3 A4 C4 A4 C4 A4 C4 B4 B4 B4 Disk Disk Disk Commodity Hardware Commodity Hardware Commodity Hardware w/ GPUs w/ GPUs w/ GPUs MULTI-HEAD INGEST 19

Real-Time Data Handlers for Structured & Unstructured Data APIs VISUALIZATION via ODBC/JDBC GEOSPATIAL CAPABILITIES Geometric Java API C++ API WMS Objects Tracks WKT JavaScript API Node.js API Geospatial REST API Python API Endpoints OPEN SOURCE INTEGRATION HTTP Head Node HTTP Head Node HTTP Head Node HTTP Head Node Apache NiFi Columnar Columnar Columnar Columnar In-memory In-memory In-memory In-memory Apache Kafka A1 B1 C1 A1 B1 C1 A1 C1 A1 C1 B1 B1 Apache Spark A2 B2 C2 A2 B2 C2 A2 B2 C2 A2 B2 C2 A3 C3 A3 B3 C3 B3 A3 B3 C3 A3 B3 C3 Apache Storm A4 C4 A4 B4 C4 B4 A4 B4 C4 A4 B4 C4 OTHER INTEGRATION Disk Disk Disk Disk Message Queues Commodity Hardware Commodity Hardware Commodity Hardware Commodity Hardware w/ GPUs ETL Tools w/ GPUs w/ GPUs w/ GPUs KINETICA CLUSTER Streaming Tools On-Demand Scale 20

Parallel Ingest Provides High Performance Streaming 1 NODE (1TB/2GPU) PARALLEL INGEST 1 NODE (1TB/2GPU) 1 NODE (1TB/2GPU) Each node of the system can share the task of data ingest, provides more and faster throughput. It can be made faster simply by adding more nodes. No compute is used on ingest ! 16

Speed Layer for the Data Lake Parallel ingestion of events Kinetica is speed layer with real- ANALYSTS Put, get, scan time analytic capabilities Amazon Kinesis HDFS for archival store MOBILE ALERTING Much looser coupling than USERS SYSTEMS Kinetica traditional lambda architecture EVENTS Connectors Execute complex analytics on the fly Batch mode Spark or MR jobs DASHBOARDS & APPLICATIONS can push data to Kinetica as MESSAGE STREAM BROKERS PROCESSING needed for fast query on data loaded from the data lake Parallel Ingestion HDFS / AWS S3 / GCS / Azure Data Lake ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° 17

Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle Parallel ingestion of events Lambda-type architecture for ANALYSTS Teradata or Oracle Amazon Kinesis Fast GPU Kinetica is speed layer with accelerated, in- near-real-time analytic Memory Database MOBILE ALERTING Converge ML, AI, capabilities USERS SYSTEMS Streaming DATA IN Kinetica MOTION Converge Machine Learning, AND REST Connectors streaming and location DASHBOARDS & APPLICATIONS analytics and fast Query and STREAM / ETL PROCESSING Analytics with Kinetica and RDBMS DATA WAREHOUSE / TRANSACTIONAL 18

Advanced In-Database Analytics ORCHESTRATION LAYER WITH USER-DEFINED FUNCTIONS (UDFs) 1. User-defined functions (UDFs) can receive table data, PHYSICAL / VIRTUAL SERVER do arbitrary computations, and save output to a separate table in a distributed manner. Table A Data returned to Table n output table for Table B 2. UDFs have direct access to CUDA APIs – enables further analysis Table C compute-to-grid analytics for logic deployed within Kinetica. Proc Server /exec/proc/UDF_A/ UDF_B UDF_n UDF_A 3. Works with custom code, or packaged code. Opens UDFs exposed from RESTful endpoint the way for machine learning/artificial intelligence CUDA Libraries libraries such as TensorFlow, BIDMach, Caffe and Torch to work on data directly within Kinetica. GPU 4. Available now with C++ & Java bindings. n number of Kinetica servers 19

Kinetica Architecture ETL / STREAM Native BI / GIS / APPS PROCESSING APIs STREAMING DATA KINETICA ‘REVEAL’ SQL PARALLEL INGEST ON DEMAND SCALE OUT + Geospatial WMS BI DASHBOARDS Custom 1TB MEM / 2 GPU CARDS Connectors UDFs TRANSACTIONAL DATA In-Database Processing CUSTOM APPS & GEOSPATIAL ERP / CRM / ML Libs CUSTOM LOGIC BIDMach 20

AI & BI on One GPU-Accelerated Database CUSTOM APPLICATIONS HIGH FIDELITY GEOSPATIAL PIPELINE BUSINESS INTELLIGENCE ODBC Native / JDBC REST API WMS SQL BUSINESS USERS HIGH PERFORMANCE ANALYTICS DATABASE UDF UDF UDF BIDMach DATA SCIENTISTS / DEVELOPERS MACHINE LEARNING PREDICTIVE MODELS e.g. Risk Management, & DEEP LEARNING GPU-ACCELERATED Sales Volume, Fraud. DATA SCIENCE 21

50-100x Faster on Queries with Large Datasets WHEN COMPARED TO LEADING IN-MEMORY ALTERNATIVES • Large retailer tested complex SQL queries on 3 years of retail data (150bn rows) SELECT (Q10) • 10 node Kinetica cluster against 30TB+ cluster from next best alternative GROUP BY (Q5) • GPU is able to perform many instructions in parallel. Huge performance gains on aggregations, group bys, joins, etc. SUM (Q1) • Kinetica sustained ingest of 1.3bn 0 5 10 15 20 25 30 35 40 45 50 objects/minute with 70 attributes per row Kinetica Leading In-Memory DB More Details 22

Distributed Geospatial Pipeline • NATIVE VISUALIZATION IS DESIGNED FOR FAST MOVING, LOCATION-BASED DATA Native Geospatial Object Types • Points, Shapes, Tracks, Labels Native Geospatial Functions Filters (by area, by series, by geometry, etc.) • Aggregation (histograms) • • Geofencing - triggers Video generation (based on dates/times) • Generate Map Overlay Imagery (via WMS) • Rasterize points Style based on attributes (class-break) • Heat maps • 23 23

Full-Text Search Kinetica includes powerful text search functionality, “Rain Tire” ~5 including : "Union Tranquility"~10 • Exact Phrases [100 TO 200] • Boolean – AND / OR • Wildcards • Grouping • Fuzzy Search (Damerau-Levenshtein optimal string alignment algorithm) • N-Gram Term Proximity Search • Term Boosting Relevance Prioritization 22

HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE USING GPUS Mark - PowerPoint PPT Presentation

HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE USING GPUS Mark Brooks - Principal System Engineer @ Kinetica May 09, 2017 The Challenge: How to maintain analytic performance while dealing with: Larger data volumes Streaming data

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Laurance Lake Reservoir Course Reservoir Course Project j Laurance Lake Laurance Lake

Hidden Lake Vista -Montana Tufa Towers Mono Lake California Sylvan Lake ,Black Hills-South Dakota

Real-Time in the Real World: Building a State of the Art Real-Time Analytics Platform INFORMS

Real- Real -Time Systems Time Systems Real- -Time Systems Time Systems Real

Real Real- -Time Systems Time Systems Designing a real- Designing a real -time system time

Real- Real -time systems time systems Real- Real -time programming time programming

Real graduates, Real graduates, real transitions, real transitions, real stories: real

Temporary Seasonal Lake Lowering Overview Pertinent facts and details about Lake Conroe, Lake

JJC - Campus Lake Lake pictures taken by Virginia Piekarski, 2006 JJC Lake Rehabilitation

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Real Real Real Time Real-Time Time Time Model Checking Model Model Checking Model

Real time Predictive Fraud Analytics using Databricks & Tableau Prasad Kona Partner Solution

GPU-Accelerated Analytics on your Data Lake. Data Lake @blazingdb Data Swamp @blazingdb ETL

SCENARIO TOOL Lake Champlain Basin TMDL Lake Champlain Phosphorus Water ershed ed In-lake ke

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

Description For your topic, you and your partner will be the lecturers. You will have two steps:

MongoDB By Bharath Subramanyam Relational Databases started to become popular in the 80s and

Office of Public Affairs Court Website Toolbox NCBC/FCCA Conference August 2016 Agenda

-: U ser M an ual :- 1. First the office user after receiving their user id and password from

Contact information ef@math.wvu.edu mays@math.wvu.edu http://math.wvu.edu

Current Status of a Measurement of Hadronic Parity Violation in the Capture of Cold Neutrons on

L A T EX: Making Math Accessible for the Blind or Visually Impaired Anthony Janolino Simon

On the Properties of Stored Electromagnetic Energy Miloslav Capek Lukas Jelinek Department of

HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE USING GPUS Mark - PowerPoint PPT Presentation

HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE USING GPUS Mark Brooks - Principal System Engineer @ Kinetica May 09, 2017 The Challenge: How to maintain analytic performance while dealing with: Larger data volumes Streaming data

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Laurance Lake Reservoir Course Reservoir Course Project j Laurance Lake Laurance Lake

Hidden Lake Vista -Montana Tufa Towers Mono Lake California Sylvan Lake ,Black Hills-South Dakota

Real-Time in the Real World: Building a State of the Art Real-Time Analytics Platform INFORMS

Real- Real -Time Systems Time Systems Real- -Time Systems Time Systems Real

Real Real- -Time Systems Time Systems Designing a real- Designing a real -time system time

Real- Real -time systems time systems Real- Real -time programming time programming

Real graduates, Real graduates, real transitions, real transitions, real stories: real

Temporary Seasonal Lake Lowering Overview Pertinent facts and details about Lake Conroe, Lake

JJC - Campus Lake Lake pictures taken by Virginia Piekarski, 2006 JJC Lake Rehabilitation

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Real Real Real Time Real-Time Time Time Model Checking Model Model Checking Model

Real time Predictive Fraud Analytics using Databricks &amp; Tableau Prasad Kona Partner Solution

GPU-Accelerated Analytics on your Data Lake. Data Lake @blazingdb Data Swamp @blazingdb ETL

SCENARIO TOOL Lake Champlain Basin TMDL Lake Champlain Phosphorus Water ershed ed In-lake ke

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

Description For your topic, you and your partner will be the lecturers. You will have two steps:

MongoDB By Bharath Subramanyam Relational Databases started to become popular in the 80s and

Office of Public Affairs Court Website Toolbox NCBC/FCCA Conference August 2016 Agenda

-: U ser M an ual :- 1. First the office user after receiving their user id and password from

Contact information ef@math.wvu.edu mays@math.wvu.edu http://math.wvu.edu

Current Status of a Measurement of Hadronic Parity Violation in the Capture of Cold Neutrons on

L A T EX: Making Math Accessible for the Blind or Visually Impaired Anthony Janolino Simon

On the Properties of Stored Electromagnetic Energy Miloslav Capek Lukas Jelinek Department of

Real time Predictive Fraud Analytics using Databricks & Tableau Prasad Kona Partner Solution