HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE USING GPUS Mark - - PowerPoint PPT Presentation

how to achieve real time analytics on a data lake using
SMART_READER_LITE
LIVE PREVIEW

HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE USING GPUS Mark - - PowerPoint PPT Presentation

HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE USING GPUS Mark Brooks - Principal System Engineer @ Kinetica May 09, 2017 The Challenge: How to maintain analytic performance while dealing with: Larger data volumes Streaming data


slide-1
SLIDE 1

HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE USING GPUS

Mark Brooks - Principal System Engineer @ Kinetica May 09, 2017

slide-2
SLIDE 2

The Challenge:

How to maintain analytic performance while dealing with:

  • Larger data volumes
  • Streaming data with minimal end-to-end latency
  • Ad-hoc drill down (you can’t pre-aggregate everything)

2

slide-3
SLIDE 3

Architectural and Design Approaches

  • 1. One database to rule them all
  • 2. SQL on Hadoop (or directly on the Data Lake)
  • 3. Data Lake + NoSQL + Spark + Search + Cache +…
  • 4. Lambda Architecture
  • 5. Kappa Architecture
  • 6. Next generation hardware acceleration

3

slide-4
SLIDE 4

One Database To Rule Them All

4

slide-5
SLIDE 5

SQL on a Data Lake

Credit: https://www.slideshare.net/Bigdatapump/sql-on-hadoop-49494494

5

slide-6
SLIDE 6

Hadoop + NoSQL + Search + Memory Cache +…

Credit: Matt Turck - https://www.slideshare.net/mjft01/big-data-landscape-matt-turck-may-2014

6

slide-7
SLIDE 7

Lambda Architecture

Credit: Nathan Marz http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html James Kinley http://jameskinley.tumblr.com/tagged/Lambda

7

slide-8
SLIDE 8

Lambda Architecture

Credit: James Kinley http://jameskinley.tumblr.com/tagged/Lambda

7

slide-9
SLIDE 9

Kappa Architecture

Credit: Jay Kreps https://www.oreilly.com/ideas/questioning-the-lambda-architecture

8

slide-10
SLIDE 10

Kappa Architecture

Credit: Jay Kreps https://www.oreilly.com/ideas/questioning-the-lambda-architecture

8

Stream processing systems already have a notion

  • f parallelism; why not just handle reprocessing by

increasing the parallelism and replaying history very, very fast?

slide-11
SLIDE 11

Next Generation Hardware Acceleration

Credit: Jay Kreps https://www.oreilly.com/ideas/questioning-the-lambda-architecture

8

Consider a system with these characteristics:

  • Horizontally Scalable
  • Low end-to-end latency
  • Powerful enough to not require pre-aggregation

This is now possible…

slide-12
SLIDE 12

GPU Accelerated Compute

12

DATA WAREHOUSE

RDBMS & Data Warehouse technologies enable

  • rganizations to store and

analyze growing volumes of data

  • n high performance machines,

but at high cost.

DISTRIBUTED STORAGE

Hadoop and MapReduce enables distributed storage and processing across multiple machines. Storing massive volumes of data becomes more affordable, but performance is slow

AFFORDABLE MEMORY

Affordable memory allows for faster data read and write. HANA, MemSQL, & Exadata provide faster analytics.

1990 - 2000’s 2005… 2010… 2017… AT SCALE PROCESSING BECOMES THE BOTTLENECK GPU ACCELERATED COMPUTE

GPU cores bulk process tasks in parallel - far more efficient for many data-intensive tasks than CPUs which process those tasks linearly.

slide-13
SLIDE 13

Kinetica: Core

13

ANALYTICS DATABASE ACCELERATED BY GPUs

KINETICA

Commodity Hardware w/ GPUs

Disk

A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4

GPU Accelerated Columnar In-memory Database

HTTP Head Node

Columnar in-memory database Data available much like a traditional RDBMS… rows, columns Data held in-memory; persisted to disk Interact with Kinetica through its native REST API, Java, Python, JavaScript, NodeJS, C++, SQL, etc… as well as with various connectors Native GIS & IP address object support VERY FAST: Ideal for OLAP workloads

Typical hardware setup: 256GB - 1TB memory with 2-4 GPUs per node.

slide-14
SLIDE 14

Multi-Head Ingest and Scale-Out Architecture

ON-DEMAND SCALE OUT

Commodity Hardware w/ GPUs

Disk

A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4

Columnar In-memory HTTP Head Node

+

Commodity Hardware w/ GPUs

Disk

A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4

Columnar In-memory HTTP Head Node Commodity Hardware w/ GPUs

Disk

A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4

Columnar In-memory HTTP Head Node

MULTI-HEAD INGEST

19

slide-15
SLIDE 15

Real-Time Data Handlers for Structured & Unstructured Data

VISUALIZATION via ODBC/JDBC APIs

Java API JavaScript API REST API C++ API Node.js API Python API

OPEN SOURCE INTEGRATION

Apache NiFi Apache Kafka Apache Spark Apache Storm

GEOSPATIAL CAPABILITIES

Geometric Objects Tracks Geospatial Endpoints WMS WKT

KINETICA CLUSTER

On-Demand Scale

Commodity Hardware w/ GPUs

Disk

A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4

Columnar In-memory

HTTP Head Node

Commodity Hardware w/ GPUs

Disk

A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4

Columnar In-memory

HTTP Head Node

Commodity Hardware w/ GPUs

Disk

A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4

Columnar In-memory

HTTP Head Node

Commodity Hardware w/ GPUs

Disk

A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4

Columnar In-memory

HTTP Head Node

OTHER INTEGRATION

Message Queues ETL Tools Streaming Tools

20

slide-16
SLIDE 16

Parallel Ingest Provides High Performance Streaming

16

1 NODE (1TB/2GPU) PARALLEL INGEST 1 NODE (1TB/2GPU) 1 NODE (1TB/2GPU) Each node of the system can share the task of data ingest, provides more and faster throughput. It can be made faster simply by adding more nodes. No compute is used on ingest !

slide-17
SLIDE 17

Speed Layer for the Data Lake

17 Parallel Ingestion

Parallel ingestion of events Kinetica is speed layer with real- time analytic capabilities HDFS for archival store Much looser coupling than traditional lambda architecture Batch mode Spark or MR jobs can push data to Kinetica as needed for fast query on data loaded from the data lake

EVENTS MESSAGE BROKERS Amazon Kinesis ANALYSTS MOBILE USERS DASHBOARDS & APPLICATIONS ALERTING SYSTEMS

Put, get, scan Execute complex analytics on the fly Kinetica Connectors

STREAM PROCESSING

° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °

HDFS / AWS S3 / GCS / Azure Data Lake

slide-18
SLIDE 18

Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle

18

Parallel ingestion of events Lambda-type architecture for Teradata or Oracle Kinetica is speed layer with near-real-time analytic capabilities Converge Machine Learning, streaming and location analytics and fast Query and Analytics with Kinetica and RDBMS

DATA IN MOTION AND REST

DATA WAREHOUSE / TRANSACTIONAL Amazon Kinesis ANALYSTS MOBILE USERS DASHBOARDS & APPLICATIONS ALERTING SYSTEMS

Kinetica Connectors

STREAM / ETL PROCESSING

Fast GPU accelerated, in- Memory Database Converge ML, AI, Streaming

slide-19
SLIDE 19

Advanced In-Database Analytics

  • 1. User-defined functions (UDFs) can receive table data,

do arbitrary computations, and save output to a separate table in a distributed manner.

  • 2. UDFs have direct access to CUDA APIs – enables

compute-to-grid analytics for logic deployed within Kinetica.

  • 3. Works with custom code, or packaged code. Opens

the way for machine learning/artificial intelligence libraries such as TensorFlow, BIDMach, Caffe and Torch to work on data directly within Kinetica.

  • 4. Available now with C++ & Java bindings.

19

ORCHESTRATION LAYER WITH USER-DEFINED FUNCTIONS (UDFs)

PHYSICAL / VIRTUAL SERVER Table A

Table n

GPU

UDFs exposed from RESTful endpoint Data returned to

  • utput table for

further analysis

CUDA Libraries

n number of Kinetica servers Table B Table C

Proc Server

UDF_A UDF_B UDF_n

/exec/proc/UDF_A/

slide-20
SLIDE 20

Kinetica Architecture

20

ETL / STREAM PROCESSING

ON DEMAND SCALE OUT + 1TB MEM / 2 GPU CARDS SQL Native APIs PARALLEL INGEST Geospatial WMS Custom Connectors

In-Database Processing CUSTOM LOGIC BIDMach

ML Libs

BI DASHBOARDS

BI / GIS / APPS

CUSTOM APPS & GEOSPATIAL KINETICA ‘REVEAL’

STREAMING DATA ERP / CRM / TRANSACTIONAL DATA

UDFs

slide-21
SLIDE 21

21

AI & BI on One GPU-Accelerated Database

HIGH PERFORMANCE ANALYTICS DATABASE

UDF UDF UDF ODBC / JDBC Native REST API WMS BUSINESS INTELLIGENCE CUSTOM APPLICATIONS HIGH FIDELITY GEOSPATIAL PIPELINE MACHINE LEARNING & DEEP LEARNING GPU-ACCELERATED DATA SCIENCE PREDICTIVE MODELS e.g. Risk Management, Sales Volume, Fraud.

BIDMach

SQL

DATA SCIENTISTS / DEVELOPERS BUSINESS USERS

slide-22
SLIDE 22

50-100x Faster on Queries with Large Datasets

  • Large retailer tested complex SQL queries
  • n 3 years of retail data (150bn rows)
  • 10 node Kinetica cluster against 30TB+

cluster from next best alternative

  • GPU is able to perform many instructions in
  • parallel. Huge performance gains on

aggregations, group bys, joins, etc.

  • Kinetica sustained ingest of 1.3bn
  • bjects/minute with 70 attributes per row

22

WHEN COMPARED TO LEADING IN-MEMORY ALTERNATIVES

SUM (Q1) GROUP BY (Q5) SELECT (Q10) 5 10 15 20 25 30 35 40 45 50

Kinetica Leading In-Memory DB

More Details

slide-23
SLIDE 23

23

Distributed Geospatial Pipeline

23

  • NATIVE VISUALIZATION IS DESIGNED FOR FAST MOVING, LOCATION-BASED DATA

Native Geospatial Object Types

  • Points, Shapes, Tracks, Labels

Native Geospatial Functions

  • Filters (by area, by series, by geometry, etc.)
  • Aggregation (histograms)
  • Geofencing - triggers
  • Video generation (based on dates/times)

Generate Map Overlay Imagery (via WMS)

  • Rasterize points
  • Style based on attributes (class-break)
  • Heat maps
slide-24
SLIDE 24

Full-Text Search

“Rain Tire” ~5

Kinetica includes powerful text search functionality, including :

  • Exact Phrases
  • Boolean – AND / OR
  • Wildcards
  • Grouping
  • Fuzzy Search (Damerau-Levenshtein optimal string alignment algorithm)
  • N-Gram Term Proximity Search
  • Term Boosting Relevance Prioritization

"Union Tranquility"~10 [100 TO 200]

22

slide-25
SLIDE 25

INTELLIGENCE: US Army - INSCOM

US Army’s in-memory computational engine for any data with a geospatial or temporal attribute for a major joint cloud initiative within the Intelligence Community (IC ITE). Intel analysts are able to conduct near real-time analytics and fuse SIGINT, ISR, and GEOINT streaming big data feeds and visualize in a web browser. First time in history military analysts are able to query and visualize billions to trillions of near real- time objects in a production environment. Major executive military and congressional visibility.

Oracle Spatial (92 Minutes) 42x Lower Space 28x Lower Cost 38x Lower Power Cost U.S Army INSCOM Shift from Oracle to GPUdb GPUdb (20ms) 1 GPUdb server vs 42 servers with Oracle 10gR2 (2011)

CASE STUDY : LOCATION BASED ANALYTICS

24

slide-26
SLIDE 26

LOGISTICS: Workforce optimization

DISTRIBUTED ANALYSIS

USPS’ parallel cluster is able to serve up to 15,000 simultaneous sessions, providing the service’s managers and analysts with the capability to instantly analyze their areas of responsibility via dashboards.

AT SCALE

With 200,000 USPS devices emitting location once every minute, that amounts to more than a quarter billion events captured and analyzed daily… tracked on 10 nodes.

USPS is the single largest logistic entity in the country, moving more individual items in four hours than the combination of UPS, FedEx, and DHL move all year.

CASE STUDY : LOCATION BASED ANALYTICS

25

slide-27
SLIDE 27

LOGISTICS & FLEET MANAGEMENT

27

Kinetica enables agile tracking of shipments to assist store managers for tracking of inventory and arrival times.

  • Visibility and tracking of deliveries & trucks for store

managers

  • ETA & Notifications – Provide estimated time of delivery,

notifications and custom location based alerting

  • Route Optimization based on truck size, and if cargo is

perishable or contains hazardous materials. LARGE RETAILER

CASE STUDY : LOCATION BASED ANALYTICS

slide-28
SLIDE 28

RISK MANAGEMENT

28

Large financial institution moves counterparty risk analysis from overnight to real-time.

  • Data collected by XVA library which computes risk

metrics for each trade

  • Risk computations are becoming more complex and

computationally heavy. xVA analysis needs to project years into the future.

  • Kinetica enables banks to move from batch/overnight

analysis to a streaming/real-time system for flexible real-time monitoring by traders, auditors and management. MULTINATIONAL BANK

CASE STUDY : ADVANCED IN-DATABASE ANALYTICS

slide-29
SLIDE 29

Scale Out on Industry Standard Hardware

29

Kinetica typically results in 1⁄10 hardware costs of standard in-memory databases.

IN THE CLOUD WITH: CERTIFIED ON PREMISE WITH:

Runs on industry standard servers, 512GB memory with GPUs (ex. NVIDIA K80)

COMING SOON:

slide-30
SLIDE 30

Stop by Booth #431 and Get Your Free T-shirt

www.kinetica.com