Graph Analytics on Massively Parallel Processing Databases Frank - PowerPoint PPT Presentation

Graph Analytics on Massively Parallel Processing Databases Frank McQuillan Feb 2017

MPP databases effective for graph analytics at scale in the enterprise 2

Database Engine Popularity http://db-engines.com/en/ranking 3

Graph Engine Trends 21. Neo4j 47. Titan 112. Giraph 180. GraphDB http://db-engines.com/en/ranking 4

Introduction to Graphs • Graphs can be small... 5

Introduction to Graphs • ...but many real world graphs are very large Person X Sample LinkedIn social graph 6

Why Graph Analytics on MPP Databases? • MPP is built for very large data sets • Many enterprise use cases combine graph analytics with other techniques • SQL – Most common workload in the enterprise – Widely used by analysts and data scientists – Ecosystem of business intelligence applications 7

Why Graph Analytics on MPP Databases? • Data locality – Cost of replicating, moving and transforming data to an external system can be high • Policy – Cost, deployment, oversight, support issues adding a new execution engine – Convince the CIO to use a specialized system in production 8

But... Can graph analytic processing be efficiently performed on relational data in an MPP database? 9

Yes! • Graph analytic processing on Greenplum database using Apache MADlib can solve for a wide range of real world use cases 10

Apache MADlib (incubating) 11

Scalable, In-Database Machine Learning • Open source https://github.com/apache/incubator-madlib • Downloads and docs http://madlib.incubator.apache.org/ • Wiki https://cwiki.apache.org/confluence/display/MADLIB/ 12

History MADlib project was initiated in 2011 by EMC/Greenplum architects and Joe Hellerstein from Univ. of California, Berkeley. UrbanDictionary.com: mad (adj.): an adjective used to enhance a noun. 1- dude, you got skills. 2- dude, you got mad skills. 13

Functions Other Machine Learning Algorithms Descriptive Statistics Generalized Linear Models • Principal Component Analysis (PCA) Sketch-Based Estimators • Linear Regression • Association Rules (Apriori) • CountMin (Cormode-Muth.) • Logistic Regression • Topic Modeling (Parallel LDA) • FM (Flajolet-Martin) • Multinomial Logistic Regression • Decision Trees • MFV (Most Frequent Values) • Ordinal Regression • Random Forest Correlation and Covariance • Cox Proportional Hazards Regression • Conditional Random Field (CRF) Summary • Elastic Net Regularization • Clustering (K-means) • Robust Variance (Huber-White), • Cross Validation Clustered Variance, Marginal Effects Inferential Statistics • Naïve Bayes Hypothesis Tests • Support Vector Machines (SVM) Matrix Factorization • Prediction Metrics Utility Modules • Singular Value Decomposition (SVD) • K-Nearest Neighbors Array and Matrix Operations • Low Rank Sparse Vectors Time Series Random Sampling • ARIMA Linear Systems Probability Functions • Sparse and Dense Solvers Data Preparation • Linear Algebra Path Functions PMML Export • Operations on Pattern Matches Conjugate Gradient Stemming Graph New in v1.10, Sessionization • Single Source Shortest Path more to come Pivot Jan 2017 14

Example Usage Train a model Predict for new data 15

Greenplum Database SQL Massively Parallel Processing Master Servers ... ... Query planning & dispatch Network Interconnect Segment ... ... Servers Query processing & data storage External Sources Load, streaming, etc. 16

MADlib on Greenplum SQL Input validation & pre-processing Massively Parallel Processing Master Servers In-Database ... ... Functions Query planning & dispatch Machine learning & Network statistics Interconnect & math Segment & ... ... utilities Servers Query processing & data storage External Sources Load, streaming, etc. 17

Graph Representation in MADlib Directed graph (digraph) Vertex or node Edge Edge weight (can be negative) 18

Graph Representation in MADlib Vertex Table Edge Table Vertex Vertex . . . Source Dest Edge Edge . . . Params Vertex Vertex Weight Params 0 ... 0 3 1.0 ... 1 ... 1 0 5.0 ... 2 ... 1 2 3.0 ... 3 ... 2 3 8.0 ... . 3 0 3.0 ... . 3 1 2.0 ... . . . . 19

Single Source Shortest Path • Given a graph and a source vertex, find a path to every vertex such that the sum of the weights of its constituent edges is minimized Shortest path (A, C, E, D, F) between vertices A and F in the weighted directed graph Image from https://en.wikipedia.org/wiki/Shortest_path_problem 20

Single Source Shortest Path • Use cases – Vehicle routing/navigation – Degrees of separation in a social network – Min-delay path in a telecommunications network – Plant and facility layout – VLSI design 21

SSSP Performance on Greenplum Database 50M edges Greenplum cluster: ● 1 master Bellman-Ford algorithm ● 4 segment hosts with O(VE) worst case but not common 6 segments per host 22

Single Source Shortest Path in MADlib SSSP graph_sssp( vertex_table, -- vertex table vertex_id, -- col in vertex table containing vertex IDs edge_table, -- edge table edge_args, -- source, dest and edge weights col in the edge table source_vertex, -- source vertex for the algorithm to start sssp_table -- output table of SSSP for all dest vertices ); Path retrieval graph_sssp_get_path( sssp_table, -- sssp table dest_vertex -- dest of the path of interest ); 23

Implementation Considerations • Relationships – Not a 1st class citizen in relational databases (unlike certain graph databases) – JOIN operations are compute and memory intensive so want to minimize • Table scans – Depth first search involves more table scans (expensive) than breadth first search – Greedy algorithms that do not take advantage of query optimizer will be slower 24

Implementation Considerations • Database limits – PostgreSQL limits maximum field size to 1GB 25

MADlib Graph Roadmap (Near Term)* Algorithm Uses O(V 3 ) Floyd-Warshall All pairs shortest path (APSP) ● ● Betweenness and closeness centrality measures to identify influencers ● Graph diameter Page rank ● Identify importance of vertices Connected components ● Clustering common components ● Measure of resilience in network flow problems Graph cut ● Partition a graph into two disjoint subsets *Subject to community interest and contribution, and subject to change at any time without notice. 26

Cybersecurity Example Lateral Movement Detection 27

Perimeter Defense Inadequate ● Defending the perimeter no longer enough ● No 100%, fool-proof way to keep bad actors out cover this square with an image (540 x 480 pixels) ● Some threats come from within ● The idea of a perimeter becoming obsolete with mobile, cloud, IoT ● Need better methods for threat detection inside the network

APT Kill Chain Advanced Persistent Threat (APT) 1 2 3 4 5 Phishing and Lateral Back Door Data Gathering Ex-filtrate Zero Day Attack Movement A handful of users are The user machine is Attacker elevates Data is acquired from Data is exfiltrated via targeted by two accessed remotely access to important target servers and encrypted files over ftp to phishing attacks: one by Poison Ivy tool user, service and staged for exfiltration external, compromised user opens Zero day admin accounts, and machine at a hosting payload specific systems provider (CVE-02011-0609)

Lateral Movement Detection What : Identify anomalous user-level access to hosts How: Look at people & machines • Users (user behavior models) • Network, servers (user peer models) Scenarios: Network reconnaissance from remote adversary on hijacked device Ill-intentioned activities by legitimate employee Access policy abuse Business values: Immediate security alert generation Enhanced SIEM alert queue prioritization Focused monitoring Future integration with other analytic models for 360° attack view

Lateral Movement Detection (LMD) – Flow Diagram Greenplum Data Store Logs Regression Recommendation Model System Active Directory Activity Semi-structured Active Directory Metadata External Tables User Cluster Model Behavioral Model Server Information Graph Model LDAP Activity Structured Anomalous Users

Regression-Based Model Regression plot of number of servers for a user Model to identify users with unusual variation in the number of servers accessed over time Build a regression model for each user Number of Servers (Y = aX + b) No. of servers accessed each week (Y) ~ Week Index (X) Find the slope of the regression line for each user (a) Identify users who have a high positive or negative slope to find users with unusual activity Week of the year

Graph Analytics on Massively Parallel Processing Databases Frank - PowerPoint PPT Presentation

Graph Analytics on Massively Parallel Processing Databases Frank McQuillan Feb 2017 MPP databases effective for graph analytics at scale in the enterprise 2 Database Engine Popularity http://db-engines.com/en/ranking 3 Graph Engine Trends

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Massively Parallel Graph Analytics Supercomputing for large-scale graph analytics George M. Slota

Breaking the Linear-Memory Barrier in Massively Parallel Computing MIS on Trees with Strongly

Loosely Dependent Parallel Processes Complementary Paradigms Massively Parallel Task

Neo4j and graph databases Presented By: Stephanie McIntyre Graph Databases: The Database Model

Deep Learning on Massively Parallel Processing Databases Frank McQuillan Feb 2019 2 A Brief

9/14/16 1 Graph Processing Graphs & Analytics Parallel Graph Processing on Web Graphs

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Massively Parallel A* Search on a GPU Yichao Zhou Jianyang Zeng Institute for Interdisciplinary

Massively Parallel Communication and Query Evaluation Paul Beame U. of Washington Based on

MPMPLAPACK: A Massively Parallel Multi-Precision Linear Algebra Package Jason Martin

Databases Picture by Jeremy Hiebert [http://www.flickr.com/photos/jeremyhiebert/] Graph Databases

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Hunting For Memory Resident Malware Memhunter tool Marcos Oviedo | McAfee Endpoint Software

WHY ATTACKER TOOLSETS DO WHAT THEY DO (or.. Reasons they just keep working) Matt McCormack

A perspective to incident response or another set of recommendations for malware authors

MISP Workbench - Because you know better MISP - Malware Information Sharing Platform & Threat

ASSURE Authentication Scheme for SecURE Energy Efficient Non-Volatile Memories Joydeep Rakshit

PV204 Security technologies In-Memory Malware Analysis Vclav Lorenc Senior Security Analyst,

Revisiting iOS Kernel (In)Security: Attacking the Early Random PRNG Tarjei Mandt CanSecWest 2014

A Sentence is a Sentence is a Sentence? Zarah Weiss Introduction Parallels and Differences

Graph Analytics on Massively Parallel Processing Databases Frank - PowerPoint PPT Presentation

Graph Analytics on Massively Parallel Processing Databases Frank McQuillan Feb 2017 MPP databases effective for graph analytics at scale in the enterprise 2 Database Engine Popularity http://db-engines.com/en/ranking 3 Graph Engine Trends

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Massively Parallel Graph Analytics Supercomputing for large-scale graph analytics George M. Slota

Breaking the Linear-Memory Barrier in Massively Parallel Computing MIS on Trees with Strongly

Loosely Dependent Parallel Processes Complementary Paradigms Massively Parallel Task

Neo4j and graph databases Presented By: Stephanie McIntyre Graph Databases: The Database Model

Deep Learning on Massively Parallel Processing Databases Frank McQuillan Feb 2019 2 A Brief

9/14/16 1 Graph Processing Graphs &amp; Analytics Parallel Graph Processing on Web Graphs

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Massively Parallel A* Search on a GPU Yichao Zhou Jianyang Zeng Institute for Interdisciplinary

Massively Parallel Communication and Query Evaluation Paul Beame U. of Washington Based on

MPMPLAPACK: A Massively Parallel Multi-Precision Linear Algebra Package Jason Martin

Databases Picture by Jeremy Hiebert [http://www.flickr.com/photos/jeremyhiebert/] Graph Databases

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Hunting For Memory Resident Malware Memhunter tool Marcos Oviedo | McAfee Endpoint Software

WHY ATTACKER TOOLSETS DO WHAT THEY DO (or.. Reasons they just keep working) Matt McCormack

A perspective to incident response or another set of recommendations for malware authors

MISP Workbench - Because you know better MISP - Malware Information Sharing Platform &amp; Threat

ASSURE Authentication Scheme for SecURE Energy Efficient Non-Volatile Memories Joydeep Rakshit

PV204 Security technologies In-Memory Malware Analysis Vclav Lorenc Senior Security Analyst,

Revisiting iOS Kernel (In)Security: Attacking the Early Random PRNG Tarjei Mandt CanSecWest 2014

A Sentence is a Sentence is a Sentence? Zarah Weiss Introduction Parallels and Differences

9/14/16 1 Graph Processing Graphs & Analytics Parallel Graph Processing on Web Graphs

MISP Workbench - Because you know better MISP - Malware Information Sharing Platform & Threat