graph analytics on massively parallel processing databases
play

Graph Analytics on Massively Parallel Processing Databases Frank - PowerPoint PPT Presentation

Graph Analytics on Massively Parallel Processing Databases Frank McQuillan Feb 2017 MPP databases effective for graph analytics at scale in the enterprise 2 Database Engine Popularity http://db-engines.com/en/ranking 3 Graph Engine Trends


  1. Graph Analytics on Massively Parallel Processing Databases Frank McQuillan Feb 2017

  2. MPP databases effective for graph analytics at scale in the enterprise 2

  3. Database Engine Popularity http://db-engines.com/en/ranking 3

  4. Graph Engine Trends 21. Neo4j 47. Titan 112. Giraph 180. GraphDB http://db-engines.com/en/ranking 4

  5. Introduction to Graphs • Graphs can be small... 5

  6. Introduction to Graphs • ...but many real world graphs are very large Person X Sample LinkedIn social graph 6

  7. Why Graph Analytics on MPP Databases? • MPP is built for very large data sets • Many enterprise use cases combine graph analytics with other techniques • SQL – Most common workload in the enterprise – Widely used by analysts and data scientists – Ecosystem of business intelligence applications 7

  8. Why Graph Analytics on MPP Databases? • Data locality – Cost of replicating, moving and transforming data to an external system can be high • Policy – Cost, deployment, oversight, support issues adding a new execution engine – Convince the CIO to use a specialized system in production 8

  9. But... Can graph analytic processing be efficiently performed on relational data in an MPP database? 9

  10. Yes! • Graph analytic processing on Greenplum database using Apache MADlib can solve for a wide range of real world use cases 10

  11. Apache MADlib (incubating) 11

  12. Scalable, In-Database Machine Learning • Open source https://github.com/apache/incubator-madlib • Downloads and docs http://madlib.incubator.apache.org/ • Wiki https://cwiki.apache.org/confluence/display/MADLIB/ 12

  13. History MADlib project was initiated in 2011 by EMC/Greenplum architects and Joe Hellerstein from Univ. of California, Berkeley. UrbanDictionary.com: mad (adj.): an adjective used to enhance a noun. 1- dude, you got skills. 2- dude, you got mad skills. 13

  14. Functions Other Machine Learning Algorithms Descriptive Statistics Generalized Linear Models • Principal Component Analysis (PCA) Sketch-Based Estimators • Linear Regression • Association Rules (Apriori) • CountMin (Cormode-Muth.) • Logistic Regression • Topic Modeling (Parallel LDA) • FM (Flajolet-Martin) • Multinomial Logistic Regression • Decision Trees • MFV (Most Frequent Values) • Ordinal Regression • Random Forest Correlation and Covariance • Cox Proportional Hazards Regression • Conditional Random Field (CRF) Summary • Elastic Net Regularization • Clustering (K-means) • Robust Variance (Huber-White), • Cross Validation Clustered Variance, Marginal Effects Inferential Statistics • Naïve Bayes Hypothesis Tests • Support Vector Machines (SVM) Matrix Factorization • Prediction Metrics Utility Modules • Singular Value Decomposition (SVD) • K-Nearest Neighbors Array and Matrix Operations • Low Rank Sparse Vectors Time Series Random Sampling • ARIMA Linear Systems Probability Functions • Sparse and Dense Solvers Data Preparation • Linear Algebra Path Functions PMML Export • Operations on Pattern Matches Conjugate Gradient Stemming Graph New in v1.10, Sessionization • Single Source Shortest Path more to come Pivot Jan 2017 14

  15. Example Usage Train a model Predict for new data 15

  16. Greenplum Database SQL Massively Parallel Processing Master Servers ... ... Query planning & dispatch Network Interconnect Segment ... ... Servers Query processing & data storage External Sources Load, streaming, etc. 16

  17. MADlib on Greenplum SQL Input validation & pre-processing Massively Parallel Processing Master Servers In-Database ... ... Functions Query planning & dispatch Machine learning & Network statistics Interconnect & math Segment & ... ... utilities Servers Query processing & data storage External Sources Load, streaming, etc. 17

  18. Graph Representation in MADlib Directed graph (digraph) Vertex or node Edge Edge weight (can be negative) 18

  19. Graph Representation in MADlib Vertex Table Edge Table Vertex Vertex . . . Source Dest Edge Edge . . . Params Vertex Vertex Weight Params 0 ... 0 3 1.0 ... 1 ... 1 0 5.0 ... 2 ... 1 2 3.0 ... 3 ... 2 3 8.0 ... . 3 0 3.0 ... . 3 1 2.0 ... . . . . 19

  20. Single Source Shortest Path • Given a graph and a source vertex, find a path to every vertex such that the sum of the weights of its constituent edges is minimized Shortest path (A, C, E, D, F) between vertices A and F in the weighted directed graph Image from https://en.wikipedia.org/wiki/Shortest_path_problem 20

  21. Single Source Shortest Path • Use cases – Vehicle routing/navigation – Degrees of separation in a social network – Min-delay path in a telecommunications network – Plant and facility layout – VLSI design 21

  22. SSSP Performance on Greenplum Database 50M edges Greenplum cluster: ● 1 master Bellman-Ford algorithm ● 4 segment hosts with O(VE) worst case but not common 6 segments per host 22

  23. Single Source Shortest Path in MADlib SSSP graph_sssp( vertex_table, -- vertex table vertex_id, -- col in vertex table containing vertex IDs edge_table, -- edge table edge_args, -- source, dest and edge weights col in the edge table source_vertex, -- source vertex for the algorithm to start sssp_table -- output table of SSSP for all dest vertices ); Path retrieval graph_sssp_get_path( sssp_table, -- sssp table dest_vertex -- dest of the path of interest ); 23

  24. Implementation Considerations • Relationships – Not a 1st class citizen in relational databases (unlike certain graph databases) – JOIN operations are compute and memory intensive so want to minimize • Table scans – Depth first search involves more table scans (expensive) than breadth first search – Greedy algorithms that do not take advantage of query optimizer will be slower 24

  25. Implementation Considerations • Database limits – PostgreSQL limits maximum field size to 1GB 25

  26. MADlib Graph Roadmap (Near Term)* Algorithm Uses O(V 3 ) Floyd-Warshall All pairs shortest path (APSP) ● ● Betweenness and closeness centrality measures to identify influencers ● Graph diameter Page rank ● Identify importance of vertices Connected components ● Clustering common components ● Measure of resilience in network flow problems Graph cut ● Partition a graph into two disjoint subsets *Subject to community interest and contribution, and subject to change at any time without notice. 26

  27. Cybersecurity Example Lateral Movement Detection 27

  28. Perimeter Defense Inadequate ● Defending the perimeter no longer enough ● No 100%, fool-proof way to keep bad actors out cover this square with an image (540 x 480 pixels) ● Some threats come from within ● The idea of a perimeter becoming obsolete with mobile, cloud, IoT ● Need better methods for threat detection inside the network

  29. APT Kill Chain Advanced Persistent Threat (APT) 1 2 3 4 5 Phishing and Lateral Back Door Data Gathering Ex-filtrate Zero Day Attack Movement A handful of users are The user machine is Attacker elevates Data is acquired from Data is exfiltrated via targeted by two accessed remotely access to important target servers and encrypted files over ftp to phishing attacks: one by Poison Ivy tool user, service and staged for exfiltration external, compromised user opens Zero day admin accounts, and machine at a hosting payload specific systems provider (CVE-02011-0609)

  30. Lateral Movement Detection What : Identify anomalous user-level access to hosts How: Look at people & machines • Users (user behavior models) • Network, servers (user peer models) Scenarios: Network reconnaissance from remote adversary on hijacked device Ill-intentioned activities by legitimate employee Access policy abuse Business values: Immediate security alert generation Enhanced SIEM alert queue prioritization Focused monitoring Future integration with other analytic models for 360° attack view

  31. Lateral Movement Detection (LMD) – Flow Diagram Greenplum Data Store Logs Regression Recommendation Model System Active Directory Activity Semi-structured Active Directory Metadata External Tables User Cluster Model Behavioral Model Server Information Graph Model LDAP Activity Structured Anomalous Users

  32. Regression-Based Model Regression plot of number of servers for a user Model to identify users with unusual variation in the number of servers accessed over time Build a regression model for each user Number of Servers (Y = aX + b) No. of servers accessed each week (Y) ~ Week Index (X) Find the slope of the regression line for each user (a) Identify users who have a high positive or negative slope to find users with unusual activity Week of the year

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend