Graph Analytics: Complexity, Scalability, and Architectures Peter - - PowerPoint PPT Presentation

graph analytics complexity scalability and architectures
SMART_READER_LITE
LIVE PREVIEW

Graph Analytics: Complexity, Scalability, and Architectures Peter - - PowerPoint PPT Presentation

Graph Analytics: Complexity, Scalability, and Architectures Peter M. Kogge McCourtney Prof. of CSE Univ. of Notre Dame IBM Fellow (retired) Please Sir, I want more GABB: May 23, 2016 1 Thesis Graph computation is increasing To


slide-1
SLIDE 1

1

GABB: May 23, 2016

Graph Analytics: Complexity, Scalability, and Architectures

Peter M. Kogge McCourtney Prof. of CSE

  • Univ. of Notre Dame

IBM Fellow (retired)

Please Sir, I want more

slide-2
SLIDE 2

2

GABB: May 23, 2016

Thesis

  • Graph computation is increasing
  • To date: most benchmarks are batch
  • Streaming becoming more important
  • This talk: Combine batch and streaming
  • Emerging architectures have real promise
slide-3
SLIDE 3

3

GABB: May 23, 2016

Graph Kernels and Benchmarks

slide-4
SLIDE 4

4

GABB: May 23, 2016

Graphs

  • Graph:

– Set of objects called vertices – Set of links called edges between vertices – May have “properties”

  • Graph computing of increasing importance

– Social Networks – Communication & power networks – Recommendation systems – Genomics – Cyber-security

http://icensa.com/sites/default/files/styles/research_image/public/Unknown.png?itok=HfBBjbJK https://www.researchgate.net/profile/ Mehmet_Bakal/publication/266968024/ figure/fig4/AS: 295737989582855@1447520839376/ Figure-42-A-sample-basic-retweet- graph.png

slide-5
SLIDE 5

5

GABB: May 23, 2016

Classes of Graph Computation

  • Characteristics of individual vertices

– E.g. “properties” such as degree

  • Characteristics of graph as a whole

– E.g. diameter, max distance, covering

  • Characteristics of pairs of vertices

– E.g. Shortest paths

  • Characteristics of subgraphs

– E.g. Connected components, spanning tree – Similarities of subgraphs, …

slide-6
SLIDE 6

6

GABB: May 23, 2016

Classes of Application Computations

  • Batch: function applied to entire graph of

major subgraph as it exists at some time

  • Streaming:

– Incoming sequence of small-scale updates

  • New vertices or edges
  • Modification of a property of specific vertex
  • r edge
  • Deletions

– Sequence of localized queries

slide-7
SLIDE 7

7

GABB: May 23, 2016

Current Benchmark Suites

Kernel Class: what class of computing kernel performs Benchmarking Efforts

  • S => Streaming
  • B => Batch
  • B/S => Both

Outputs: what is size or structure of result of kernel execution?

Connectedness Path ¡Analysis Centrality Clustering Subgraph ¡Isomorphism Other Standalone Firehose Graph500 GraphBLAS Graph ¡Challenge Graph ¡Algorithm ¡Platform HPC ¡Graph ¡Analysis Kepner ¡& ¡Gilbert Stinger VAST Graph ¡Modification Compute ¡Vertex ¡Property Output ¡Global ¡Value Output ¡O(1) ¡Events Output ¡O(|V|) ¡List Output ¡O(|V|k) ¡List ¡(k>1) Anomaly ¡-­‑ ¡Fixed ¡Key X S X Anomaly ¡-­‑ ¡Unbounded ¡Key X S X Anomaly ¡-­‑ ¡Two-­‑level ¡Key X S X BC: ¡Betweeness ¡Centrality X B B B S X BFS: ¡Breadth ¡First ¡Search X B B B B B B X X Search ¡for ¡"Largest" X B X CCW: ¡Weakly ¡Connected ¡Components X B B S X X CCS: ¡ ¡Strongly ¡Connected ¡Components X B B X CCO: ¡Clustering ¡Coefficients X B S X CD: ¡Community ¡Detection X X S X X GC: ¡Graph ¡Contraction X B B X GP: ¡Graph ¡Partitioning X B/S B X GTC: ¡Global ¡Triangle ¡Counting X B X Insert/Delete X S X Jaccard X B/S X MIS: ¡Maximally ¡Independent ¡Set B B PR: ¡PageRank X B X SSSP: ¡Single ¡Source ¡Shortest ¡Path X B B/S B X X APSP: ¡All ¡pairs ¡Shortest ¡Path X B X SI: ¡General ¡Subgraph ¡Isomorphism X B/S TL: ¡Triangle ¡Listing X B/S X Geo ¡& ¡Temporal ¡Correlation X B/S X

Kernel

Kernel ¡Class Benchmarking ¡Efforts Outputs

slide-8
SLIDE 8

8

GABB: May 23, 2016

A Real World App

slide-9
SLIDE 9

9

GABB: May 23, 2016

Real World vs. Benchmarks

  • Processing more than single kernel
  • Many different classes of vertices
  • Many different classes of edges
  • Vertices may have 1000’s of properties
  • Edges may have timestamps
  • Both batch & streaming are integrated

– Batch to clean/process existing data sets, add properties – Streaming (today) to query graph – Streaming (tomorrow) to update graph in real-time

  • “Neighborhoods” more important than full

graph connectivity

slide-10
SLIDE 10

10

GABB: May 23, 2016

Sample Real-World Batch Analytic

(From Lexis Nexis)

  • 2012: 40+ TB of Raw Data
  • Periodically clean up &

combine to 4-7 TB

  • Weekly “Boil the Ocean” to

precompute answers to all standard queries

– Does X have financial difficulties? – Does X have legal problems? – Has X had significant driving problems? – Who has shared addresses with X? – Who has shared property

  • wnership with X?

Auto Insurance Co: “Tell me about giving auto policy to Jane Doe” in < 0.1sec “Jane Doe has no indicators But she has shared multiple addresses with Joe Scofflaw Who has the following negative indicators ….” Look up answers to precomputed queries for “Jane Doe”, and combine

Relationships

slide-11
SLIDE 11

11

GABB: May 23, 2016

Sample Analytic Details

  • Given: 14.2+ billion records from

– 800+ million entities (people, businesses) – 100+ million addresses – records on who has resided at what address

  • Goal: for each entity ID, find all other IDs such that

– Share at least 2 addresses in common – Or have one address in common and “close” last name – Matching last names requires processing to check for typos (“Levenshtein distance”)

  • Akin to a join based on common address, with

grouping and thresholding on # of join results

  • Dozens of similar analytics computed once a week
  • n 400 node cluster

Vertices Edges

slide-12
SLIDE 12

12

GABB: May 23, 2016

Sample Batch Implementation Platform: Lexis Nexis

  • Entity data kept in huge persistent tables

– Often with 1,000s of columns

  • Programming in declarative ECL
  • THOR: runs “offline” on 400+ node systems

– Batch analytic processing over large data sets – Large distributed parallel file system – Leaves all data sets for queries in indexed files

  • ROXIE: runs “online” on smaller system

– User queries using output files from THOR – Dynamically interrogate indexed files – Can perform localized ECL on data subsets

  • No dynamic data updates

Software Architecture:

https://upload.wikimedia.org/wikipedia/ commons/0/02/Fig4b_HPCC.jpg

slide-13
SLIDE 13

13

GABB: May 23, 2016

Execution on Today’s Architectures

  • Model built to estimate usage of following

– Bandwidth: Network, Disk, Memory – Processing capability

  • Baseline: cluster of 400 dual-Xeon nodes
  • Menu of improvement options investigated
  • “Conventional” improvements

– No one option >45% increase in performance – Significant gains only when all applied at once

  • “Unconventional” improvements even better

– ARMs for Xeons – 2-level memory – Computing in “3D memory”

slide-14
SLIDE 14

14

GABB: May 23, 2016

A Model Based on Contemporary Architecture

  • Optimal code streams data thru multiple kernels till barrier
  • No one resource is consistent bottleneck
  • Inter-node comm: dynamically random small message

1.E-­‑02 1.E-­‑01 1.E+00 1.E+01 1.E+02 1.E+03 1 2 3 4 5 6 7 8 9

Resources ¡Used/node ¡(sec) Step ¡#

Disk CPU Memory Network

Baseline: ¡1026s 10 ¡racks

slide-15
SLIDE 15

15

GABB: May 23, 2016

The Core of This Computation as a Benchmark Kernel

slide-16
SLIDE 16

16

GABB: May 23, 2016

Sample Analytic Details

  • Given: 14.2+ billion records from

– 800+ million entities (people, businesses) – 100+ million addresses – records on who has resided at what address

  • Goal: for each entity ID, find all other IDs such that

– Share at least 2 addresses in common – Or have one address in common and “close” last name – Matching last names requires processing to check for typos (“Levenshtein distance”)

  • Akin to a join based on common address, but with

grouping and thresholding on # of join results

  • Dozens of similar analytics computed once a week
  • n 400 node cluster

Vertices Edges

slide-17
SLIDE 17

17

GABB: May 23, 2016

Neighborhoods & Jaccard Coefficients: The Essence of NORA problems

N(u) = set of neighbors of u Γ(u,v) = fraction of neighbors of u and v that are in common Γ(u,v) = |N(u) ∩ N(v)|/(N(u) U N(v)| Alternative: d(u) = # of neighbors of u ɤ(u, v) = # of common neighbors Γ(u,v) = ɤ(u, v) /(d(u)+d(v)- ɤ(u, v))

i j

Green and Purple lead to common neighbors Blue lead to non-common neighbors u v

The LexisNexis shared address NORA problem is an extension of this

slide-18
SLIDE 18

18

GABB: May 23, 2016

Results of a Map-Reduce Batch Implementation

1.E+02 1.E+03 1.E+04 1.E+05 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08

Time ¡(Sec) Vertices

Measured Modeled 1.E+07 1.E+08 1.E+09 1.E+10 1.E+11 1.E+12 1.E+13 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08

Coefficients Vertices

Measured Modeled

JACS (Jaccard Coefficients / Sec) = 1.6E6*V0.26 Entire LN Analytic approx 10X faster

Burkhardt “Asking Hard Graph Questions,” Beyond Watson Workshop, Feb. 2014. RMAT matrices, average d(i) = 16, on 1000 node system, each with 12 cores & 64GB # Coefficients grows more rapidly than # vertices Time also grows more than linearly

slide-19
SLIDE 19

19

GABB: May 23, 2016

A Streaming Form: Better Match to Future Real-Time NORA

  • Assume Edges arrive in stream {(u,v)}

– Graph keeps just “largest Γ” for each vertex

  • Question: does any individual edge addition

significantly change any vertex’s largest Γ

– Especially to change to cross some threshold

  • Implementation involves looking at

– Neighbors of neighbors of u (to update u’s max Γ) – Neighbors of v (to update their peak Γ, given u now shares neighbors)

  • Bloom filter-like heuristic can bound thresholds early
  • Lots of interesting variations
  • P. Kogge, “Jaccard Coefficients as a Potential Graph Benchmark,” GABB, IPDPS, 2016

u v

slide-20
SLIDE 20

20

GABB: May 23, 2016

Looking Forward: Converting Such Problem into Large Graphs

  • Create graph of name/address records
  • Query: Given a specific ID1 find all ID2s that meet

requirements

– Start with ID1 and find all ID2 reachable via shared address – Score each path – Sum all path scores & pass (ID1, ID2) if > threshold

ID1 … ID2 … Entity ID Last Name Address

slide-21
SLIDE 21

21

GABB: May 23, 2016

Ability to do On-Demand Graph Queries Will Change Business Model

IDs Last Names Addresses 1 Start with ID, follow last names to all matching address, and then other IDs 2 3

96GB 18GB 300GB

slide-22
SLIDE 22

22

GABB: May 23, 2016

Canonical Graph Processing

slide-23
SLIDE 23

23

GABB: May 23, 2016

Property Updates

Canonical Graph Processing

Persistent Big Data/Graph Data Set Seed Identification

Selection Criteria

Subgraph Extraction

Seeds

Sub Graph Sub Graph Sub Graph Batch Analytics Batch Analytics Batch Analytics Local Update

Real-Time, Stream Events Events Graph Properties Batch Input

slide-24
SLIDE 24

24

GABB: May 23, 2016

Streaming Characteristics

  • Two kinds

– Streams of queries – Updates to persistent data

  • Both typically localized to start with
  • Streaming updates often multi-step

– Perform update (use atomics) – Perform some local computations – Compare to threshold – If threshold passed, extract some larger subset – And perform a bigger analytic

slide-25
SLIDE 25

25

GABB: May 23, 2016

Observations

  • Data sets live in persistent memory
  • Streaming updates trigger threshold tests
  • Streaming queries result in local graph traversals
  • Batch analytics used primarily for analysis & new

property computation

Property Updates

Persistent Big Data/Graph Data Set Seed Identification

Selection Criteria

Subgraph Extraction

Seeds

Sub Graph Sub Graph Sub Graph Batch Analytics Batch Analytics Batch Analytics Local Update

Real-Time, Stream Events Events Graph Properties

slide-26
SLIDE 26

26

GABB: May 23, 2016

Emerging Architectures to Accelerate Graph Processing

slide-27
SLIDE 27

27

GABB: May 23, 2016

Knight’s Landing: 2-Level Memory

  • 2-level memory

– High capacity for “Persistent Graphs” – HBM memory for “Subgraph” Working memory – Lots of multi-threaded cores that can see all memory

https://www.micron.com/products/hybrid-memory-cube/~/media/track-2-images/content-images/content_image_knights_landing_1200_x_600.jpg?la=en

slide-28
SLIDE 28

28

GABB: May 23, 2016

A Novel Sparse Graph Processor

Song, et al. “Novel Graph Processor Architecture, Prototype System, and Results,” IEEE HPEC 2016

Matrix Reader Matrix Memory Sorter ALU Matrix Writer Control Network Interface

  • Express Graph as Adjacency Matrix
  • Graph ops as Matrix-Vector or

Matrix-Matrix products

slide-29
SLIDE 29

29

GABB: May 23, 2016

  • Single System-wide Address Space
  • Parallelism via Memory Channels
  • Gossamer Cores (GCs) execute

Gossamer Threads at Nodelets

– Perform local computations & memory references (inc. atomics) – Migrate to other Nodelets w’o software involvement – Spawn new Threads – Call System Services on SCs

  • Stationary Cores (SCs): (Conventional

cores)

– Execute Operating System – Manage IO / File System – Call or Spawn Gossamer Threads

  • Programmed in Cilk

Migrating Threads for Streaming

Nodes Internal ¡Network Memory Controller GCore GCore GCore GCore Memory Nodelets

Memory Controller GCore GCore GCore GCore Memory Migration ¡Engine ¡& ¡Network ¡ I/F Memory Controller GCore GCore GCore GCore Memory Nodelets

Memory Controller GCore GCore GCore GCore Memory Migration ¡Engine ¡& ¡Network ¡ I/F

If the mountain won’t come to you, you must go to the mountain - ancient proverb

slide-30
SLIDE 30

30

GABB: May 23, 2016

Emu 1

Emu Chick

  • 8 Nodes; 64 nodelets
  • Copy room environment

Emu1 Memory Server

  • 256 nodes; 2048 nodelets
  • Server room environment
  • FCS 2017

The node boards are the same!

Tool Chain: Cilk C Today C++ in progress

  • T. Dysart, et al. “Highly Scalable Near Memory Processing with

Migrating Threads on the Emu System Architecture,” SC16

slide-31
SLIDE 31

31

GABB: May 23, 2016

Where Useful

Property Updates

Persistent Big Data/Graph Data Set Seed Identification

Selection Criteria

Subgraph Extraction

Seeds

Sub Graph Sub Graph Sub Graph Batch Analytics Batch Analytics Batch Analytics Local Update

Real-Time, Stream Events Events Graph Properties

2-level memory

Sparse Graph Engines Migrating Threads with rich atomics

slide-32
SLIDE 32

32

GABB: May 23, 2016

0.1 1 10 100 1000 1 2 3 4 5 6 7 8 9 10

Speedup ¡over ¡2012 ¡Baseline Racks

HeavyWeight Lightweight Next-­‑Gen ¡Compute Emu

Projection for LexisNexis Problem

  • Still “Batch Mode” Computations

Emu1 Emu2 Emu3 Baseline Upgrades

Emu1 assumes 400MHz GCs 2400 MT/s DRAM Channels

ARM servers KNL-Like

slide-33
SLIDE 33

33

GABB: May 23, 2016

Conclusions

  • Most graph benchmarks today

– Batch oriented – Assume simple graphs – Focus on total graph properties

  • Real-world apps today radically different

– Many vertex/edge classes with many properties – Real interest: localized “neighborhoods”

  • Key queries not computable in real-time today
  • “Two-level” graph processing flow meshes both

streaming and batch computations

  • Architectures are emerging that will support such

graph processing directly

slide-34
SLIDE 34

34

GABB: May 23, 2016

Acknowledgements: Funding

  • In part by NSF CCF-1642280
  • In part by the Univ. of Notre Dame