Graph Databases Marco Serafini COMPSCI 532 Lecture 10 Graph DB - - PowerPoint PPT Presentation

graph databases
SMART_READER_LITE
LIVE PREVIEW

Graph Databases Marco Serafini COMPSCI 532 Lecture 10 Graph DB - - PowerPoint PPT Presentation

Graph Databases Marco Serafini COMPSCI 532 Lecture 10 Graph DB Use cases Social network queries E.g. Facebook stores the entire metadata in a social graph Network security Find sequence of steps that lead to intrusion


slide-1
SLIDE 1

Graph Databases

Marco Serafini

COMPSCI 532 Lecture 10

slide-2
SLIDE 2

2

Graph DB Use cases

  • Social network queries
  • E.g. Facebook stores the entire metadata in a social graph
  • Network security
  • Find sequence of steps that lead to intrusion
  • Fraud detection
  • Find fraud rings
  • Knowledge bases
  • Answer questions, language models
slide-3
SLIDE 3

33

Resource Description Framework

  • World Wide Web Consortium specification
  • Used for the Semantic Web
  • Web pages define human-readable content
  • Goal: add machine-readable meta-data describing how

pages relate

  • Format to reuse and share data across the Web
  • Examples
  • Wikipedia, census, life sciences, DBPedia
  • Directed labeled multi-graph
slide-4
SLIDE 4

44

RDF Format

  • Graph is set of triplets = (Subject, Predicate, Object)
  • Subject and predicate are resources
  • Associated with Unique Resource Identifiers (URI)
  • Object can be resource or literal (string)

From S. Decker et al., “Framework for the Semantic Web: An RDF Tutorial”

slide-5
SLIDE 5

55

Query Language: SPARQL

  • Declarative
  • Defines a query graph
  • RDF store must find all instances in data graph
  • Example
  • “Return friends of user alice01 who live in Paris”

PREFIX sn: http://socialnetwork.com/ontology/ SELECT ?friend WHERE { ?user sn:hasName “alice01”; sn:isFriendOf ?friend. ?friend sn:livesIn sn:Paris. }

slide-6
SLIDE 6

66

Property Graph Format

  • Vertices and edges can have associated properties
  • Key-value pairs
  • Vertices can be grouped by label
  • Similar to tables, e.g., employees
  • Properties are similar to columns of a table
  • Not a “global” format: no URIs required
  • Typical more compact than RDFs
  • Common is NoSQL graph databases
slide-7
SLIDE 7

77

Query Languages

  • Cypher
  • Originally used by Neo4j
  • Linear queries
  • Previous example in Cypher

MATCH (u:User)-[:isFriend]->(f:User)–[:livesIn]->(:City {name: ‘Paris’}) WHERE (u.name = ‘Alice’) RETURN f.name

slide-8
SLIDE 8

88

Relational Representation of Graphs

  • Graphs is a relational DBMS
  • Vertex table, edge table
  • Sometimes edges as triplets
  • Pattern matching
  • Maintain a set of partial matches
  • Extend by edge: self-join on edge table
slide-9
SLIDE 9

99

Why are Graph Workloads Hard?

  • Many joins: difficult to estimate cardinality
  • Joins require random access
  • Cardinality estimation gets harder at every join
  • Skew: few vertices have very high degree
  • Indexing
  • Adjacency list scans are very frequent
  • Graph-aware databases optimize these
  • Some queries have very low selectivity
  • E.g. triangle closure (potential friends)
slide-10
SLIDE 10

10

10

Worst-Case Optimal Joins

  • Worst-Case Optimality
  • O(intermediate results) <= O(final results)
  • Edge-at-a-time approach is not worst-case optimal
  • Number of triangles: O(|E|3/2)
  • Number of wedges: O(|E|2)
  • Vertex-at-a-time (multi-way-joins) are WCO
  • (v1,v2), (v1,v2,v3), (v1,v2,v3, v4), …
  • Will not materialize all wedges
slide-11
SLIDE 11

11

Subgraph Isomorphism (TurboISO)

11

SubTask 1 Match spanning tree from one starting vertex SubTask 2 Match cross-edges

100 100 heavyweight 104 subgraphs * 2 edge lookups v v single starting vertex multiple matching vertices 10 10 10 10 10*10 lightweight 220 edge lookups 10*10

slide-12
SLIDE 12

12

TurboISO: Flexible Join Order

12

slide-13
SLIDE 13

13

Hard to Parallelize

13 Running time (ms)

slide-14
SLIDE 14

14

14

Subgraph Enumeration

  • Count all instances of an unlabeled pattern
  • E.g. triangles, squares, cliques
  • Important to rule out permutations
slide-15
SLIDE 15

15

15

Reachability Queries

  • Given two vertices v and u
  • Find (and/or rank) paths connecting them
  • Simplest approach: parallel BFS from both vertices
  • Expensive
slide-16
SLIDE 16

16

16

Dynamic Graphs

  • Temporal Analysis à Deal with multiple snapshots
  • Real-Time analytics à Work on live graph data
  • Storage implications

UPDATES TRANSACTIONAL SYSTEM ANALYTICAL SYSTEM RESULTS LOAD DYNAMIC DATA STRUCTURE + TRANSACTIONS READ-ONLY DATA STRUCTURE NO TRANSACTIONS E.g.: B-Tree, LSMT E.g.: CSR

slide-17
SLIDE 17

17

17

Graph Storage for RT Analytics

  • Sequential adjacency list scan is important
  • CSR: Sequential scan but read-only
  • TEL: LOG-based adjacency list

220 221 222 223 224 225 226 graph scale, V

0.01 0.1 1 10

cache miss/edge

TEL LSMT B+Tree Linked List

220 221 222 223 224 225 226 graph scale, V

0.1 1 10 100

µs/vertex (seeks)

TEL LSMT B+Tree Linked List

Cache misses Seek time Edge scan

220 221 222 223 224 225 226 graph scale, V

10 100 1000

ns/edge (scan)

TEL LSMT B+Tree Linked List

slide-18
SLIDE 18

18

18

Open Issues

  • Graph analytics algorithms are diverse
  • Still looking for good APIs
  • There is no “SQL for graphs”
  • Hard to leverage hardware characteristics
  • Scale out to distributed systems: Hard because of edge cut
  • SIMD: hard because of skew and random access
  • Caching: hard because of random access