graph databases
play

Graph Databases Marco Serafini COMPSCI 532 Lecture 10 Graph DB - PowerPoint PPT Presentation

Graph Databases Marco Serafini COMPSCI 532 Lecture 10 Graph DB Use cases Social network queries E.g. Facebook stores the entire metadata in a social graph Network security Find sequence of steps that lead to intrusion


  1. Graph Databases Marco Serafini COMPSCI 532 Lecture 10

  2. Graph DB Use cases • Social network queries • E.g. Facebook stores the entire metadata in a social graph • Network security • Find sequence of steps that lead to intrusion • Fraud detection • Find fraud rings • Knowledge bases • Answer questions, language models 2

  3. Resource Description Framework • World Wide Web Consortium specification • Used for the Semantic Web • Web pages define human-readable content • Goal: add machine-readable meta-data describing how pages relate • Format to reuse and share data across the Web • Examples • Wikipedia, census, life sciences, DBPedia • Directed labeled multi-graph 3 3

  4. RDF Format • Graph is set of triplets = (Subject, Predicate, Object) • Subject and predicate are resources • Associated with Unique Resource Identifiers (URI) • Object can be resource or literal (string) From S. Decker et al., “Framework for the Semantic Web: An RDF Tutorial” 4 4

  5. Query Language: SPARQL • Declarative • Defines a query graph • RDF store must find all instances in data graph • Example • “Return friends of user alice01 who live in Paris” PREFIX sn: http://socialnetwork.com/ontology/ SELECT ?friend WHERE { ?user sn:hasName “alice01”; sn:isFriendOf ?friend. ?friend sn:livesIn sn:Paris. } 5 5

  6. Property Graph Format • Vertices and edges can have associated properties • Key-value pairs •Vertices can be grouped by label • Similar to tables, e.g., employees • Properties are similar to columns of a table • Not a “global” format: no URIs required •Typical more compact than RDFs • Common is NoSQL graph databases 6 6

  7. Query Languages • Cypher • Originally used by Neo4j • Linear queries • Previous example in Cypher MATCH (u:User)-[:isFriend]->(f:User)–[:livesIn]->(:City {name: ‘Paris’}) WHERE (u.name = ‘Alice’) RETURN f.name 7 7

  8. Relational Representation of Graphs • Graphs is a relational DBMS • Vertex table, edge table • Sometimes edges as triplets • Pattern matching • Maintain a set of partial matches • Extend by edge: self-join on edge table 8 8

  9. Why are Graph Workloads Hard? • Many joins: difficult to estimate cardinality • Joins require random access • Cardinality estimation gets harder at every join • Skew: few vertices have very high degree • Indexing •Adjacency list scans are very frequent • Graph-aware databases optimize these • Some queries have very low selectivity • E.g. triangle closure (potential friends) 9 9

  10. Worst-Case Optimal Joins • Worst-Case Optimality • O(intermediate results) <= O(final results) • Edge-at-a-time approach is not worst-case optimal • Number of triangles: O(|E| 3/2 ) • Number of wedges: O(|E| 2 ) • Vertex-at-a-time (multi-way-joins) are WCO • ( v 1 , v 2 ), ( v 1 , v 2 , v 3 ), ( v 1 , v 2 , v 3 , v 4 ), … • Will not materialize all wedges 10 10

  11. Subgraph Isomorphism (TurboISO) SubTask 1 SubTask 2 Match spanning tree Match cross-edges from one starting vertex single starting vertex v v 10 10 10 10 multiple lightweight heavyweight matching 10 4 subgraphs * 220 edge lookups vertices 2 edge lookups 10*10 10*10 100 100 11 11

  12. TurboISO: Flexible Join Order 12 12

  13. Hard to Parallelize Running time (ms) 13 13

  14. Subgraph Enumeration • Count all instances of an unlabeled pattern • E.g. triangles, squares, cliques • Important to rule out permutations 14 14

  15. Reachability Queries • Given two vertices v and u • Find (and/or rank) paths connecting them • Simplest approach: parallel BFS from both vertices • Expensive 15 15

  16. Dynamic Graphs • Temporal Analysis à Deal with multiple snapshots • Real-Time analytics à Work on live graph data • Storage implications ANALYTICAL TRANSACTIONAL SYSTEM SYSTEM LOAD UPDATES RESULTS DYNAMIC READ-ONLY DATA STRUCTURE DATA STRUCTURE + TRANSACTIONS NO TRANSACTIONS E.g.: B-Tree, LSMT E.g.: CSR 16 16

  17. Graph Storage for RT Analytics • Sequential adjacency list scan is important • CSR: Sequential scan but read-only • TEL: LOG-based adjacency list µ s/vertex (seeks) cache miss/edge ns/edge (scan) TEL B+Tree TEL B+Tree TEL B+Tree 1000 10 LSMT Linked List LSMT Linked List LSMT Linked List 100 1 100 10 0.1 1 10 0.01 0.1 2 20 2 21 2 22 2 23 2 24 2 25 2 26 2 20 2 21 2 22 2 23 2 24 2 25 2 26 2 20 2 21 2 22 2 23 2 24 2 25 2 26 graph scale, V graph scale, V graph scale, V Cache misses Seek time Edge scan 17 17

  18. Open Issues • Graph analytics algorithms are diverse • Still looking for good APIs • There is no “SQL for graphs” • Hard to leverage hardware characteristics • Scale out to distributed systems: Hard because of edge cut • SIMD: hard because of skew and random access • Caching: hard because of random access 18 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend