Introduction to Graph Data Management Claudio Gutierrez Center for - - PowerPoint PPT Presentation

introduction to graph data management
SMART_READER_LITE
LIVE PREVIEW

Introduction to Graph Data Management Claudio Gutierrez Center for - - PowerPoint PPT Presentation

Introduction to Graph Data Management Claudio Gutierrez Center for Semantic Web Research (CIWS) Department of Computer Science Universidad de Chile EDBT Summer School Palamos 2015 Joint Work With Renzo Angles Universidad de Talca, Chile


slide-1
SLIDE 1

Introduction to Graph Data Management

Claudio Gutierrez

Center for Semantic Web Research (CIWS) Department of Computer Science Universidad de Chile EDBT Summer School – Palamos 2015

slide-2
SLIDE 2

Renzo Angles Universidad de Talca, Chile

Joint Work With

  • C. Gutierrez – EDBT Summer School - Palamos 2015
slide-3
SLIDE 3
  • 1. Reminder / comment on first lecture
  • 2. Graph query language concepts
  • 3. Querying graphs
  • 4. Graph database and systems
  • C. Gutierrez – EDBT Summer School - Palamos 2015

Agenda for today: querying

slide-4
SLIDE 4

Golden Age of Graph Databases

  • C. Gutierrez – EDBT Summer School - Palamos 2015

Jan Hidders Alexandra Poulovassilis

slide-5
SLIDE 5

Reminder I: Property Graph data model

  • C. Gutierrez – EDBT Summer School - Palamos 2015
slide-6
SLIDE 6

Graph Theory (R. Diestel)

  • 1. Introduction
  • 2. Matching
  • 3. Connectivity
  • 4. Planar Graphs
  • 5. Colouring
  • 6. Flows
  • 7. Substructures in Dense Graph
  • 8. Ramsey Theory for Graphs
  • 9. Hamilton Cycles
  • 10. Random Graphs
  • 11. Minor, Trees, Well Quasi

Orders

  • C. Gutierrez – EDBT Summer School - Palamos 2015

Reminder II: TOC of Graph-Theory/Networks books

Networks: An Introduction (M. Newman) 1. Introduction 2. Technological Netowrks 3. Social Networks 4. Networks of information 5. Biological Networks 6. Mathematics of Networks 7. Measures and Methods 8. The large-scale structure of networks 9. Basic concepts of algorithms

  • 10. Fundamental network algorithms
  • 11. Matrix algorithms and graph partitioning
  • 12. Random Graphs
  • 13. Random Graphs with general degree

distribution

  • 14. Models of network formation
  • 15. Other network models
  • 16. Percolation and network resilience
  • 17. Epidemics on networks
  • 18. Dynamical system on networks
  • 19. Network search
slide-7
SLIDE 7

Q1 (Property Graph data model)

Name one positive feature and one negative feature of the Property Graph data model

Q2 (Graph theory – Data management)

Name one result (theorem, area, topic, algorithm, technique, etc.) from Graph Theory that you consider could be useful for improving Graph Data management.

Quiz / Inquiry

  • C. Gutierrez – EDBT Summer School - Palamos 2015
slide-8
SLIDE 8

Graph Query Language Notions

Agenda

  • C. Gutierrez – EDBT Summer School - Palamos 2015
slide-9
SLIDE 9

Database Models: Codd’s definition

  • C. Gutierrez – EDBT Summer School - Palamos 2015

Data structures Query Language Integrity constraints

slide-10
SLIDE 10

Database Models: Codd’s definition

  • C. Gutierrez – EDBT Summer School - Palamos 2015

Query Language

Data manipulation is expressed by graph transformations, or by operations whose main primitives are on graph features like paths, neighborhoods, subgraphs, graph patterns, connectivity, and graph statistics.

slide-11
SLIDE 11
  • A. “Basic” Graph Queries
  • 1. Pattern matching
  • 2. Adjacency / neighborhood
  • 3. Reachability / connectivity
  • 1. Regular (and regular++)
  • 2. CRPQ
  • 3. etc.
  • 4. Summarization
  • 5. …
  • C. Gutierrez – EDBT Summer School - Palamos 2015

A supermarket list of types of queries

slide-12
SLIDE 12
  • B. Analytical Queries
  • 1. Centrality measures
  • 2. Diameter and other global properties
  • 3. Various statistics
  • 4. Graph properties and parameters
  • 5. …
  • C. Gutierrez – EDBT Summer School - Palamos 2015

A supermarket list of types queries (cont.)

slide-13
SLIDE 13

Seems like we are in Linnean times: lots of arbitrary animals collected and discovered, but no way of making sense of this diversity Either: we are not understanding graphs

  • r

graphs are not understandable by XXI’s century humans

  • r

we do not know what we are looking for …

  • C. Gutierrez – EDBT Summer School - Palamos 2015

Something is getting wrong…

but one thing is clear: a scientific description cannot be an arbitrary list of properties

slide-14
SLIDE 14
  • 1. Genericity (independence of coding of data)
  • 2. Good expressive power
  • 3. Low complexity of evaluation
  • 4. Simple syntax and semantics
  • 5. Compositionality
  • 6. Few and simple constructors
  • 7. Hopefully not operational semantics
  • 8. User friendly / low barrier of entrance
  • 9. Standard…
  • C. Gutierrez – EDBT Summer School - Palamos 2015

Some desirable features of a query language

slide-15
SLIDE 15

Graph Query Language: I/O types

Graphs Relations

  • C. Gutierrez – EDBT Summer School - Palamos 2015
slide-16
SLIDE 16

Graph Query Language: basic modules

  • C. Gutierrez – EDBT Summer School - Palamos 2015

transform define data sources

slide-17
SLIDE 17

Graph query languages: their basic modules

  • C. Gutierrez – EDBT Summer School - Palamos 2015

Language Define Source Extract Transform Construct SQL FROM WHERE SELECT SPARQL FROM, Service pattern matching

  • perators

Select, ASK, Contruct, Datalog match facts rules head XQuery XSLT …

Exercise: Fill in the blanks; add your favorite language

slide-18
SLIDE 18

SELECT ASK CONSTRUCT DESCRIBE Query Form Dataset Clause Where Clause (Graph Pattern)

Triple pattern

FROM FROM NAMED

Dataset FILTER OPTIONAL AND UNION

X Y Z X Y

TRUE - FALSE

SPARQL Query

  • C. Gutierrez – EDBT Summer School - Palamos 2015
slide-19
SLIDE 19
  • C. Gutierrez – EDBT Summer School - Palamos 2015

Cypher Query Language: structure

MATCH (p:Person)-[:Knows]->(friend) WHERE p.age = 20 WITH p, count(friend) as friends WHERE friends > 0 RETURN p.name, friends

Basic syntax

  • (p:Person) indicates the

nodes having label Person

  • [:Knows] indicates a relation
  • f type Knows
  • p.age indicates an attribute
slide-20
SLIDE 20
  • C. Gutierrez – EDBT Summer School - Palamos 2015

Cypher Query Language: outputs

get nodes of a given type

A node MATCH (p:Person {name:"Tom"}) RETURN p A value MATCH (p:Person {name:"Tom"}) RETURN p.age A list of values MATCH (p:Person) RETURN p.name LIMIT 5 An array MATCH p=shortestPath((a)-[*]->(b)) WHERE a.name="Axel" AND b.name="Tom" RETURN p A list of arrays MATCH p=((a)-[*]->(b)) WHERE a.name="Axel" AND b.name="Frank" RETURN p

slide-21
SLIDE 21

Each social application is a consumer/producer of social networks, producing and/or collecting network data, and consuming data produced by other applications. [SNQL, SanMartin,_,Wood]

The flow of data in SNA: a notion of “data management” a

  • C. Gutierrez – EDBT Summer School - Palamos 2015

Data Collection Network Manipulation and Storage

Sharing and Porting Social Networks Data Management (Data Model/DBMS) External Consumers/ Producers Incremental Data Feed Local Consumer/Producer

Application Logic Other Consumers/ Producers Social Network Analysis Tools

Query & Transformation Interactive Data Set Production Integration of Structural Measures

slide-22
SLIDE 22

The web is one more artifact or is “the” answer to scalability?

  • C. Gutierrez – EDBT Summer School - Palamos 2015

An aside: a different problem or “the” problem?

The Web The Web of Data

r

<2> <3>

1

<4>

2

<5>

3

<3> <5>

4 5

Search/Query

<2> <3> <4> 1 2 <3> <5> <5> 3 4 5 href <uri2,q,uri3> <uri1,p,uri2> ...

uri2

<uri1,r,uri4> <uri2,p,uri1> ...

uri1

<uri3,m, uri4> <uri1,n, uri3> ...

uri3

<uri4,t,uri2> ...

uri4 Description of urij

p

q n t m

slide-23
SLIDE 23

The “use case” that triggered the Web design

  • C. Gutierrez – EDBT Summer School - Palamos 2015
slide-24
SLIDE 24
  • 1. Data sources/services are reliable
  • 2. Consumer behaviour can be anticipated
  • 3. Publishers are infallible and play no role
  • 4. You can know what’s out there
  • 5. Universal cost models can be mantained
  • 6. Query execution is always deterministic
  • 7. Standards = interoperability
  • 8. One system can ACE them all

(ACE: alignment, coverage, efficiency)

  • C. Gutierrez – EDBT Summer School - Palamos 2015

Eight fallacies when querying … [Umbrich,_,Hogan,Karnstedt, Parreira]

slide-25
SLIDE 25

Graph Databases and Systems

Agenda

  • C. Gutierrez – EDBT Summer School - Palamos 2015
slide-26
SLIDE 26

Native Data Store APIs Data Structure: Graphs RDBMS

MySQL MSQL Oracle DB2 Postgres

Files Applications Services Query languages

X1 X2 … Xn …..

Reminder: Database Technology

  • C. Gutierrez – EDBT Summer School - Palamos 2015
slide-27
SLIDE 27

Classification (most influential models)

  • C. Gutierrez – EDBT Summer School - Palamos 2015

Database model Abstraction level Data structure Information focus Network Physical Pointers,records Records Relational Logical Relations Data, attributes Semantic User Graph Schema, relations OO Physical/logical Objects Objects, methods Semi-structured Logical Tree Data,components Graph Logical/user Graph Data, relations

slide-28
SLIDE 28

(Interactive, BI, Graph analytics)

Graph Databases Graph programming frameworks RDF databases Relational databases NoSQL Key-value NoSQL MapReduce Batch processing …

  • C. Gutierrez – EDBT Summer School - Palamos 2015

Classification issues (taken from P. Boncz’s lecture)

slide-29
SLIDE 29

Offline processing (offline analytics) Online processing (online querying) Optimized for response time or throughtput Transactional …

  • C. Gutierrez – EDBT Summer School - Palamos 2015

Classification issues (taken from B. Shao’s lecture)

slide-30
SLIDE 30
  • 1. Address need of managing graph data
  • 2. Architecture/goals inspired by classical DBMS
  • 3. Persistent storage of graph data
  • 4. Transactionality
  • 5. Closed world
  • 6. Efficiency (over scalability)
  • 7. (Near future:) Portability (of data)
  • 8. (Near future:) Declarative query languages
  • C. Gutierrez – EDBT Summer School - Palamos 2015

Graph Databases

slide-31
SLIDE 31
  • C. Gutierrez – EDBT Summer School - Palamos 2015

Graph Databases

Data Storage Query Computing model method facilities model Graph database Simple graph Property graph Hypergraph Nested graph Native Non-native Query language API Graph algorithms Single-node Distributed AllegroGraph

  • ArangoDB
  • Bitsy
  • Cayley
  • FlockDB
  • GraphBase
  • Graphd
  • Horton
  • HyperGraphDB
  • IBM System G
  • imGraph
  • InfiniteGraph
  • InfoGrid
  • Neo4j
  • OrientDB
  • Sparksee/DEX
  • Titan
  • Trinity
  • TurboGraph
slide-32
SLIDE 32

(Offline Graph Analytical Systems)

  • 1. Batch processing
  • 2. Analysis of large graphs
  • 3. Facilities for graph analytical algorithms
  • 4. Distributed environment
  • 5. Multiple machines
  • 6. API or programming as user access
  • C. Gutierrez – EDBT Summer School - Palamos 2015

Graph Processing Frameworks

slide-33
SLIDE 33
  • Pregel
  • Apache Giraph
  • GraphLab
  • Catch the Wind
  • GPS
  • Mizan
  • Power Graph
  • GraphX
  • TurboGraph
  • GraphChi
  • C. Gutierrez – EDBT Summer School - Palamos 2015

Graph Processing Frameworks / Offline graph analytical syst.

slide-34
SLIDE 34
  • Graph data management has a bright future: we are

living very interesting times

  • One size does not fit all: Need different GDB for small,

medium and web scale: marry one, get aware of your choice, and learn to love it… try not to flirt with others.

  • Not evident that there will be one standard graph query

language: too diverse use cases.

  • Need to better understand graphs
  • What matters in a graph is its topology [if you do not

need it, stay in the relational world]

  • Need better interoperation between relational (tables)

and graph data.

  • C. Gutierrez – EDBT Summer School - Palamos 2015

Some conclusions / opinions