Popularity and Challenges of Graph Cypher Queries Introduction - - PowerPoint PPT Presentation

popularity and challenges of graph cypher queries
SMART_READER_LITE
LIVE PREVIEW

Popularity and Challenges of Graph Cypher Queries Introduction - - PowerPoint PPT Presentation

Presentation Layout Popularity and Challenges of Graph Cypher Queries Introduction Motivation Dataset Details Methodology for Data Extraction Sheik Shameer, Shivasurya Sankarapandian (CS846 Fall 2019) Results and implications


slide-1
SLIDE 1

Popularity and Challenges

  • f Graph Cypher Queries

Sheik Shameer, Shivasurya Sankarapandian (CS846 Fall 2019)

Presentation Layout

  • Introduction
  • Motivation
  • Dataset Details
  • Methodology for Data Extraction
  • Results and implications
  • Threats to Validity

Introduction

  • A NoSql Database that uses graph structures for semantic queries with nodes and edges.
  • They allow fast retrieval of complex hierarchical structures that are difficult to model

in relational database systems.

  • Commonly used in Fraud detection analysis, network and database

infrastructure monitoring, Recommendation engines, Social Network, Knowledge Graphs, Privacy and risk management.

Social Network Experiment- Finding Friends of Friends

Database of 1,000,000 users, searching for 1000users

slide-2
SLIDE 2

Cypher Queries

  • Some of the most popular Graph database management system are Neo4j, microsoft

azure, Cosmos DB, OrientDB, ArangoDB, Virtuoso.

  • For this study we will be looking into Neo4j.
  • Some of the popular forms of graph query languages are Cypher, SPARQL, GraphQL and

Gremlin.

  • For this study we will be looking into the Cypher Query language.

MATCH (:Movie{ title: 'Wall Street' })<-[:ACTED_IN|:DIRECTED]-(person) RETURN person.name Oliver Stone Michael Douglas Charlie Sheen Martin Sheen

Motivation

  • Version Control system – can this be used to provide relevant information on the

problems faced by developers in Open source repositories.

  • Can we use Abstract Syntax trees to mine the Cypher queries from the repositories.
  • Can we start building a corpus of graph cypher queries that can be further used for

analysis by others.

  • Can the information that we gained help others to make useful contributions to the open

source community.

Research Questions

  • RQ1 - "what type of graph cypher queries are popular among the developers now?"
  • RQ2 : "what type of graph cypher queries do the developers have trouble with?"
slide-3
SLIDE 3

Data set

v

Repositories Count Cypher Queries Mined Java 2579 4159 Java Script 1212 832 Total 3791 4991

Methodology for data extraction

Dataset Java and JavaScript GitHub Repositories Extracting graph database queries from source code,

  • Regular Expressions Pattern matching approach
  • Abstract Syntax Tree parsing approach

AST Approach

  • Follows Visitor Pattern
  • Parse source code, modules to represent as Tree
  • Traverse for Identifiers & CallExpression, with official Driver method calls
  • Extract parameters, variables within block
slide-4
SLIDE 4

Mining and Tools

  • 1212 JavaScript repositories from GitHub which uses Neo4J
  • Verify existence of Neo4J-driver in the repo
  • ESPrima – Source code parser and Construct AST
  • ESPrima-Walk – Efficiently traverse AST and Filter queries
  • Node-git to fetch commit logs and code changes of extracted graphdb queries
  • Shell Script for automation

Mining and Tools

  • 2579 java repositories mined from github
  • Javalang python module for AST tree
  • Javaparser library – we were able to mine the queries with this library
  • We were looking for official methods and the variables used in them
  • Able to mine queries from the same file, totaling around 4159 queries
  • 836 java queries commit messages were mined using combination of git log and grep

commands

RQ1 - "what type of graph cypher queries are popular among the developers now?"

  • Call , Match and Create type of queries were

popular among

  • So the Cypher Queries are predominantly used

for Creating, Fetching and also for calling procedures.

Java Type of Cypher Querises Javascript Type of Cypher Queries

Inferences RQ1

  • Call procedures were very popular.
  • We also used the tokenization and

stemming concepts in NLP to search for most used words in the messages of the commits that created the Queries.

  • The word "procedure" had a

significant usage.

Word Tokens Count Procedure 654 Initial 626 Commit 611 Fixes 388 Neo4j 295 Annotation 259 Change 226 Sparkles 224 Branch 216

slide-5
SLIDE 5

Inferences RQ1

  • Neo4j default procedures were used – 516
  • Other procedures worth mentioning were apoc repositories, machine learning

procedures.

  • We also found that users were writing their own procedures after tokenizing the

call queries.

  • Neo4jversioner is a repository that deals with network and database infrastructure,

these procedures can be used by other users in the related domain as well.

Procedures Count

  • rg.neo4j.procedure.simpleArgument

42

  • rg.neo4j.procedure.writingProcedure

40

  • rg.neo4j.procedure.defaultValues

30

  • rg.neo4j.procedure.node

28

  • rg.neo4j.procedure.integrationTestMe

24

  • rg.neo4j.procedure.schemaProcedure

20

  • rg.neo4j.procedure.genericListWithDefault

18

  • rg.neo4j.procedure.recursiveSum

18

  • rg.neo4j.procedure.sideEffect

16

  • rg.neo4j.procedure.createNode

12 Procedures Count graph.versioner.diff 4 graph.versioner.diff.from.current 3 graph.versioner.diff.from.previous 3 graph.versioner.get.all 2 graph.versioner.get.by.date 1 graph.versioner.get.by.label 2 graph.versioner.get.current.path 1 graph.versioner.get.current.state 1 graph.versioner.get.nth.state 2 graph.versioner.init 6 graph.versioner.patch 6 graph.versioner.patch.from 4 graph.versioner.rollback 4 graph.versioner.rollback.nth 2 graph.versioner.rollback.to 4 graph.versioner.update 4 Procedures Count regression.linear.addM 2 regression.linear.create 8 regression.linear.delete 2 regression.linear.info 3 regression.linear.load 3 regression.linear.test 1 regression.linear.train 3 regression.logistic.add 1 regression.logistic.delete 1 regression.linear.add 2

RQ2

  • what type of graph cypher queries do the developers have trouble

with?

  • With the extracted 832 Queries from JavaScript and 4159 from Java, verified

for false positive queries

  • Git-log with corresponding line number and file names that produced commit

information

  • 100 Random queries from Javascript and Java.
  • Manually verified the code changes and commit information

Refactored Neo4J types of queries - JavaScript Repo Commits for Sample 100 Queries

slide-6
SLIDE 6

Refactored Neo4J types of queries – Java Repo Commits for Sample 100 Queries

RQ2 Results

  • Transaction, Merge and Match has large number of changes in refactoring the particular

query whereas other type of queries have infrequent changes

  • Rare and common query edits in the MATCH, CALL and CREATE queries such as
  • Adding & Renaming Alias
  • Adding & Removing Attributes
  • Adding & Removing Conditions
  • Adopting new version procedures and libraries

Threats to Validity

  • We collect JavaScript and Java source code from Opensource which may not represent

the whole general set.

  • Developers may use Object Relational Mapping, runtime query generation which can be

missed out by static tools like AST.

  • We generalize our results based on the Java and Javascript repositories we mined there

may be repositories in other programming languages like python that may provide further insights to our work.