Tuning Cypher Andrew Bowman Customer Success Engineer NODES 2019 - - PowerPoint PPT Presentation
Tuning Cypher Andrew Bowman Customer Success Engineer NODES 2019 - - PowerPoint PPT Presentation
Tuning Cypher Andrew Bowman Customer Success Engineer NODES 2019 (andrew)-[:FREQUENTS]-> 2 How to get good at Cypher tuning Curiosity, patience, flexibility Sometimes its not the query that must bend... Know your tools
2
(andrew)-[:FREQUENTS]->
- Curiosity, patience, flexibility
- Sometimes it’s not the query that must bend...
- Know your tools and resources
- And knowing is half the battle
- Learn esoteric secrets key principles of Cypher execution
- Row cardinality, cheap vs expensive operations
- Practice!
3
How to get good at Cypher tuning
- Avoid redundant work and operations
- Cut out work that’s going to be filtered out anyway
- Recognize cheaper ways to do what you want
4
Goal: Do less work
Curiosity, patience, and flexibility
5
Patience in the fight against...
6
Cypher!
(or misunderstandings of its behavior)
Sometimes, the planner itself
Sometimes it is not the query that must bend...
7
...maybe you have to change your model… ...maybe you have to use APOC… ...maybe you need to write a stored procedure...
Tuning Tools and Resources
8
Reference documentation
9
- Neo4j Refcard
- Official Cypher manual
- Query tuning section
- Execution plan descriptions
- Cypher knowledge base
Query logs
For finding queries already-being used that need tuning Enable in neo4j.conf, successful queries exceeding threshold are logged. Options for finer grain timing breakdowns: Time spent planning, executing, and waiting on locks
10
2017-11-22 12:38 ... INFO 3 ms: bolt-session bolt johndoe neo4j-javascript/1.4.1 ... 2017-11-22 22:38 ... INFO 61 ms: (planning: 0, cpu: 58, waiting: 0) - 6164496 B - 0 page hits, 1 page faults ... 2017-11-22 12:38 ... INFO 78 ms: (planning: 40, cpu: 74, waiting: 0) - 6347592 B - 0 page hits, 0 page faults ... 2017-11-22 12:38 ... INFO 44 ms: (planning: 9, cpu: 25, waiting: 0) - 1311384 B - 0 page hits, 0 page faults ... 2017-11-22 12:38 ... INFO 6 ms: (planning: 2, cpu: 6, waiting: 0) - 420872 B - 0 page hits, 0 page faults - ...
Graph app in Neo4j Desktop, analyzes a query.log file
11
Query log analyzer
Graph structure CALL db.schema() OR CALL apoc.meta.graph() Indexes and constraints CALL db.indexes OR :schema Counts store CALL db.stats.retrieve('GRAPH COUNTS') OR CALL apoc.meta.stats()
12
Graph, schema, and counts overview
Prefix before the query, produces query plans EXPLAIN - does not execute the query No db hits, only rough estimate row counts PROFILE - executes the query Includes valuable db hit and row counts
13
EXPLAIN and PROFILE
14
How to walk the tree
1. Start node: from result go up always turning left until leaf node 2. Execute down until we reach a branch 3. We have the lhs, now need rhs 4. Go up one on rhs, then repeat from 1
1. Read from bottom-up 2. Start node: from result go down always turning left until leaf node 3. Follow until we reach a branch 4. We have lhs, need rhs 5. Go down one on rhs, repeat from 1
15
Cypher-shell plans - inverted tree
DB hits
Abstract unit of db work Not comparable between operations Not always something that can be
- ptimized
Smoke, let it draw your attention to...
16
Rows
Inputs to operations Multiplies out db hits The more rows, the more work Watch for spikes: 44 -> 200 -> 38
17
Key principles of Cypher execution
18
Row Cardinality
19
- Operators produce result records/rows
- Operators execute for each input record/row
Implications:
- Reduce rows across the query, reduce total work
- MATCHes are not independent
A simple demonstration
20
PROFILE MATCH (m:Movie {title:'The Matrix'}) MATCH (dir:Person)-[:DIRECTED]->(m) MATCH (actor:Person)-[:ACTED_IN]->(m) RETURN m.title as title, collect(DISTINCT actor.name) as actors, collect(DISTINCT dir.name) as directors
Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 141 total db hits in 1 ms.
MATCH (p:Person)-[:ACTED_IN]->(m) MATCH (p:Person)-[:DIRECTED]->(m)
21
MATCHes are not independent
MATCH (m:Movie {title:’The Matrix’}) RETURN m.title as title, collect(DISTINCT p.name) as actors, collect(DISTINCT dir.name) as directors
2 rows 5 rows
MATCH (p:Person)-[:DIRECTED]->(m) MATCH (p:Person)-[:ACTED_IN]->(m)
22
What actually happens (best case)
MATCH (m:Movie {title:’The Matrix’}) RETURN m.title as title, collect(DISTINCT p.name) as actors, collect(DISTINCT dir.name) as directors
2 rows 10 rows (2 * 5) executed 2 times executed 1 time
MATCH (p:Person)-[:DIRECTED]->(m) MATCH (p:Person)-[:ACTED_IN]->(m)
23
It could be worse
MATCH (m:Movie {title:’The Matrix’}) RETURN m.title as title, collect(DISTINCT p.name) as actors, collect(DISTINCT dir.name) as directors
10 rows (5 * 2) 5 rows executed 1 times executed 5 times
Same exact plan
24
PROFILE MATCH (dir:Person)-[:DIRECTED]-> (m:Movie {title:'The Matrix'}) <-[:ACTED_IN]-(actor:Person) RETURN m.title as title, collect(actor.name) as actors, collect(dir.name) as directors
Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 141 total db hits in 1 ms.
- MATCHes and OPTIONAL MATCHes
(watch for back-to-back, even WITH-separated)
- UNWIND operations
- Procedure results
(only if they YIELD something)
25
What increases cardinality?
- Aggregation
(the grouping key will become distinct)
- WITH DISTINCT …
(applies to the entire row, not just a single variable)
- LIMIT
(in special cases) (limits ALL rows, not results per row)
- (WITH on its own does not shrink cardinality)
26
How do we shrink cardinality?
How do we fix this?
27
PROFILE MATCH (m:Movie {title:'The Matrix'}) MATCH (dir:Person)-[:DIRECTED]->(m) MATCH (actor:Person)-[:ACTED_IN]->(m) RETURN m.title as title, collect(actor.name) as actors, collect(dir.name) as directors these are happening too late...
Aggregate earlier
28
PROFILE MATCH (m:Movie {title:'The Matrix'}) MATCH (dir:Person)-[:DIRECTED]->(m) WITH m, collect(dir.name) as directors MATCH (actor:Person)-[:ACTED_IN]->(m) WITH m, directors, collect(actor.name) as actors RETURN m.title as title, directors, actors
Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 101 total db hits in 1 ms. (vs 134)
Or use pattern comprehension
29
PROFILE MATCH (m:Movie {title:'The Matrix'}) RETURN m.title,
[(dir:Person)-[:DIRECTED]->(m) | dir] as directors, [(actor:Person)-[:ACTED_IN]->(m) | actor] as actors It’s like an (optional) MATCH and collect
Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 94 total db hits in 1 ms.
How do you find your starting node in the graph?
- NodeByLabelScan - Visit every node of a label
If there are property predicates, filter A label must be present in the pattern
PROFILE MATCH (m:Movie {title:'The Matrix'}) RETURN m
30
The cost of things...starting node lookup
- AllNodesScan - Visit every node in your graph
If there are property predicates, filter Used when no labels are present
PROFILE MATCH (m {title:'The Matrix'}) RETURN m
31
The most expensive, the one to avoid
- NodeIndexSeek - Lookup via index
Used when a label and at least 1 property present An index must have been created for the label/property combo: CREATE INDEX ON :Movie(title); Variations! WHERE m.title STARTS WITH ‘THE MATRIX’ WHERE m.title IN [‘The Matrix’, ‘Toy Story 4’]
PROFILE MATCH (m:Movie {title:'The Matrix'}) RETURN m
32
The fast, preferred lookup
- NodeByIdSeek - Lookup via graph id
You must know the graph id ahead of time! Not good to cache outside of Neo4j What if the node was deleted? Also ids of deleted things get reused, so it may be different Could be useful in a multi-query transaction, or within a query
PROFILE MATCH (m) WHERE id(m) = 6 RETURN m
33
The fastest lookup (that you can rarely use)
- Procedures for fulltext lookup
Named indexes Can index over multiple labels, multiple properties Can create on relationships too! Case insensitive searching, wildcard searching… Can even use custom lucene analyzers
PROFILE CALL db.index.fulltext.queryNodes("titlesAndDescriptions", "matrix") YIELD node, score RETURN node.title, node.description, score
34
Fulltext schema index
- Properties (most expensive)
WHERE m.released = 1999
- Node labels
WHERE actor:Person
- Relationship types
WHERE type(r) = ‘ACTED_IN’
- Relationship type selection
- [:ACTED_IN | DIRECTED]->
35
The costs of access and filtering
... WITH DISTINCT m WHERE m.released = 1999 …
36
Delay access until the node is distinct
… WITH m, collect(actor) as actors WHERE m.released = 1999 ...
... WITH m, actors WHERE size(actors) > 5 WITH m.released as released, actors ...
37
Wait until filter / LIMIT / fewer rows
… WITH m LIMIT 10 WITH m.released as released ...
… WITH m, m.title as title WHERE title IN $listOfTitles OR title CONTAINS ‘Matrix OR size(title) > 10 ...
38
If need to use multiple times, project out
… WITH m, m.title as title MATCH (m)-[:ACTED_IN]-(actor) RETURN title, actor.name as actor ...
39
It is sometimes more efficient to project before an expansion
… WITH m MATCH (m)-[:ACTED_IN]-(actor:Person) WITH m.id as id, collect(actor) as actors // NO ...
40
If aggregating by a unique property, aggregate by the node instead
… WITH m MATCH (m)-[:ACTED_IN]-(actor:Person) WITH m, collect(actor) as actors WITH m.id as id, actors // YES ...
Move ORDER BY and LIMIT up, when possible
MATCH (m:Movie)<-[:ACTED_IN]-(actor) WITH m, collect(actor) as actors RETURN m.title as title, actors ORDER BY m.released DESC LIMIT 10
297 total db hits 41
MATCH (m:Movie) WITH m ORDER BY m.released DESC LIMIT 10 MATCH (m)<-[:ACTED_IN]-(actor) WITH m, collect(actor) as actors RETURN m.title as title, actors
138 total db hits
MATCH (p:Person) RETURN p ORDER BY p.name
42
Index-backed ORDER BY
MATCH (p:Person) WITH p WHERE p.name > '' RETURN p ORDER BY p.name
requires order by indexed property possibility of lookup by indexed property type hint
MATCH (m:Movie)<-[:ACTED_IN]-(a) RETURN m, count(a) as degree ORDER BY degree DESC LIMIT 3
43
Avoid expansions with degrees
MATCH (m:Movie) RETURN m, size((m)<-[:ACTED_IN]-()) as degree ORDER BY degree DESC LIMIT 3
Aim to be lazy
- Leverage node indexes for starting nodes, when possible
Use EXPLAIN to double-check
- Mind the cardinality
Watch for row spikes, check if it makes sense vs counts Get those rows down! Aggregate early, use pattern comprehensions
- Be frugal with property access!
Strike when cardinality is low
44
45
But what about practice?
Hunger Games Questions for "Tuning Cypher"
1. Easy: Which operation for looking up a starting node is typically fastest?
a. NodeByDirectReference b. NodeIndexSeek c. NodeByPrimaryKey
2. Medium: What is the name of the feature that lets us expand a pattern, apply a WHERE clause, then project the results into a list? 3. Hard: If aggregating with respect to a non-unique node property, you should:
a. Aggregate with respect to the node itself instead of its property, it’s more efficient b. Not aggregate with respect to the node itself, as that may give incorrect results c. Aggregate with respect to the node’s label instead, since label access is cheaper
Answer here: r.neo4j.com/hunger-games