Graph Algorithms for Community Detection & Recommendations
Mark Needham & Amy Hodler, Neo4j
Graph Algorithms for Community Detection & Recommendations Mark - - PowerPoint PPT Presentation
Graph Algorithms for Community Detection & Recommendations Mark Needham & Amy Hodler, Neo4j Mark Needham Amy Hodler @markhneedham @amyhodler Analytics & AI Programs Neo4j Labs Engineer 2 Graph Algorithms for Community
Mark Needham & Amy Hodler, Neo4j
Mark Needham
@markhneedham
2
Neo4j Labs Engineer
Amy Hodler
@amyhodler
Analytics & AI Programs
Graph Algorithms for Community Detection & Recommendations Investigating the Graph Community
3
4
Query (e.g. Cypher/Python)
Real-time, local decisioning and pattern matching
Graph Algorithms Libraries
Global analysis and iterations You know what you’re looking for and making a decision You’re learning the overall structure of a network, updating data, and predicting
Local Patterns Global Computation
6
well-organized data
7
Requires Understanding Relationships and Structures Flow & Dynamics Interactions & Resiliency Propagation Pathways
Explore, Plan, Measure
Find significant patterns and plan for optimal structures Score outcomes and set a threshold value for a prediction
Machine Learning
Use the measures as features to train an ML model
1st Node 2nd Node Common Neighbors Preferential Attachment label 1 2 4 15 1 3 4 7 12 1 5 6 1 1
10
neo4j.com/
graph-algorithms- book/
Pathfinding & Search Centrality / Importance Community Detection Link Prediction
Finds optimal paths
availability and quality Determines the importance of distinct nodes in the network Detects group clustering or partition
Evaluates how alike nodes are Estimates the likelihood
future relationship
Similarity
DFS
Wasserman & Faust
Multi-Step
Pathfinding & Search Centrality / Importance Community Detection Similarity
neo4j.com/docs/
graph-algorithms/current/
Updated April 2019
Link Prediction
configuration
CALL algo.<name>.stream('Label','TYPE',{conf}) YIELD nodeId, score
statistics CALL algo.<name>('Label','TYPE',{conf})
Pathfinding & Search Centrality / Importance Community Detection Link Prediction Similarity
Pass in Cypher statement for node- and relationship-lists. CALL algo.<name>( 'MATCH ... RETURN id(n)', 'MATCH (n)-->(m) RETURN id(n) as source, id(m) as target', {graph:'cypher'})
15
Russian Twitter Trolls
https://www.nbcnews.com/pages/author/ben-popken
Inferred Relationships AMPLIFIED
CALL algo.pageRank( "MATCH (t:Troll) RETURN id(t) AS id", "MATCH (r1:Troll)-[:POSTED]->(:Tweet)<-[:RETWEETED]- (:Tweet)<-[:POSTED]-(r2:Troll) RETURN id(r2) as source, id(r1) as target", {graph:'cypher'})
PageRank on Inferred AMPLIFIED Graph https://www.nbcnews.com/tech/social-media/russian-trolls-went-a ttack-during-key-election-moments-n827176
Procedures Neo4j In Memory Projected Graph
Read projected graph Load projected graph
Graph Loader
Execute algorithm Store results
1 2 4 3
Everything is concurrent
1) Read projected graph 2) Load projected graph 3) Execute algorithm 4) Store results
○ Community Edition restricted to 4 cores!
○ Need enough heap to fit projected graph in memory ○ Memory requirements vary per algorithm
○ Do not run graph algos on core members ○ Streaming method only for read replicas ○ Consider snapshot
20
21
22
Determines the importance of distinct nodes in the network Developed for distinct uses or types
Tip / Caution This is the simplest of the centrality algorithms. Can measure in-degree,
globally averaged, it can be skewed by supernodes. Other algorithms are better for determining influence over more than just direct neighbors.
Measures the number of direct relationships
In-Degree
Use When Understanding immediate connectedness or direct influence Popularity & Gregariousness Quick estimation of network densities such as min/max and mean degrees
Likelihood of Flu
Individual probabilities
31
Tip / Caution Test your dampening factor as it will change results (default works well for power law distributions.) Spark uses a inverse dampening factor resetProbability=0.15 is equal to dampingFactor:0.85 in Neo4j and other libraries Careful with mixing node types
Measures the transitive (directional) influence of nodes and considers the influence of neighbors and their neighbors
Node Being Ranked Nodes Linking To -> “u” Dampening Factor Outdegree of that Node CALL algo.pageRank('Page', 'LINKS', {iterations:20, dampingFactor:0.85, sourceNodes: [siteA]})
Fraud Detection Feature engineering for machine learning
Use When Anytime you’re looking for broad influence over a network
Many domain specific variations for differing analysis, e.g. Personalized PageRank for personalized recommendations
Recommendations Who To Follow with personalized PR
36
Tip / Caution Computationally intensive: use RA Brandes approximation on large graphs. Assumes all communication between nodes happens along the shortest path and with the same frequency (not always the case in real life)
The sum of the % shortest paths that pass through a node, calculated by pairs
1. For a node, find the shortest paths that go through it
2. For each shortest path in step one, calculate it’s percentage of the total possible shortest paths for that pair 3. Add together all the values in step two; this is a nodes Betweenness Centrality score 4. Repeat for each node
A B D E C
3.5 Pairs with Shortest Paths Through D Total Possible Shortest Paths for that Pair % of Total Through D (1/Total) A,E 1 1 B,E 1 1 C,E 1 1 B,C 2 (through D & A) 0.5 Betweenness Score 3.5
Node D Calculation
0.5
A
Use When Identify bridges Uncover control points Find bottlenecks and vulnerabilities
Network Resilience
Key points of cascading failure
41
Evaluates how a group is clustered
Different approaches to define a community
Continually maximizes the modularity by comparing relationship weights and densities to an estimate /average
Tip / Caution ALL Modularity algorithms:
into larger ones
modularity on several partitions - forming local maxima & stalling progress
test/validate results
2 1 4 1 1 2 14 4 14
Use When Community detection in large networks Uncover hierarchical structures in data Evaluate different grouping thresholds
Understanding the Brain
Mapping hierarchy of functions
Use When Community detection in large networks Uncover hierarchical structures in data Evaluate different grouping thresholds
Detecting Fraud Rings
Thresholds for bad apples vs rings
47
48
Degree Centrality to get a message directly in front of the most people with a single, immediate offer
49
PageRank to use an influencer with the broadest reach to all communities for a greater ripple effect Communities for use of different promoted tweets &
Betweenness Centrality + Louvain to find the person who can best connect to a specific sub-community
D
50
51
Applied Graph Algorithms Data Science With Neo4j
https://neo4j.com/graphacademy/online-training/data-science https://neo4j.com/graphacademy/online-training/applied-graph-algorithms
"Graph Algorithms for Community Detection & Recommendations"
1. Easy: Which of these algorithms is a community detection algorithm?
a. PageRank b. Common Neighbors c. Louvain Modularity
2. Medium: What is NOT a good use for a Graph Projection?
a. Infer Relationships b. Shape a graph for your algorithm to run on c. Save a new subgraph
3. Hard: The default dampening factor for the PageRank algorithm is 0.85. This default works well for graphs that have what distribution of relationships?
a. Power Law b. Poisson c. Normal
Answer here: r.neo4j.com/hunger-games