Graph Algorithms for Community Detection & Recommendations Mark - - PowerPoint PPT Presentation

graph algorithms for community detection recommendations
SMART_READER_LITE
LIVE PREVIEW

Graph Algorithms for Community Detection & Recommendations Mark - - PowerPoint PPT Presentation

Graph Algorithms for Community Detection & Recommendations Mark Needham & Amy Hodler, Neo4j Mark Needham Amy Hodler @markhneedham @amyhodler Analytics & AI Programs Neo4j Labs Engineer 2 Graph Algorithms for Community


slide-1
SLIDE 1

Graph Algorithms for Community Detection & Recommendations

Mark Needham & Amy Hodler, Neo4j

slide-2
SLIDE 2

Mark Needham

@markhneedham

2

Neo4j Labs Engineer

Amy Hodler

@amyhodler

Analytics & AI Programs

slide-3
SLIDE 3

Graph Algorithms for Community Detection & Recommendations Investigating the Graph Community

3

  • Graph Algorithms
  • Neo4j Social Network
  • Finding Influencers
  • Identifying Communities
slide-4
SLIDE 4

What are Graph Analytics and Algorithms?

4

slide-5
SLIDE 5

Query (e.g. Cypher/Python)

Real-time, local decisioning and pattern matching

Graph Algorithms Libraries

Global analysis and iterations You know what you’re looking for and making a decision You’re learning the overall structure of a network, updating data, and predicting

Local Patterns Global Computation

slide-6
SLIDE 6

6

Don’t Need Graph Algorithms to Answer . . .

  • Questions with just a few connections or flat (not nested)
  • Questions solved with specific, well-crafted queries
  • Simple statistical results (sums, averages, ratios)
  • Example:
  • Regular reporting based on defined criteria and

well-organized data

slide-7
SLIDE 7

What Do People Do with Graph Algorithms?

7

slide-8
SLIDE 8

Understand & Predict Complex Behavior

Requires Understanding Relationships and Structures Flow & Dynamics Interactions & Resiliency Propagation Pathways

slide-9
SLIDE 9

Using Graph Algorithms

Explore, Plan, Measure

Find significant patterns and plan for optimal structures Score outcomes and set a threshold value for a prediction

Machine Learning

Use the measures as features to train an ML model

1st Node 2nd Node Common Neighbors Preferential Attachment label 1 2 4 15 1 3 4 7 12 1 5 6 1 1

slide-10
SLIDE 10

10

Neo4j Graph Algorithms Library

slide-11
SLIDE 11

Graph & ML Algorithms in Neo4j

+45

neo4j.com/

graph-algorithms- book/

Pathfinding & Search Centrality / Importance Community Detection Link Prediction

Finds optimal paths

  • r evaluates route

availability and quality Determines the importance of distinct nodes in the network Detects group clustering or partition

  • ptions

Evaluates how alike nodes are Estimates the likelihood

  • f nodes forming a

future relationship

Similarity

slide-12
SLIDE 12

Graph and ML Algorithms in Neo4j

  • Parallel Breadth First Search &

DFS

  • Shortest Path
  • Single-Source Shortest Path
  • All Pairs Shortest Path
  • Minimum Spanning Tree
  • A* Shortest Path
  • Yen’s K Shortest Path
  • K-Spanning Tree (MST)
  • Random Walk
  • Degree Centrality
  • Closeness Centrality
  • CC Variations: Harmonic, Dangalchev,

Wasserman & Faust

  • Betweenness Centrality
  • Approximate Betweenness Centrality
  • PageRank
  • Personalized PageRank
  • ArticleRank
  • Eigenvector Centrality
  • Triangle Count
  • Clustering Coefficients
  • Connected Components (Union Find)
  • Strongly Connected Components
  • Label Propagation
  • Louvain Modularity – 1 Step &

Multi-Step

  • Balanced Triad (identification)
  • Euclidean Distance
  • Cosine Similarity
  • Jaccard Similarity
  • Overlap Similarity
  • Pearson Similarity

Pathfinding & Search Centrality / Importance Community Detection Similarity

neo4j.com/docs/

graph-algorithms/current/

Updated April 2019

Link Prediction

  • Adamic Adar
  • Common Neighbors
  • Preferential Attachment
  • Resource Allocations
  • Same Community
  • Total Neighbors
slide-13
SLIDE 13
  • 1. Call as Cypher procedure
  • 2. Pass in specification (Label, Prop, Query) and

configuration

  • 3. stream variant returns (a lot) of results

CALL algo.<name>.stream('Label','TYPE',{conf}) YIELD nodeId, score

  • 4. non-stream variant writes results to graph returns

statistics CALL algo.<name>('Label','TYPE',{conf})

How To…

Pathfinding & Search Centrality / Importance Community Detection Link Prediction Similarity

slide-14
SLIDE 14

Pass in Cypher statement for node- and relationship-lists. CALL algo.<name>( 'MATCH ... RETURN id(n)', 'MATCH (n)-->(m) RETURN id(n) as source, id(m) as target', {graph:'cypher'})

Cypher Projection

slide-15
SLIDE 15

15

Cypher Projection Example

Russian Twitter Trolls

https://www.nbcnews.com/pages/author/ben-popken

slide-16
SLIDE 16

Inferred Relationships AMPLIFIED

slide-17
SLIDE 17

CALL algo.pageRank( "MATCH (t:Troll) RETURN id(t) AS id", "MATCH (r1:Troll)-[:POSTED]->(:Tweet)<-[:RETWEETED]- (:Tweet)<-[:POSTED]-(r2:Troll) RETURN id(r2) as source, id(r1) as target", {graph:'cypher'})

PageRank on Inferred AMPLIFIED Graph https://www.nbcnews.com/tech/social-media/russian-trolls-went-a ttack-during-key-election-moments-n827176

slide-18
SLIDE 18

How does it work?

Procedures Neo4j In Memory Projected Graph

Read projected graph Load projected graph

Graph Loader

Execute algorithm Store results

1 2 4 3

Everything is concurrent

1) Read projected graph 2) Load projected graph 3) Execute algorithm 4) Store results

slide-19
SLIDE 19

Architecture Considerations

  • Parallelization - everything, leverage lots of CPUs

○ Community Edition restricted to 4 cores!

  • Memory

○ Need enough heap to fit projected graph in memory ○ Memory requirements vary per algorithm

  • Causal Clusters

○ Do not run graph algos on core members ○ Streaming method only for read replicas ○ Consider snapshot

slide-20
SLIDE 20

Enter the NEuler

20

slide-21
SLIDE 21

install.graphapp.io

21

slide-22
SLIDE 22

Investigating the Neo4j Social Graph

22

slide-23
SLIDE 23

Neo4j Twitter Graph

slide-24
SLIDE 24

Twint: Twitter scraping tool

slide-25
SLIDE 25

Neo4j Twitter Graph

slide-26
SLIDE 26

Determines the importance of distinct nodes in the network Developed for distinct uses or types

  • f importance.

Centrality Algorithms

slide-27
SLIDE 27
slide-28
SLIDE 28

Degree Centrality

slide-29
SLIDE 29

Degree Centrality

Tip / Caution This is the simplest of the centrality algorithms. Can measure in-degree,

  • ut-degree, or both. When

globally averaged, it can be skewed by supernodes. Other algorithms are better for determining influence over more than just direct neighbors.

Measures the number of direct relationships

In-Degree

slide-30
SLIDE 30

Degree Centrality - Uses

Use When Understanding immediate connectedness or direct influence Popularity & Gregariousness Quick estimation of network densities such as min/max and mean degrees

Likelihood of Flu

Individual probabilities

slide-31
SLIDE 31

31

Degree Centrality

slide-32
SLIDE 32

PageRank

slide-33
SLIDE 33

Tip / Caution Test your dampening factor as it will change results (default works well for power law distributions.) Spark uses a inverse dampening factor resetProbability=0.15 is equal to dampingFactor:0.85 in Neo4j and other libraries Careful with mixing node types

PageRank

Measures the transitive (directional) influence of nodes and considers the influence of neighbors and their neighbors

slide-34
SLIDE 34

PageRank Calculation

Node Being Ranked Nodes Linking To -> “u” Dampening Factor Outdegree of that Node CALL algo.pageRank('Page', 'LINKS', {iterations:20, dampingFactor:0.85, sourceNodes: [siteA]})

Personalized

slide-35
SLIDE 35

PageRank - Uses

Fraud Detection Feature engineering for machine learning

Use When Anytime you’re looking for broad influence over a network

Many domain specific variations for differing analysis, e.g. Personalized PageRank for personalized recommendations

Recommendations Who To Follow with personalized PR

slide-36
SLIDE 36

36

PageRank

slide-37
SLIDE 37

Betweenness Centrality

slide-38
SLIDE 38

Betweenness Centrality

Tip / Caution Computationally intensive: use RA Brandes approximation on large graphs. Assumes all communication between nodes happens along the shortest path and with the same frequency (not always the case in real life)

The sum of the % shortest paths that pass through a node, calculated by pairs

slide-39
SLIDE 39

1. For a node, find the shortest paths that go through it

  • B,C, E have no shortest paths and are assigned 0 value

2. For each shortest path in step one, calculate it’s percentage of the total possible shortest paths for that pair 3. Add together all the values in step two; this is a nodes Betweenness Centrality score 4. Repeat for each node

A B D E C

3.5 Pairs with Shortest Paths Through D Total Possible Shortest Paths for that Pair % of Total Through D (1/Total) A,E 1 1 B,E 1 1 C,E 1 1 B,C 2 (through D & A) 0.5 Betweenness Score 3.5

Node D Calculation

0.5

A

Betweenness Centrality

slide-40
SLIDE 40

Betweenness Centrality - Uses

Use When Identify bridges Uncover control points Find bottlenecks and vulnerabilities

Network Resilience

Key points of cascading failure

slide-41
SLIDE 41

41

Betweenness Centrality

slide-42
SLIDE 42

Evaluates how a group is clustered

  • r partitioned

Different approaches to define a community

Community Detection Algorithms

slide-43
SLIDE 43

Louvain Modularity

slide-44
SLIDE 44

Louvain Modularity

Continually maximizes the modularity by comparing relationship weights and densities to an estimate /average

Tip / Caution ALL Modularity algorithms:

  • Merge smaller communities

into larger ones

  • Review intermediates
  • Can plateau with similar

modularity on several partitions - forming local maxima & stalling progress

  • Treat as a guide and

test/validate results

2 1 4 1 1 2 14 4 14

slide-45
SLIDE 45

Louvain Modularity - Uses

Use When Community detection in large networks Uncover hierarchical structures in data Evaluate different grouping thresholds

Understanding the Brain

Mapping hierarchy of functions

slide-46
SLIDE 46

Louvain Modularity - Uses

Use When Community detection in large networks Uncover hierarchical structures in data Evaluate different grouping thresholds

Detecting Fraud Rings

Thresholds for bad apples vs rings

slide-47
SLIDE 47

47

Graph Communities

slide-48
SLIDE 48

48

Graph Communities

slide-49
SLIDE 49

Degree Centrality to get a message directly in front of the most people with a single, immediate offer

49

Targeting Recommendations - EXAMPLES

PageRank to use an influencer with the broadest reach to all communities for a greater ripple effect Communities for use of different promoted tweets &

  • ffers per sub-community

Betweenness Centrality + Louvain to find the person who can best connect to a specific sub-community

D

slide-50
SLIDE 50

Learn More

50

slide-51
SLIDE 51

51

Online Training!

Applied Graph Algorithms Data Science With Neo4j

https://neo4j.com/graphacademy/online-training/data-science https://neo4j.com/graphacademy/online-training/applied-graph-algorithms

slide-52
SLIDE 52

Free O’Reilly Book neo4j.com/ graph-algorithms-book

  • Spark & Neo4j Examples
  • Machine Learning Chapter
slide-53
SLIDE 53

Hunger Games Questions for

"Graph Algorithms for Community Detection & Recommendations"

1. Easy: Which of these algorithms is a community detection algorithm?

a. PageRank b. Common Neighbors c. Louvain Modularity

2. Medium: What is NOT a good use for a Graph Projection?

a. Infer Relationships b. Shape a graph for your algorithm to run on c. Save a new subgraph

3. Hard: The default dampening factor for the PageRank algorithm is 0.85. This default works well for graphs that have what distribution of relationships?

a. Power Law b. Poisson c. Normal

Answer here: r.neo4j.com/hunger-games