http://cs224w.stanford.edu ? ? ? ? Machine Learning ? Node - - PowerPoint PPT Presentation

β–Ά
http cs224w stanford edu
SMART_READER_LITE
LIVE PREVIEW

http://cs224w.stanford.edu ? ? ? ? Machine Learning ? Node - - PowerPoint PPT Presentation

CS224W: Machine Learning with Graphs Jure Leskovec, Stanford University http://cs224w.stanford.edu ? ? ? ? Machine Learning ? Node classification 10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs,


slide-1
SLIDE 1

CS224W: Machine Learning with Graphs Jure Leskovec, Stanford University

http://cs224w.stanford.edu

slide-2
SLIDE 2

? ? ? ? ?

Machine Learning

Node classification

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 2 10/15/19

slide-3
SLIDE 3

Machine Learning

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 4

? ? ?

x

10/15/19

slide-4
SLIDE 4

5

Raw Data Structured Data Learning Algorithm Model Downstream task Feature Engineering

Automatically learn the features

Β‘ (Supervised) Machine Learning Lifecycle

requires feature engineering every single time!

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 10/15/19

slide-5
SLIDE 5

Goal: Efficient task-independent feature learning for machine learning with graphs!

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 6

vec node 𝑔: 𝑣 β†’ ℝ& ℝ&

Feature representation, embedding

u

10/15/19

slide-6
SLIDE 6
  • –

– –

17

Β‘ Task: We map each node in a network into a

low-dimensional space

Β§ Distributed representations for nodes Β§ Similarity of embeddings between nodes indicates their network similarity Β§ Encode network information and generate node representation

10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 7

slide-7
SLIDE 7

Β‘ 2D embeddings of nodes of the Zachary’s

Karate Club network:

10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 8

  • Zachary’s Karate Network:

Image from: Perozzi et al. DeepWalk: Online Learning of Social Representations. KDD 2014.

slide-8
SLIDE 8

Β‘ Modern deep learning toolbox is designed for

simple sequences or grids.

Β§ CNNs for fixed-size images/grids…. Β§ RNNs or word2vec for text/sequences…

10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 9

slide-9
SLIDE 9

Β‘ But networks are far more complex!

Β§ Complex topographical structure (i.e., no spatial locality like grids) Β§ No fixed node ordering or reference point (i.e., the isomorphism problem) Β§ Often dynamic and have multimodal features.

10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 10

slide-10
SLIDE 10
slide-11
SLIDE 11

Β‘ Assume we have a graph G:

Β§ V is the vertex set. Β§ A is the adjacency matrix (assume binary). Β§ No node features or extra information is used!

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 12 10/15/19

slide-12
SLIDE 12

Β‘ Goal is to encode nodes so that similarity in

the embedding space (e.g., dot product) approximates similarity in the original network

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 13 10/15/19

slide-13
SLIDE 13

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 14

similarity(u, v) β‰ˆ z>

v zu

Go Goal: Ne Need t to d define!

10/15/19

in the original network Similarity of the embedding

slide-14
SLIDE 14

1.

Define an encoder (i.e., a mapping from nodes to embeddings)

2.

Define a node similarity function (i.e., a measure of similarity in the original network)

3.

Optimize the parameters of the encoder so that:

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 15

similarity(u, v) β‰ˆ z>

v zu

10/15/19

in the original network Similarity of the embedding

slide-15
SLIDE 15

Β‘ Encoder: maps each node to a low-

dimensional vector

Β‘ Similarity function: specifies how the

relationships in vector space map to the relationships in the original network

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 16

enc(v) = zv

node in the input graph d-dimensional embedding Similarity of u and v in the original network dot product between node embeddings

similarity(u, v) β‰ˆ z>

v zu

10/15/19

slide-16
SLIDE 16

Β‘ Simplest encoding approach: encoder is just

an embedding-lookup

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 17

matrix, each column is a node embedding [w [what w we l learn!] !] indicator vector, all zeroes except a one in column indicating node v

enc(v) = Zv

Z ∈ RdΓ—|V| v ∈ I|V|

10/15/19

slide-17
SLIDE 17

Β‘ Simplest encoding approach: encoder is just

an embedding-lookup

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 18

Z =

Dimension/size

  • f embeddings
  • ne column per node

embedding matrix embedding vector for a specific node

10/15/19

slide-18
SLIDE 18

Simplest encoding approach: encoder is just an embedding-lookup Each node is assigned to a unique embedding vector Many methods: DeepWalk, node2vec, TransE

10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 19

slide-19
SLIDE 19

Β‘ Key choice of methods is how they define

node similarity.

Β‘ E.g., should two nodes have similar

embeddings if they….

Β§ are connected? Β§ share neighbors? Β§ have similar β€œstructural roles”? Β§ …?

10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 20

slide-20
SLIDE 20

Material based on:

  • Perozzi et al. 2014. DeepWalk: Online Learning of Social Representations. KDD.
  • Grover et al. 2016. node2vec: Scalable Feature Learning for Networks. KDD.
slide-21
SLIDE 21

1 4 3 2 5 6 7 9 10 8 11 12

10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 22

Given a graph and a starting point, we select a neighbor of it at random, and move to this neighbor; then we select a neighbor of this point at random, and move to it, etc. The (random) sequence of points selected this way is a random walk on the graph.

slide-22
SLIDE 22

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 23

probability that u and v co-occur on a random walk over the network

z>

u zv β‰ˆ

10/15/19

slide-23
SLIDE 23

1.

Estimate probability of visiting node π’˜ on a random walk starting from node 𝒗 using some random walk strategy 𝑺

2.

Optimize embeddings to encode these random walk statistics:

Similarity (here: dot product=cos(πœ„)) encodes random walk β€œsimilarity”

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 24 10/15/19

slide-24
SLIDE 24

1.

Expressivity: Flexible stochastic definition of node similarity that incorporates both local and higher-order neighborhood information

2.

Efficiency: Do not need to consider all node pairs when training; only need to consider pairs that co-occur on random walks

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 25 10/15/19

slide-25
SLIDE 25

Β‘ Intuition: Find embedding of nodes to

d-dimensions that preserves similarity

Β‘ Idea: Learn node embedding such that nearby

nodes are close together in the network

Β‘ Given a node 𝑣, how do we define nearby

nodes?

Β§ 𝑂7 𝑣 … neighbourhood of 𝑣 obtained by some strategy 𝑆

26 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 10/15/19

slide-26
SLIDE 26

Β‘ Given 𝐻 = (π‘Š, 𝐹), Β‘ Our goal is to learn a mapping 𝑨: 𝑣 β†’ ℝ&. Β‘ Log-likelihood objective:

max

B

C

D ∈F

log P(𝑂J(𝑣)| 𝑨D)

Β§ where 𝑂7(𝑣) is neighborhood of node 𝑣 by strategy 𝑆

Β‘ Given node 𝑣, we want to learn feature

representations that are predictive of the nodes in its neighborhood 𝑂J(𝑣)

10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 27

slide-27
SLIDE 27

1.

Run short fixed-length random walks starting from each node on the graph using some strategy R

2.

For each node 𝑣 collect 𝑂7(𝑣), the multiset*

  • f nodes visited on random walks starting

from u

3.

Optimize embeddings according to: Given node 𝑣, predict its neighbors 𝑂J(𝑣) max

B

C

D ∈F

log P(𝑂J(𝑣)| 𝑨D)

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 28

*𝑂7(𝑣) can have repeat elements since nodes can be visited multiple times on random walks

10/15/19

slide-28
SLIDE 28

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 30

  • Intuition: Optimize embeddings to maximize

likelihood of random walk co-occurrences

  • Parameterize 𝑄(𝑀|π’œπ‘£) using softmax:

L = X

u∈V

X

v∈NR(u)

βˆ’ log(P(v|zu))

P(v|zu) = exp(z>

u zv)

P

n2V exp(z> u zn)

10/15/19

Why softmax? We want node 𝑀 to be most similar to node 𝑣 (out of all nodes π‘œ). Intuition: βˆ‘R exp 𝑦R β‰ˆ max

R

exp(𝑦R)

slide-29
SLIDE 29

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 31

Putting it all together:

sum over all nodes 𝑣 sum over nodes 𝑀 seen on random walks starting from 𝑣 predicted probability of 𝑣 and 𝑀 co-occuring on random walk

Optimizing random walk embeddings = Finding embeddings zu that minimize L

L = X

u2V

X

v2NR(u)

βˆ’ log βœ“ exp(z>

u zv)

P

n2V exp(z> u zn)

β—†

10/15/19

slide-30
SLIDE 30

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 32

But doing this naively is too expensive!!

Nested sum over nodes gives O(|V|2) complexity!

L = X

u2V

X

v2NR(u)

βˆ’ log βœ“ exp(z>

u zv)

P

n2V exp(z> u zn)

β—†

10/15/19

slide-31
SLIDE 31

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 33

But doing this naively is too expensive!! The normalization term from the softmax is the culprit… can we approximate it?

L = X

u2V

X

v2NR(u)

βˆ’ log βœ“ exp(z>

u zv)

P

n2V exp(z> u zn)

β—†

10/15/19

slide-32
SLIDE 32

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 34

sigmoid function

(makes each term a β€œprobability” between 0 and 1)

random distribution over all nodes

log βœ“ exp(z>

u zv)

P

n2V exp(z> u zn)

β—† β‰ˆ log(Οƒ(z>

u zv)) βˆ’ k

X

i=1

log(Οƒ(z>

u zni)), ni ∼ PV

10/15/19

Β‘ Solution: Negative sampling

Instead of normalizing w.r.t. all nodes, just normalize against 𝑙 random β€œnegative samples” π‘œR

Why is the approximation valid? Technically, this is a different objective. But Negative Sampling is a form of Noise Contrastive Estimation (NCE) which approx. maximizes the log probability of softmax. New formulation corresponds to using a logistic regression (sigmoid func.) to distinguish the target node 𝑀 from nodes π‘œR sampled from background distribution 𝑄

\.

More at https://arxiv.org/pdf/1402.3722.pdf

slide-33
SLIDE 33

log βœ“ exp(z>

u zv)

P

n2V exp(z> u zn)

β—† β‰ˆ log(Οƒ(z>

u zv)) βˆ’ k

X

i=1

log(Οƒ(z>

u zni)), ni ∼ PV

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 35

random distribution

  • ver all nodes

Β§ Sample 𝑙 negative nodes proportional to degree Β§ Two considerations for 𝑙 (# negative samples):

  • 1. Higher 𝑙 gives more robust estimates
  • 2. Higher 𝑙 corresponds to higher bias on negative events

In practice 𝑙 =5-20

10/15/19

slide-34
SLIDE 34

1.

Run short fixed-length random walks starting from each node on the graph using some strategy R.

2.

For each node u collect NR(u), the multiset of nodes visited on random walks starting from u

3.

Optimize embeddings using Stochastic Gradient Descent:

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 36

We We can efficiently approximate this using ne negative sampling ng!

L = X

u∈V

X

v∈NR(u)

βˆ’ log(P(v|zu))

10/15/19

slide-35
SLIDE 35

Β‘ So far we have described how to optimize

embeddings given random walk statistics

Β‘ What strategies should we use to run these

random walks?

Β§ Simplest idea: Just run fixed-length, unbiased random walks starting from each node (i.e., DeepWalk from Perozzi et al., 2013).

Β§ The issue is that such notion of similarity is too constrained

Β§ How can we generalize this?

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 37 10/15/19

slide-36
SLIDE 36

Β‘ Goal: Embed nodes with similar network

neighborhoods close in the feature space

Β‘ We frame this goal as a maximum likelihood

  • ptimization problem, independent to the

downstream prediction task

Β‘ Key observation: Flexible notion of network

neighborhood 𝑂7(𝑣) of node 𝑣 leads to rich node embeddings

Β‘ Develop biased 2nd order random walk 𝑆 to

generate network neighborhood 𝑂7(𝑣) of node 𝑣

10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 38

slide-37
SLIDE 37

Idea: use flexible, biased random walks that can trade off between local and global views of the network (Grover and Leskovec, 2016).

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 39

u s3 s2

s1

s4 s8 s9 s6 s7 s5

BFS DFS

10/15/19

slide-38
SLIDE 38

Two classic strategies to define a neighborhood 𝑂7 𝑣 of a given node 𝑣: Walk of length 3 (𝑂7 𝑣 of size 3):

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 40

𝑂^_` 𝑣 = { 𝑑c, 𝑑d, 𝑑e} 𝑂g_` 𝑣 = { 𝑑h, 𝑑i, 𝑑j} Local microscopic view Global macroscopic view

u s3 s2

s1

s4 s8 s9 s6 s7 s5

BFS DFS

10/15/19

slide-39
SLIDE 39

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 41

BFS: Micro-view of neighbourhood

u

DFS: Macro-view of neighbourhood

10/15/19

slide-40
SLIDE 40

Biased fixed-length random walk 𝑺 that given a node 𝒗 generates neighborhood 𝑢𝑺 𝒗

Β‘ Two parameters:

Β§ Return parameter 𝒒:

Β§ Return back to the previous node

Β§ In-out parameter 𝒓:

Β§ Moving outwards (DFS) vs. inwards (BFS) Β§ Intuitively, π‘Ÿ is the β€œratio” of BFS vs. DFS

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 42 10/15/19

slide-41
SLIDE 41

Biased 2nd-order random walks explore network neighborhoods:

Β§ Rnd. walk just traversed edge (𝑑c, π‘₯) and is now at π‘₯ Β§ Insight: Neighbors of π‘₯ can only be: Idea: Remember where that walk came from

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 43

s1 s2 w s3 u

Back k to π’•πŸ Sam Same e distan ance ce to π’•πŸ Fa Farthe her fr from π’•πŸ

10/15/19

slide-42
SLIDE 42

Β‘ Walker came over edge (sc, w) and is at w.

Where to go next?

Β‘ π‘ž, π‘Ÿ model transition probabilities

Β§ π‘ž … return parameter Β§ π‘Ÿ … ”walk away” parameter

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 44

1 1/π‘Ÿ 1/π‘ž

1/π‘ž, 1/π‘Ÿ, 1 are

unnormalized probabilities

s1 s2 w s3 u

10/15/19

s4

1/π‘Ÿ

slide-43
SLIDE 43

Β‘ Walker came over edge (sc, w) and is at w.

Where to go next?

Β§ BFS-like walk: Low value of π‘ž Β§ DFS-like walk: Low value of π‘Ÿ

𝑂7(𝑣) are the nodes visited by the biased walk

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 45

w β†’

s1 s2 s3 s4 1/π‘ž 1 1/π‘Ÿ 1/π‘Ÿ

Unnormalized transition prob. segmented based

  • n distance from 𝑑!

10/15/19

  • Dist. (π’•πŸ, 𝒖)

1 2 2 1 1/π‘Ÿ 1/π‘ž

s1 s2 w s3 u s4

1/π‘Ÿ

Target 𝒖 Prob.

slide-44
SLIDE 44

Β‘ 1) Compute random walk probabilities Β‘ 2) Simulate 𝑠 random walks of length π‘š starting

from each node 𝑣

Β‘ 3) Optimize the node2vec objective using

Stochastic Gradient Descent Linear-time complexity All 3 steps are individually parallelizable

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 46 10/15/19

slide-45
SLIDE 45

Β‘ How to use embeddings π’œπ’‹ of nodes:

Β§ Clustering/community detection: Cluster points 𝑨R Β§ Node classification: Predict label 𝑔(𝑨R) of node 𝑗 based on 𝑨R Β§ Link prediction: Predict edge (𝑗, π‘˜) based on 𝑔(𝑨R, 𝑨

})

Β§ Where we can: concatenate, avg, product, or take a difference between the embeddings:

Β§ Concatenate: 𝑔(𝑨R, 𝑨

})= 𝑕([𝑨R, 𝑨 }])

Β§ Hadamard: 𝑔(𝑨R, 𝑨

})= 𝑕(𝑨R βˆ— 𝑨 }) (per coordinate product)

Β§ Sum/Avg: 𝑔(𝑨R, 𝑨

})= 𝑕(𝑨R + 𝑨 })

Β§ Distance: 𝑔(𝑨R, 𝑨

})= 𝑕(||𝑨R βˆ’ 𝑨 }||d)

10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 50

slide-46
SLIDE 46

Β‘ Basic idea: Embed nodes so that distances in

embedding space reflect node similarities in the original network.

Β‘ Different notions of node similarity:

Β§ Adjacency-based (i.e., similar if connected) Β§ Multi-hop similarity definitions Β§ Random walk approaches (covered today)

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 51 10/15/19

slide-47
SLIDE 47

Β‘ So what method should I use..? Β‘ No one method wins in all cases….

Β§ E.g., node2vec performs better on node classification while multi-hop methods performs better on link prediction (Goyal and Ferrara, 2017 survey)

Β‘ Random walk approaches are generally more

efficient

Β‘ In general: Must choose definition of node

similarity that matches your application!

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 52 10/15/19

slide-48
SLIDE 48

An Application of Embeddings to the Knowledge Graph:

Bordes, Usunier, Garcia-Duran. NeurIPS 2013.

slide-49
SLIDE 49

10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 54 Pierre-Yves Vandenbussche

Nodes are referred to as en enti titi ties es, edges as re relations

A kn knowledge graph is composed of facts/statements about inter-related entities In KGs, edges can be of many types!

slide-50
SLIDE 50

10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 55

missing relation KG incompleteness can substantially affect the efficiency

  • f systems relying on it!

IN INTUIT ITION ION: : we want a link prediction model that learns from local and global connectivity patterns in the KG, taking into account entities and relationships

  • f different types at the same time.

DOWNSTREAM TASK: relation predictions are performed by using the learned patterns to generalize observed relationships between an entity of interest and all the other entities.

slide-51
SLIDE 51

10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 56

slide-52
SLIDE 52

10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 57

Β‘ In TransE, relationships between entities are

represented as triplets

Β§ π’Š (head entity), π’Ž (relation), 𝒖 (tail entity) => (β„Ž, π‘š, 𝑒)

Β‘ Entities are first embedded in an entity space 𝑆ˆ

Β§ similarly to the previous methods

Β‘ Relations are represented as translations

Β§ β„Ž + π‘š β‰ˆ 𝑒 if the given fact is true Β§ else, β„Ž + π‘š β‰  𝑒

π‘š

slide-53
SLIDE 53

10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 58

Entities and relations are initialized uniformly, and normalized

Negative sampling with triplet that does not appear in the KG Co Comparative loss: favors lower distance values for valid triplets, high distance values for corrupted ones

positive sample negative sample

slide-54
SLIDE 54
slide-55
SLIDE 55

‘ Goal: Want to embed an entire graph 𝐻 ‘ Tasks:

Β§ Classifying toxic vs. non-toxic molecules Β§ Identifying anomalous graphs

10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 60

π’œΕ 

slide-56
SLIDE 56

Simple idea:

Β‘ Run a standard graph embedding

technique on the (sub)graph 𝐻

Β‘ Then just sum (or average) the node

embeddings in the (sub)graph 𝐻

Β‘ Used by Duvenaud et al., 2016 to classify

molecules based on their graph structure

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 61

𝑨Š = C

\∈Š

𝑨\

10/15/19

slide-57
SLIDE 57

Β‘ Idea: Introduce a β€œvirtual node” to represent

the (sub)graph and run a standard graph embedding technique

Β‘ Proposed by Li et al., 2016 as a general

technique for subgraph embedding

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 62 10/15/19

slide-58
SLIDE 58

States in anonymous walk correspond to the index of the first time we visited the node in a random walk

10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 63

Anonymous Walk Embeddings, ICML 2018 https://arxiv.org/pdf/1805.11921.pdf

slide-59
SLIDE 59

Number of anonymous walks grows exponentially:

Β§ There are 5 anon. walks 𝑏R of length 3: 𝑏c=111, 𝑏d=112, 𝑏e= 121, 𝑏h= 122, 𝑏i= 123

10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 64

slide-60
SLIDE 60

Β‘ Enumerate all possible anonymous walks 𝑏R of π‘š

steps and record their counts

Β‘ Represent the graph as a probability distribution

  • ver these walks

Β‘ For example:

Β§ Set π‘š = 3 Β§ Then we can represent the graph as a 5-dim vector

Β§ Since there are 5 anonymous walks 𝑏R of length 3: 111, 112, 121, 122, 123

Β§ π‘ŽΕ [𝑗] = probability of anonymous walk 𝑏R in 𝐻

10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 65

slide-61
SLIDE 61

Β‘ Complete counting of all anonymous walks in a

large graph may be infeasible

Β‘ Sampling approach to approximating the true

distribution: Generate independently a set of 𝑛 random walks and calculate its corresponding empirical distribution of anonymous walks

Β‘ How many random walks 𝑛 do we need?

Β§ We want the distribution to have error of more than 𝜁 with prob. less than πœ€:

10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 66

where: πœƒ is the total number of anon. walks of length π‘š.

For example: There are πœƒ = 877 anonymous walks of length π‘š = 7. If we set 𝜁 = 0.1 and πœ€ = 0.01 then we need to generate 𝑛=122500 random walks

slide-62
SLIDE 62

Learn embedding π’œπ’‹ of every anonymous walk 𝒃𝒋

‘ The embedding of a graph 𝐻 is then

sum/avg/concatenation of walk embeddings zR How to embed walks?

Β‘ Idea: Embed walks s.t.

next walk can be predicted

Β§ Set zR s.t. we maximize 𝑄 π‘₯β€”

D π‘₯β€”Λœβ„’ D

, … , π‘₯β€”

D = 𝑔(𝑨)

Β§ Where π‘₯β€”

D is a 𝑒-th random

walk starting at node 𝑣

10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 67

slide-63
SLIDE 63

Β‘ Run 𝑼 different random walks from 𝒗 each of length π’Ž:

𝑂7 𝑣 = π‘₯c

D, π‘₯d D … π‘₯Ε“ D

Β§ Let 𝑏R be its anonymous version of walk π‘₯R

Β‘ Learn to predict walks that co-occur in 𝚬-size window Β‘ Estimate embedding 𝑨R of anonymous walk 𝑏R of π‘₯R:

max 1 π‘ˆ C

β€”ΕΈβ„’ Ε“

log 𝑄(π‘₯β€”|π‘₯β€”Λœβ„’, … , π‘₯β€”Λœc)

where: Δ… context window size Β§ 𝑄 π‘₯β€” π‘₯β€”Λœβ„’, … , π‘₯β€”Λœc =

‘’£(Β€ Β₯Β¦ ) βˆ‘Β§

Β¨ ‘’£(Β€(Β₯Β§))

Β§ 𝑧 π‘₯β€” = 𝑐 + 𝑉 β‹…

c β„’ βˆ‘RΕΈc β„’

𝑨R

Β§ where 𝑐 ∈ ℝ, 𝑉 ∈ ℝg, 𝑨R is the embedding of 𝑏R (anonymized version of walk π‘₯R)

10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 68

Anonymous Walk Embeddings, ICML 2018 https://arxiv.org/pdf/1805.11921.pdf

slide-64
SLIDE 64

We discussed 3 ideas to graph embeddings

Β‘ Approach 1: Embed nodes and sum/avg them Β‘ Approach 2: Create super-node that spans the

(sub) graph and then embed that node

Β‘ Approach 3: Anonymous Walk Embeddings

Β§ Idea 1: Represent the graph via the distribution over all the anonymous walks Β§ Idea 2: Sample the walks to approximate the distribution Β§ Idea 3: Embed anonymous walks

10/15/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 69