Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, - - PowerPoint PPT Presentation

network embedding as matrix factorization unifying
SMART_READER_LITE
LIVE PREVIEW

Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, - - PowerPoint PPT Presentation

Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec Jiezhong Qiu Tsinghua University February 21, 2018 Joint work with Yuxiao Dong (MSR), Hao Ma (MSR), Jian Li (IIIS, Tsinghua), Kuansan Wang (MSR), Jie Tang


slide-1
SLIDE 1

Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec

Jiezhong Qiu

Tsinghua University

February 21, 2018

Joint work with Yuxiao Dong (MSR), Hao Ma (MSR), Jian Li (IIIS, Tsinghua), Kuansan Wang (MSR), Jie Tang (DCST, Tsinghua)

slide-2
SLIDE 2

Motivation and Problem Formulation

Problem Formulation

Give a network G = (V, E), aim to learn a function f : V → Rp to capture neighborhood similarity and community membership.

Applications:

◮ link prediction ◮ community detection ◮ label classification

Figure 1: A toy example (Figure from DeepWalk).

slide-3
SLIDE 3

History of Network Embedding

LINE & PTE [Tang et al.] Spectral Partitioning [Donath, Hoffman] DeepWalk [Perozzi et al.] Image Segmentation [Shi & Malik]

1973 2009 2016

Spectral Clustering [Ng et al.]

2000

SocDim [Tang & Liu]

2014 2015 1996

node2vec [Grover & Leskovec]

2005 2002

Fiedler Vector [Fiedler] A large body of literature [Pothen et al.] [Simon] [Bolla], [Hagen & Kahng] [Hendrickson & Leland] [Van Driessche & Roose], [Barnard et al.] [Spielman & Teng], [Guattery & Miller] Spectral Clustering v.s. Kernel k-means [Dhillon et al.]

2013

word2vec (skip-gram) [Mikolov et al.]

2017

metapath2vec [Dong et al.]

slide-4
SLIDE 4

Contents

Preliminaries Main Theoretic Results Notations DeepWalk (KDD’14) LINE (WWW’15) PTE (KDD’15) node2vec (KDD’16) NetMF NetMF for a Small Window Size T NetMF for a Large Window Size T Experiments

slide-5
SLIDE 5

Notations

Consider an undirected weighted graph G = (V, E) , where |V | = n and |E| = m.

◮ Adjacency matrix A ∈ Rn×n +

:

Ai,j =

  • ai,j > 0

(i, j) ∈ E (i, j) ∈ E .

◮ Degree matrix D = diag(d1, · · · , dn), where di is the

generalized degree of vertex i.

◮ Volume of the graph G: vol(G) = i

  • j Ai,j.

Assumption

G = (V, E) is connected, undirected, and not bipartite, which makes P(w) =

dw vol(G) a unique stationary distribution.

slide-6
SLIDE 6

DeepWalk — Roadmap

  • Random

Walk Skip-gram Output: Node Embedding Input G=(V,E)

slide-7
SLIDE 7

DeepWalk — a Two-step Algorithm

Algorithm 1: DeepWalk

1 for n = 1, 2, . . . , N do 2

Pick wn

1 according to a probability distribution P(w1); 3

Generate a vertex sequence (wn

1 , · · · , wn L) of length L by a

random walk on network G;

4

for j = 1, 2, . . . , L − T do

5

for r = 1, . . . , T do

6

Add vertex-context pair (wn

j , wn j+r) to multiset D; 7

Add vertex-context pair (wn

j+r, wn j ) to multiset D; 8 Run SGNS on D with b negative samples.

slide-8
SLIDE 8

DeepWalk — Roadmap

Random Walk Skip-gram Output: Node Embedding Input G=(V,E)

Levy & Goldberg (NIPS 14)

  • #(w, c)

#(w) #(c)

Co-occurrence of w and c

Occurrence of word w Occurrence of context c

|D| Total number of word-context pairs b

Number of negative samples

slide-9
SLIDE 9

Skip-gram with Negative Sampling (SGNS)

◮ SGNS maintains a multiset D which counts the occurrence of

each word-context pair (w, c).

◮ Objective:

L =

  • w
  • c
  • #(w, c) log g
  • x⊤

wyc

  • + b#(w)#(c)

|D| log g

  • −x⊤

wyc

  • ,

where xw, yc ∈ Rd, g is the sigmoid function, and b is the number of negative samples for SGNS.

◮ For sufficiently large dimensionality d, equivalent to

factorizing PMI matrix (Levy & Goldberg, NIPS’14):

log #(w, c) |D| b#(w)#(c)

  • .
slide-10
SLIDE 10

DeepWalk — Roadmap

Random Walk Skip-gram Output: Node Embedding Input G=(V,E)

Levy & Goldberg (NIPS 14)

  • #(w, c)

#(w) #(c)

Co-occurrence of w and c

Occurrence of word w Occurrence of context c

|D| Total number of word-context pairs b

Number of negative samples

slide-11
SLIDE 11

DeepWalk

a b c d e

(c, a) (c, e) (c, d)

Question

Suppose the multiset D is constructed based on random walk on graph, can we interpret log

  • #(w,c)|D|

b#(w)#(c)

  • with graph theory

terminologies?

slide-12
SLIDE 12

DeepWalk

a b c d e

(c, a) (c, e) (c, d)

Question

Suppose the multiset D is constructed based on random walk on graph, can we interpret log

  • #(w,c)|D|

b#(w)#(c)

  • with graph theory

terminologies?

Challange

We mix so many things together, i.e., direction and distance.

slide-13
SLIDE 13

DeepWalk

a b c d e

(c, a) (c, e) (c, d)

Question

Suppose the multiset D is constructed based on random walk on graph, can we interpret log

  • #(w,c)|D|

b#(w)#(c)

  • with graph theory

terminologies?

Challange

We mix so many things together, i.e., direction and distance.

Solution

Let’s distinguish them!

slide-14
SLIDE 14

DeepWalk

Partition the multiset D into several sub-multisets according to the way in which vertex and its context appear in a random walk

  • sequence. More formally, for r = 1, · · · , T, we define

D−

→ r =

  • (w, c) : (w, c) ∈ D, w = wn

j , c = wn j+r

  • ,

D←

− r =

  • (w, c) : (w, c) ∈ D, w = wn

j+r, c = wn j

  • .

a b c d e

(c, a) (c, e) (c, d) D−

! 1

D−

! 2

D←

− 2

slide-15
SLIDE 15

DeepWalk as Implicit Matrix Factorization

Some observations

◮ Observation 1:

log #(w, c) |D| b#(w) · #(c)

  • = log

 

#(w,c) |D|

b #(w)

|D| #(c) |D|

 

◮ Observation 2:

#(w, c) |D| = 1 2T

T

  • r=1

#(w, c)−

→ r

|D−

→ r |

+ #(w, c)←

− r

|D←

− r |

  • .

Sufficient to characterize #(w,c)−

→ r

|D−

→ r |

and #(w,c)←

− r

|D←

− r | .

slide-16
SLIDE 16

DeepWalk — Theorems

Theorem

Denote P = D−1A, when the length of random walk L → ∞,

#(w, c)−

→ r

|D−

→ r | p

→ dw vol(G) (P r)w,c and #(w, c)←

− r

|D←

− r | p

→ dc vol(G) (P r)c,w .

Theorem

When the length of random walk L → ∞, we have

#(w, c) |D|

p

→ 1 2T

T

  • r=1
  • dw

vol(G) (P r)w,c + dc vol(G) (P r)c,w

  • .

Theorem

For DeepWalk, when the length of random walk L → ∞,

#(w, c) |D| #(w) · #(c)

p

→ vol(G) 2T

  • 1

dc

T

  • r=1

(P r)w,c + 1 dw

T

  • r=1

(P r)c,w

  • .
slide-17
SLIDE 17

DeepWalk — Conclusion

Theorem

DeepWalk is asymptotically and implicitly factorizing

log

  • vol(G)

b

  • 1

T

T

  • r=1
  • D−1A

r

  • D−1
  • .
slide-18
SLIDE 18

DeepWalk — Roadmap

Random Walk Skip-gram Output: Node Embedding Input G=(V,E)

Levy & Goldberg (NIPS 14)

  • Adjacency matrix

Degree matrix

b Number of negative samples

slide-19
SLIDE 19

LINE

◮ Objective of LINE:

L =

|V |

  • i=1

|V |

  • j=1
  • Ai,j log g
  • x⊤

i yj

  • + bdidj

vol(G) log g

  • −x⊤

i yj

  • .

◮ Align it with the Objective of SGNS:

L =

  • w
  • c
  • #(w, c) log g
  • x⊤

wyc

  • + b#(w)#(c)

|D| log g

  • −x⊤

wyc

  • .

◮ LINE is actually factorizing

log vol(G) b D−1AD−1

  • ◮ Recall DeepWalk’s matrix form:

log

  • vol(G)

b

  • 1

T

T

  • r=1
  • D−1A

r

  • D−1
  • .

Observation LINE is a special case of DeepWalk (T = 1).

slide-20
SLIDE 20

PTE

Figure 2: Heterogeneous Text Network.

◮ word-word network Gww, Aww ∈ R#word×#word. ◮ document-word network Gdw, Adw ∈ R#doc×#word. ◮ label-word network Glw, Alw ∈ R#label×#word.

slide-21
SLIDE 21

PTE as Implicit Matrix Factorization

log     α vol(Gww)(Dww

row)−1Aww(Dww col )−1

β vol(Gdw)(Ddw

row)−1Adw(Ddw col)−1

γ vol(Glw)(Dlw

row)−1Alw(Dlw col)−1

    − log b,

◮ The matrix is of shape (#word + #doc + #label) × #word. ◮ b is the number of negative samples in training. ◮ {α, β, γ} are hyper-parameters to balance the weights of the

three networks. In PTE, {α, β, γ} satisfy

α vol(Gww) = β vol(Gdw) = γ vol(Glw)

slide-22
SLIDE 22

node2vec — 2nd Order Random Walk

T u,v,w =         

1 p

(u, v) ∈ E, (v, w) ∈ E, u = w; 1 (u, v) ∈ E, (v, w) ∈ E, u = w, (w, u) ∈ E;

1 q

(u, v) ∈ E, (v, w) ∈ E, u = w, (w, u) ∈ E;

  • therwise.

P u,v,w = Prob (wj+1 = u|wj = v, wj−1 = w) = T u,v,w

  • u T u,v,w

.

Stationary Distribution

  • w

P u,v,wXv,w = Xu,v

Existence guaranteed by Perron-Frobenius theorem, but may not be unique.

slide-23
SLIDE 23

node2vec as Implicit Matrix Factorization

Theorem

node2vec is asymptotically and implicitly factorizing a matrix whose entry at w-th row, c-th column is

log 1

2T

T

r=1

  • u Xw,uP r

c,w,u + u Xc,uP r w,c,u

  • b (

u Xw,u) ( u Xc,u)

slide-24
SLIDE 24

Contents

Preliminaries Main Theoretic Results Notations DeepWalk (KDD’14) LINE (WWW’15) PTE (KDD’15) node2vec (KDD’16) NetMF NetMF for a Small Window Size T NetMF for a Large Window Size T Experiments

slide-25
SLIDE 25

Roadmap

Random Walk Skip-gram Output: Node Embedding Input G=(V,E)

Levy & Goldberg (NIPS 14)

Matrix Factorization

slide-26
SLIDE 26

NetMF

◮ Factorize the DeepWalk matrix:

log

  • vol(G)

b

  • 1

T

T

  • r=1
  • D−1A

r

  • D−1
  • .

◮ For numerical reason, we use truncated logarithm —

˜ log(x) = log (max(1, x))

1 2 3 4 5 0.0 0.5 1.0 1.5

Figure 3: Truncated Logarithm

slide-27
SLIDE 27

NetMF for a Small Window Size T

Algorithm 2: NetMF for a Small Window Size T

1 Compute P 1, · · · , P T ; 2 Compute M = vol(G) bT

T

r=1 P r

D−1;

3 Compute M ′ = max(M, 1); 4 Rank-d approximation by SVD: log M ′ = UdΣdV ⊤ d ; 5 return Ud

√Σd as network embedding.

slide-28
SLIDE 28

NetMF for a Large Window Size T — Observations

◮ We want to factorize

˜ log

  • vol(G)

b

  • 1

T

T

  • r=1
  • D−1A

r

  • D−1
  • .

◮ We know the property of normalized graph Laplacian

D−1/2AD−1/2 = UΛU ⊤

where Λ = diag(λ1, · · · , λn) and ∀λi ∈ [−1, 1].

  • 1

T

T

  • r=1
  • D−1A

r

  • D−1 =
  • D−1/2

1 T

T

  • r=1
  • D−1/2AD−1/2r

D−1/2 =

  • D−1/2

      U

  • 1

T

T

  • r=1

Λr

  • A polynomial

U ⊤      

  • D−1/2
slide-29
SLIDE 29

NetMF for a Large Window Size T — Observations

−1.0 −0.5 0.0 0.5 1.0

Eigenvalues Before Filtering

−1.0 −0.5 0.0 0.5 1.0

Eigenvalues After Filtering

T = 1 T = 2 T = 5 T = 10

Figure 4: f(λ) = 1

T

T

r=1 λr

Idea

This polynomial implicitly filters out negative eigenvalues and small positive eigenvalues, why not do it explicitly.

slide-30
SLIDE 30

NetMF for a Large Window Size T — Algorithm

Algorithm 3: NetMF for a Large Window Size T

1 Eigen-decomposition D−1/2AD−1/2 ≈ UhΛhU ⊤ h ; 2 Approximate M with

ˆ M = vol(G)

b

D−1/2Uh

  • 1

T

T

r=1 Λr h

  • U ⊤

h D−1/2; 3 Compute

ˆ M ′ = max( ˆ M, 1);

4 Rank-d approximation by SVD: log ˆ

M ′ = UdΣdV ⊤

d ; 5 return Ud

√Σd as network embedding.

slide-31
SLIDE 31

Setup

Label Classification:

◮ BlogCatelog, PPI, Wikipedia, Flickr ◮ Logistic Regression ◮ NetMF (T = 1) v.s. LINE ◮ NetMF (T = 10) v.s. DeepWalk

Table 1: Statistics of Datasets. Dataset BlogCatalog PPI Wikipedia Flickr |V | 10,312 3,890 4,777 80,513 |E| 333,983 76,584 184,812 5,899,882 #Labels 39 50 40 195

slide-32
SLIDE 32

Experimental Results

20 25 30 35 40 45 50

Micro-F1 (%) BlogCatalog

5 10 15 20 25 30

PPI

30 40 50 60 70

Wikipedia

20 25 30 35 40

Flickr

20 40 60 80 10 15 20 25 30 35

Macro-F1 (%)

20 40 60 80 5 10 15 20 25 20 40 60 80 5.0 7.5 10.0 12.5 15.0 17.5 20.0 2 4 6 8 10 5 10 15 20 25

NetMF (T=1) LINE NetMF (T=10) DeepWalk

Figure 5: Predictive performance on varying the ratio of training data. The x-axis represents the ratio of labeled data (%), and the y-axis in the top and bottom rows denote the Micro-F1 and Macro-F1 scores respectively.

slide-33
SLIDE 33

Conclusion

Table 2: The matrices that are implicitly approximated and factorized by DeepWalk, LINE, PTE, and node2vec. Algorithm Matrix DeepWalk log

  • vol(G)
  • 1

T

T

r=1(D−1A)r

D−1 − log b LINE log

  • vol(G)D−1AD−1

− log b PTE log     α vol(Gww)(Dww

row)−1Aww(Dww col )−1

β vol(Gdw)(Ddw

row)−1Adw(Ddw col)−1

γ vol(Glw)(Dlw

row)−1Alw(Dlw col)−1

    − log b node2vec log

  • 1

2T

T

r=1(

  • u Xw,uP r

c,w,u+ u Xc,uP r w,c,u)

(

  • u Xw,u)(
  • u Xc,u)
  • − log b
slide-34
SLIDE 34

Thanks.

Standing on the shoulders of giants — Isaac Newton Code available at github.com/xptree/NetMF Q&A

slide-35
SLIDE 35

DeepWalk — Sketched Proof

Theorem

Denote P = D−1A, when L → ∞, we have

#(w, c)−

→ r

|D−

→ r | p

→ dw vol(G) (P r)w,c and #(w, c)←

− r

|D←

− r | p

→ dc vol(G) (P r)c,w .

Proof.

Consider the special case when N = 1, thus we only have one vertex sequence w1, · · · , wL generated by random walk. Let Yj (j = 1, · · · , L − T) be the indicator function for event that wj = w and wj+r = c

w … c

wj

wj+r

slide-36
SLIDE 36

Proof (Con’t)

Observation

◮ E[Yj] = Prob(wj = w, wj+r = c) → dw vol(G) (P r)w,c. ◮ #(w,c)−

→ r

|D−

→ r |

=

1 L−T

L−T

j=1 Yj. ◮ Cov(Yi, Yj) → 0 as |i − j| → ∞.

Lemma

(S.N. Bernstein Law of Large Numbers) Let Y1, Y2 · · · be a sequence of random variables with finite expectation E[Yj] and variance Var(Yj) < K, j ≥ 1, and covariances are s.t. Cov(Yi, Yj) → 0 as |i − j| → ∞. Then the law of large numbers (LLN) holds.

#(w, c)−

→ r

|D−

→ r |

= 1 L − T

L−T

  • j=1

Yj

p

→ 1 L − T

L−T

  • j=1

E(Yj) → dw vol(G) (P r)w,c

slide-37
SLIDE 37

Time Complexity

◮ Eigen-Decomposition (Implicitly Restarted Lanczos Method)

O(mhI + nh2I + h3I).

◮ Reconstruction O(n2h) ◮ Element-wise logarithm O(n2). ◮ SVD (a naive implementation with eigen-decomposition):

O(n2dI + nd2I + d3I).

slide-38
SLIDE 38

Future Work

◮ Comprehend high-order cases, e.g., node2vec.

log 1

2T

T

r=1

  • u Xw,uP r

c,w,u + u Xc,uP r w,c,u

  • b (

u Xw,u) ( u Xc,u)

  • ◮ Design scalable algorithm (e.g., using spectral sparsification of

random-walk polynomials).

log

  • vol(G)

b

  • 1

T

T

  • r=1
  • D−1A

r

  • D−1
  • .

◮ Connection with graph convolutional networks (Kipf &

Welling, ICLR’17).