Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, - PowerPoint PPT Presentation

Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec Jiezhong Qiu Tsinghua University February 21, 2018 Joint work with Yuxiao Dong (MSR), Hao Ma (MSR), Jian Li (IIIS, Tsinghua), Kuansan Wang (MSR), Jie Tang (DCST, Tsinghua)

Motivation and Problem Formulation Problem Formulation Give a network G = ( V, E ) , aim to learn a function f : V → R p to capture neighborhood similarity and community membership. Applications: ◮ link prediction ◮ community detection ◮ label classification Figure 1: A toy example (Figure from DeepWalk).

History of Network Embedding 2017 metapath2vec [Dong et al.] 2016 node2vec [ Grover & Leskovec ] 2015 LINE & PTE [Tang et al.] 2014 DeepWalk [Perozzi et al.] word2vec (skip-gram) [Mikolov et al.] 2013 2009 SocDim [Tang & Liu] Spectral Clustering v.s. Kernel k-means [Dhillon et al.] 2005 Spectral Clustering [Ng et al.] 2002 2000 Image Segmentation [Shi & Malik] A large body of literature 1996 [Pothen et al.] [Simon] [Bolla], [Hagen & Kahng] [Hendrickson & Leland] [Van Driessche & Roose], [Barnard et al.] [Spielman & Teng], [Guattery & Miller] 1973 Fiedler Vector [Fiedler] Spectral Partitioning [Donath, Hoffman]

Contents Preliminaries Main Theoretic Results Notations DeepWalk (KDD’14) LINE (WWW’15) PTE (KDD’15) node2vec (KDD’16) NetMF NetMF for a Small Window Size T NetMF for a Large Window Size T Experiments

Notations Consider an undirected weighted graph G = ( V, E ) , where | V | = n and | E | = m . ◮ Adjacency matrix A ∈ R n × n : + � a i,j > 0 ( i, j ) ∈ E A i,j = ( i, j ) �∈ E . 0 ◮ Degree matrix D = diag( d 1 , · · · , d n ) , where d i is the generalized degree of vertex i . ◮ Volume of the graph G : vol( G ) = � � j A i,j . i Assumption G = ( V, E ) is connected, undirected, and not bipartite, which d w makes P ( w ) = vol( G ) a unique stationary distribution.

DeepWalk — Roadmap �� Output: Input Node Random Skip-gram G=(V,E) Embedding Walk

DeepWalk — a Two-step Algorithm Algorithm 1: DeepWalk 1 for n = 1 , 2 , . . . , N do Pick w n 1 according to a probability distribution P ( w 1 ) ; 2 Generate a vertex sequence ( w n 1 , · · · , w n L ) of length L by a 3 random walk on network G ; for j = 1 , 2 , . . . , L − T do 4 for r = 1 , . . . , T do 5 Add vertex-context pair ( w n j , w n j + r ) to multiset D ; 6 Add vertex-context pair ( w n j + r , w n j ) to multiset D ; 7 8 Run SGNS on D with b negative samples.

DeepWalk — Roadmap �� Output: Input Node Random Skip-gram G=(V,E) Embedding Walk Levy & Goldberg (NIPS 14) b Number of negative samples #( w, c ) #( w ) Co-occurrence of w and c Occurrence of word w |D| Total number of word-context pairs #( c ) Occurrence of context c

Skip-gram with Negative Sampling (SGNS) ◮ SGNS maintains a multiset D which counts the occurrence of each word-context pair ( w, c ) . ◮ Objective: � �� + b #( w )#( c ) � � � � � x ⊤ − x ⊤ L = #( w, c ) log g log g w y c w y c , |D| w c where x w , y c ∈ R d , g is the sigmoid function, and b is the number of negative samples for SGNS. ◮ For sufficiently large dimensionality d , equivalent to factorizing PMI matrix (Levy & Goldberg, NIPS’14): � #( w, c ) |D| � log . b #( w )#( c )

DeepWalk — Roadmap �� Output: Input Node Random Skip-gram G=(V,E) Embedding Walk Levy & Goldberg (NIPS 14) b Number of negative samples #( w, c ) #( w ) Co-occurrence of w and c Occurrence of word w |D| Total number of word-context pairs #( c ) Occurrence of context c

DeepWalk (c, d) a b c d e (c, a) (c, e) Question Suppose the multiset D is constructed based on random walk on � � #( w,c ) |D| graph, can we interpret log with graph theory b #( w )#( c ) terminologies?

DeepWalk (c, d) a b c d e (c, a) (c, e) Question Suppose the multiset D is constructed based on random walk on � � #( w,c ) |D| graph, can we interpret log with graph theory b #( w )#( c ) terminologies? Challange We mix so many things together, i.e., direction and distance.

DeepWalk (c, d) a b c d e (c, a) (c, e) Question Suppose the multiset D is constructed based on random walk on � � #( w,c ) |D| graph, can we interpret log with graph theory b #( w )#( c ) terminologies? Challange We mix so many things together, i.e., direction and distance. Solution Let’s distinguish them!

DeepWalk Partition the multiset D into several sub-multisets according to the way in which vertex and its context appear in a random walk sequence. More formally, for r = 1 , · · · , T , we define � � ( w, c ) : ( w, c ) ∈ D , w = w n j , c = w n D − r = , → j + r � � ( w, c ) : ( w, c ) ∈ D , w = w n j + r , c = w n D ← r = . − j (c, d) D − ! 1 a b c d e (c, a) (c, e) D − D ← ! − 2 2

DeepWalk as Implicit Matrix Factorization Some observations ◮ Observation 1:   � #( w, c ) |D| #( w,c ) � |D| log = log   b #( w ) · #( c ) b #( w ) #( c ) |D| |D| ◮ Observation 2: T � #( w, c ) − � #( w, c ) = 1 + #( w, c ) ← � → − r r . |D| 2 T |D − r | |D ← r | → − r =1 Sufficient to characterize #( w,c ) − and #( w,c ) ← → − r | . r r |D − r | |D ← → −

DeepWalk — Theorems Theorem Denote P = D − 1 A , when the length of random walk L → ∞ , #( w, c ) − vol( G ) ( P r ) w,c and #( w, c ) ← d w d c → − p p r r vol( G ) ( P r ) c,w . → → |D − r | |D ← r | → − Theorem When the length of random walk L → ∞ , we have T � � #( w, c ) → 1 d w d c � p vol( G ) ( P r ) w,c + vol( G ) ( P r ) c,w . |D| 2 T r =1 Theorem For DeepWalk, when the length of random walk L → ∞ , � � T T #( w, c ) |D| → vol( G ) 1 ( P r ) w,c + 1 � � p ( P r ) c,w . #( w ) · #( c ) 2 T d c d w r =1 r =1

DeepWalk — Conclusion Theorem DeepWalk is asymptotically and implicitly factorizing � � � � T vol( G ) 1 � � � r D − 1 A D − 1 log . b T r =1

DeepWalk — Roadmap �� Output: Input Node Random Skip-gram G=(V,E) Embedding Walk Levy & Goldberg (NIPS 14) Adjacency matrix Degree matrix b Number of negative samples

LINE ◮ Objective of LINE: | V | | V | � �� + bd i d j � � � � � x ⊤ − x ⊤ L = A i,j log g i y j vol( G ) log g i y j . i =1 j =1 ◮ Align it with the Objective of SGNS: � �� + b #( w )#( c ) � � � � � x ⊤ − x ⊤ L = #( w, c ) log g w y c log g w y c . |D| w c ◮ LINE is actually factorizing � vol( G ) � D − 1 AD − 1 log b ◮ Recall DeepWalk’s matrix form: � � � � T vol( G ) 1 � � � r D − 1 A D − 1 log . b T r =1 Observation LINE is a special case of DeepWalk ( T = 1 ).

PTE Figure 2: Heterogeneous Text Network. ◮ word-word network G ww , A ww ∈ R # word × # word . ◮ document-word network G dw , A dw ∈ R # doc × # word . ◮ label-word network G lw , A lw ∈ R # label × # word .

PTE as Implicit Matrix Factorization     row ) − 1 A ww ( D ww col ) − 1 α vol( G ww )( D ww  − log b, row ) − 1 A dw ( D dw col ) − 1 log β vol( G dw )( D dw    row ) − 1 A lw ( D lw col ) − 1 γ vol( G lw )( D lw ◮ The matrix is of shape (# word + # doc + # label ) × # word. ◮ b is the number of negative samples in training. ◮ { α, β, γ } are hyper-parameters to balance the weights of the three networks. In PTE, { α, β, γ } satisfy α vol( G ww ) = β vol( G dw ) = γ vol( G lw )

node2vec — 2nd Order Random Walk  1 ( u, v ) ∈ E, ( v, w ) ∈ E, u = w ;  p    1 ( u, v ) ∈ E, ( v, w ) ∈ E, u � = w, ( w, u ) ∈ E ; T u,v,w = 1 ( u, v ) ∈ E, ( v, w ) ∈ E, u � = w, ( w, u ) �∈ E ;   q   0 otherwise . T u,v,w P u,v,w = Prob ( w j +1 = u | w j = v, w j − 1 = w ) = . � u T u,v,w Stationary Distribution � P u,v,w X v,w = X u,v w Existence guaranteed by Perron-Frobenius theorem, but may not be unique.

node2vec as Implicit Matrix Factorization Theorem node2vec is asymptotically and implicitly factorizing a matrix whose entry at w -th row, c -th column is � 1 � �� T c,w,u + � u X w,u P r u X c,u P r r =1 w,c,u 2 T log b ( � u X w,u ) ( � u X c,u )

Contents Preliminaries Main Theoretic Results Notations DeepWalk (KDD’14) LINE (WWW’15) PTE (KDD’15) node2vec (KDD’16) NetMF NetMF for a Small Window Size T NetMF for a Large Window Size T Experiments

Roadmap �� Output: Input Node Random Skip-gram G=(V,E) Embedding Walk Levy & Goldberg (NIPS 14) Matrix Factorization

NetMF ◮ Factorize the DeepWalk matrix: � � � � T vol( G ) 1 � � � r D − 1 A D − 1 log . b T r =1 ◮ For numerical reason, we use truncated logarithm — ˜ log( x ) = log (max(1 , x )) 1 . 5 1 . 0 0 . 5 0 . 0 0 1 2 3 4 5 Figure 3: Truncated Logarithm

NetMF for a Small Window Size T Algorithm 2: NetMF for a Small Window Size T 1 Compute P 1 , · · · , P T ; �� T r =1 P r � 2 Compute M = vol( G ) D − 1 ; bT 3 Compute M ′ = max( M , 1) ; 4 Rank- d approximation by SVD: log M ′ = U d Σ d V ⊤ d ; √ Σ d as network embedding. 5 return U d

Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, - PowerPoint PPT Presentation

Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec Jiezhong Qiu Tsinghua University February 21, 2018 Joint work with Yuxiao Dong (MSR), Hao Ma (MSR), Jian Li (IIIS, Tsinghua), Kuansan Wang (MSR), Jie Tang

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

L101: Matrix Factorization In a nutshell Matrix factorization/completion you know? In NLP?

Online-Updating Regularized Kernel Matrix Factorization Models for Large-Scale Recommender

Tensor Factorization via Matrix Factorization Volodymyr Kuleshov Arun Tejasvi Chaganty Percy

A Model For Mixed Linear-Tropical Matrix Factorization James Hook, Sanjar Karaev, Pauli Miettinen

Graph Drawing Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 )

Planarity Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 ) assigns

NetSMF: Large-Scale Network Embedding as Sparse Matrix Factorization Jiezhong Qiu Tsinghua

Singular Value Decomposition (matrix factorization) Singular Value Decomposition The SVD is a

Matrix Factorization and Factorization Machines for Recommender Systems Chih-Jen Lin Department

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

Unifying Mirror Symmetry Constructions David Favero favero@ualberta.ca University of Alberta

Unifying Notions of Feedback Sergey Goncharov FAU Tag der Informatik 2019, April 26 Unifying

Unifying Traditional and Unifying Traditional and Formal Verification Through Formal

The q -gram distance Bioinformatics Algorithms In many situations, edit distance is a good

Language Models Language Models Dan Klein, John DeNero UC Berkeley Language Models Acoustic

CSE 447/547 Natural Language Processing Winter 2020 Language Models Yejin Choi Slides adapted

Uncountably many quasi-isometry classes of groups of type FP Ignat Soroko University of Oklahoma

Language Model School of Data Science, Fudan University

Constructing Equiangular Tight Frames with Alternating Projection Joel A. Tropp

Quartic Curves and Their Bitangents Bernd Sturmfels, UC Berkeley joint work with Daniel Plaumann

Language Modeling Introduction to N-grams Dan Jurafsky Probabilistic Language Models