Graph Representation Learning
William L. Hamilton COMP 551 – Special Topic Lecture
1 Will Hamilton, McGill and Mila
Graph Representation Learning William L. Hamilton COMP 551 - - PowerPoint PPT Presentation
Graph Representation Learning William L. Hamilton COMP 551 Special Topic Lecture Will Hamilton, McGill and Mila 1 Why graphs? Graphs are a general language for describing and modeling complex systems Will Hamilton, McGill and Mila 2
William L. Hamilton COMP 551 – Special Topic Lecture
1 Will Hamilton, McGill and Mila
2 Will Hamilton, McGill and Mila
3 Will Hamilton, McGill and Mila
4 Will Hamilton, McGill and Mila
5
Economic networks Social networks Networks of neurons Information networks: Web & citations Biomedical networks Internet
C
Will Hamilton, McGill and Mila
§ Universal language for describing complex data
§ Networks/graphs from science, nature, and technology are more similar than one would expect
§ Shared vocabulary between fields
§ Computer Science, Social science, Physics, Economics, Statistics, Biology
§ Data availability (+computational challenges)
§ Web/mobile, bio, health, and medical
§ Impact!
§ Social networking, Social media, Drug design
Will Hamilton, McGill and Mila 6
Classical ML tasks ks in graphs: § Node classification
§ Predict a type of a given node
§ Link prediction
§ Predict whether two nodes are linked
§ Community detection
§ Identify densely linked clusters of nodes
§ Network similarity
§ How similar are two (sub)networks
Will Hamilton, McGill and Mila 7
? ? ? ? ?
Machine Learning
Will Hamilton, McGill and Mila 8
Will Hamilton, McGill and Mila 9
Cl Classifying ng the he fu functi ction
in in the in intera ractome!
Image from: Ganapathiraju et al. 2016. Schizophrenia interactome with 504 novel protein–protein interactions. Nature.
Machine Learning
Will Hamilton, McGill and Mila 10
? ? ?
x
Will Hamilton, McGill and Mila 11
Co Cont ntent nt re recommendation is link k prediction!
?
12
Raw Data Structured Data Learning Algorithm Model Downstream prediction task Feature Engineering
Automatically learn the features
§ (Supervised) Machine Learning Lifecycle: This feature, that feature. Every single time!
Will Hamilton, McGill and Mila
Goal: Efficient task-independent feature learning for machine learning in graphs!
Will Hamilton, McGill and Mila 13
vec node 2 !: # → ℝ& ℝ&
Feature representation, embedding
u
Will Hamilton, McGill and Mila 14
Ou Outpu put In Input
A B
Image from: Perozzi et al. 2014. DeepWalk: Online Learning of Social
§ Modern deep learning toolbox is designed for simple sequences or grids.
§ CNNs for fixed-size images/grids…. § RNNs or word2vec for text/sequences…
Will Hamilton, McGill and Mila 15
§ But graphs are far more complex!
§ Complex topographical structure (i.e., no spatial locality like grids) § No fixed node ordering or reference point (i.e., the isomorphism problem) § Often dynamic and have multimodal features.
Will Hamilton, McGill and Mila 16
§ 1) Node embeddings
§ Map nodes to low-dimensional embeddings.
§ 2) Graph neural networks
§ Deep learning architectures for graph- structured data
§ 3) Example applications.
Will Hamilton, McGill and Mila 17
18
Will Hamilton, McGill and Mila
A B
Will Hamilton, McGill and Mila 19
Intuition: Find embedding of nodes to d- dimensions so that “similar” nodes in the graph have embeddings that are close together. Ou Outpu put In Input
Will Hamilton, McGill and Mila 20
§ Assume we have a graph G:
§ V is the vertex set. § A is the adjacency matrix (assume binary). § No No no node de featur ures or extra inf nformation n is us used!
Will Hamilton, McGill and Mila 21
the embedding space (e.g., dot product) approximates similarity in the original network.
Will Hamilton, McGill and Mila 22
similarity(u, v) ≈ z>
v zu
Go Goal: Ne Need d to de define ne!
Will Hamilton, McGill and Mila 23
1.
Defin ine an encoder (i.e., a mapping from nodes to embeddings) 2.
Defin ine a node sim imila ilarit ity functio ion (i.e., a measure of similarity in the original network). 3.
Optimize the parameters of f the encoder so so that:
v zu
Will Hamilton, McGill and Mila 24
§ En Encoder maps each node to a low-dimensional vector. § Si Simi milarity y function specifies how relationships in vector space map to relationships in the original network.
enc(v) = zv
node in the input graph d-dimensional embedding Similarity of u and v in the original network dot product between node embeddings
similarity(u, v) ≈ z>
v zu
Will Hamilton, McGill and Mila 25
§ Simplest encoding approach: en enco coder er is is ju just an an em embed edding-looku kup
matrix, each column is node embedding [wh [what we we le learn!] !] indicator vector, all zeroes except a one in column indicating node v
Will Hamilton, McGill and Mila 26
§ Simplest encoding approach: en enco coder er is is ju just an embeddin ing-looku kup
Dimension/size
embedding matrix embedding vector for a specific node
27
§ Simplest encoding approach: en enco coder er is is ju just an an em embed edding-looku kup. i. i.e., each node is is assig igned a uniq ique em embed edding ve vector. § E.g., node2vec, DeepWalk, LINE
Will Hamilton, McGill and Mila
Will Hamilton, McGill and Mila 28
§ Key distinction between “shallow” methods is ho how the they define ne no node si similarity. § E.g., should two nodes have similar embeddings if they….
§ are connected? § share neighbors? § have similar “structural roles”? § …?
Will Hamilton, McGill and Mila 29
Simi milarity y function is just the edge weight between u and v in the original network.
Intuition: Dot products between node embeddings approximate edge existence.
(weighted) adjacency matrix for the graph loss (what we want to minimize) sum over all node pairs
L = X
(u,v)2V ⇥V
kz>
u zv Au,vk2 embedding similarity
Will Hamilton, McGill and Mila 30
minimizes the loss ℒ
as a general optimization method.
SVD or QR decomposition routines).
L = X
(u,v)2V ⇥V
kz>
u zv Au,vk2
Will Hamilton, McGill and Mila 31
§ Drawbacks ks:
§ O(|V|2) runtime. (Must consider all node pairs.)
§ Can make O([E|) by only summing over non-zero edges and using regularization (e.g., Ahmed et al., 2013)
§ O(|V|) parameters! (One learned vector per node). § Only considers direct, local connections.
e.g., the blue node is obviously more similar to green compared to red node, despite none having direct connections.
L = X
(u,v)2V ⇥V
kz>
u zv Au,vk2
Will Hamilton, McGill and Mila 32
Will Hamilton, McGill and Mila 33
1. Estimate probability of visiting node v on a random walk starting from node u using some random walk strategy R. 2. Optimize embeddings to encode these random walk statistics.
Will Hamilton, McGill and Mila 34
1.
Expressi ssivity: Flexible stochastic definition of node similarity that incorporates both local and higher-
2.
Efficiency: Do not need to consider all node pairs when training; only need to consider pairs that co-occur
1. Run short random walks starting from each node on the graph using some strategy R. 2. For each node u collect NR(u), the multiset*
from u. 3. Optimize embeddings to according to:
35 Will Hamilton, McGill and Mila
* NR(u) can have repeat elements since nodes can be visited multiple times on random walks.
L = X
u∈V
X
v∈NR(u)
− log(P(v|zu))
36 Will Hamilton, McGill and Mila
Intuition: Optimize embeddings to maximize likelihood of random walk co-occurrences.
Parame meterize ze P(v | zu) us using ng so softmax:
u∈V
v∈NR(u)
P(v|zu) = exp(z>
u zv)
P
n2V exp(z> u zn)
37 Will Hamilton, McGill and Mila
Pu Putting things together:
sum over all nodes u sum over nodes v seen on random walks starting from u predicted probability of u and v co-occuring on random walk
Optimizing random walk k em embed eddings = Fi Find nding ng em embed eddings zu th that t minimize ze L
L = X
u2V
X
v2NR(u)
− log ✓ exp(z>
u zv)
P
n2V exp(z> u zn)
◆
A B
Will Hamilton, McGill and Mila 38
Ou Outpu put In Input
39
Will Hamilton, McGill and Mila
Will Hamilton, McGill and Mila 40
the embedding space (e.g., dot product) approximates similarity in the original network.
Will Hamilton, McGill and Mila 41
§ So far we have focused on “s “shallow” w” en enco coder ers, i.e. embedding lookups:
Dimension/size
embedding matrix embedding vector for a specific node
Will Hamilton, McGill and Mila 42
§ Limitations of shallow encoding:
§ O(| O(|V|) ) pa parameters are neede ded: there no parameter sharing and every node has its
§ Inherently y “tr transducti tive ve”: It is impossible to generate embeddings for nodes that were not seen during training. § Do Do not in incorp rpora rate node features: Many graphs have features that we can and should leverage.
Will Hamilton, McGill and Mila 43
§ We will now discuss “deeper” methods based on graph neural networks ks. § In general, all of these more complex encoders can be combined with the similarity functions from the previous section.
complex function that depends on graph structure.
Will Hamilton, McGill and Mila 44
§ Assume we have a graph G:
§ V is the vertex set. § A is the adjacency matrix (assume binary). § X X ∈ R(×|+| is is a matrix rix of node features.
§ Categorical attributes, text, image data
– E.g., profile information in a social network.
§ Node degrees, clustering coefficients, etc. § Indicator vectors (i.e., one-hot encoding of each node)
Will Hamilton, McGill and Mila 45
§ Key Key idea: ea: Generate node embeddings based on local neighborhoods.
INPUT GRAPH TARGET NODE
B D E F C A B C D A A A C F B E A
Will Hamilton, McGill and Mila 46
§ In Intuition: Nodes aggregate information from their neighbors using neural networks
INPUT GRAPH TARGET NODE
B D E F C A B C D A A A C F B E A
Will Hamilton, McGill and Mila 47
§ In Intuition: Network neighborhood defines a computation graph
Every node defines a unique computation graph!
Will Hamilton, McGill and Mila 48
§ Nodes have embeddings at each layer. § Model can be arbitrary depth. § “layer-0” embedding of node u is its input feature, i.e. xu.
INPUT GRAPH TARGET NODE
B D E F C A B C D A A A C F B E A
xA xB xC xE xF xA xA
La Layer-2 La Layer-1 La Layer-0
Will Hamilton, McGill and Mila 49
§ Neighborhood aggregation can be viewed as a center-surround filter. § Mathematically related to spectral graph convolutions (see Bronstein et al., 2017)
Will Hamilton, McGill and Mila 50
INPUT GRAPH TARGET NODE
B D E F C A B C D A A A C F B E A
??? ??? ? ? ? wh what’s in the bo box!? !?
§ Key distinctions are in how different approaches aggregate information across the layers.
Will Hamilton, McGill and Mila 51
INPUT GRAPH TARGET NODE
B D E F C A B C D A A A C F B E A
§ Ba Basic approach: h: Average neighbor information and apply a neural network. 1) average messages from neighbors 2) apply neural network
average of neighbor’s previous layer embeddings
Will Hamilton, McGill and Mila 52
§ Ba Basic approach: h: Average neighbor messages and apply a neural network.
Initial “layer 0” embeddings are equal to node features kth layer embedding
non-linearity (e.g., ReLU or tanh) previous layer embedding of v
h0
v = xv
hk
v = σ
@Wk X
u∈N(v)
hk−1
u
|N(v)| + Bkhk−1
v
1 A , ∀k > 0
Will Hamilton, McGill and Mila 53
zA
Ne Need to defin ine a lo loss functio ion on the he em embed eddings gs, , L(zu)!
§ Ho How w do do we we train n the he mode del to gene nerate “hi high- qu quality” em embed eddings?
Will Hamilton, McGill and Mila 54
§ After K-layers of neighborhood aggregation, we get output embeddings for each node. § We We can feed these em embed eddings in into any lo loss fu functi ction
to train the aggregation parameters.
trainable matrices (i.e., what we learn)
h0
v = xv
hk
v = σ
@Wk X
u∈N(v)
hk−1
u
|N(v)| + Bkhk−1
v
1 A , ∀k ∈ {1, ..., K} zv = hK
v
Will Hamilton, McGill and Mila 55
§ Train in an uns unsup upervised manne nner using only the graph structure. § Unsupervised loss function can be anything from the last section, e.g., based on
§ Random walks (node2vec, DeepWalk) § Graph factorization § i.e., train the model so that “similar” nodes have similar embeddings.
Will Hamilton, McGill and Mila 56
§ Al Alternative ve: Directly train the model for a supervised task (e.g., node classification):
Hu Human or r bo bot? t? Hu Human or r bo bot? t?
e. e.g., an an on
e soci cial al network k
Will Hamilton, McGill and Mila 57
1) 1) Define ne a ne neighb ghbor
hood ag aggregat egation
ction.
zA
2) 2) Define ne a loss
unction
he em embed eddings gs, , L(zu)
Will Hamilton, McGill and Mila 58
3) 3) Train n on
set of
nodes, s, i.e., a bat atch ch of
comput ute e gr grap aphs hs
Will Hamilton, McGill and Mila 59
4) 4) Gene nerate em embed eddings gs fo for n r nodes as as need eeded ed Ev Even n for no node des we ne never trai ained ned on!
Will Hamilton, McGill and Mila 60
INPUT GRAPH
B D E F C A
Compute graph for node A Compute graph for node B shared parameters shared parameters
Wk Bk
§ The same aggregation parameters are shared for all nodes. § The number of model parameters is sublinear in |V| and we can generalize to unseen nodes!
61
Inductive node embedding generalize to entirely unseen graphs e.g., train on protein interaction graph from model organism A and generate embeddings on newly collected data about organism B
tr train in on one graph ph ge generalize t to n
graph
Will Hamilton, McGill and Mila
62
train with snapshot new node arrives generate embedding for new node Many application settings constantly encounter previously unseen nodes. e.g., Reddit, YouTube, GoogleScholar, …. Need to generate new embeddings “on the fly”
Will Hamilton, McGill and Mila
Will Hamilton, McGill and Mila 63
§ Re Recap: Generate node embeddings by aggregating neighborhood information.
§ Allows for parameter sharing in the encoder. § Allows for inductive learning.
§ We saw a ba basic vari riant of
dea…
Will Hamilton, McGill and Mila 64
INPUT GRAPH TARGET NODE
B D E F C A B C D A A A C F B E A
??? ??? ? ? ?
§ Key distinctions are in how different approaches aggregate messages
65
Will Hamilton, McGill and Mila
Will Hamilton, McGill and Mila 66
Pins: Visual bookmarks text, images, links Boards collection of pins
~200 million monthly active users ~2 billion pins, ~1 billion boards, ~17 billion edges
[KDD 2018]
Collaboration with Pinterest
Will Hamilton, McGill and Mila 67
Task: Given a query pin, recommend related pins.
SUCCESSFUL RECOMMENDATION BAD RECOMMENDATION
[KDD 2018]
Will Hamilton, McGill and Mila 68
[KDD 2018]
Compared with current production system (highly
5000 query images, 20000 head-to-head comparisons
Users preferred GNN recommendations 60% of the time.
Will Hamilton, McGill and Mila 69
[Under Review]
Will Hamilton, McGill and Mila 70
[ICML 2018; NeurIPS 2018; AAAI 2019]
Will Hamilton, McGill and Mila 71
§ Growing interest in models that are capable of logical induction and combinatorial generalization. § Learn “rules” from training data that can generalize to unseen types of data instances (e.g., larger, different structures, …).
Will Hamilton, McGill and Mila 72
[NeurIPS 2018]
Will Hamilton, McGill and Mila 73
Will Hamilton, McGill and Mila 74
Questions?