Graph Representation Learning William L. Hamilton COMP 551 - - PowerPoint PPT Presentation

graph representation learning
SMART_READER_LITE
LIVE PREVIEW

Graph Representation Learning William L. Hamilton COMP 551 - - PowerPoint PPT Presentation

Graph Representation Learning William L. Hamilton COMP 551 Special Topic Lecture Will Hamilton, McGill and Mila 1 Why graphs? Graphs are a general language for describing and modeling complex systems Will Hamilton, McGill and Mila 2


slide-1
SLIDE 1

Graph Representation Learning

William L. Hamilton COMP 551 – Special Topic Lecture

1 Will Hamilton, McGill and Mila

slide-2
SLIDE 2

Why graphs?

Graphs are a general language for describing and modeling complex systems

2 Will Hamilton, McGill and Mila

slide-3
SLIDE 3

3 Will Hamilton, McGill and Mila

slide-4
SLIDE 4

Graph!

4 Will Hamilton, McGill and Mila

slide-5
SLIDE 5

5

Economic networks Social networks Networks of neurons Information networks: Web & citations Biomedical networks Internet

Many Data are Graphs

C

Will Hamilton, McGill and Mila

slide-6
SLIDE 6

Why Graphs? Why Now?

§ Universal language for describing complex data

§ Networks/graphs from science, nature, and technology are more similar than one would expect

§ Shared vocabulary between fields

§ Computer Science, Social science, Physics, Economics, Statistics, Biology

§ Data availability (+computational challenges)

§ Web/mobile, bio, health, and medical

§ Impact!

§ Social networking, Social media, Drug design

Will Hamilton, McGill and Mila 6

slide-7
SLIDE 7

Machine Learning with Graphs

Classical ML tasks ks in graphs: § Node classification

§ Predict a type of a given node

§ Link prediction

§ Predict whether two nodes are linked

§ Community detection

§ Identify densely linked clusters of nodes

§ Network similarity

§ How similar are two (sub)networks

Will Hamilton, McGill and Mila 7

slide-8
SLIDE 8

Example: Node Classification

? ? ? ? ?

Machine Learning

Will Hamilton, McGill and Mila 8

slide-9
SLIDE 9

Example: Node Classification

Will Hamilton, McGill and Mila 9

Cl Classifying ng the he fu functi ction

  • n of
  • f protei
  • teins

in in the in intera ractome!

Image from: Ganapathiraju et al. 2016. Schizophrenia interactome with 504 novel protein–protein interactions. Nature.

slide-10
SLIDE 10

Example: Link Prediction

Machine Learning

Will Hamilton, McGill and Mila 10

? ? ?

x

slide-11
SLIDE 11

Example: Link Prediction

Will Hamilton, McGill and Mila 11

Co Cont ntent nt re recommendation is link k prediction!

?

slide-12
SLIDE 12

Machine Learning Lifecycle

12

Raw Data Structured Data Learning Algorithm Model Downstream prediction task Feature Engineering

Automatically learn the features

§ (Supervised) Machine Learning Lifecycle: This feature, that feature. Every single time!

Will Hamilton, McGill and Mila

slide-13
SLIDE 13

Feature Learning in Graphs

Goal: Efficient task-independent feature learning for machine learning in graphs!

Will Hamilton, McGill and Mila 13

vec node 2 !: # → ℝ& ℝ&

Feature representation, embedding

u

slide-14
SLIDE 14

Example

Will Hamilton, McGill and Mila 14

Ou Outpu put In Input

A B

§ Zachary’s Karate Club Network:

Image from: Perozzi et al. 2014. DeepWalk: Online Learning of Social

  • Representations. KDD.
slide-15
SLIDE 15

Why Is It Hard?

§ Modern deep learning toolbox is designed for simple sequences or grids.

§ CNNs for fixed-size images/grids…. § RNNs or word2vec for text/sequences…

Will Hamilton, McGill and Mila 15

slide-16
SLIDE 16

Why Is It Hard?

§ But graphs are far more complex!

§ Complex topographical structure (i.e., no spatial locality like grids) § No fixed node ordering or reference point (i.e., the isomorphism problem) § Often dynamic and have multimodal features.

Will Hamilton, McGill and Mila 16

slide-17
SLIDE 17

This talk

§ 1) Node embeddings

§ Map nodes to low-dimensional embeddings.

§ 2) Graph neural networks

§ Deep learning architectures for graph- structured data

§ 3) Example applications.

Will Hamilton, McGill and Mila 17

slide-18
SLIDE 18

18

Pa Part rt 1 1: : Node Node Emb Embeddings

Will Hamilton, McGill and Mila

slide-19
SLIDE 19

A B

Embedding Nodes

Will Hamilton, McGill and Mila 19

Intuition: Find embedding of nodes to d- dimensions so that “similar” nodes in the graph have embeddings that are close together. Ou Outpu put In Input

slide-20
SLIDE 20

Setup

Will Hamilton, McGill and Mila 20

§ Assume we have a graph G:

§ V is the vertex set. § A is the adjacency matrix (assume binary). § No No no node de featur ures or extra inf nformation n is us used!

slide-21
SLIDE 21

Embedding Nodes

Will Hamilton, McGill and Mila 21

  • Goal is to encode nodes so that similarity in

the embedding space (e.g., dot product) approximates similarity in the original network.

slide-22
SLIDE 22

Embedding Nodes

Will Hamilton, McGill and Mila 22

similarity(u, v) ≈ z>

v zu

Go Goal: Ne Need d to de define ne!

slide-23
SLIDE 23

Learning Node Embeddings

Will Hamilton, McGill and Mila 23

1.

  • 1. De

Defin ine an encoder (i.e., a mapping from nodes to embeddings) 2.

  • 2. De

Defin ine a node sim imila ilarit ity functio ion (i.e., a measure of similarity in the original network). 3.

  • 3. Op

Optimize the parameters of f the encoder so so that:

similarity(u, v) ≈ z>

v zu

slide-24
SLIDE 24

Two Key Components

Will Hamilton, McGill and Mila 24

§ En Encoder maps each node to a low-dimensional vector. § Si Simi milarity y function specifies how relationships in vector space map to relationships in the original network.

enc(v) = zv

node in the input graph d-dimensional embedding Similarity of u and v in the original network dot product between node embeddings

similarity(u, v) ≈ z>

v zu

slide-25
SLIDE 25

“Shallow” Encoding

Will Hamilton, McGill and Mila 25

§ Simplest encoding approach: en enco coder er is is ju just an an em embed edding-looku kup

matrix, each column is node embedding [wh [what we we le learn!] !] indicator vector, all zeroes except a one in column indicating node v

enc(v) = Zv

Z ∈ Rd×|V| v ∈ I|V|

slide-26
SLIDE 26

“Shallow” Encoding

Will Hamilton, McGill and Mila 26

§ Simplest encoding approach: en enco coder er is is ju just an embeddin ing-looku kup

Z =

Dimension/size

  • f embeddings
  • ne column per node

embedding matrix embedding vector for a specific node

slide-27
SLIDE 27

“Shallow” Encoding

27

§ Simplest encoding approach: en enco coder er is is ju just an an em embed edding-looku kup. i. i.e., each node is is assig igned a uniq ique em embed edding ve vector. § E.g., node2vec, DeepWalk, LINE

Will Hamilton, McGill and Mila

slide-28
SLIDE 28

How to Define Node Similarity?

Will Hamilton, McGill and Mila 28

§ Key distinction between “shallow” methods is ho how the they define ne no node si similarity. § E.g., should two nodes have similar embeddings if they….

§ are connected? § share neighbors? § have similar “structural roles”? § …?

slide-29
SLIDE 29

Will Hamilton, McGill and Mila 29

  • Si

Simi milarity y function is just the edge weight between u and v in the original network.

  • In

Intuition: Dot products between node embeddings approximate edge existence.

(weighted) adjacency matrix for the graph loss (what we want to minimize) sum over all node pairs

Adjacency-based Similarity

L = X

(u,v)2V ⇥V

kz>

u zv Au,vk2 embedding similarity

slide-30
SLIDE 30

Will Hamilton, McGill and Mila 30

  • Find embedding matrix ! ∈ ℝ$ % |'|that

minimizes the loss ℒ

  • Option 1: Use stochastic gradient descent (SGD)

as a general optimization method.

  • Highly scalable, general approach
  • Option 2: Solve matrix decomposition solvers (e.g.,

SVD or QR decomposition routines).

  • Only works in limited cases.

Adjacency-based Similarity

L = X

(u,v)2V ⇥V

kz>

u zv Au,vk2

slide-31
SLIDE 31

Adjacency-based Similarity

Will Hamilton, McGill and Mila 31

§ Drawbacks ks:

§ O(|V|2) runtime. (Must consider all node pairs.)

§ Can make O([E|) by only summing over non-zero edges and using regularization (e.g., Ahmed et al., 2013)

§ O(|V|) parameters! (One learned vector per node). § Only considers direct, local connections.

e.g., the blue node is obviously more similar to green compared to red node, despite none having direct connections.

L = X

(u,v)2V ⇥V

kz>

u zv Au,vk2

slide-32
SLIDE 32

Random-walk Embeddings

Will Hamilton, McGill and Mila 32

probability that u and v co-occur on a random walk over the network

z>

u zv ≈

slide-33
SLIDE 33

Random-walk Embeddings

Will Hamilton, McGill and Mila 33

1. Estimate probability of visiting node v on a random walk starting from node u using some random walk strategy R. 2. Optimize embeddings to encode these random walk statistics.

slide-34
SLIDE 34

Why Random Walks?

Will Hamilton, McGill and Mila 34

1.

  • 1. Exp

Expressi ssivity: Flexible stochastic definition of node similarity that incorporates both local and higher-

  • rder neighborhood information.

2.

  • 2. Ef

Efficiency: Do not need to consider all node pairs when training; only need to consider pairs that co-occur

  • n random walks.
slide-35
SLIDE 35

Random Walk Optimization

1. Run short random walks starting from each node on the graph using some strategy R. 2. For each node u collect NR(u), the multiset*

  • f nodes visited on random walks starting

from u. 3. Optimize embeddings to according to:

35 Will Hamilton, McGill and Mila

* NR(u) can have repeat elements since nodes can be visited multiple times on random walks.

L = X

u∈V

X

v∈NR(u)

− log(P(v|zu))

slide-36
SLIDE 36

Random Walk Optimization

36 Will Hamilton, McGill and Mila

  • In

Intuition: Optimize embeddings to maximize likelihood of random walk co-occurrences.

  • Pa

Parame meterize ze P(v | zu) us using ng so softmax:

L = X

u∈V

X

v∈NR(u)

− log(P(v|zu))

P(v|zu) = exp(z>

u zv)

P

n2V exp(z> u zn)

slide-37
SLIDE 37

Random Walk Optimization

37 Will Hamilton, McGill and Mila

Pu Putting things together:

sum over all nodes u sum over nodes v seen on random walks starting from u predicted probability of u and v co-occuring on random walk

Optimizing random walk k em embed eddings = Fi Find nding ng em embed eddings zu th that t minimize ze L

L = X

u2V

X

v2NR(u)

− log ✓ exp(z>

u zv)

P

n2V exp(z> u zn)

slide-38
SLIDE 38

A B

Example: DeepWalk

Will Hamilton, McGill and Mila 38

Ou Outpu put In Input

slide-39
SLIDE 39

39

Pa Part rt 2 2: : Gr Graph ph Neural Networks ks

Will Hamilton, McGill and Mila

slide-40
SLIDE 40

Embedding Nodes

Will Hamilton, McGill and Mila 40

  • Goal is to encode nodes so that similarity in

the embedding space (e.g., dot product) approximates similarity in the original network.

slide-41
SLIDE 41

From “Shallow” to “Deep”

Will Hamilton, McGill and Mila 41

§ So far we have focused on “s “shallow” w” en enco coder ers, i.e. embedding lookups:

Z =

Dimension/size

  • f embeddings
  • ne column per node

embedding matrix embedding vector for a specific node

slide-42
SLIDE 42

From “Shallow” to “Deep”

Will Hamilton, McGill and Mila 42

§ Limitations of shallow encoding:

§ O(| O(|V|) ) pa parameters are neede ded: there no parameter sharing and every node has its

  • wn unique embedding vector.

§ Inherently y “tr transducti tive ve”: It is impossible to generate embeddings for nodes that were not seen during training. § Do Do not in incorp rpora rate node features: Many graphs have features that we can and should leverage.

slide-43
SLIDE 43

From “Shallow” to “Deep”

Will Hamilton, McGill and Mila 43

§ We will now discuss “deeper” methods based on graph neural networks ks. § In general, all of these more complex encoders can be combined with the similarity functions from the previous section.

enc(v) =

complex function that depends on graph structure.

slide-44
SLIDE 44

Setup

Will Hamilton, McGill and Mila 44

§ Assume we have a graph G:

§ V is the vertex set. § A is the adjacency matrix (assume binary). § X X ∈ R(×|+| is is a matrix rix of node features.

§ Categorical attributes, text, image data

– E.g., profile information in a social network.

§ Node degrees, clustering coefficients, etc. § Indicator vectors (i.e., one-hot encoding of each node)

slide-45
SLIDE 45

Neighborhood Aggregation

Will Hamilton, McGill and Mila 45

§ Key Key idea: ea: Generate node embeddings based on local neighborhoods.

INPUT GRAPH TARGET NODE

B D E F C A B C D A A A C F B E A

slide-46
SLIDE 46

Neighborhood Aggregation

Will Hamilton, McGill and Mila 46

§ In Intuition: Nodes aggregate information from their neighbors using neural networks

INPUT GRAPH TARGET NODE

B D E F C A B C D A A A C F B E A

slide-47
SLIDE 47

Neighborhood Aggregation

Will Hamilton, McGill and Mila 47

§ In Intuition: Network neighborhood defines a computation graph

Every node defines a unique computation graph!

slide-48
SLIDE 48

Neighborhood Aggregation

Will Hamilton, McGill and Mila 48

§ Nodes have embeddings at each layer. § Model can be arbitrary depth. § “layer-0” embedding of node u is its input feature, i.e. xu.

INPUT GRAPH TARGET NODE

B D E F C A B C D A A A C F B E A

xA xB xC xE xF xA xA

La Layer-2 La Layer-1 La Layer-0

slide-49
SLIDE 49

Neighborhood “Convolutions”

Will Hamilton, McGill and Mila 49

§ Neighborhood aggregation can be viewed as a center-surround filter. § Mathematically related to spectral graph convolutions (see Bronstein et al., 2017)

slide-50
SLIDE 50

Neighborhood Aggregation

Will Hamilton, McGill and Mila 50

INPUT GRAPH TARGET NODE

B D E F C A B C D A A A C F B E A

??? ??? ? ? ? wh what’s in the bo box!? !?

§ Key distinctions are in how different approaches aggregate information across the layers.

slide-51
SLIDE 51

Neighborhood Aggregation

Will Hamilton, McGill and Mila 51

INPUT GRAPH TARGET NODE

B D E F C A B C D A A A C F B E A

§ Ba Basic approach: h: Average neighbor information and apply a neural network. 1) average messages from neighbors 2) apply neural network

slide-52
SLIDE 52

average of neighbor’s previous layer embeddings

The Math

Will Hamilton, McGill and Mila 52

§ Ba Basic approach: h: Average neighbor messages and apply a neural network.

Initial “layer 0” embeddings are equal to node features kth layer embedding

  • f v

non-linearity (e.g., ReLU or tanh) previous layer embedding of v

h0

v = xv

hk

v = σ

@Wk X

u∈N(v)

hk−1

u

|N(v)| + Bkhk−1

v

1 A , ∀k > 0

slide-53
SLIDE 53

Training the Model

Will Hamilton, McGill and Mila 53

zA

Ne Need to defin ine a lo loss functio ion on the he em embed eddings gs, , L(zu)!

§ Ho How w do do we we train n the he mode del to gene nerate “hi high- qu quality” em embed eddings?

slide-54
SLIDE 54

Training the Model

Will Hamilton, McGill and Mila 54

§ After K-layers of neighborhood aggregation, we get output embeddings for each node. § We We can feed these em embed eddings in into any lo loss fu functi ction

  • n and run stochastic gradient descent

to train the aggregation parameters.

trainable matrices (i.e., what we learn)

h0

v = xv

hk

v = σ

@Wk X

u∈N(v)

hk−1

u

|N(v)| + Bkhk−1

v

1 A , ∀k ∈ {1, ..., K} zv = hK

v

slide-55
SLIDE 55

Training the Model

Will Hamilton, McGill and Mila 55

§ Train in an uns unsup upervised manne nner using only the graph structure. § Unsupervised loss function can be anything from the last section, e.g., based on

§ Random walks (node2vec, DeepWalk) § Graph factorization § i.e., train the model so that “similar” nodes have similar embeddings.

slide-56
SLIDE 56

Training the Model

Will Hamilton, McGill and Mila 56

§ Al Alternative ve: Directly train the model for a supervised task (e.g., node classification):

Hu Human or r bo bot? t? Hu Human or r bo bot? t?

e. e.g., an an on

  • nline

e soci cial al network k

slide-57
SLIDE 57

Overview of Model Design

Will Hamilton, McGill and Mila 57

1) 1) Define ne a ne neighb ghbor

  • rhood

hood ag aggregat egation

  • n funct

ction.

zA

2) 2) Define ne a loss

  • ss func

unction

  • n on
  • n the

he em embed eddings gs, , L(zu)

slide-58
SLIDE 58

Overview of Model Design

Will Hamilton, McGill and Mila 58

3) 3) Train n on

  • n a se

set of

  • f nod

nodes, s, i.e., a bat atch ch of

  • f com

comput ute e gr grap aphs hs

slide-59
SLIDE 59

Overview of Model

Will Hamilton, McGill and Mila 59

4) 4) Gene nerate em embed eddings gs fo for n r nodes as as need eeded ed Ev Even n for no node des we ne never trai ained ned on!

  • n!!!!
slide-60
SLIDE 60

Inductive Capability

Will Hamilton, McGill and Mila 60

INPUT GRAPH

B D E F C A

Compute graph for node A Compute graph for node B shared parameters shared parameters

Wk Bk

§ The same aggregation parameters are shared for all nodes. § The number of model parameters is sublinear in |V| and we can generalize to unseen nodes!

slide-61
SLIDE 61

Inductive Capability

61

Inductive node embedding generalize to entirely unseen graphs e.g., train on protein interaction graph from model organism A and generate embeddings on newly collected data about organism B

tr train in on one graph ph ge generalize t to n

  • new gr

graph

Will Hamilton, McGill and Mila

zu

slide-62
SLIDE 62

Inductive Capability

62

train with snapshot new node arrives generate embedding for new node Many application settings constantly encounter previously unseen nodes. e.g., Reddit, YouTube, GoogleScholar, …. Need to generate new embeddings “on the fly”

Will Hamilton, McGill and Mila

zu

slide-63
SLIDE 63

Quick Recap

Will Hamilton, McGill and Mila 63

§ Re Recap: Generate node embeddings by aggregating neighborhood information.

§ Allows for parameter sharing in the encoder. § Allows for inductive learning.

§ We saw a ba basic vari riant of

  • f this ide

dea…

slide-64
SLIDE 64

Neighborhood Aggregation

Will Hamilton, McGill and Mila 64

INPUT GRAPH TARGET NODE

B D E F C A B C D A A A C F B E A

??? ??? ? ? ?

§ Key distinctions are in how different approaches aggregate messages

slide-65
SLIDE 65

65

Pa Part rt 3 3: : Ap Applic licatio ions

Will Hamilton, McGill and Mila

slide-66
SLIDE 66

Social recommendations

Will Hamilton, McGill and Mila 66

Pins: Visual bookmarks text, images, links Boards collection of pins

~200 million monthly active users ~2 billion pins, ~1 billion boards, ~17 billion edges

[KDD 2018]

Collaboration with Pinterest

slide-67
SLIDE 67

Social recommendations

Will Hamilton, McGill and Mila 67

Task: Given a query pin, recommend related pins.

SUCCESSFUL RECOMMENDATION BAD RECOMMENDATION

[KDD 2018]

slide-68
SLIDE 68

Social recommendations

Will Hamilton, McGill and Mila 68

[KDD 2018]

Compared with current production system (highly

  • ptimized random-walk based recommendations).

5000 query images, 20000 head-to-head comparisons

Users preferred GNN recommendations 60% of the time.

slide-69
SLIDE 69

“Fair” recommendations

Will Hamilton, McGill and Mila 69

[Under Review]

slide-70
SLIDE 70

Graph generation and drug design

Will Hamilton, McGill and Mila 70

[ICML 2018; NeurIPS 2018; AAAI 2019]

slide-71
SLIDE 71

From learning to reasoning

Will Hamilton, McGill and Mila 71

§ Growing interest in models that are capable of logical induction and combinatorial generalization. § Learn “rules” from training data that can generalize to unseen types of data instances (e.g., larger, different structures, …).

slide-72
SLIDE 72

Will Hamilton, McGill and Mila 72

[NeurIPS 2018]

Reasoning with graphs

slide-73
SLIDE 73

Will Hamilton, McGill and Mila 73

Reasoning with graphs

slide-74
SLIDE 74

Will Hamilton, McGill and Mila 74

Questions?