http://cs224w.stanford.edu Intuition: Map nodes to -dimensional - - PowerPoint PPT Presentation

http cs224w stanford edu
SMART_READER_LITE
LIVE PREVIEW

http://cs224w.stanford.edu Intuition: Map nodes to -dimensional - - PowerPoint PPT Presentation

Project Proposal deadline : tonight, 11:59pm Course Notes: https://snap-stanford.github.io/cs224w-notes/ Help us write the course notes we will give generous bonuses! CS224W: Machine Learning with Graphs Jure Leskovec, Stanford University


slide-1
SLIDE 1

CS224W: Machine Learning with Graphs Jure Leskovec, Stanford University

http://cs224w.stanford.edu

Project Proposal deadline: tonight, 11:59pm Course Notes: https://snap-stanford.github.io/cs224w-notes/

Help us write the course notes – we will give generous bonuses!

slide-2
SLIDE 2

¡ Intuition: Map nodes to 𝑒-dimensional

embeddings such that similar nodes in the graph are embedded close together

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 2

f( )=

Input graph 2D node embeddings

How to learn mapping function 𝒈?

slide-3
SLIDE 3

¡ Goal: Map nodes so that similarity in the

embedding space (e.g., dot product) approximates similarity (e.g., proximity) in the network

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 3

Input network d-dimensional embedding space

slide-4
SLIDE 4 12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 4

similarity(u, v) ≈ z>

v zu

Goal:

Need to define!

Input network d-dimensional embedding space

slide-5
SLIDE 5

¡ Encoder: Map a node to a low-dimensional

vector:

¡ Similarity function defines how relationships

in the input network map to relationships in the embedding space:

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 5

enc(v) = zv

node in the input graph d-dimensional embedding Similarity of u and v in the network dot product between node embeddings

similarity(u, v) ≈ z>

v zu

slide-6
SLIDE 6

¡ So far we have focused on “shallow”

encoders, i.e. embedding lookups:

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 6

Z =

Dimension/size of embeddings

  • ne column per node

embedding matrix embedding vector for a specific node

slide-7
SLIDE 7

Shallow encoders:

§ One-layer of data transformation § A single hidden layer maps node 𝑣 to embedding 𝒜% via function 𝑔(), e.g. 𝒜% = 𝑔 𝒜(, 𝑤 ∈ 𝑂- 𝑣

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 7
slide-8
SLIDE 8

¡ Limitations of shallow embedding methods:

§ O(|V|) parameters are needed:

§ No sharing of parameters between nodes § Every node has its own unique embedding

§ Inherently “transductive”:

§ Cannot generate embeddings for nodes that are not seen during training

§ Do not incorporate node features:

§ Many graphs have features that we can and should leverage

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 8
slide-9
SLIDE 9

¡ Today: We will now discuss deep methods

based on graph neural networks:

¡ Note: All these deep encoders can be

combined with node similarity functions defined in the last lecture

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 9

enc(v) =

multiple layers of non-linear transformations

  • f graph structure
slide-10
SLIDE 10 12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 10

Output: Node embeddings Also, we can embed larger network structures, subgraphs, graphs

slide-11
SLIDE 11 11 Jure Leskovec, Stanford University

Im Imag ages es Te Text/Speech

Modern deep learning toolbox is designed for simple sequences & grids

slide-12
SLIDE 12

But networks are far more complex!

§ Arbitrary size and complex topological structure (i.e., no spatial locality like grids) § No fixed node ordering or reference point § Often dynamic and have multimodal features

12

vs vs.

Networks ks Im Imag ages es Te Text

Jure Leskovec, Stanford University
slide-13
SLIDE 13

CNN on an image:

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 13

Goal l is is to genera raliz lize convolu lutio ions beyond sim imple le la lattic ices Levera rage node features/attrib ributes (e (e.g. g., te text, xt, im images)

12/6/18
slide-14
SLIDE 14

Single CNN layer with 3x3 filter:

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 14 (Animation Vincent Dumoul

Im Image Gr Graph

12/6/18

Transform information at the neighbors and combine it:

§ Transform “messages” ℎ/ from neighbors: 𝑋

/ ℎ/

§ Add them up: ∑/ 𝑋

/ ℎ/

slide-15
SLIDE 15

But what if your graphs look like this?

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 15
  • r this:

s like this?

  • r this:
12/6/18

¡ Examples:

Biological networks, Medical networks, Social networks, Information networks, Knowledge graphs, Communication networks, Web graph, …

slide-16
SLIDE 16

¡ Join adjacency matrix and features ¡ Feed them into a deep neural net: ¡ Issues with this idea: ¡ Issues with this idea:

§ 𝑃(𝑂) parameters § Not applicable to graphs of different sizes § Not invariant to node ordering

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 16 A B C D E A B C D E 0 1 1 1 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 1 0 1 0 Feat
  • Done?

?

A C B D E 12/6/18
slide-17
SLIDE 17
  • 1. Basics of deep learning for graphs
  • 2. Graph Convolutional Networks
  • 3. Graph Attention Networks (GAT)
  • 4. Practical tips and demos
12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 17
slide-18
SLIDE 18
slide-19
SLIDE 19

¡ Local network neighborhoods:

§ Describe aggregation strategies § Define computation graphs

¡ Stacking multiple layers:

§ Describe the model, parameters, training § How to fit the model? § Simple example for unsupervised and supervised training

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 19
slide-20
SLIDE 20

¡ Assume we have a graph 𝐻:

§ 𝑊 is the vertex set § 𝑩 is the adjacency matrix (assume binary) § 𝒀 ∈ ℝ:×|=| is a matrix of node features § Node features:

§ Social networks: User profile, User image § Biological networks: Gene expression profiles, gene functional information § No features:

§ Indicator vectors (one-hot encoding of a node) § Vector of constant 1: [1, 1, …, 1]

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 20
slide-21
SLIDE 21

Idea: Node’s neighborhood defines a computation graph

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 21

Determine node computation graph

𝑗

Propagate and transform information

𝑗

Learn how to propagate information across the graph to compute node features

[Kipf and Welling, ICLR 2017]

slide-22
SLIDE 22

¡ Key idea: Generate node embeddings based

  • n local network neighborhoods
12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 22 INPUT GRAPH TARGET NODE B D E F C A B C D A A A C F B E A
slide-23
SLIDE 23

¡ Intuition: Nodes aggregate information from

their neighbors using neural networks

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 23 INPUT GRAPH TARGET NODE B D E F C A B C D A A A C F B E A

Neural networks

slide-24
SLIDE 24

¡ Intuition: Network neighborhood defines a

computation graph

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 24

Every node defines a computation graph based on its neighborhood!

slide-25
SLIDE 25

¡ Model can be of arbitrary depth:

§ Nodes have embeddings at each layer § Layer-0 embedding of node 𝑣 is its input feature, 𝑦𝑣 § Layer-K embedding gets information from nodes that are K hops away

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 25 INPUT GRAPH TARGET NODE B D E F C A B C D A A A C F B E A

xA xB xC xE xF xA xA

La Layer-2 La Layer-1 La Layer-0

slide-26
SLIDE 26

¡ Neighborhood aggregation: Key distinctions

are in how different approaches aggregate information across the layers

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 26 INPUT GRAPH TARGET NODE B D E F C A B C D A A A C F B E A

? ? ? ? What is in the box?

slide-27
SLIDE 27

¡ Basic approach: Average information from

neighbors and apply a neural network

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 27 INPUT GRAPH TARGET NODE B D E F C A B C D A A A C F B E A

(1) average messages from neighbors (2) apply neural network

slide-28
SLIDE 28

¡ Basic approach: Average neighbor messages

and apply a neural network

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 28

Average of neighbor’s previous layer embeddings Initial 0-th layer embeddings are equal to node features Embedding after K layers of neighborhood aggregation Non-linearity (e.g., ReLU) Previous layer embedding of v

h0

v = xv

hk

v = σ

@Wk X

u∈N(v)

hk−1

u

|N(v)| + Bkhk−1

v

1 A , ∀k ∈ {1, ..., K} zv = hK

v

slide-29
SLIDE 29

𝒜@

How do we train the model to generate embeddings? Need to define a loss function on the embeddings

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 29
slide-30
SLIDE 30

We can feed these embeddings into any loss function and run stochastic gradient descent to train the weight parameters

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 30

Trainable weight matrices (i.e., what we learn)

h0

v = xv

hk

v = σ

@Wk X

u∈N(v)

hk−1

u

|N(v)| + Bkhk−1

v

1 A , ∀k ∈ {1, ..., K} zv = hK

v

Equivalently rewritten in vector form:

slide-31
SLIDE 31

¡ Train in an unsupervised manner:

§ Use only the graph structure § “Similar” nodes have similar embeddings

¡ Unsupervised loss function can be anything

from the last section, e.g., a loss based on

§ Random walks (node2vec, DeepWalk, struc2vec) § Graph factorization § Node proximity in the graph

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 31
slide-32
SLIDE 32

Directly train the model for a supervised task (e.g., node classification)

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 32

Safe or toxic drug? Safe or toxic drug?

E.g., a drug-drug interaction network

slide-33
SLIDE 33

Directly train the model for a supervised task (e.g., node classification)

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 33

Encoder output: node embedding Classification weights Node class label

Safe or toxic drug?

L = X

v2V

yv log(σ(z>

v θ)) + (1 − yv) log(1 − σ(z> v θ))

slide-34
SLIDE 34 12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 34

(1) Define a neighborhood aggregation function (2) Define a loss function on the embeddings

𝒜@

slide-35
SLIDE 35 12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 35

(3) Train on a set of nodes, i.e., a batch of compute graphs

slide-36
SLIDE 36 12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 36

(4) Generate embeddings for nodes as needed Even for nodes we never trained on!

slide-37
SLIDE 37

¡ The same aggregation parameters are shared

for all nodes:

§ The number of model parameters is sublinear in |𝑊| and we can generalize to unseen nodes!

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 37 INPUT GRAPH B D E F C A

Compute graph for node A Compute graph for node B shared parameters shared parameters

Wk Bk

slide-38
SLIDE 38 12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 38

Inductive node embedding Generalize to entirely unseen graphs E.g., train on protein interaction graph from model organism A and generate embeddings on newly collected data about organism B

Train on one graph Generalize to new graph

zu

slide-39
SLIDE 39 12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 39

Train with snapshot New node arrives Generate embedding for new node

zu

¡ Many application settings constantly encounter

previously unseen nodes:

§ E.g., Reddit, YouTube, Google Scholar

¡ Need to generate new embeddings “on the fly”

slide-40
SLIDE 40

¡ Recap: Generate node embeddings by

aggregating neighborhood information

§ We saw a basic variant of this idea § Key distinctions are in how different approaches aggregate information across the layers

¡ Next: Describe GraphSAGE graph neural

network architecture

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 40
slide-41
SLIDE 41
  • 1. Basics of deep learning for graphs
  • 2. Graph Convolutional Networks
  • 3. Graph Attention Networks (GAT)
  • 4. Practical tips and demos
12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 41
slide-42
SLIDE 42
slide-43
SLIDE 43

So far we have aggregated the neighbor messages by taking their (weighted) average Can we do better?

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 43 INPUT GRAPH TARGET NODE B D E F C A B C D A A A C F B E A

? ? ? ?

[Hamilton et al., NIPS 2017]

slide-44
SLIDE 44 12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 44 INPUT GRAPH TARGET NODE B D E F C A B C D A A A C F B E A

hk

v = σ

⇥ Ak · agg({hk−1

u

, ∀u ∈ N(v)}), Bkhk−1

v

Any differentiable function that maps set of vectors in 𝑂(𝑣) to a single vector

Apply L2 normalization for each node embedding at every layer

slide-45
SLIDE 45

¡ Simple neighborhood aggregation: ¡ GraphSAGE:

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 45

Concatenate neighbor embedding and self embedding

hk

v = σ

⇥ Wk · agg

  • {hk−1

u

, ∀u ∈ N(v)}

  • , Bkhk−1

v

hk

v = σ

@Wk X

u∈N(v)

hk−1

u

|N(v)| + Bkhk−1

v

1 A

Generalized aggregation

slide-46
SLIDE 46

¡ Mean: Take a weighted average of neighbors ¡ Pool: Transform neighbor vectors and apply

symmetric vector function

¡ LSTM: Apply LSTM to reshuffled of neighbors

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 46

agg = X

u∈N(v)

hk−1

u

|N(v)|

Element-wise mean/max

agg = γ

  • {Qhk−1

u

, ∀u ∈ N(v)}

  • agg = LSTM
  • [hk−1

u

, ∀u ∈ π(N(v))]

slide-47
SLIDE 47 12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 47

𝑤 hk−1

u khk−1 v

hk

v

Key idea: Generate node embeddings based on local neighborhoods

§ Nodes aggregate “messages” from their neighbors using neural networks

¡ Graph convolutional networks:

§ Basic variant: Average neighborhood information and stack neural networks

¡ GraphSAGE:

§ Generalized neighborhood aggregation

slide-48
SLIDE 48

¡ Many aggregations can be performed

efficiently by (sparse) matrix operations

¡ Let ¡ Another example: GCN (Kipf et al. 2017)

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 48
slide-49
SLIDE 49 Tutorials and overviews: § Relational inductive biases and graph networks (Battaglia et al., 2018) § Representation learning on graphs: Methods and applications (Hamilton et al., 2017) Attention-based neighborhood aggregation: § Graph attention networks (Hoshen, 2017; Velickovic et al., 2018; Liu et al., 2018) Embedding entire graphs: § Graph neural nets with edge embeddings (Battaglia et al., 2016; Gilmer et. al., 2017) § Embedding entire graphs (Duvenaud et al., 2015; Dai et al., 2016; Li et al., 2018) and graph pooling (Ying et al., 2018, Zhang et al., 2018) § Graph generation and relational inference (You et al., 2018; Kipf et al., 2018) § How powerful are graph neural networks(Xu et al., 2017) Embedding nodes: § Varying neighborhood: Jumping knowledge networks (Xu et al., 2018), GeniePath (Liu et al., 2018) § Position-aware GNN (You et al. 2019) Spectral approaches to graph neural networks: § Spectral graph CNN & ChebNet (Bruna et al., 2015; Defferrard et al., 2016) § Geometric deep learning (Bronstein et al., 2017; Monti et al., 2017) Other GNN techniques: § Pre-training Graph Neural Networks (Hu et al., 2019) § GNNExplainer: Generating Explanations for Graph Neural Networks (Ying et al., 2019) 12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 49
slide-50
SLIDE 50
  • 1. Basics of deep learning for graphs
  • 2. Graph Convolutional Networks
  • 3. Graph Attention Networks (GAT)
  • 4. Practical tips and demos
12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 50
slide-51
SLIDE 51
slide-52
SLIDE 52

¡ Recap: Simple neighborhood aggregation: ¡ Graph convolutional operator:

§ Aggregates messages across neighborhoods, 𝑂(𝑤) § 𝛽(% = 1/|𝑂(𝑤)| is the weighting factor (importance) of node 𝑣’s message to node 𝑤 § ⟹ 𝛽(% is defined explicitly based on the structural properties of the graph § ⟹ All neighbors 𝑣 ∈ 𝑂(𝑤) are equally important to node 𝑤

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 52

hk

v = σ

@Wk X

u∈N(v)

hk−1

u

|N(v)| + Bkhk−1

v

1 A

slide-53
SLIDE 53

Can we do better than simple neighborhood aggregation? Can we let weighting factors 𝜷𝒘𝒗 to be implicitly defined?

¡ Goal: Specify arbitrary importances to different

neighbors of each node in the graph

¡ Idea: Compute embedding 𝒊(

I of each node in

the graph following an attention strategy:

§ Nodes attend over their neighborhoods’ message § Implicitly specifying different weights to different nodes in a neighborhood

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 53

[Velickovic et al., ICLR 2018; Vaswani et al., NIPS 2017]

slide-54
SLIDE 54

¡ Let 𝛽(% be computed as a byproduct of an

attention mechanism 𝒃:

§ Let 𝑏 compute attention coefficients 𝒇𝒘𝒗 across pairs

  • f nodes 𝑣, 𝑤 based on their messages:

𝑓(% = 𝑏(𝑿I𝒊%

IOP, 𝑿I𝒊( IOP)

§ 𝒇𝒘𝒗 indicates the importance of node 𝒗Q𝐭 message to node 𝒘

§ Normalize coefficients using the softmax function in

  • rder to be comparable across different

neighborhoods: 𝛽(% = exp(𝑓(%) ∑I∈V ( exp(𝑓(I) 𝒊(

I = 𝜏(∑%∈V ( 𝛽(%𝑿I𝒊% IOP)

Next: What is the form of attention mechanism 𝑏?

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 54
slide-55
SLIDE 55

¡ Attention mechanism 𝑏:

§ The approach is agnostic to the choice of 𝑏

§ E.g., use a simple single-layer neural network § 𝑏 can have parameters, which need to be estimates

§ Parameters of 𝑏 are trained jointly:

§ Learn the parameters together with weight matrices (i.e., other parameter of the neural net) in an end-to-end fashion

¡ Multi-head attention: Stabilize the learning process of

attention mechanism [Velickovic et al., ICLR 2018]:

§ Attention operations in a given layer are independently replicated R times (each replica with different parameters) § Outputs are aggregated (by concatenating or adding)

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 55
slide-56
SLIDE 56

¡ Key benefit: Allows for (implicitly) specifying different

importance values (𝜷𝒘𝒗) to different neighbors

¡ Computationally efficient:

§ Computation of attentional coefficients can be parallelized across all edges of the graph § Aggregation may be parallelized across all nodes

¡ Storage efficient:

§ Sparse matrix operations do not require more than O(V+E) entries to be stored § Fixed number of parameters, irrespective of graph size

¡ Trivially localized:

§ Only attends over local network neighborhoods

¡ Inductive capability:

§ It is a shared edge-wise mechanism § It does not depend on the global graph structure

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 56
slide-57
SLIDE 57

¡ t-SNE plot of GAT-based node embeddings:

§ Node color: 7 publication classes § Edge thickness: Normalized attention coefficients between nodes 𝑗 and 𝑘, across eight attention heads, ∑I(𝛽/Y

I + 𝛽Y/ I) 12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 57 Attention mechanism can be used with many different graph neural network models In many cases, attention leads to performance gains
slide-58
SLIDE 58
  • 1. Basics of deep learning for graphs
  • 2. Graph Convolutional Networks (GCN)
  • 3. Graph Attention Networks (GAT)
  • 4. Practical tips and demos
12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 58
slide-59
SLIDE 59 12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 59
slide-60
SLIDE 60

¡ 300M users ¡ 4+B pins, 2+B boards

10/17/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs 60
slide-61
SLIDE 61 12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 61

Human curated collection of pins

Pin:A visual bookmark someone has saved from the internet to a board they’ve created. Pin:Image, text, link Board: A collection of ideas (pins having something in common)

slide-62
SLIDE 62

Graph: 2B pins, 1B boards, 20B edges

¡ Graph is dynamic: Need to apply to new nodes

without model retraining

¡ Rich node features: Content, images

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu

Q

12/6/18 62
slide-63
SLIDE 63

¡ PinSage graph convolutional network:

§ Goal: Generate embeddings for nodes (e.g., Pins/images) in a web-scale Pinterest graph containing billions of objects § Key Idea: Borrow information from nearby nodes

§ E.g., bed rail Pin might look like a garden fence, but gates and beds are rarely adjacent in the graph § Pin embeddings are essential to various tasks like recommendation of Pins, classification, clustering, ranking

§ Services like “Related Pins”, “Search”, “Shopping”, “Ads”

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 63

[Ying et al., WWW 2018]

slide-64
SLIDE 64

Goal: Map nodes to d-dimensional embeddings such that nodes that are related are embedded close together

64

Node 𝑣

Input d-dimensional embedding space

𝑨% 𝑨( 𝑔(𝑤) 𝑔(𝑣)

Node 𝑤

10/17/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs
slide-65
SLIDE 65

¡ Challenges:

§ Massive size: 3 billion nodes, 20 billion edges § Heterogeneous data: Rich image and text features

Task: Recommend related pins to users

So Source pin

8

Task: Learn node embeddings 𝑨/ such that 𝑒 𝑨\]I^P, 𝑨\]I^_ < 𝑒(𝑨\]I^P, 𝑨ab^]c^d)

Jure Leskovec, Stanford University
slide-66
SLIDE 66

Goal: Identify target pin among 3B pins

¡ Issue: Need to learn with resolution of 100 vs. 3B ¡ Idea: Use harder and harder negative samples ¡ Include more and more hard negative samples for

each epoch

66

Sour Source ce pi pin Pos Positive ve Ha Hard negative Eas Easy y negat negative ve

Jure Leskovec, Stanford University
slide-67
SLIDE 67

¡ How to scale the training as well as inference

  • f node embeddings to graphs with billions
  • f nodes and tens of billions of edges?

§ 10,000X larger dataset than any previous graph neural network application

¡ Key innovations:

§ Sub-sample neighborhoods for efficient GPU batching § Producer-consumer CPU-GPU training pipeline § Curriculum learning for negative samples § MapReduce for efficient inference

67
slide-68
SLIDE 68

¡ Three key innovations:

  • 1. On-the-fly graph convolutions

§ Sample the neighborhood around a node and dynamically construct a computation graph § Perform a localized graph convolution around a particular node § Does not need the entire graph during training

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 68
slide-69
SLIDE 69

¡ Three key innovations:

  • 1. On-the-fly graph convolutions
  • 2. Constructing convolutions via random walks

§ Performing convolutions on full neighborhoods is infeasible:

§ How to select the set of neighbors of a node to convolve over?

§ Importance pooling: Define importance-based neighborhoods by simulating random walks and selecting the neighbors with the highest visit counts

  • 3. Efficient MapReduce inference

§ Bottom-up aggregation of node embeddings lends itself to MapReduce

§ Decompose each aggregation step across all nodes into three

  • perations in MapReduce, i.e., map, join, and reduce
12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 69
slide-70
SLIDE 70

¡ Baselines:

§ Visual: Nearest neighbors

  • f CNN visual embeddings

for recommendations § Annotation: Nearest neighbors in terms of Word2vec embeddings § Combined: Concatenate embeddings:

§ Uses exact same data and loss function as PinSage

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 70

PinSage gives 150% improvement in hit rate and 60% improvement in MRR over the best baseline

slide-71
SLIDE 71 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 12/6/18 71 Pixie is a purely graph-based method that uses biased random walks to generate ranking scores by simulating random walks starting at query Pin. Items with top scores are retrieved as recommendations [Eksombatchai et al., 2018]
slide-72
SLIDE 72 10/17/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs 72 Pixie Graph- SAGE Query

PinSAGE

slide-73
SLIDE 73 10/17/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs 73 Pixie Graph- SAGE Query

PinSAGE

slide-74
SLIDE 74
slide-75
SLIDE 75

¡ Data preprocessing is important:

§ Use renormalization tricks § Variance-scaled initialization § Network data whitening

¡ ADAM optimizer:

§ ADAM naturally takes care of decaying the learning rate

¡ ReLU activation function often works really well ¡ No activation function at your output layer:

§ Easy mistake if you build layers with a shared function

¡ Include bias term in every layer ¡ GCN layer of size 64 or 128 is already plenty

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 75
slide-76
SLIDE 76

¡ Debug?!:

§ Loss/accuracy not converging during training

¡ Important for model development:

§ Overfit on training data:

§ Accuracy should be essentially 100% or error close to 0 § If neural network cannot overfit a single data point, something is wrong

§ Scrutinize your loss function! § Scrutinize your visualizations!

12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 76
slide-77
SLIDE 77 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 77 12/6/18
slide-78
SLIDE 78 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 78 12/6/18
slide-79
SLIDE 79
  • 1. Basics of deep learning for graphs
  • 2. Graph Convolutional Networks
  • 3. Graph Attention Networks (GAT)
  • 4. Practical tips and demos
12/6/18 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 79