Graph Representation Learning: Embedding, GNNs, and Pre-Training - - PowerPoint PPT Presentation

β–Ά
graph representation learning
SMART_READER_LITE
LIVE PREVIEW

Graph Representation Learning: Embedding, GNNs, and Pre-Training - - PowerPoint PPT Presentation

Graph Representation Learning: Embedding, GNNs, and Pre-Training Yuxiao Dong https://ericdongyx.github.io/ Microsoft Research, Redmond Joint Work with Jiezhong Qiu Ziniu Hu Hongxia Yang Jing Zhang Tsinghua UCLA Alibaba Renmin U. of


slide-1
SLIDE 1

Graph Representation Learning:

Embedding, GNNs, and Pre-Training

Yuxiao Dong

https://ericdongyx.github.io/

Microsoft Research, Redmond

slide-2
SLIDE 2

Joint Work with

Jiezhong Qiu Tsinghua (Jie Tang) Jie Tang Tsinghua Yizhou Sun UCLA Ziniu Hu UCLA (Yizhou Sun) Hongxia Yang Alibaba Hao Ma Facebook AI Kuansan Wang Microsoft Research Jing Zhang Renmin U. of China

slide-3
SLIDE 3

Why Graphs?

slide-4
SLIDE 4

Graphs

Of Office/ ice/So Social cial Gr Graph aph In Internet Kno Knowledge Graph Bi Biol

  • logi
  • gical Ne

Neura ral Networks ks Tr Transp sportation

figure credit: Web

Acad Academ emic Gr Graph aph

slide-5
SLIDE 5

hand-crafted feature matrix

feature engineering

The Graph Mining Paradigm

X

𝑦!": node 𝑀!’s π‘˜#$ feature, e.g., 𝑀!’s pagerank value

machine learning models

Graph & Network applications

  • Node classification
  • Link prediction
  • Community detection
  • Anomaly detection
  • Social influence
  • Graph evolution
  • … …

Structural Diversity and Homophily: A Study Across More Than One Hundred Big Networks. KDD 2017.

slide-6
SLIDE 6

hand-crafted latent feature matrix

Feature engineering learning

Graph Representation Learning

Z machine learning models

  • Input: a network 𝐻 = (π‘Š, 𝐹)
  • Output: 𝒂 ∈ 𝑆 ! Γ—#, 𝑙 β‰ͺ |π‘Š|, 𝑙-dim vector 𝒂$ for each node v.

Graph & Network applications

  • Node classification
  • Link prediction
  • Community detection
  • Anomaly detection
  • Social influence
  • Graph evolution
  • … …
slide-7
SLIDE 7

Application: Embedding Heterogeneous Academic Graph

Academic Graph

Graph Representation Learning

1. https://academic.microsoft.com/ 2. Kuansan Wang et al. Microsoft Academic Graph: When experts are not enough. Quantitative Science Studies 1 (1), 396-413, 2020 3. Dong et al. metapath2vec: scalable representation learning for heterogeneous networks. In KDD 2017. 4. Code & data for metapath2vec: https://ericdongyx.github.io/metapath2vec/m2v.html

slide-8
SLIDE 8

Application: Similarity Search & Recommendation

Harvard Stanford Columbia Yale UChicago Johns Hopkins

1. https://academic.microsoft.com/ 2. Kuansan Wang et al. Microsoft Academic Graph: When experts are not enough. Quantitative Science Studies 1 (1), 396-413, 2020 3. Dong et al. metapath2vec: scalable representation learning for heterogeneous networks. In KDD 2017. 4. Code & data for metapath2vec: https://ericdongyx.github.io/metapath2vec/m2v.html

slide-9
SLIDE 9

Cause Symptom Treatment

Application: Reasoning about Diabetes from MAG

slide-10
SLIDE 10

SARS-CoV-2 COVID-19 Antiviral drug Azithromycin Rash Zika Virus MERS Lamivudine Ebola Virus Wasting Oseltamivir Post-exposure prophylaxis Asymptomatic Abdominal pain Diarrhea Coronavirus

Cause Symptom Treatment

Application: Reasoning about COVID-19 from MAG

slide-11
SLIDE 11

Graph Representation Learning

Network Embedding Matrix Factorization Pre-Training GNNs

slide-12
SLIDE 12

Feature learning

1. Mikolov, et al. Efficient estimation of word representations in vector space. In ICLR 2013. 2. Perozzi et al. DeepWalk: Online learning of social representations. In KDD’ 14, pp. 701–710.

𝑀! 𝑀!"# 𝑀!"$ 𝑀!%$ 𝑀!%#

Network Embedding

Sequences of objects

  • Words in Text
  • Nodes in graphs

Skip-Gram

slide-13
SLIDE 13

Distributional Hypothesis of Harris

  • Word embedding: words in similar contexts have similar

meanings (e.g., skip-gram in word embedding)

  • Node embedding: nodes in similar structural contexts are similar
  • DeepWalk: structural contexts are defined by co-occurrence over random

walk paths

Harris, Z. (1954). Distributional structure. Word, 10(23): 146-162.

slide-14
SLIDE 14

The Objective

β„’ Γ  to maximize the likelihood of node co-occurrence on a random walk path π’œ$

%π’œ& Γ  the possibility that node 𝑀 and context 𝑑 appear on a random walk path

β„’ = #

!∈#

#

$∈%!"(!)

βˆ’log(𝑄(𝑑|𝑀))

π‘ž 𝑑 𝑀 = exp(π’œ2

3π’œ4)

βˆ‘5∈6 exp(π’œ2

3π’œ5)

hide

slide-15
SLIDE 15

𝑀! 𝑀!"# 𝑀!"$ 𝑀!%$ 𝑀!%#

Network Embedding: Random Walk + Skip-Gram

Radom Walk Strategies:

  • DeepWalk (walk length > 1)
  • LINE (walk length = 1)
  • PTE

(walk length = 1)

  • node2vec

(biased random walk)

  • metapath2vec (heterogeneous rw)

1. Perozzi et al. DeepWalk: Online learning of social representations. In KDD’ 14. Most Cited Paper in KDD’14. 2. Tang et al. LINE: Large scale information network embedding. In WWW’15. Most Cited Paper in WWW’15. 3. Grover and Leskovec. node2vec: Scalable feature learning for networks. In KDD’16. 2nd Most Cited Paper in KDD’16. 4. Dong et al. metapath2vec: scalable representation learning for heterogeneous networks. In KDD 2017. Most Cited Paper in KDD’17.

slide-16
SLIDE 16

Graph Representation Learning

Network Embedding Matrix Factorization Pre-Training GNNs

  • DeepWalk
  • LINE
  • Node2vec
  • PTE
  • …
  • metapath2vec
slide-17
SLIDE 17

NetMF: Network Embedding as Matrix Factorization

  • 1. Qiu et al. Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. In WSDM’18.
  • DeepWalk
  • LINE
  • PTE
  • node2vec

π‘€π‘π‘š 𝐻 = (

!

(

"

𝐡!" 𝑩 Adjacency matrix 𝑬 Degree matrix b: #negative samples T: context window size

hide

slide-18
SLIDE 18

π‘₯! π‘₯!"# π‘₯!"$ π‘₯!%$ π‘₯!%#

log(#(𝒙, 𝒅)|𝒠| 𝑐#(π‘₯)#(𝑑))

  • 𝐻: graph
  • 𝑩: adjacency matrix
  • 𝑬:degree matrix
  • π‘€π‘π‘š 𝐻 : volume of 𝐻

Levy and Goldberg. Neural word embeddings as implicit matrix factorization. In NIPS 2014

  • #(w,c): co-occurrence of w & c
  • #(w): occurrence of word w
  • #(c): occurrence of context c
  • 𝒠: wordβˆ’context pair (w, c) multiβˆ’set
  • |𝒠|: number of word-context pairs

Understanding Random Walk + Skip Gram

Graph Language NLP Language

Skip-Gram

slide-19
SLIDE 19

Understanding Random Walk + Skip Gram

  • Partition the multiset 𝒠 into several sub-multisets

according to the way in which each node and its context appear in a random walk node sequence.

Distinguish direction and distance

  • More formally, for 𝑠 = 1, 2, β‹― , π‘ˆ, we define
  • #(w,c): co-occurrence of w & c
  • #(w): occurrence of word w
  • #(c): occurrence of context c
  • 𝒠: wordβˆ’context pair (w, c) multiβˆ’set
  • |𝒠|: number of word-context pairs

NLP Language

slide-20
SLIDE 20

Understanding Random Walk + Skip Gram

the length of random walk 𝑀 β†’ ∞

slide-21
SLIDE 21

Understanding Random Walk + Skip Gram

π‘€π‘π‘š 𝐻 = (

!

(

"

𝐡!" 𝑩 Adjacency matrix 𝑬 Degree matrix b: #negative samples T: context window size Graph Language

slide-22
SLIDE 22

π‘₯! π‘₯!"# π‘₯!"$ π‘₯!%$ π‘₯!%#

DeepWalk is asymptotically and implicitly factorizing

1. Qiu et al. Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. In WSDM’18.

Understanding Random Walk + Skip Gram

π‘€π‘π‘š 𝐻 = (

!

(

"

𝐡!" 𝑩 Adjacency matrix 𝑬 Degree matrix b: #negative samples T: context window size

slide-23
SLIDE 23

Unifying DeepWalk, LINE, PTE, & node2vec as Matrix Factorization

Qiu et al. Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. In WSDM’18. The most cited paper in WSDM’18 as of May 2019

  • DeepWalk
  • LINE
  • PTE
  • node2vec
slide-24
SLIDE 24

NetMF: Explicitly Factorizing the DeepWalk Matrix

π‘₯! π‘₯!"# π‘₯!"$ π‘₯!%$ π‘₯!%#

DeepWalk is asymptotically and implicitly factorizing

1. Qiu et al. Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. In WSDM’18. 2. Code &data for NetMF: https://github.com/xptree/NetMF

Matrix Factorization

slide-25
SLIDE 25

1. Construction 2. Factorization

𝑻 =

NetMF

1. Qiu et al. Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. In WSDM’18. 2. Code &data for NetMF: https://github.com/xptree/NetMF

slide-26
SLIDE 26

Results

1. Qiu et al. Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. In WSDM’18. 2. Code &data for NetMF: https://github.com/xptree/NetMF

Explicit matrix factorization (NetMF) offers performance gains over implicit matrix factorization (DeepWalk & LINE)

slide-27
SLIDE 27

Input:

Adjacency Matrix

𝑩

Random Walk Skip Gram

𝑻 = 𝑔(𝑩)

(dense) Matrix Factorization

Output:

Vectors

𝒂

Network Embedding

DeepWalk, LINE, node2vec, metapath2vec NetMF 𝑔 𝑩 =

1. Qiu et al. Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. In WSDM’18. 2. Code &data for NetMF: https://github.com/xptree/NetMF

slide-28
SLIDE 28

Challenge?

π‘œ" non-zeros Dense!! 𝑻 = Time complexity 𝑃(π‘œ#)

slide-29
SLIDE 29

NetMF

How can we solve this issue?

1. Construction 2. Factorization

  • 1. Qiu et al. NetSMF: Network embedding as sparse matrix factorization. In WWW 2019.
  • 2. Code & data for NetSMF: https://github.com/xptree/NetSMF

𝑻 =

slide-30
SLIDE 30

NetSMF--Sparse

How can we solve this issue?

1. Sparse Construction 2. Sparse Factorization

𝑻 =

  • 1. Qiu et al. NetSMF: Network embedding as sparse matrix factorization. In WWW 2019.
  • 2. Code & data for NetSMF: https://github.com/xptree/NetSMF
slide-31
SLIDE 31

Sparsify 𝑻

For random-walk matrix polynomial where and non-negative One can construct a 1 + πœ— -spectral sparsifier 3 𝑴 with non-zeros in time for undirected graphs

1. Dehua Cheng, Yu Cheng, Yan Liu, Richard Peng, and Shang-Hua Teng, Efficient Sampling for Gaussian Graphical Models via Spectral Sparsification, COLT 2015. 2. Dehua Cheng, Yu Cheng, Yan Liu, Richard Peng, and Shang-Hua Teng. Spectral sparsification of random-walk matrix polynomials. arXiv:1502.03496.

hide

slide-32
SLIDE 32

Sparsify 𝑻

For random-walk matrix polynomial where and non-negative One can construct a 1 + πœ— -spectral sparsifier 3 𝑴 with non-zeros in time

𝑻 =

for undirected graphs

  • 1. Qiu et al. NetSMF: Network embedding as sparse matrix factorization. In WWW 2019.
  • 2. Code & data for NetSMF: https://github.com/xptree/NetSMF
slide-33
SLIDE 33

NetSMF --- Sparse

Factorize the constructed matrix

hide

  • 1. Qiu et al. NetSMF: Network embedding as sparse matrix factorization. In WWW 2019.
  • 2. Code & data for NetSMF: https://github.com/xptree/NetSMF
slide-34
SLIDE 34

NetSMF---Bounded Approximation Error

𝑡 6 𝑡

  • 1. Qiu et al. NetSMF: Network embedding as sparse matrix factorization. In WWW 2019.
  • 2. Code & data for NetSMF: https://github.com/xptree/NetSMF
slide-35
SLIDE 35
  • #non-zeros:
  • ~4.5 Quadrillion

Γ  45 Billion

hide

  • 1. Qiu et al. NetSMF: Network embedding as sparse matrix factorization. In WWW 2019.
  • 2. Code & data for NetSMF: https://github.com/xptree/NetSMF

Results

slide-36
SLIDE 36
  • 1. Qiu et al. NetSMF: Network embedding as sparse matrix factorization. In WWW 2019.
  • 2. Code & data for NetSMF: https://github.com/xptree/NetSMF

Effectiveness: NetSMF (sparse MF) β‰ˆ NetMF (explicit MF) > DeepWalk/LINE (implicit MF) Efficiency: NetSMF (sparse MF) can handle billion-scale network embedding

.

Results

30% improvements

  • ver LINE on

billion-scale graphs 100% improvements

  • ver LINE on

billion-scale graphs

slide-37
SLIDE 37

hide

  • 1. Qiu et al. NetSMF: Network embedding as sparse matrix factorization. In WWW 2019.
  • 2. Code & data for NetSMF: https://github.com/xptree/NetSMF
slide-38
SLIDE 38

Input:

Adjacency Matrix

𝑩

Random Walk Skip Gram

𝑻 = 𝑔(𝑩)

(dense) Matrix Factorization

Output:

Vectors

𝒂

Network Embedding

DeepWalk, LINE, node2vec, metapath2vec (sparse) Matrix Factorization

Sparsify 𝑻

NetSMF NetMF 𝑔 𝑩 =

Incorporate network structures 𝑩 into the similarity matrix 𝑻, and then factorize 𝑻

  • 1. Qiu et al. NetSMF: Network embedding as sparse matrix factorization. In WWW 2019.
  • 2. Code & data for NetSMF: https://github.com/xptree/NetSMF
slide-39
SLIDE 39

ProNE: Propagation based Network Embedding

  • 1. Zhang et al. ProNE: Fast and Scalable Network Representation Learning. In IJCAI 2019
  • 2. Code & data for ProNE: https://github.com/THUDM/ProNE
slide-40
SLIDE 40

Spectral Propagation

𝑆> ← 𝐸?@𝐡(𝐽A βˆ’ 3 𝑀) 𝑆> is the spectral filter of 𝑀 = 𝐽A βˆ’ 𝐸?@𝐡 𝐸?@𝐡(𝐽A βˆ’ 3 𝑀) is 𝐸?@𝐡 modulated by the filter in the spectrum The idea of Graph Neural Networks

  • 1. Zhang et al. ProNE: Fast and Scalable Network Representation Learning. In IJCAI 2019
  • 2. Code & data for ProNE: https://github.com/THUDM/ProNE
slide-41
SLIDE 41

Chebyshev expansion for efficiency

  • To avoid explicit eigendecomposition and Fourier transform
  • Chebyshev expansion

hide

  • 1. Zhang et al. ProNE: Fast and Scalable Network Representation Learning. In IJCAI 2019
  • 2. Code & data for ProNE: https://github.com/THUDM/ProNE
slide-42
SLIDE 42

Efficiency

20 Threads 1 Thread 19hours 98mins 10mins 1.1M nodes

  • 1. Zhang et al. ProNE: Fast and Scalable Network Representation Learning. In IJCAI 2019
  • 2. Code & data for ProNE: https://github.com/THUDM/ProNE

ProNE offers 10-400X speedups (1 thread vs 20 threads) Embed 100,000,000 nodes by 1 thread: 29 hours with performance superiority

slide-43
SLIDE 43

Scalability

hide

Embed 100,000,000 nodes by 1 thread: 29 hours with performance superiority

slide-44
SLIDE 44

ProNE: A General Propagation Framework

hide

  • 1. Zhang et al. ProNE: Fast and Scalable Network Representation Learning. In IJCAI 2019
  • 2. Code & data for ProNE: https://github.com/THUDM/ProNE
slide-45
SLIDE 45

Input:

Adjacency Matrix

𝑩

Random Walk Skip Gram

𝑻 = 𝑔(𝑩)

(dense) Matrix Factorization

Output:

Vectors

𝒂

Network Embedding

DeepWalk, LINE, node2vec, metapath2vec (sparse) Matrix Factorization

Sparsify 𝑻

NetSMF (sparse) Matrix Factorization

𝒂 = 𝑔(𝒂′)

ProNE NetMF

Factorize 𝑩, and then incorporate network structures via spectral propagation

slide-46
SLIDE 46

Graph Representation Learning

Network Embedding Matrix Factorization Pre-Training GNNs

  • DeepWalk
  • LINE
  • Node2vec
  • PTE
  • …
  • metapath2vec
  • NetMF
  • NetSMF
  • …
  • ProNE (Propagation)
slide-47
SLIDE 47

Connecting NE with Graph Neural Networks

a e v b d c

π’Š2 = 𝑔(π’Š2, π’ŠB, π’ŠC, π’Š4, π’Š>, π’ŠD)

  • 1. Justin Gilmer, et al. Neural message passing for quantum chemistry. In ICML 2017.
  • 2. Zhang et al. ProNE: Fast and Scalable Network Representation Learning. In IJCAI 2019
  • ProNE

– Propagation based network embedding

  • GNN

– Neighborhood aggregation: aggregate neighbor information and pass into a neural network

𝑺> ← 𝑬?@𝑩(𝑱A βˆ’ 3 𝑴) 𝑺>

slide-48
SLIDE 48

Graph Neural Networks

a e v b d c

Neighborhood Aggregation:

  • Aggregate neighbor information and pass into a neural network
  • It can be viewed as a center-surround filter in CNN---graph convolutions!
  • 1. Niepert et al. Learning Convolutional Neural Networks for Graphs. In ICML 2016
  • 2. Defferrard et al. Convolutional Neural Networks on Graphs with Fast Locailzied Spectral Filtering. In NIPS 2016
  • 1. Choose neighborhood
  • 2. Determine the order of

selected neighbors

  • 3. Parameter sharing

CNN Graph Convolution

slide-49
SLIDE 49

Graph Convolutional Networks π’Š!

( = 𝜏(𝑿(

#

)∈* ! βˆͺ!

π’Š)

(,-

|𝑂(𝑣)||𝑂(𝑀)| )

the neighbors of node 𝑀 node 𝑀’s embedding at layer 𝑙 Non-linear activation function (e.g., ReLU) parameters in layer 𝑙 a e v b d c

  • 1. Kipf et al. Semisupervised Classification with Graph Convolutional Networks. ICLR 2017

𝑰( = 𝜏 5 𝑩𝑰 (,- 𝑿 (

normalized Laplacian matrix

Aggregate info from neighborhood via the normalized Laplacian matrix

slide-50
SLIDE 50

Graph Convolutional Networks π’Š!

( = 𝜏(𝑿(

#

)∈* !

π’Š)

(,-

𝑂 𝑣 𝑂 𝑀 + 𝑿(#

!

π’Š!

(,-

|𝑂(𝑀)||𝑂(𝑀)| )

  • 1. Kipf et al. Semisupervised Classification with Graph Convolutional Networks. ICLR 2017

a e v b d c Aggregate from 𝑀’s neighbors Aggregate from itself

hide

slide-51
SLIDE 51

Graph Convolutional Networks π’Š!

( = 𝜏(𝑿(

#

)∈* !

π’Š)

(,-

𝑂 𝑣 𝑂 𝑀 + 𝑿(#

!

π’Š!

(,-

|𝑂(𝑀)||𝑂(𝑀)| )

Kipf et al. Semisupervised Classification with Graph Convolutional Networks. ICLR 2017

a e v b d c The same parameters for both its neighbors & itself

hide

slide-52
SLIDE 52

Graph Convolutional Networks π’Š!

( = 𝜏(𝑿(

#

)∈* !

π’Š)

(,-

𝑂 𝑣 𝑂 𝑀 + 𝑿(#

!

π’Š!

(,-

|𝑂(𝑀)||𝑂(𝑀)| )

Kipf et al. Semisupervised Classification with Graph Convolutional Networks. ICLR 2017

a e v b d c

𝑬,-

.𝑩𝑬,- .𝑰 (,- 𝑿 (

𝑬,-

.𝑱𝑬,- .𝑰 (,- 𝑿 (

hide

slide-53
SLIDE 53

Graph Convolutional Networks 𝑰( = 𝜏 𝑬,-

. 𝑩 + 𝑱 𝑬,- .𝑰 (,- 𝑿 (

Input

𝒂 =𝑰&

Output

𝑰' = 𝒀

𝐻 = (π‘Š, 𝐹, 𝑩)

Kipf et al. Semisupervised Classification with Graph Convolutional Networks. ICLR 2017

hide

slide-54
SLIDE 54

Graph Convolutional Networks 𝑰( = 𝜏 𝑬,-

. 𝑩 + 𝑱 𝑬,- .𝑰 (,- 𝑿 (

Input

𝒂 =𝑰&

Output

  • Model training
  • The common setting is to have an end to end training

framework with a supervised task

  • That is, define a loss function over 𝒂

𝑰' = 𝒀

𝐻 = (π‘Š, 𝐹, 𝑩)

Kipf et al. Semisupervised Classification with Graph Convolutional Networks. ICLR 2017

hide

slide-55
SLIDE 55

Graph Convolutional Networks 𝑰( = 𝜏 𝑬,-

. 𝑩 + 𝑱 𝑬,- .𝑰 (,- 𝑿 (

𝑰' = 𝒀

𝐻 = (π‘Š, 𝐹, 𝑩)

Input

𝒂 =𝑰&

Output

  • Benefits: Parameter sharing for all nodes
  • #parameters is subline in |V|
  • Enable inductive learning for new nodes

hide

slide-56
SLIDE 56

GraphSage

π’Š!

( = 𝜏(𝑿(

#

)∈* ! βˆͺ!

π’Š)

(,-

|𝑂(𝑣)||𝑂(𝑀)| )

Hamilton et al. Inductive Representation Learning on Large Graphs. NIPS 2017

a e v b d c

GCN GraphSage

π’Š!

( = 𝜏([𝑩( β‹… AGG

π’Š)

(,-, βˆ€π‘£ ∈ 𝑂 𝑀

, π‘ͺ(π’Š!

(,-])

Instead of summation, it concatenates neighbor & self embeddings Generalized aggregation: any differentiable function that maps set of vectors to a single vector

hide

slide-57
SLIDE 57

GraphSage

Hamilton et al. Inductive Representation Learning on Large Graphs. NIPS 2017 Slide snipping from β€œHamiltion & Tang, AAAI 2019 Tutorial on Graph Representation Learning”

π’Š!

( = 𝜏([𝑩( β‹… AGG

π’Š)

(,-, βˆ€π‘£ ∈ 𝑂 𝑀

, π‘ͺ(π’Š!

(,-])

hide

slide-58
SLIDE 58

Graph Neural Network 𝑰( = 𝜏 𝑩𝑰 (,- 𝑿 (,- hide

slide-59
SLIDE 59

Graph Attention

Velickovic et al. Graph Attention Networks. ICLR 2018

π’Š!

( = 𝜏(𝑿(

#

)∈* ! βˆͺ!

π’Š)

(,-

|𝑂(𝑣)||𝑂(𝑀)| )

GCN Graph Attention

π’Š!

( = 𝜏(

#

)∈* ! βˆͺ!

𝛽!,)𝑿(π’Š)

(,-)

a e v b d c

Aggregate info from neighborhood via the learned attention Aggregate info from neighborhood via the normalized Laplacian matrix

slide-60
SLIDE 60

Graph Attention

Velickovic et al. Graph Attention Networks. ICLR 2018

a e v b d c

many ways to define attention!

slide-61
SLIDE 61

Attention over Heterogeneous Graphs?

heterogeneous academic graph heterogeneous office graph

slide-62
SLIDE 62

Heterogeneous Graph Transformer (HGT)

  • Current graph neural networks are not capable enough to capture graph heterogeneity
  • Heterogeneous Graph Transformer
  • Unique parameters for each type of relationships

1.Ziniu Hu, et al. Heterogeneous Graph Transformer. WWW 2020. 2.Code & Data for HGT: https://github.com/acbull/pyHGT

  • meta relation of an edge 𝑓 = (𝑑, 𝑒)

𝑑

Paper Author

Write

𝑒

slide-63
SLIDE 63

Heterogeneous Graph Transformer (HGT)

  • meta relation of an edge 𝑓 = (𝑑, 𝑒)
  • heterogeneous mutual attention

𝑑

Paper Author

Write

𝑒

1.Ziniu Hu, et al. Heterogeneous Graph Transformer. WWW 2020. 2.Code & Data for HGT: https://github.com/acbull/pyHGT

slide-64
SLIDE 64

Heterogeneous Graph Transformer

  • meta relation of an edge 𝑓 = (𝑑, 𝑒)
  • heterogeneous message passing

𝑑

Paper Author

Write

𝑒

hide

1.Ziniu Hu, et al. Heterogeneous Graph Transformer. WWW 2020. 2.Code & Data for HGT: https://github.com/acbull/pyHGT

slide-65
SLIDE 65

Heterogeneous Graph Transformer

  • meta relation of an edge 𝑓 = (𝑑, 𝑒)
  • target specific aggregation

𝑑

Paper Author

Write

𝑒

hide

1.Ziniu Hu, et al. Heterogeneous Graph Transformer. WWW 2020. 2.Code & Data for HGT: https://github.com/acbull/pyHGT

slide-66
SLIDE 66

Heterogeneous Graph Transformer (HGT)

1.Ziniu Hu, et al. Heterogeneous Graph Transformer. WWW 2020. 2.Code & Data for HGT: https://github.com/acbull/pyHGT

slide-67
SLIDE 67

Heterogeneous Graph Transformer (HGT)

  • Relative Temporal Encoding

1.Ziniu Hu, et al. Heterogeneous Graph Transformer. WWW 2020. 2.Code & Data for HGT: https://github.com/acbull/pyHGT

slide-68
SLIDE 68

Experiments

1. Ziniu Hu, et al. Heterogeneous Graph Transformer. WWW 2020. 2. Difan Zou, et al. Layer-Dependent Importance Sampling for Training Deep and Large Graph Convolutional Networks. In NeurIPS’19.

  • Sampling subgraphs from large-scale graphs
  • From homogeneous graphs Γ  LADIES algorithm
  • From heterogeneous graphs Γ  HGSampling algo
slide-69
SLIDE 69

Results

HGT offers ~9βˆ’21% improvements over existing (heterogeneous) GNNs

1.Ziniu Hu, et al. Heterogeneous Graph Transformer. WWW 2020. 2.Code & Data for HGT: https://github.com/acbull/pyHGT

slide-70
SLIDE 70

Case Study

DB + Networking + IR DM + Networking + IR + DB DB + DM ML + DB + Web + AI + NLP!!! CV + ML + AI ML + CV + DL + NLP Experiments done in 2019!

1.Ziniu Hu, et al. Heterogeneous Graph Transformer. WWW 2020. 2.Code & Data for HGT: https://github.com/acbull/pyHGT

slide-71
SLIDE 71

What is the Best Part of HGT?

Learn meta-paths & their weights implicitly!

1.Ziniu Hu, et al. Heterogeneous Graph Transformer. WWW 2020. 2.Code & Data for HGT: https://github.com/acbull/pyHGT

slide-72
SLIDE 72

Graph Representation Learning

Network Embedding Matrix Factorization Pre-Training GNNs

  • DeepWalk
  • LINE
  • Node2vec
  • PTE
  • …
  • metapath2vec
  • NetMF
  • NetSMF
  • …
  • ProNE (Propagation)
  • GCN
  • GAT
  • GraphSage
  • …
  • GRAND
  • HGT
slide-73
SLIDE 73

Language and Image Pre-Training, Graphs?

– Recent progress of pre-training models in NLP & CV

  • ELMO, BERT, XLNet, MoCo, etc.
  • Model level: Transformer
  • Pre-training Task: masked language modeling & next sentence prediction
slide-74
SLIDE 74

GNN Pre-Training

  • The FIRST graph pre-training setting:

– To pre-train from one graph – To fine-tune for unseen tasks on the same graph or graphs of the same domain.

  • How to do this?

– Model level: GNNs? – Pre-training task: self-supervised tasks on graphs?

1.Ziniu Hu et al. GPT-GNN: Generative Pre-Training of Graph Neural Networks. KDD 2020. 2.Code & data for GPT-GNN: https://github.com/acbull/GPT-GNN

slide-75
SLIDE 75

GNN Pre-Training

Pre-Trained Model

graph pre-training task

input graph

P-Ted Model

node classification

the same input graph or graphs of the same domain

link prediction recommendation

…

Pre-Training Fine-Tuning

P-Ted Model P-Ted Model

+

  • +

? ? ? ?

hide

1.Ziniu Hu et al. GPT-GNN: Generative Pre-Training of Graph Neural Networks. KDD 2020. 2.Code & data for GPT-GNN: https://github.com/acbull/GPT-GNN

slide-76
SLIDE 76

GPT-GNN: Generative Pre-Training of GNNs

  • Model the graph distribution

by learning to reconstruct the input graph.

– Factorize the graph likelihood into two terms:

  • Attribute Generation
  • Edge Generation

attribute and edge masked input graph

𝑗

1.Ziniu Hu et al. GPT-GNN: Generative Pre-Training of Graph Neural Networks. KDD 2020. 2.Code & data for GPT-GNN: https://github.com/acbull/GPT-GNN

slide-77
SLIDE 77

GPT-GNN: Generative Pre-Training of GNNs

  • Model the graph distribution

by learning to reconstruct the input graph.

– Factorize the graph likelihood into two terms:

  • Attribute Generation
  • Edge Generation

attribute and edge masked input graph

𝑗

1.Ziniu Hu et al. GPT-GNN: Generative Pre-Training of Graph Neural Networks. KDD 2020. 2.Code & data for GPT-GNN: https://github.com/acbull/GPT-GNN

slide-78
SLIDE 78

GPT-GNN: Generative Pre-Training of GNNs

GPT-GNN

attribute generation

attribute and edge masked input graph

GPT-GNN

node classification

the same input graph or graphs of the same domain

link prediction recommendation

…

Pre-Training Fine-Tuning

GPT-GNN GPT-GNN

+

  • +

? ? ? ?

edge generation

1.Ziniu Hu et al. GPT-GNN: Generative Pre-Training of Graph Neural Networks. KDD 2020. 2.Code & data for GPT-GNN: https://github.com/acbull/GPT-GNN

slide-79
SLIDE 79

GPT-GNN: Generative Pre-Training of GNNs

Pre-Train Fine-Tune

  • Attribute Generation
  • Edge Generation
  • Inferring the topic of each paper
  • Inferring the venue of each paper
  • Author name disambiguation

Tasks:

  • Data 1: Open Academic Graph

Base GNN model:

Heterogeneous Graph Transformer (HGT)

1.Ziniu Hu et al. GPT-GNN: Generative Pre-Training of Graph Neural Networks. KDD 2020. 2.Code & data for GPT-GNN: https://github.com/acbull/GPT-GNN

slide-80
SLIDE 80

GPT-GNN: Generative Pre-Training of GNNs

Pre-Train Fine-Tune

CS Academic Graph CS Academic Graph Med, Bio, Physics… CS before 2014 CS Academic Graph CS after 2014 Med, Bio, Physics… before 2014 CS after 2014 No Transfer: Field Transfer: Time Transfer: Time + Field Transfer:

  • Data 1: Open Academic Graph

1.Ziniu Hu et al. GPT-GNN: Generative Pre-Training of Graph Neural Networks. KDD 2020. 2.Code & data for GPT-GNN: https://github.com/acbull/GPT-GNN

slide-81
SLIDE 81

GPT-GNN: Generative Pre-Training of GNNs

1.Ziniu Hu et al. GPT-GNN: Generative Pre-Training of Graph Neural Networks. KDD 2020. 2.Code & data for GPT-GNN: https://github.com/acbull/GPT-GNN

  • All pre-training frameworks help the

performance of GNNs

  • GAE, GraphSage, Graph Infomax
  • GPT-GNN
  • GPT-GNN helps the most by achieving a

relative performance gain of 9.1% over the base model without pre-training

  • Both self-supervised tasks in GPT-GNN

help the pre-training framework

  • Attribute generation
  • Edge generation
slide-82
SLIDE 82

GPT-GNN: Generative Pre-Training of GNNs

Pre-Train Fine-Tune

  • Attribute Generation
  • Edge Generation
  • Inferring the topic of each paper
  • Inferring the venue of each paper
  • Author name disambiguation

Tasks:

  • Data 1: Open Academic Graph

Base GNN model:

Heterogeneous Graph Transformer (HGT)

1.Ziniu Hu et al. GPT-GNN: Generative Pre-Training of Graph Neural Networks. KDD 2020. 2.Code & data for GPT-GNN: https://github.com/acbull/GPT-GNN

slide-83
SLIDE 83

The Promise of Graph Pre-Training!

  • During fine-turning

1.Ziniu Hu et al. GPT-GNN: Generative Pre-Training of Graph Neural Networks. KDD 2020. 2.Code & data for GPT-GNN: https://github.com/acbull/GPT-GNN

The GNN model w/o pre-training by using 100% training data VS The GNN model with pre-training by using 10-20% training data

slide-84
SLIDE 84

GNN Pre-Training on the β€œSame” Networks

GPT-GNN

attribute generation

attribute and edge masked input graph

GPT-GNN

node classification

the same input graph or graphs of the same domain

link prediction recommendation

…

Pre-Training Fine-Tuning

GPT-GNN GPT-GNN

+

  • +

? ? ? ?

edge generation

1.Ziniu Hu et al. GPT-GNN: Generative Pre-Training of Graph Neural Networks. KDD 2020. 2.Code & data for GPT-GNN: https://github.com/acbull/GPT-GNN

slide-85
SLIDE 85

Graphs

Of Office/ ice/So Social cial Gr Graph aph In Internet Kno Knowledge Graph Bi Biol

  • logi
  • gical Ne

Neura ral Networks ks Tr Transp sportation

figure credit: Web

slide-86
SLIDE 86

GNN Pre-Training

  • The SECOND graph pre-training setting:

– To pre-train from some graphs – To fine-tune for unseen tasks on unseen graphs

  • How to do this?

– Model level: GNNs? – Pre-training task: self-supervised tasks across graphs?

1.Jiezhong Qiu et al. GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training. KDD 2020. 2.Code & Data for GCC: https://github.com/THUDM/GCC

slide-87
SLIDE 87

GNN Pre-Training across Networks

Pre-Trained GNN

graph pre-training task

P-Ted GNN

node classification

Pre-Training Fine-Tuning

Facebook IMDB DBLP US-Airport

P-Ted GNN

graph classification

Reddit

…

hide

1.Jiezhong Qiu et al. GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training. KDD 2020. 2.Code & Data for GCC: https://github.com/THUDM/GCC

slide-88
SLIDE 88

GNN Pre-Training across Networks

  • What are the requirements?

– structural similarity, it maps vertices with similar local network topologies close to each other in the vector space – transferability, it is compatible with vertices and graphs unseen by the pre-training algorithm

1.Jiezhong Qiu et al. GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training. KDD 2020. 2.Code & Data for GCC: https://github.com/THUDM/GCC

slide-89
SLIDE 89

GNN Pre-Training across Networks

  • The Idea: Contrastive learning
  • pre-training task: instance discrimination
  • InfoNCE objective: output instance representations that are capable of

capturing the similarities between instances

  • Contrastive learning for graphs?
  • Q1: How to define instances in graphs?
  • Q2: How to define (dis) similar instance pairs in and across graphs?
  • Q3: What are the proper graph encoders?
  • query instance 𝑦'
  • query 𝒓 (embedding of 𝑦'), i.e., 𝒓 = 𝑔(𝑦')
  • dictionary of keys 𝒍(, 𝒍), β‹― , 𝒍*
  • key 𝒍 = 𝑔(𝑦#)

1. Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR ’18. 2. Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In CVPR ’20.

slide-90
SLIDE 90

Graph Contrastive Coding (GCC)

  • Contrastive learning for graphs
  • Q1: How to define instances in graphs?
  • Q2: How to define (dis) similar instance pairs in and across graphs?
  • Q3: What are the proper graph encoders?

Subgraph instance discrimination

1.Jiezhong Qiu et al. GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training. KDD 2020. 2.Code & Data for GCC: https://github.com/THUDM/GCC

slide-91
SLIDE 91

Graph Contrastive Coding (GCC)

GCC

subgraph instance discrimination

GCC

node classification

Pre-Training Fine-Tuning

Facebook IMDB DBLP US-Airport

GCC

graph classification

Reddit

…

hide

1.Jiezhong Qiu et al. GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training. KDD 2020. 2.Code & Data for GCC: https://github.com/THUDM/GCC

slide-92
SLIDE 92

GCC Pre-Training / Fine-Tuning

  • pre-train on six graphs
  • fine-tune on different graphs

– US-Airport & AMiner academic graph

  • Node classification

– COLLAB, RDT-B, RDT-M, & IMDB-B, IMDB-M

  • Graph classification

– AMiner academic graph

  • Similarity search
  • The base GNN

– Graph Isomorphism Network (GIN)

1.Jiezhong Qiu et al. GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training. KDD 2020. 2.Code & Data for GCC: https://github.com/THUDM/GCC

slide-93
SLIDE 93

Results

Node Classification Graph Classification Similarity Search

1.Jiezhong Qiu et al. GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training. KDD 2020. 2.Code & Data for GCC: https://github.com/THUDM/GCC

slide-94
SLIDE 94

Results hide

1.Jiezhong Qiu et al. GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training. KDD 2020. 2.Code & Data for GCC: https://github.com/THUDM/GCC

slide-95
SLIDE 95

Results

GCC: universal patterns?

subgraph instance discrimination

GCC

node classification

Pre-Training Fine-Tuning

Facebook IMDB DBLP US-Airport

GCC

graph classification

Reddit

…

1.Jiezhong Qiu et al. GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training. KDD 2020. 2.Code & Data for GCC: https://github.com/THUDM/GCC

Does the pre-training of GNNs learn the universal structural patterns across networks?

slide-96
SLIDE 96

Graph Representation Learning

Network Embedding Matrix Factorization Pre-Training GNNs

  • DeepWalk
  • LINE
  • Node2vec
  • PTE
  • …
  • metapath2vec
  • NetMF
  • NetSMF
  • …
  • ProNE (Propagation)
  • GCN
  • GAT
  • GraphSage
  • …
  • GRAND
  • HGT
  • Generative
  • GPT-GNN
  • Contrastive
  • GCC

What graph data to use?

slide-97
SLIDE 97
  • 1. Weihua Hu, et al. Open Graph Benchmark: Datasets for Machine Learning on Graphs. arXiv 2020

https://ogb.stanford.edu/

slide-98
SLIDE 98

Microsoft Academic Graph & AMiner & OAG

1. Kuansan Wang et al. Microsoft Academic Graph: When experts are not enough. Quantitative Science Studies 1 (1), 2020 2. Fanjin Zhang et al. OAG: Toward Linking Large-scale Heterogeneous Entity Graphs. KDD 2019. 3. Jie Tang et al. Arnetminer: extraction and mining of academic social networks. In KDD 2008.

1800 --- 2019 #pubs doubles every 13 years

slide-99
SLIDE 99

References

1. Ziniu Hu et al. GPT-GNN: Generative Pre-Training of Graph Neural Networks. KDD 2020. 2. Jiezhong Qiu et al. GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training. KDD 2020. 3. Kuansan Wang et al. Microsoft Academic Graph: When experts are not enough. Quantitative Science Studies 1 (1), 396-413, 2020. 4. Weihua Hu et al. Open Graph Benchmark: Datasets for Machine Learning on Graphs. arXiv 2020. 5. Feng et al. Graph Random Neural Networks. arXiv 2020. 6. Ziniu Hu et al. Heterogeneous Graph Transformer. WWW 2020. 7. Yuxiao Dong et al. Heterogeneous Network Representation Learning. IJCAI 2020. 8. Jiezhong Qiu et al. NetSMF: Large-Scale Network Embedding as Sparse Matrix Factorization. WWW 2019. 9. Jie Zhang et al. ProNE: Fast and Scalable Network Representation Learning. IJCAI 2019. 10. Fanjing Zhang et al. OAG: Toward Linking Large-scale Heterogeneous Entity Graphs. ACM KDD 2019. 11. Xian Wu et al. Neural Tensor Decomposition. WSDM 2019. 12. Jiezhong Qiu et al. Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec. WSDM 2018. 13. Yuxiao Dong et al. metapath2vec: Scalable Representation Learning for Heterogeneous Networks. KDD 2017. 14. Perozzi et al. DeepWalk: Online learning of social representations. In KDD’ 14. 15. Tang et al. LINE: Large scale information network embedding. In WWW’15. 16. Grover and Leskovec. node2vec: Scalable feature learning for networks. In KDD’16. 17. Harris, Z. (1954). Distributional structure. Word, 10(23): 146-162. 18. Kipf et al. Semisupervised Classification with Graph Convolutional Networks. ICLR 2017 19. Velickovic et al. Graph Attention Networks. ICLR 2018 20. Hamilton et al. Inductive Representation Learning on Large Graphs. NeurIPS 2017 21. Defferrard et al. Convolutional Neural Networks on Graphs with Fast Locailzied Spectral Filtering. In NeurIPS 2016 22. Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT 2019. 23. Justin Gilmer, et al. Neural message passing for quantum chemistry. arXiv: 2017. 24. Kaiming He, et al. Momentum contrast for unsupervised visual representation learning. arXiv: 2019 25. Tomas Mikolov et al. Distributed representations of words and phrases and their compositionality. NeurIPS 2013. 26. Petar Velickovic et al. Deep Graph Infomax. In ICLR 19. 27. Zhen Yang et al. Understanding Negative Sampling in Graph Representation Learning. KDD 2020.

slide-100
SLIDE 100

Thank you!

Jiezhong Qiu Tsinghua (Jie Tang) Jie Tang Tsinghua Yizhou Sun UCLA Ziniu Hu UCLA (Yizhou Sun) Hongxia Yang Alibaba Hao Ma Facebook AI Kuansan Wang Microsoft Research Jing Zhang Renmin U. of China

Papers & data & code available at https://ericdongyx.github.io/ ericdongyx@gmail.com