Graph Representation Learning: Where Probability Theory, Data - - PowerPoint PPT Presentation

graph representation learning where probability theory
SMART_READER_LITE
LIVE PREVIEW

Graph Representation Learning: Where Probability Theory, Data - - PowerPoint PPT Presentation

Graph Representation Learning: Where Probability Theory, Data Mining, and Neural Networks Meet Bruno Ribeir eiro Assis istant nt Profess ssor Departm tment ent of Comp mputer er Scienc nce Purdue Unive Pu versit rsity Joint work


slide-1
SLIDE 1

Bruno Ribeir eiro

Assis istant nt Profess ssor Departm tment ent of Comp mputer er Scienc nce Pu Purdue Unive versit rsity

Graph Representation Learning: Where Probability Theory, Data Mining, and Neural Networks Meet

Joint work with R. Murphy*, B. Srinivasan*, V. Rao

GrAPL Workshop @ IPDPS May 20th, 2019

Sponsors:

Army Research Lab Network Science CTA

slide-2
SLIDE 2

Bruno Ribeiro  What is the most powerful+ graph model / representation?  How can we make model learning$ tractable*?

  • How can we make model learning$ scalable?

+ powerful → expressive * tractable → works on small graphs $ learning and inference

slide-3
SLIDE 3

Bruno Ribeiro

3

slide-4
SLIDE 4

Bruno Ribeiro

4

Social Graphs Biological Graphs Molecules The Web Ecological Graphs

𝐻 = (𝑊, 𝐹)

slide-5
SLIDE 5

Bruno Ribeiro

1 2 3 4 5 6 7 8 vertices/nodes edges Undirected Graph G(V, E)

A = 2 6 6 6 6 6 6 6 6 6 4 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 1 0 1 0 0 1 1 0 0 1 0 3 7 7 7 7 7 7 7 7 7 5

5

𝑄(𝑩) probability of sampling A (this graph)

Arbitrary node labels

𝑩1⋅ 𝑩⋅1𝑩⋅2 𝑩2⋅

slide-6
SLIDE 6

Bruno Ribeiro

6

slide-7
SLIDE 7

Bruno Ribeiro  Consider a sequence of n random variables:

with joint probability distribution

 Sequence example: “The quick brown fox jumped over the lazy dog”  The joint probability is just a function

  • P takes an ordered sequence and outputs a value between zero and one

(w/ normalization)

with

𝑌1, … , 𝑌𝑜

(w/ normalization)

7

𝑄(𝑌1 = the, 𝑌2 = quick, … , 𝑌9 = dog)

countable

𝑄: Ω𝑜 → [0,1]

slide-8
SLIDE 8

Bruno Ribeiro  Consider a set of n random variables (representing a multiset):

how should we define their joint probability distribution? Recall: Probability function 𝑄: Ω𝑜 → 0,1 is order-dependent

Definition: For multisets the probability function P is such that is true for any permutation 𝜌 of (1,…,n)

with

Useful references: (Diaconis, Synthese 1977). Finite forms of de Finetti’s theorem on exchangeability (Murphy et al., ICLR 2019) Janossy Pooling: Learning Deep Permutation-Invariant Functions for Variable-Size Inputs

8

slide-9
SLIDE 9

Bruno Ribeiro  Point clouds  Bag of words  Our friends  Neighbors of a node

Lidar maps A extension: set-of-sets (Meng et al., KDD 2019)

slide-10
SLIDE 10

Bruno Ribeiro

 Consider an array of 𝑜2 random variables:

and 𝑄: Ω𝑜×𝑜 → [0,1] such that for any permutation 𝜌 of (1,…,n)

 Then, P is a model of a graph with n vertices, where 𝑌𝑗𝑘 ∈ Ω are edge

attributes (e.g., weights)

  • For each graph, P assigns a probability
  • Trivial to add node attributes to the definition

 If Ω = {0,1} then P is a probability distribution over adjacency matrices

  • Most statistical graph models can be represented this way

𝑄 𝑌11, 𝑌12, 𝑌21, … , 𝑌𝑜𝑜 = 𝑄 𝑌𝜌(1)𝜌(1), 𝑌𝜌(1)𝜌(2), 𝑌𝜌(2)𝜌(1), … , 𝑌𝜌 𝑜 𝜌(𝑜)

𝑌𝑗𝑘 ∈ Ω …

10

𝑌𝑜𝑜

slide-11
SLIDE 11

Bruno Ribeiro

1 2 3 4 5 6 7 8 vertices/nodes edges Undirected Graph G(V, E)

A = 2 6 6 6 6 6 6 6 6 6 4 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 1 0 1 0 0 1 1 0 0 1 0 3 7 7 7 7 7 7 7 7 7 5

11

𝑩1⋅ 𝑩⋅1𝑩⋅2 𝑩2⋅

𝝆 = (2,1,3,4,5,6,7,8)

𝑄 𝐵 = 𝑄 𝐵𝜌𝜌

Graph model is invariant to permutations Arbitrary node labels 𝝆

slide-12
SLIDE 12

Bruno Ribeiro

slide-13
SLIDE 13

Bruno Ribeiro

 Invariances have deep implications in nature

  • Noether’s (first) theorem (1918):

invariances ⇒ laws of conservation e.g.:  time and space translation invariance ⇒ energy conservation

 The study of probabilistic invariances (symmetries) has a long history

  • Laplace’s “rule of succession” dates to 1774 (Kallenberg, 2005)
  • Maxwell’s work in statistical mechanics (1875) (Kallenberg, 2005)
  • Permutation invariance for infinite sets:

 de Finetti’s theorem (de Finetti, 1930)  Special case of the ergodic decomposition theorem, related to integral decompositions (see Orbanz and Roy (2015) for a good overview)

  • Kallenberg (2005) & (2007): de-facto references on probabilistic invariances
slide-14
SLIDE 14

Bruno Ribeiro

Aldous, D. J. Representations for partially exchangeable arrays of random variables. J.

  • Multivar. Anal., 1981.

14

slide-15
SLIDE 15

Bruno Ribeiro  Consider an infinite set of random variables:

such that is true for any permutation 𝜌 of the positive integers

 Then,

𝑄 𝑌11, 𝑌12, … ∝ ׬

𝑉1∈[0,1] ⋯ ׬ 𝑉∞∈[0,1] ς𝑗𝑘 𝑄(𝑌𝑗𝑘 |𝑉𝑗, 𝑉 𝑘)

is a mixture model of uniform distributions over 𝑉𝑗, 𝑉

𝑘, … ∼ Uniform(0,1)

(Aldous-Hoover representation is sufficient only for infinite graphs) 15

𝑌𝑗𝑘 ∈ Ω …

slide-16
SLIDE 16

Bruno Ribeiro

slide-17
SLIDE 17

Bruno Ribeiro

Relationship between deterministic functions and probability distributions

 Noise outsourcing:

  • Tool from measure theory
  • Any conditional probability 𝑄(𝑍|𝑌) can be represent as

𝑍 = 𝑕 𝑌, 𝜗 , 𝜗 ∼ Uniform(0,1) where 𝑕 is a deterministic function

  • The randomness is entirely outsourced to 𝜗

 Representation 𝑡(𝑌):

  • 𝑡(𝑌): deterministic function, makes 𝑍 independent of 𝑌 given 𝑡(𝑌)
  • Then, ∃𝑕′ such that

𝑍, 𝑌 = (𝑕′ 𝑡 𝑌 , 𝜗 , 𝑌), 𝜗 ∼ Uniform(0,1)

* = is a.s.

We call 𝑡(𝑌) a representation of 𝑌 Representations are generalizations of “embeddings”

slide-18
SLIDE 18

Bruno Ribeiro

18

slide-19
SLIDE 19

Bruno Ribeiro

Gaussian Linear Model: Node i vector Ui. ~ 𝑂𝑝𝑠𝑛𝑏𝑚(𝟏, 𝜏𝑉

2 𝑱)

Adjacency matrix: Aij ~ 𝑂𝑝𝑠𝑛𝑏𝑚(𝑽𝑗⋅

𝑈𝑽𝑘⋅, 𝜏2)

𝑽⋆ = argmin

𝑽

𝑩 − 𝑽𝑽𝑈

2 2 + 𝜏2 𝜏𝑉

2 𝑽 2

2

Equivalent optimization: Minimizing Negative Log-Likelihood:

19

Q: For a given 𝑩, what is the most likely 𝑽? Answer: 𝑽⋆ = argmax𝑽 𝑄(𝑩|𝑽) , a.k.a. maximum likelihood

(each node i represented by a random vector)

slide-20
SLIDE 20

Bruno Ribeiro

That will turn out to be the same

slide-21
SLIDE 21

Bruno Ribeiro

 Embedding of adjacency matrix 𝑩

A = +

U.1 U.1 U.2 U.2

+ …

𝑩 ≈ 𝑽𝑽𝑈

U.i = i-th column vector of 𝑽

slide-22
SLIDE 22

Bruno Ribeiro  Matrix factorization can be used to compute a

low-rank representation of A

 A reconstruction problem:

Find by optimizing where 𝑉 has k columns*

𝑩 ≈ 𝑽𝑽T min

𝑽

𝑩 − 𝑽𝑽T

2 2 + 𝜇‖𝑽 ‖2 2

Sum squared error

22

*sometimes we will force orthogonal columns in U L2 regularization Regularization strength

slide-23
SLIDE 23

Bruno Ribeiro

23

slide-24
SLIDE 24

Bruno Ribeiro

slide-25
SLIDE 25

Bruno Ribeiro

25

slide-26
SLIDE 26

Bruno Ribeiro

26

slide-27
SLIDE 27

Bruno Ribeiro

Initialize: 𝒊𝑤 is the attribute vector of vertex 𝑤 ∈ 𝐻 (if no attribute, assign 1) 𝑙 = 0 function WL-fingerprints(𝐻): while vertex attributes change do: 𝑙 ← 𝑙 + 1 for all vertices 𝑤 ∈ 𝐻 do 𝒊𝑙,𝑤 ← hash 𝒊𝑙−1,𝑤, 𝒊𝑙−1,𝑣: ∀𝑣 ∈ Neighbors 𝑤 Return {𝒊𝑙,𝑤: ∀𝑤 ∈ 𝐻}

Recursive algorithm to determine if two graphs are isomorphic

  • Valid isomorphism test for most

graphs (Babai and Kucera, 1979)

  • Cai et al., 1992 shows

examples that cannot be distinguished by it

  • Belongs to class of color refinement

algorithms that iteratively update vertex “colors” (hash values) until it has converged to unique assignments of hashes to vertices

  • Final hash values encode the

structural roles of vertices inside a graph

  • Often fails for graphs with a high

degree of symmetry, e.g. chains, complete graphs, tori and stars

27

Shervashidze et al. 2011

neighbors of node v

slide-28
SLIDE 28

Bruno Ribeiro  The hardest task for graph representation is:

  • Give different tags to different graphs

 Isomorphic graphs should have the same tag

  • Task: Given adjacency matrix 𝑩 , predict tag

 Goal: Find a representation 𝑡(𝑩) such that P tag 𝐁 = g(𝑡 𝑩 , 𝜗)

  • Then, 𝑡(𝑩) must give:

 same representation to isomorphic graphs  different representations to non-isomorphic graphs

28

slide-29
SLIDE 29

Bruno Ribeiro

slide-30
SLIDE 30

Bruno Ribeiro

Main idea: Graph Neural Networks: Use the WL algorithm to compute representations that are related to a task Initialize ℎ0,𝑤 = node 𝑤 attribute function Ԧ 𝑔(𝑩, 𝑿1, … , 𝑿𝐿, 𝒄1, … , 𝒄𝐿): while 𝑙 < 𝐿 do: # K layers 𝑙 ← 𝑙 + 1 for all vertices 𝑤 ∈ 𝑊 do 𝒊𝑙,𝑤 = 𝜏 𝑿𝑙 𝒊𝑙−1,𝑤, 𝑩𝒘⋅ 𝒊 + 𝒄𝑙 return {𝒊𝐿,𝑤: ∀𝑤 ∈ 𝑊} Example supervised task: predict label 𝑧𝑗 of graph 𝐻𝑗 represented by 𝑩𝑗 Optimization for loss 𝑀: Let 𝜾 = (𝑿1, … , 𝑿𝐿, 𝒄1, … , 𝒄𝐿, 𝑿agg, 𝒄agg) 𝜾⋆ = argmax𝜾 ෍

𝑗∈Data

𝑀(𝑧𝑗, 𝑿agg Pooling( Ԧ 𝑔(𝑩𝑗, 𝑿1, … , 𝑿𝐿, 𝒄1, … , 𝒄𝐿)) + 𝒄agg)

30

could be another permutation-invariant function (see Murphy et al. ICLR 2019) permutation-invariant function (see Murphy et al. ICLR 2019)

slide-31
SLIDE 31

Bruno Ribeiro

GNN representations can be as expressive as the Weisfeller-Lehman (WL) isomorphism test (Xu et al., ICLR 2019) But WL test can sometimes fail ….

E.g. in a family of circulant graphs:

Ԧ 𝑔 𝑩 = Ԧ 𝑔(𝑩𝜌𝜌)

By construction, GNN representation Ԧ 𝑔 is guaranteed permutation invariance (equivariance)

31

slide-32
SLIDE 32

Bruno Ribeiro

slide-33
SLIDE 33

Bruno Ribeiro  Multilayer perceptron (MLP) is universal function approximator

(Hornik et al. 1989)

  • What about using Ԧ

𝑔

MLP vec(𝑩 )?

 No! Permutation-sensitive* Ԧ 𝑔

MLP vec(𝑩 ) ≠ Ԧ

𝑔

MLP vec(𝑩𝜌𝜌 )

for some permutation 𝜌 * unless neuron weights nearly all the same (Maron et al., 2018)

slide-34
SLIDE 34

Bruno Ribeiro  Extension: A graph model P is an array of 𝑜2 random variables, n>1,

and 𝑄: Ω∪× → [0,1], where Ω∪× ≡∪𝑗=2

Ω𝑗×𝑗, such that for any value of n and any permutation 𝜌 of (1,…,n)

𝑄 𝑌11, 𝑌12, 𝑌21, … , 𝑌𝑜𝑜 = 𝑄 𝑌𝜌(1)𝜌(1), 𝑌𝜌(1)𝜌(2), 𝑌𝜌(2)𝜌(1), … , 𝑌𝜌 𝑜 𝜌(𝑜)

𝑌𝑗𝑘 ∈ Ω …

34

𝑌𝑜𝑜

 (Murphy et al., ICML 2019) insight:

P is average of an unconstrained probability function 𝑄 applied

  • ver the Abelian group defined by the permutation operator 𝜌
  • Average is invariant to the group action of a permutation 𝜌

(see Bloem-Reddy & Teh, 2019)

  • Works for variable size graphs
slide-35
SLIDE 35

Bruno Ribeiro

 A is a tensor encoding: adjacency matrix & edge attributes  X(v) encodes node attributes  Π is the set of all permutation of (1,…,|V|) , where |V| is number of vertices  𝒈 is any permutation-sensitive function  Theorem 2.1: Necessary and sufficient representation of finite graphs

  • (details) 𝒈 is an universal approximator (MLP

, RNNs), then ധ 𝒈 𝑩 is the most expressive representation of A

35

ҧҧ 𝑔 𝑩 = 𝐹𝜌 Ԧ 𝑔 𝑩𝜌𝜌, 𝒀𝜌

𝑤

= 1 𝑊 ! ෍

𝜌∈Π

Ԧ 𝑔(𝑩𝜌𝜌, 𝒀𝜌

𝑤 )

average over 𝜌 ∼ Uniform(Π)

slide-36
SLIDE 36

Bruno Ribeiro

1.

Canonical orientation (some order of the vertices), so that canonical(A) = canonical(Aπ π)

2.

k-ary dependencies:

  • Nodes k-by-k independent in 𝒈
  • 𝒈 considers only the first 𝑙 nodes of any permutation 𝜌

3.

Stochastic optimization (proposes 𝜌-SGD)

36

ҧҧ 𝑔 𝑩 ∝ ෍

𝜌∈Π

Ԧ 𝑔(𝑩𝜌𝜌, 𝒀𝜌

𝑤 )

slide-37
SLIDE 37

Bruno Ribeiro  Order nodes with a sort function

  • E.g.: order nodes by PageRank

 Arrange 𝑩 with sort(𝑩)

(assuming no ties)

  • Note that sort 𝑩 = sort(𝑩𝝆𝝆) for any permutation 𝜌

ҧҧ 𝑔 𝑩 ∝ ෍

𝜌∈Π

Ԧ 𝑔(𝑩𝜌𝜌, 𝒀𝜌

𝑤 )

slide-38
SLIDE 38

Bruno Ribeiro

1 2 3 4 5 6 7 8 vertices/nodes edges Undirected Graph G(V, E)

k-ary dependencies:

  • Nodes k-by-k independent in 𝒈
  • 𝒈 considers only the first 𝑙 nodes of any permutation 𝜌: 𝑜

𝑙 permutations

38

𝝆 = (1,2,3,4,5,6,7,8) 𝒈( ) 𝝆 = (2,1,3,4,5,6,7,8) ෍

𝜌∈Π

𝒈 𝑩𝜌𝜌 = 𝒈( ) + + ⋯ + 𝝆 = (2,1,3,4,5,6,8,7) 𝒈( ) + ⋯

ҧҧ 𝑔 𝑩 ∝ ෍

𝜌∈Π

Ԧ 𝑔(𝑩𝜌𝜌, 𝒀𝜌

𝑤 )

slide-39
SLIDE 39

Bruno Ribeiro

SGD: standard Stochastic Gradient Descent 1. SGD will sample a batch of n training examples 2. Compute gradients (backpropagation using chain rule) 3. Update model following negative gradient (one gradient descent step) 4. GOTO 1:

𝜌-SGD (as fast as SGD per gradient step) 1. Sample a batch of training examples 2. For each example 𝐲(𝑘) in the batch  Sample one permutation 𝜌(𝑘) 3. Perform a forward pass over the examples with the single sampled permutation 4. Compute gradients (backpropagation using chain rule) 5. Update model following negative gradient (one gradient descent step) 6. GOTO 1:

slide-40
SLIDE 40

Bruno Ribeiro  Proposition 2.1 (Murphy et al., ICML 2019)

  • π-SGD behaves just like SGD
  • If loss is MSE, cross-entropy, negative log-likelihood, then π-SGD is

minimizing an upper bound of the loss

  • However, the solution π-SGD converges to is not the solution of SGD.

 But still a valid graph representation

slide-41
SLIDE 41

Bruno Ribeiro  Consider a GNN 𝒈 , e.g., GIN of (Xu et al., ICLR 2019)

  • By definition 𝒈 is insensitive to permutations

 Let’s make 𝒈 sensitive to permutations by adding node id (label) as

unique node feature

  • And use RP to make the entire representation insensitive to

permutations (learnt approximately via 𝜌-SGD)

Task: Classify circulant graphs

41

Task: Molecular classification

slide-42
SLIDE 42

Bruno Ribeiro  𝒈 can be a logistic model (logistic regression)  𝒈 can be a Recurrent Neural Network (RNN)  𝒈 can be a Convolutional Neural Network (CNN)

  • Treat A as image

 These are all valid graph representations in Relational Pooling (RP)

𝐵 =

42

slide-43
SLIDE 43

Bruno Ribeiro

Permutation-invariant Ԧ 𝑔 without permutation averaging “Any” Ԧ 𝑔 made permutation-invariant by RP ҧҧ 𝑔(𝑩) inference with Monte Carlo = approximately permutation-invariant learns exact Ԧ 𝑔 Ԧ 𝑔 is always perm-invariant learns ҧҧ 𝑔(𝑩) approximately via 𝜌-SGD

43

Relational Pooling (RP) framework gives a new class of graph representations and models

  • Until now Ԧ

𝑔 has been hand-designing to be permutation-invariant

  • (Murphy et al ICML 2019) RP Ԧ

𝑔 can be permutation-sensitive, allows more expressive models

  • Trade-off: can only be learnt approximately

Thank You! @brunofmr ribeiro@cs.purdue.edu

RP: ҧҧ 𝑔 𝑩 ∝ ෍

𝜌∈Π

Ԧ 𝑔(𝑩𝜌𝜌, 𝒀𝜌

𝑤 )

slide-44
SLIDE 44

Bruno Ribeiro

1.

Murphy, R. L., Srinivasan, B., Rao, V., and Ribeiro, B., Janossy pooling: Learning deep permutationinvariant functions for variable-size inputs. ICLR 2019

2.

Murphy, R.L., Srinivasan, B., Rao, V., Ribeiro, B., Relational Pooling for Graph Representations, ICML 2019

3.

Meng, C., Yang, J., Ribeiro, B., Neville, J., HATS: A Hierarchical Sequence-Attention Framework for Inductive Set-of- Sets Embeddings. KDD 2019

4.

de Finetti, B.. Fuzione caratteristica di un fenomeno aleatorio. Mem. R. Acc. Lincei, 1930

5.

Aldous, D. J. Representations for partially exchangeable arrays of random variables. J. Multivar. Anal., 1981

6.

Diaconis, P . and Janson, S. Graph limits and exchangeable random graphs. Rend. di Mat. e delle sue Appl. Ser. VII, 28:33–61, 2008

7.

Kallenberg, O. (2005). Probabilistic Symmetries and Invariance Principles. Springer.

8.

Kallenberg, O. (2017). Random Measures, Theory and Applications. Springer International Publishing.

9.

Diaconis P . Finite forms of de Finetti's theorem on exchangeability. Synthese. 1977

10.

Orbanz, P . and Roy, D. M. Bayesian models of graphs, arrays and other exchangeable random structures. IEEE transactions on pattern analysis and machine intelligence, 37(2):437–461, 2015.

11.

Bloem-Reddy B, Teh YW. Probabilistic symmetry and invariant neural networks, arXiv:1901.06082. 2019.

12.

Robbins, H. and Monro, S. A stochastic approximation method. The annals ofmathematical statistics, pp. 400–407, 1951.

13.

Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.

14.

Cai, J.-Y., Furer, M., and Immerman, N. An optimal lower bound on the number of variables for graph identification. Combinatorica, 12(4):389–410, 1992

15.

Shervashidze, N., Schweitzer, P ., Leeuwen, E. J. V., Mehlhorn, K., & Borgwardt, K. M., Weisfeiler-lehman graph

  • kernels. Journal of Machine Learning Research, 2011

16.

Maron, H., Ben-Hamu, H., Shamir, N., and Lipman, Y. Invariant and equivariant graph networks. arXiv preprint arXiv:1812.09902, 201

Maron, H., Fetaya, E., Segol, N., and Lipman, Y. On the universality of invariant networks. arXiv preprint arXiv:1901.09342, 2019.