Graph Representation Learning: Where Probability Theory, Data - PowerPoint PPT Presentation

Graph Representation Learning: Where Probability Theory, Data Mining, and Neural Networks Meet Bruno Ribeir eiro Assis istant nt Profess ssor Departm tment ent of Comp mputer er Scienc nce Purdue Unive Pu versit rsity Joint work with R. Murphy*, B. Srinivasan*, V. Rao GrAPL Workshop @ IPDPS May 20 th , 2019 Army Research Lab Sponsors: Network Science CTA

 What is the most powerful + graph model / representation?  How can we make model learning $ tractable*? ◦ How can we make model learning $ scalable? + powerful → expressive * tractable → works on small graphs $ learning and inference Bruno Ribeiro

3 Bruno Ribeiro

𝐻 = (𝑊, 𝐹) Social Graphs Biological Graphs Molecules Ecological Graphs The Web Bruno Ribeiro 4

Arbitrary node labels 2 𝑩 ⋅1 𝑩 ⋅2 2 3 3 0 1 0 0 0 0 1 0 𝑩 1⋅ 1 6 7 𝑩 2⋅ 1 0 1 0 0 0 0 0 7 8 6 7 6 7 0 1 0 1 0 0 0 1 6 7 vertices/nodes edges 0 0 1 0 1 1 0 1 6 7 A = 6 7 0 0 0 1 0 1 0 0 6 7 6 6 7 4 0 0 0 1 1 0 1 0 6 7 4 5 1 0 0 0 0 1 0 1 5 0 0 1 1 0 0 1 0 Undirected Graph G(V, E) 𝑄(𝑩) probability of sampling A (this graph) 5 Bruno Ribeiro

Bruno Ribeiro 6

 Consider a sequence of n random variables: countable 𝑌 1 , … , 𝑌 𝑜 with with joint probability distribution  Sequence example: “The quick brown fox jumped over the lazy dog” 𝑄(𝑌 1 = the, 𝑌 2 = quick, … , 𝑌 9 = dog)  The joint probability is just a function 𝑄: Ω 𝑜 → [0,1] (w/ normalization) ◦ P takes an ordered sequence and outputs a value between zero and one (w/ normalization) 7 Bruno Ribeiro

 Consider a set of n random variables ( representing a multiset ): with how should we define their joint probability distribution ? Recall: Probability function 𝑄: Ω 𝑜 → 0,1 is order-dependent Definition: For multisets the probability function P is such that  is true for any permutation 𝜌 of (1,…,n) Useful references: (Diaconis, Synthese 1977). Finite forms of de Finetti’s theorem on exchangeability (Murphy et al., ICLR 2019) Janossy Pooling: Learning Deep Permutation-Invariant Functions for Variable-Size Inputs 8 Bruno Ribeiro

 Point clouds  Bag of words  Our friends  Neighbors of a node A Lidar maps extension: set-of-sets (Meng et al., KDD 2019) Bruno Ribeiro

 Consider an array of 𝑜 2 random variables: … 𝑌 𝑗𝑘 ∈ Ω 𝑌 𝑜𝑜 and 𝑄: Ω 𝑜×𝑜 → [0,1] such that 𝑄 𝑌 11 , 𝑌 12 , 𝑌 21 , … , 𝑌 𝑜𝑜 = 𝑄 𝑌 𝜌(1)𝜌(1) , 𝑌 𝜌(1)𝜌(2) , 𝑌 𝜌(2)𝜌(1) , … , 𝑌 𝜌 𝑜 𝜌(𝑜) for any permutation 𝜌 of (1,…, n )  Then, P is a model of a graph with n vertices, where 𝑌 𝑗𝑘 ∈ Ω are edge attributes (e.g., weights) ◦ For each graph, P assigns a probability ◦ Trivial to add node attributes to the definition  If Ω = {0,1} then P is a probability distribution over adjacency matrices ◦ Most statistical graph models can be represented this way 10 Bruno Ribeiro

Arbitrary node labels 2 𝑩 ⋅1 𝑩 ⋅2 2 3 3 0 1 0 0 0 0 1 0 𝑩 1⋅ 1 6 7 𝑩 2⋅ 1 0 1 0 0 0 0 0 7 8 6 7 6 7 0 1 0 1 0 0 0 1 6 7 vertices/nodes edges 6 0 0 1 0 1 1 0 1 7 A = 6 7 0 0 0 1 0 1 0 0 6 7 6 6 7 4 𝝆 0 0 0 1 1 0 1 0 6 7 4 5 1 0 0 0 0 1 0 1 5 0 0 1 1 0 0 1 0 Undirected Graph G(V, E) Graph model is invariant to permutations 𝝆 = (2,1,3,4,5,6,7,8) 𝑄 𝐵 = 𝑄 𝐵 𝜌𝜌 11 Bruno Ribeiro

Bruno Ribeiro

 Invariances have deep implications in nature ◦ Noether’s (first) theorem (1918): invariances ⇒ laws of conservation e.g.:  time and space translation invariance ⇒ energy conservation  The study of probabilistic invariances (symmetries) has a long history ◦ Laplace’s “rule of succession” dates to 1774 ( Kallenberg, 2005) ◦ Maxwell’s work in statistical mechanics (1875) ( Kallenberg, 2005) ◦ Permutation invariance for infinite sets :  de Finetti’s theorem (de Finetti, 1930)  Special case of the ergodic decomposition theorem, related to integral decompositions (see Orbanz and Roy (2015) for a good overview) ◦ Kallenberg (2005) & (2007): de-facto references on probabilistic invariances Bruno Ribeiro

Aldous, D. J. Representations for partially exchangeable arrays of random variables. J. Multivar. Anal., 1981. 14 Bruno Ribeiro

 Consider an infinite set of random variables: … 𝑌 𝑗𝑘 ∈ Ω such that is true for any permutation 𝜌 of the positive integers  Then, 𝑉 ∞ ∈[0,1] ς 𝑗𝑘 𝑄(𝑌 𝑗𝑘 |𝑉 𝑗 , 𝑉 𝑄 𝑌 11 , 𝑌 12 , … ∝ ׬ 𝑉 1 ∈[0,1] ⋯ ׬ 𝑘 ) is a mixture model of uniform distributions over 𝑉 𝑗 , 𝑉 𝑘 , … ∼ Uniform(0,1) (Aldous-Hoover representation is sufficient only for infinite graphs) 15 Bruno Ribeiro

Bruno Ribeiro

Relationship between deterministic functions and probability distributions  Noise outsourcing : ◦ Tool from measure theory ◦ Any conditional probability 𝑄(𝑍|𝑌) can be represent as 𝑍 = 𝑕 𝑌, 𝜗 , 𝜗 ∼ Uniform(0,1) where 𝑕 is a deterministic function ◦ The randomness is entirely outsourced to 𝜗  Representation 𝑡(𝑌) : ◦ 𝑡(𝑌) : deterministic function, makes 𝑍 independent of 𝑌 given 𝑡(𝑌) ◦ Then, ∃𝑕′ such that 𝑍, 𝑌 = (𝑕′ 𝑡 𝑌 , 𝜗 , 𝑌) , 𝜗 ∼ Uniform(0,1) We call 𝑡(𝑌) a representation of 𝑌 Representations are generalizations of “embeddings” * = is a.s. Bruno Ribeiro

18 Bruno Ribeiro

Gaussian Linear Model: (each node i represented by a random vector) 2 𝑱) Node i vector U i . ~ 𝑂𝑝𝑠𝑛𝑏𝑚(𝟏, 𝜏 𝑉 Adjacency matrix: A ij ~ 𝑂𝑝𝑠𝑛𝑏𝑚(𝑽 𝑗⋅ 𝑈 𝑽 𝑘⋅ , 𝜏 2 ) Q : For a given 𝑩 , what is the most likely 𝑽 ? Answer : 𝑽 ⋆ = argmax 𝑽 𝑄(𝑩|𝑽) , a.k.a. maximum likelihood Equivalent optimization : Minimizing Negative Log-Likelihood: 2 + 𝜏 2 𝑽 ⋆ = argmin 𝑩 − 𝑽𝑽 𝑈 2 2 𝑽 2 2 𝜏 𝑉 𝑽 19 Bruno Ribeiro

That will turn out to be the same Bruno Ribeiro

 Embedding of adjacency matrix 𝑩 𝑩 ≈ 𝑽𝑽 𝑈 U . i = i- th column vector of 𝑽 U . 1 U . 2 A U . 2 U . 1 = + + … Bruno Ribeiro

 Matrix factorization can be used to compute a low-rank representation of A  A reconstruction problem: Find 𝑩 ≈ 𝑽𝑽 T by optimizing Sum squared error L2 regularization 2 + 𝜇‖𝑽 ‖ 2 𝑩 − 𝑽𝑽 T 2 min 2 𝑽 Regularization strength where 𝑉 has k columns* *sometimes we will force orthogonal columns in U 22 Bruno Ribeiro

23 Bruno Ribeiro

Bruno Ribeiro

25 Bruno Ribeiro

26 Bruno Ribeiro

Recursive algorithm to determine if two  graphs are isomorphic ◦ Valid isomorphism test for most graphs (Babai and Kucera, 1979) ◦ Cai et al., 1992 shows examples that cannot be Shervashidze et al. 2011 distinguished by it Initialize: 𝒊 𝑤 is the attribute vector of vertex 𝑤 ∈ 𝐻 ◦ Belongs to class of color refinement (if no attribute, assign 1) algorithms that iteratively update vertex “colors” (hash values) until 𝑙 = 0 it has converged to unique function WL-fingerprints( 𝐻 ): assignments of hashes to vertices while vertex attributes change do : ◦ Final hash values encode the 𝑙 ← 𝑙 + 1 structural roles of vertices inside a for all vertices 𝑤 ∈ 𝐻 do graph 𝒊 𝑙,𝑤 ← hash 𝒊 𝑙−1,𝑤 , 𝒊 𝑙−1,𝑣 : ∀𝑣 ∈ Neighbors 𝑤 ◦ Often fails for graphs with a high degree of symmetry, e.g. chains, Return {𝒊 𝑙,𝑤 : ∀𝑤 ∈ 𝐻} complete graphs, tori and stars neighbors of node v Bruno Ribeiro 27

 The hardest task for graph representation is: ◦ Give different tags to different graphs  Isomorphic graphs should have the same tag ◦ Task: Given adjacency matrix 𝑩 , predict tag  Goal : Find a representation 𝑡(𝑩) such that P tag 𝐁 = g(𝑡 𝑩 , 𝜗) ◦ Then, 𝑡(𝑩) must give:  same representation to isomorphic graphs  different representations to non-isomorphic graphs 28 Bruno Ribeiro

Bruno Ribeiro

Main idea: Graph Neural Networks: Use the WL algorithm to compute representations that are related to a task Initialize ℎ 0,𝑤 = node 𝑤 attribute function Ԧ 𝑔 ( 𝑩, 𝑿 1 , … , 𝑿 𝐿 , 𝒄 1 , … , 𝒄 𝐿 ): while 𝑙 < 𝐿 do : # K layers could be another permutation-invariant function 𝑙 ← 𝑙 + 1 (see Murphy et al. ICLR 2019) for all vertices 𝑤 ∈ 𝑊 do 𝒊 𝑙,𝑤 = 𝜏 𝑿 𝑙 𝒊 𝑙−1,𝑤 , 𝑩 𝒘⋅ 𝒊 + 𝒄 𝑙 return {𝒊 𝐿,𝑤 : ∀𝑤 ∈ 𝑊} Example supervised task: predict label 𝑧 𝑗 of graph 𝐻 𝑗 represented by 𝑩 𝑗 Optimization for loss 𝑀 : Let 𝜾 = (𝑿 1 , … , 𝑿 𝐿 , 𝒄 1 , … , 𝒄 𝐿 , 𝑿 agg , 𝒄 agg ) 𝜾 ⋆ = argmax 𝜾 𝑔 ( 𝑩 𝑗 , 𝑿 1 , … , 𝑿 𝐿 , 𝒄 1 , … , 𝒄 𝐿 )) + 𝒄 agg ) 𝑀(𝑧 𝑗 , 𝑿 agg Pooling( Ԧ ෍ 𝑗∈Data permutation-invariant function 30 (see Murphy et al. ICLR 2019) Bruno Ribeiro

Graph Representation Learning: Where Probability Theory, Data - PowerPoint PPT Presentation

Graph Representation Learning: Where Probability Theory, Data Mining, and Neural Networks Meet Bruno Ribeir eiro Assis istant nt Profess ssor Departm tment ent of Comp mputer er Scienc nce Purdue Unive Pu versit rsity Joint work

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

Which probability Which probability Which probability Which probability theory for cosmology?

Recap of Basic Probability Elements of basic probability theory probability theory The

Continuing Probability. Wrap up: Total Probability and Conditional Probability. Continuing

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Basics Probability Background Martin Emms October 1, 2020 Probability Basics

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Theory p ( E ) = p ( a 1 ) + p ( a 2 ) + ... + p ( a m ) 1 2 3 4 5 6 7 8 9 10 11 12 13

Counting and Probability Whats to come? Counting and Probability Whats to come?

Network/Graph Network/Graph Informally a graph is a set of nodes Theory Theory joined by a

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Basics of Probability Basics of Probability Janyl Jumadinova February 2426, 2020 Janyl

1 2 3 4 Stopping Probability Visiting Probability 5 Stopping

Unit 2: Probability and distributions Lecture 1: Probability and conditional probability

Graphical Models Graphical Models Review of probability theory Review of probability theory

8.3 GRAPH REPRESENTATIONS AND GRAPH ISOMORPHISM INCIDENCE TABLE REPRESENTATION def: An incidence

Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors for NIST LRE 2015 Alan

Joint work with Jessica Hwang & Paulo Orenstein (Stanford), Judah Cohen & Karl Pfeiffer

Distributed Approaches to Mirror Descent for Stochastic Learning over Rate-Limited Networks

Joint ITU-T and OASIS Workshop and Demonstration of Advances in ICT Standards for Public Warning

It slices, dices, and makes julienne data! or, Processing data with RecordStream, also known

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 6 Yan n Gu

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 10 Yan n Gu

Telehealth: How to make it work November 12, 2008 Monday 18 June 2018 Supported by The Royal

Graph Representation Learning: Where Probability Theory, Data - PowerPoint PPT Presentation

Graph Representation Learning: Where Probability Theory, Data Mining, and Neural Networks Meet Bruno Ribeir eiro Assis istant nt Profess ssor Departm tment ent of Comp mputer er Scienc nce Purdue Unive Pu versit rsity Joint work

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

Which probability Which probability Which probability Which probability theory for cosmology?

Recap of Basic Probability Elements of basic probability theory probability theory The

Continuing Probability. Wrap up: Total Probability and Conditional Probability. Continuing

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Basics Probability Background Martin Emms October 1, 2020 Probability Basics

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Theory p ( E ) = p ( a 1 ) + p ( a 2 ) + ... + p ( a m ) 1 2 3 4 5 6 7 8 9 10 11 12 13

Counting and Probability Whats to come? Counting and Probability Whats to come?

Network/Graph Network/Graph Informally a graph is a set of nodes Theory Theory joined by a

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Basics of Probability Basics of Probability Janyl Jumadinova February 2426, 2020 Janyl

1 2 3 4 Stopping Probability Visiting Probability 5 Stopping

Unit 2: Probability and distributions Lecture 1: Probability and conditional probability

Graphical Models Graphical Models Review of probability theory Review of probability theory

8.3 GRAPH REPRESENTATIONS AND GRAPH ISOMORPHISM INCIDENCE TABLE REPRESENTATION def: An incidence

Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors for NIST LRE 2015 Alan

Joint work with Jessica Hwang &amp; Paulo Orenstein (Stanford), Judah Cohen &amp; Karl Pfeiffer

Distributed Approaches to Mirror Descent for Stochastic Learning over Rate-Limited Networks

Joint ITU-T and OASIS Workshop and Demonstration of Advances in ICT Standards for Public Warning

It slices, dices, and makes julienne data! or, Processing data with RecordStream, also known

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 6 Yan n Gu

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 10 Yan n Gu

Telehealth: How to make it work November 12, 2008 Monday 18 June 2018 Supported by The Royal

Joint work with Jessica Hwang & Paulo Orenstein (Stanford), Judah Cohen & Karl Pfeiffer