Probabilistic symmetry and invariant neural networks Benjamin - PowerPoint PPT Presentation

Probabilistic symmetry and invariant neural networks Benjamin Bloem-Reddy , University of Oxford Work with Yee Whye Teh 14 January 2019, UBC Computer Science

models Outline B. Bloem-Reddy 2 / 27 • Symmetry in neural networks • Permutation-invariant neural networks • Symmetry in probability and statistics • Exchangeable sequences • Permutation-invariant neural networks as exchangeable probability • Symmetry in neural networks as probabilistic symmetry

Deep learning and statistics settings. semi-/unsupervised domains. B. Bloem-Reddy 3 / 27 • Deep neural networks have been applied successfully in a range of • Effort under way to improve performance in data poor and • Focus on symmetry . • The study of symmetry in probability and statistics has a long history.

Symmetric neural networks n network. If X and Y are assumed to satisfy a symmetry property, B. Bloem-Reddy 4 / 27 ( ) w ( ℓ ) ∑ f ℓ, i = σ i , j f ℓ − 1 , j j = 1 For input X and output Y , model Y = h ( X ) , where h ∈ H is a neural how is H restricted?

Symmetric neural networks Convolutional neural networks encode translation invariance: Illustration from medium.freecodecamp.org B. Bloem-Reddy 5 / 27

Why symmetry? Stabler training and better generalization through • reduction in dimension of parameter space through weight-tying; and • capturing structure at multiple scales via pooling. Historical note: Interest in invariant neural networks goes back at least to Minsky and Papert [MP88]; extended by Shawe-Taylor and Wood [Sha89; WS96]. More recent work by a host of others. B. Bloem-Reddy 6 / 27 Encoding symmetry in network architecture is a Good Thing ∗ .

Neural networks for permutation-invariant data [Zah+17] Permutation invariance: X 1 X 2 X 3 X 4 Y B. Bloem-Reddy 7 / 27 Consider a sequence X n := ( X 1 , . . . , X n ) , X i ∈ X . Y = h ( X n ) = h ( π · X n ) for all π ∈ S n .

Neural networks for permutation-invariant data [Zah+17] X 1 B. Bloem-Reddy n h Y X 3 X 2 X 4 7 / 27 X 2 Permutation invariance: Y X 1 X 3 X 4 Consider a sequence X n := ( X 1 , . . . , X n ) , X i ∈ X . Y = h ( X n ) = h ( π · X n ) for all π ∈ S n . �→ ( ) Y = ˜ ∑ Y = h ( X n ) �→ φ ( X i ) i = 1

Neural networks for permutation-invariant data [Zah+17] Equivariance: X 1 X 2 X 3 X 4 Y 1 Y 2 Y 3 Y 4 B. Bloem-Reddy 8 / 27 Y n = h ( X n ) such that h ( π · X n ) = π · h ( X n ) for all π ∈ S n .

Neural networks for permutation-invariant data [Zah+17] X 3 B. Bloem-Reddy X j n n Equivariance: Y 3 Y 2 Y 1 X 4 Y 4 X 2 X 3 X 1 X 2 X 1 8 / 27 X 4 Y 4 Y 1 Y 2 Y 3 Y n = h ( X n ) such that h ( π · X n ) = π · h ( X n ) for all π ∈ S n . ( ) ( ) ∑ ∑ [ h ( X n )] i = σ �→ [ h ( X n )] i = σ w 0 X i + w 1 w i , j X j j = 1 j = 1

Neural networks for permutation-invariant data . . . B. Bloem-Reddy 9 / 27

You could probably make some money making decent hats. Note to students: These were the first Google Image results for ”deep learning hat” and ”statistics hat”. B. Bloem-Reddy 10 / 27 ⟨⟨ Deep learning hat, off; statistics hat, on ⟩⟩

Statistical models and symmetry If X is assumed to satisfy a symmetry property, B. Bloem-Reddy 11 / 27 Consider a sequence X n := ( X 1 , . . . , X n ) , X i ∈ X . A statistical model of X n is a family of probability distributions on X n : P = { P θ : θ ∈ Ω } . how is P restricted?

Exchangeable sequences de Finetti’s theorem: iid Implication for Bayesian inference: Analogous theorems for other symmetries. The book by Kallenberg [Kal05] collects many of them. Some other accessible references: [Dia88; OR15]. B. Bloem-Reddy 12 / 27 A distribution P on X n is exchangeable if P ( X 1 , . . . , X n ) = P ( X π ( 1 ) , . . . , X π ( n ) ) for all π ∈ S n . X N is infinitely exchangeable if this is true for all prefixes X n ⊂ X N , n ∈ N . X N exchangeable ⇐ ⇒ X i | Q ∼ Q for some random Q . Our models for X N need only consist of i.i.d. distributions on X .

Finite exchangeable sequences de Finetti’s theorem may fail for finite exchangeable sequences. What else can we say? n B. Bloem-Reddy 13 / 27 The empirical measure of X n is ∑ M X n ( • ) := δ X i ( • ) . i = 1

Finite exchangeable sequences The empirical measure is a sufficient statistic : P is exchangeable iff with empirical measure m . B. Bloem-Reddy 14 / 27 • | M X n = m ) = U m ( • ) , P ( X n ∈ where U m is the uniform distribution on all sequences ( x 1 , . . . , x n )

Finite exchangeable sequences The empirical measure is a sufficient statistic : P is exchangeable iff with empirical measure m . d The empirical measure is an adequate statistic for any such Y : B. Bloem-Reddy 14 / 27 • | M X n = m ) = U m ( • ) , P ( X n ∈ where U m is the uniform distribution on all sequences ( x 1 , . . . , x n ) Consider Y such that ( π · X n , Y ) = ( X n , Y ) . • | X n = x n ) = P ( Y ∈ • | M X n = M x n ) . P ( Y ∈ M X n contains all information in X n that is relevant for predicting Y .

A useful theorem Theorem (Invariant representation; B-R, Teh) d a.s. B. Bloem-Reddy 15 / 27 Suppose X n is an exchangeable sequence. Then ( π · X n , Y ) = ( X n , Y ) for all π ∈ S n if and only if there is a measurable function ˜ h : [ 0 , 1 ] × M ( X ) → Y such that = ( X n , ˜ ( X n , Y ) h ( η, M X n )) and η ∼ Unif [ 0 , 1 ] , η ⊥ ⊥ X n .

A useful theorem X 3 B. Bloem-Reddy n h n h Y Theorem (Invariant representation; B-R, Teh) X 3 X 2 X 1 Y X 4 X 4 X 2 a.s. d X 1 15 / 27 Suppose X n is an exchangeable sequence. Then ( π · X n , Y ) = ( X n , Y ) for all π ∈ S n if and only if there is a measurable function ˜ h : [ 0 , 1 ] × M ( X ) → Y such that = ( X n , ˜ ( X n , Y ) h ( η, M X n )) and η ∼ Unif [ 0 , 1 ] , η ⊥ ⊥ X n . Deterministic invariance [Zah+17] �→ stochastic invariance [B-R, Teh] η ( ) ( ) Y = ˜ ∑ Y = ˜ ∑ φ ( X i ) �→ η, δ X i i = 1 i = 1

Another useful theorem a.s. B. Bloem-Reddy iid Theorem (Equivariant representation; B-R, Teh) 16 / 27 d Suppose X n is an exchangeable sequence and Y i ⊥ ⊥ X n ( Y n \ Y i ) . Then ( π · X n , π · Y n ) = ( X n , Y n ) for all π ∈ S n if and only if there is a measurable function ˜ h : [ 0 , 1 ] × X × M ( X ) → Y such that X n , (˜ ( ) ( X n , Y n ) = h ( η i , X i , M X n )) i ∈ [ n ] and η i ∼ Unif [ 0 , 1 ] , ( η i ) i ∈ [ n ] ⊥ ⊥ X n .

Another useful theorem Theorem (Equivariant representation; B-R, Teh) Y 2 Y 3 Y 4 X 1 X 2 X 3 X 4 Y 1 Y 2 Y 4 X 4 n X j h n w 0 X i w 1 n j 1 B. Bloem-Reddy Y 1 Y 3 X 3 iid d X 2 a.s. X 1 16 / 27 Suppose X n is an exchangeable sequence and Y i ⊥ ⊥ X n ( Y n \ Y i ) . Then ( π · X n , π · Y n ) = ( X n , Y n ) for all π ∈ S n if and only if there is a measurable function ˜ h : [ 0 , 1 ] × X × M ( X ) → Y such that X n , (˜ ( ) ( X n , Y n ) = h ( η i , X i , M X n )) i ∈ [ n ] and η i ∼ Unif [ 0 , 1 ] , ( η i ) i ∈ [ n ] ⊥ ⊥ X n . Deterministic equivariance [Zah+17] �→ stochastic equivariance [B-R, Teh] η 1 η 2 η 3 η 4 ( ) ( ) ∑ Y i = ˜ ∑ Y i = σ w 0 X i + w 1 �→ η i , X i , δ X j j = 1 j = 1 ( ) ∫ ∑ X j dx

models Outline B. Bloem-Reddy 17 / 27 • Symmetry in neural networks • Permutation-invariant neural networks • Symmetry in probability and statistics • Exchangeable sequences • Permutation-invariant neural networks as exchangeable probability • Symmetry in neural networks as probabilistic symmetry

and A bit of group theory B. Bloem-Reddy 18 / 27 For a group G acting on a set X : • The orbit of any x ∈ X is the subset of X generated by applying G to x : G · x = { g · x ; g ∈ G} . • A maximal invariant statistic M : X → S (i) is constant on an orbit, i.e., M ( g · x ) = M ( x ) for all g ∈ G and x ∈ X ; (ii) takes a different value on each orbit, i.e., M ( x 1 ) = M ( x 2 ) implies x 1 = g · x 2 for some g ∈ G . • A maximal equivariant τ : X → G satisfies τ ( g · X ) = g · τ ( x ) , g ∈ G , x ∈ X .

A general invariance theorem d B. Bloem-Reddy a.s. Theorem (B-R, Teh) 19 / 27 d Let G be a compact group and assume that g · X = X for all g ∈ G . Let M : X → S be a maximal invariant. Then ( g · X , Y ) = ( X , Y ) for all g ∈ G if and only if there exists a measurable function ˜ h : [ 0 , 1 ] × S → Y such that X , ˜ ( ) ( X , Y ) = h ( η, M ( X )) with η ∼ Unif [ 0 , 1 ] and η ⊥ ⊥ X .

Proof by picture B. Bloem-Reddy 20 / 27 P ( g · X , Y ) = P ( X , Y ) for all g ∈ G X Y

Proof by picture B. Bloem-Reddy 20 / 27 P ( g · X , M ( g · X ) , Y ) = P ( X , M ( X ) , Y ) for all g ∈ G ⇒ Y ⊥ ⊥ M ( X ) X X Y M ( X )

A general equivariance theorem a.s. B. Bloem-Reddy a.s. h is equivariant: Theorem (Kallenberg; B-R, Teh) 21 / 27 d d Let G be a compact group and assume that g · X = X for all g ∈ G . Assume that a maximal equivariant τ : X → G exists. Then ( g · X , g · Y ) = ( X , Y ) for all g ∈ G if and only if there exists a measurable function ˜ h : [ 0 , 1 ] × X → Y such that X , ˜ ( ) ( X , Y ) = h ( η, X ) with η ∼ Unif [ 0 , 1 ] and η ⊥ ⊥ X , where ˜ ˜ = g · ˜ h ( η, g · X ) h ( η, X ) , g ∈ G .

Proof by picture B. Bloem-Reddy 22 / 27 P ( g · X , g · Y ) = P ( X , Y ) for all g ∈ G X Y

Probabilistic symmetry and invariant neural networks Benjamin - PowerPoint PPT Presentation

Probabilistic symmetry and invariant neural networks Benjamin Bloem-Reddy , University of Oxford Work with Yee Whye Teh 14 January 2019, UBC Computer Science models Outline B. Bloem-Reddy 2 / 27 Symmetry in neural networks

Symmetry Transforms 1 1 Motivation Symmetry is everywhere 2 Motivation Symmetry is

Symmetry properties Symmetry properties periodic potential space group crystal symmetry: point

Invariant neural networks and probabilistic symmetry Benjamin Bloem-Reddy , University of Oxford

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Symmetry in mathematics and mathematics of symmetry Peter J. Cameron p.j.cameron@qmul.ac.uk

Symmetry in mathematics and mathematics of symmetry Peter J. Cameron p.j.cameron@qmul.ac.uk

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Headlines for this session What do we mean by evidence-informed leadership? What do we mean

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

Blind motion-compensated video watermarking Peter Meerwald, Andreas Uhl June 24, 2008 Outline

for better performing buildings IEE/13/610/SIO2.675574 01/03/2014-28/02/2017 UPDATE: 28/02/2017

continuous random variables continuous random variables Discrete random variable: takes values in

Floating Point Representation CS3220 - Summer 2008 Jonathan Kaldor Floating Point Numbers

Estimating the competitive storage model with stochastic trend: A particle MCMC approach Kjartan

Optimal Scheduling for Precedence-Constrained Applications on Heterogeneous Machines Carlos Soto