Infinite Models Zoubin Ghahramani Center for Automated Learning and - PowerPoint PPT Presentation

Infinite Models Zoubin Ghahramani Center for Automated Learning and Discovery Carnegie Mellon University http://www.cs.cmu.edu/ ∼ zoubin Feb 2002 Carl E. Rasmussen Matthew J. Beal Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/

Two conflicting Bayesian views? View 1: Occam’s Razor. Bayesian learning automatically finds the optimal model complexity given the available amount of data, since Occam’s Razor is an integral part of Bayes [Jefferys & Berger; MacKay]. Occam’s Razor discourages overcomplex models. View 2: Large models. There is no statistical reason to constrain models; use large models (no matter how much data you have) [Neal] and pursue the infinite limit if you can [Neal; Williams, Rasmussen]. Both views require averaging over all model parameters. These two views seem contradictory. Example, should we use Occam’s Razor to find the “best” number of hidden units in a feedforward neural network, or simply use as many hidden units as we can manage computationally?

View 1: Finding the “best” model complexity Select the model class with the highest probability given the data: P ( M i | Y ) = P ( Y |M i ) P ( M i ) � , P ( Y |M i ) = P ( Y | θ i , M i ) P ( θ i |M i ) dθ i P ( Y ) θ i Interpretation: The probability that randomly selected parameter values from the model class would generate data set Y . Model classes that are too simple are unlikely to generate the data set. Model classes that are too complex can generate many possible data sets, so again, they are unlikely to generate that particular data set at random. too simple P(Y| M i ) "just right" too complex Y All possible data sets

Bayesian Model Selection: Occam’s Razor at Work M = 0 M = 1 M = 2 M = 3 40 40 40 40 Model Evidence 1 20 20 20 20 0.8 0 0 0 0 0.6 P(Y|M) −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10 0.4 M = 4 M = 5 M = 6 M = 7 0.2 40 40 40 40 0 20 20 20 20 0 1 2 3 4 5 6 7 M 0 0 0 0 −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10

Lower Bounding the Evidence Variational Bayesian Learning Let the hidden states be x , data y and the parameters θ . We can lower bound the evidence (Jensen’s inequality): � ln P ( y |M ) = ln d x d θ P ( y , x , θ |M ) � d x d θ Q ( x , θ ) P ( y , x , θ ) = ln Q ( x , θ ) d x d θ Q ( x , θ ) ln P ( y , x , θ ) � ≥ Q ( x , θ ) . Use a simpler, factorised approximation to Q ( x , θ ) : � d x d θ Q x ( x ) Q θ ( θ ) ln P ( y , x , θ ) ≥ ln P ( y ) Q x ( x ) Q θ ( θ ) = F ( Q x ( x ) , Q θ ( θ ) , y ) .

Variational Bayesian Learning . . . Maximizing this lower bound, F , leads to EM-like updates: Q ∗ ∝ exp � ln P ( x , y | θ ) � Q θ ( θ ) E − like step x ( x ) Q ∗ θ ( θ ) ∝ P ( θ ) exp � ln P ( x , y | θ ) � Q x ( x ) M − like step F Maximizing is equivalent to minimizing KL-divergence between the approximate posterior , Q ( θ ) Q ( x ) and the true posterior , P ( θ , x | y ) .

Conjugate-Exponential models Let’s focus on conjugate-exponential ( CE ) models, which satisfy (1) and (2) : Condition (1) . The joint probability over variables is in the exponential family: φ ( θ ) ⊤ u ( x , y ) � � P ( x , y | θ ) = f ( x , y ) g ( θ ) exp where φ ( θ ) is the vector of natural parameters , u are sufficient statistics Condition (2) . The prior over parameters is conjugate to this joint probability: P ( θ | η, ν ) = h ( η, ν ) g ( θ ) η exp φ ( θ ) ⊤ ν � � where η and ν are hyperparameters of the prior. Conjugate priors are computationally convenient and have an intuitive interpretation: • η : number of pseudo-observations • ν : values of pseudo-observations

Conjugate-Exponential examples In the CE family: • Gaussian mixtures • factor analysis, probabilistic PCA • hidden Markov models and factorial HMMs • linear dynamical systems and switching models • discrete-variable belief networks Other as yet undreamt-of models can combine Gaussian, Gamma, Poisson, Dirichlet, Wishart, Multinomial and others. Not in the CE family: • Boltzmann machines, MRFs (no conjugacy) • logistic regression (no conjugacy) • sigmoid belief networks (not exponential) • independent components analysis (not exponential) Note: one can often approximate these models with models in the CE family.

The Variational EM algorithm VE Step : Compute the expected sufficient statistics � i u ( x i , y i ) under the hidden variable distributions Q x i ( x i ) . VM Step : Compute expected natural parameters φ ( θ ) under the parameter distribution given by ˜ η and ˜ ν . Properties: • Reduces to the EM algorithm if Q θ ( θ ) = δ ( θ − θ ∗ ) . • F increases monotonically, and incorporates the model complexity penalty. • Analytical parameter distributions (but not constrained to be Gaussian). • VE step has same complexity as corresponding E step. • We can use the junction tree, belief propagation, Kalman filter, etc, algorithms in the VE step of VEM, but using expected natural parameters .

View 2: Large models We ought not to limit the complexity of our model a priori (e.g. number of hidden states, number of basis functions, number of mixture components, etc) since we don’t believe that the real data was actually generated from a statistical model with a small number of parameters. Therefore, regardless of how much training data we have, we should consider models with as many parameters as we can handle computationally. Neal (1994) showed that MLPs with large numbers of hidden units achieved good performance on small data sets. He used MCMC techniques to average over parameters. Here there is no model order selection task: • No need to evaluate evidence (which is often difficult). • We don’t need or want to use Occam’s razor to limit the number of parameters in our model. In fact, we may even want to consider doing inference in models with an infinite number of parameters...

Infinite Models 1: Gaussian Processes Neal (1994) showed that a one-hidden-layer neural network with bounded activation function and Gaussian prior over the weights and biases converges to a (nonstationary) Gaussian process prior over functions. p ( y | x ) = N (0 , C ( x )) where e.g. C ij ≡ C ( x i , x j ) = g ( | x i − x j | ) . 3 2 1 y 0 −1 −2 −3 −3 −2 −1 0 1 2 3 4 x Gaussian Process with Error Bars Bayesian inference is GPs is conceptually and algorithmically much easier than inference in large neural networks. Williams (1995; 1996) and Rasmussen (1996) have evaluated GPs as regression models and shown that they are very good.

Gaussian Processes: prior over functions Samples from the Prior Samples from the Posterior 2 2 output, y(x) output, y(x) 0 0 −2 −2 −2 0 2 −2 0 2 input, x input, x

Linear Regression ⇒ Gaussian Processes in four steps... � 1. Linear Regression with inputs x i and outputs y i : y i = w k x ik + ǫ i k � 2. Kernel Linear Regression: y i = w k φ k ( x i ) + ǫ i k 3. Bayesian Kernel Linear Regression: ǫ i ∼ N (0 , σ 2 ) w k ∼ N (0 , β k ) [indep. of w ℓ ] , 4. Now, integrate out the weights, w k : β k φ k ( x i ) φ k ( x j ) + δ ij σ 2 ≡ C ij � � y i � = 0 , � y i y j � = k This is a Gaussian process with covariance function: β k φ k ( x ) φ k ( x ′ ) + δ ij σ 2 ≡ C ij � C ( x , x ′ ) = k This is a Gaussian process with finite number of basis functions. Many useful GP covariance functions correspond to infinitely many kernels.

Infinite Models 2: Infinite Gaussian Mixtures Following Neal (1991), Rasmussen (2000) showed that it is possible to do inference in countably infinite mixtures of Gaussians. N K � � P ( x 1 , . . . , x N | π , µ , Σ ) = π j N ( x i | µ j , Σ j ) i =1 j =1 Mixing proportions given symmetric Dirichlet prior K Γ( β ) π β/K − 1 � P ( π | β ) = j Γ( β/K ) K j =1 Joint distribution of indicators is then multinomial K N n j � � P ( s 1 , . . . , s N | π ) = π j , n j = δ ( s i , j ) . j =1 i =1 Integrating out the mixing proportions we obtain K Γ( β ) Γ( n j + β/K ) � � P ( s 1 , . . . , s N | β ) = d π P ( s 1 , . . . , s N | π ) P ( π | β ) = Γ( n + β ) Γ( β/K ) j =1 We have integrated out the mixing proportions! Yields a Dirichlet Process over indicator variables.

Gibbs sampling in Infinite Gaussian Mixtures Conditional Probabilities: Finite K P ( s i = j | s − i , β ) = n − i,j + β/K N − 1 + β where s − i denotes all indices except i , and n − i,j is total number of observations of indicator j excluding i th . Conditional Probabilities: Infinite K Taking the limit as K → ∞ yields the conditionals n − i,j  j represented N − 1+ β  P ( s i = j | s − i , β ) = . β j not represented  N − 1+ β Left over mass gives rise to a countably infinite number of indicator settings. Infinite limit ⇒ Infinite Dirichlet Process. Gibbs sampling is easy!

Infinite Models Zoubin Ghahramani Center for Automated Learning and - PowerPoint PPT Presentation

Infinite Models Zoubin Ghahramani Center for Automated Learning and Discovery Carnegie Mellon University http://www.cs.cmu.edu/ zoubin Feb 2002 Carl E. Rasmussen Matthew J. Beal Gatsby Computational Neuroscience Unit University College

Infinite graphs P eter Komj ath LC12 P eter Komj ath Infinite graphs Infinite

Infinite Campus Parent Portal Scan and Go https://goo.gl/kNtHrw Infinite Campus Parent Portal

Happy 103rd birthday, Richard Guy Karl Dilcher Infinite products Infinite products involving

Infinite dimensional sub-Riemannian geometry Sylvain Arguill` ere (CIS, Johns Hopkins

HOW TO USE INFINITE CAMPUS August 24, 2016 Seminar Information WHAT IS INFINITE CAMPUS?

EMC D IRECTIVE 2014/30/EU V ERIFICATION C OMPLIANCE S TATEMENT For the Product : Water purifier

HOW TO USE INFINITE CAMPUS August 24, 2017 Seminar Information WHAT IS INFINITE CAMPUS?

Hardy space infinite elements for exterior Maxwell problems L. Nannen, T. Hohage, A. Schdle, J.

Lattice Cryptography Lecture 24 Lattices Lattices A infinite set of points in R n obtained by

Math 104 Calculus 10.2 Infinite Series Math 104 -

Compositions and Infinite Matrices Rod Canfield 9 Feb 2013 Compositions and Infinite Matrices

$TITLE: M10-2.GMS: Infinite horizon dynamic model, MPS/GE formulation $ONTEXT Converts infinite

18.175: Lecture 20 Infinite divisibility and L evy processes Scott Sheffield MIT 18.175 Lecture

Hamilton Decompositions of Infinite Circulant Graphs Sara Herke The University of Queensland

Joint work with Stability of difference equations with an infinite delay Leonid Berezansky

Verification of Infinite-State Systems Ahmed Bouajjani LIAFA - University of Paris 7 Genova -

CSE 527 Lecture 10 Parsimony and Phylogenetic Footprinting Phylogenies (aka Evolutionary

Evidence and Occams razor Based on David J.C. MacKay: Information Theory and Learning

CSC2541 Lecture 2 Bayesian Occams Razor and Gaussian Processes Roger Grosse Roger Grosse

Computational Learning Theory: Occams Razor Machine Learning 1 Slides based on material from

The simpler the better: Thinning out MIP's by Occam's razor Matteo Fischetti, University of

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I What We Did The

Decision Tree Learning: Part 1 Yingyu Liang Computer Sciences 760 Fall 2017

CS 4700: Foundations of Artificial Intelligence Prof. Bart Selman selman@cs.cornell.edu