Infinite Models II Zoubin Ghahramani Center for Automated Learning - PowerPoint PPT Presentation

Infinite Models II Zoubin Ghahramani Center for Automated Learning and Discovery Carnegie Mellon University http://www.cs.cmu.edu/ ∼ zoubin Mar 2002 Carl E. Rasmussen Matthew J. Beal Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/

Two conflicting Bayesian views? View 1: Occam’s Razor. Bayesian learning automatically finds the optimal model complexity given the available amount of data, since Occam’s Razor is an integral part of Bayes [Jefferys & Berger; MacKay]. Occam’s Razor discourages overcomplex models. View 2: Large models. There is no statistical reason to constrain models; use large models (no matter how much data you have) [Neal] and pursue the infinite limit if you can [Neal; Williams, Rasmussen]. Both views require averaging over all model parameters. These two views seem contradictory. Example, should we use Occam’s Razor to find the “best” number of hidden units in a feedforward neural network, or simply use as many hidden units as we can manage computationally?

View 1: Finding the “best” model complexity Select the model class with the highest probability given the data: P ( M i | Y ) = P ( Y |M i ) P ( M i ) � , P ( Y |M i ) = P ( Y | θ i , M i ) P ( θ i |M i ) dθ i P ( Y ) θ i Interpretation: The probability that randomly selected parameter values from the model class would generate data set Y . Model classes that are too simple are unlikely to generate the data set. Model classes that are too complex can generate many possible data sets, so again, they are unlikely to generate that particular data set at random. too simple P(Y| M i ) "just right" too complex Y All possible data sets

Bayesian Model Selection: Occam’s Razor at Work M = 0 M = 1 M = 2 M = 3 40 40 40 40 Model Evidence 1 20 20 20 20 0.8 0 0 0 0 0.6 P(Y|M) −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10 0.4 M = 4 M = 5 M = 6 M = 7 0.2 40 40 40 40 0 20 20 20 20 0 1 2 3 4 5 6 7 M 0 0 0 0 −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10

Lower Bounding the Evidence Variational Bayesian Learning Let the hidden states be x , data y and the parameters θ . We can lower bound the evidence (Jensen’s inequality): � ln P ( y |M ) = ln d x d θ P ( y , x , θ |M ) � d x d θ Q ( x , θ ) P ( y , x , θ ) = ln Q ( x , θ ) d x d θ Q ( x , θ ) ln P ( y , x , θ ) � ≥ Q ( x , θ ) . Use a simpler, factorised approximation to Q ( x , θ ) : � d x d θ Q x ( x ) Q θ ( θ ) ln P ( y , x , θ ) ≥ ln P ( y ) Q x ( x ) Q θ ( θ ) = F ( Q x ( x ) , Q θ ( θ ) , y ) .

Variational Bayesian Learning . . . Maximizing this lower bound, F , leads to EM-like updates: Q ∗ ∝ exp � ln P ( x , y | θ ) � Q θ ( θ ) E − like step x ( x ) Q ∗ θ ( θ ) ∝ P ( θ ) exp � ln P ( x , y | θ ) � Q x ( x ) M − like step F Maximizing is equivalent to minimizing KL-divergence between the approximate posterior , Q ( θ ) Q ( x ) and the true posterior , P ( θ , x | y ) .

Conjugate-Exponential models Let’s focus on conjugate-exponential ( CE ) models, which satisfy (1) and (2) : Condition (1) . The joint probability over variables is in the exponential family: φ ( θ ) ⊤ u ( x , y ) � � P ( x , y | θ ) = f ( x , y ) g ( θ ) exp where φ ( θ ) is the vector of natural parameters , u are sufficient statistics Condition (2) . The prior over parameters is conjugate to this joint probability: P ( θ | η, ν ) = h ( η, ν ) g ( θ ) η exp φ ( θ ) ⊤ ν � � where η and ν are hyperparameters of the prior. Conjugate priors are computationally convenient and have an intuitive interpretation: • η : number of pseudo-observations • ν : values of pseudo-observations

Conjugate-Exponential examples In the CE family: • Gaussian mixtures • factor analysis, probabilistic PCA • hidden Markov models and factorial HMMs • linear dynamical systems and switching models • discrete-variable belief networks Other as yet undreamt-of models can combine Gaussian, Gamma, Poisson, Dirichlet, Wishart, Multinomial and others. Not in the CE family: • Boltzmann machines, MRFs (no conjugacy) • logistic regression (no conjugacy) • sigmoid belief networks (not exponential) • independent components analysis (not exponential) Note: one can often approximate these models with models in the CE family.

The Variational EM algorithm VE Step : Compute the expected sufficient statistics � i u ( x i , y i ) under the hidden variable distributions Q x i ( x i ) . VM Step : Compute expected natural parameters φ ( θ ) under the parameter distribution given by ˜ η and ˜ ν . Properties: • Reduces to the EM algorithm if Q θ ( θ ) = δ ( θ − θ ∗ ) . • F increases monotonically, and incorporates the model complexity penalty. • Analytical parameter distributions (but not constrained to be Gaussian). • VE step has same complexity as corresponding E step. • We can use the junction tree, belief propagation, Kalman filter, etc, algorithms in the VE step of VEM, but using expected natural parameters .

View 2: Large models We ought not to limit the complexity of our model a priori (e.g. number of hidden states, number of basis functions, number of mixture components, etc) since we don’t believe that the real data was actually generated from a statistical model with a small number of parameters. Therefore, regardless of how much training data we have, we should consider models with as many parameters as we can handle computationally. Neal (1994) showed that MLPs with large numbers of hidden units achieved good performance on small data sets. He used MCMC techniques to average over parameters. Here there is no model order selection task: • No need to evaluate evidence (which is often difficult). • We don’t need or want to use Occam’s razor to limit the number of parameters in our model. In fact, we may even want to consider doing inference in models with an infinite number of parameters...

Infinite Models 1: Gaussian Processes Neal (1994) showed that a one-hidden-layer neural network with bounded activation function and Gaussian prior over the weights and biases converges to a (nonstationary) Gaussian process prior over functions. p ( y | x ) = N (0 , C ( x )) where e.g. C ij ≡ C ( x i , x j ) = g ( | x i − x j | ) . 3 2 1 y 0 −1 −2 −3 −3 −2 −1 0 1 2 3 4 x Gaussian Process with Error Bars Bayesian inference is GPs is conceptually and algorithmically much easier than inference in large neural networks. Williams (1995; 1996) and Rasmussen (1996) have evaluated GPs as regression models and shown that they are very good.

Gaussian Processes: prior over functions Samples from the Prior Samples from the Posterior 2 2 output, y(x) output, y(x) 0 0 −2 −2 −2 0 2 −2 0 2 input, x input, x

Linear Regression ⇒ Gaussian Processes in four steps... � 1. Linear Regression with inputs x i and outputs y i : y i = w k x ik + ǫ i k � 2. Kernel Linear Regression: y i = w k φ k ( x i ) + ǫ i k 3. Bayesian Kernel Linear Regression: ǫ i ∼ N (0 , σ 2 ) w k ∼ N (0 , β k ) [indep. of w ℓ ] , 4. Now, integrate out the weights, w k : β k φ k ( x i ) φ k ( x j ) + δ ij σ 2 ≡ C ij � � y i � = 0 , � y i y j � = k This is a Gaussian process with covariance function: β k φ k ( x ) φ k ( x ′ ) + δ ij σ 2 ≡ C ij � C ( x , x ′ ) = k This is a Gaussian process with finite number of basis functions. Many useful GP covariance functions correspond to infinitely many kernels.

Infinite Models 2: Infinite Gaussian Mixtures Following Neal (1991), Rasmussen (2000) showed that it is possible to do inference in countably infinite mixtures of Gaussians. N K � � P ( x 1 , . . . , x N | π , µ , Σ ) π j N ( x i | µ j , Σ j ) = i =1 j =1 N K � � � � [ π j N ( x i | µ j , Σ j )] δ ( s i ,j ) P ( s , x | π , µ , Σ ) = = s s i =1 j =1 Joint distribution of indicators is multinomial K N n j � � P ( s 1 , . . . , s N | π ) = π j , n j = δ ( s i , j ) . j =1 i =1 Mixing proportions are given symmetric Dirichlet prior K Γ( β ) π β/K − 1 � P ( π | β ) = j Γ( β/K ) K j =1

Infinite Gaussian Mixtures (continued) Joint distribution of indicators is multinomial K N n j � � P ( s 1 , . . . , s N | π ) = π j , n j = δ ( s i , j ) . j =1 i =1 Mixing proportions are given symmetric Dirichlet conjugate prior K Γ( β ) π β/K − 1 � P ( π | β ) = j Γ( β/K ) K j =1 Integrating out the mixing proportions we obtain K Γ( β ) Γ( n j + β/K ) � � P ( s 1 , . . . , s N | β ) = d π P ( s 1 , . . . , s N | π ) P ( π | β ) = Γ( n + β ) Γ( β/K ) j =1 This yields a Dirichlet Process over indicator variables.

Dirichlet Process Conditional Probabilities Conditional Probabilities: Finite K P ( s i = j | s − i , β ) = n − i,j + β/K N − 1 + β where s − i denotes all indices except i , and n − i,j is total number of observations of indicator j excluding i th . DP: more populous classes are more more likely to be joined Conditional Probabilities: Infinite K Taking the limit as K → ∞ yields the conditionals n − i,j  j represented N − 1+ β  P ( s i = j | s − i , β ) = β all j not represented  N − 1+ β Left over mass, β , ⇒ countably infinite number of indicator settings. Gibbs sampling from posterior of indicators is easy!

Infinite Models II Zoubin Ghahramani Center for Automated Learning - PowerPoint PPT Presentation

Infinite Models II Zoubin Ghahramani Center for Automated Learning and Discovery Carnegie Mellon University http://www.cs.cmu.edu/ zoubin Mar 2002 Carl E. Rasmussen Matthew J. Beal Gatsby Computational Neuroscience Unit University

Infinite graphs P eter Komj ath LC12 P eter Komj ath Infinite graphs Infinite

Infinite Campus Parent Portal Scan and Go https://goo.gl/kNtHrw Infinite Campus Parent Portal

Happy 103rd birthday, Richard Guy Karl Dilcher Infinite products Infinite products involving

Infinite dimensional sub-Riemannian geometry Sylvain Arguill` ere (CIS, Johns Hopkins

HOW TO USE INFINITE CAMPUS August 24, 2016 Seminar Information WHAT IS INFINITE CAMPUS?

EMC D IRECTIVE 2014/30/EU V ERIFICATION C OMPLIANCE S TATEMENT For the Product : Water purifier

HOW TO USE INFINITE CAMPUS August 24, 2017 Seminar Information WHAT IS INFINITE CAMPUS?

Hardy space infinite elements for exterior Maxwell problems L. Nannen, T. Hohage, A. Schdle, J.

Lattice Cryptography Lecture 24 Lattices Lattices A infinite set of points in R n obtained by

Math 104 Calculus 10.2 Infinite Series Math 104 -

Compositions and Infinite Matrices Rod Canfield 9 Feb 2013 Compositions and Infinite Matrices

$TITLE: M10-2.GMS: Infinite horizon dynamic model, MPS/GE formulation $ONTEXT Converts infinite

18.175: Lecture 20 Infinite divisibility and L evy processes Scott Sheffield MIT 18.175 Lecture

Hamilton Decompositions of Infinite Circulant Graphs Sara Herke The University of Queensland

Joint work with Stability of difference equations with an infinite delay Leonid Berezansky

Verification of Infinite-State Systems Ahmed Bouajjani LIAFA - University of Paris 7 Genova -

The DSTO Ionospheric Sounder Replacement for JORN Dr Trevor J Harris, Adrian D Quinn

TESTING OF A YOCTO PROJECT BASED AUTOMOTIVE HEAD UNIT MARIO DOMENECH GOULART MIKKO RAPELI

AND SETTING GODS PEOPLE FREE ACROSS THE DIOCESE GROWING DISCIPLES AND SETTING GODS PEOPLE

Role of the Plan Commission Rebecca Roberts Center for Land Use Education UW Stevens

Numerical Range of non-hermitian random Ginibre matrices and the Dvoretzky theorem Karol

With the Common Core Developing the Cultural Competence of Teachers using the Common Core

Modification of Hydroxamic Acid Containing Compounds for Improved Metal Chelation By Karan

Stabilization methods for the Korteweg-de Nonlinear System Boundary Control Vries equation from