 
              A Tutorial on Bayesian Nonparametrics Fatima Al-Raisi Carnegie Mellon University fraisi@cs.cmu.edu October 25, 2016 Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 1 / 45
Introdution 1 Baseyan Non-Parametrics Motivation 2 Intuitions and Assumptions Theoretical Motivation Practical Motivation Dirichlet Process 3 Chinese Restaurant Process 4 Pitman-Yor Process Discussion and Concluding Remarks 5 List of Tutorials 6 Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 2 / 45
Development of Interest in Topic Over Time An interesting “interest over time” pattern! Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 3 / 45
Interest Over Time: Deep Learning Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 4 / 45
Interest Over Time: Reinforcement Learning Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 5 / 45
Interest Over Time: Nonparametric Statistics! Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 6 / 45
Interest Over Time: Bayesian Inference! Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 7 / 45
Terminology What does “Bayesian Nonparametrics” mean? Bayesian inference: data and parameters, priors and posterios P ( parameters | data ) ∝ P ( parameters ) P ( data | parameters ) Bayesian inference vs. Bayes rule (Bayesian inference does not mean using Bayes rule!) Non-parametric ⋆ (misnomer): large/unbounded number of parameters, growing number of parameters, infinite parameter space “the number of parameters grow with the amount of training data” No (strong) assumption about underlying distribution of the data Terminology note: non - parametric vs. non e parametric Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 8 / 45
Terminology Formal Definition A statistical model is a collection of distributions: { P θ : θ ∈ Θ } indexed by a parameter θ Parametric Model : indexing parameter is a finite-dimensional vector: Θ ⊂ R k Nonparametric Model : Θ ⊂ F for some possibly infinite-dimensional space F Semiparametric Model : parameter has both a finite-dimensional component and an infinite-dimensional component: Θ ⊂ R k × F where F is an infinite-dimensional space Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 9 / 45
Review Probabilistic Modeling Data: x 1 , x 2 , . . . , x n Latent variables: z 1 , z 2 , . . . , z n Parameter: θ A probabilistic model is a parametrized joint distribution over variables P ( x 1 , x 2 , . . . , x n , z 1 , z 2 , . . . , z n | θ ) Typically interpreted as a generative model of data Inference of latent variables given observed data: P ( z 1 , z 2 , . . . , z n | x 1 , x 2 , . . . , x n , θ ) = P ( x 1 , x 2 , . . . , x n , z 1 , z 2 , . . . , z n | θ ) P ( x 1 , x 2 , . . . , x n | θ ) Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 10 / 45
Review Probabilistic Modeling Learning , (e.g., by maximum likelihood): θ = argmax P ( x 1 , x 2 , . . . , x n | θ ) θ Prediction : P ( x n +1 | x 1 , x 2 , . . . , x n , θ ) Classification : argmax P ( x n +1 | θ c ) c Standard algorithms: EM, VI, MCMC, etc. Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 11 / 45
Review Bayesian Modeling Prior distribution: P ( θ ) Posterior distribution: P ( z 1 , . . . , z n , θ | x 1 , . . . , x n ) = P ( x 1 , . . . , x n , z 1 , . . . , z n | θ ) P ( θ ) P ( x 1 , . . . , x n ) The above is doing both inference and learning Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 12 / 45
Clustering Parametric Approach Think of data as generated from a number of sources Model each cluster using a parametric model A data item i is drawn as follows: z i | π ∼ Discrete ( π ) x i | z i , θ ⋆ k ∼ F ( θ ⋆ z i ) where F is a parametric model (e.g., Guassian with parameter vector θ = ( µ, σ )) Mixing proportions: π = ( π 1 , . . . , π k ) | α ∼ Dirichlet ( α k , . . . , α k ) More on the Dirichlet distribution later Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 13 / 45
Motivation Question: What is the number of sources? Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 14 / 45
Motivation Question: What is the number of sources? Is it 5? Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 15 / 45
Motivation Question: What is the number of sources? Or maybe 3? Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 16 / 45
Motivation Question: What is the number of sources? In practice an ad-hoc approach is followed to decide k. For example, guess the number of clusters, then run EM for Gaussian Mixture Model, look at results and goodness of fit, and then if needed try again with a different k or run hierarchical agglomerative clustering, and cut the tree at a “reasonable looking” level Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 17 / 45
Motivation Question: What is the number of sources? In practice an ad-hoc approach is followed to decide on k. But we want a principled approach for discovering k. After all, it is an essential part of the problem to be solved! Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 18 / 45
Motivation Intuitive and Theoretical Motivation Natural Phenomena: Topics: ◮ (Wikipedia) dynamic traversal ◮ Clustering Species discovery Annotation and labeling Knowledge-base entity types . . . For any fixed k, as we see more data, there is a positive probability that we will encounter a data point that does not fit in the current scheme; i.e., k grows with data Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 19 / 45
Motivation Theoretical Motivation: De Finetti’s Theorem Infinite Exchangeability A data sequence is infinitely exchangeable if the distribution of any N data points does not change under permutation: p ( X 1 , . . . , X n ) = p ( X σ (1) , . . . , X σ ( n ) ) Theoretical Motivation: De Finetti’s Theorem Theorem (De Finetti’s Theorem) A sequence X 1 , . . . , X n is infinitely exchangeable if and only if, for all N and some distribution P: N � � p ( X 1 , . . . , X n ) = p ( X n | θ ) P ( d θ ) θ n =1 Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 20 / 45
Motivation Theoretical Motivation De Finetti’s Theorem General proof: Hewitt, Savage 1955; Aldous 1983 Theorem (De Finetti’s Theorem) A sequence X 1 , . . . , X n is infinitely exchangeable if and only if, for all N and some distribution P: N � � p ( X 1 , . . . , X n ) = p ( X n | θ ) P ( d θ ) θ n =1 Motivates: Parameters Likelihood Priors Non-parametric Bayesian priors Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 21 / 45
Motivation Theoretical Motivation What happens under the parametric regime? Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 22 / 45
Motivation Theoretical Motivation What happens under the parametric regime? Let’s take the example of regression Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 23 / 45
Motivation Theoretical Motivation What happens under the parametric regime? When fitting/optimizing, we’re finding the best fit within the chosen (parametric) family of functions; i.e., we’re optimizing to get the closest approximation to the true taget function. Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 24 / 45
Motivation Theoretical Motivation What happens under the parametric regime? When fitting, we’re finding the best fit within the chosen (parametric) family of functions; i.e., we’re optimizing to get the closest approximation to the true taget function. But this may not be good enough Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 25 / 45
Motivation Theoretical Motivation: Non-parametric Bayesin Approach Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 26 / 45
Motivation Practical Problem-solving Motivation Human intuitions about high-dimentional problems are often misleading! Example : recent result from Random Matrix Theory: proving the proliferation of saddle points in comparison to local minina in high-dimentional problems [Dauphin et. al 2015] Assumptions often made when attempting to solve different problems are naturally part of the problem to be solved, e.g., Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 27 / 45
Recommend
More recommend