Bayesian Nonparametrics Charlie Frogner 9.520 Class 11 March 14, - PowerPoint PPT Presentation

Bayesian Nonparametrics Charlie Frogner 9.520 Class 11 March 14, 2012 C. Frogner Bayesian Nonparametrics

About this class Last time Bayesian formulation of RLS, for regression. (Basically, a normal distribution.) This time a more complicated probability model: the Dirichlet Process. And its application to clustering. And also more Bayesian terminology. C. Frogner Bayesian Nonparametrics

Plan Dirichlet distribution + other basics The Dirichlet process Abstract definition Stick Breaking Chinese restaurant process Clustering Dirichlet process mixture model Hierarchical Dirichlet process mixture model C. Frogner Bayesian Nonparametrics

Gamma Function and Beta Distribution The Gamma function � ∞ x z − 1 e − x dx . Γ( z ) = 0 Extends factorial function to R + : Γ( z + 1 ) = z Γ( z ) . Beta Distribution P ( x | α, β ) = Γ( α + β ) Γ( α )Γ( β ) x ( α − 1 ) ( 1 − x ) ( β − 1 ) for x ∈ [ 0 , 1 ] , α > 0, β > 0. αβ α (Mean: α + β , variance: ( α + β ) 2 ( α + β + 1 ) .) C. Frogner Bayesian Nonparametrics

Beta Distribution 4 4 � = 1, � = 1 � = 1.0, � = 1.0 � = 2, � = 2 � = 1.0, � = 3.0 3.5 3.5 � = 5, � = 3 � = 1.0, � = 0.3 � = 4, � = 9 � = 0.3, � = 0.3 3 3 2.5 2.5 p( � | � , � ) p( � | � , � ) 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 � � For large parameters the distribution is unimodal. For small parameters it favors biased binomial distributions. C. Frogner Bayesian Nonparametrics

Dirichlet Distribution Generalizes Beta distribution to the K-dimensional simplex S K . K S K = { x ∈ R K : � x i = 1 , x i ≥ 0 ∀ i } i = 1 Dirichlet distribution K P ( x | α ) = P ( x 1 , . . . , x K ) = Γ( � K i = 1 α i ) � ( x i ) α i − 1 � K i = 1 Γ( α i ) i = 1 where α = ( α 1 , . . . , α K ) , α i > 0 ∀ i , x ∈ S K . We write x ∼ Dir ( α ) , i.e. x 1 , . . . , x K ∼ Dir ( α 1 , . . . , α K ) . C. Frogner Bayesian Nonparametrics

Dirichlet Distribution university-logo C. Frogner Bayesian Nonparametrics

Properties of the Dirichlet Distribution Mean α i E [ x i ] = . � K j = 1 α j Variance α i ( � i � = j α j ) Var [ x i ] = . ( � K j = 1 α j ) 2 ( 1 + � K j = 1 α j ) Covariance α i α j Cov ( x i , x j ) = . ( � K j = 1 α j ) 2 ( 1 + � K j = 1 α j ) Marginals: x i ∼ Beta ( α i , � j � = i α j ) Aggregation: ( x 1 + x 2 , . . . , x k ) ∼ Dir ( α 1 + α 2 , . . . , α K ) C. Frogner Bayesian Nonparametrics

Multinomial Distribution If you throw n balls into k bins, the distribution of balls into bins is given by the multinomial distribution. Multinomial distribution Let p = ( p 1 , . . . , p K ) be probabilities over K categories and C = ( C 1 , . . . , C K ) be category counts. C i is the number of samples in the i th category, from n independent draws of a categorical variable with category probabilities p . Then K n ! � p C i P ( C | n , p ) = i . � K i = 1 C i ! i = 1 For K = 2 this is the binomial distribution. C. Frogner Bayesian Nonparametrics

An idea Treat the Dirichlet distribution as a distribution on probabilities: each sample θ ∼ Dir ( α ) defines a K -dimensional multinomial distribution. x ∼ Mult ( θ ) , θ ∼ Dir ( α ) C. Frogner Bayesian Nonparametrics

An idea Treat the Dirichlet distribution as a distribution on probabilities: each sample θ ∼ Dir ( α ) defines a K -dimensional multinomial distribution. x ∼ Mult ( θ ) , θ ∼ Dir ( α ) Posterior on θ : θ | x ∼ Dir ( α + x ) C. Frogner Bayesian Nonparametrics

Conjugate Priors Say x ∼ F ( θ ) (the likelihood ) and θ ∼ G ( α ) (the prior ). Conjugate prior G is a conjugate prior for F if the posterior P ( θ | x , α ) is in the same family as G . (E.g. if F is Gaussian then P ( θ | x , α ) should also be Gaussian.) So the Dirichlet distribution is a conjugate prior for the multinomial. C. Frogner Bayesian Nonparametrics

Plan Dirichlet distribution + other basics The Dirichlet process Abstract definition Stick Breaking Chinese restaurant process Clustering Dirichlet process mixture model Hierarchical Dirichlet process mixture model C. Frogner Bayesian Nonparametrics

Parametric vs. nonparametric Parametric : fix parameters independent of data. Nonparametric : effective number of parameters can grow with the data. E.g. density estimation: fitting Gaussian vs. parzen windows. E.g. Kernel methods are nonparametric. C. Frogner Bayesian Nonparametrics

Dirichlet Process Want: distribution on all K-dimensional simplices (for all K ). Informal Description X is a space, F is a probability distribution on X and F ( X ) is the set of all possible distributions on X . A Dirichlet Process gives a distribution over F ( X ) . A sample path from a DP is an element F ∈ F ( X ) . F can be seen as a (random) probability distribution on X . C. Frogner Bayesian Nonparametrics

Dirichlet Process Want: distribution on all K-dimensional simplices (for all K ). Formal Definition Let X be a space and H be the base measure on X . F is a sample from the Dirichlet Process DP ( α, H ) on X if its finite-dimensional marginals have the Dirichlet distribution: ( F ( B 1 ) , . . . , F ( B K )) ∼ Dir ( α H ( B 1 ) , . . . , α H ( B 2 )) for all partitions B 1 , . . . , B K of X (for any K ). C. Frogner Bayesian Nonparametrics

Stick Breaking Construction Explicit construction of a DP . Let α > 0, ( π i ) ∞ i = 1 such that i − 1 i − 1 � � p i = β i ( 1 − β j ) = β i ( 1 − p j ) j = 1 j = 1 where β i ∼ Beta ( 1 , α ) , for all i . Let H be a distribution on X and define ∞ � F = p i δ θ i i = 1 where θ i ∼ H , for all i . C. Frogner Bayesian Nonparametrics

Stick Breaking Construction: Interpretation 0.5 0.5 β 1 1 −β 1 0.4 0.4 π 1 0.3 0.3 β 2 1 −β 2 � k � k 0.2 0.2 π 2 0.1 0.1 β 3 1 −β 3 0 0 0 5 10 15 20 0 5 10 15 20 k k π 3 0.5 0.5 β 4 1 −β 4 0.4 0.4 π 4 0.3 0.3 � k � k β 5 0.2 0.2 π 5 0.1 0.1 0 0 0 5 10 15 20 0 5 10 15 20 k k α = 1 α = 5 The weights π partition a unit-length stick in an infinite set: the i -th weight is a random proportion β i of the stick remaining after sampling the first i − 1 weights. C. Frogner Bayesian Nonparametrics

Stick Breaking Construction (cont.) It is possible to prove (Sethuraman ’94) that the previous construction returns a DP and conversely a Dirichlet process is discrete almost surely. C. Frogner Bayesian Nonparametrics

Chinese Restaurant Process There is an infinite (countable) set of tables. First customer sits at the first table. Customer i sits at table j with probability n j α + i + 1 , where n j is the number of customers at table j , and i sits at the first open table with probability α α + i + 1 C. Frogner Bayesian Nonparametrics

The Role of the Strength Parameter Note that E [ β i ] = 1 / ( 1 + α ) . for small α , the first few components will have all the mass. for large α , F approaches the distribution H assigning uniform weights to the samples θ i . C. Frogner Bayesian Nonparametrics

Number of Clusters and Strength Parameter It is possible to prove (Antoniak ’77??) that the number of components with positive count grows as α log n as we increase the number of samples n . C. Frogner Bayesian Nonparametrics

Another idea Clustering with the K -dimensional Dirichlet: take each sample θ ∼ Dir ( α ) to define a K -dimensional categorical (instead of multinomial) distribution. x ∼ G ( φ ) , φ ∼ Cat ( θ ) , θ ∼ Dir ( α ) ( G is a a distribution on observation space X , say, Gaussian.) θ i is the probability of x coming from the i th cluster. C. Frogner Bayesian Nonparametrics

Another idea Clustering with the Dirichlet Process: take each sample θ ∼ DP ( α, H ) to define a K -dimensional categorical (instead of multinomial) distribution. x ∼ G ( φ ) , φ ∼ Cat ( θ ) , θ ∼ DP ( α, H ) ( G is a a distribution on observation space X , say, Gaussian. H can be uniform on { 1 , . . . , K } .) C. Frogner Bayesian Nonparametrics

Another idea Clustering with the Dirichlet Process: x ∼ G ( φ ) , φ ∼ Cat ( θ ) , θ ∼ DP ( α, H ) This is the Dirichlet Process mixture model . C. Frogner Bayesian Nonparametrics

Bayesian Nonparametrics Charlie Frogner 9.520 Class 11 March 14, - PowerPoint PPT Presentation

Bayesian Nonparametrics Charlie Frogner 9.520 Class 11 March 14, 2012 C. Frogner Bayesian Nonparametrics About this class Last time Bayesian formulation of RLS, for regression. (Basically, a normal distribution.) This time a more

Bayesian nonparametrics Dr. Jarad Niemi STAT 615 - Iowa State University December 5, 2017 Jarad

Bayesian Nonparametrics Lorenzo Rosasco 9.520 Class 18 April 11, 2011 L. Rosasco Bayesian

Variational Russian Roulette for Variational Russian Roulette for Deep Bayesian Nonparametrics

A Tutorial on Bayesian Nonparametrics Fatima Al-Raisi Carnegie Mellon University

Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro Panella Department of

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Applied Bayesian Nonparametrics 3. Infinite Hidden Markov Models Tutorial at CVPR 2012 Erik

Bayesian Magic for Complex Social Science Data: Fusion, Nonparametrics, Dynamics, Dyads, Networks

Spatial Bayesian Nonparametrics for Natural Image Segmentation Erik Sudderth Brown University

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at

Structured Databases of Named Entities from Bayesian Nonparametrics Dr. Jacob Eisenstein

Bayesian Nonparametrics Peter Orbanz Columbia University P ARAMETERS AND P ATTERNS Parameters P

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Solving Large-scale problems using JuMP Thuener Silva JuMP Developers meet-up Santiago, March

Model-based Deep Hand Pose Estimation Xingyi Zhou, Qingfu Wan, Wei Zhang, Xiangyang Xue, Yichen

Directed Random Graphs with Given Degree Distributions Mariana Olvera-Cravioto Columbia

Graphical models and inference II Milos Hauskrecht milos@pitt.edu 5329 Sennott Square, x4-8845

Lecture 13: Dirichlet Processes Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center

CSE 190 Lecture 14 Data Mining and Predictive Analytics Dimensionality-reduction approaches

Document and Topic Models: pLSA and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced

Natural Language Understanding Unsupervised Part-of-Speech Tagging Adam Lopez Slide credits: