bayesian nonparametrics models based on the dirichlet
play

Bayesian Nonparametrics: Models Based on the Dirichlet Process - PowerPoint PPT Presentation

Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro Panella Department of Computer Science University of Illinois at Chicago Machine Learning Seminar Series February 18, 2013 Alessandro Panella (CS Dept. - UIC) Bayesian


  1. Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro Panella Department of Computer Science University of Illinois at Chicago Machine Learning Seminar Series February 18, 2013 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 1 / 57

  2. Sources and Inspirations Tutorials (slides) P . Orbanz and Y.W. Teh, Modern Bayesian Nonparametrics . NIPS 2011. M. Jordan, Dirichlet Process, Chinese Restaurant Process, and All That . NIPS 2005. Articles etc. E.B. Sudderth, Chapter in PhD thesis, 2006. E. Fox, Chapter in PhD thesis, 2008. Y.W. Teh, Dirichlet Processes . Encyclopedia of Machine Learning, 2010. Springer. ... Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 2 / 57

  3. Outline Introduction and background 1 Bayesian learning Nonparametric models Finite mixture models 2 Bayesian models Clustering with FMMs Inference 3 Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference A little more theory. . . 4 De Finetti’s REDUX Dirichlet process REDUX The hierarchical Dirichlet process 5 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 3 / 57

  4. Introduction and background Outline Introduction and background 1 Bayesian learning Nonparametric models Finite mixture models 2 Bayesian models Clustering with FMMs Inference 3 Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference A little more theory. . . 4 De Finetti’s REDUX Dirichlet process REDUX The hierarchical Dirichlet process 5 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 4 / 57

  5. Introduction and background Bayesian learning The meaning of it all BAYESIAN NONPARAMETRICS Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 5 / 57

  6. Introduction and background Bayesian learning The meaning of it all BAYESIAN NONPARAMETRICS Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 5 / 57

  7. Introduction and background Bayesian learning The meaning of it all BAYESIAN NONPARAMETRICS Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 5 / 57

  8. Introduction and background Bayesian learning Bayesian statistics Estimate a parameter θ ∈ Θ after observing data x . Frequentist Maximum Likelihood (ML): ˆ θ MLE = argmax θ p ( x | θ ) = argmax θ L ( θ : x ) Bayesian Bayes Rule: p ( θ | x ) = p ( x | θ ) p ( θ ) p ( x ) Bayesian prediction (using the whole posterior, not just one estimator) � p ( x new | x ) = p ( x new | θ ) p ( θ | x ) d θ Θ Maximum A Posteriori (MAP) ˆ p ( x | θ ) p ( θ ) θ MAP = argmax θ Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 6 / 57

  9. Introduction and background Bayesian learning Bayesian statistics Estimate a parameter θ ∈ Θ after observing data x . Frequentist Maximum Likelihood (ML): ˆ θ MLE = argmax θ p ( x | θ ) = argmax θ L ( θ : x ) Bayesian Bayes Rule: p ( θ | x ) = p ( x | θ ) p ( θ ) p ( x ) Bayesian prediction (using the whole posterior, not just one estimator) � p ( x new | x ) = p ( x new | θ ) p ( θ | x ) d θ Θ Maximum A Posteriori (MAP) ˆ p ( x | θ ) p ( θ ) θ MAP = argmax θ Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 6 / 57

  10. Introduction and background Bayesian learning Bayesian statistics Estimate a parameter θ ∈ Θ after observing data x . Frequentist Maximum Likelihood (ML): ˆ θ MLE = argmax θ p ( x | θ ) = argmax θ L ( θ : x ) Bayesian Bayes Rule: p ( θ | x ) = p ( x | θ ) p ( θ ) p ( x ) Bayesian prediction (using the whole posterior, not just one estimator) � p ( x new | x ) = p ( x new | θ ) p ( θ | x ) d θ Θ Maximum A Posteriori (MAP) ˆ p ( x | θ ) p ( θ ) θ MAP = argmax θ Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 6 / 57

  11. Introduction and background Bayesian learning De Finetti’s theorem A premise: Definition An infinite sequence random variables ( x 1 , x 2 , . . . ) is said to be (infinitely) exchangeable if, for every N and every possible permutation π on ( 1 , . . . , N ) , p ( x 1 , x 2 , . . . , x N ) = p ( x π ( 1 ) , x π ( 2 ) . . . , x π ( N ) ) Note: exchangeability not equal i.i.d! Example (Polya Urn) An urn contains some red balls and some black balls; an infinite sequence of colors is drawn recursively as follows: draw a ball, mark down its color, then put the ball back in the urn along with an additional ball of the same color. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 7 / 57

  12. Introduction and background Bayesian learning De Finetti’s theorem A premise: Definition An infinite sequence random variables ( x 1 , x 2 , . . . ) is said to be (infinitely) exchangeable if, for every N and every possible permutation π on ( 1 , . . . , N ) , p ( x 1 , x 2 , . . . , x N ) = p ( x π ( 1 ) , x π ( 2 ) . . . , x π ( N ) ) Note: exchangeability not equal i.i.d! Example (Polya Urn) An urn contains some red balls and some black balls; an infinite sequence of colors is drawn recursively as follows: draw a ball, mark down its color, then put the ball back in the urn along with an additional ball of the same color. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 7 / 57

  13. Introduction and background Bayesian learning De Finetti’s theorem (cont’d) Theorem (De Finetti, 1935. Aka Representation Theorem) A sequence of random variables ( x 1 , x 2 , . . . ) is infinitely exchangeable if for all N , there exists a random variable θ and a probability measure p on it such that N � � p ( x 1 , x 2 , . . . , x N ) = p ( θ ) p ( x i | θ ) d θ Θ i = 1 i.e., there exists a parameter space and a measure on it that makes the variables iid! The representation theorem motivates (and encourages!) the use of Bayesian statistics. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 8 / 57

  14. Introduction and background Bayesian learning De Finetti’s theorem (cont’d) Theorem (De Finetti, 1935. Aka Representation Theorem) A sequence of random variables ( x 1 , x 2 , . . . ) is infinitely exchangeable if for all N , there exists a random variable θ and a probability measure p on it such that N � � p ( x 1 , x 2 , . . . , x N ) = p ( θ ) p ( x i | θ ) d θ Θ i = 1 i.e., there exists a parameter space and a measure on it that makes the variables iid! The representation theorem motivates (and encourages!) the use of Bayesian statistics. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 8 / 57

  15. Introduction and background Bayesian learning Bayesian learning Hypothesis space H Given data D , compute p ( h | D ) = p ( D | h ) p ( h ) p ( D ) Then, we probably want to predict some future data D ′ , by either: Average over H , i.e. p ( D ′ | D ) = H p ( D ′ | h ) p ( h | D ) p ( h ) dh � Choose the MAP h (or compute it directly), i.e. p ( D ′ | D ) = p ( D ′ | h MAP ) Sample from the posterior ... H can be anything! Bayesian learning as a general learning framework We will consider the case in which h is a probabilistic model itself, i.e. a parameter vector θ . Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 9 / 57

  16. Introduction and background Bayesian learning A simple example Infer the bias θ ∈ [ 0 , 1 ] of a coin after observing N tosses. H = 1 , T = 0 , p ( H ) = θ h = θ , hence H = [ 0 , 1 ] Sequence of Bernoulli trials: θ p ( x 1 , . . . , x n | θ ) = θ n H ( 1 − θ ) N − n H x 1 x 2 x N where n H = # heads. Unknown θ : θ � 1 θ n H ( 1 − θ ) n H − k p ( θ ) d θ p ( x 1 , . . . , x N ) = x i 0 N Need to find a “good” prior p ( θ ) . . . Beta distribution! Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 10 / 57

  17. Introduction and background Bayesian learning A simple example (cont’d) Beta distribution: θ ∼ Beta ( a , b ) B ( a , b ) θ a − 1 ( 1 − θ ) b − 1 1 p ( θ | a , b ) = Bayesian learning: p ( h | D ) ∝ p ( D | h ) p ( h ) ; for us: Beta(0 . 1 , 0 . 1) p ( θ | x 1 , . . . , x N ) ∝ p ( x 1 , . . . , x n | θ ) p ( θ ) 1 = θ n H ( 1 − θ ) n T B ( a , b ) θ a − 1 ( 1 − θ ) b − 1 Beta(1 , 1) ∝ θ n H + a − 1 ( 1 − θ ) n T + b − 1 i.e. θ | x 1 , . . . , x N ∼ Beta ( a + N H , b + N T ) Beta(2 , 3) We’re lucky! The Beta distribution is a conjugate prior to the binomial distribution. Beta(10 , 10) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 11 / 57

  18. Introduction and background Bayesian learning A simple example (cont’d) Beta distribution: θ ∼ Beta ( a , b ) B ( a , b ) θ a − 1 ( 1 − θ ) b − 1 1 p ( θ | a , b ) = Bayesian learning: p ( h | D ) ∝ p ( D | h ) p ( h ) ; for us: Beta(0 . 1 , 0 . 1) p ( θ | x 1 , . . . , x N ) ∝ p ( x 1 , . . . , x n | θ ) p ( θ ) 1 = θ n H ( 1 − θ ) n T B ( a , b ) θ a − 1 ( 1 − θ ) b − 1 Beta(1 , 1) ∝ θ n H + a − 1 ( 1 − θ ) n T + b − 1 i.e. θ | x 1 , . . . , x N ∼ Beta ( a + N H , b + N T ) Beta(2 , 3) We’re lucky! The Beta distribution is a conjugate prior to the binomial distribution. Beta(10 , 10) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 11 / 57

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend