statistical machine learning
play

Statistical Machine Learning Lecture 06: Probability Density - PowerPoint PPT Presentation

Statistical Machine Learning Lecture 06: Probability Density Estimation Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters Statistical Machine Learning Summer Term 2020 1 / 77 Todays Objectives


  1. Statistical Machine Learning Lecture 06: Probability Density Estimation Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 1 / 77

  2. Today’s Objectives Make you understand how to do find p ( x ) Covered Topics Density Estimation Maximum Likelihood Estimation Non-Parametric Methods Mixture Models Expectation Maximization K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 2 / 77

  3. Outline 1. Probability Density 2. Parametric models Maximum Likelihood Method 3. Non-Parametric Models Histograms Kernel Density Estimation K-nearest Neighbors 4. Mixture models 5. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 3 / 77

  4. 1. Probability Density Outline 1. Probability Density 2. Parametric models Maximum Likelihood Method 3. Non-Parametric Models Histograms Kernel Density Estimation K-nearest Neighbors 4. Mixture models 5. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 4 / 77

  5. 1. Probability Density Training Data 2 1.5 1 0.5 0 0 0.25 0.5 0.75 1 How do we get the probability distributions from this so that we can classify with them? K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 5 / 77

  6. 1. Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions p ( x | C k ) p ( C k ) The prior p ( C k ) is easy to deal with. We can “just count” the number of occurrences of each class in the training data We need to estimate (learn) the class-conditional probability density p ( x | C k ) Supervised training: we know the input data points and their true labels (classes) Estimate the density separately for each class C k “Abbreviation”: p ( x ) = p ( x | C k ) K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 6 / 77

  7. 1. Probability Density Probability Density Estimation Training data x 1 , x 2 , x 3 , . . . Estimation p ( x ) Methods Parametric model Non-parametric model Mixture models K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 7 / 77

  8. 2. Parametric models Outline 1. Probability Density 2. Parametric models Maximum Likelihood Method 3. Non-Parametric Models Histograms Kernel Density Estimation K-nearest Neighbors 4. Mixture models 5. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 8 / 77

  9. 2. Parametric models 2. Parametric models Simple case: Gaussian Distribution � − ( x − µ ) 2 � 1 p ( x | µ, σ ) = √ 2 πσ 2 exp 2 σ 2 Is governed by two parameters: mean and variance. That is, if we know these parameters we can fully describe p ( x ) K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 9 / 77

  10. 2. Parametric models 2. Parametric models Notation for parametric density models x ∼ p ( x | θ ) For the Gaussian distribution θ = ( µ, σ ) � � � x ∼ p � µ, σ x � K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 10 / 77

  11. 2. Parametric models : Maximum Likelihood Method 2. Parametric models Learning means to estimate the parameters θ given the training data X = { x 1 , x 2 , . . . } Likelihood of θ is defined as the probability that the data X was generated from the probability density function with parameters θ L ( θ ) = p ( X | θ ) K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 11 / 77

  12. 2. Parametric models : Maximum Likelihood Method Maximum Likelihood Method Consider a set of points X = { x 1 , . . . , x N } Computing the likelihood Of a single datum? p ( x n | θ ) Of all data? Assumption: the data is i.i.d. (independent and identically distributed) The random variables x 1 and x 2 are independent if P ( x 1 ≤ α, x 2 ≤ β ) = P ( x 1 ≤ α ) P ( x 2 ≤ β ) ∀ α, β ∈ R The random variables x 1 and x 2 are identically distributed if P ( x 1 ≤ α ) = P ( x 2 ≤ α ) ∀ α ∈ R K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 12 / 77

  13. 2. Parametric models : Maximum Likelihood Method Maximum Likelihood Method Likelihood � � � L ( θ ) = p ( X | θ ) = p x 1 , . . . , x N � θ � (using the i.i.d. assumption) = p ( x 1 | θ ) · . . . · p ( x n | θ ) N � p ( x n | θ ) = n = 1 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 13 / 77

  14. 2. Parametric models : Maximum Likelihood Method Maximum log-Likelihood Method Maximize the (log-)likelihood w.r.t. θ N N � � log L ( θ ) = log p ( X | θ ) = log p ( x n | θ ) = log p ( x n | θ ) n = 1 n = 1 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 14 / 77

  15. 2. Parametric models : Maximum Likelihood Method Maximum Likelihood Method - Gaussian Maximum likelihood estimation of a Gaussian N � � � � µ, ˆ ˆ σ = arg max µ,σ log L ( θ ) = log p ( X | θ ) = log p x n � µ, σ � n = 1 Take the partial derivatives and set them to zero ∂µ = 0 , ∂ L ∂ L ∂σ = 0 This leads to a closed form solution N µ = 1 � ˆ x n N n = 1 N σ 2 = 1 � µ ) 2 ˆ ( x n − ˆ N n = 1 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 15 / 77

  16. 2. Parametric models : Maximum Likelihood Method Maximum Likelihood Method - Gaussian K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 16 / 77

  17. 2. Parametric models : Maximum Likelihood Method Likelihood N � L ( θ ) = p ( X | θ ) = p ( x n | θ ) n = 1 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 17 / 77

  18. 2. Parametric models : Maximum Likelihood Method Degenerate case If N = 1, X = { x 1 } , the resulting Gaussian looks like K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 18 / 77

  19. 2. Parametric models : Maximum Likelihood Method Degenerate case What can we do to still get a useful estimate? We can put a prior on the mean! K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 19 / 77

  20. 2. Parametric models : Maximum Likelihood Method Bayesian Estimation Bayesian estimation / learning of parametric distributions, assumes that the parameters are not fixed, but are random variables too This allows us to use prior knowledge about the parameters How do we achieve that? What do we want? A density model for x , p ( x ) What do we have? Data X K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 20 / 77

  21. 2. Parametric models : Maximum Likelihood Method Bayesian Estimation � � � Formalize this as a conditional probability p x � X � � � � � � � � d θ p x � X = p x , θ � X � � � � � � � � � � � p x , θ � X = p x � θ, X p θ � X � � � p ( x ) can be fully determined with the parameters θ , i.e., θ is a sufficient statistic � � � � � � Hence, we have p � θ, X = p � θ x x � � � � � � � � � � � � = � θ θ d θ p x � X p x p � X � � � K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 21 / 77

  22. 2. Parametric models : Maximum Likelihood Method Bayesian Estimation � � � � � � � � � � = � θ θ d θ p x � X p x p � X � � � � � � � θ p ( θ ) p X � = L ( θ ) p ( θ ) � � � p θ � X = � p ( X ) p ( X ) � � � � � p ( X ) = � θ p ( θ ) d θ = L ( θ ) p ( θ ) d θ p X � 1 � � � � � � � L ( θ ) p ( θ ) d θ p x � X = p x � θ � � p ( X ) K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 22 / 77

  23. 2. Parametric models : Maximum Likelihood Method Bayesian Estimation � � � � � � � � � � d θ p x � X = p x � θ p θ � X � � � � � � The probability p θ � X makes it explicit how the parameter � estimation depends on the training data � � � is small in most places, but large for a single ˆ If p θ θ then � X � we can approximate � � � � � � � ˆ p x � X ≈ p x θ � � Sometimes referred to as the Bayes point The more uncertain we are about estimating ˆ θ , the more we average K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 23 / 77

  24. 2. Parametric models : Maximum Likelihood Method Bayesian Estimation Problem: In general, it is intractable to integrate out the parameters θ (or only possible to do so numerically) Example with closed form solution Gaussian data distribution, the variance is known and fixed We estimate the distribution of the mean � � � p X � µ p ( µ ) � � � � µ = p � X � p ( X ) With prior � � µ 0 , σ 2 p ( µ ) = N 0 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 24 / 77

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend