15 388 688 practical data science maximum likelihood
play

15-388/688 - Practical Data Science: Maximum likelihood estimation, - PowerPoint PPT Presentation

15-388/688 - Practical Data Science: Maximum likelihood estimation, nave Bayes J. Zico Kolter Carnegie Mellon University Spring 2018 1 Outline Maximum likelihood estimation Naive Bayes Machine learning and maximum likelihood 2 Outline


  1. 15-388/688 - Practical Data Science: Maximum likelihood estimation, naΓ―ve Bayes J. Zico Kolter Carnegie Mellon University Spring 2018 1

  2. Outline Maximum likelihood estimation Naive Bayes Machine learning and maximum likelihood 2

  3. Outline Maximum likelihood estimation Naive Bayes Machine learning and maximum likelihood 3

  4. Estimating the parameters of distributions We’re moving now from probability to statistics The basic question: given some data 𝑦 1 , … , 𝑦 ν‘š , how do I find a distribution that captures this data β€œwell”? In general (if we can pick from the space of all distributions), this is a hard question, but if we pick from a particular parameterized family of distributions π‘ž π‘Œ; πœ„ , the question is (at least a little bit) easier Question becomes: how do I find parameters πœ„ of this distribution that fit the data? 4

  5. Maximum likelihood estimation Given a distribution π‘ž π‘Œ; πœ„ , and a collection of observed (independent) data points 𝑦 1 , … , 𝑦 ν‘š , the probability of observing this data is simply ν‘š π‘ž 𝑦 1 , … , 𝑦 ν‘š ; πœ„ = ∏ π‘ž 𝑦 ν‘– ; πœ„ ν‘–=1 Basic idea of maximum likelihood estimation (MLE): find the parameters that maximize the probability of the observed data ν‘š ν‘š π‘ž 𝑦 ν‘– ; πœ„ log π‘ž 𝑦 ν‘– ; πœ„ maximize ∏ ≑ maximize β„“ πœ„ = βˆ‘ νœƒ νœƒ ν‘–=1 ν‘–=1 where β„“ πœ„ is called the log likelihood of the data Seems β€œobvious”, but there are many other ways of fitting parameters 5

  6. Μ‚ Parameter estimation for Bernoulli Simple example: Bernoulli distribution π‘ž π‘Œ = 1; 𝜚 = 𝜚, π‘ž π‘Œ = 0; 𝜚 = 1 βˆ’ 𝜚 Given observed data 𝑦 1 , … , 𝑦 ν‘š , the β€œobvious” answer is: ν‘š 𝑦 ν‘– βˆ‘ ν‘–=1 #1’s 𝜚 = # Total = 𝑛 But why is this the case? Maybe there are other estimates that are just as good, i.e.? ν‘š 𝑦 ν‘– + 1 βˆ‘ ν‘–=1 𝜚 = 𝑛 + 2 6

  7. MLE for Bernoulli Maximum likelihood solution for Bernoulli given by ν‘š ν‘š 𝜚 ν‘₯ ν‘– 1 βˆ’ 𝜚 1βˆ’ν‘₯ ν‘– π‘ž 𝑦 ν‘– ; 𝜚 = maximize maximize ∏ ∏ νœ™ νœ™ ν‘–=1 ν‘–=1 Taking the negative log of the optimization objective (just to be consistent with our usual notation of optimization as minimization) ν‘š 𝑦 ν‘– log 𝜚 + 1 βˆ’ 𝑦 ν‘– maximize β„“ 𝜚 = βˆ‘ log 1 βˆ’ 𝜚 νœ™ ν‘–=1 Derivative with respect to 𝜚 is given by ν‘š 𝑦 ν‘– ν‘š (1 βˆ’ 𝑦 ν‘– ) ν‘š βˆ‘ ν‘–=1 βˆ‘ ν‘–=1 𝑦 ν‘– 𝜚 βˆ’ 1 βˆ’ 𝑦 ν‘– 𝑒 π‘’πœš β„“ 𝜚 = βˆ‘ = βˆ’ 1 βˆ’ 𝜚 𝜚 1 βˆ’ 𝜚 ν‘–=1 7

  8. MLE for Bernoulli, continued Setting derivative to zero gives: ν‘š 𝑦 ν‘– ν‘š (1 βˆ’ 𝑦 ν‘– ) βˆ‘ ν‘–=1 βˆ‘ ν‘–=1 ≑ 𝑏 𝑐 βˆ’ 𝜚 βˆ’ 1 βˆ’ 𝜚 = 0 𝜚 1 βˆ’ 𝜚 ⟹ 1 βˆ’ 𝜚 𝑏 = πœšπ‘ ν‘š 𝑦 ν‘– βˆ‘ ν‘–=1 𝑏 ⟹ 𝜚 = 𝑏 + 𝑐 = 𝑛 So, we have shown that the β€œnatural” estimate of 𝜚 actually corresponds to the maximum likelihood estimate 8

  9. Poll: Bernoulli maximum likelihood Suppose we observe binary data 𝑦 1 , … , 𝑦 ν‘š with 𝑦 ν‘– ∈ {0,1} with some 𝑦 ν‘– = 0 and some 𝑦 ν‘— = 1 , and we compute the Bernoulli MLE ν‘š 𝑦 ν‘– βˆ‘ ν‘–=1 𝜚 = 𝑛 Which of following statements is necessarily true? (may be more than one) For any 𝜚 β€² β‰  𝜚 , π‘ž 𝑦 ν‘– ; 𝜚 β€² ≀ π‘ž 𝑦 ν‘– ; 𝜚 1. for all 𝑗 = 1, … , π‘œ ν‘š π‘ž 𝑦 ν‘– ; 𝜚 β€² ν‘š π‘ž 𝑦 ν‘– ; 𝜚 For any 𝜚 β€² β‰  𝜚 , ∏ ν‘–=1 2. ≀ ∏ ν‘–=1 We always have π‘ž 𝑦 ν‘– ; 𝜚 β€² β‰₯ π‘ž 𝑦 ν‘– ; 𝜚 3. for at least one 𝑗 9

  10. MLE for Gaussian, briefly For Gaussian distribution π‘ž 𝑦; 𝜈, 𝜏 2 = 2𝜌𝜏 2 βˆ’1/2 exp βˆ’ 1/2 𝑦 βˆ’ 𝜈 2 /𝜏 2 Log likelihood given by: 𝑦 ν‘– βˆ’ 𝜈 2 ν‘š β„“ 𝜈, 𝜏 2 = βˆ’π‘› 1 2 log 2𝜌𝜏 2 βˆ’ 1 2 βˆ‘ 𝜏 2 ν‘–=1 Derivatives (see if you can derive these fully): ν‘š 𝑦 ν‘– βˆ’ 𝜈 ν‘š π‘’πœˆ β„“ 𝜈, 𝜏 2 = βˆ’ 1 𝑒 = 0 ⟹ 𝜈 = 1 𝑦 ν‘– 2 βˆ‘ 𝑛 βˆ‘ 𝜏 2 ν‘–=1 ν‘–=1 𝑦 ν‘– βˆ’ 𝜈 2 ν‘š ν‘š π‘’πœ 2 β„“ 𝜈, 𝜏 2 = βˆ’ 𝑛 𝑒 2𝜏 2 + 1 1 = 0 ⟹ 𝜏 2 = 𝑦 ν‘– βˆ’ 𝜈 2 2 βˆ‘ 𝑛 βˆ‘ 𝜏 2 2 ν‘–=1 ν‘–=1 10

  11. Outline Maximum likelihood estimation Naive Bayes Machine learning and maximum likelihood 11

  12. Naive Bayes modeling Naive Bayes is a machine learning algorithm that rests relies heavily on probabilistic modeling But, it is also interpretable according to the three ingredients of a machine learning algorithm (hypothesis function, loss, optimization), more on this later Basic idea is that we model input and output as random variables π‘Œ = π‘Œ 1 , π‘Œ 2 , … , π‘Œ ν‘› (several Bernoulli, categorical, or Gaussian random variables), and 𝑍 (one Bernoulli or categorical random variable), goal is to find π‘ž(𝑍 |π‘Œ) 12

  13. Naive Bayes assumptions We’re going to find π‘ž 𝑍 π‘Œ via Bayes’ rule π‘ž 𝑍 π‘Œ = π‘ž π‘Œ 𝑍 π‘ž 𝑍 π‘ž π‘Œ 𝑍 π‘ž 𝑍 = π‘ž π‘Œ βˆ‘ 푦 π‘ž(π‘Œ|𝑧) π‘ž 𝑧 The denominator is just the sum over all values of 𝑍 of the distribution specified by the numeration, so we’re just going to focus on the π‘ž π‘Œ 𝑍 π‘ž 𝑍 term Modeling full distribution π‘ž(π‘Œ|𝑍 ) for high-dimensional π‘Œ is not practical, so we’re going to make the naive Bayes assumption , that the elements π‘Œ ν‘– are conditionally independent given 𝑍 ν‘› π‘ž π‘Œ 𝑍 = ∏ π‘ž π‘Œ ν‘– 𝑍 ν‘–=1 13

  14. Modeling individual distributions We’re going to explicitly model the distribution of each π‘ž π‘Œ ν‘– 𝑍 as well as π‘ž(𝑍 ) We do this by specifying a distribution for π‘ž(𝑍 ) and a separate distribution and for each π‘ž(π‘Œ ν‘– |𝑍 = 𝑧) So assuming, for instance, that 𝑍 ν‘– and π‘Œ ν‘– are binary (Bernoulli random variables), then we would represent the distributions 0 ), 1 π‘ž 𝑍 ; 𝜚 0 , π‘ž π‘Œ ν‘– 𝑍 = 0; 𝜚 ν‘– π‘ž π‘Œ ν‘– 𝑍 = 1; 𝜚 ν‘– We then estimate the parameters of these distributions using MLE, i.e. ν‘— β‹… 1{𝑧 ν‘— = 𝑧} ν‘š ν‘š 𝑧 ν‘— βˆ‘ ν‘—=1 βˆ‘ ν‘—=1 𝑦 ν‘– 푦 = 𝜚 0 = , 𝜚 ν‘– 1{𝑧 ν‘— = 𝑧} ν‘š 𝑛 βˆ‘ ν‘—=1 14

  15. Making predictions Given some new data point 𝑦 , we can now compute the probability of each class ν‘š ν‘š 푦 ) ν‘₯ ν‘– 1 βˆ’ 𝜚 1 푦 1βˆ’ν‘₯ ν‘– π‘ž 𝑍 = 𝑧 𝑦 ∝ π‘ž 𝑍 = 𝑧 ∏ π‘ž 𝑦 ν‘– 𝑍 = 𝑧 = 𝜚 0 ∏ (𝜚 ν‘– ν‘–=1 ν‘–=1 After you have computed the right hand side, just normalize (divide by the sum over all 𝑧 ) to get the desired probability Alternatively, if you just want to know the most likely 𝑍 , just compute each right hand side and take the maximum 15

  16. Example 𝒁 𝒀 ퟏ 𝒀 ퟐ π‘ž 𝑍 = 1 = 𝜚 0 = 0 0 0 0 = π‘ž π‘Œ 1 = 1 𝑍 = 0 = 𝜚 1 1 1 0 0 0 1 1 = π‘ž π‘Œ 1 = 1 𝑍 = 1 = 𝜚 1 1 1 1 0 = π‘ž π‘Œ 2 = 1 𝑍 = 0 = 𝜚 2 1 1 0 0 1 0 1 = π‘ž π‘Œ 2 = 1 𝑍 = 0 = 𝜚 2 1 0 1 ? 1 0 π‘ž 𝑍 π‘Œ 1 = 1, π‘Œ 2 = 0 = 16

  17. Potential issues ν‘› Problem #1: when computing probability, the product p 𝑧 ∏ ν‘–=1 π‘ž(𝑦 ν‘– |𝑧) quickly goes to zero to numerical precision Solution: compute log of the probabilities instead ν‘› log π‘ž(𝑧) + βˆ‘ log π‘ž 𝑦 ν‘– 𝑧 ν‘–=1 Problem #2: If we have never seen either π‘Œ ν‘– = 1 or π‘Œ ν‘– = 0 for a given 𝑧 , then the corresponding probabilities computed by MLE will be zero Solution: Laplace smoothing, β€œhallucinate” one π‘Œ ν‘– = 0/1 for each class ν‘— β‹… 1{𝑧 ν‘— = 𝑧} + 1 ν‘š βˆ‘ ν‘—=1 𝑦 ν‘– 푦 = 𝜚 ν‘– 1{𝑧 ν‘— = 𝑧} + 2 ν‘š βˆ‘ ν‘—=1 17

  18. Other distributions Though naive Bayes is often presented as β€œjust” counting, the value of the maximum likelihood interpretation is that it’s clear how to model π‘ž(π‘Œ ν‘– |𝑍 ) for non- categorical random variables Example: if 𝑦 ν‘– is real-valued, we can model π‘ž(π‘Œ ν‘– |𝑍 = 𝑧) as a Gaussian 2 = π’ͺ(𝑦 ν‘– ; 𝜈 푦 , 𝜏 푦 π‘ž 𝑦 ν‘– 𝑧; 𝜈 푦 , 𝜏 푦 2 ) with maximum likelihood estimates ν‘š 𝑦 ν‘– ν‘— β‹… 1{𝑧 ν‘— = 𝑧} ν‘š (𝑦 ν‘– ν‘— βˆ’πœˆ 푦 )^2 β‹… 1{𝑧 ν‘— = 𝑧} βˆ‘ ν‘—=1 βˆ‘ ν‘—=1 𝜈 푦 = 2 = , 𝜏 푦 ν‘š 1{𝑧 ν‘— = 𝑧} ν‘š 1{𝑧 ν‘— = 𝑧} βˆ‘ ν‘—=1 βˆ‘ ν‘—=1 All probability computations are exactly the same as before (it doesn’t matter that some of the terms are probability densities) 18

  19. Outline Maximum likelihood estimation Naive Bayes Machine learning and maximum likelihood 19

  20. Machine learning via maximum likelihood Many machine learning algorithms (specifically the loss function component) can be interpreted probabilistically, as maximum likelihood estimation Recall logistic regression: ν‘š β„“ logistic (β„Ž νœƒ (𝑦 ν‘– ) , 𝑧 ν‘– ) minimize βˆ‘ νœƒ ν‘–=1 β„“ logistic β„Ž νœƒ 𝑦 , 𝑧 = log(1 + exp βˆ’π‘§ β‹… β„Ž νœƒ 𝑦 20

  21. Logistic probability model Consider the model (where 𝑍 is binary taking on βˆ’1, +1 values) 1 π‘ž 𝑧 𝑦; πœ„ = logistic 𝑧 β‹… β„Ž νœƒ 𝑦 = 1 + exp(βˆ’π‘§ β‹… β„Ž νœƒ 𝑦 ) Under this model, the maximum likelihood estimate is ν‘š ν‘š log π‘ž 𝑧 ν‘– 𝑦 ν‘– ; πœ„) ≑ minimize β„“ logistic (β„Ž νœƒ (𝑦 ν‘– ) , 𝑧 ν‘– ) maximize βˆ‘ βˆ‘ νœƒ νœƒ ν‘–=1 ν‘–=1 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend