Estimating the parameters of some probability distributions: - PowerPoint PPT Presentation

0. Estimating the parameters of some probability distributions: Exemplifications

1. Estimating the parameter of the Bernoulli distribution: the MLE and MAP approaches CMU, 2015 spring, Tom Mitchell, Nina Balcan, HW2, pr. 2

2. Suppose we observe the values of n i.i.d. (independent, identi- cally distributed) random variables X 1 , . . . , X n drawn from a single Bernoulli distribution with parameter θ . In other words, for each X i , we know that P ( X i = 1) = θ and P ( X i = 0) = 1 − θ. Our goal is to estimate the value of θ from the observed values of X 1 , . . . , X n .

3. Reminder: Maximum Likelihood Estimation For any hypothetical value ˆ θ , we can compute the probability of observing the outcome X 1 , . . . , X n if the true parameter value θ were equal to ˆ θ . This probability of the observed data is often called the data likelihood , and the function L (ˆ θ ) that maps each ˆ θ to the corresponding likelihood is called the likelihood function . A natural way to estimate the unknown parameter θ is to choose the ˆ θ that maximizes the likelihood function. Formally, ˆ L (ˆ θ MLE = argmax θ ) . ˆ θ

4. a. Write a formula for the likelihood function, L (ˆ θ ) . Your function should depend on the random variables X 1 , . . . , X n and the hypothetical parameter ˆ θ . Does the likelihood function depend on the order of the random variables? Solution: Since the X i are independent, we have n n � � θ X i · (1 − ˆ L (ˆ (ˆ θ ) 1 − X i ) θ ) = P ˆ θ ( X 1 , . . . , X n ) = P ˆ θ ( X i ) = i =1 i =1 θ # { X i =1 } · (1 − ˆ ˆ θ ) # { X i =0 } , = where # {·} counts the number of X i for which the condition in braces holds true. Note that in the third equality we used the trick X i = I { X i =1 } . The likelihood function does not depend on the order of the data.

5. b. Suppose that n = 10 and the data Solution: set contains six 1s and four 0s. MLE; n = 10, six 1s, four 0s Write a short computer program 0.0012 that plots the likelihood function of 0.001 this data. 0.0008 For the plot, the x -axis should be ˆ θ , and the y -axis L (ˆ 0.0006 θ ) . Scale your y -axis so that you can see some variation in 0.0004 its value. 0.0002 Estimate ˆ θ MLE by marking on the x - 0 axis the value of ˆ θ that maximizes 0 0.2 0.4 0.6 0.8 1 the likelihood. θ

6. c. Find a closed-form formula for ˆ θ MLE , the MLE estimate of ˆ θ . Does the closed form agree with the plot? Solution: Since the ln function is increasing, the ˆ Let’s consider l ( θ ) = ln( L ( θ )) . θ that maximizes the log-likelihood is the same as the ˆ θ that maximizes the likelihood. Using the properties of the ln function, we can rewrite l (ˆ θ ) as follows: θ n 1 · (1 − ˆ l (ˆ θ ) = ln(ˆ θ ) n 0 ) = n 1 ln(ˆ θ ) + n 0 ln(1 − ˆ θ ) . not. not. where n 1 = # { X i = 1 } , iar n 0 = # { X i = 0 } . Assuming that ˆ θ � = 0 and ˆ θ � = 1 , the first and second derivatives of l are given by θ ) = n 1 n 0 θ ) = − n 1 n 0 l ′ (ˆ l ′′ (ˆ − and θ 2 − ˆ 1 − ˆ ˆ (1 − ˆ θ ) 2 θ θ Since l ′′ (ˆ θ ) is always negative, the l function is concave, and we can find its maximizer by solving the equation l ′ ( θ ) = 0 . n 1 The solution to this equation is given by ˆ θ MLE = . n 1 + n 0

7. d. Create three more likelihood plots: one where n = 5 and the data set contains three 1s and two 0s; one where n = 100 and the data set contains sixty 1s and fourty 0s; and one where n = 10 and there are five 1s and five 0s. Solution: MLE; n = 5, three 1s, two 0s MLE; n = 100, sixty 1s, fourty 0s 0.04 7e-30 0.035 6e-30 0.03 5e-30 0.025 4e-30 0.02 3e-30 0.015 2e-30 0.01 1e-30 0.005 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 θ θ

8. e. Describe how the likelihood functions and maximum likelihood Solution (to part d.): estimates compare for the different MLE; n = 10, five 1s, five 0s data sets. 0.001 Solution (to part e.): 0.0008 The MLE is equal to the propor- 0.0006 tion of 1s observed in the data, so 0.0004 for the first three plots the MLE is always at 0.6, while for the last plot 0.0002 it is at 0.5. 0 As the number of samples n in- 0 0.2 0.4 0.6 0.8 1 creases, the likelihood function gets θ more peaked at its maximum value, and the values it takes on decrease.

9. Reminder: Maximum a Posteriori Probability Estimation In the maximum likelihood estimate, we treated the true parameter value θ as a fixed (non-random) number. In cases where we have some prior knowledge about θ , it is useful to treat θ itself as a random variable, and express our prior knowledge in the form of a prior probability distribution over θ . For example , suppose that the X 1 , . . . , X n are generated in the following way: − First, the value of θ is drawn from a given prior probability distribution − Second, X 1 , . . . , X n are drawn independently from a Bernoulli distribution using this value for θ . Since both θ and the sequence X 1 , . . . , X n are random, they have a joint probability distribution. In this setting, a natural way to estimate the value of θ is to simply choose its most probable value given its prior distribution plus the observed data X 1 , . . . , X n . Definition: ˆ P ( θ = ˆ θ MAP = argmax θ | X 1 , . . . , X n ) . ˆ θ This is called the maximum a posteriori probability (MAP) estimate of θ .

10. Reminder (cont’d) Using Bayes rule, we can rewrite the posterior probability as follows: θ | X 1 , . . . , X n ) = P ( X 1 , . . . , X n | θ = ˆ θ ) P ( θ = ˆ θ ) P ( θ = ˆ . P ( X 1 , . . . , X n ) Since the probability in the denominator does not depend on ˆ θ , the MAP estimate is given by ˆ P ( X 1 , . . . , X n | θ = ˆ θ ) P ( θ = ˆ θ MAP = argmax θ ) ˆ θ L (ˆ θ ) P ( θ = ˆ = argmax θ ) . ˆ θ In words, the MAP estimate for θ is the value ˆ θ that maximizes the likelihood function multiplied by the prior distribution on θ .

11. We will consider a Beta (3 , 3) prior distribution for θ , which has the density θ 2 (1 − ˆ ˆ θ ) 2 function given by p (ˆ θ ) = B (3 , 3) , where B ( α, β ) is the beta function and B (3 , 3) ≈ 0 . 0333 . Solution: f. Suppose, as in part b , that n = 10 and we observed six 1s and MAP; n = 10, six 1s, four 0s; Beta(3,3) four 0s. 0.0025 Write a short computer program 0.002 ˆ that plots the function θ �→ 0.0015 L (ˆ θ ) p (ˆ θ ) for the same values of ˆ θ 0.001 as in part b . Estimate ˆ θ MAP by marking on the 0.0005 x -axis the value of ˆ θ that maxi- 0 mizes the function. 0 0.2 0.4 0.6 0.8 1 θ

12. g. Find a closed form formula for ˆ θ MAP , the MAP estimate of ˆ θ . Does the closed form agree with the plot? Solution: As in the case of the MLE, we will apply the ln function before finding the maximizer. We want to maximize the function θ n 1 +2 · (1 − ˆ l (ˆ θ ) = ln( L (ˆ θ ) · p (ˆ θ )) = ln(ˆ θ ) n 0 +2 ) − ln( B (3 , 3)) . The normalizing constant for the prior appears as an additive constant and therefore the first and second derivatives are identical to those in the case of the MLE (except with n 1 + 2 and n 0 + 2 instead of n 1 and n 0 , respectively). It follows that the closed form formula for the MAP estimate is given by n 1 + 2 ˆ θ MAP = n 1 + n 0 + 4 . This formula agrees with the plot obtained in part f .

13. h. Compare the MAP estimate to the MLE computed from the same data in part b . Briefly explain any significant difference. Solution: The MAP estimate is equal to the MLE with four additional vir- tual random variables, two that are equal to 1, and two that are equal to 0. This pulls the value of the MAP estimate closer to the value 0.5, which is why ˆ θ MAP is smaller than ˆ θ MLE . i. Comment on the relationship between the MAP and MLE estimates as n goes to infinity, while the ratio # { X i = 1 } / # { X i = 0 } remains constant. Solution: It is obvious that as n goes to infinity, the influence of the 4 vir- tual random variables diminishes, and the two estimators become equal.

14. The MLE estimator for the parameter of the Bernoulli distribution: the bias and [an example of] inadmissibility CMU, 2004 fall, Tom Mitchell, Ziv Bar-Joseph, HW2, pr. 1

15. Suppose X is a binary random variable that takes value 0 with probability p and value 1 with probability 1 − p . Let X 1 , . . . , X n be i.i.d. samples of X . a. Compute an MLE estimate of p (denote it by ˆ p ). Answer: By way of definition, i.i.d. p k (1 − p ) n − k p = argmax ˆ P ( X 1 , . . . , X n | p ) = argmax P ( X i | p ) = argmax p p p where k is the number of 0 ’s in x 1 , . . . , x n . Furthermore, since ln is a monotonic (strictly increasing) function, ln p k (1 − p ) n − k = argmax p = argmax ˆ ( k ln p + ( n − k ) ln(1 − p )) p p Computing the first derivative of k ln p + ( n − k ) ln(1 − p ) w.r.t. p leads to: ∂p ( k ln p + ( n − k ) ln(1 − p )) = k ∂ p − n − k 1 − p . Hence, ∂p ( k ln p + ( n − k ) ln(1 − p )) = 0 ⇔ k ∂ p = n − k p = k 1 − p ⇔ ˆ n.

16. b. Is ˆ p an unbiased estimate of p ? Prove the answer. Answer: Since k can be seen as a sum of n [independent] Bernoulli variables of parameter p , we can write: � k � n n = 1 nE [ k ] = 1 X i ] = 1 � � nE [ n − n ( n − E [ˆ p ] = E E [ X i ]) n i =1 i =1 n 1 (1 − p )) = 1 n ( n − n (1 − p )) = 1 � n ( n − = nnp = p. i =1 Therefore, ˆ p is an unbiased estimator for the parameter p .

Estimating the parameters of some probability distributions: - PowerPoint PPT Presentation

0. Estimating the parameters of some probability distributions: Exemplifications 1. Estimating the parameter of the Bernoulli distribution: the MLE and MAP approaches CMU, 2015 spring, Tom Mitchell, Nina Balcan, HW2, pr. 2 2. Suppose we

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

Estimating Estimating Covariance . . . Statistical Characteristics Estimating . . . Proof of

Continuing Probability. Wrap up: Total Probability and Conditional Probability. Continuing

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Basics Probability Background Martin Emms October 1, 2020 Probability Basics

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Counting and Probability Whats to come? Counting and Probability Whats to come?

Estimating Probability of Target Attainment Johan W. Mouton MD PhD FIDSA Professor

Planning III-A: Planning III-A: Estimating Software Size - Estimating Software Size -

Estimating Frequency Moments Estimating F 0 Algorithm Correctness Further Anil Maheshwari

Estimating Frequency Moments Moments Estimating F 0 Algorithm Correctness Anil Maheshwari

Camera Parameters INEL 6088 Computer Vision Camera Parameters Extrinsic parameters: define

Estimating Parameters of Pareto Distribution Under Interval and Fuzzy Uncertainty Nitaya Buntao

Probability Review CMSC 473/673 UMBC Some slides adapted from 3SLP, Jason Eisner Probability

Which probability Which probability Which probability Which probability theory for cosmology?

Maximum Likelihood Learning Stefano Ermon, Aditya Grover Stanford University Lecture 4 Stefano

MLE/MAP + Nave Bayes Matt Gormley Lecture 17 Mar. 20, 2020 1 Reminders

Benefits Eligibility 1 Eligibility Transition - General All eCommerce associates enrolling

Interstate Compacts Legal Roundtable Case Law Update The Smarter Balance Assessment Consortium

PATTERN RECOGNITION AND MACHINE LEARNING Slide Set 2: Estimation Theory October 2019 Heikki

Maximum Likelihood Estimation MLE tool for parameter estimation good approach for cases

Introduction to General and Generalized Linear Models The Likelihood Principle - part I Henrik

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian