I 02 - Likelihood STAT 587 (Engineering) Iowa State University - PowerPoint PPT Presentation

I 02 - Likelihood STAT 587 (Engineering) Iowa State University September 10, 2020

Modeling Statistical modeling A statistical model is a pair ( S , P ) where S is the set of possible observations, i.e. the sample space, and P is a set of probability distributions on S . Typically, assume a parametric model p ( y | θ ) where y is our data and θ is unknown parameter vector. The allowable values for θ determine P and the support of p ( y | θ ) is the set S .

Modeling Binomial Binomial model Suppose we will collect data were we have the number of success y out of some number of attempts n where each attempt is independent with a common probability of success θ . Then a reasonable statistical model is Y ∼ Bin ( n, θ ) . Formally, S = { 0 , 1 , 2 , . . . , n } and P = { Bin ( n, θ ) : 0 < θ < 1 } .

Modeling Normal Normal model Suppose we have one datum real number, has a mean µ and variance σ 2 , and uncertainty is represented by a bell-shaped curve. Then a reasonable statistical model is Y ∼ N ( µ, σ 2 ) . Marginally, S = { y : y ∈ R } P = { N ( µ, σ 2 ) : −∞ < µ < ∞ , 0 < σ 2 < ∞} where θ = ( µ, σ 2 ) .

Modeling Normal Normal model Suppose our data are n real numbers, each has a mean µ and variance is σ 2 , a histogram is reasonably approximated by a bell-shaped curve, and each observation is independent of the others. Then a reasonable statistical model is ind ∼ N ( µ, σ 2 ) . Y i Marginally, S = { ( y 1 , . . . , y n ) : y i ∈ R , i ∈ { 1 , 2 , . . . , n }} P = { N n ( µ, σ 2 I) : −∞ < µ < ∞ , 0 < σ 2 < ∞} where θ = ( µ, σ 2 ) .

Likelihood Likelihood The likelihood function, or simply likelihood, is the joint probability mass/density function for fixed data when viewed as a function of the parameter (vector) θ . Generically, let p ( y | θ ) be the joint probability mass/density function of the data and thus the likelihood is L ( θ ) = p ( y | θ ) but where y is fixed and known, i.e. it is your data. The log-likelihood is the (natural) logarithm of the likelihood, i.e. ℓ ( θ ) = log L ( θ ) . Intuition: The likelihood describes the relative support in the data for different values for your parameter, i.e. the larger the likelihood is the more consistent that parameter value is with the data.

Likelihood Binomial Binomial likelihood Suppose Y ∼ Bin ( n, θ ) , then � n � θ y (1 − θ ) n − y . p ( y | θ ) = y where θ is considered fixed (but often unknown) and the argument to this function is y . Thus the likelihood is � n � θ y (1 − θ ) n − y L ( θ ) = y where y is considered fixed and known and the argument to this function is θ . Note : I write L ( θ ) without any conditioning, e.g. on y , so that you don’t confuse this with a probability mass (or density) function.

Likelihood Binomial Binomial likelihood Binomial likelihoods (n=10) 0.2 data L ( θ ) y=3 y=6 0.1 0.0 0.00 0.25 0.50 0.75 1.00 θ

Likelihood Independent observations Likelihood for independent observations Suppose Y i are independent with marginal probability mass/density function p ( y i | θ ) . The joint distribution for y = ( y 1 , . . . , y n ) is n � p ( y | θ ) = p ( y i | θ ) . i =1 The likelihood for θ is n � L ( θ ) = p ( y | θ ) = p ( y i | θ ) i =1 where we are thinking about this as a function of θ for fixed y .

Likelihood Normal Normal model ind ∼ N ( µ, σ 2 ) , then Suppose Y i 1 1 2 σ 2 ( y i − µ ) 2 2 πσ 2 e − p ( y i | µ, σ 2 ) = √ and = � n p ( y | µ, σ 2 ) i =1 p ( y i | µ, σ 2 ) 1 2 σ 2 ( y i − µ ) 2 = � n 2 πσ 2 e − 1 √ i =1 1 � n i =1 ( y i − µ ) 2 (2 πσ 2 ) n/ 2 e − 1 = 2 σ 2 where µ and σ 2 are fixed (but often unknown) and the argument to this function is y = ( y 1 , . . . , y n ) .

Likelihood Normal Normal likelihood ind ∼ N ( µ, σ 2 ) , then If Y i 1 1 � n i =1 ( y i − µ ) 2 (2 πσ 2 ) n/ 2 e − p ( y | µ, σ 2 ) = 2 σ 2 The likelihood is 1 � n 1 i =1 ( y i − µ ) 2 (2 πσ 2 ) n/ 2 e − L ( µ, σ ) = p ( y | µ, σ 2 ) = 2 σ 2 where y is fixed and known and µ and σ 2 are the arguments to this function.

Likelihood Normal Normal likelihood - example contour plot Example normal likelihood 2.0 1.5 σ 1.0 0.5 0.0 −2 −1 0 1 2 µ

Maximum likelihood estimator Maximum likelihood estimator (MLE) Definition The maximum likelihood estimator (MLE), ˆ θ MLE is the parameter value θ that maximizes the likelihood function, i.e. ˆ θ MLE = argmax θ L ( θ ) . When the data are discrete, the MLE maximizes the probability of the observed data.

Binomial MLE Derivation Binomial MLE - derivation If Y ∼ Bin ( n, θ ) , then � n � θ y (1 − θ ) n − y . L ( θ ) = y To find the MLE, 1. Take the derivative of ℓ ( θ ) with respect to θ . 2. Set it equal to zero and solve for θ . � n � ℓ ( θ ) = log + y log( θ ) + ( n − y ) log(1 − θ ) y set = y θ − n − y d dθ ℓ ( θ ) = 0 = ⇒ 1 − θ ˆ θ MLE = y/n Take the second derivative of ℓ ( θ ) with respect to θ and check to make sure it is negative.

Binomial MLE Graph Binomial MLE - graphically 0.2 likelihood 0.1 0.0 0.00 0.25 0.50 0.75 1.00 theta

Binomial MLE Numerical maximization Binomial MLE - Numerical maximization log_likelihood <- function(theta) { dbinom(3, size = 10, prob = theta, log = TRUE) } o <- optim(0.5, log_likelihood, method='L-BFGS-B', # this method to use bounds lower = 0.001, upper = .999, # cannot use 0 and 1 exactly control = list(fnscale = -1)) # maximize o$convergence # 0 means convergence was achieved [1] 0 o$par # MLE [1] 0.3000006 o$value # value of the likelihood at the MLE [1] -1.321151

Normal MLE Derivation Normal MLE - derivation ind ∼ N ( µ, σ 2 ) , then If Y i 1 � n i =1( yi − µ )2 L ( µ, σ 2 ) 1 − 2 σ 2 = (2 πσ 2) n/ 2 e 1 � n i =1( yi − y + y − µ )2 1 − 2 σ 2 = (2 πσ 2) n/ 2 e = (2 πσ 2 ) − n/ 2 exp � � ( y i − y ) 2 + 2( y i − y )( y − µ ) + ( y − µ ) 2 �� 1 � n − 2 σ 2 i =1 = (2 πσ 2 ) − n/ 2 exp � i =1 ( y i − y ) 2 + − 2 σ 2 ( y − µ ) 2 � 1 � n n since � n i =1 ( y i − y ) = 0 − 2 σ 2 ℓ ( µ, σ 2 ) = − n 2 log(2 πσ 2 ) − 1 � n i =1 ( y i − y ) 2 − 2 σ 2 n ( y − µ ) 2 1 2 σ 2 σ 2 ( y − µ ) set ∂µ ℓ ( µ, σ 2 ) ∂ n = = 0 = ˆ µ MLE = y ⇒ i =1 ( y i − y ) 2 set ∂σ 2 ℓ ( µ, σ 2 ) ∂ n 1 � n = − 2 σ 2 + = 0 2( σ 2)2 i =1 ( y i − y ) 2 = n − 1 σ 2 MLE = 1 � n S 2 = ⇒ ˆ n n Thus, the MLE for a normal model is n MLE = 1 � σ 2 ( y i − y ) 2 µ MLE = y, ˆ ˆ n i =1

Normal MLE Numerical maximization Normal MLE - numerical maximization x [1] -0.8969145 0.1848492 1.5878453 log_likelihood <- function(theta) { sum(dnorm(x, mean = theta[1], sd = exp(theta[2]), log = TRUE)) } o <- optim(c(0,0), log_likelihood, control = list(fnscale = -1)) c(o$par[1], exp(o$par[2])^2) # numerical MLE [1] 0.2918674 1.0344601 n <- length(x); c(mean(x), (n-1)/n*var(x)) # true MLE [1] 0.2919267 1.0347381

Normal MLE Graph Normal likelihood - graph 2.0 1.5 σ 1.0 0.5 0.0 −2 −1 0 1 2 µ

Summary Summary For independent observations, the joint probability mass (density) function is the product of the marginal probability mass (density) functions. The likelihood is the joint probability mass (density) function when the argument of the function is the parameter (vector). The maximum likelihood estimator (MLE) is the value of the parameter (vector) that maximizes the likelihood.

I 02 - Likelihood STAT 587 (Engineering) Iowa State University - PowerPoint PPT Presentation

I 02 - Likelihood STAT 587 (Engineering) Iowa State University September 10, 2020 Modeling Statistical modeling A statistical model is a pair ( S , P ) where S is the set of possible observations, i.e. the sample space, and P is a set of

Max. likelihood & Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

Chapter 8: Estimation In this chapter we will cover: 1. The likelihood and maximum likelihood

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

Lesson 3: Likelihood-based inference for POMP models Aaron A. King, Edward L. Ionides, Kidus

Maximum likelihood models Tues. Feb. 27, 2018 1 Overview of today Informal notion of

Applied Statistics Lecturer: Serena Arima Likelihood ML estimator Summaries ML properties LR

Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood

Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood for Markov Nets A B C

Maximum likelihood and EM algorithm (after the Chapter 8) Pasha Zusmanovich, deCODE Statistics

Likelihood inference in complex settings Nancy Reid with Uyen Hoang, Wei Lin, Ximing Xu 1 / 30

Phylogenetic trees IV Maximum Likelihood Gerhard Jger ESSLLI 2016 Gerhard Jger Maximum

Approximating likelihood ratios with calibrated classifiers Gilles Louppe DIANA meeting

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract formulation Weve had one

Likelihood in Molecular Phylogenetics Peter G. Foster The Natural History Museum, London

Recurrent machines for likelihood-free inference Arthur Pesah Antoine Wehenkel Gilles Louppe

Thus far we have emphasized: 1. Writing the likelihood equation. L ( ) = P ( X | ). 2.

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

CSCE 970 Lecture 4: Introduction to Bayesian Networks E.g. each vector represents a medical

Optical Propagation, Detection, and Communication Jeffrey H. Shapiro Massachusetts Institute of

7. Two Random Variables In many experiments, the observations are expressible not as a single

Machine Learning Lecture 01-1: Basics of Probability Theory Nevin L. Zhang lzhang@cse.ust.hk

Language models Chapter 3 in Martin/Jurafsky Probabilistic Language Models Goal: assign a

Introduction to Machine Learning ML-Basics: Data Learning goals 10 Understand structure of

CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability

I 02 - Likelihood STAT 587 (Engineering) Iowa State University - PowerPoint PPT Presentation

I 02 - Likelihood STAT 587 (Engineering) Iowa State University September 10, 2020 Modeling Statistical modeling A statistical model is a pair ( S , P ) where S is the set of possible observations, i.e. the sample space, and P is a set of

Max. likelihood &amp; Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

Chapter 8: Estimation In this chapter we will cover: 1. The likelihood and maximum likelihood

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

Lesson 3: Likelihood-based inference for POMP models Aaron A. King, Edward L. Ionides, Kidus

Maximum likelihood models Tues. Feb. 27, 2018 1 Overview of today Informal notion of

Applied Statistics Lecturer: Serena Arima Likelihood ML estimator Summaries ML properties LR

Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood

Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood for Markov Nets A B C

Maximum likelihood and EM algorithm (after the Chapter 8) Pasha Zusmanovich, deCODE Statistics

Likelihood inference in complex settings Nancy Reid with Uyen Hoang, Wei Lin, Ximing Xu 1 / 30

Phylogenetic trees IV Maximum Likelihood Gerhard Jger ESSLLI 2016 Gerhard Jger Maximum

Approximating likelihood ratios with calibrated classifiers Gilles Louppe DIANA meeting

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract formulation Weve had one

Likelihood in Molecular Phylogenetics Peter G. Foster The Natural History Museum, London

Recurrent machines for likelihood-free inference Arthur Pesah Antoine Wehenkel Gilles Louppe

Thus far we have emphasized: 1. Writing the likelihood equation. L ( ) = P ( X | ). 2.

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

CSCE 970 Lecture 4: Introduction to Bayesian Networks E.g. each vector represents a medical

Optical Propagation, Detection, and Communication Jeffrey H. Shapiro Massachusetts Institute of

7. Two Random Variables In many experiments, the observations are expressible not as a single

Machine Learning Lecture 01-1: Basics of Probability Theory Nevin L. Zhang lzhang@cse.ust.hk

Language models Chapter 3 in Martin/Jurafsky Probabilistic Language Models Goal: assign a

Introduction to Machine Learning ML-Basics: Data Learning goals 10 Understand structure of

CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability

Max. likelihood & Bayesian techniques are both likelihood-based. Weaknesses of likelihood for