Introduction to General and Generalized Linear Models The Likelihood - PowerPoint PPT Presentation

Introduction to General and Generalized Linear Models The Likelihood Principle - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby October 2010 Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 1 / 35

This lecture The likelihood principle Point estimation theory The likelihood function The score function The information matrix Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 2 / 35

The likelihood principle The beginning of likelihood theory Fisher (1922) identified the likelihood function as the key inferential quantity conveying all inferential information in statistical modelling including the uncertainty The Fisherian school offers a Bayesian-frequentist compromise Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 3 / 35

The likelihood principle A motivating example Suppose we toss a thumbtack (used to fasten up documents to a background) 10 times and observe that 3 times it lands point up. Assuming we know nothing prior to the experiment, what is the probability of landing point up, θ ? Binomial experiment with y = 3 and n = 10. P(Y=3;10,3,0.2) = 0.2013 P(Y=3;10,3,0.3) = 0.2668 P(Y=3;10,3,0.4) = 0.2150 Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 4 / 35

The likelihood principle A motivating example By considering P θ ( Y = 3) to be a function of the unknown parameter we have the likelihood function : L ( θ ) = P θ ( Y = 3) In general, in a Binomial experiment with n trials and y successes, the likelihood function is: � n � θ y (1 − θ ) n − y L ( θ ) = P θ ( Y = y ) = y Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 5 / 35

The likelihood principle A motivating example 0.05 0.10 0.15 0.20 0.25 Likelihood 0.0 0.2 0.4 0.6 0.8 1.0 θ Figure: Likelihood function of the success probability θ in a binomial experiment with n = 10 and y = 3 . Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 6 / 35

The likelihood principle A motivating example It is often more convenient to consider the log-likelihood function. The log-likelihood function is: log L ( θ ) = y log θ + ( n − y ) log(1 − θ ) + const where const indicates a term that does not depend on θ . By solving ∂ log L ( θ ) = 0 ∂θ it is readily seen that the maximum likelihood estimate (MLE) for θ is θ ( y ) = y n = 3 � 10 = 0 . 3 Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 7 / 35

The likelihood principle The likelihood principle Not just a method for obtaining a point estimate of parameters. It is the entire likelihood function that captures all the information in the data about a certain parameter. Likelihood based methods are inherently computational. In general numerical methods are needed to find the MLE. Today the likelihood principles play a central role in statistical modelling and inference. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 8 / 35

The likelihood principle Some syntax Multivariate random variable: Y = { Y 1 , Y 2 , ..., Y n } T Observation set: { y = y 1 , y 2 , . . . , y n } T Joint density: { f Y ( y 1 , y 2 , . . . , y n ; θ ) } θ ∈ Θ k Estimator (random) � θ ( Y ) Estimate (number/vector) � θ ( y ) Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 9 / 35

Point estimation theory Point estimation theory We will assume that the statistical model for y is given by parametric family of joint densities: { f Y ( y 1 , y 2 , . . . , y n ; θ ) } θ ∈ Θ k Remember that when the n random variables are independent, the joint probability density equals the product of the corresponding marginal densities or: f ( y 1 , y 2 , ...y n ) = f 1 ( y 1 ) · f 2 ( y 2 ) · . . . · f n ( y n ) Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 10 / 35

Point estimation theory Point estimation theory Definition (Unbiased estimator) Any estimator � θ = � θ ( Y ) is said to be unbiased if E [ � θ ] = θ for all θ ∈ Θ k . Definition (Minimum mean square error) An estimator � θ = � θ ( Y ) is said to be uniformly minimum mean square error if � θ ( Y ) − θ ) T � � θ ( Y ) − θ ) T � (˜ θ ( Y ) − θ )(˜ ( � θ ( Y ) − θ )( � E ≤ E for all θ ∈ Θ k and all other estimators ˜ θ ( Y ) . Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 11 / 35

Point estimation theory Point estimation theory By considering the class of unbiased estimators it is most often not possible to establish a suitable estimator. We need to add a criterion on the variance of the estimator. A low variance is desired, and in order to evaluate the variance a suitable lower bound is given by the Cramer-Rao inequality. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 12 / 35

Point estimation theory Point estimation theory Theorem (Cramer-Rao inequality) Given the parametric density f Y ( y ; θ ) , θ ∈ Θ k , for the observations Y . Subject to certain regularity conditions, the variance of any unbiased estimator � θ ( Y ) of θ satisfies the inequality � � � ≥ i − 1 ( θ ) Var θ ( Y ) where i ( θ ) is the Fisher information matrix defined by �� ∂ log f Y ( Y ; θ ) � T � � � ∂ log f Y ( Y ; θ ) i ( θ ) = E ∂ θ ∂ θ � � � θ ( Y ) − θ ) T � � ( � θ ( Y ) − θ )( � and Var θ ( Y ) = E . Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 13 / 35

Point estimation theory Point estimation theory Definition (Efficient estimator) An unbiased estimator is said to be efficient if its covariance is equal to the Cramer-Rao lower bound. Dispersion matrix � � � The matrix Var θ ( Y ) is often called a variance covariance matrix since it contains variances in the diagonal and covariances outside the diagonal. This important matrix is often termed the Dispersion matrix . Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 14 / 35

The likelihood function The likelihood function The likelihood function is built on an assumed parameterized statistical model as specified by a parametric family of joint densities for the observations Y = ( Y 1 , Y 2 , ..., Y n ) T . The likelihood of any specific value θ of the parameters in a model is (proportional to) the probability of the actual outcome, Y 1 = y 1 , Y 2 = y 2 , ..., Y n = y n , calculated for the specific value θ . The likelihood function is simply obtained by considering the likelihood as a function of θ ∈ Θ k . Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 15 / 35

The likelihood function The likelihood function Definition (Likelihood function) Given the parametric density f Y ( y , θ ) , θ ∈ Θ P , for the observations y = ( y 1 , y 2 , . . . , y n ) the likelihood function for θ is the function L ( θ ; y ) = c ( y 1 , y 2 , . . . , y n ) f Y ( y 1 , y 2 , . . . , y n ; θ ) where c ( y 1 , y 2 , . . . , y n ) is a constant. The likelihood function is thus (proportional to) the joint probability density for the actual observations considered as a function of θ . Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 16 / 35

The likelihood function The log-likelihood function Very often it is more convenient to consider the log-likelihood function defined as l ( θ ; y ) = log( L ( θ ; y )) . Sometimes the likelihood and the log-likelihood function will be written as L ( θ ) and l ( θ ) , respectively, i.e. the dependency on y is suppressed. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 17 / 35

The likelihood function Example: Likelihood function for mean of normal distribution An automatic production of a bottled liquid is considered to be stable. A sample of three bottles were selected at random from the production and the volume of the content volume was measured. The deviation from the nominal volume of 700.0 ml was recorded. The deviations (in ml) were 4.6; 6.3; and 5.0. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 18 / 35

The likelihood function Example: Likelihood function for mean of normal distribution First a model is formulated i Model: C+E (center plus error) model, Y = µ + ǫ ii Data: Y i = µ + ǫ i iii Assumptions: Y 1 , Y 2 , Y 3 are independent Y i ∼ N( µ, σ 2 ) σ 2 is known, σ 2 = 1 , Thus, there is only one unknown model parameter, µ Y = µ . Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 19 / 35

The likelihood function Example: Likelihood function for mean of normal distribution The joint probability density function for Y 1 , Y 2 , Y 3 is given by � � − ( y 1 − µ ) 2 1 f Y 1 ,Y 2 ,Y 3 ( y 1 , y 2 , y 3 ; µ ) = √ 2 π exp 2 � � − ( y 2 − µ ) 2 1 √ × 2 π exp 2 � � − ( y 3 − µ ) 2 1 × √ 2 π exp 2 which for every value of µ is a function of the three variables y 1 , y 2 , y 3 . h − ( y − µ ) 2 i Remember that the normal probability density is: f ( y ; µ, σ 2 ) = 1 2 πσ exp √ 2 σ 2 Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 20 / 35

Introduction to General and Generalized Linear Models The Likelihood - PowerPoint PPT Presentation

Introduction to General and Generalized Linear Models The Likelihood Principle - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby October 2010 Henrik Madsen Poul

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Limitations of linear models Richard Erickson Instructor DataCamp Generalized Linear Models in

Introduction to General and Generalized Linear Models Generalized Linear Models - part II Henrik

Introduction to General and Generalized Linear Models Generalized Linear Models - part I Henrik

Introduction to General and Generalized Linear Models Generalized Linear Models - part III Henrik

Introduction to the R Statistical Computing Environment Linear and Generalized Linear Models in R

Generalized linear models Christopher F Baum EC 823: Applied Econometrics Boston College, Spring

Multiple logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models in

Workshop 11.2a: Generalized Linear Mixed Effects Models (GLMM) Murray Logan February 7, 2017

Generalized Nonlinear Models gnm : a Package for Generalized Nonlinear Models Same form as

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik

Introduction to General and Generalized Linear Models General Linear Models - part II Henrik

Generalized Additive Models September 10, 2019 Generalized Additive Models September 10, 2019 1

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Workshop 3 Building from Linear Models to Generalised Linear Models Part 2: GLMs 2 2 What are

Maximum Likelihood Estimation MLE tool for parameter estimation good approach for cases

PATTERN RECOGNITION AND MACHINE LEARNING Slide Set 2: Estimation Theory October 2019 Heikki

Estimating the parameters of some probability distributions: Exemplifications 1. Estimating the

Maximum Likelihood Learning Stefano Ermon, Aditya Grover Stanford University Lecture 4 Stefano

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

STAT 339 A Generative Linear Model and Max Likelihood Estimation 20-22 February 2017 Colin

Probabilistic Graphical Models Probabilistic Graphical Models Parameter learning in Bayesian

The maximum likelihood degree of rank 2 matrices via Euler characteristics Jose Israel Rodriguez