Advanced Section #2 Model Selection & Information Criteria - - PowerPoint PPT Presentation

advanced section 2 model selection information criteria
SMART_READER_LITE
LIVE PREVIEW

Advanced Section #2 Model Selection & Information Criteria - - PowerPoint PPT Presentation

Advanced Section #2 Model Selection & Information Criteria Akaike Information Criterion Marios Mattheakis and Pavlos Protopapas CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader 1 Outline Maximum Likelihood


slide-1
SLIDE 1

CS109A Introduction to Data Science

Pavlos Protopapas and Kevin Rader

Advanced Section #2 Model Selection & Information Criteria

Akaike Information Criterion

1

Marios Mattheakis and Pavlos Protopapas

slide-2
SLIDE 2

CS109A, PROTOPAPAS, RADER

Outline

  • Maximum Likelihood Estimation (MLE). Fit a distribution
  • Exponential distribution
  • Normal (Linear Regression Model)
  • Model Selection & Information Criteria
  • KL divergence
  • MLE justification through KL divergence
  • Model Comparison
  • Akaike Information Criterion (AIC)

2

slide-3
SLIDE 3

Maximum Likelihood Estimation (MLE) & Parametric Models

3

slide-4
SLIDE 4

CS109A, PROTOPAPAS, RADER

Maximum Likelihood Estimation (MLE)

Fit your data with a parametric distribution q(y|θ). θ=(θ1, … , θk) is a parameter set to be estimated.

4

y

slide-5
SLIDE 5

CS109A, PROTOPAPAS, RADER

Maximum Likelihood Estimation (MLE)

Fit your data with a parametric distribution q(y|θ). θ=(θ1, … , θk) is a parameter set to be estimated.

5

y

slide-6
SLIDE 6

CS109A, PROTOPAPAS, RADER

Maximize the Likelihood L

6

Scanning over all the parameters until find the maximum L ...but this is a too time-consuming approach.

slide-7
SLIDE 7

CS109A, PROTOPAPAS, RADER

Maximum Likelihood Estimation (MLE)

A formal and efficient method is given by MLE Observations: y=(y1, …, yn)

7

Easier and numerically more stable to work with log-likelihood

slide-8
SLIDE 8

CS109A, PROTOPAPAS, RADER

Maximum Likelihood Estimation (MLE)

8

Easier and numerically more stable to work with log-likelihood

slide-9
SLIDE 9

CS109A, PROTOPAPAS, RADER

Exponential distribution: A simple and useful example

A one parameter distribution: rate parameter λ

9

slide-10
SLIDE 10

CS109A, PROTOPAPAS, RADER

Linear Regression Model with gaussian error

10

slide-11
SLIDE 11

CS109A, PROTOPAPAS, RADER

Linear Regression Model through MLE

11

Loss Function

slide-12
SLIDE 12

CS109A, PROTOPAPAS, RADER

Linear Regression Model: Standard Formulas

12

Minimize the loss essentially maximize the likelihood, and we get

slide-13
SLIDE 13

Model Selection & Information Theory: Akaike Information Criterion

13

slide-14
SLIDE 14

CS109A, PROTOPAPAS, RADER

Kullback-Leibler (KL) divergence (or relative entropy)

How good do we fit the data? What additional uncertainty have we introduced?

14

slide-15
SLIDE 15

CS109A, PROTOPAPAS, RADER

KL divergence

The KL divergence shows the distance between two distributions, hence it is a non-negative quantity.

15

With Jensen’s inequality for convex functions 𝑔 𝒛 , 𝔽[𝑔 𝒛 ] ≥ 𝑔(𝔽[y]): KL divergence is a non-symmetric quantity

slide-16
SLIDE 16

CS109A, PROTOPAPAS, RADER

MLE justification through KL divergence

Empirical distribution

16

Minimize KL divergence is the same with maximize likelihood (empirical distribution)

log-likelihood

slide-17
SLIDE 17

CS109A, PROTOPAPAS, RADER

Model Comparison

Consider to model distributions

17

By using the empirical distribution: p is eliminated.

slide-18
SLIDE 18

CS109A, PROTOPAPAS, RADER

Akaike Information Criterion (AIC)

AIC is a trade off between the number of parameters k and the error that is introduced (overfitting). AIC is an asymptotic approximation of the KL-divergence The data are being used twice: first for MLE and second for the KL-divergence estimation. AIC estimates which is the optimal number of parameters k

18

slide-19
SLIDE 19

CS109A, PROTOPAPAS, RADER

Polynomial Regression Model Example

Suppose a polynomial regression model

19

Which is the optimal k? For k smaller than the optimal: Underfitting For k larger than the optimal: Overfitting

slide-20
SLIDE 20

CS109A, PROTOPAPAS, RADER

Minimizing real and empirical KL-divergence

20

Suppose many models indicated by index j Work with the j-th model which has kj parameters

slide-21
SLIDE 21

CS109A, PROTOPAPAS, RADER

21

Numerical verification of AIC

slide-22
SLIDE 22

CS109A, PROTOPAPAS, RADER

Akaike Information Criterion (AIC): Proof

Asymptotic Expansion around true ideal MLE θ0

22

slide-23
SLIDE 23

CS109A, PROTOPAPAS, RADER

Akaike Information Criterion (AIC): Proof

23

slide-24
SLIDE 24

CS109A, PROTOPAPAS, RADER

Akaike Information Criterion (AIC): Proof

24

In the limit of a correct model:

slide-25
SLIDE 25

CS109A, PROTOPAPAS, RADER

Review

  • Maximum Likelihood Estimation (MLE)

1. A powerful method to estimate the ideal fitting parameters of a model. 2. Exponential distribution, a simple but useful example. 3. Linear Regression Model as a special paradigm of MLE implementation.

  • Model Selection & Information Criteria

1. KL-divergence quantifies the “distance” between the fitting model and the “real” distribution. 2. KL-divergence justifies the MLE and is used for model comparison. 3. AIC: Estimates the number of model parameters and protects from

  • verfitting.

25

slide-26
SLIDE 26

CS109A, PROTOPAPAS, RADER

Thank you

Office hours are: Monday 6-7:30 (Marios) Tuesday 6:30-8 (Trevor)

26

Advanced Section 2: Model Selection & Information Criteria