advanced section 2 model selection information criteria
play

Advanced Section #2 Model Selection & Information Criteria - PowerPoint PPT Presentation

Advanced Section #2 Model Selection & Information Criteria Akaike Information Criterion Marios Mattheakis and Pavlos Protopapas CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader 1 Outline Maximum Likelihood


  1. Advanced Section #2 Model Selection & Information Criteria Akaike Information Criterion Marios Mattheakis and Pavlos Protopapas CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader 1

  2. Outline • Maximum Likelihood Estimation (MLE). Fit a distribution Exponential distribution • • Normal (Linear Regression Model) Model Selection & Information Criteria • • KL divergence • MLE justification through KL divergence Model Comparison • • Akaike Information Criterion (AIC) CS109A, P ROTOPAPAS , R ADER 2

  3. Maximum Likelihood Estimation (MLE) & Parametric Models 3

  4. Maximum Likelihood Estimation (MLE) Fit your data with a parametric distribution q ( y | θ ). θ =( θ 1 , … , θ k ) is a parameter set to be estimated. y CS109A, P ROTOPAPAS , R ADER 4

  5. Maximum Likelihood Estimation (MLE) Fit your data with a parametric distribution q ( y | θ ). θ =( θ 1 , … , θ k ) is a parameter set to be estimated. y CS109A, P ROTOPAPAS , R ADER 5

  6. Maximize the Likelihood L Scanning over all the parameters until find the maximum L ...but this is a too time-consuming approach. CS109A, P ROTOPAPAS , R ADER 6

  7. Maximum Likelihood Estimation (MLE) A formal and efficient method is given by MLE Observations: y =(y 1 , …, y n ) Easier and numerically more stable to work with log-likelihood CS109A, P ROTOPAPAS , R ADER 7

  8. Maximum Likelihood Estimation (MLE) Easier and numerically more stable to work with log-likelihood ⟹ CS109A, P ROTOPAPAS , R ADER 8

  9. Exponential distribution: A simple and useful example A one parameter distribution: rate parameter λ CS109A, P ROTOPAPAS , R ADER 9

  10. Linear Regression Model with gaussian error CS109A, P ROTOPAPAS , R ADER 10

  11. Linear Regression Model through MLE Loss Function CS109A, P ROTOPAPAS , R ADER 11

  12. Linear Regression Model: Standard Formulas Minimize the loss essentially maximize the likelihood, and we get CS109A, P ROTOPAPAS , R ADER 12

  13. Model Selection & Information Theory: Akaike Information Criterion 13

  14. Kullback-Leibler (KL) divergence (or relative entropy) How good do we fit the data? What additional uncertainty have we introduced? CS109A, P ROTOPAPAS , R ADER 14

  15. KL divergence The KL divergence shows the distance between two distributions, hence it is a non-negative quantity. With Jensen’s inequality for convex functions 𝑔 𝒛 , 𝔽[𝑔 𝒛 ] ≥ 𝑔(𝔽 [ y ]): KL divergence is a non-symmetric quantity CS109A, P ROTOPAPAS , R ADER 15

  16. MLE justification through KL divergence Empirical distribution Minimize KL divergence is the same with maximize likelihood (empirical distribution) log-likelihood CS109A, P ROTOPAPAS , R ADER 16

  17. Model Comparison Consider to model distributions By using the empirical distribution: p is eliminated. CS109A, P ROTOPAPAS , R ADER 17

  18. Akaike Information Criterion (AIC) AIC is a trade off between the number of parameters k and the error that is introduced (overfitting). AIC is an asymptotic approximation of the KL-divergence The data are being used twice: first for MLE and second for the KL-divergence estimation. AIC estimates which is the optimal number of parameters k CS109A, P ROTOPAPAS , R ADER 18

  19. Polynomial Regression Model Example Suppose a polynomial regression model Which is the optimal k? For k smaller than the optimal: Underfitting For k larger than the optimal: Overfitting CS109A, P ROTOPAPAS , R ADER 19

  20. Minimizing real and empirical KL-divergence Suppose many models indicated by index j Work with the j -th model which has k j parameters CS109A, P ROTOPAPAS , R ADER 20

  21. Numerical verification of AIC CS109A, P ROTOPAPAS , R ADER 21

  22. Akaike Information Criterion (AIC): Proof Asymptotic Expansion around true ideal MLE θ 0 CS109A, P ROTOPAPAS , R ADER 22

  23. Akaike Information Criterion (AIC): Proof CS109A, P ROTOPAPAS , R ADER 23

  24. Akaike Information Criterion (AIC): Proof In the limit of a correct model: CS109A, P ROTOPAPAS , R ADER 24

  25. Review Maximum Likelihood Estimation (MLE) • A powerful method to estimate the ideal fitting parameters of a 1. model. Exponential distribution, a simple but useful example. 2. 3. Linear Regression Model as a special paradigm of MLE implementation. • Model Selection & Information Criteria 1. KL-divergence quantifies the “distance” between the fitting model and the “real” distribution. KL-divergence justifies the MLE and is used for model comparison. 2. AIC: Estimates the number of model parameters and protects from 3. overfitting. CS109A, P ROTOPAPAS , R ADER 25

  26. Advanced Section 2: Model Selection & Information Criteria Thank you Office hours are: Monday 6-7:30 (Marios) Tuesday 6:30-8 (Trevor) CS109A, P ROTOPAPAS , R ADER 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend