stk in4300 statistical learning methods in data science
play

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De - PowerPoint PPT Presentation

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no STK-IN4300: lecture 5 1/ 41 STK-IN4300 - Statistical Learning Methods in Data Science Outline of the lecture Kernel Smoothing Methods One dimensional


  1. STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no STK-IN4300: lecture 5 1/ 41

  2. STK-IN4300 - Statistical Learning Methods in Data Science Outline of the lecture Kernel Smoothing Methods One dimensional kernel smoothers Selecting the width of a kernel Local linear regression Local polynomial regression Local regression in R p Structured local regression models in R p Kernel density estimation Mixture models for density estimation Nonparametric Density Estimation with a Parametric Start STK-IN4300: lecture 5 2/ 41

  3. STK-IN4300 - Statistical Learning Methods in Data Science One dimensional kernel smoothers: from k NN to kernel smoothers When we introduced the k NN algorithm, ˆ f p x q “ Ave p y i | x i P N k p x qq ‚ justified as an estimate of E r Y | X “ x s . Drawbacks: ‚ ugly discontinuities; ‚ same weight to all points despite their distance to x . STK-IN4300: lecture 5 3/ 41

  4. STK-IN4300 - Statistical Learning Methods in Data Science One dimensional kernel smoothers: definition Alternative: weight the effect of each point based on its distance. ř N i “ 1 K p x 0 , x i q y i ˆ f p x 0 q “ , ř N i “ 1 K p x 0 , x i q where ˆ | x ´ x 0 | ˙ K λ p x 0 , x q “ D (1) . λ Here: ‚ D p¨q is called kernel; ‚ λ is the bandwidth or smoothing parameter. STK-IN4300: lecture 5 4/ 41

  5. STK-IN4300 - Statistical Learning Methods in Data Science One dimensional kernel smoothers: comparison STK-IN4300: lecture 5 5/ 41

  6. STK-IN4300 - Statistical Learning Methods in Data Science One dimensional kernel smoothers: typical kernels We need to choose D p¨q : ‚ symmetric around x 0 ; ‚ goes off smoothly with the distance. Typical choices: Nucleus D(t) Support 1 2 π exp t´ 1 2 t 2 u Normal ? R 1 Rectangular p´ 1 , 1 q 2 3 4 p 1 ´ t 2 q Epanechnikov p´ 1 , 1 q 16 p 1 ´ t 2 q 2 15 Biquadratic p´ 1 , 1 q 70 81 p 1 ´ | t | 3 q 3 Tricubic p´ 1 , 1 q STK-IN4300: lecture 5 6/ 41

  7. STK-IN4300 - Statistical Learning Methods in Data Science One dimensional kernel smoothers: comparison STK-IN4300: lecture 5 7/ 41

  8. STK-IN4300 - Statistical Learning Methods in Data Science One dimensional kernel smoothers: choice of the smoothing parameter Choice of the bandwidth λ : ‚ controls how large is the interval around x 0 to consider, § for Epanechnikov, biquadratic or tricubic kernels Ñ radius of the support; § for Gaussian kernel, standard deviation; ‚ large values implies lower variance but higher bias, § λ small Ñ ˆ f p x 0 q based on few points Ñ y i ’s closer to y 0 ; § λ large Ñ more points Ñ stronger effect of averaging; ‚ alternatively, § adapt to the local density (fix k as in k NN); § expressed by substituting λ with h λ p x 0 q in (1); § keep bias constant, variance is inversely proportional to the local density. STK-IN4300: lecture 5 8/ 41

  9. STK-IN4300 - Statistical Learning Methods in Data Science One dimensional kernel smoothers: effect of the smoothing parameter STK-IN4300: lecture 5 9/ 41

  10. STK-IN4300 - Statistical Learning Methods in Data Science Selecting the width of a kernel: bias and variance Assume y i “ f p x i q ` ǫ i , ǫ i i.i.d. s.t. E r ǫ i s “ 0 and Var “ σ 2 , then f p x qs « f p x q ` λ 2 E r ˆ D f 2 p x q 2 σ 2 and f p x qs « σ 2 R D Var r ˆ g p x q Nλ for N large and λ sufficiently close to 0 (Azzalini & Scarpa, 2012). Here: ‚ σ 2 ş t 2 D p t q dt ; D “ ş D p t q 2 dt ; ‚ R D “ ‚ g p x q is the density from which the x i were sampled. STK-IN4300: lecture 5 10/ 41

  11. STK-IN4300 - Statistical Learning Methods in Data Science Selecting the width of a kernel: bias and variance Note: ‚ the bias is a multiple of λ 2 ; § λ Ñ 0 reduce the bias; 1 ‚ the variance is a multiple of Nλ ; § λ Ñ 8 reduce the variance. The quantities g p x q and f 2 p x q are unknown, otherwise ˙ 1 { 5 σ 2 R D ˆ λ opt “ ; σ 4 D f 2 p x q g p x q N note that λ must tend to 0 with rate N ´ 1 { 5 (i.e., very slowly). STK-IN4300: lecture 5 11/ 41

  12. STK-IN4300 - Statistical Learning Methods in Data Science Selecting the width of a kernel: AIC Anyway, local smoothers are linear estimators, ˆ f p x q “ S λ y as S λ , the smoothing matrix, does not depend on y . Therefore, an Akaike Information Criterion can be implemented, AIC “ log ˆ σ ` 2 trace t S λ u where trace t S λ u are the effective degrees of freedom. Otherwise it is always possible to implement a cross-validation procedure. STK-IN4300: lecture 5 12/ 41

  13. STK-IN4300 - Statistical Learning Methods in Data Science One dimensional Kernel Smoothers: other issues Other points to consider: ‚ boundary issues: § estimates are less accurate close to the boundaries; § less observations; § asymmetry in the kernel; ‚ ties in the x i ’s: § possibly more weight on a single x i ; § there can be different y i for the same x i . STK-IN4300: lecture 5 13/ 41

  14. STK-IN4300 - Statistical Learning Methods in Data Science Local linear regression: problems at the boundaries STK-IN4300: lecture 5 14/ 41

  15. STK-IN4300 - Statistical Learning Methods in Data Science Local linear regression: problems at the boundaries By fitting a straight line, we solve the problem to the first order. Ó Local linear regression Locally weighted linear regression solves, at each target point x 0 , N ÿ K λ p x 0 , x i qr y i ´ α p x 0 q ´ β p x 0 q x i s 2 . min α p x 0 q ,β p x 0 q i “ 1 The estimate is ˆ α p x 0 q ` ˆ f p x 0 q “ ˆ β p x 0 q x 0 : ‚ the model is fit on all data belonging to the support of K λ ; ‚ it is only evaluated in x 0 . STK-IN4300: lecture 5 15/ 41

  16. STK-IN4300 - Statistical Learning Methods in Data Science Local linear regression: estimation Estimation f p x 0 q “ b p x 0 q T p B T W p x 0 q B q ´ 1 B T W p x 0 q y ˆ N ÿ “ l i p x 0 q y i , i “ 1 where: ‚ b p x 0 q T “ p 1 , x 0 q ‚ B “ p � 1 , X q ; ‚ W p x 0 q is a N ˆ N diagonal matrix with i -th term K λ p x 0 , x i q ; ‚ ˆ f p x 0 q is linear in y ( l i p x 0 q does not depend on y i ); ‚ the weights l i p x 0 q are sometimes called equivalent kernels, § combine the weighting kernel K λ p x 0 , ¨q and the LS operator. STK-IN4300: lecture 5 16/ 41

  17. STK-IN4300 - Statistical Learning Methods in Data Science Local linear regression: bias correction due asymmetry STK-IN4300: lecture 5 17/ 41

  18. STK-IN4300 - Statistical Learning Methods in Data Science Local linear regression: bias Using a Taylor expansion of f p x i q around x 0 , N ÿ E r ˆ f p x 0 qs “ l i p x 0 q f p x i q i “ 1 N N ÿ l i p x 0 q ` f 1 p x 0 q ÿ “ f p x 0 q p x i ´ x 0 q l i p x 0 q` i “ 1 i “ 1 N ` f 2 p x 0 q ÿ p x i ´ x 0 q 2 l i p x 0 q ` . . . (2) 2 i “ 1 For local linear regression, ‚ ř i “ 1 l i p x 0 q “ 1 ; ‚ ř N i “ 1 p x i ´ x 0 q l i p x 0 q “ 0 . Therefore, f p x 0 qs ´ f p x 0 q “ f 2 p x 0 q ‚ E r ˆ ř N i “ 1 p x i ´ x 0 q 2 l i p x 0 q ` . . . . 2 STK-IN4300: lecture 5 18/ 41

  19. STK-IN4300 - Statistical Learning Methods in Data Science Local polynomial regression: bias Why limiting to a linear fit? ff 2 « N d β j p x 0 q x j ÿ ÿ K λ p x 0 , x i q y i ´ α p x 0 q ´ , min i α p x 0 q ,β 1 p x 0 q ,...,β d p x 0 q i “ 1 j “ 1 ˆ β p x 0 q x j α p x 0 q ` ř d j “ 1 ˆ with solution f p x 0 q “ ˆ 0 . ‚ it can be shown that the bias, using (2), only involves components of degree d ` 1 ; ‚ in contrast to local linear regression, it tends to be closer to the true function in regions with high curvature, § no trimming the hills and filling the gaps effect. STK-IN4300: lecture 5 19/ 41

  20. STK-IN4300 - Statistical Learning Methods in Data Science Local polynomial regression: regions with high curvature STK-IN4300: lecture 5 20/ 41

  21. STK-IN4300 - Statistical Learning Methods in Data Science Local polynomial regression: bias-variance trade-off Not surprisingly, there is a price for having less bias. Assuming a model y i “ f p x i q ` ǫ i , where ǫ i are i.i.d. with mean 0 and variance σ 2 , Var p ˆ f p x i qq “ σ 2 || l p x 0 q|| It can be shown that || l p x 0 q|| increase with d ñ bias-variance trade-off in the choice of d . STK-IN4300: lecture 5 21/ 41

  22. STK-IN4300 - Statistical Learning Methods in Data Science Local polynomial regression: variance STK-IN4300: lecture 5 22/ 41

  23. STK-IN4300 - Statistical Learning Methods in Data Science Local polynomial regression: final remarks Some final remarks: ‚ local linear fits help dramatically in alleviating boundary issues; ‚ quadratic fits do a little better, but increase variance; ‚ quadratic fits solve issues in high curvature regions; ‚ asymptotic analyses suggest that polynomials of odd degrees should be preferred to those of even degrees, § the MSE is asymptotically dominated by boundary effects; ‚ anyway, the choice of d is problem specific. STK-IN4300: lecture 5 23/ 41

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend