STK-IN4300 Statistical Learning Methods in Data Science Riccardo De - - PowerPoint PPT Presentation

stk in4300 statistical learning methods in data science
SMART_READER_LITE
LIVE PREVIEW

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De - - PowerPoint PPT Presentation

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no STK-IN4300: lecture 5 1/ 41 STK-IN4300 - Statistical Learning Methods in Data Science Outline of the lecture Kernel Smoothing Methods One dimensional


slide-1
SLIDE 1

STK-IN4300 Statistical Learning Methods in Data Science

Riccardo De Bin

debin@math.uio.no

STK-IN4300: lecture 5 1/ 41

slide-2
SLIDE 2

STK-IN4300 - Statistical Learning Methods in Data Science

Outline of the lecture

Kernel Smoothing Methods One dimensional kernel smoothers Selecting the width of a kernel Local linear regression Local polynomial regression Local regression in Rp Structured local regression models in Rp Kernel density estimation Mixture models for density estimation Nonparametric Density Estimation with a Parametric Start

STK-IN4300: lecture 5 2/ 41

slide-3
SLIDE 3

STK-IN4300 - Statistical Learning Methods in Data Science

One dimensional kernel smoothers: from kNN to kernel smoothers

When we introduced the kNN algorithm, ˆ fpxq “ Avepyi|xi P Nkpxqq ‚ justified as an estimate of ErY |X “ xs. Drawbacks: ‚ ugly discontinuities; ‚ same weight to all points despite their distance to x.

STK-IN4300: lecture 5 3/ 41

slide-4
SLIDE 4

STK-IN4300 - Statistical Learning Methods in Data Science

One dimensional kernel smoothers: definition

Alternative: weight the effect of each point based on its distance. ˆ fpx0q “ řN

i“1 Kpx0, xiqyi

řN

i“1 Kpx0, xiq

, where Kλpx0, xq “ D ˆ|x ´ x0| λ ˙ . (1) Here: ‚ Dp¨q is called kernel; ‚ λ is the bandwidth or smoothing parameter.

STK-IN4300: lecture 5 4/ 41

slide-5
SLIDE 5

STK-IN4300 - Statistical Learning Methods in Data Science

One dimensional kernel smoothers: comparison

STK-IN4300: lecture 5 5/ 41

slide-6
SLIDE 6

STK-IN4300 - Statistical Learning Methods in Data Science

One dimensional kernel smoothers: typical kernels

We need to choose Dp¨q: ‚ symmetric around x0; ‚ goes off smoothly with the distance. Typical choices: Nucleus D(t) Support Normal

1 ? 2π expt´1 2t2u

R Rectangular

1 2

p´1, 1q Epanechnikov

3 4p1 ´ t2q

p´1, 1q Biquadratic

15 16p1 ´ t2q2

p´1, 1q Tricubic

70 81p1 ´ |t|3q3

p´1, 1q

STK-IN4300: lecture 5 6/ 41

slide-7
SLIDE 7

STK-IN4300 - Statistical Learning Methods in Data Science

One dimensional kernel smoothers: comparison

STK-IN4300: lecture 5 7/ 41

slide-8
SLIDE 8

STK-IN4300 - Statistical Learning Methods in Data Science

One dimensional kernel smoothers: choice of the smoothing parameter

Choice of the bandwidth λ: ‚ controls how large is the interval around x0 to consider,

§ for Epanechnikov, biquadratic or tricubic kernels Ñ radius of

the support;

§ for Gaussian kernel, standard deviation;

‚ large values implies lower variance but higher bias,

§ λ small Ñ ˆ

fpx0q based on few points Ñ yi’s closer to y0;

§ λ large Ñ more points Ñ stronger effect of averaging;

‚ alternatively,

§ adapt to the local density (fix k as in kNN); § expressed by substituting λ with hλpx0q in (1); § keep bias constant, variance is inversely proportional to the

local density.

STK-IN4300: lecture 5 8/ 41

slide-9
SLIDE 9

STK-IN4300 - Statistical Learning Methods in Data Science

One dimensional kernel smoothers: effect of the smoothing parameter

STK-IN4300: lecture 5 9/ 41

slide-10
SLIDE 10

STK-IN4300 - Statistical Learning Methods in Data Science

Selecting the width of a kernel: bias and variance

Assume yi “ fpxiq ` ǫi, ǫi i.i.d. s.t. Erǫis “ 0 and Var “ σ2, then Er ˆ fpxqs « fpxq ` λ2 2 σ2

Df2pxq

and Varr ˆ fpxqs « σ2 Nλ RD gpxq for N large and λ sufficiently close to 0 (Azzalini & Scarpa, 2012). Here: ‚ σ2

D “

ş t2Dptqdt; ‚ RD “ ş Dptq2dt; ‚ gpxq is the density from which the xi were sampled.

STK-IN4300: lecture 5 10/ 41

slide-11
SLIDE 11

STK-IN4300 - Statistical Learning Methods in Data Science

Selecting the width of a kernel: bias and variance

Note: ‚ the bias is a multiple of λ2;

§ λ Ñ 0 reduce the bias;

‚ the variance is a multiple of

1 Nλ;

§ λ Ñ 8 reduce the variance.

The quantities gpxq and f2pxq are unknown, otherwise λopt “ ˆ σ2RD σ4

Df2pxqgpxqN

˙1{5 ; note that λ must tend to 0 with rate N´1{5 (i.e., very slowly).

STK-IN4300: lecture 5 11/ 41

slide-12
SLIDE 12

STK-IN4300 - Statistical Learning Methods in Data Science

Selecting the width of a kernel: AIC

Anyway, local smoothers are linear estimators, ˆ fpxq “ Sλy as Sλ, the smoothing matrix, does not depend on y. Therefore, an Akaike Information Criterion can be implemented, AIC “ log ˆ σ ` 2 tracetSλu where tracetSλu are the effective degrees of freedom. Otherwise it is always possible to implement a cross-validation procedure.

STK-IN4300: lecture 5 12/ 41

slide-13
SLIDE 13

STK-IN4300 - Statistical Learning Methods in Data Science

One dimensional Kernel Smoothers: other issues

Other points to consider: ‚ boundary issues:

§ estimates are less accurate close to the boundaries; § less observations; § asymmetry in the kernel;

‚ ties in the xi’s:

§ possibly more weight on a single xi; § there can be different yi for the same xi. STK-IN4300: lecture 5 13/ 41

slide-14
SLIDE 14

STK-IN4300 - Statistical Learning Methods in Data Science

Local linear regression: problems at the boundaries

STK-IN4300: lecture 5 14/ 41

slide-15
SLIDE 15

STK-IN4300 - Statistical Learning Methods in Data Science

Local linear regression: problems at the boundaries

By fitting a straight line, we solve the problem to the first order. Ó Local linear regression Locally weighted linear regression solves, at each target point x0, min

αpx0q,βpx0q N

ÿ

i“1

Kλpx0, xiqryi ´ αpx0q ´ βpx0qxis2. The estimate is ˆ fpx0q “ ˆ αpx0q ` ˆ βpx0qx0: ‚ the model is fit on all data belonging to the support of Kλ; ‚ it is only evaluated in x0.

STK-IN4300: lecture 5 15/ 41

slide-16
SLIDE 16

STK-IN4300 - Statistical Learning Methods in Data Science

Local linear regression: estimation

Estimation ˆ fpx0q “ bpx0qT pBT Wpx0qBq´1BT Wpx0qy “

N

ÿ

i“1

lipx0qyi, where: ‚ bpx0qT “ p1, x0q ‚ B “ p 1, Xq; ‚ Wpx0q is a N ˆ N diagonal matrix with i-th term Kλpx0, xiq; ‚ ˆ fpx0q is linear in y (lipx0q does not depend on yi); ‚ the weights lipx0q are sometimes called equivalent kernels,

§ combine the weighting kernel Kλpx0, ¨q and the LS operator. STK-IN4300: lecture 5 16/ 41

slide-17
SLIDE 17

STK-IN4300 - Statistical Learning Methods in Data Science

Local linear regression: bias correction due asymmetry

STK-IN4300: lecture 5 17/ 41

slide-18
SLIDE 18

STK-IN4300 - Statistical Learning Methods in Data Science

Local linear regression: bias

Using a Taylor expansion of fpxiq around x0, Er ˆ fpx0qs “

N

ÿ

i“1

lipx0qfpxiq “fpx0q

N

ÿ

i“1

lipx0q ` f1px0q

N

ÿ

i“1

pxi ´ x0qlipx0q` ` f2px0q 2

N

ÿ

i“1

pxi ´ x0q2lipx0q ` . . . (2) For local linear regression, ‚ ř

i“1 lipx0q “ 1;

‚ řN

i“1pxi ´ x0qlipx0q “ 0.

Therefore, ‚ Er ˆ fpx0qs ´ fpx0q “ f2px0q

2

řN

i“1pxi ´ x0q2lipx0q ` . . . .

STK-IN4300: lecture 5 18/ 41

slide-19
SLIDE 19

STK-IN4300 - Statistical Learning Methods in Data Science

Local polynomial regression: bias

Why limiting to a linear fit? min

αpx0q,β1px0q,...,βdpx0q N

ÿ

i“1

Kλpx0, xiq « yi ´ αpx0q ´

d

ÿ

j“1

βjpx0qxj

i

ff2 , with solution ˆ fpx0q “ ˆ αpx0q ` řd

j“1 ˆ

βpx0qxj

0.

‚ it can be shown that the bias, using (2), only involves components of degree d ` 1; ‚ in contrast to local linear regression, it tends to be closer to the true function in regions with high curvature,

§ no trimming the hills and filling the gaps effect. STK-IN4300: lecture 5 19/ 41

slide-20
SLIDE 20

STK-IN4300 - Statistical Learning Methods in Data Science

Local polynomial regression: regions with high curvature

STK-IN4300: lecture 5 20/ 41

slide-21
SLIDE 21

STK-IN4300 - Statistical Learning Methods in Data Science

Local polynomial regression: bias-variance trade-off

Not surprisingly, there is a price for having less bias. Assuming a model yi “ fpxiq ` ǫi, where ǫi are i.i.d. with mean 0 and variance σ2, Varp ˆ fpxiqq “ σ2||lpx0q|| It can be shown that ||lpx0q|| increase with d ñ bias-variance trade-off in the choice of d.

STK-IN4300: lecture 5 21/ 41

slide-22
SLIDE 22

STK-IN4300 - Statistical Learning Methods in Data Science

Local polynomial regression: variance

STK-IN4300: lecture 5 22/ 41

slide-23
SLIDE 23

STK-IN4300 - Statistical Learning Methods in Data Science

Local polynomial regression: final remarks

Some final remarks: ‚ local linear fits help dramatically in alleviating boundary issues; ‚ quadratic fits do a little better, but increase variance; ‚ quadratic fits solve issues in high curvature regions; ‚ asymptotic analyses suggest that polynomials of odd degrees should be preferred to those of even degrees,

§ the MSE is asymptotically dominated by boundary effects;

‚ anyway, the choice of d is problem specific.

STK-IN4300: lecture 5 23/ 41

slide-24
SLIDE 24

STK-IN4300 - Statistical Learning Methods in Data Science

Local regression in Rp: extension

Kernel smoothing and local regression can be easily generalized to more dimensions: ‚ average weighted by a kernel with support in Rp; ‚ for local regression, fit locally an hyperplane. With d “ 1 and p “ 2, ‚ bpXq “ p1, X1, X2q With d “ 2 and p “ 2, ‚ bpXq “ p1, X1, X2, X2

1, X2 2, X1X2q

At each x0, solve min

βpx0q N

ÿ

i“1

Kλpx0, xiqryi ´ bpxiqT βpx0qxis2. where Kλpx0, xiq is a radial function, Kλpx0, xiq “ D ˆ||x ´ x0|| λ ˙ . Since || ¨ || is the Euclidean norm, standardize each xj.

STK-IN4300: lecture 5 24/ 41

slide-25
SLIDE 25

STK-IN4300 - Statistical Learning Methods in Data Science

Local regression in Rp: example

STK-IN4300: lecture 5 25/ 41

slide-26
SLIDE 26

STK-IN4300 - Statistical Learning Methods in Data Science

Local regression in Rp: remarks

Some remarks: ‚ boundary issues are even more dramatic than in one dimension;

§ the fraction of points at the boundary increases to 1 by

increasing the dimensions;

§ curse of dimensionality;

‚ local polynomials still perform boundary corrections up to the desired order; ‚ local regression does not really make sense for p ą 3,

§ it is impossible to maintain localness (small bias) and sizeable

sample in the neighbourhood (small variance);

§ again, curse of dimensionality. STK-IN4300: lecture 5 26/ 41

slide-27
SLIDE 27

STK-IN4300 - Statistical Learning Methods in Data Science

Structured local regression models in Rp: structured kernels

When the ratio dimensions/sample size is too large: Structured kernels Kλ,Apx0, xq “ D ˆpx ´ x0qT Apx ´ x0q λ ˙ ‚ A is a matrix semidefinite positive; ‚ we can add structures through A:

§ A diagonal, increase or decrease the importance of the

predictor Xj by increasing/decreasing ajj;

§ low rank versions of A Ñ projection pursuit; STK-IN4300: lecture 5 27/ 41

slide-28
SLIDE 28

STK-IN4300 - Statistical Learning Methods in Data Science

Structured local regression models in Rp: structured regression functions

Structured regression functions fpX1, . . . , Xpq “ α `

p

ÿ

j“1

gjpXjq ` ÿ

kăℓ

gkℓpXk, Xℓq ` . . . ‚ we can simplify the structure; ‚ examples:

§ remove all interaction terms,

fpX1, . . . , Xpq “ α ` řp

j“1 gjpXjq;

§ keep only the first order interactions,

fpX1, . . . , Xpq “ α ` řp

j“1 gjpXjq ` ř kăℓ gkℓpXk, Xℓq;

§ . . . STK-IN4300: lecture 5 28/ 41

slide-29
SLIDE 29

STK-IN4300 - Statistical Learning Methods in Data Science

Structured local regression models in Rp: varying coefficient models

Varying coefficient models The varying coefficient models: ‚ are a special case of structured regression functions; ‚ consider only q ă p predictors, all the remaining are in Z; ‚ assume the conditionally linear model, fpXq “ αpZq ` β1pZqX1 ` ¨ ¨ ¨ ` βqpZqXq; ‚ given Z, it is a linear model,

§ solution via least squares estimator;

‚ the coefficients can vary with Z.

STK-IN4300: lecture 5 29/ 41

slide-30
SLIDE 30

STK-IN4300 - Statistical Learning Methods in Data Science

Structured local regression models in Rp: structured regression functions

STK-IN4300: lecture 5 30/ 41

slide-31
SLIDE 31

STK-IN4300 - Statistical Learning Methods in Data Science

Kernel density estimation: density estimation

Suppose to have a random sample, xi P R, i “ 1, . . . , N, and want to estimate its density fXpxq. An estimation at each point x0 is ˆ fXpx0q “ #xi P Npx0q Nλ . where Npx0q is a small metric neighborhood around x0 of width λ. Bumpy estimate Ñ the smooth Parzen estimate is preferred, ˆ fXpx0q “ 1 Nλ

N

ÿ

i“1

Kλpx0, xiq, in which closer observations contributes more.

STK-IN4300: lecture 5 31/ 41

slide-32
SLIDE 32

STK-IN4300 - Statistical Learning Methods in Data Science

Kernel density estimation: choice of Kλpx0, xq

For the smooth Parzen estimate, the Gaussian kernel is often used, Kλpx0, xq “ φ ˆ|x ´ x0| λ ˙ where φ is the density of a standard normal. Using the density of a normal with mean 0 and sd λ, denoted φλ, fXpxq “ 1 N

N

ÿ

i“1

φλpx ´ xiq “ p ˆ F ‹ φλqpxq the convolution of the sample empirical distribution ˆ F with φλ: ‚ smooth ˆ F by adding independent Gaussian noise to each xi.

STK-IN4300: lecture 5 32/ 41

slide-33
SLIDE 33

STK-IN4300 - Statistical Learning Methods in Data Science

Kernel density estimation: example

STK-IN4300: lecture 5 33/ 41

slide-34
SLIDE 34

STK-IN4300 - Statistical Learning Methods in Data Science

Mixture models for density estimation: density estimation

The density fpXq can be considered a mixture of distributions, fpXq “

M

ÿ

m“1

αmgpx; µm, Σmq where ‚ αm are the mixing proportions, řM

m“1 αm “ 1;

‚ each density gp¨q has mean µm ad covariance Σm; ‚ almost always gpx; µm, Σmq “ φpx; µm, Σmq; Ó ‚ Gaussian mixture model.

STK-IN4300: lecture 5 34/ 41

slide-35
SLIDE 35

STK-IN4300 - Statistical Learning Methods in Data Science

Mixture models for density estimation: example

STK-IN4300: lecture 5 35/ 41

slide-36
SLIDE 36

STK-IN4300 - Statistical Learning Methods in Data Science

Semiparametric density estimation: Hjort & Glad (1995)

Hjort & Glad (1995) proposed a different option: ‚ start with a parametric density estimate f0px, ˆ θq; ‚ multiply it to a correction term rpxq “ fpxq{f0px, ˆ θq; ‚ estimate the correction term with a kernel smoother, ˆ rpxq “ 1 N

N

ÿ

i“1

Kλpx, xiq f0pxi, ˆ θq ; ‚ the resulting density estimate is ˆ fHGpxq “ f0px, ˆ θqˆ rpxq “ 1 N

N

ÿ

i“1

Kλpx, xiq f0px, ˆ θq f0pxi, ˆ θq .

STK-IN4300: lecture 5 36/ 41

slide-37
SLIDE 37

STK-IN4300 - Statistical Learning Methods in Data Science

Semiparametric density estimation: Hjort & Glad (1995)

Note that: ‚ the initial parametric estimate is not necessarily a good approximation to the true density:

§ the method often works well with “bad” parametric starts; § the better the approximation, the better the result, though;

‚ f0px, ˆ θq “ constant Ñ f0pxq „ Unif,

§ back to the classic kernel estimator. STK-IN4300: lecture 5 37/ 41

slide-38
SLIDE 38

STK-IN4300 - Statistical Learning Methods in Data Science

Semiparametric density estimation: properties

Consider fHGpxq’s variance, Varp ˆ fHGpxqq “ Varp ˆ fkernelpxqq ` O ˆ λ N ` 1 N2 ˙ , ‚ ˆ fHGpxq and ˆ fkernelpxq have approximatively the same variance; and bias, Er ˆ fHGpxqs « fpxq ` λ2 2 σ2

Df0pxqr2pxq,

‚ same order as the bias of ˆ fkernelpxq, i.e., O(λ2); ‚ it is proportional to f0pxqr2pxq rather than f2pxq; ‚ smaller when f2pxq “ f2

0 pxqrpxq ` 2f1 0pxqr1pxq ` f0pxqr2pxq,

§ when f0pxq is a good guess, better performance! STK-IN4300: lecture 5 38/ 41

slide-39
SLIDE 39

STK-IN4300 - Statistical Learning Methods in Data Science

Semiparametric density estimation: derivation example

Example with a Gaussian start, ˆ fHGpxq “ f0px, ˆ θqˆ rpxq “ 1 N

N

ÿ

i“1

Kλpx, xiq f0px, ˆ θq f0pxi, ˆ θq “ 1 ˆ σφ ˆx ´ ˆ µ ˆ σ ˙ 1

N

řN

i“1 Kλpx, xiq

φ ´

xi´ˆ µ ˆ σ

¯ “ 1 N

N

ÿ

i“1

Kλpx, xiq exp ! ´1

2 x´ˆ µ ˆ σ

) exp ! ´1

2 xi´ˆ µ ˆ σ

).

STK-IN4300: lecture 5 39/ 41

slide-40
SLIDE 40

STK-IN4300 - Statistical Learning Methods in Data Science

Semiparametric density estimation: data example

2 4 6 8 10 12 0.00 0.05 0.10 0.15

Concentration of theophylline

N = 132 Bandwidth = 0.1849 ('ucv') Density start gamma, kernel gamma start gamma, kernel Gaussian start Gaussian, kernel Gaussian start Gaussian, kernel gamma

STK-IN4300: lecture 5 40/ 41

slide-41
SLIDE 41

STK-IN4300 - Statistical Learning Methods in Data Science

References I

Azzalini, A. & Scarpa, B. (2012). Data Analysis and Data Mining: An

  • introduction. Oxford University Press, New York.

Hjort, N. L. & Glad, I. K. (1995). Nonparametric density estimation with a parametric start. The Annals of Statistics 23, 882–904. Terrell, G. R., Scott, D. W. et al. (1992). Variable kernel density

  • estimation. The Annals of Statistics 20, 1236–1265.

STK-IN4300: lecture 5 41/ 41