Lecture 8: Kernel Density Estimation (2) Applied Statistics 2015 1 - - PowerPoint PPT Presentation

lecture 8 kernel density estimation 2
SMART_READER_LITE
LIVE PREVIEW

Lecture 8: Kernel Density Estimation (2) Applied Statistics 2015 1 - - PowerPoint PPT Presentation

Choice of bandwidth by Cross Validation Multivariate density estimators Assignments Lecture 8: Kernel Density Estimation (2) Applied Statistics 2015 1 / 20 Choice of bandwidth by Cross Validation Multivariate density estimators Assignments


slide-1
SLIDE 1

Choice of bandwidth by Cross Validation Multivariate density estimators Assignments

Lecture 8: Kernel Density Estimation (2)

Applied Statistics 2015

1 / 20

slide-2
SLIDE 2

Choice of bandwidth by Cross Validation Multivariate density estimators Assignments

Recap

A kernel density estimator is given by ˆ fn,h(x) = 1 nh

n

  • i=1

K x − Xi h

  • .

The risk of the estimator is measured locally by MSE and globally by IMSE. Both have the following decomposition. Risk = (Bias)2 + Variance = ah4 + b nh + Remaining term. Minimizing the risk yields the optimal bandwidth hopt of order n−1/5.

2 / 20

slide-3
SLIDE 3

Choice of bandwidth by Cross Validation Multivariate density estimators Assignments

Recap

The trade-off between bias and variance is a common issue in smoothing

  • problems. The bias increases and the variance decreases with the amount
  • f smoothing, which is determined by the bandwidth h in kernel density

estimator.

3 / 20

slide-4
SLIDE 4

Choice of bandwidth by Cross Validation Multivariate density estimators Assignments

The general concept of cross validation (CV) was introduced in Stone (1974). It was not first suggested for density estimation. The basic idea

  • f CV is very intuitive. Select a part of the data to fit the model. Then

apply the fitted model to the rest of the data to assess goodness of fit. For choosing bandwidth in density estimator, the procedure works as fol- lows. Fix h

Obtain the estimator based on (n − 1) observations {X1, . . . , Xj−1, Xj+1, . . . , Xn}. Denote by ˆ f (j)

n,h(x) =

1 (n − 1)h

  • i=j

K x − Xi h

  • .

A CV score, as a measure of GoF, is computed based on { ˆ f (j)

n,h(Xj), j =

1, . . . , n}.

Varying h, a function CV (h) will be formed and then maximized (or minimized) to obtain a CV bandwidth.

4 / 20

slide-5
SLIDE 5

Choice of bandwidth by Cross Validation Multivariate density estimators Assignments

Maximum Likelihood CV

Let ˆ f (j)

n,h(x) =

1 (n − 1)h

  • i=j

K x − Xi h

  • .

be the estimated density based on the sample values excepted Xj. We apply the estimate ˆ f (j)

n,h(x) to x = Xj to obtain ˆ

f (j)

n,h(Xj). Since Xj was

actually observed, a good choice of h should give large value of ˆ f (j)

n,h(Xj).

The rationale is similar to that of MLE. Define the CV likelihood as ˆ L(h) =

n

  • j=1

ˆ f (j)

n,h(Xj).

The maximum likelihood CV (MLCV) bandwidth is given by hML = argmaxh ˆ L(h).

5 / 20

slide-6
SLIDE 6

Choice of bandwidth by Cross Validation Multivariate density estimators Assignments

Maximum Likelihood CV

It can be proven that under some conditions of f and K,

  • | ˆ

fn,hML(x) − f(x)|dx

a.s

→ 0.

  • Remark. There are known examples of inconsistency of ˆ

fn,hML, if f has unbounded support.

6 / 20

slide-7
SLIDE 7

Choice of bandwidth by Cross Validation Multivariate density estimators Assignments

Least squares CV

Consider MISE( ˆ fn,h) = E

  • ( ˆ

fn,h(x) − f(x))2dx

  • =E
  • ˆ

f 2

n,h(x)dx

  • − 2E
  • ˆ

fn,h(x)f(x)dx

  • +
  • (f(x))2dx.

The last term does not depend on h. Thus we aim to find a good h that minimizes M(h) = E ˆ f 2

n,h(x)dx

  • − 2E

ˆ fn,h(x)f(x)dx

  • .

However M(h) depends on the unknown f. We shall find an unbiased estimator of M(h). We only need to find an unbiased estimator of E ˆ fn,h(x)f(x)dx

  • .

7 / 20

slide-8
SLIDE 8

Choice of bandwidth by Cross Validation Multivariate density estimators Assignments

Least squares CV

It turns out that 1 n

n

  • j=1

ˆ f (j)

n,h(Xj)

is an unbiased estimator of E ˆ fn,h(x)f(x)dx

  • :

E   1 n

n

  • j=1

ˆ f (j)

n,h(Xj)

  = 1 n

n

  • j=1

E

  • ˆ

f (j)

n,h(Xj)

  • = E
  • ˆ

f (1)

n,h(X1)

  • =E
  • 1

(n − 1)h

n

  • i=2

K X1 − Xi h

  • = 1

hE

  • K

X1 − X2 h

  • = 1

h K x − y h

  • f(y)f(x)dydx =

1 hK x − y h

  • f(y)dy
  • f(x)dx

=

  • E
  • ˆ

fn,h(x)

  • f(x)dx = E
  • ˆ

fn,h(x)f(x)dx

  • .

8 / 20

slide-9
SLIDE 9

Choice of bandwidth by Cross Validation Multivariate density estimators Assignments

Least squares CV

Let LSCV (h) = ˆ f 2

n,h(x)dx − 2 n

n

j=1 ˆ

f (j)

n,h(Xj). We have shown

that for any h > 0, E(LSCV (h)) = M(h). LSCV (h) is the least squares cross validation score. The LSCV bandwidth is defined as hls = argminhLSCV (h). For a given h, LSCV (h) can be computed from the sample. A computanional formula for LSCV(h): 1 n2h

n

  • i=1

n

  • j=1
  • K(y)K

Xi − Xj h − y

  • dy −

2 n(n − 1)h

n

  • j=1
  • i=j

K Xi − Xj h

  • 9 / 20
slide-10
SLIDE 10

Choice of bandwidth by Cross Validation Multivariate density estimators Assignments

Least squares CV

The resulting bandwidth hls and thus the density estimator ˆ fn,hls(x) are asymptotically optimal.

Theorem (Stone 1984)

Assume the following: (a) f is uniformly bounded. (b) K is a kernel (so a density symmetric around zero) with zero the unique-mode. (c) K is compactly supported. (d) K is Holder continuous of order β; i.e. for x1, x2 ∈ R, |K(x1) − k(x2)| ≤ constant |x1 − x2|β. Then,

  • ( ˆ

fn,hls(x) − f(x))2dx

  • ( ˆ

fn,hopt2(x) − f(x))2dx − 1

a.s

→ 0.

  • Remark. This result is regarded as a landmark in the cross-validation
  • literature. The theorem asserts optimal performance of the LSCV

without pratically any condition on f.

10 / 20

slide-11
SLIDE 11

Choice of bandwidth by Cross Validation Multivariate density estimators Assignments

A few comments

All the methods for choosing smoothing parameter h should be used with common sense. Recommended methods: reference bandwidth and cross validation approaches. In practice, always make plots and compare different choices of h.

11 / 20

slide-12
SLIDE 12

Choice of bandwidth by Cross Validation Multivariate density estimators Assignments

Multivariate density estimators1

On the basis of n i.i.d. random vectors Xi = (Xi1, . . . , Xid) from unknown F, we wish to estimate f, the density of F. We consider d-dimensional kernel estimators, for x = (x1, . . . , xd) ∈ Rd, ˆ fn(x) = 1 nhd

n

  • i=1

K x − Xi h

  • .

where the kernel K is a d-dimensional density. In practice, K is often taken to be product kernel or an ellipsoidal kernel.

1A bold letter denotes a vector in this section.

12 / 20

slide-13
SLIDE 13

Choice of bandwidth by Cross Validation Multivariate density estimators Assignments

Multivariate density estimators

Product kernel: K(x) = d

i=1 K0(xi), with K0 a univariate kernel.

Ellipsoidal kernel

Multivariate normal density function (2π)−d/2 exp

  • − 1

2xx′

. Multivariate Epanechnikov kernel d+2

2cd (1 − xx′)1[−1,1](xx′), where cd

is the volume of d-dimensional unit ball: c1=1, c2 = π, c3 = 4π/3.

One can also choose different amounts of smoothing along different directions: h = (h1, . . . , hd).

13 / 20

slide-14
SLIDE 14

Choice of bandwidth by Cross Validation Multivariate density estimators Assignments

Multivariate density estimators

Assume that hi → 0, for i = 1, . . . , d and n d

i=1 hd i → ∞ as n → ∞.

Under some smoothing conditions of f and K, ˆ fn(x) is a consistent estimator of f(x): ˆ fn(x)

P

→ f(x). The optimal bandwidth, hiopt is cin−1/(d+4), i = 1, . . . , d and the corresponding risk (MSE or MISE) tend to zero at the rate of n−4/(d+4).

14 / 20

slide-15
SLIDE 15

Choice of bandwidth by Cross Validation Multivariate density estimators Assignments

Curse of dimensionality

It refers to the situation that (estimation) problem gets harder very quickly as the dimension of the data increases. This can be due to computational burden and/or statistical efficiency. We discuss here the statistical curse

  • f dimensionality: to obtain an accurate estimator, enormous sample size

is required. MSEhopt ≈ cn−4/(d+4). Set MSEhopt = δ and solve for n: n ≈ c δ d/4 , grows exponentially with dimension d. Following we illustrate this phenomenon with two examples.

15 / 20

slide-16
SLIDE 16

Choice of bandwidth by Cross Validation Multivariate density estimators Assignments

Curse of dimensionality

1st Example Suppose that the data are multivariate Gaussianb N(0, Id), with Id identity matrix. Choose the optimal h and Gaussian kernel to estimate f(0). To achieve E

  • ˆ

fn(0) − f(0) 2 f 2(0) < 0.1, the number of observations n required are as in the following table (Table 4.2 of Silverman (1986)). d 2 4 6 8 10 n 19 223 2790 43,700 842,000

16 / 20

slide-17
SLIDE 17

Choice of bandwidth by Cross Validation Multivariate density estimators Assignments

Curse of dimensionality

Why an accurate estimator requires large sample size in multivariate case? The reason is that f(x) is estimated using data points in a local neighbor- hood of x. But in high dimensional setting, the data are very sparse, so local neighborhoods contain very few points. 2nd Example Suppose that we have n data points uniformly dis- tributed on the interval [0, 1]. How many data points will be in the interval [0, 0.1]? The answer is around n/10. Suppose n data points uniformly distributed on the 10-dimensional unit cube [0, 1]10 = [0, 1] × · · · [0, 1]. How many data points will be in the cube [0, 0.1]10? About 0.110n = n 10, 000, 000, 000

17 / 20

slide-18
SLIDE 18

Choice of bandwidth by Cross Validation Multivariate density estimators Assignments

Group Presentation (March 30)

Group 14

Data on the salaries of the CEO of 60 companies is available at http://lib.stat.cmu.edu/DASL/Datafiles/ceodat.html. Investigate the distribution of the salaries using a kernel density estimator. use Epanechnikov kernel; implement R function to compute LSCV (h) and plot LSCV (h) against h; what is the LSCV bandwidth? try other bandwidths; what is your final kernel density estimate?

18 / 20

slide-19
SLIDE 19

Choice of bandwidth by Cross Validation Multivariate density estimators Assignments

Group Presentation (March 30)

Group 15

Consider bivariate kernel density estimator. Simulate data from bivari- ate normal distribution N((0, 0), (1, 0.5; 0.5; 1)). Choose n = 50, 200. Try different bandwidths. Use bivariate normal kernel and product kernel with K0 the Epanechnikov kernel. Find your ways to visually compare your estimates with the real density. For instance you can make the 3d plots of the density, or density contour lines. R functions such as kde2d of package MASS, contour and persp might be useful.

19 / 20

slide-20
SLIDE 20

Choice of bandwidth by Cross Validation Multivariate density estimators Assignments

Group Presentation (March 30)

Group 23

It is claimed by MLB (Major League Baseball) owners that they need limitations on player salaries to maintain competitiveness among richer and poorer teams. This argument assumes that higher salaries attract better players. The data contains the salaries in 2002 and career batting averages of 50 randomly selected MLB players. Based on the data, please address the following question. Is there a relationship between an MLB player’s salary and his performance? You might consider the correlation coefficient. Use bootstrapping method.

20 / 20