Computational Statistics Lectures 10-13: Smoothing and - - PowerPoint PPT Presentation

computational statistics lectures 10 13 smoothing and
SMART_READER_LITE
LIVE PREVIEW

Computational Statistics Lectures 10-13: Smoothing and - - PowerPoint PPT Presentation

Computational Statistics Lectures 10-13: Smoothing and Nonparametric Inference Dr Jennifer Rogers Hilary Term 2017 Background Smoothing and nonparametric methods Approximating function that attempts to capture important patterns in


slide-1
SLIDE 1

Computational Statistics Lectures 10-13: Smoothing and Nonparametric Inference

Dr Jennifer Rogers Hilary Term 2017

slide-2
SLIDE 2

Background

slide-3
SLIDE 3

Smoothing and nonparametric methods

◮ Approximating function that attempts to capture important

patterns in datasets or images

◮ Leaves out noise ◮ Aid data analysis by being able to extract more information

from the data

◮ Analyses are flexible and robust ◮ Should always just use a nonparametric estimator?

slide-4
SLIDE 4

Smoothing and nonparametric methods

No!

◮ There is no miracle! ◮ There is a price to pay for the gain in generality ◮ When we have clear evidence of a good parametric model

for the data, we should use it

◮ Nonparametric estimators converge to true curve slower

VALID parametric estimators

◮ But as soon as the parametric model is incorrect, a

parametric estimator will never converge to the true curve So nonparametric methods have their place!

slide-5
SLIDE 5

The regression problem

◮ Goal of regression → discover the relationship between

two variables, X and Y

◮ Wish to find a curve m that passes “in the middle” of the

points

◮ Observations (xi, Yi) for i = 1, . . . , n

◮ xi is a real-valued variable ◮ Yi a real-valued random response

◮ Yi = m(xi) + εi

for i = 1, . . . , n

◮ E(εi | Xi) = 0 ◮ Var(εi | Xi) = σ2(Xi) ◮ m(x) = E(Y | X = x)

◮ m(·): regression function

◮ Reflects the relationship between X and Y ◮ Curve of interest and “lies in the middle” of all the points

◮ Goal is to infer m(x) from observations (xi, Yi)

slide-6
SLIDE 6

Example: Cosmic microwave background data

slide-7
SLIDE 7

Example: FTSE stock market index

slide-8
SLIDE 8

Linear smoothers

slide-9
SLIDE 9

Linear smoothers

◮ In the parametric context, we assume we know the shape

  • f m(·)

◮ Linear model: Y = α + βX + ε

◮ m(x) = E(Y | X = x) = α + βx

◮ We estimate α and β from the data ◮ Least squares estimator:

(ˆ α, ˆ β) = argmin(α,β)

  • i

(Yi − α − βXi)2

◮ Consider m(xi) = 1 + 2xi ◮ We can fit a linear model to the data and obtain

ˆ α = 0.9905 and ˆ β = 2.0025

slide-10
SLIDE 10

Linear modelling

> x <- seq(from=0,to=1,length.out=1000) > e <- rnorm(1000,0,0.2) > y1 <- 2*x+1+e > lm(y1˜x) Call: lm(formula = y1 ˜ x) Coefficients: (Intercept) x 0.9905 2.0025

slide-11
SLIDE 11

Linear modelling

slide-12
SLIDE 12

Linear modelling

◮ Linear model: linear in the parameters! ◮ No higher order terms such as αβ or β2 ◮ Not necessarily linear in Xi ◮ Examples of linear models:

◮ Yi = 8X 3

i − 13.6X 2 i + 7.28Xi − 1.176 + ε

◮ Yi = cos(20Xi) + ε

slide-13
SLIDE 13

Linear modelling

> lm(y2˜x+I(xˆ2)+I(xˆ3))

slide-14
SLIDE 14

Linear modelling

> lm(y3˜I(cos(20*x)))

slide-15
SLIDE 15

Linear modelling

◮ Can still use linear modelling ◮ Requires knowledge of functional form of explanatory

variables

◮ May not always be obvious ◮ Consider linear smoothers - much more general ◮ Obtain a non-trivial smoothing matrix even for just a single

‘predictor’ variable (p = 1)

slide-16
SLIDE 16

Linear smoothers

◮ For some n × n-matrix S, ˆ

Y = SY

◮ Fitted value ˆ

Yi at design point xi is a linear combination of measurements ˆ Yi =

n

  • j=1

SijYj

◮ Linear regression with p predictor variables:

ˆ θ = argminθ

  • i

(Yi − Xiθ)2 = argminθ(Y − θX)(Y − θX) = argminθ(Y TY − 2θTX TY + θTX TXθ)

◮ Differentiate with respect to θ and setting to zero

−2X TY + 2X TXθ = 0 ˆ θ = (X TX)−1X TY

slide-17
SLIDE 17

Linear smoothers

ˆ θ = (X TX)−1X TY

◮ Estimated (fitted) values ˆ

Y are X ˆ θ ˆ Y = X(X TX)−1X TY = S Y,

◮ S is a n × n-matrix ◮ Hat-matrix, H, from linear regression

slide-18
SLIDE 18

Linear smoothers

◮ Degrees of freedom for linear regression

tr(X(X TX)−1X T) = tr(X TX(X TX)−1) = tr(1p) = p

◮ How large is the expected residual sum of squares

E(RSS) = E

  • n
  • i=1

( ˆ Yi − Yi)2 ?

◮ If ˆ

Y = Xθ, then E(RSS) = E(n

i=1 ε2 i ) = nσ2 ◮ If ˆ

Y = SY, then ˆ Y − Y = Sε − ε E(RSS) = σ2(n − p) A good estimator of σ2 is thus ˆ σ2 = RSS n − p = RSS n − df

slide-19
SLIDE 19

Linear smoothers

ˆ Y = ˆ m(x) = S Y,

◮ Serum data, taken in connection with Diabetes research ◮ Y: log-concentration of a serum ◮ x: age of children in months ◮ Various ways how S can be chosen:

slide-20
SLIDE 20

Linear smoothers

Methods considered fall into two categories

◮ Local regression, including

◮ Kernel estimators and ◮ Local polynomial regression.

◮ Penalized estimators, mainly

◮ Smoothing splines.

slide-21
SLIDE 21

Local estimation: Kernel estimators

slide-22
SLIDE 22

Histogram

◮ X ∼ F ◮ P(x ≤ X ≤ x + ∆x) =

x+∆x

x

fX(v)dv

◮ Thus, for any u ∈ [x, x + ∆x]:

P(x ≤ X ≤ x + ∆x) ≈ ∆x · fX(u),

◮ This implies

fX(u) ≈ P(x ≤ X ≤ x + ∆x) ∆x ˆ fX(u) = #{Xi : x ≤ Xi ≤ x + ∆x} n∆x This is the idea behind histograms.

slide-23
SLIDE 23

Histogram

◮ Choose an origin t0 ◮ Choose a bin size, h ◮ Partition real line into intervals Ik = [tk, tk+1] of equal length

h

◮ Histogram estimator:

ˆ fH(x) = #Xi : tk ≤ Xi ≤ tk+1 nh . Step function that depends heavily on both the origin, t0, and the bin width, h

slide-24
SLIDE 24

Example: Old Faithful geyser

Duration in minutes of 272 eruptions of the Old Faithful geyser in Yellowstone National Park

> hist(faithful$eruptions,probability = T)

slide-25
SLIDE 25

Example: Old Faithful geyser

What happens if we change the time origin and the bin width?

> hist(faithful$eruptions,breaks=seq(1.5,5.5,1), probability = T,xlim=c(1,5.5)) > hist(faithful$eruptions,breaks=seq(1.1,5.1,1), probability = T,xlim=c(1,5.5)) > hist(faithful$eruptions,breaks=seq(0.5,5.5,0.5), probability = T,xlim=c(1,5.5)) > hist(faithful$eruptions,breaks=seq(0.75,5.75,0.5), probability = T,xlim=c(1,5.5))

slide-26
SLIDE 26

Example: Old Faithful geyser

slide-27
SLIDE 27

Density estimator

Can we do better? We can get rid of time origin. fX(x) = lim

h→0

FX(x + h) − FX(x) h = lim

h→0

FX(x) − FX(x − h) h Combining the two expressions: fX(x) = lim

h→0

FX(x + h) − FX(x − h) 2h = lim

h→0

P(x − h < X < x + h) 2h Which we can estimate using proportions: fX(x) = 1 nh

n

  • i=1

K x − Xi h

  • ,

With K(x) = 1/2 · I{| x |< 1}

slide-28
SLIDE 28

Density estimator

◮ Similar to the histogram

◮ No longer have the origin, t0 ◮ More flexible

◮ Constructs a box of length 2h around each observation Xi ◮ Estimator is then the sum of the boxes at x ◮ Density that depends on the bandwidth, h

slide-29
SLIDE 29

Kernel estimators

◮ Put a smooth symmetric ‘bump’ of shape K around each

  • bservation

◮ Estimator at x is now the sum of the bumps at x ◮ We define

ˆ fX(x) = 1 nh

n

  • i=1

K x − Xi h

  • ,

◮ K: ‘kernel’ function ◮ Estimator has the same properties of K

◮ If K is continuous and differentiable → so is the estimator ◮ Estimator is a density if K is a density

◮ Shape of K does not influence the resulting estimator ◮ Estimator does depend heavily on h

slide-30
SLIDE 30

Kernel estimators

slide-31
SLIDE 31

Kernels

A kernel is a real-valued function K(x) such that

◮ K(x) ≥ 0 for all x ∈ R, ◮

K(x) dx = 1,

xK(x) dx = 0. In practice, the choice of K does not influence the results much, but the value of h is crucial

slide-32
SLIDE 32

Kernels

Commonly used kernels include

◮ Boxcar: K(x) = I(x)/2 ◮ Gaussian: K(x) = (2π)−1/2 exp(−x2/2) ◮ Epanechnikov: K(x) = 3 4(1 − x2)I(x) ◮ Biweight: K(x) = 15 16(1 − x2)2I(x) ◮ Triweight: K(x) = 35 32(1 − x2)3I(x) ◮ Uniform: 1 2I(x)

I(x) = 1 if |x| ≤ 1 and I(x) = 0 otherwise

slide-33
SLIDE 33

Example: Old Faithful geyser

slide-34
SLIDE 34

Kernel regression

slide-35
SLIDE 35

Kernel regression

Y = m(x) + ε

  • 1. We want to estimate E(Y|X = x). Naive estimator:

ˆ m(x) = n

i=1 Yi

n . Same for all x

  • 2. Average the Yis of only those Xis that are close to x (local

average): ˆ m(x) = n

i=1 Yi · I{|Xi − x| < h}

n

i=1 I{|Xi − x| < h}

. h: bandwidth, determines the size of the neighbourhood around x

slide-36
SLIDE 36

Kernel regression

Y = m(x) + ε

  • 3. Give a slowly decreasing weight to Xi as it gets far from x,

rather than giving the same weight to all observations close to x: ˆ m(x) =

n

  • i=1

YiW(x − Xi), W(·): weight function that decreases as x increases and n

i=1 W(x − Xi) = 1

slide-37
SLIDE 37

Nadaraya-Watson estimator

W(x − Xi) = K x − Xi h

  • /

n

  • j=1

K x − Xj h

  • Hence, the Nadaraya-Watson kernel estimator is

ˆ m(x) = n

i=1 YiK( x−Xi h )

n

j=1 K( x−Xj h )

The estimated function values ˆ Yj = ˆ m(xj) at the made

  • bservations are given by

ˆ Yj =

  • i

SijYj, where Sij = K( xj−xi

h )

  • k K( xj−xk

h

) , The kernel smoother is thus a linear smoother

slide-38
SLIDE 38

Local least squares

We can rewrite the kernel regression estimator as ˆ m(x) = argminmx∈R

n

  • i=1

K x − xi h

  • (Yi − mx)2.

Exercise: Can be verified by solving

d dmx

n

i=1 K( x−xi h )(Yi − mx)2 = 0 ◮ Thus, for every fixed x, have to search for the best local

constant mx such that the localized sum of squares is minimized

◮ Localization is here described by the kernel and gives a

large weight to those observations (xi, Yi) where xi is close to the point x of interest. The choice of the bandwidth h is crucial

slide-39
SLIDE 39

Example: FTSE stock market index

slide-40
SLIDE 40

Example: Cosmic microwave background data

slide-41
SLIDE 41

Choosing the bandwidth

slide-42
SLIDE 42

Choosing the bandwidth

◮ Measure “success” of fit using mean squared error on new

  • bservations (MSE),

MSE(h) = E

  • (Y − ˆ

mh(x))2 ,

◮ Splitting into noise, bias and variance

MSE(h) = Noise + Bias2 + Variance

◮ Bias decreases if h ↓ 0 ◮ Variance increases if h ↓ 0 ◮ Choosing the bandwidth is a bias-variance trade-off.

slide-43
SLIDE 43

Choosing the bandwidth

◮ But...we don’t know MSE(h), as we just have a n

  • bservations and cannot generate new random
  • bservations Y

◮ First idea is to compute, for various values of the

bandwidth h:

◮ The estimator ˆ

mh for a training sample

◮ The error

n−1

n

  • i=1

(Yi − ˆ mh(x))2 = n−1

n

  • i=1

(Yi − ˆ Yi)2,

◮ Choose the bandwidth with the smallest training error

slide-44
SLIDE 44

CMB data

slide-45
SLIDE 45

Choosing the bandwidth

◮ We would choose a bandwidth close to h = 0, giving near

perfect interpolation of the data, that is ˆ m(xi) ≈ Yi

◮ This is unsurprising ◮ Parametric context

◮ Shape of the model is fixed ◮ Minimising the MSE makes the parametric model as close

as possible to the data

◮ Nonparametric setting

◮ Don’t have a fixed shape ◮ Value of h dictates the model ◮ Minimising the MSE → fitted model as close as possible to

the data

◮ Lead us to choose h as small as possible ◮ Interpolation of the data

◮ Misleading result → Only noise is fitted for very small

bandwidths

slide-46
SLIDE 46

Cross-validation

◮ Solution...don’t use Xi to construct ˆ

m(Xi)

◮ This is the idea behind cross-validation ◮ Leave-one-out cross-validation ◮ Least squares cross-validation

◮ For each value of h

◮ For each i = 1, . . . , n, compute the estimator ˆ

m(−i)

h

(x), where ˆ m(−i)

h

(x) is computed without using observation i

◮ The estimated MSE is then given by

  • MSE(h) = n−1

i

(Yi − ˆ m(−i)

h

(xi))2

slide-47
SLIDE 47

CMB data

slide-48
SLIDE 48

CMB data

A bandwidth of 44 minimises the estimated MSE

slide-49
SLIDE 49

Cross-validation

◮ A drawback of LOO-CV is that it is expensive to compute ◮ Fit has to be recalculated n times (once for each left-out

  • bservation)

◮ We can avoid needing to calculate ˆ

m(x)(−i) for all i

◮ For some n × n-matrix S, the linear smoother fulfills

ˆ Y = SY The risk (MSE) under LOO-CV can subsequently be written as

  • MSE(h) = n−1

n

  • i=1
  • Yi − ˆ

mh(xi) 1 − Sii 2

slide-50
SLIDE 50

Cross-validation

◮ Do not need to recompute ˆ

mh while leaving out each of the n observations in turn

◮ Results can much faster be obtained by rescaling the

residuals Yi − ˆ mh(xi) with the factor (1 − Sii)

◮ Sii is the i-th diagonal entry in the smoothing matrix

slide-51
SLIDE 51

Generalized Cross-Validation

  • MSE(h) = n−1

n

  • i=1

Yi − ˆ mh(xi) 1 − Sii 2 , Replace Sii by its average ν/n (where ν =

i Sii)

Choose bandwidth h that minimizes GCV(h) = n−1

n

  • i=1

Yi − ˆ mh(xi) 1 − ν/n 2 .

slide-52
SLIDE 52

Local polynomial regression

slide-53
SLIDE 53

Nadaraya-Watson kernel estimator

◮ Major disadvantage of the Nadaraya-Watson kernel

estimator → boundary bias

◮ Bias is of large order at the boundaries

slide-54
SLIDE 54

Local polynomial regression

◮ Even when a curve doesn’t look like a polynomial ◮ Restrict to a small neighbourhood of a given point, x ◮ Approximate the curve by a polynomial in that

neighbourhood

◮ Fit its coefficients using only observations Xi close to x

◮ (or rather, putting more weight to observations close to x)

◮ Repeat this procedure at every point x where we want to

estimate m(x)

slide-55
SLIDE 55

Local polynomial regression

◮ Kernel estimator approximates the data by taking local

averages within small bandwidths

◮ Use local linear regression to obtain an approximation

slide-56
SLIDE 56

Kernel regression estimator

Recall that the kernel regression estimator is the solution to: ˆ m(x) = argminm(x)∈R

n

  • i=1

K x − xi h

  • (Yi − m(x))2

This given by ˆ m(x) = n

i=1 YiK( x−Xi h )

n

j=1 K( x−Xj h )

. Thus estimation corresponds to the solution to the weighted sum of squares problem

slide-57
SLIDE 57

Local polynomial regression

Using Taylor Series, we can approximate m(x), where x is close to a point x0 using the following polynomial: m(x) ≈ m(x0) + m(1)(x0)(x − x0) + m(2)(x0) 2! (x − x0)2 + . . . . . . + m(p)(x0) p! (x − x0)p = m(x0) + β1(x − x0) + β2(x − x0)2 + · · · + βp(x − x0)p where m(k)(x0) = k!βk, provided that all the required derivatives exist

slide-58
SLIDE 58

Local polynomial regression

◮ Use the data to estimate that polynomial of degree p which

best approximates m(xi) in a small neighbourhood around the point x

◮ Minimise with respect to β0, β1, . . . , βp the function: n

  • i

{Yi − β0 − β1(xi − x) − . . . − βp(xi − x)p}2K x − xi h

  • ◮ Weighted least squares problem, where the weights are

given by the kernel functions K((x − xi)/h)

◮ As m(k)(x) = k!βk, we then have that m(x) = β0, or

ˆ m(x) = ˆ β0

slide-59
SLIDE 59

CMB data

Red: Kernel smoother (p = 0) Green: Local linear regression (p = 1)

slide-60
SLIDE 60

Boundary bias: kernel estimator

Let ℓi(x) = ωi(x)/

j ωj(x), so that

ˆ m(x) =

  • i

ℓi(x)Yi For the Kernel smoother (p = 0), the bias of the linear smoother is thus Bias = E( ˆ m(x)) − m(x) = m′(x)

  • i

(xi − x)ℓi(x) + m′′(x) 2

  • i

(xi − x)2ℓi(x) + R,

slide-61
SLIDE 61

Boundary bias: Kernel estimator

First term in the expansion is equal to m′(x)

  • i

(xi − x)K x−xi

h

  • K

x−xi

h

  • ◮ vanishes if the design points xi are centred symmetrically

around x

◮ does not vanish if x sits at the boundary (all xi − x will have

the same sign)

slide-62
SLIDE 62

Boundary bias: polynomial estimator

◮ m(x) is truly a local polynomial of degree p ◮ At least p + 1 points with positive weights in the

neighbourhood of x Bias will hence be of order Bias = E( ˆ m(x)) − m(x) = m(p+1)(x) (p + 1)!

  • j

(xj − x)p+1ℓj(x) + R Why not choose p = 20?

slide-63
SLIDE 63

Boundary bias: polynomial estimator

◮ Yi = m(x) + σεi with εi ∼ N(0, 1) ◮ Variance of the linear smoother, ˆ

m(x) =

j ℓj(x)Yi, is,

Var( ˆ m(x)) = σ2

j

ℓ2

j (x) = σ2ℓ(x)2 ◮ ℓ(x)2 tends to be large if p is large ◮ In practice, p = 1 is a good choice

slide-64
SLIDE 64

Example: Doppler function

m(x) =

  • x(1 − x) sin

2.1π x + .05

  • ,

0 ≤ x ≤ 1.

slide-65
SLIDE 65

Example: Doppler function

> n <- 1000 > x <- seq(0,1,length=n) > m <- sqrt(x*(1-x))*sin(2.1*pi/(x+0.05)) > plot(x,m,type=’l’) > y <- m+rnorm(n)*0.075 > plot(x,y) > fit <- locpoly(x,y,bandwidth=dpill(x,y)*2,degree=1) > lines(fit,col=2) > plot(x,y) > fit2 <- locpoly(x,y,bandwidth=dpill(x,y)/2,degree=1) > lines(fit2,col=2) > plot(x,y) > fit3 <- locpoly(x,y,bandwidth=dpill(x,y)/4,degree=1) > lines(fit3,col=2)

slide-66
SLIDE 66

Example: Doppler function

slide-67
SLIDE 67

Penalised regression

slide-68
SLIDE 68

Penalised regression

Regression model i = 1, . . . , n, Yi = m(xi) + εi E(εi) = 0 Estimating m by choosing ˆ m(x) to minimize

n

  • i=1

(Yi − ˆ m(xi))2 leads to

◮ linear regression estimate if minimizing over all linear

functions

◮ an interpolation of the data if minimizing over all

functions.

slide-69
SLIDE 69

Penalised regression

Estimate m by choosing ˆ m(x) to minimize

n

  • i=1

(Yi − ˆ m(xi))2 + λJ(m), J(m): roughness penalty Typically J(m) =

  • (m′′(x))2 dx.

Parameter λ controls trade-off between fit and penalty

◮ For λ = 0: interpolation ◮ For λ → ∞: linear least squares line

slide-70
SLIDE 70

Example: Doppler function

slide-71
SLIDE 71

Splines

slide-72
SLIDE 72

Splines

◮ Kernel regression

◮ Researcher isn’t interested in actually calculating m(x) for a

single location x

◮ m(x) calculated on sufficiently small grid of x-values ◮ Curve obtained by interpolation ◮ Local polynomial regression: unknown mean function was

assumed to be locally well approximated by a polynomial

◮ Alternative approach

◮ Represent the fit as a piecewise polynomial ◮ Pieces connecting at points called knots ◮ Once the knots are selected, an estimator can be

computed globally

◮ Manner similar to that for a parametrically specified mean

function

This is the idea behind splines

slide-73
SLIDE 73

Splines

IID sample (X1, Y1), (X2, Y2), . . . (Xn, Yn) coming from the model Yi = m(Xi) + εi Want to estimate the mean of the variable Y with m(x) = E(Y|X = x) A very naive estimator of E(Y|X = x) would be the sample mean of x: ˆ m(x) = n

i=1 Yi

n Not very good (same for all x)

slide-74
SLIDE 74

Splines

Approximate m by piecewise polynomials, each on a small interval: ˆ m(x) =            c1 if x < ξ1 c2 if ξ1 ≤ x < ξ2 . . . ck if ξk−1 ≤ x < ξk ck+1 if x ≥ ξk

slide-75
SLIDE 75

Splines

Use more general lines, which join at the ξs: ˆ m(x) =            a1 + b1x if x < ξ1 a2 + b2x if ξ1 ≤ x < ξ2 . . . ak + bkx if ξk−1 ≤ x < ξk ak+1 + bk+1x if x ≥ ξk a and b are such that the lines join at each ξ

slide-76
SLIDE 76

Splines

Approximate m(x) by polynomials ˆ m(x) =              p

j=0 β1,jxj

if x < ξ1 p

j=0 β2,jxj

if ξ1 ≤ x < ξ2 . . . p

j=0 βk,jxj

if ξk−1 ≤ x < ξk p

j=0 βk+1,jxj

if x ≥ ξk βjs are such that the polynomials join at each ξ and the approximation has p − 1 derivatives Splines which are piecewise polynomials of order p

◮ Splines of order p + 1 ◮ Splines of degree p ◮ ξ : knots

slide-77
SLIDE 77

Splines

slide-78
SLIDE 78

Piecewise constant splines

slide-79
SLIDE 79

Knots

How many knots should we have?

◮ Choose a lot of knots well widespread over the data range

→ reduce the bias of the estimator

◮ If we make it too local

→ estimator will be too wiggly

◮ Overcome the bias problem without increasing the

variance → take a lot of knots, but constrain their influence

◮ We can do this using penalised regression

slide-80
SLIDE 80

Spline order

What order spline should we use?

◮ Increase the value of p

→ make the estimator ˆ mp smoother (since it has p − 1 continuous derivatives)

◮ If we have p too large

→ increase the number of parameters to estimate

◮ In practice rarely useful to take p > 3 ◮ p = 2

◮ Splines of order three or quadratic splines

◮ p = 3

◮ Splines of order 4 or cubic splines

◮ p-th order spline is a piecewise p − 1 degree polynomial

with p − 2 continuous derivatives at the knots

slide-81
SLIDE 81

Natural splines

◮ Natural spline: linear beyond the boundary is called a

natural spline

◮ Why this constraint? ◮ We usually have very few observations beyond the two

extreme knots

◮ Want to obtain an estimator of the regression curve there ◮ Cannot reasonably estimate anything correct there ◮ Rather use a simplified model (e.g. linear) ◮ Often gives more or less reasonable results

slide-82
SLIDE 82

Natural cubic splines

ξ1 < ξ2 < . . . < ξn set of ordered points, so-called knots, contained in an interval (a, b) A cubic spline is a continuous function m such that (i) m is a cubic polynomial over (ξ1, ξ2), . . ., and (ii) m has a continuous first and second derivatives at the knots. The solution to

n

  • i=1

(yi − ˆ m(xi))2 + λ

  • ( ˆ

m′′(x))2 dx is a natural cubic spline with knots at the data points ˆ m(x) is called a smoothing spline

slide-83
SLIDE 83

Natural cubic splines

◮ Sequence of values f1, . . . , fn at specified locations

x1 < x2 < · · · < xn

◮ Find a smooth curve g(x) that passes through the points

(x1, f1), (x2, f2), . . . , (xn, fn)

◮ The natural cubic spline g is an interpolating function that

satisfies the following conditions:

(i) g(xj) = fj, j = 1, . . . , n, (ii) g(x) is cubic on each subinterval (xk, xk+1), k = 1, . . . , (n − 1), (iii) g(x) is continuous and has continuous first and second derivatives, (iv) g′′(x1) = g′′(xn) = 0.

slide-84
SLIDE 84

B-splines

◮ Need a basis for natural polynomial splines ◮ Convenient is the so-called B-spline basis ◮ Data points a = ξ0 < ξ1 < ξ2 < . . . , ξn ≤ ξn+1 = b in (a, b) ◮ There are n + 2 real values

◮ The n ≥ 0 are called ‘interior knots’ or ‘control points’ ◮ And there are always two endpoints, ξ0 and ξn+1

◮ When the knots are equidistant they are said to be

‘uniform’

slide-85
SLIDE 85

B-splines

Now define new knots τ as

◮ τ1 ≤ . . . ≤ τp = ξ0 = a ◮ τj+p = ξj ◮ b = ξn+1 = τn+p+1 ≤ τn+p+2 ≤ . . . ≤ τn+2p

◮ p: order of the polynomial ◮ p + 1 is the order of the spline ◮ Append lower and upper boundary knots ξ0 and ξn+1 p

times

◮ Needed due to the recursive nature of B-splines

slide-86
SLIDE 86

B-slines

Define recursively

◮ For k = 0 and i = 1, . . . , n + 2p

Bi,0(x) = 1 τi ≤ x < τi+1

  • therwise

◮ For k = 1, 2, . . . , p and i = 1, . . . , n + 2p

Bi,k(x) = x − τi τi+k−1 − τi Bi,k−1(x) + τi+k − x τi+k − τi+1 Bi+1,k−1(x) Support of Bi,k(x) is [τi, τi+k]

slide-87
SLIDE 87

B-splines

slide-88
SLIDE 88

Solving

◮ Solution depends on regularization parameter λ

◮ Determines the amount of “roughness”

◮ Choosing λ isn’t necessarily intuitive ◮ Degrees of freedom = trace of the smoothing parameter S

◮ Sum of the eigenvalues

S = B(BTB + λΩ)−1BT

◮ Monotone relationship between df and λ ◮ Search for a value of λ for desired df

◮ df=2 → linear regression ◮ df=n → interpolate data exactly

slide-89
SLIDE 89

Example: Doppler function

slide-90
SLIDE 90

Example: Doppler function

Could of course choose λ by LOO-CV or GCV

slide-91
SLIDE 91

Cross validation

> plot(x,y) > fitcv <- smooth.spline(x,y,cv=T) > lines(fitcv,col=2) > fitcv Call: smooth.spline(x = x, y = y, cv = T) Smoothing Parameter spar= 0.157514 lambda= 2.291527e-08 (16 iterations) Equivalent Degrees of Freedom (Df): 124.738 Penalized Criterion: 6.071742 PRESS: 0.007898575

slide-92
SLIDE 92

Generalised cross validation

> plot(x,y) > fitgcv <- smooth.spline(x,y,cv=F) > lines(fitgcv,col=4) > fitgcv Call: smooth.spline(x = x, y = y, cv = F) Smoothing Parameter spar= 0.1597504 lambda= 2.378386e-08 (15 iterations) Equivalent Degrees of Freedom (Df): 124.2353 Penalized Criterion: 6.078626 GCV: 0.007925571

slide-93
SLIDE 93

Multivariate smoothing

slide-94
SLIDE 94

Multivariate smoothing

◮ So far we have only considered univariate functions ◮ Suppose there are several predictors that we would like to

treat nonparametrically

◮ Most ‘interesting’ statistical problems nowadays are

high-dimensional with, easily, p > 1000

◮ Biology: Microarrays, Gene maps, Network inference ◮ Finance: Prediction from multi-variate time-series ◮ Physics: Climate models

◮ Can we just extend the methods and model functions

Rp → R nonparametrically?

slide-95
SLIDE 95

Curse of dimensionality

◮ One might consider multidimensional smoothers aimed at

estimating: Y = m(x1, x2, . . . , xp)

◮ Considered methods rely on ‘local’ approximations ◮ Examine behaviour of data-points in the neighbourhood of

the point of interest

◮ What is ‘local’ and ‘neighbourhood’ if p → ∞ and n

constant

slide-96
SLIDE 96

Curse of dimensionality

x = (x(1), x(2), . . . , x(p)) ∈ [0, 1]p. To get 5% of all n sample points into a cube-shaped neighbourhood of x, we need a cube with side-length 0 < ℓ < 1 such that ℓp ≤ 0.05 Dimension p side length ℓ 1 0.05 2 0.22 5 0.54 10 0.74 1000 0.997

slide-97
SLIDE 97

Additive models

Require the function m : Rp → R to be of the form madd(x) = µ + m1(x(1)) + m2(x(2)) + . . . + mp(x(p)) = µ +

p

  • j=1

mj(x(j)), m ∈ R mj(·) : R → R just a univariate nonparametric function E[mj(x(j))] = 0 j = 1, . . . , p

◮ Choice of smoother is left open ◮ Avoids curse of dimensionality → ‘less flexible’ ◮ Functions can be estimated by ‘backfitting’

slide-98
SLIDE 98

Backfitting

Data x(j)

i

, 1 ≤ i ≤ n and 1 ≤ j ≤ p A linear smoother for variable j can be described by a n × n-matrix Sj, so that ˆ mj = S(j)Y,

◮ Y = (Y1, . . . , Yn)T: observed vector of responses ◮ ˆ

mj = ( ˆ mj(x(j)

1 ), . . . , ˆ

mj(x(j)

n )): regression fit ◮ S(j) smoother with bandwidth estimated by LOO-CV or

GCV

slide-99
SLIDE 99

Backfitting

ˆ madd(x) = ˆ µ +

p

  • j=1

ˆ mj(x(j)), Suppose ˆ µ and ˆ mk given for all k = j ˆ madd(xi) =

  • ˆ

µ +

  • k=j

ˆ mk(x(k)

i

)

  • + ˆ

mj(x(j)

i

) Now to apply the smoother S(j) to Y −

  • ˆ

µ +

  • k=j

ˆ mk

  • Cycle through all j = 1, . . . , p to get

ˆ madd(x) = ˆ µ +

p

  • j=1

ˆ mj(x(j)

i

), m ∈ R.

slide-100
SLIDE 100

Backfitting

  • 1. Use ˆ

µ ← n−1 n

i=1 Yi. Start with ˆ

mj ≡ 0 for all j = 1, . . . , p

  • 2. Cycle through the indices j = 1, 2, . . . , p, 1, 2, . . . , p, . . .,

ˆ mj ← S(j)(Y − ˆ µ1 −

  • k=j

ˆ mk. Also normalize ˆ mj(·) ← ˆ mj(·) − n−1

n

  • i=1

ˆ mj(x(j)

i

) update ˆ µ ← n−1 n

i=1(Yi − k ˆ

mk(x(k)

i

)) Stop iterations if functions do not change very much

  • 3. Return

ˆ madd(xi) ← ˆ µ +

p

  • j=1

ˆ mj(x(j)

i

)

slide-101
SLIDE 101

Example: Ozone data

slide-102
SLIDE 102

Example: Ozone data

slide-103
SLIDE 103

Iteration 1

slide-104
SLIDE 104

Iteration 2

slide-105
SLIDE 105

Iteration 3

slide-106
SLIDE 106

Iteration 7