[PPT] - Computational Statistics Lectures 10-13: Smoothing and PowerPoint Presentation

SLIDE 1

Computational Statistics Lectures 10-13: Smoothing and Nonparametric Inference

Dr Jennifer Rogers Hilary Term 2017

SLIDE 2

Background

SLIDE 3

Smoothing and nonparametric methods

◮ Approximating function that attempts to capture important

patterns in datasets or images

◮ Leaves out noise ◮ Aid data analysis by being able to extract more information

from the data

◮ Analyses are flexible and robust ◮ Should always just use a nonparametric estimator?

SLIDE 4

Smoothing and nonparametric methods

No!

◮ There is no miracle! ◮ There is a price to pay for the gain in generality ◮ When we have clear evidence of a good parametric model

for the data, we should use it

◮ Nonparametric estimators converge to true curve slower

VALID parametric estimators

◮ But as soon as the parametric model is incorrect, a

parametric estimator will never converge to the true curve So nonparametric methods have their place!

SLIDE 5

The regression problem

◮ Goal of regression → discover the relationship between

two variables, X and Y

◮ Wish to find a curve m that passes “in the middle” of the

points

◮ Observations (xi, Yi) for i = 1, . . . , n

◮ xi is a real-valued variable ◮ Yi a real-valued random response

◮ Yi = m(xi) + εi

for i = 1, . . . , n

◮ E(εi | Xi) = 0 ◮ Var(εi | Xi) = σ2(Xi) ◮ m(x) = E(Y | X = x)

◮ m(·): regression function

◮ Reflects the relationship between X and Y ◮ Curve of interest and “lies in the middle” of all the points

◮ Goal is to infer m(x) from observations (xi, Yi)

SLIDE 6

Example: Cosmic microwave background data

SLIDE 7

Example: FTSE stock market index

SLIDE 8

Linear smoothers

SLIDE 9

Linear smoothers

◮ In the parametric context, we assume we know the shape

f m(·)

◮ Linear model: Y = α + βX + ε

◮ m(x) = E(Y | X = x) = α + βx

◮ We estimate α and β from the data ◮ Least squares estimator:

(ˆ α, ˆ β) = argmin(α,β)

i

(Yi − α − βXi)2

◮ Consider m(xi) = 1 + 2xi ◮ We can fit a linear model to the data and obtain

ˆ α = 0.9905 and ˆ β = 2.0025

SLIDE 10

Linear modelling

> x <- seq(from=0,to=1,length.out=1000) > e <- rnorm(1000,0,0.2) > y1 <- 2*x+1+e > lm(y1˜x) Call: lm(formula = y1 ˜ x) Coefficients: (Intercept) x 0.9905 2.0025

SLIDE 11

Linear modelling

SLIDE 12

Linear modelling

◮ Linear model: linear in the parameters! ◮ No higher order terms such as αβ or β2 ◮ Not necessarily linear in Xi ◮ Examples of linear models:

◮ Yi = 8X 3

i − 13.6X 2 i + 7.28Xi − 1.176 + ε

◮ Yi = cos(20Xi) + ε

SLIDE 13

Linear modelling

> lm(y2˜x+I(xˆ2)+I(xˆ3))

SLIDE 14

Linear modelling

> lm(y3˜I(cos(20*x)))

SLIDE 15

Linear modelling

◮ Can still use linear modelling ◮ Requires knowledge of functional form of explanatory

variables

◮ May not always be obvious ◮ Consider linear smoothers - much more general ◮ Obtain a non-trivial smoothing matrix even for just a single

‘predictor’ variable (p = 1)

SLIDE 16

Linear smoothers

◮ For some n × n-matrix S, ˆ

Y = SY

◮ Fitted value ˆ

Yi at design point xi is a linear combination of measurements ˆ Yi =

n

j=1

SijYj

◮ Linear regression with p predictor variables:

ˆ θ = argminθ

i

(Yi − Xiθ)2 = argminθ(Y − θX)(Y − θX) = argminθ(Y TY − 2θTX TY + θTX TXθ)

◮ Differentiate with respect to θ and setting to zero

−2X TY + 2X TXθ = 0 ˆ θ = (X TX)−1X TY

SLIDE 17

Linear smoothers

ˆ θ = (X TX)−1X TY

◮ Estimated (fitted) values ˆ

Y are X ˆ θ ˆ Y = X(X TX)−1X TY = S Y,

◮ S is a n × n-matrix ◮ Hat-matrix, H, from linear regression

SLIDE 18

Linear smoothers

◮ Degrees of freedom for linear regression

tr(X(X TX)−1X T) = tr(X TX(X TX)−1) = tr(1p) = p

◮ How large is the expected residual sum of squares

E(RSS) = E

n
i=1

( ˆ Yi − Yi)2 ?

◮ If ˆ

Y = Xθ, then E(RSS) = E(n

i=1 ε2 i ) = nσ2 ◮ If ˆ

Y = SY, then ˆ Y − Y = Sε − ε E(RSS) = σ2(n − p) A good estimator of σ2 is thus ˆ σ2 = RSS n − p = RSS n − df

SLIDE 19

Linear smoothers

ˆ Y = ˆ m(x) = S Y,

◮ Serum data, taken in connection with Diabetes research ◮ Y: log-concentration of a serum ◮ x: age of children in months ◮ Various ways how S can be chosen:

SLIDE 20

Linear smoothers

Methods considered fall into two categories

◮ Local regression, including

◮ Kernel estimators and ◮ Local polynomial regression.

◮ Penalized estimators, mainly

◮ Smoothing splines.

SLIDE 21

Local estimation: Kernel estimators

SLIDE 22

Histogram

◮ X ∼ F ◮ P(x ≤ X ≤ x + ∆x) =

x+∆x

x

fX(v)dv

◮ Thus, for any u ∈ [x, x + ∆x]:

P(x ≤ X ≤ x + ∆x) ≈ ∆x · fX(u),

◮ This implies

fX(u) ≈ P(x ≤ X ≤ x + ∆x) ∆x ˆ fX(u) = #{Xi : x ≤ Xi ≤ x + ∆x} n∆x This is the idea behind histograms.

SLIDE 23

Histogram

◮ Choose an origin t0 ◮ Choose a bin size, h ◮ Partition real line into intervals Ik = [tk, tk+1] of equal length

h

◮ Histogram estimator:

ˆ fH(x) = #Xi : tk ≤ Xi ≤ tk+1 nh . Step function that depends heavily on both the origin, t0, and the bin width, h

SLIDE 24

Example: Old Faithful geyser

Duration in minutes of 272 eruptions of the Old Faithful geyser in Yellowstone National Park

> hist(faithful$eruptions,probability = T)

SLIDE 25

Example: Old Faithful geyser

What happens if we change the time origin and the bin width?

> hist(faithful$eruptions,breaks=seq(1.5,5.5,1), probability = T,xlim=c(1,5.5)) > hist(faithful$eruptions,breaks=seq(1.1,5.1,1), probability = T,xlim=c(1,5.5)) > hist(faithful$eruptions,breaks=seq(0.5,5.5,0.5), probability = T,xlim=c(1,5.5)) > hist(faithful$eruptions,breaks=seq(0.75,5.75,0.5), probability = T,xlim=c(1,5.5))

SLIDE 26

Example: Old Faithful geyser

SLIDE 27

Density estimator

Can we do better? We can get rid of time origin. fX(x) = lim

h→0

FX(x + h) − FX(x) h = lim

h→0

FX(x) − FX(x − h) h Combining the two expressions: fX(x) = lim

h→0

FX(x + h) − FX(x − h) 2h = lim

h→0

P(x − h < X < x + h) 2h Which we can estimate using proportions: fX(x) = 1 nh

n

i=1

K x − Xi h

,

With K(x) = 1/2 · I{| x |< 1}

SLIDE 28

Density estimator

◮ Similar to the histogram

◮ No longer have the origin, t0 ◮ More flexible

◮ Constructs a box of length 2h around each observation Xi ◮ Estimator is then the sum of the boxes at x ◮ Density that depends on the bandwidth, h

SLIDE 29

Kernel estimators

◮ Put a smooth symmetric ‘bump’ of shape K around each

bservation

◮ Estimator at x is now the sum of the bumps at x ◮ We define

ˆ fX(x) = 1 nh

n

i=1

K x − Xi h

,

◮ K: ‘kernel’ function ◮ Estimator has the same properties of K

◮ If K is continuous and differentiable → so is the estimator ◮ Estimator is a density if K is a density

◮ Shape of K does not influence the resulting estimator ◮ Estimator does depend heavily on h

SLIDE 30

Kernel estimators

SLIDE 31

Kernels

A kernel is a real-valued function K(x) such that

◮ K(x) ≥ 0 for all x ∈ R, ◮

K(x) dx = 1,

◮

xK(x) dx = 0. In practice, the choice of K does not influence the results much, but the value of h is crucial

SLIDE 32

Kernels

Commonly used kernels include

◮ Boxcar: K(x) = I(x)/2 ◮ Gaussian: K(x) = (2π)−1/2 exp(−x2/2) ◮ Epanechnikov: K(x) = 3 4(1 − x2)I(x) ◮ Biweight: K(x) = 15 16(1 − x2)2I(x) ◮ Triweight: K(x) = 35 32(1 − x2)3I(x) ◮ Uniform: 1 2I(x)

I(x) = 1 if |x| ≤ 1 and I(x) = 0 otherwise

SLIDE 33

Example: Old Faithful geyser

SLIDE 34

Kernel regression

SLIDE 35

Kernel regression

Y = m(x) + ε

1. We want to estimate E(Y|X = x). Naive estimator:

ˆ m(x) = n

i=1 Yi

n . Same for all x

2. Average the Yis of only those Xis that are close to x (local

average): ˆ m(x) = n

i=1 Yi · I{|Xi − x| < h}

n

i=1 I{|Xi − x| < h}

. h: bandwidth, determines the size of the neighbourhood around x

SLIDE 36

Kernel regression

Y = m(x) + ε

3. Give a slowly decreasing weight to Xi as it gets far from x,

rather than giving the same weight to all observations close to x: ˆ m(x) =

n

i=1

YiW(x − Xi), W(·): weight function that decreases as x increases and n

i=1 W(x − Xi) = 1

SLIDE 37

Nadaraya-Watson estimator

W(x − Xi) = K x − Xi h

/

n

j=1

K x − Xj h

Hence, the Nadaraya-Watson kernel estimator is

ˆ m(x) = n

i=1 YiK( x−Xi h )

n

j=1 K( x−Xj h )

The estimated function values ˆ Yj = ˆ m(xj) at the made

bservations are given by

ˆ Yj =

i

SijYj, where Sij = K( xj−xi

h )

k K( xj−xk

h

) , The kernel smoother is thus a linear smoother

SLIDE 38

Local least squares

We can rewrite the kernel regression estimator as ˆ m(x) = argminmx∈R

n

i=1

K x − xi h

(Yi − mx)2.

Exercise: Can be verified by solving

d dmx

n

i=1 K( x−xi h )(Yi − mx)2 = 0 ◮ Thus, for every fixed x, have to search for the best local

constant mx such that the localized sum of squares is minimized

◮ Localization is here described by the kernel and gives a

large weight to those observations (xi, Yi) where xi is close to the point x of interest. The choice of the bandwidth h is crucial

SLIDE 39

Example: FTSE stock market index

SLIDE 40

Example: Cosmic microwave background data

SLIDE 41

Choosing the bandwidth

SLIDE 42

Choosing the bandwidth

◮ Measure “success” of fit using mean squared error on new

bservations (MSE),

MSE(h) = E

(Y − ˆ

mh(x))2 ,

◮ Splitting into noise, bias and variance

MSE(h) = Noise + Bias2 + Variance

◮ Bias decreases if h ↓ 0 ◮ Variance increases if h ↓ 0 ◮ Choosing the bandwidth is a bias-variance trade-off.

SLIDE 43

Choosing the bandwidth

◮ But...we don’t know MSE(h), as we just have a n

bservations and cannot generate new random
bservations Y

◮ First idea is to compute, for various values of the

bandwidth h:

◮ The estimator ˆ

mh for a training sample

◮ The error

n−1

n

i=1

(Yi − ˆ mh(x))2 = n−1

n

i=1

(Yi − ˆ Yi)2,

◮ Choose the bandwidth with the smallest training error

SLIDE 44

CMB data

SLIDE 45

Choosing the bandwidth

◮ We would choose a bandwidth close to h = 0, giving near

perfect interpolation of the data, that is ˆ m(xi) ≈ Yi

◮ This is unsurprising ◮ Parametric context

◮ Shape of the model is fixed ◮ Minimising the MSE makes the parametric model as close

as possible to the data

◮ Nonparametric setting

◮ Don’t have a fixed shape ◮ Value of h dictates the model ◮ Minimising the MSE → fitted model as close as possible to

the data

◮ Lead us to choose h as small as possible ◮ Interpolation of the data

◮ Misleading result → Only noise is fitted for very small

bandwidths

SLIDE 46

Cross-validation

◮ Solution...don’t use Xi to construct ˆ

m(Xi)

◮ This is the idea behind cross-validation ◮ Leave-one-out cross-validation ◮ Least squares cross-validation

◮ For each value of h

◮ For each i = 1, . . . , n, compute the estimator ˆ

m(−i)

h

(x), where ˆ m(−i)

h

(x) is computed without using observation i

◮ The estimated MSE is then given by

MSE(h) = n−1

i

(Yi − ˆ m(−i)

h

(xi))2

SLIDE 47

CMB data

SLIDE 48

CMB data

A bandwidth of 44 minimises the estimated MSE

SLIDE 49

Cross-validation

◮ A drawback of LOO-CV is that it is expensive to compute ◮ Fit has to be recalculated n times (once for each left-out

bservation)

◮ We can avoid needing to calculate ˆ

m(x)(−i) for all i

◮ For some n × n-matrix S, the linear smoother fulfills

ˆ Y = SY The risk (MSE) under LOO-CV can subsequently be written as

MSE(h) = n−1

n

i=1
Yi − ˆ

mh(xi) 1 − Sii 2

SLIDE 50

Cross-validation

◮ Do not need to recompute ˆ

mh while leaving out each of the n observations in turn

◮ Results can much faster be obtained by rescaling the

residuals Yi − ˆ mh(xi) with the factor (1 − Sii)

◮ Sii is the i-th diagonal entry in the smoothing matrix

SLIDE 51

Generalized Cross-Validation

MSE(h) = n−1

n

i=1

Yi − ˆ mh(xi) 1 − Sii 2 , Replace Sii by its average ν/n (where ν =

i Sii)

Choose bandwidth h that minimizes GCV(h) = n−1

n

i=1

Yi − ˆ mh(xi) 1 − ν/n 2 .

SLIDE 52

Local polynomial regression

SLIDE 53

Nadaraya-Watson kernel estimator

◮ Major disadvantage of the Nadaraya-Watson kernel

estimator → boundary bias

◮ Bias is of large order at the boundaries

SLIDE 54

Local polynomial regression

◮ Even when a curve doesn’t look like a polynomial ◮ Restrict to a small neighbourhood of a given point, x ◮ Approximate the curve by a polynomial in that

neighbourhood

◮ Fit its coefficients using only observations Xi close to x

◮ (or rather, putting more weight to observations close to x)

◮ Repeat this procedure at every point x where we want to

estimate m(x)

SLIDE 55

Local polynomial regression

◮ Kernel estimator approximates the data by taking local

averages within small bandwidths

◮ Use local linear regression to obtain an approximation

SLIDE 56

Kernel regression estimator

Recall that the kernel regression estimator is the solution to: ˆ m(x) = argminm(x)∈R

n

i=1

K x − xi h

(Yi − m(x))2

This given by ˆ m(x) = n

i=1 YiK( x−Xi h )

n

j=1 K( x−Xj h )

. Thus estimation corresponds to the solution to the weighted sum of squares problem

SLIDE 57

Local polynomial regression

Using Taylor Series, we can approximate m(x), where x is close to a point x0 using the following polynomial: m(x) ≈ m(x0) + m(1)(x0)(x − x0) + m(2)(x0) 2! (x − x0)2 + . . . . . . + m(p)(x0) p! (x − x0)p = m(x0) + β1(x − x0) + β2(x − x0)2 + · · · + βp(x − x0)p where m(k)(x0) = k!βk, provided that all the required derivatives exist

SLIDE 58

Local polynomial regression

◮ Use the data to estimate that polynomial of degree p which

best approximates m(xi) in a small neighbourhood around the point x

◮ Minimise with respect to β0, β1, . . . , βp the function: n

i

{Yi − β0 − β1(xi − x) − . . . − βp(xi − x)p}2K x − xi h

◮ Weighted least squares problem, where the weights are

given by the kernel functions K((x − xi)/h)

◮ As m(k)(x) = k!βk, we then have that m(x) = β0, or

ˆ m(x) = ˆ β0

SLIDE 59

CMB data

Red: Kernel smoother (p = 0) Green: Local linear regression (p = 1)

SLIDE 60

Boundary bias: kernel estimator

Let ℓi(x) = ωi(x)/

j ωj(x), so that

ˆ m(x) =

i

ℓi(x)Yi For the Kernel smoother (p = 0), the bias of the linear smoother is thus Bias = E( ˆ m(x)) − m(x) = m′(x)

i

(xi − x)ℓi(x) + m′′(x) 2

i

(xi − x)2ℓi(x) + R,

SLIDE 61

Boundary bias: Kernel estimator

First term in the expansion is equal to m′(x)

i

(xi − x)K x−xi

h

K

x−xi

h

◮ vanishes if the design points xi are centred symmetrically

around x

◮ does not vanish if x sits at the boundary (all xi − x will have

the same sign)

SLIDE 62

Boundary bias: polynomial estimator

◮ m(x) is truly a local polynomial of degree p ◮ At least p + 1 points with positive weights in the

neighbourhood of x Bias will hence be of order Bias = E( ˆ m(x)) − m(x) = m(p+1)(x) (p + 1)!

j

(xj − x)p+1ℓj(x) + R Why not choose p = 20?

SLIDE 63

Boundary bias: polynomial estimator

◮ Yi = m(x) + σεi with εi ∼ N(0, 1) ◮ Variance of the linear smoother, ˆ

m(x) =

j ℓj(x)Yi, is,

Var( ˆ m(x)) = σ2

j

ℓ2

j (x) = σ2ℓ(x)2 ◮ ℓ(x)2 tends to be large if p is large ◮ In practice, p = 1 is a good choice

SLIDE 64

Example: Doppler function

m(x) =

x(1 − x) sin

2.1π x + .05

,

0 ≤ x ≤ 1.

SLIDE 65

Example: Doppler function

> n <- 1000 > x <- seq(0,1,length=n) > m <- sqrt(x*(1-x))*sin(2.1*pi/(x+0.05)) > plot(x,m,type=’l’) > y <- m+rnorm(n)*0.075 > plot(x,y) > fit <- locpoly(x,y,bandwidth=dpill(x,y)*2,degree=1) > lines(fit,col=2) > plot(x,y) > fit2 <- locpoly(x,y,bandwidth=dpill(x,y)/2,degree=1) > lines(fit2,col=2) > plot(x,y) > fit3 <- locpoly(x,y,bandwidth=dpill(x,y)/4,degree=1) > lines(fit3,col=2)

SLIDE 66

Example: Doppler function

SLIDE 67

Penalised regression

SLIDE 68

Penalised regression

Regression model i = 1, . . . , n, Yi = m(xi) + εi E(εi) = 0 Estimating m by choosing ˆ m(x) to minimize

n

i=1

(Yi − ˆ m(xi))2 leads to

◮ linear regression estimate if minimizing over all linear

functions

◮ an interpolation of the data if minimizing over all

functions.

SLIDE 69

Penalised regression

Estimate m by choosing ˆ m(x) to minimize

n

i=1

(Yi − ˆ m(xi))2 + λJ(m), J(m): roughness penalty Typically J(m) =

(m′′(x))2 dx.

Parameter λ controls trade-off between fit and penalty

◮ For λ = 0: interpolation ◮ For λ → ∞: linear least squares line

SLIDE 70

Example: Doppler function

SLIDE 71

Splines

SLIDE 72

Splines

◮ Kernel regression

◮ Researcher isn’t interested in actually calculating m(x) for a

single location x

◮ m(x) calculated on sufficiently small grid of x-values ◮ Curve obtained by interpolation ◮ Local polynomial regression: unknown mean function was

assumed to be locally well approximated by a polynomial

◮ Alternative approach

◮ Represent the fit as a piecewise polynomial ◮ Pieces connecting at points called knots ◮ Once the knots are selected, an estimator can be

computed globally

◮ Manner similar to that for a parametrically specified mean

function

This is the idea behind splines

SLIDE 73

Splines

IID sample (X1, Y1), (X2, Y2), . . . (Xn, Yn) coming from the model Yi = m(Xi) + εi Want to estimate the mean of the variable Y with m(x) = E(Y|X = x) A very naive estimator of E(Y|X = x) would be the sample mean of x: ˆ m(x) = n

i=1 Yi

n Not very good (same for all x)

SLIDE 74

Splines

Approximate m by piecewise polynomials, each on a small interval: ˆ m(x) =            c1 if x < ξ1 c2 if ξ1 ≤ x < ξ2 . . . ck if ξk−1 ≤ x < ξk ck+1 if x ≥ ξk

SLIDE 75

Splines

Use more general lines, which join at the ξs: ˆ m(x) =            a1 + b1x if x < ξ1 a2 + b2x if ξ1 ≤ x < ξ2 . . . ak + bkx if ξk−1 ≤ x < ξk ak+1 + bk+1x if x ≥ ξk a and b are such that the lines join at each ξ

SLIDE 76

Splines

Approximate m(x) by polynomials ˆ m(x) =              p

j=0 β1,jxj

if x < ξ1 p

j=0 β2,jxj

if ξ1 ≤ x < ξ2 . . . p

j=0 βk,jxj

if ξk−1 ≤ x < ξk p

j=0 βk+1,jxj

if x ≥ ξk βjs are such that the polynomials join at each ξ and the approximation has p − 1 derivatives Splines which are piecewise polynomials of order p

◮ Splines of order p + 1 ◮ Splines of degree p ◮ ξ : knots

SLIDE 77

Splines

SLIDE 78

Piecewise constant splines

SLIDE 79

Knots

How many knots should we have?

◮ Choose a lot of knots well widespread over the data range

→ reduce the bias of the estimator

◮ If we make it too local

→ estimator will be too wiggly

◮ Overcome the bias problem without increasing the

variance → take a lot of knots, but constrain their influence

◮ We can do this using penalised regression

SLIDE 80

Spline order

What order spline should we use?

◮ Increase the value of p

→ make the estimator ˆ mp smoother (since it has p − 1 continuous derivatives)

◮ If we have p too large

→ increase the number of parameters to estimate

◮ In practice rarely useful to take p > 3 ◮ p = 2

◮ Splines of order three or quadratic splines

◮ p = 3

◮ Splines of order 4 or cubic splines

◮ p-th order spline is a piecewise p − 1 degree polynomial

with p − 2 continuous derivatives at the knots

SLIDE 81

Natural splines

◮ Natural spline: linear beyond the boundary is called a

natural spline

◮ Why this constraint? ◮ We usually have very few observations beyond the two

extreme knots

◮ Want to obtain an estimator of the regression curve there ◮ Cannot reasonably estimate anything correct there ◮ Rather use a simplified model (e.g. linear) ◮ Often gives more or less reasonable results

SLIDE 82

Natural cubic splines

ξ1 < ξ2 < . . . < ξn set of ordered points, so-called knots, contained in an interval (a, b) A cubic spline is a continuous function m such that (i) m is a cubic polynomial over (ξ1, ξ2), . . ., and (ii) m has a continuous first and second derivatives at the knots. The solution to

n

i=1

(yi − ˆ m(xi))2 + λ

( ˆ

m′′(x))2 dx is a natural cubic spline with knots at the data points ˆ m(x) is called a smoothing spline

SLIDE 83

Natural cubic splines

◮ Sequence of values f1, . . . , fn at specified locations

x1 < x2 < · · · < xn

◮ Find a smooth curve g(x) that passes through the points

(x1, f1), (x2, f2), . . . , (xn, fn)

◮ The natural cubic spline g is an interpolating function that

satisfies the following conditions:

(i) g(xj) = fj, j = 1, . . . , n, (ii) g(x) is cubic on each subinterval (xk, xk+1), k = 1, . . . , (n − 1), (iii) g(x) is continuous and has continuous first and second derivatives, (iv) g′′(x1) = g′′(xn) = 0.

SLIDE 84

B-splines

◮ Need a basis for natural polynomial splines ◮ Convenient is the so-called B-spline basis ◮ Data points a = ξ0 < ξ1 < ξ2 < . . . , ξn ≤ ξn+1 = b in (a, b) ◮ There are n + 2 real values

◮ The n ≥ 0 are called ‘interior knots’ or ‘control points’ ◮ And there are always two endpoints, ξ0 and ξn+1

◮ When the knots are equidistant they are said to be

‘uniform’

SLIDE 85

B-splines

Now define new knots τ as

◮ τ1 ≤ . . . ≤ τp = ξ0 = a ◮ τj+p = ξj ◮ b = ξn+1 = τn+p+1 ≤ τn+p+2 ≤ . . . ≤ τn+2p

◮ p: order of the polynomial ◮ p + 1 is the order of the spline ◮ Append lower and upper boundary knots ξ0 and ξn+1 p

times

◮ Needed due to the recursive nature of B-splines

SLIDE 86

B-slines

Define recursively

◮ For k = 0 and i = 1, . . . , n + 2p

Bi,0(x) = 1 τi ≤ x < τi+1

therwise

◮ For k = 1, 2, . . . , p and i = 1, . . . , n + 2p

Bi,k(x) = x − τi τi+k−1 − τi Bi,k−1(x) + τi+k − x τi+k − τi+1 Bi+1,k−1(x) Support of Bi,k(x) is [τi, τi+k]

SLIDE 87

B-splines

SLIDE 88

Solving

◮ Solution depends on regularization parameter λ

◮ Determines the amount of “roughness”

◮ Choosing λ isn’t necessarily intuitive ◮ Degrees of freedom = trace of the smoothing parameter S

◮ Sum of the eigenvalues

S = B(BTB + λΩ)−1BT

◮ Monotone relationship between df and λ ◮ Search for a value of λ for desired df

◮ df=2 → linear regression ◮ df=n → interpolate data exactly

SLIDE 89

Example: Doppler function

SLIDE 90

Example: Doppler function

Could of course choose λ by LOO-CV or GCV

SLIDE 91

Cross validation

> plot(x,y) > fitcv <- smooth.spline(x,y,cv=T) > lines(fitcv,col=2) > fitcv Call: smooth.spline(x = x, y = y, cv = T) Smoothing Parameter spar= 0.157514 lambda= 2.291527e-08 (16 iterations) Equivalent Degrees of Freedom (Df): 124.738 Penalized Criterion: 6.071742 PRESS: 0.007898575

SLIDE 92

Generalised cross validation

> plot(x,y) > fitgcv <- smooth.spline(x,y,cv=F) > lines(fitgcv,col=4) > fitgcv Call: smooth.spline(x = x, y = y, cv = F) Smoothing Parameter spar= 0.1597504 lambda= 2.378386e-08 (15 iterations) Equivalent Degrees of Freedom (Df): 124.2353 Penalized Criterion: 6.078626 GCV: 0.007925571

SLIDE 93

Multivariate smoothing

SLIDE 94

Multivariate smoothing

◮ So far we have only considered univariate functions ◮ Suppose there are several predictors that we would like to

treat nonparametrically

◮ Most ‘interesting’ statistical problems nowadays are

high-dimensional with, easily, p > 1000

◮ Biology: Microarrays, Gene maps, Network inference ◮ Finance: Prediction from multi-variate time-series ◮ Physics: Climate models

◮ Can we just extend the methods and model functions

Rp → R nonparametrically?

SLIDE 95

Curse of dimensionality

◮ One might consider multidimensional smoothers aimed at

estimating: Y = m(x1, x2, . . . , xp)

◮ Considered methods rely on ‘local’ approximations ◮ Examine behaviour of data-points in the neighbourhood of

the point of interest

◮ What is ‘local’ and ‘neighbourhood’ if p → ∞ and n

constant

SLIDE 96

Curse of dimensionality

x = (x(1), x(2), . . . , x(p)) ∈ [0, 1]p. To get 5% of all n sample points into a cube-shaped neighbourhood of x, we need a cube with side-length 0 < ℓ < 1 such that ℓp ≤ 0.05 Dimension p side length ℓ 1 0.05 2 0.22 5 0.54 10 0.74 1000 0.997

SLIDE 97

Additive models

Require the function m : Rp → R to be of the form madd(x) = µ + m1(x(1)) + m2(x(2)) + . . . + mp(x(p)) = µ +

p

j=1

mj(x(j)), m ∈ R mj(·) : R → R just a univariate nonparametric function E[mj(x(j))] = 0 j = 1, . . . , p

◮ Choice of smoother is left open ◮ Avoids curse of dimensionality → ‘less flexible’ ◮ Functions can be estimated by ‘backfitting’

SLIDE 98

Backfitting

Data x(j)

i

, 1 ≤ i ≤ n and 1 ≤ j ≤ p A linear smoother for variable j can be described by a n × n-matrix Sj, so that ˆ mj = S(j)Y,

◮ Y = (Y1, . . . , Yn)T: observed vector of responses ◮ ˆ

mj = ( ˆ mj(x(j)

1 ), . . . , ˆ

mj(x(j)

n )): regression fit ◮ S(j) smoother with bandwidth estimated by LOO-CV or

GCV

SLIDE 99

Backfitting

ˆ madd(x) = ˆ µ +

p

j=1

ˆ mj(x(j)), Suppose ˆ µ and ˆ mk given for all k = j ˆ madd(xi) =

ˆ

µ +

k=j

ˆ mk(x(k)

i

)

+ ˆ

mj(x(j)

i

) Now to apply the smoother S(j) to Y −

ˆ

µ +

k=j

ˆ mk

Cycle through all j = 1, . . . , p to get

ˆ madd(x) = ˆ µ +

p

j=1

ˆ mj(x(j)

i

), m ∈ R.

SLIDE 100

Backfitting

1. Use ˆ

µ ← n−1 n

i=1 Yi. Start with ˆ

mj ≡ 0 for all j = 1, . . . , p

2. Cycle through the indices j = 1, 2, . . . , p, 1, 2, . . . , p, . . .,

ˆ mj ← S(j)(Y − ˆ µ1 −

k=j

ˆ mk. Also normalize ˆ mj(·) ← ˆ mj(·) − n−1

n

i=1

ˆ mj(x(j)

i

) update ˆ µ ← n−1 n

i=1(Yi − k ˆ

mk(x(k)

i

)) Stop iterations if functions do not change very much

3. Return

ˆ madd(xi) ← ˆ µ +

p

j=1

ˆ mj(x(j)

i

)

SLIDE 101

Example: Ozone data

SLIDE 102

Example: Ozone data

SLIDE 103

Iteration 1

SLIDE 104

Iteration 2

SLIDE 105

Iteration 3

SLIDE 106