Computational Statistics Lectures 10-13: Smoothing and - - PowerPoint PPT Presentation
Computational Statistics Lectures 10-13: Smoothing and - - PowerPoint PPT Presentation
Computational Statistics Lectures 10-13: Smoothing and Nonparametric Inference Dr Jennifer Rogers Hilary Term 2017 Background Smoothing and nonparametric methods Approximating function that attempts to capture important patterns in
Background
Smoothing and nonparametric methods
◮ Approximating function that attempts to capture important
patterns in datasets or images
◮ Leaves out noise ◮ Aid data analysis by being able to extract more information
from the data
◮ Analyses are flexible and robust ◮ Should always just use a nonparametric estimator?
Smoothing and nonparametric methods
No!
◮ There is no miracle! ◮ There is a price to pay for the gain in generality ◮ When we have clear evidence of a good parametric model
for the data, we should use it
◮ Nonparametric estimators converge to true curve slower
VALID parametric estimators
◮ But as soon as the parametric model is incorrect, a
parametric estimator will never converge to the true curve So nonparametric methods have their place!
The regression problem
◮ Goal of regression → discover the relationship between
two variables, X and Y
◮ Wish to find a curve m that passes “in the middle” of the
points
◮ Observations (xi, Yi) for i = 1, . . . , n
◮ xi is a real-valued variable ◮ Yi a real-valued random response
◮ Yi = m(xi) + εi
for i = 1, . . . , n
◮ E(εi | Xi) = 0 ◮ Var(εi | Xi) = σ2(Xi) ◮ m(x) = E(Y | X = x)
◮ m(·): regression function
◮ Reflects the relationship between X and Y ◮ Curve of interest and “lies in the middle” of all the points
◮ Goal is to infer m(x) from observations (xi, Yi)
Example: Cosmic microwave background data
Example: FTSE stock market index
Linear smoothers
Linear smoothers
◮ In the parametric context, we assume we know the shape
- f m(·)
◮ Linear model: Y = α + βX + ε
◮ m(x) = E(Y | X = x) = α + βx
◮ We estimate α and β from the data ◮ Least squares estimator:
(ˆ α, ˆ β) = argmin(α,β)
- i
(Yi − α − βXi)2
◮ Consider m(xi) = 1 + 2xi ◮ We can fit a linear model to the data and obtain
ˆ α = 0.9905 and ˆ β = 2.0025
Linear modelling
> x <- seq(from=0,to=1,length.out=1000) > e <- rnorm(1000,0,0.2) > y1 <- 2*x+1+e > lm(y1˜x) Call: lm(formula = y1 ˜ x) Coefficients: (Intercept) x 0.9905 2.0025
Linear modelling
Linear modelling
◮ Linear model: linear in the parameters! ◮ No higher order terms such as αβ or β2 ◮ Not necessarily linear in Xi ◮ Examples of linear models:
◮ Yi = 8X 3
i − 13.6X 2 i + 7.28Xi − 1.176 + ε
◮ Yi = cos(20Xi) + ε
Linear modelling
> lm(y2˜x+I(xˆ2)+I(xˆ3))
Linear modelling
> lm(y3˜I(cos(20*x)))
Linear modelling
◮ Can still use linear modelling ◮ Requires knowledge of functional form of explanatory
variables
◮ May not always be obvious ◮ Consider linear smoothers - much more general ◮ Obtain a non-trivial smoothing matrix even for just a single
‘predictor’ variable (p = 1)
Linear smoothers
◮ For some n × n-matrix S, ˆ
Y = SY
◮ Fitted value ˆ
Yi at design point xi is a linear combination of measurements ˆ Yi =
n
- j=1
SijYj
◮ Linear regression with p predictor variables:
ˆ θ = argminθ
- i
(Yi − Xiθ)2 = argminθ(Y − θX)(Y − θX) = argminθ(Y TY − 2θTX TY + θTX TXθ)
◮ Differentiate with respect to θ and setting to zero
−2X TY + 2X TXθ = 0 ˆ θ = (X TX)−1X TY
Linear smoothers
ˆ θ = (X TX)−1X TY
◮ Estimated (fitted) values ˆ
Y are X ˆ θ ˆ Y = X(X TX)−1X TY = S Y,
◮ S is a n × n-matrix ◮ Hat-matrix, H, from linear regression
Linear smoothers
◮ Degrees of freedom for linear regression
tr(X(X TX)−1X T) = tr(X TX(X TX)−1) = tr(1p) = p
◮ How large is the expected residual sum of squares
E(RSS) = E
- n
- i=1
( ˆ Yi − Yi)2 ?
◮ If ˆ
Y = Xθ, then E(RSS) = E(n
i=1 ε2 i ) = nσ2 ◮ If ˆ
Y = SY, then ˆ Y − Y = Sε − ε E(RSS) = σ2(n − p) A good estimator of σ2 is thus ˆ σ2 = RSS n − p = RSS n − df
Linear smoothers
ˆ Y = ˆ m(x) = S Y,
◮ Serum data, taken in connection with Diabetes research ◮ Y: log-concentration of a serum ◮ x: age of children in months ◮ Various ways how S can be chosen:
Linear smoothers
Methods considered fall into two categories
◮ Local regression, including
◮ Kernel estimators and ◮ Local polynomial regression.
◮ Penalized estimators, mainly
◮ Smoothing splines.
Local estimation: Kernel estimators
Histogram
◮ X ∼ F ◮ P(x ≤ X ≤ x + ∆x) =
x+∆x
x
fX(v)dv
◮ Thus, for any u ∈ [x, x + ∆x]:
P(x ≤ X ≤ x + ∆x) ≈ ∆x · fX(u),
◮ This implies
fX(u) ≈ P(x ≤ X ≤ x + ∆x) ∆x ˆ fX(u) = #{Xi : x ≤ Xi ≤ x + ∆x} n∆x This is the idea behind histograms.
Histogram
◮ Choose an origin t0 ◮ Choose a bin size, h ◮ Partition real line into intervals Ik = [tk, tk+1] of equal length
h
◮ Histogram estimator:
ˆ fH(x) = #Xi : tk ≤ Xi ≤ tk+1 nh . Step function that depends heavily on both the origin, t0, and the bin width, h
Example: Old Faithful geyser
Duration in minutes of 272 eruptions of the Old Faithful geyser in Yellowstone National Park
> hist(faithful$eruptions,probability = T)
Example: Old Faithful geyser
What happens if we change the time origin and the bin width?
> hist(faithful$eruptions,breaks=seq(1.5,5.5,1), probability = T,xlim=c(1,5.5)) > hist(faithful$eruptions,breaks=seq(1.1,5.1,1), probability = T,xlim=c(1,5.5)) > hist(faithful$eruptions,breaks=seq(0.5,5.5,0.5), probability = T,xlim=c(1,5.5)) > hist(faithful$eruptions,breaks=seq(0.75,5.75,0.5), probability = T,xlim=c(1,5.5))
Example: Old Faithful geyser
Density estimator
Can we do better? We can get rid of time origin. fX(x) = lim
h→0
FX(x + h) − FX(x) h = lim
h→0
FX(x) − FX(x − h) h Combining the two expressions: fX(x) = lim
h→0
FX(x + h) − FX(x − h) 2h = lim
h→0
P(x − h < X < x + h) 2h Which we can estimate using proportions: fX(x) = 1 nh
n
- i=1
K x − Xi h
- ,
With K(x) = 1/2 · I{| x |< 1}
Density estimator
◮ Similar to the histogram
◮ No longer have the origin, t0 ◮ More flexible
◮ Constructs a box of length 2h around each observation Xi ◮ Estimator is then the sum of the boxes at x ◮ Density that depends on the bandwidth, h
Kernel estimators
◮ Put a smooth symmetric ‘bump’ of shape K around each
- bservation
◮ Estimator at x is now the sum of the bumps at x ◮ We define
ˆ fX(x) = 1 nh
n
- i=1
K x − Xi h
- ,
◮ K: ‘kernel’ function ◮ Estimator has the same properties of K
◮ If K is continuous and differentiable → so is the estimator ◮ Estimator is a density if K is a density
◮ Shape of K does not influence the resulting estimator ◮ Estimator does depend heavily on h
Kernel estimators
Kernels
A kernel is a real-valued function K(x) such that
◮ K(x) ≥ 0 for all x ∈ R, ◮
K(x) dx = 1,
◮
xK(x) dx = 0. In practice, the choice of K does not influence the results much, but the value of h is crucial
Kernels
Commonly used kernels include
◮ Boxcar: K(x) = I(x)/2 ◮ Gaussian: K(x) = (2π)−1/2 exp(−x2/2) ◮ Epanechnikov: K(x) = 3 4(1 − x2)I(x) ◮ Biweight: K(x) = 15 16(1 − x2)2I(x) ◮ Triweight: K(x) = 35 32(1 − x2)3I(x) ◮ Uniform: 1 2I(x)
I(x) = 1 if |x| ≤ 1 and I(x) = 0 otherwise
Example: Old Faithful geyser
Kernel regression
Kernel regression
Y = m(x) + ε
- 1. We want to estimate E(Y|X = x). Naive estimator:
ˆ m(x) = n
i=1 Yi
n . Same for all x
- 2. Average the Yis of only those Xis that are close to x (local
average): ˆ m(x) = n
i=1 Yi · I{|Xi − x| < h}
n
i=1 I{|Xi − x| < h}
. h: bandwidth, determines the size of the neighbourhood around x
Kernel regression
Y = m(x) + ε
- 3. Give a slowly decreasing weight to Xi as it gets far from x,
rather than giving the same weight to all observations close to x: ˆ m(x) =
n
- i=1
YiW(x − Xi), W(·): weight function that decreases as x increases and n
i=1 W(x − Xi) = 1
Nadaraya-Watson estimator
W(x − Xi) = K x − Xi h
- /
n
- j=1
K x − Xj h
- Hence, the Nadaraya-Watson kernel estimator is
ˆ m(x) = n
i=1 YiK( x−Xi h )
n
j=1 K( x−Xj h )
The estimated function values ˆ Yj = ˆ m(xj) at the made
- bservations are given by
ˆ Yj =
- i
SijYj, where Sij = K( xj−xi
h )
- k K( xj−xk
h
) , The kernel smoother is thus a linear smoother
Local least squares
We can rewrite the kernel regression estimator as ˆ m(x) = argminmx∈R
n
- i=1
K x − xi h
- (Yi − mx)2.
Exercise: Can be verified by solving
d dmx
n
i=1 K( x−xi h )(Yi − mx)2 = 0 ◮ Thus, for every fixed x, have to search for the best local
constant mx such that the localized sum of squares is minimized
◮ Localization is here described by the kernel and gives a
large weight to those observations (xi, Yi) where xi is close to the point x of interest. The choice of the bandwidth h is crucial
Example: FTSE stock market index
Example: Cosmic microwave background data
Choosing the bandwidth
Choosing the bandwidth
◮ Measure “success” of fit using mean squared error on new
- bservations (MSE),
MSE(h) = E
- (Y − ˆ
mh(x))2 ,
◮ Splitting into noise, bias and variance
MSE(h) = Noise + Bias2 + Variance
◮ Bias decreases if h ↓ 0 ◮ Variance increases if h ↓ 0 ◮ Choosing the bandwidth is a bias-variance trade-off.
Choosing the bandwidth
◮ But...we don’t know MSE(h), as we just have a n
- bservations and cannot generate new random
- bservations Y
◮ First idea is to compute, for various values of the
bandwidth h:
◮ The estimator ˆ
mh for a training sample
◮ The error
n−1
n
- i=1
(Yi − ˆ mh(x))2 = n−1
n
- i=1
(Yi − ˆ Yi)2,
◮ Choose the bandwidth with the smallest training error
CMB data
Choosing the bandwidth
◮ We would choose a bandwidth close to h = 0, giving near
perfect interpolation of the data, that is ˆ m(xi) ≈ Yi
◮ This is unsurprising ◮ Parametric context
◮ Shape of the model is fixed ◮ Minimising the MSE makes the parametric model as close
as possible to the data
◮ Nonparametric setting
◮ Don’t have a fixed shape ◮ Value of h dictates the model ◮ Minimising the MSE → fitted model as close as possible to
the data
◮ Lead us to choose h as small as possible ◮ Interpolation of the data
◮ Misleading result → Only noise is fitted for very small
bandwidths
Cross-validation
◮ Solution...don’t use Xi to construct ˆ
m(Xi)
◮ This is the idea behind cross-validation ◮ Leave-one-out cross-validation ◮ Least squares cross-validation
◮ For each value of h
◮ For each i = 1, . . . , n, compute the estimator ˆ
m(−i)
h
(x), where ˆ m(−i)
h
(x) is computed without using observation i
◮ The estimated MSE is then given by
- MSE(h) = n−1
i
(Yi − ˆ m(−i)
h
(xi))2
CMB data
CMB data
A bandwidth of 44 minimises the estimated MSE
Cross-validation
◮ A drawback of LOO-CV is that it is expensive to compute ◮ Fit has to be recalculated n times (once for each left-out
- bservation)
◮ We can avoid needing to calculate ˆ
m(x)(−i) for all i
◮ For some n × n-matrix S, the linear smoother fulfills
ˆ Y = SY The risk (MSE) under LOO-CV can subsequently be written as
- MSE(h) = n−1
n
- i=1
- Yi − ˆ
mh(xi) 1 − Sii 2
Cross-validation
◮ Do not need to recompute ˆ
mh while leaving out each of the n observations in turn
◮ Results can much faster be obtained by rescaling the
residuals Yi − ˆ mh(xi) with the factor (1 − Sii)
◮ Sii is the i-th diagonal entry in the smoothing matrix
Generalized Cross-Validation
- MSE(h) = n−1
n
- i=1
Yi − ˆ mh(xi) 1 − Sii 2 , Replace Sii by its average ν/n (where ν =
i Sii)
Choose bandwidth h that minimizes GCV(h) = n−1
n
- i=1
Yi − ˆ mh(xi) 1 − ν/n 2 .
Local polynomial regression
Nadaraya-Watson kernel estimator
◮ Major disadvantage of the Nadaraya-Watson kernel
estimator → boundary bias
◮ Bias is of large order at the boundaries
Local polynomial regression
◮ Even when a curve doesn’t look like a polynomial ◮ Restrict to a small neighbourhood of a given point, x ◮ Approximate the curve by a polynomial in that
neighbourhood
◮ Fit its coefficients using only observations Xi close to x
◮ (or rather, putting more weight to observations close to x)
◮ Repeat this procedure at every point x where we want to
estimate m(x)
Local polynomial regression
◮ Kernel estimator approximates the data by taking local
averages within small bandwidths
◮ Use local linear regression to obtain an approximation
Kernel regression estimator
Recall that the kernel regression estimator is the solution to: ˆ m(x) = argminm(x)∈R
n
- i=1
K x − xi h
- (Yi − m(x))2
This given by ˆ m(x) = n
i=1 YiK( x−Xi h )
n
j=1 K( x−Xj h )
. Thus estimation corresponds to the solution to the weighted sum of squares problem
Local polynomial regression
Using Taylor Series, we can approximate m(x), where x is close to a point x0 using the following polynomial: m(x) ≈ m(x0) + m(1)(x0)(x − x0) + m(2)(x0) 2! (x − x0)2 + . . . . . . + m(p)(x0) p! (x − x0)p = m(x0) + β1(x − x0) + β2(x − x0)2 + · · · + βp(x − x0)p where m(k)(x0) = k!βk, provided that all the required derivatives exist
Local polynomial regression
◮ Use the data to estimate that polynomial of degree p which
best approximates m(xi) in a small neighbourhood around the point x
◮ Minimise with respect to β0, β1, . . . , βp the function: n
- i
{Yi − β0 − β1(xi − x) − . . . − βp(xi − x)p}2K x − xi h
- ◮ Weighted least squares problem, where the weights are
given by the kernel functions K((x − xi)/h)
◮ As m(k)(x) = k!βk, we then have that m(x) = β0, or
ˆ m(x) = ˆ β0
CMB data
Red: Kernel smoother (p = 0) Green: Local linear regression (p = 1)
Boundary bias: kernel estimator
Let ℓi(x) = ωi(x)/
j ωj(x), so that
ˆ m(x) =
- i
ℓi(x)Yi For the Kernel smoother (p = 0), the bias of the linear smoother is thus Bias = E( ˆ m(x)) − m(x) = m′(x)
- i
(xi − x)ℓi(x) + m′′(x) 2
- i
(xi − x)2ℓi(x) + R,
Boundary bias: Kernel estimator
First term in the expansion is equal to m′(x)
- i
(xi − x)K x−xi
h
- K
x−xi
h
- ◮ vanishes if the design points xi are centred symmetrically
around x
◮ does not vanish if x sits at the boundary (all xi − x will have
the same sign)
Boundary bias: polynomial estimator
◮ m(x) is truly a local polynomial of degree p ◮ At least p + 1 points with positive weights in the
neighbourhood of x Bias will hence be of order Bias = E( ˆ m(x)) − m(x) = m(p+1)(x) (p + 1)!
- j
(xj − x)p+1ℓj(x) + R Why not choose p = 20?
Boundary bias: polynomial estimator
◮ Yi = m(x) + σεi with εi ∼ N(0, 1) ◮ Variance of the linear smoother, ˆ
m(x) =
j ℓj(x)Yi, is,
Var( ˆ m(x)) = σ2
j
ℓ2
j (x) = σ2ℓ(x)2 ◮ ℓ(x)2 tends to be large if p is large ◮ In practice, p = 1 is a good choice
Example: Doppler function
m(x) =
- x(1 − x) sin
2.1π x + .05
- ,
0 ≤ x ≤ 1.
Example: Doppler function
> n <- 1000 > x <- seq(0,1,length=n) > m <- sqrt(x*(1-x))*sin(2.1*pi/(x+0.05)) > plot(x,m,type=’l’) > y <- m+rnorm(n)*0.075 > plot(x,y) > fit <- locpoly(x,y,bandwidth=dpill(x,y)*2,degree=1) > lines(fit,col=2) > plot(x,y) > fit2 <- locpoly(x,y,bandwidth=dpill(x,y)/2,degree=1) > lines(fit2,col=2) > plot(x,y) > fit3 <- locpoly(x,y,bandwidth=dpill(x,y)/4,degree=1) > lines(fit3,col=2)
Example: Doppler function
Penalised regression
Penalised regression
Regression model i = 1, . . . , n, Yi = m(xi) + εi E(εi) = 0 Estimating m by choosing ˆ m(x) to minimize
n
- i=1
(Yi − ˆ m(xi))2 leads to
◮ linear regression estimate if minimizing over all linear
functions
◮ an interpolation of the data if minimizing over all
functions.
Penalised regression
Estimate m by choosing ˆ m(x) to minimize
n
- i=1
(Yi − ˆ m(xi))2 + λJ(m), J(m): roughness penalty Typically J(m) =
- (m′′(x))2 dx.
Parameter λ controls trade-off between fit and penalty
◮ For λ = 0: interpolation ◮ For λ → ∞: linear least squares line
Example: Doppler function
Splines
Splines
◮ Kernel regression
◮ Researcher isn’t interested in actually calculating m(x) for a
single location x
◮ m(x) calculated on sufficiently small grid of x-values ◮ Curve obtained by interpolation ◮ Local polynomial regression: unknown mean function was
assumed to be locally well approximated by a polynomial
◮ Alternative approach
◮ Represent the fit as a piecewise polynomial ◮ Pieces connecting at points called knots ◮ Once the knots are selected, an estimator can be
computed globally
◮ Manner similar to that for a parametrically specified mean
function
This is the idea behind splines
Splines
IID sample (X1, Y1), (X2, Y2), . . . (Xn, Yn) coming from the model Yi = m(Xi) + εi Want to estimate the mean of the variable Y with m(x) = E(Y|X = x) A very naive estimator of E(Y|X = x) would be the sample mean of x: ˆ m(x) = n
i=1 Yi
n Not very good (same for all x)
Splines
Approximate m by piecewise polynomials, each on a small interval: ˆ m(x) = c1 if x < ξ1 c2 if ξ1 ≤ x < ξ2 . . . ck if ξk−1 ≤ x < ξk ck+1 if x ≥ ξk
Splines
Use more general lines, which join at the ξs: ˆ m(x) = a1 + b1x if x < ξ1 a2 + b2x if ξ1 ≤ x < ξ2 . . . ak + bkx if ξk−1 ≤ x < ξk ak+1 + bk+1x if x ≥ ξk a and b are such that the lines join at each ξ
Splines
Approximate m(x) by polynomials ˆ m(x) = p
j=0 β1,jxj
if x < ξ1 p
j=0 β2,jxj
if ξ1 ≤ x < ξ2 . . . p
j=0 βk,jxj
if ξk−1 ≤ x < ξk p
j=0 βk+1,jxj
if x ≥ ξk βjs are such that the polynomials join at each ξ and the approximation has p − 1 derivatives Splines which are piecewise polynomials of order p
◮ Splines of order p + 1 ◮ Splines of degree p ◮ ξ : knots
Splines
Piecewise constant splines
Knots
How many knots should we have?
◮ Choose a lot of knots well widespread over the data range
→ reduce the bias of the estimator
◮ If we make it too local
→ estimator will be too wiggly
◮ Overcome the bias problem without increasing the
variance → take a lot of knots, but constrain their influence
◮ We can do this using penalised regression
Spline order
What order spline should we use?
◮ Increase the value of p
→ make the estimator ˆ mp smoother (since it has p − 1 continuous derivatives)
◮ If we have p too large
→ increase the number of parameters to estimate
◮ In practice rarely useful to take p > 3 ◮ p = 2
◮ Splines of order three or quadratic splines
◮ p = 3
◮ Splines of order 4 or cubic splines
◮ p-th order spline is a piecewise p − 1 degree polynomial
with p − 2 continuous derivatives at the knots
Natural splines
◮ Natural spline: linear beyond the boundary is called a
natural spline
◮ Why this constraint? ◮ We usually have very few observations beyond the two
extreme knots
◮ Want to obtain an estimator of the regression curve there ◮ Cannot reasonably estimate anything correct there ◮ Rather use a simplified model (e.g. linear) ◮ Often gives more or less reasonable results
Natural cubic splines
ξ1 < ξ2 < . . . < ξn set of ordered points, so-called knots, contained in an interval (a, b) A cubic spline is a continuous function m such that (i) m is a cubic polynomial over (ξ1, ξ2), . . ., and (ii) m has a continuous first and second derivatives at the knots. The solution to
n
- i=1
(yi − ˆ m(xi))2 + λ
- ( ˆ
m′′(x))2 dx is a natural cubic spline with knots at the data points ˆ m(x) is called a smoothing spline
Natural cubic splines
◮ Sequence of values f1, . . . , fn at specified locations
x1 < x2 < · · · < xn
◮ Find a smooth curve g(x) that passes through the points
(x1, f1), (x2, f2), . . . , (xn, fn)
◮ The natural cubic spline g is an interpolating function that
satisfies the following conditions:
(i) g(xj) = fj, j = 1, . . . , n, (ii) g(x) is cubic on each subinterval (xk, xk+1), k = 1, . . . , (n − 1), (iii) g(x) is continuous and has continuous first and second derivatives, (iv) g′′(x1) = g′′(xn) = 0.
B-splines
◮ Need a basis for natural polynomial splines ◮ Convenient is the so-called B-spline basis ◮ Data points a = ξ0 < ξ1 < ξ2 < . . . , ξn ≤ ξn+1 = b in (a, b) ◮ There are n + 2 real values
◮ The n ≥ 0 are called ‘interior knots’ or ‘control points’ ◮ And there are always two endpoints, ξ0 and ξn+1
◮ When the knots are equidistant they are said to be
‘uniform’
B-splines
Now define new knots τ as
◮ τ1 ≤ . . . ≤ τp = ξ0 = a ◮ τj+p = ξj ◮ b = ξn+1 = τn+p+1 ≤ τn+p+2 ≤ . . . ≤ τn+2p
◮ p: order of the polynomial ◮ p + 1 is the order of the spline ◮ Append lower and upper boundary knots ξ0 and ξn+1 p
times
◮ Needed due to the recursive nature of B-splines
B-slines
Define recursively
◮ For k = 0 and i = 1, . . . , n + 2p
Bi,0(x) = 1 τi ≤ x < τi+1
- therwise
◮ For k = 1, 2, . . . , p and i = 1, . . . , n + 2p
Bi,k(x) = x − τi τi+k−1 − τi Bi,k−1(x) + τi+k − x τi+k − τi+1 Bi+1,k−1(x) Support of Bi,k(x) is [τi, τi+k]
B-splines
Solving
◮ Solution depends on regularization parameter λ
◮ Determines the amount of “roughness”
◮ Choosing λ isn’t necessarily intuitive ◮ Degrees of freedom = trace of the smoothing parameter S
◮ Sum of the eigenvalues
S = B(BTB + λΩ)−1BT
◮ Monotone relationship between df and λ ◮ Search for a value of λ for desired df
◮ df=2 → linear regression ◮ df=n → interpolate data exactly
Example: Doppler function
Example: Doppler function
Could of course choose λ by LOO-CV or GCV
Cross validation
> plot(x,y) > fitcv <- smooth.spline(x,y,cv=T) > lines(fitcv,col=2) > fitcv Call: smooth.spline(x = x, y = y, cv = T) Smoothing Parameter spar= 0.157514 lambda= 2.291527e-08 (16 iterations) Equivalent Degrees of Freedom (Df): 124.738 Penalized Criterion: 6.071742 PRESS: 0.007898575
Generalised cross validation
> plot(x,y) > fitgcv <- smooth.spline(x,y,cv=F) > lines(fitgcv,col=4) > fitgcv Call: smooth.spline(x = x, y = y, cv = F) Smoothing Parameter spar= 0.1597504 lambda= 2.378386e-08 (15 iterations) Equivalent Degrees of Freedom (Df): 124.2353 Penalized Criterion: 6.078626 GCV: 0.007925571
Multivariate smoothing
Multivariate smoothing
◮ So far we have only considered univariate functions ◮ Suppose there are several predictors that we would like to
treat nonparametrically
◮ Most ‘interesting’ statistical problems nowadays are
high-dimensional with, easily, p > 1000
◮ Biology: Microarrays, Gene maps, Network inference ◮ Finance: Prediction from multi-variate time-series ◮ Physics: Climate models
◮ Can we just extend the methods and model functions
Rp → R nonparametrically?
Curse of dimensionality
◮ One might consider multidimensional smoothers aimed at
estimating: Y = m(x1, x2, . . . , xp)
◮ Considered methods rely on ‘local’ approximations ◮ Examine behaviour of data-points in the neighbourhood of
the point of interest
◮ What is ‘local’ and ‘neighbourhood’ if p → ∞ and n
constant
Curse of dimensionality
x = (x(1), x(2), . . . , x(p)) ∈ [0, 1]p. To get 5% of all n sample points into a cube-shaped neighbourhood of x, we need a cube with side-length 0 < ℓ < 1 such that ℓp ≤ 0.05 Dimension p side length ℓ 1 0.05 2 0.22 5 0.54 10 0.74 1000 0.997
Additive models
Require the function m : Rp → R to be of the form madd(x) = µ + m1(x(1)) + m2(x(2)) + . . . + mp(x(p)) = µ +
p
- j=1
mj(x(j)), m ∈ R mj(·) : R → R just a univariate nonparametric function E[mj(x(j))] = 0 j = 1, . . . , p
◮ Choice of smoother is left open ◮ Avoids curse of dimensionality → ‘less flexible’ ◮ Functions can be estimated by ‘backfitting’
Backfitting
Data x(j)
i
, 1 ≤ i ≤ n and 1 ≤ j ≤ p A linear smoother for variable j can be described by a n × n-matrix Sj, so that ˆ mj = S(j)Y,
◮ Y = (Y1, . . . , Yn)T: observed vector of responses ◮ ˆ
mj = ( ˆ mj(x(j)
1 ), . . . , ˆ
mj(x(j)
n )): regression fit ◮ S(j) smoother with bandwidth estimated by LOO-CV or
GCV
Backfitting
ˆ madd(x) = ˆ µ +
p
- j=1
ˆ mj(x(j)), Suppose ˆ µ and ˆ mk given for all k = j ˆ madd(xi) =
- ˆ
µ +
- k=j
ˆ mk(x(k)
i
)
- + ˆ
mj(x(j)
i
) Now to apply the smoother S(j) to Y −
- ˆ
µ +
- k=j
ˆ mk
- Cycle through all j = 1, . . . , p to get
ˆ madd(x) = ˆ µ +
p
- j=1
ˆ mj(x(j)
i
), m ∈ R.
Backfitting
- 1. Use ˆ
µ ← n−1 n
i=1 Yi. Start with ˆ
mj ≡ 0 for all j = 1, . . . , p
- 2. Cycle through the indices j = 1, 2, . . . , p, 1, 2, . . . , p, . . .,
ˆ mj ← S(j)(Y − ˆ µ1 −
- k=j
ˆ mk. Also normalize ˆ mj(·) ← ˆ mj(·) − n−1
n
- i=1
ˆ mj(x(j)
i
) update ˆ µ ← n−1 n
i=1(Yi − k ˆ
mk(x(k)
i
)) Stop iterations if functions do not change very much
- 3. Return
ˆ madd(xi) ← ˆ µ +
p
- j=1
ˆ mj(x(j)
i