Lecture 14. Nonparametric GLMs (cont.) Nan Ye School of Mathematics - - PowerPoint PPT Presentation

lecture 14 nonparametric glms cont nan ye
SMART_READER_LITE
LIVE PREVIEW

Lecture 14. Nonparametric GLMs (cont.) Nan Ye School of Mathematics - - PowerPoint PPT Presentation

Lecture 14. Nonparametric GLMs (cont.) Nan Ye School of Mathematics and Physics University of Queensland 1 / 22 Recall: Nonparametric Models Parametric models Fixed structure and number of parameters. Represent a fixed class of


slide-1
SLIDE 1

Lecture 14. Nonparametric GLMs (cont.) Nan Ye

School of Mathematics and Physics University of Queensland

1 / 22

slide-2
SLIDE 2

Recall: Nonparametric Models

Parametric models

  • Fixed structure and number of parameters.
  • Represent a fixed class of functions.

Nonparametric models

  • Flexible structure where the number of parameters usually grow as

more data becomes available.

  • The class of functions represented depends on the data.
  • Not models without parameters, but nonparametric in the sense

that they do not have fixed structures and numbers of parameters as in parametric models.

2 / 22

slide-3
SLIDE 3

This Lecture

  • Smoothing splines
  • Generalized additive models

3 / 22

slide-4
SLIDE 4

Smoothing Splines

If we fit a degree 8 polynomial on these 9 points, will the polynomial be a good fit?

  • −1.0

−0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 x y Actual curve

4 / 22

slide-5
SLIDE 5

No...

  • −1.0

−0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 x y Actual curve Polynomial fit

Runge phenomenon: polynomial fits can be very unstable.

5 / 22

slide-6
SLIDE 6

Trade-off between smoothness and quality of fit

  • We want to find a curve f (x) that fits data well, and is sufficiently

smooth at the same time.

  • This can be formulated as finding f to minimize

R(f ) =

n

∑︂

i=1

(yi − f (xi))2 + λJ(f ), where J(f ) is a measure of the roughness of f , and λ > 0 is a parameter controlling the tradeoff between the smoothness and the quality of fit.

  • J(f ) is also called a regularizer.

6 / 22

slide-7
SLIDE 7

Measuring roughness

  • For a quadratic function f (x) = cx2, large f ′′(x) indicates that the

curve is very wiggly.

  • In general, for any function f , if f ′′(x) is usually large, then f looks

very wiggly.

  • We can use

J(f ) = ∫︂ b

a

f ′′(x)2dx as a measure for overall roughness of f over [a, b].

7 / 22

slide-8
SLIDE 8

Smoothing splines

  • Assume that a < mini xi, and b > maxi xi.
  • Consider the problem of finding a function f minimizing

R(f ) =

n

∑︂

i=1

(yi − f (xi))2 + λ ∫︂ b

a

f ′′(x)2dx.

  • When λ = 0, f can be any function passing through the data.
  • When λ = ∞, f is the OLS fit.
  • When 0 < λ < ∞, f is a natural cubic spline with knots at the

unique xi values.

8 / 22

slide-9
SLIDE 9

Revisiting the example

  • −1.0

−0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 x y Actual curve Smooth spline

A smoothing spline can fit the data well and is smooth!

9 / 22

slide-10
SLIDE 10

A basis for natural cubic spline

  • Recall: natural splines are linear at two ends.
  • Assume that the knots are t1, . . . , tm.
  • A natural cubic spline is a linear combination of the following m

basis functions n1(x) = 1, n2(x) = x, n2+i(x) = di(x) − dm−1(x), i = 1, . . . , m − 2, where di(x) = (x−ti)3

+−(x−tm)3 +

tm−ti

.

10 / 22

slide-11
SLIDE 11

Fitting a smoothing spline

  • Training data: (x1, y1), . . . , (xn, yn) ∈ R × R.
  • An smoothing spline is fitted by minimizing

ˆ β =

n

∑︂

i=1

(β⊤zi − yi)2 + λβ⊤Ωβ, where zi = (n1(xi), . . . , nn(xi)), ni’s use xi’s as the knots, and Ωij = ∫︁ n′′

i (x)n′′ j (x)dx.

  • The fitted spline is

f (x) = ∑︂

i

ˆ βini(x).

11 / 22

slide-12
SLIDE 12

Matrix form

  • Let Z be the n × n matrix with zi as the i-th row.
  • Then ˆ

β can be written as ˆ β = (Z⊤Z + λΩ)−1Z⊤y.

  • We thus have

ˆ y = Zˆ β = Sλy, where Sλ is the smoother matrix Sλ = Z(Z⊤Z + λΩ)−1Z⊤.

12 / 22

slide-13
SLIDE 13

Effective degree of freedom

  • The effective degree of freedom of a smoothing spline is

dfλ = trace(Sλ), where the trace of a matrix is the sum of its diagonal elements.

  • The effective degree of freedom can be considered as a

generalization of the concept of the number of free parameters.

13 / 22

slide-14
SLIDE 14

Selection of smoothing parameters

  • The effective degree of freedom dfλ provides an intuitive way to

manually specify the smoothing parameter λ.

  • There are various procedures used for automatically determining

the λ values, such as cross-validation, generalized cross validation.

14 / 22

slide-15
SLIDE 15

Smoothing splines in R

> fit.spline.df <- smooth.spline(cars$speed, cars$dist, df=9) Smoothing Parameter spar= 0.3858413 lambda= 0.0001576001 (11 iterations) Equivalent Degrees of Freedom (Df): 8.998755 Penalized Criterion (RSS): 2054.319 GCV: 262.3012 > fit.spline.gcv <- smooth.spline(cars$speed, cars$dist) Smoothing Parameter spar= 0.7801305 lambda= 0.1112206 (11 iterations) Equivalent Degrees of Freedom (Df): 2.635278 Penalized Criterion (RSS): 4187.776 GCV: 244.1044

  • By default, the smoothing parameter λ is determined using

generalized cross validation.

15 / 22

slide-16
SLIDE 16
  • 5

10 15 20 25 20 40 60 80 100 120 speed dist lm smoothing spline (df=2.64) smoothing spline (df=9) 16 / 22

slide-17
SLIDE 17

Generalized Additive Models

  • Smoothing spline is a nonparametric analogue of OLS.
  • We can extend the approach to GLM.

17 / 22

slide-18
SLIDE 18

Idea

  • Replace the linear predictor by β0 + h1(x1) + . . . + hd(xd).
  • Maximize roughness penalized log-likelihood instead of

log-likelihood.

18 / 22

slide-19
SLIDE 19

Generalized additive model (GAM)

  • Recall: A GLM has the following structure

(systematic) E(Y | x) = h(β⊤x), (random) Y | x follows an exponential family distribution.

  • A generalized additive model has the following structure

(systematic) E(Y | x) = β0 + ∑︂

i

hi(xi) (random) Y | x follows an exponential family distribution. This defines a conditional probability model p(y | x, β0, h1, . . . , hd)

19 / 22

slide-20
SLIDE 20

Roughness penalty approach for GAM

  • We want to choose β0, h1, . . . , hd to maximize

∑︂

i

ln p(yi | xi, β0, h1, . . . , hd) − ∑︂

j

λj ∫︂ h′′

j (xj)2dxj.

  • Again, if each λj > 0, then each hj must be a natural cubic spline

with knots at the unique values of xj.

  • This reduces the problem to a finite-dimensional parametric

regression problem.

20 / 22

slide-21
SLIDE 21

Remarks

  • Higher order derivatives may be used in the regularizer

(smoothness penalty).

  • We can also use regression splines instead of smoothing splines to

represent hi’s.

  • hi’s may use a mix of different representations.

e.g. h1(x1) = x1, h2(x2) a regression spline, h3(x3) a smoothing spline...

21 / 22

slide-22
SLIDE 22

What You Need to Know

  • Smoothing splines
  • The roughness penalty approach
  • Natural cubic splines as smoothing splines
  • Smoothing parameter and effective degree of freedom
  • Generalized additive model
  • GAM as a generalization of GLM
  • Roughness penalty approach for GAM

22 / 22