Bayesian Interpretations of Regularization Charlie Frogner 9.520 - - PowerPoint PPT Presentation

bayesian interpretations of regularization
SMART_READER_LITE
LIVE PREVIEW

Bayesian Interpretations of Regularization Charlie Frogner 9.520 - - PowerPoint PPT Presentation

Bayesian Interpretations of Regularization Charlie Frogner 9.520 Class 17 April 6, 2011 C. Frogner Bayesian Interpretations of Regularization The Plan Regularized least squares maps { ( x i , y i ) } n i = 1 to a function that minimizes the


slide-1
SLIDE 1

Bayesian Interpretations of Regularization

Charlie Frogner

9.520 Class 17

April 6, 2011

  • C. Frogner

Bayesian Interpretations of Regularization

slide-2
SLIDE 2

The Plan

Regularized least squares maps {(xi, yi)}n

i=1 to a function that

minimizes the regularized loss: fS = arg min

f∈H

1 2

n

  • i=1

(yi − f(xi))2 + λ 2f2

H

Can we interpret RLS from a probabilistic point of view?

  • C. Frogner

Bayesian Interpretations of Regularization

slide-3
SLIDE 3

Some notation

Training set: S = {(x1, y1), . . . , (xn, yn)}. Inputs: X = {x1, . . . , xn}. Labels: Y = {y1, . . . , yn}. Parameters: θ ∈ Rp. p(Y|X, θ) is the joint distribution over labels Y given inputs X and the parameters.

  • C. Frogner

Bayesian Interpretations of Regularization

slide-4
SLIDE 4

Where do probabilities show up?

1 2

n

  • i=1

V(yi, f(xi)) + λ 2f2

H

becomes p(Y|f, X) · p(f) Likelihood, a.k.a. noise model: p(Y|f, X).

Gaussian: yi ∼ N

  • f ∗(xi), σ2

i

  • Poisson: yi ∼ Pois (f ∗(xi))

Prior: p(f).

  • C. Frogner

Bayesian Interpretations of Regularization

slide-5
SLIDE 5

Where do probabilities show up?

1 2

n

  • i=1

V(yi, f(xi)) + λ 2f2

H

becomes p(Y|f, X) · p(f) Likelihood, a.k.a. noise model: p(Y|f, X).

Gaussian: yi ∼ N

  • f ∗(xi), σ2

i

  • Poisson: yi ∼ Pois (f ∗(xi))

Prior: p(f).

  • C. Frogner

Bayesian Interpretations of Regularization

slide-6
SLIDE 6

Estimation

The estimation problem: Given data {(xi, yi)}N

i=1 and model p(Y|f, X), p(f).

Find a good f to explain data.

  • C. Frogner

Bayesian Interpretations of Regularization

slide-7
SLIDE 7

The Plan

Maximum likelihood estimation for ERM MAP estimation for linear RLS MAP estimation for kernel RLS Transductive model Infinite dimensions get more complicated

  • C. Frogner

Bayesian Interpretations of Regularization

slide-8
SLIDE 8

Maximum likelihood estimation

Given data {(xi, yi)}N

i=1 and model p(Y|f, X), p(f).

A good f is one that maximizes p(Y|f, X).

  • C. Frogner

Bayesian Interpretations of Regularization

slide-9
SLIDE 9

Maximum likelihood and least squares

For least squares, noise model is: yi|f, xi ∼ N

  • f(xi), σ2

a.k.a. Y|f, X ∼ N

  • f(X), σ2I
  • So

p(Y|f, X) = 1 (2πσ2)N/2 exp

N

  • i=1

1 σ2 (yi − f(xi))2

  • C. Frogner

Bayesian Interpretations of Regularization

slide-10
SLIDE 10

Maximum likelihood and least squares

For least squares, noise model is: yi|f, xi ∼ N

  • f(xi), σ2

a.k.a. Y|f, X ∼ N

  • f(X), σ2I
  • So

p(Y|f, X) = 1 (2πσ2)N/2 exp

N

  • i=1

1 σ2 (yi − f(xi))2

  • C. Frogner

Bayesian Interpretations of Regularization

slide-11
SLIDE 11

Maximum likelihood and least squares

Maximum likelihood: maximize p(Y|f, X) = 1 (2πσ2)N/2 exp

N

  • i=1

1 σ2 (yi − f(xi)))2

  • Empirical risk minimization: minimize

N

  • i=1

(yi − f(xi))2

  • C. Frogner

Bayesian Interpretations of Regularization

slide-12
SLIDE 12

...

N

  • i=1

(yi − f(xi))2

  • C. Frogner

Bayesian Interpretations of Regularization

slide-13
SLIDE 13

...

e

N

  • i=1

1 σ2 (yi−f(xi))2

  • C. Frogner

Bayesian Interpretations of Regularization

slide-14
SLIDE 14

What about regularization?

RLS: arg min

f

1 2

n

  • i=1

(yi − f(xi))2 + λ 2f2

H

Is there a model of Y and f that yields RLS? Yes. e

1 2σ2 ε

n

  • i=1

(yi−f(xi))2

  • − λ

2 f2 H

p(Y|f, X) · p(f)

  • C. Frogner

Bayesian Interpretations of Regularization

slide-15
SLIDE 15

What about regularization?

RLS: arg min

f

1 2

n

  • i=1

(yi − f(xi))2 + λ 2f2

H

Is there a model of Y and f that yields RLS? Yes. e

1 2σ2 ε

n

  • i=1

(yi−f(xi))2

  • − λ

2 f2 H

p(Y|f, X) · p(f)

  • C. Frogner

Bayesian Interpretations of Regularization

slide-16
SLIDE 16

What about regularization?

RLS: arg min

f

1 2

n

  • i=1

(yi − f(xi))2 + λ 2f2

H

Is there a model of Y and f that yields RLS? Yes. e

1 2σ2 ε

n

  • i=1

(yi−f(xi))2

  • · e− λ

2 f2 H

p(Y|f, X) · p(f)

  • C. Frogner

Bayesian Interpretations of Regularization

slide-17
SLIDE 17

What about regularization?

RLS: arg min

f

1 2

n

  • i=1

(yi − f(xi))2 + λ 2f2

H

Is there a model of Y and f that yields RLS? Yes. e

1 2σ2 ε

n

  • i=1

(yi−f(xi))2

  • · e− λ

2 f2 H

p(Y|f, X) · p(f)

  • C. Frogner

Bayesian Interpretations of Regularization

slide-18
SLIDE 18

Posterior function estimates

Given data {(xi, yi)}N

i=1 and model p(Y|f, X), p(f).

Find a good f to explain data. (If we can get p(f|Y, X)) Bayes least squares estimate: ˆ fBLS = E(f|X,Y)[f] i.e. the mean of the posterior. MAP estimate: ˆ fMAP(Y|X) = arg max

f

p(f|X, Y) i.e. a mode of the posterior.

  • C. Frogner

Bayesian Interpretations of Regularization

slide-19
SLIDE 19

Posterior function estimates

Given data {(xi, yi)}N

i=1 and model p(Y|f, X), p(f).

Find a good f to explain data. (If we can get p(f|Y, X)) Bayes least squares estimate: ˆ fBLS = E(f|X,Y)[f] i.e. the mean of the posterior. MAP estimate: ˆ fMAP(Y|X) = arg max

f

p(f|X, Y) i.e. a mode of the posterior.

  • C. Frogner

Bayesian Interpretations of Regularization

slide-20
SLIDE 20

Posterior function estimates

Given data {(xi, yi)}N

i=1 and model p(Y|f, X), p(f).

Find a good f to explain data. (If we can get p(f|Y, X)) Bayes least squares estimate: ˆ fBLS = E(f|X,Y)[f] i.e. the mean of the posterior. MAP estimate: ˆ fMAP(Y|X) = arg max

f

p(f|X, Y) i.e. a mode of the posterior.

  • C. Frogner

Bayesian Interpretations of Regularization

slide-21
SLIDE 21

A posterior on functions?

How to find p(f|Y, X)? Bayes’ rule: p(f|X, Y) = p(Y|X, f) · p(f) p(Y|X) = p(Y|X, f) · p(f)

  • p(Y|X, f)dp(f)

When is this well-defined?

  • C. Frogner

Bayesian Interpretations of Regularization

slide-22
SLIDE 22

A posterior on functions?

How to find p(f|Y, X)? Bayes’ rule: p(f|X, Y) = p(Y|X, f) · p(f) p(Y|X) = p(Y|X, f) · p(f)

  • p(Y|X, f)dp(f)

When is this well-defined?

  • C. Frogner

Bayesian Interpretations of Regularization

slide-23
SLIDE 23

A posterior on functions?

Functions vs. parameters: H ∼ = Rp Represent functions in H by their coordinates w.r.t. a basis: f ∈ H ↔ θ ∈ Rp Assume (for the moment): p < ∞

  • C. Frogner

Bayesian Interpretations of Regularization

slide-24
SLIDE 24

A posterior on functions?

Functions vs. parameters: H ∼ = Rp Represent functions in H by their coordinates w.r.t. a basis: f ∈ H ↔ θ ∈ Rp Assume (for the moment): p < ∞

  • C. Frogner

Bayesian Interpretations of Regularization

slide-25
SLIDE 25

A posterior on functions?

Mercer’s theorem: K(xi, xj) =

  • k

νkψk(xi)ψk(xj) where νkψk(·) =

  • K(·, y)ψk(y)dy for all k. The functions

{√νkψk(·)} form an orthonormal basis for HK. Let φ(·) = [√ν1ψ1(·), . . . , √νpψp(·)]. Then: HK = {φ(·)θ|θ ∈ Rp}

  • C. Frogner

Bayesian Interpretations of Regularization

slide-26
SLIDE 26

Prior on infinite-dimensional space

Problem: there’s no such thing as θ ∼ N (0, I) when θ ∈ R∞!

  • C. Frogner

Bayesian Interpretations of Regularization

slide-27
SLIDE 27

Posterior for linear RLS

Linear function: f(x) = x, θ Noise model: Y|X, θ ∼ N

  • Xθ, σ2

εI

  • Add a prior:

θ ∼ N (0, I)

  • C. Frogner

Bayesian Interpretations of Regularization

slide-28
SLIDE 28

Posterior for linear RLS

Model: Y|X, θ ∼ N

  • Xθ, σ2

εI

  • ,

θ ∼ N (0, I) Joint over Y and θ: Y θ

  • ∼ N
  • ,

XXT + σ2

εI

X XT I

  • Condition on Y.
  • C. Frogner

Bayesian Interpretations of Regularization

slide-29
SLIDE 29

Posterior for linear RLS

Posterior: θ|X, Y ∼ N

  • µθ|X,Y, Σθ|X,Y
  • where

µθ|X,Y = XT(XXT + σ2

εI)−1Y

Σθ|X,Y = I − XT(XXT + σ2

εI)−1X

This is Gaussian, so ˆ θMAP(Y|X) = ˆ θBLS(Y|X) = XT(XXT + σ2

εI)−1Y

  • C. Frogner

Bayesian Interpretations of Regularization

slide-30
SLIDE 30

Posterior for linear RLS

Posterior: θ|X, Y ∼ N

  • µθ|X,Y, Σθ|X,Y
  • where

µθ|X,Y = XT(XXT + σ2

εI)−1Y

Σθ|X,Y = I − XT(XXT + σ2

εI)−1X

This is Gaussian, so ˆ θMAP(Y|X) = ˆ θBLS(Y|X) = XT(XXT + σ2

εI)−1Y

  • C. Frogner

Bayesian Interpretations of Regularization

slide-31
SLIDE 31

Linear RLS as a MAP estimator

Model: Y|X, θ ∼ N

  • Xθ, σ2

εI

  • ,

θ ∼ N (0, I) ˆ θMAP(Y|X) = XT(XXT + σ2

εI)−1Y

Recall the linear RLS solution: ˆ θRLS(Y|X) = arg min

θ

1 2

N

  • i=1

(yi − xi, θ)2 + λ 2θ2 = XT(XXT + λ 2I)−1Y So what’s λ?

  • C. Frogner

Bayesian Interpretations of Regularization

slide-32
SLIDE 32

Linear RLS as a MAP estimator

Model: Y|X, θ ∼ N

  • Xθ, σ2

εI

  • ,

θ ∼ N (0, I) ˆ θMAP(Y|X) = XT(XXT + σ2

εI)−1Y

Recall the linear RLS solution: ˆ θRLS(Y|X) = arg min

θ

1 2

N

  • i=1

(yi − xi, θ)2 + λ 2θ2 = XT(XXT + λ 2I)−1Y So what’s λ?

  • C. Frogner

Bayesian Interpretations of Regularization

slide-33
SLIDE 33

Posterior for kernel RLS

Model for linear RLS: Y|X, θ ∼ N

  • Xθ, σ2

εI

  • ,

θ ∼ N (0, I) Model for kernel RLS? Y|X, θ ∼ N

  • φ(X)θ, σ2

εI

  • ,

θ ∼ N (0, I) Then: ˆ θMAP(Y|X) = φ(X)T(φ(X)φ(X)T + σ2

εI)−1Y

  • C. Frogner

Bayesian Interpretations of Regularization

slide-34
SLIDE 34

Posterior for kernel RLS

Model for linear RLS: Y|X, θ ∼ N

  • Xθ, σ2

εI

  • ,

θ ∼ N (0, I) Model for kernel RLS? Y|X, θ ∼ N

  • φ(X)θ, σ2

εI

  • ,

θ ∼ N (0, I) Then: ˆ θMAP(Y|X) = φ(X)T(φ(X)φ(X)T + σ2

εI)−1Y

  • C. Frogner

Bayesian Interpretations of Regularization

slide-35
SLIDE 35

Posterior for kernel RLS

Model for linear RLS: Y|X, θ ∼ N

  • Xθ, σ2

εI

  • ,

θ ∼ N (0, I) Model for kernel RLS? Y|X, θ ∼ N

  • φ(X)θ, σ2

εI

  • ,

θ ∼ N (0, I) Then: ˆ θMAP(Y|X) = φ(X)T (K + σ2

εI)−1Y

  • C. Frogner

Bayesian Interpretations of Regularization

slide-36
SLIDE 36

A quick recap

Empirical risk minimization is ML. p(Y|f, X) ∝ e− 1

2

N

i=1(yi−f(xi))2

Linear RLS is MAP . p(Y, f|X) ∝ e− 1

2

N

i=1(yi−xi,θ)2 · e− λ 2 θT θ

Kernel RLS is also MAP . p(Y, f|X) ∝ e− 1

2

N

i=1(yi−f(xi)2 · e− λ 2 f2 H

  • C. Frogner

Bayesian Interpretations of Regularization

slide-37
SLIDE 37

A quick recap

Empirical risk minimization is ML. p(Y|f, X) ∝ e− 1

2

N

i=1(yi−f(xi))2

Linear RLS is MAP . p(Y, f|X) ∝ e− 1

2

N

i=1(yi−xi,θ)2 · e− λ 2 θT θ

Kernel RLS is also MAP . p(Y, f|X) ∝ e− 1

2

N

i=1(yi−f(xi)2 · e− λ 2 f2 H

  • C. Frogner

Bayesian Interpretations of Regularization

slide-38
SLIDE 38

A quick recap

Empirical risk minimization is ML. p(Y|f, X) ∝ e− 1

2

N

i=1(yi−f(xi))2

Linear RLS is MAP . p(Y, f|X) ∝ e− 1

2

N

i=1(yi−xi,θ)2 · e− λ 2 θT θ

Kernel RLS is also MAP . p(Y, f|X) ∝ e− 1

2

N

i=1(yi−f(xi)2 · e− λ 2 f2 H

  • C. Frogner

Bayesian Interpretations of Regularization

slide-39
SLIDE 39

Transductive setting

Idea: Forget about estimating θ (i.e. f). Instead: Estimate predicted outputs Y∗ = [y∗

1, . . . , y∗ M]T

at test inputs X∗ = [x∗

1, . . . , x∗ M]T

Need the joint distribution over Y∗ and Y.

  • C. Frogner

Bayesian Interpretations of Regularization

slide-40
SLIDE 40

Transductive setting

Say Y∗ and Y are jointly Gaussian: Y Y ∗

  • = N
  • ,
  • ΛY

ΛYY ∗ ΛY ∗Y ΛY ∗

  • Want: kernel RLS.

General form for the posterior: Y ∗|X, Y ∼ N

  • µY∗|X,Y, ΣY ∗|X,Y
  • where

µY ∗|X,Y = ΛT

YY ∗Λ−1 Y Y

ΣY ∗|X,Y = ΛY∗ − ΛT

YY ∗Λ−1 Y ΛYY ∗

  • C. Frogner

Bayesian Interpretations of Regularization

slide-41
SLIDE 41

Transductive setting

Say Y∗ and Y are jointly Gaussian: Y Y ∗

  • = N
  • ,
  • ΛY

ΛYY ∗ ΛY ∗Y ΛY ∗

  • Want: kernel RLS.

General form for the posterior: Y ∗|X, Y ∼ N

  • µY∗|X,Y, ΣY ∗|X,Y
  • where

µY ∗|X,Y = ΛT

YY ∗Λ−1 Y Y

ΣY ∗|X,Y = ΛY∗ − ΛT

YY ∗Λ−1 Y ΛYY ∗

  • C. Frogner

Bayesian Interpretations of Regularization

slide-42
SLIDE 42

Transductive setting

Set ΛY = K(X, X) + σ2I, ΛYY ∗ = K(X, X ∗), ΛY ∗ = K(X ∗, X ∗). Posterior: Y ∗|X, Y ∼ N

  • µY∗|X,Y, ΣY ∗|X,Y
  • where

µY ∗|X,Y = K(X ∗, X)(K(X, X) + σ2I)−1Y ΣY∗|X,Y = K(X ∗, X ∗) − K(X ∗, X)(K(X, X) + σ2I)−1K(X, X ∗) So: ˆ Y ∗

MAP = ˆ

fRLS(X ∗).

  • C. Frogner

Bayesian Interpretations of Regularization

slide-43
SLIDE 43

Transductive setting

Model: Y Y ∗

  • = N
  • ,

K(X, X) + σ2

εI

K(X, X ∗) K(X ∗, X) K(X ∗, X ∗)

  • MAP estimate (posterior mean) = RLS function at every point

x∗, regardless of dim HK. Are the prior and posterior (on points!) consistent with a distribution on HK ?

  • C. Frogner

Bayesian Interpretations of Regularization

slide-44
SLIDE 44

Transductive setting

Strictly speaking, θ and f don’t come into play here at all: Have: p(Y ∗|X, Y) Do not have: p(θ|X, Y) or p(f|X, Y) But, if HK is finite dimensional, the joint over Y and Y ∗ is consistent with: Y = f(X) + ε, Y ∗ = f(X), and f ∈ HK is a random trajectory from a Gaussian process

  • ver the domain, with mean µ and covariance K.

(Ergo, people call this “Gaussian process regression.”) (Also “Kriging,” because of a guy.)

  • C. Frogner

Bayesian Interpretations of Regularization

slide-45
SLIDE 45

Transductive setting

Strictly speaking, θ and f don’t come into play here at all: Have: p(Y ∗|X, Y) Do not have: p(θ|X, Y) or p(f|X, Y) But, if HK is finite dimensional, the joint over Y and Y ∗ is consistent with: Y = f(X) + ε, Y ∗ = f(X), and f ∈ HK is a random trajectory from a Gaussian process

  • ver the domain, with mean µ and covariance K.

(Ergo, people call this “Gaussian process regression.”) (Also “Kriging,” because of a guy.)

  • C. Frogner

Bayesian Interpretations of Regularization

slide-46
SLIDE 46

Recap redux

Empirical risk minimization is the maximum likelihood estimator when: y = xTθ + ε Linear RLS is the MAP estimator when: y = xTθ + ε, θ ∼ N (0, I) Kernel RLS is the MAP estimator when: y = φ(x)T θ + ε, θ ∼ N (0, I) in finite dimensional HK. Kernel RLS is the MAP estimator at points when: Y Y ∗

  • = N

µY µY ∗

  • ,

K(X, X) + σ2

εI

K(X, X ∗) K(X ∗, X) K(X ∗, X ∗)

  • in possibly infinite dimensional HK .
  • C. Frogner

Bayesian Interpretations of Regularization

slide-47
SLIDE 47

Is this useful in practice?

Want confidence intervals + believe the posteriors are meaningful = yes Maybe other reasons?

−6 −4 −2 2 4 6 −5 −4 −3 −2 −1 1 2 3 4 5 X Y Gaussian Regression + Confidence Intervals Regression function Observed points

  • C. Frogner

Bayesian Interpretations of Regularization