Reproducing Kernel Hilbert Spaces for Classification Katarina - - PowerPoint PPT Presentation

reproducing kernel hilbert spaces for classification
SMART_READER_LITE
LIVE PREVIEW

Reproducing Kernel Hilbert Spaces for Classification Katarina - - PowerPoint PPT Presentation

Reproducing Kernel Hilbert Spaces for Classification Katarina Domijan and Simon P. Wilson Department of Statistics, University of Dublin, Trinity College, Ireland November 1, 2005 1 Working Group on Statistical Learning General problem


slide-1
SLIDE 1

Reproducing Kernel Hilbert Spaces for Classification

Katarina Domijan and Simon P. Wilson

Department of Statistics, University of Dublin, Trinity College, Ireland

November 1, 2005 1 Working Group on Statistical Learning

slide-2
SLIDE 2

General problem

  • Regression problem.
  • Data are available (X1, Y1),...(Xn, Yn); Xi ∈ Rp and Yi ∈ R.
  • The aim is to find f(X) for predicting Y given the values of X.
  • Linear model: Y = f(X) + ǫ , where E(ǫ)=0 and ǫ is independent of
  • X. f(X) = XT β, for a set of parameters β.
  • Another approach is to use the linear basis expansions.
  • Replace X with a transformation of it, and subsequently use a linear

model in the new space of input features.

November 1, 2005 2 Working Group on Statistical Learning

slide-3
SLIDE 3

General problem cont’d

  • Let hm(X) : Rp → R the mth transformation of X.
  • Then f(X) = M

m=1 hm(X)βm.

  • Examples of hm(X) are polynomial and trigonometric expansions, e.g.

X3

1, X1X2, sin(X1), etc.

  • Classical solution: use the least squares to estimate β in f(X),

ˆ β = (HT H)−1HT y.

  • Bayesian solution: place a prior (MVN) on β’s. Likelihood is given by:

f(Y |X, β) =

n

  • i=1

1 √ 2πσ2 e−

1 2σ2 (yi−f(xi))2. November 1, 2005 3 Working Group on Statistical Learning

slide-4
SLIDE 4

Example: a cubic spline

  • Assume X is one dimensional.
  • Divide the domain of X into contiguous intervals.
  • f is represented by a separate polynomial in each interval.
  • Basis functions are:

h1(X) = 1, h3(X) = X2, h5(X) = (X − ψ1)3

+,

h2(X) = X, h4(X) = X3, h6(X) = (X − ψ2)3

+.

  • ψ1 and ψ2 are knots.

November 1, 2005 4 Working Group on Statistical Learning

slide-5
SLIDE 5

Example: a cubic spline

1 2 3 4 5 6 7 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x f(x) ψ1 ψ1 ψ2 ψ2 November 1, 2005 5 Working Group on Statistical Learning

slide-6
SLIDE 6

Use in classification

  • Let the outputs Y take values in a discrete set.
  • We want to divide the input space into a collection of regions labelled

according to the classification.

  • For Y ∈ {0, 1}, the model is:

log P(Y = 1|X = x) P(Y = 0|X = x) = f(x). Hence: P(Y = 1|X = x) = ef(x) 1 + ef(x) .

November 1, 2005 6 Working Group on Statistical Learning

slide-7
SLIDE 7

Regularisation

  • Let’s move from cubic splines to consider all f that are twice

continuously differentiable.

  • Many f will have n

i=1 (yi − f(xi))2 = 0.

  • So we look at penalized RSS:

RSS(f, λ) =

n

  • i=1

(yi − f(xi))2 + λ

  • (f ′′(t))2 dt.
  • The second term encourages splines with a slowly changing slope.

November 1, 2005 7 Working Group on Statistical Learning

slide-8
SLIDE 8

Regularisation cont’d

  • λ = 0, f can be any function that interpolates the data.
  • λ = ∞, f is a least squares line fit.
  • Note that this is defined on an infinite-dimensional function space.
  • However, the solution is finite-dimensional and unique:

f(x) =

n

  • j=1

Dj(x)βj, where Dj(x) are an n-dim set of basis functions representing a family

  • f natural splines.
  • Natural splines have additional constraints to force the function to be

linear beyond the boundary knots.

November 1, 2005 8 Working Group on Statistical Learning

slide-9
SLIDE 9

Regularisation cont’d

  • Clearly, all inference about f is inference about β = (β0, β1, ...βn).
  • The LS solution can be shown to be:

ˆ β = (DT D + λΦD)−1DT y, where D and ΦD are matrices with elements: {D}i,j = Dj(xi) and {ΦD}j,k =

  • D

′′

j (t)D

′′

k(t)dt,

respectively.

November 1, 2005 9 Working Group on Statistical Learning

slide-10
SLIDE 10

Generalisation

  • We can generalise this to higher dimensions.
  • Suppose X ∈ R2

min

f

n

  • i=1

(yi − f(xi))2 + λJ(f)

  • ,
  • J(f) is the penalty term an example of it is

J(f) =

R2

∂2f(x) ∂x2

1

2 + 2 ∂2f(x) ∂x1∂x2 2 + ∂2f(x) ∂x2

2

2 dx1dx2.

November 1, 2005 10 Working Group on Statistical Learning

slide-11
SLIDE 11

Generalisation cont’d

  • Optimizing with this penalty leads to a thin plate spline.
  • The solution can be written as a linear expansion of basis functions:

f(x) = β0 + βT x +

n

  • j=1

αjhj(x). where hj are radial basis functions: hj(x) = ||x − xj||2log(||x − xj||).

November 1, 2005 11 Working Group on Statistical Learning

slide-12
SLIDE 12

Most general case

  • The general class of problems can be represented as:

min

f∈H

n

  • i=1

L(yi, f(xi)) + λJ(f)

  • ,

(1)

  • L(yi, f(xi)) is a loss function, e.g. (yi − f(xi))2,
  • J(f) is the penalty term,
  • H is the space on which J(f) is defined.
  • A general functional form can be used for J(f). See Girosi et al.

(1995).

  • The solution can be written in terms of a finite number of coefficients.

November 1, 2005 12 Working Group on Statistical Learning

slide-13
SLIDE 13

Reproducing Kernel Hilbert Spaces (RKHS)

  • This is a subclass of problems in the previous slide.
  • Let φ1, φ2, ... be an infinite sequence of basis functions.
  • HK is defined to be space of f’s such that:

HK = {f(x)|f(x) =

  • i=1

ciφi(x)}.

  • Let K be a positive definite kernel with an eigen-expansion:

K(x1, x2) =

  • i=1

γiφi(x1)φi(x2), (2) where γi ≥ 0, ∞

i=1 γ2 i < ∞.

November 1, 2005 13 Working Group on Statistical Learning

slide-14
SLIDE 14

RKHS cont’d

  • Define J(f) to be:

J(f) = ||f||2

HK = ∞

  • i=1

c2

i

γi < ∞

  • J(f) penalizes functions with small eigenvalues in the expansion (2).
  • Wahba (1990) shows that (1) with these f and J has a

finite-dimensional solution given by: f(x) =

n

  • i=1

βiK(x, xi).

November 1, 2005 14 Working Group on Statistical Learning

slide-15
SLIDE 15

RKHS cont’d

  • Given this, the problem in (1) reduces to finite-dimensional
  • ptimization:

min

β

  • L(y, Kβ) + λβT Kβ
  • .

where K is a n × n matrix with elements {K}i,j = K(xi, xj).

  • Hence, the problem is defined in terms of L and K!

November 1, 2005 15 Working Group on Statistical Learning

slide-16
SLIDE 16

Bayesian RKHS for classification

  • Mallick et al. (2005): molecular classification of 2 types of tumour

using cDNA microarrays.

  • Data have undergone within and between slide normalization.
  • p genes, n tumour samples, so xi,j is a measurement of the expression

level of the jth gene, for the ith sample.

  • They wish to model p(y|x) and use it to predict future observations.
  • Assume latent variables zi such that:

p(y|z) =

n

  • i=1

p(yi|zi), and zi = f(xi) + ǫi, i = 1, ..., n, ǫi ∼ i.i.d. N(0, σ2).

November 1, 2005 16 Working Group on Statistical Learning

slide-17
SLIDE 17

Bayesian RKHS for classification

  • To develop the complete model, they need to specify p(y|z) and f.
  • f(x) is modeled by RKHS approach.
  • Their kernel choices are Gaussian and polynomial.
  • Both kernels contain only one parameter θ, e.g. Gaussian:

K(xi, xj) = exp(−||xi − xj||2/θ)

  • Hence, the random variable zi is modeled by:

zi = f(xi)+ǫi = β0+

n

  • j=1

βjK(xi, xj|θ)+ǫi, i = 1, ..., n, ǫi ∼ i.i.d. N(0, σ2).

November 1, 2005 17 Working Group on Statistical Learning

slide-18
SLIDE 18

Bayesian RKHS for classification

  • The Bayesian formulation requires priors to be assigned to β, θ, and

σ2.

  • The model is specified as:

zi|β, θ, σ2 ∼ N(zi|K′

iβ, σ2)

β, σ2 ∼ N(β|0, σ2M−1)IG(σ2|γ1, γ2) θ ∼

p

  • q=1

U(a1q, a2q). where K′

i = (1, K(xi, x1|θ), ..., K(xi, xn|θ)) and M is a diagonal

matrix with elements ξ = (ξ1, ..., ξn+1).

  • Jeffrey’s independence prior p(ξ) ∝ n+1

i=1 ξ−1 i

promotes sparseness Figueiredo (2002).

November 1, 2005 18 Working Group on Statistical Learning

slide-19
SLIDE 19

Bayesian RKHS for classification

  • p(y|z) is modeled on the basis of a loss function.
  • Two models considered in the paper are: logistic regression and SVM.
  • The logistic regression approach:

p(yi|zi) = [pi(zi)]yi [1 − pi(zi)](1−yi) , pi(zi) = ezi (1 + ezi).

  • It follows that the log-likelihood is equal to:

n

  • i=1

yizi −

n

  • i=1

log(1 + ezi).

  • So the loss function is given by: L(yi, zi) = yizi − log(1 + ezi).

November 1, 2005 19 Working Group on Statistical Learning

slide-20
SLIDE 20

Bayesian RKHS for classification

  • MCMC sampling is used for sampling from the posterior

p(β, θ, z, λ, σ2|y).

  • Proposed work:
  • variable selection:
  • kernel selection (which βi = 0?)
  • regressor selection (which xi to ignore?)
  • more than two classes (multivariate logistic regression).

November 1, 2005 20 Working Group on Statistical Learning

slide-21
SLIDE 21

References

[1] T. Evgeniou, M. Pontil and T. Poggio (2000). Regularization Networks and Support Vector Machines. Advances in Computational Mathematics 13, 1,1–50. [2] M. Figueiredo (2002), Adaptive sparseness using jeffreys prior. Advances in Neural Information Processing Systems, 14, (eds T. G. Dietterich, S. Becker and Z. Ghahramani), 697–704, Cambridge: MIT Press. [3] F. Girosi, M. Jones and T. Poggio (1995) Regularization Theory and Neural Networks Architectures. Neural Computation, 7, 2,219–269. [4] T. Hastie, R. Tibshirani, and J. Friedman (2001). The Elements of Statistical

  • Learning. Springer.

[5] B. K. Mallick, D. Ghosh and M. Ghosh (2005). Bayesian Classification of Tumors Using Gene Expression Data. J. Royal Statistical Soc. B, 67,219–234. [6] G. Wahba (1990). Spline models for observational data. SIAM [Society for Industrial and Applied Mathematics]

November 1, 2005 21 Working Group on Statistical Learning