Reproducing Kernel Hilbert Spaces for Classification Katarina - PowerPoint PPT Presentation

Reproducing Kernel Hilbert Spaces for Classification Katarina Domijan and Simon P. Wilson Department of Statistics, University of Dublin, Trinity College, Ireland November 1, 2005 1 Working Group on Statistical Learning

General problem • Regression problem. • Data are available ( X 1 , Y 1 ),...( X n , Y n ); X i ∈ R p and Y i ∈ R . • The aim is to find f ( X ) for predicting Y given the values of X . • Linear model: Y = f ( X ) + ǫ , where E ( ǫ )=0 and ǫ is independent of X . f ( X ) = X T β , for a set of parameters β . • Another approach is to use the linear basis expansions. • Replace X with a transformation of it, and subsequently use a linear model in the new space of input features. November 1, 2005 2 Working Group on Statistical Learning

General problem cont’d • Let h m ( X ) : R p �→ R the m th transformation of X . • Then f ( X ) = � M m =1 h m ( X ) β m . • Examples of h m ( X ) are polynomial and trigonometric expansions, e.g. X 3 1 , X 1 X 2 , sin( X 1 ), etc. • Classical solution: use the least squares to estimate β in f ( X ), β = ( H T H ) − 1 H T y . ˆ • Bayesian solution: place a prior (MVN) on β ’s. Likelihood is given by: n 1 2 σ 2 ( y i − f ( x i )) 2 . 1 � 2 πσ 2 e − √ f ( Y | X , β ) = i =1 November 1, 2005 3 Working Group on Statistical Learning

Example: a cubic spline • Assume X is one dimensional. • Divide the domain of X into contiguous intervals. • f is represented by a separate polynomial in each interval. • Basis functions are: h 1 ( X ) = 1 , h 3 ( X ) = X 2 , h 5 ( X ) = ( X − ψ 1 ) 3 + , h 2 ( X ) = X, h 4 ( X ) = X 3 , h 6 ( X ) = ( X − ψ 2 ) 3 + . • ψ 1 and ψ 2 are knots. November 1, 2005 4 Working Group on Statistical Learning

Example: a cubic spline ψ 1 ψ 1 ψ 2 ψ 2 1.5 1.0 0.5 0.0 f(x) −0.5 −1.0 −1.5 −2.0 1 2 3 4 5 6 7 x November 1, 2005 5 Working Group on Statistical Learning

Use in classification • Let the outputs Y take values in a discrete set. • We want to divide the input space into a collection of regions labelled according to the classification. • For Y ∈ { 0 , 1 } , the model is: log P ( Y = 1 | X = x ) P ( Y = 0 | X = x ) = f ( x ) . Hence: e f ( x ) P ( Y = 1 | X = x ) = 1 + e f ( x ) . November 1, 2005 6 Working Group on Statistical Learning

Regularisation • Let’s move from cubic splines to consider all f that are twice continuously differentiable. i =1 ( y i − f ( x i )) 2 = 0. • Many f will have � n • So we look at penalized RSS: n � ( f ′′ ( t )) 2 dt. ( y i − f ( x i )) 2 + λ � RSS ( f, λ ) = i =1 • The second term encourages splines with a slowly changing slope. November 1, 2005 7 Working Group on Statistical Learning

Regularisation cont’d • λ = 0, f can be any function that interpolates the data. • λ = ∞ , f is a least squares line fit. • Note that this is defined on an infinite-dimensional function space. • However, the solution is finite-dimensional and unique: n � f ( x ) = D j ( x ) β j , j =1 where D j ( x ) are an n-dim set of basis functions representing a family of natural splines. • Natural splines have additional constraints to force the function to be linear beyond the boundary knots. November 1, 2005 8 Working Group on Statistical Learning

Regularisation cont’d • Clearly, all inference about f is inference about β = ( β 0 , β 1 , ...β n ). • The LS solution can be shown to be: β = ( D T D + λ Φ D ) − 1 D T y , ˆ where D and Φ D are matrices with elements: { D } i,j = D j ( x i ) and � ′′ ′′ { Φ D } j,k = D j ( t ) D k ( t ) dt, respectively. November 1, 2005 9 Working Group on Statistical Learning

Generalisation • We can generalise this to higher dimensions. • Suppose X ∈ R 2 � n � ( y i − f ( x i )) 2 + λJ ( f ) � min , f i =1 • J ( f ) is the penalty term an example of it is � ∂ 2 f ( x ) �� ∂ 2 f ( x ) � 2 � 2 � 2 � � ∂ 2 f ( x ) � � J ( f ) = + 2 + d x 1 d x 2 . ∂ x 2 ∂ x 2 ∂ x 1 ∂ x 2 R 2 1 2 November 1, 2005 10 Working Group on Statistical Learning

Generalisation cont’d • Optimizing with this penalty leads to a thin plate spline. • The solution can be written as a linear expansion of basis functions: n f ( x ) = β 0 + β T x + � α j h j ( x ) . j =1 where h j are radial basis functions: h j ( x ) = || x − x j || 2 log ( || x − x j || ) . November 1, 2005 11 Working Group on Statistical Learning

Most general case • The general class of problems can be represented as: � n � � min L ( y i , f ( x i )) + λJ ( f ) , (1) f ∈ H i =1 • L ( y i , f ( x i )) is a loss function, e.g. ( y i − f ( x i )) 2 , • J ( f ) is the penalty term, • H is the space on which J ( f ) is defined. • A general functional form can be used for J ( f ). See Girosi et al. (1995). • The solution can be written in terms of a finite number of coefficients. November 1, 2005 12 Working Group on Statistical Learning

Reproducing Kernel Hilbert Spaces (RKHS) • This is a subclass of problems in the previous slide. • Let φ 1 , φ 2 , ... be an infinite sequence of basis functions. • H K is defined to be space of f ’s such that: ∞ � H K = { f ( x ) | f ( x ) = c i φ i ( x ) } . i =1 • Let K be a positive definite kernel with an eigen-expansion: ∞ � K ( x 1 , x 2 ) = γ i φ i ( x 1 ) φ i ( x 2 ) , (2) i =1 where γ i ≥ 0 , � ∞ i =1 γ 2 i < ∞ . November 1, 2005 13 Working Group on Statistical Learning

RKHS cont’d • Define J ( f ) to be: ∞ c 2 � J ( f ) = || f || 2 i H K = < ∞ γ i i =1 • J ( f ) penalizes functions with small eigenvalues in the expansion (2). • Wahba (1990) shows that (1) with these f and J has a finite-dimensional solution given by: n � f ( x ) = β i K ( x , x i ) . i =1 November 1, 2005 14 Working Group on Statistical Learning

RKHS cont’d • Given this, the problem in (1) reduces to finite-dimensional optimization: � L ( y , Kβ ) + λ β T Kβ � min . β where K is a n × n matrix with elements { K } i,j = K ( x i , x j ). • Hence, the problem is defined in terms of L and K ! November 1, 2005 15 Working Group on Statistical Learning

Bayesian RKHS for classification • Mallick et al. (2005): molecular classification of 2 types of tumour using cDNA microarrays. • Data have undergone within and between slide normalization. • p genes, n tumour samples, so x i,j is a measurement of the expression level of the j th gene, for the i th sample. • They wish to model p ( y | x ) and use it to predict future observations. • Assume latent variables z i such that: n � p ( y | z ) = p ( y i | z i ) , i =1 and z i = f ( x i ) + ǫ i , i = 1 , ..., n, ǫ i ∼ i.i.d. N (0 , σ 2 ) . November 1, 2005 16 Working Group on Statistical Learning

Bayesian RKHS for classification • To develop the complete model, they need to specify p ( y | z ) and f . • f ( x ) is modeled by RKHS approach. • Their kernel choices are Gaussian and polynomial. • Both kernels contain only one parameter θ , e.g. Gaussian: K ( x i , x j ) = exp ( −|| x i − x j || 2 /θ ) • Hence, the random variable z i is modeled by: n � β j K ( x i , x j | θ )+ ǫ i , i = 1 , ..., n, ǫ i ∼ i.i.d. N (0 , σ 2 ) . z i = f ( x i )+ ǫ i = β 0 + j =1 November 1, 2005 17 Working Group on Statistical Learning

Bayesian RKHS for classification • The Bayesian formulation requires priors to be assigned to β , θ , and σ 2 . • The model is specified as: z i | β , θ , σ 2 i β , σ 2 ) N ( z i | K ′ ∼ N ( β | 0 , σ 2 M − 1 ) IG ( σ 2 | γ 1 , γ 2 ) β , σ 2 ∼ p � ∼ U ( a 1 q , a 2 q ) . θ q =1 where K ′ i = (1 , K ( x i , x 1 | θ ) , ..., K ( x i , x n | θ )) and M is a diagonal matrix with elements ξ = ( ξ 1 , ..., ξ n +1 ). • Jeffrey’s independence prior p ( ξ ) ∝ � n +1 i =1 ξ − 1 promotes sparseness i Figueiredo (2002). November 1, 2005 18 Working Group on Statistical Learning

Bayesian RKHS for classification • p ( y | z ) is modeled on the basis of a loss function. • Two models considered in the paper are: logistic regression and SVM. • The logistic regression approach: [ p i ( z i )] y i [1 − p i ( z i )] (1 − y i ) , p ( y i | z i ) = e z i p i ( z i ) = (1 + e z i ) . • It follows that the log-likelihood is equal to: n n � � log (1 + e z i ) . y i z i − i =1 i =1 • So the loss function is given by: L ( y i , z i ) = y i z i − log (1 + e z i ). November 1, 2005 19 Working Group on Statistical Learning

Bayesian RKHS for classification • MCMC sampling is used for sampling from the posterior p ( β , θ , z , λ , σ 2 | y ). • Proposed work: - variable selection: - kernel selection (which β i = 0?) - regressor selection (which x i to ignore?) - more than two classes (multivariate logistic regression). November 1, 2005 20 Working Group on Statistical Learning

Reproducing Kernel Hilbert Spaces for Classification Katarina - PowerPoint PPT Presentation

Reproducing Kernel Hilbert Spaces for Classification Katarina Domijan and Simon P. Wilson Department of Statistics, University of Dublin, Trinity College, Ireland November 1, 2005 1 Working Group on Statistical Learning General problem

Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 L. Rosasco RKHS About this

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Functional Gradient Motion Planning in Reproducing Kernel Hilbert Spaces RSS Robotics Science and

Econ 2148, fall 2019 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Counterfactual Policy Evaluation in Reproducing Kernel Hilbert Spaces Krikamol Muandet Max

Composition operators on some analytic reproducing kernel Hilbert spaces Jan Stochel (Uniwersytet

Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic

A new weak Hilbert space Jess Surez de la Fuente, UEx Workshop on Banach spaces and Banach

Lecture 1: Introduction to RKHS MLSS Cadiz, 2016 Gatsby Unit, CSML, UCL May 12, 2016 Lecture 1:

Lecture 1: Introduction to RKHS MLSS Tbingen, 2015 Gatsby Unit, CSML, UCL July 22, 2015

Positive kernels and reproducing kernel spaces: a rich tapestry of settings and applications

Scalable Learning in Reproducing Kernel Kre n Spaces Dino Oglic 1 Thomas Grtner 2 1

On Hilbert IVth Problem Marc Troyanov (EPFL) SJTU, June 21, 2019 On Hilbert IVth Abstract

Tyrol Hill Park Phase 4 Elementary Campbell Elementary Campbell Park Spaces Open Park

Graph Classification Classification Outline Introduction, Overview Classification using

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Final Evaluation Results of The Milwaukee Community Literacy Project/ SPARK Early Literacy Curtis

Optimal Extraction with Sub-sampled Line-Spread Functions Nicholas R. Collins 1 Science Systems

January 2018 Disclaimer This management presentation is intended to provide an overview of the

The multi rotor turbine Project Manager, Sren O. Lind 17-11-2016 This material is not for

Day 9 Optimization of Cloud Data Centre Energy Consumption

Robot Motion Planning Barbara Frank, Cyrill Stachniss, Rdiger Schmedding, Matthias Teschner,

Using Measures of Linguistic Complexity to Assess German L2 Proficiency in Learner Corpora under

Lake Creek Instream Flow Study Status Update Fish/Aquatics Meeting June 2007 INSTREAM FLOW