Introduction to Statistical Learning and Kernel Machines Hichem - - PowerPoint PPT Presentation

introduction to statistical learning and kernel machines
SMART_READER_LITE
LIVE PREVIEW

Introduction to Statistical Learning and Kernel Machines Hichem - - PowerPoint PPT Presentation

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA) Introduction to Statistical Learning and Kernel Machines Hichem SAHBI CNRS UPMC June 2018 Hichem SAHBI Introduction to


slide-1
SLIDE 1

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Introduction to Statistical Learning and Kernel Machines

Hichem SAHBI

CNRS UPMC

June 2018

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-2
SLIDE 2

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Outline

Introduction to Statistical Learning Definitions Probability Tools Generalization Bounds Machine Learning Algorithms Kernel Machines : Supervised and Unsupervised Learning The Representer Theorem Supervised Learning (Support vector machines and regression) Kernel Design (kernel combination, cdk kernels,...) Unsupervised Learning (kernel PCA and CCA)

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-3
SLIDE 3

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Outline

Introduction to Statistical Learning Definitions Probability Tools Generalization Bounds Machine Learning Algorithms Kernel Machines : Supervised and Unsupervised Learning The Representer Theorem Supervised Learning (Support vector machines and regression) Kernel Design (kernel combination, cdk kernels,...) Unsupervised Learning (kernel PCA and CCA)

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-4
SLIDE 4

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

1

The Representer Theorem

2

Supervised Learning (SVMs and SVRs)

3

Kernel Design

4

Unsupervised Learning (Kernel PCA and CCA)

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-5
SLIDE 5

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Pattern Recognition Problems

Given a pattern (observation) X ∈ X, the goal is to predict the unknown label Y of X. Character recognition (OCR) : X is an image, Y is a letter. Face detection (resp. recognition) : X is an image, Y indicates the presence of a face in the picture (resp. identity). Text classification : X is a text, Y is a category (topic, spam/non spam,...). Medical diagnosis : X is a set of features (age, genome, ...), Y is the risk.

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-6
SLIDE 6

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Section 1 The Representer Theorem

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-7
SLIDE 7

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Regularization, Kernel Methods and Representer Theorem

min g ∈ G Rn(g) + λ Ω(g), λ ≥ 0 (Tikhonov) For a particular regularizer Ω(g) and class G, the solution to the above problem (Kimeldorf and Wahba, 1971) gα(.) =

n

  • i=1

αik(., Xi), {(Xi, Yi)}n

i=1 ⊆ X × Y is fixed

k is a kernel : symmetric, continuous on X × X and positive definite (Mercer, 1909), k(X, X ′) = Φ(X), Φ(X ′). Rn(g) = 1

n n

  • i=1

(g(Xi) − Yi)2 (kernel regression), Rn(g) =

1 2n n

  • i=1

(sign[g(Xi)] − Yi)2 (e.g., max margin classifier, SVMs).

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-8
SLIDE 8

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Section 2 Supervised Learning (SVMs and SVRs)

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-9
SLIDE 9

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Support Vector Machines

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-10
SLIDE 10

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Support Vector Machines (Large Margin Classifiers)

Let {(X1, Y1), . . . , (Xn, Yn)}, be a training set generated i.i.d

w

w t x + b = −1 w t x + b = 0 w t x + b = +1

2 w

min

w,b

1 2w′w s.c. Yi

  • w′ Xi + b
  • − 1 ≥ 0, ∀ i

Optimality conditions lead to w =

i αiYiXi and dual form

max

{αi}

  • i

αi − 1 2

  • i
  • j

αi αj Yi Yj Xi, Xj s.t. αi ≥ 0, ∀ i and

  • i

αi Yi = 0

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-11
SLIDE 11

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Support Vector Machines (Large Margin Classifiers)

Let {(X1, Y1), . . . , (Xn, Yn)}, be a training set generated i.i.d

w

w t x + b = −1 w t x + b = 0 w t x + b = +1

2 w

min

w,b

1 2w′w s.c. Yi

  • w′ Xi + b
  • − 1 ≥ 0, ∀ i

Optimality conditions lead to w =

i αiYiXi and dual form

max

{αi}

  • i

αi − 1 2

  • i
  • j

αi αj Yi Yj Xi, Xj s.t. αi ≥ 0, ∀ i and

  • i

αi Yi = 0

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-12
SLIDE 12

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

VC Dimension of Large Margin Classifiers

The set of hyperplane classifiers with a margin at least M has a VC dimension upper bounded by : h ≤ r2/M2, here r is the radius of the smallest sphere containing all the patterns X.

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-13
SLIDE 13

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Interpretation of Lagrange Multipliers

αi > 0 implies Yi(w′Xi + b) = 1 : Xi is a support vector. αi = 0 implies Yi(w′Xi + b) > 1 : Xi is useless vector.

w

w t x + b = −1 w t x + b = 0 w t x + b = +1

2 w

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-14
SLIDE 14

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Classification Function

w

w t x + b = −1 w t x + b = 0 w t x + b = +1

Classification function

gα(X) − b = w, X =

  • Yi=+1

αi Xi, X −

  • Yi=−1

αi Xi, X

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-15
SLIDE 15

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Linear Soft-SVMs

Introduce slack variables {ξ1, . . . , ξn} to allow misclassification. Trade-off large margin and misclassification. min

w,b

1 2w′w + C

n

  • i=1

ξi s.c. Yi (w′ Xi + b) + ξi ≥ 1, ∀ i ξi ≥ 0

w

w t x + b = −1 w t x + b = 0 w t x + b = +1

2 w

ξi Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-16
SLIDE 16

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Dual Formulation

max{αi}

  • i

αi − 1 2

  • i
  • j

αi αj Yi Yj Xi, Xj s.c. 0 ≤ αi ≤ C, ∀ i and

  • i αi Yi = 0

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-17
SLIDE 17

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Non-Linear SVMs

Φ() Φ() Φ() Φ() Φ() Φ() Φ() Φ() Φ

max

{αi}

  • i

αi − 1 2

  • i
  • j

αi αj Yi Yj Φ(Xi), Φ(Xj) s.t. αi ≥ 0, ∀ i and

  • i

αi Yi = 0 gα(X) − b =

  • Yi=+1

αi Φ(Xi), Φ(X) −

  • Yi=−1

αi Φ(Xi), Φ(X) The product Φ(X), Φ(X ′) defines a kernel k(X, X ′).

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-18
SLIDE 18

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Kernels

Kernels are symmetric and positive (semi) definite functions that measure similarity between data. Positive semi definite means k(X, X ′) = Φ(X), Φ(X ′). ∀X1, . . . , Xn ∈ X, ∀c1, . . . , cn ∈ R,

  • i,j

cicjk(Xi, Xj) ≥ 0 Equivalently, the Gram (kernel) matrix K with Kij = k(Xi, Xj) has positive eigenvalues. Kernels on vectorial data : linear X, X ′, polynomial (1 + X, X ′)p, Gaussian exp

  • − 1

σ2

X − X ′2 , etc. Kernels can be designed using closure operations (additions, products, exponentiation, etc.)

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-19
SLIDE 19

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Examples (Linear vs Gaussian)

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-20
SLIDE 20

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Gaussian Kernel

k(X, X ′) = Φ(X), Φ(X ′) = exp

  • − 1

σ2

X − X ′2 . The dimension of the output space RH is ∞. The Gaussian kernel has good generalization properties but requires a good selection of the scale parameter σ using the tedious cross validation process.

Generalization error scale parameter

  • ver-fitting
  • ver-smoothing

tradeoff

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-21
SLIDE 21

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Gaussian Kernel

Φ(X), w =

  • j

αjk(X, Xj) =

  • j

αj exp(−X − Xj2/σ2) Large Scale (σ → ∞) Φ(X), w ≃

  • j αj(1 − X − Xj2/σ2)

= −(1/σ2)

j αj(X, X + Xj, Xj − 2X, Xj) + ...

=

  • j

2αj σ2 X, Xj + Cst = X, j 2αj σ2 Xj + Cst

Small Scale (σ → 0) Φ(X), w =

  • j

αj exp(−X − Xj2/σ2) ≃

  • j

αj1{X=Xj}

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-22
SLIDE 22

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Gaussian Kernel

Φ(X), w =

  • j

αjk(X, Xj) =

  • j

αj exp(−X − Xj2/σ2) Large Scale (σ → ∞) Φ(X), w ≃

  • j αj(1 − X − Xj2/σ2)

= −(1/σ2)

j αj(X, X + Xj, Xj − 2X, Xj) + ...

=

  • j

2αj σ2 X, Xj + Cst = X, j 2αj σ2 Xj + Cst

Small Scale (σ → 0) Φ(X), w =

  • j

αj exp(−X − Xj2/σ2) ≃

  • j

αj1{X=Xj}

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-23
SLIDE 23

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

The Spiral Example

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-24
SLIDE 24

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Triangular Kernel

k(X, X ′) = −X − X ′p, p ∈]0, 2]. While the Gaussian kernel behaves either as a Dirac-like, a bump or a uniform weighting, the triangular kernel remains similar at all scales.

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-25
SLIDE 25

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Scale Invariance

T = {(X1, Y1), . . . , (Xn, Yn)} γT = {(γX1, Y1), . . . , (γXn, Yn)} F(T , α) =

  • i

αi − 1 2

  • i
  • j

αi αj Yi Yj k(Xi, Xj) F(γT , αγ) =

  • i

αγ

i

− 1 2

  • i
  • j

αγ

i αγ j Yi Yj k(γXi, γXj)

We have [Sahbi & Fleuret, 2002] : arg max[F γ(αγ)] = 1 γp arg max[F(α)] equivariant gαγ(γX) = g(X) invariant

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-26
SLIDE 26

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Scale Invariance (Examples)

×10−1 ×1 ×10

Gaussian Kernel (σ = 0.2) Triangular Kernel

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-27
SLIDE 27

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

VC Dimension

For a given training set T = {(X1, Y1), . . . , (Xn, Yn)}, we have   

g(X1) g(X2) . . . g(Xn)

   =   

k(X1, X1) k(X1, X2) . . . k(X1, Xn) k(X2, X1) k(X2, X2) . . . k(X2, Xn) . . . . . . ... . . . k(Xn, X1) k(Xn, X2) . . . k(Xn, Xn)

     

α1 α2 . . . αn

   + b   

1 1 . . . 1

   Since any Gram matrix built using this kernel is invertible for 0 < p < 2 [Micchelli86] :

1

We can learn any function with a null empirical error (i.e., ∀i, g(Xi) = Yi).

2

The VC-dimension of SVMs trained on this kernel is ∞.

This does not mean that it is bad ! as the actual dimension of data does not exceed n.

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-28
SLIDE 28

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Leave One Out Error (LOOE) bound

In SVM, only support vectors are critical for training and classification. Training : max

α n

  • i

αi − 1 2

n

  • i,j

αi αj Yi Yj k(Xi, Xj) Classification : g(X) =

n

  • i=1

αi yi k(X, Xi) + b. When removing the non-support vectors, the optimal solution and hence the decision function remain unchanged. The leave-one-out error bound :

R(g) ≤ Rn(g) + #NSV (g) n

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-29
SLIDE 29

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Support Vector Regression

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-30
SLIDE 30

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Support Vector Regression

Let {(X1, Y1), . . . , (Xn, Yn)}, be a training set generated i.i.d according to a probability distribution, now the labels are reals (for instance age prediction, weight prediction, etc.). minw,b

1 2w2 + C n

  • i=1

(ξi + ξ∗

i )

s.t. Yi − (w, Φ(Xi) + b) ≤ ǫ + ξi (w, Φ(Xi) + b) − Yi ≤ ǫ + ξ∗

i

ξi, ξ∗

i ≥ 0

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-31
SLIDE 31

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Support Vector Regression (Dual Form and Solution)

maxα − 1

2 n

  • i,j=1

(αi − α∗

i )(αj − α∗ j )Φ(Xi), Φ(Xj)

−ǫ

n

  • j=1

(αi + α∗

i ) + n

  • i=1

Yi(αi − α∗

i )

s.t.

n

  • i=1

(αi − α∗

i ) = 0,

αi, α∗

i ∈ [0, C]

w =

n

  • i=1

(αi − α∗

i )Φ(Xi),

g(X) =

n

  • i

(αi − α∗

i )Φ(Xi), Φ(X) + b

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-32
SLIDE 32

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Section 3 Kernel Design

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-33
SLIDE 33

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Standard Kernels (vectorial data)

Laplacian : k(X, X ′) = exp(− X−X ′

σ

) Hyperbolic Tangent (Sigmoid) : k(X, X ′) = tanh(aX, X ′ + c) Rational Quadratic : k(X, X ′) = 1 −

X−X ′2 X−X ′2+c

Multiquadric : k(X, X ′) =

  • X − X ′2 + c

Inverse Multiquadric : k(X, X ′) =

1

X−X ′2+c

Log : k(X, X ′) = − log(X − X ′d + 1) Cauchy : k(X, X ′) =

1 1+ X−X′2

σ2

ANOVA : k(X, X ′) = d

k=1 exp(−σ(X k − X ′k)2)d

Histogram Intersection : k(X, X ′) = d

k=1 min(X k, X ′k)

Bayesian : k(X, X ′) = d

k=1 kl(X k, X ′k) with kl(a, b) = c P(a|c)P(c|b).

Chi-Square : k(X, X ′) = 1 − d

k=1 X k −X ′k X k +X ′k

What about data described with graphs or strings (text, social networks, etc.) ?

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-34
SLIDE 34

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Standard Kernels on graph data (random walks)

Given G = (V , E) and G ′ = (V ′, E ′). A product graph Gx = (Vx, Ex) with Vx = {(u, u′) : u ∈ Vx, u′ ∈ V ′

x}

and Ex = {((u, u′), (v, v ′)), (u, v) ∈ E, (u′, v ′) ∈ E ′}.

A× = AG ⊗ AG ′ =    [AG]11AG ′ . . . [AG]1nAG ′ . . . ... . . . [AG]n1AG ′ . . . [AG]nnAG ′    k(G, G ′) =

  • (u,u′),(v,v′)

  • ℓ=0

λℓAℓ

×

  • (u,u′),(v,v′)

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-35
SLIDE 35

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Standard Kernels on graph data (graphlets)

Sample and count subgraphs of limited size T in G and G ′. These subgraphs are referred to as graphlets. These subgraphs are not restricted to only chains. k(G, G ′) =

  • g∈D

min(HG(g), HG ′(g))

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-36
SLIDE 36

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Standard Kernels on string data

K(car, cat) ?

u c-a c-t a-t b-a b-t c-r a-r b-r φu(cat) λ2 λ3 λ2 φu(car) λ2 λ3 λ2 φu(bat) λ2 λ2 λ3 φu(bar) λ2 λ2 λ3

φu(S) =

  • i:u=S[i]

λl(i) k(S, S′) =

  • u∈Σ

φu(S), φu(S′) K(car, cat) = λ4.

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-37
SLIDE 37

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Explicit Kernel Maps

Sometimes knowing Φ is very helpful especially when training and testing on very large scale datasets. g(X) =

n

  • i=1

αi Yi k(X, Xi) + b If we know Φ(), then we may pre-compute w =

i αiYiΦ(Xi), and use

g(X) = w, Φ(X) + b The complexity of testing drops from O(n) to O(1), and the complexity of training also drops from O(n3) to O(n).

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-38
SLIDE 38

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Explicit Kernel Maps (Examples)

Linear : k(X, X ′) = X, X ′ (just identity map). Polynomial : k(X, X ′) = X, X ′p = Φ(X), Φ(X ′) with Φ(X) = X ⊗ ... ⊗ X (p times). Example : Let k(X, X ′) = X, X ′2, X = (a1 b1), X ′ = (a2 b2), Φ(X) = X ⊗ X = (a2

1 a1b1 b1a1 b2 1),

Φ(X ′) = X ′ ⊗ X ′ = (a2

2 a2b2 b2a2 b2 2),

Φ(X), Φ(X ′) = a2

1a2 2 + 2a1a2b1b2 + b2 1b2 2 = X, X ′2.

Histogram intersection : k(X, X ′) =

d min(X d, X ′d) = Φ(X), Φ(X ′) with

Φ(X) = (ψ(X 1)t...ψ(X d)t)t and ψ(.) “decimal-to-unary” map. Example : X = {3, 2, 1}, X ′ = {1, 2, 3} Φ(X), Φ(X ′) = 0111 0011 0001, 0001 0011 0111 = 4 K(X, X ′) = min(3, 1) + min(2, 2) + min(1, 3) = 4.

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-39
SLIDE 39

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Kernel Design Principles

So far, we have seen standard kernels. This is not enough ! How can we design other kernels ? kernel combination ... Sums, products and exponentiation of kernels ... Proposition k(X, X ′) = k1(X, X ′) + k2(X, X ′) k(X, X ′) = k1(X, X ′).k2(X, X ′) If k1, k2 are p.s.d, then k will also be p.s.d Proposition let k(X, X ′) = exp(k1(X, X ′)) If k1 is p.s.d, then k will also be p.s.d.

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-40
SLIDE 40

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Kernel Design Principles (proofs)

Sum : ∀X1, . . . , Xn ∈ X, ∀c1, . . . , cn ∈ R

  • i,j

cicj(k1(Xi, Xj) + k2(Xi, Xj)) =

  • i,j

cicjk1(Xi, Xj) +

  • i,j

cicjk2(Xi, Xj) ≥ 0 Product : k1(Xi, Xj).k2(Xi, Xj) = Φ1(Xi), Φ1(Xj).Φ2(Xi), Φ2(Xj) = Φ1(Xi) ⊗ Φ2(Xi), Φ1(Xj) ⊗ Φ2(Xj) = Φ(Xi), Φ(Xj) Exponential (Taylor’s expansion and closure of p.s.d w.r.t sums/products) : exp(k1(Xi, Xj)) = ∞

ℓ=0 1 ℓ!(k1(Xi, Xj))ℓ

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-41
SLIDE 41

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Multiple Kernel Learning (MKL)

Use a weighted linear combination of kernels k(X, X ′) = M

m=1 βmkm(X, X ′), with βm ≥ 0 (to guarantee

p.s.d) and

m βm = 1 (convex or sparse combination).

g(X) =

M

  • m=1

βm

  • n
  • i=1

αikm(X, Xi) + bm

  • =

M

  • m=1

βm gm(X)

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-42
SLIDE 42

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Multiple Kernel Learning (MKL), Bi-level Optimization

gm(X) = wm, Φ(X) + bm min

{wm,bm},ξ,{βm}

1 2

  • m

βmwm2 + C

  • i

ξi s.t. Yi

  • m

βm gm(X) + ξi ≥ 1, ∀ i ξi ≥ 0, βm ≥ 0,

m βm = 1

max

{αi}

  • i

αi − 1 2

  • i
  • j

αi αj Yi Yj

  • m

βmkm(Xi, Xj) s.t. αi ≥ 0, ∀ i and

  • i

αi Yi = 0 Step 1 : fix {βm} and solve the above QP Step 2 : fix {αi} (so {(wm, bm)}m) and solve the above LP

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-43
SLIDE 43

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Deep Multiple Kernel Learning

Deep MKL is a multi-layer perceptron (MLP), where kℓ

p at

layer ℓ is a nonlinear combination of kernel values at layer (ℓ − 1) kℓ

p(X, X ′) = σ q

w(ℓ−1)

q,p

k(ℓ−1)

q

(X, X ′)

  • When combined with SVMs, the whole frame becomes

semi-parametric (parametric in kernel design and non-parametric in SVM learning).

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-44
SLIDE 44

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Deep Multiple Kernel Learning : Optimization

Step 1 : fix {wq,p} (so the deep kernel) and solve the QP for SVM parameters (similarly to MKL). Step 2 : fix the SVM parameters and solve the QP (denoted J) now using backpropagation+gradient descent

Algorithm (Deep MKL) Inputs : Initial w(ℓ)(ℓ = 1, . . . , L − 1), and

∂J ∂κ(L)

1 (.,.)

Outputs : Optimal w(ℓ)(ℓ = 1, . . . , L − 1) Repeat till convergence for ℓ = L : 2 do Compute gradient :

∂J w(ℓ−1)

q,p

=

i,j ∂J ∂k(ℓ)

p

(xi ,xj ) ∂k(ℓ)

p

(xi ,xj ) w(ℓ)

q,p

Compute sensitivity :

∂J ∂k(ℓ−1)

q

(xi ,xj ) = q ∂J ∂k(ℓ)

p

(xi ,xj ) ∂k(ℓ)

p

(xi ,xj ) ∂k(ℓ−1)

q

(xi ,xj )

end Update the weights w(ℓ)

q,p : w(ℓ) q,p = w(ℓ) q,p − β ∂J w(ℓ)

q,p

ℓ = 1, . . . , L − 1. EndRepeat

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-45
SLIDE 45

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Context Dependent Kernels (Image Comparison)

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-46
SLIDE 46

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Context Dependent Kernels (Problem Formulation)

min

k

  • i,j

k(Xi, Xj) d(Xi, Xj) + β

  • i,j

k(Xi, Xj) log(k(Xi, Xj)) +α

  • i,j

k(Xi, Xj)    −

  • Xk ∈ N (Xi ),

Xℓ ∈ N (Xj )

k(Xk, Xℓ)     s.t. k(Xi, Xj) ∈ [0, 1],

  • i,j

k(Xi, Xj) = 1 kt(Xi, Xj) ∝ exp

  • −d(Xi, Xj)

β − 1

  • exp

 2α β

  • k,ℓ

1{Xk∈N(Xi)}1{Xℓ∈N(Xj)} kt−1(Xk, Xℓ)  

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-47
SLIDE 47

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Context Dependent Kernels (P.S.D)

By induction : k0(Xi, Xj) = exp

  • − d(Xi,Xj)

β

− 1

  • is p.s.d

Assuming kt−1 p.s.d, kt(Xi, Xj) ∝ exp

  • − d(Xi,Xj)

β

− 1

  • .

exp

β k

1{Xk∈N(Xi)}Φt−1(Xk),

1{Xℓ∈N(Xj)} Φt−1(Xl)

  • Resulting from the closure of p.s.d w.r.t exponentiation and

product, it follows that kt is also p.s.d.

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-48
SLIDE 48

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Relation to Deep-Learning

kt(X, X ′) = φt(φt−1(...φ1(φ0(X))))

  • t times

. φt(φt−1(...φ1(φ0(X ′))))

  • t times

φ0 φ1 φt−1 φt Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-49
SLIDE 49

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Context Dependent Kernels : ImageCLEF Benchmark

250k training images, 1k images for dev and 2k images for test (labels not available for participants). Nbr of concepts : 116. Visual features provided by the organizers including GIST, Color Histograms, SIFT, C-SIFT, RGB-SIFT and OPPONENT-SIFT. For all the SIFT-based descriptors, a bag-of-words representation is provided.

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-50
SLIDE 50

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Context Dependent Kernels : ImageCLEF Benchmark

In order to build the graph of images, the latter are connected if the number of their shared keywords (in meta-data) is sufficiently large. We use the resulting context dependent kernel and “one-versus-all” SVMs for each concept/class. The score of these SVMs decide about the presence of these concepts in images. Histogram intersection kernel is used for initialization.

CF CD

2α β

10−3 10−2 10−1 1 10 F-scores 41.40 40.17 41.31 46.70 51.30 48.14 (samples) F-scores 30.90 29.77 31.03 37.55 45.00 42.21 (concepts)

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-51
SLIDE 51

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Context Dependent Kernels : Comparison with MKL

For comparison, we build 8 kernel matrices corresponding to the 7 visual features and 1 tag-based (TF-IDF) features ; we learn an “optimal” linear combination of these kernels via MKL.

F-scores F-scores (samples) (concepts) Visual without Context Standard SVM+Linear kernel 37.80 28.45 Standard SVM+Polynomial kernel 33.06 26.04 Standard SVM+HI kernel 41.40 30.90 Visual + Context MKL SVM based on Linear 46.58 43.01 MKL SVM based on Polynomial 43.81 42.18 MKL SVM based on HI 49.49 45.10 Visual + Context SVM+ CD kernel 51.30 45.00

Improvement of CDK is not only due to the exploitation of the tags but also due to the way we use context in kernel design.

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-52
SLIDE 52

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Context Dependent Kernels : Annotation Examples

(a) CF : beach cloud coast

  • utdoor

reflection sea silhouette sky sun sun- set water CA : beach cloud coast reflection sand sea sil- houette sky sun sunset water (b) CF : countryside day- time grass outdoor plant sky soil CA : countryside day- time grass motorcycle

  • utdoor plant sky

(c) CF : building closeup daytime flower furniture garden grass

  • utdoor

park plant tree CA : daytime dog flower furniture grass outdoor plant tree (d) CF : boat building castle cityscape cloud cloudless daytime grass harbor lake

  • utdoor

plant shadow sky tree vehicle CA : building castle cloud cloudless daytime grass mountain outdoor plant shadow sky tree (e) CF : bird building child dog elder footwear fur- niture indoor male per- son sport teenager CA : child daytime foot- wear indoor person (f) CF : book building church cityscape furniture indoor painting rain sign train CA : book furniture indoor (g) CF : closeup fish flower

  • utdoor plant

CA : closeup flower plant (h) CF : bicycle book car cartoon diagram drum guitar instrument mo- torcycle poster CA : cartoon diagram drum instrument logo (i) CF : airplane cloudless daytime helicopter logo

  • utdoor sea sky sport

vehicle water CA : airplane cloudless daytime outdoor sky ve- hicle water (j) CF : aerial beach boat bridge cityscape cloud coast countryside daytime harbor outdoor plant river sand sea sky water CA : beach cloud coast daytime

  • utdoor

sand sea sky water (k) CF : cat diagram em- broidery plant CA : cartoon diagram embroidery unpaved (l) CF : beach bridge cat cloudless coast country- side daytime desert out- door plant road sand sky soil vehicle CA : cloudless daytime desert mountain

  • ut-

door sand sky soil (m) CF : cloud coast coun- tryside daytime grass mountain outdoor plant sea sky soil water CA : cloud country- side daytime forest grass mountain outdoor plant sky tree (n) CF : daytime dog fo- rest outdoor park plant reflection river soil tree unpaved water CA : bridge daytime fo- rest grass outdoor plant road sky soil tree unpa- ved (o) CF : cat daytime el- der male outdoor person plant sand sky soil tee- nager unpaved CA : beach building castle daytime outdoor person sand sculpture unpaved

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-53
SLIDE 53

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Section 4 Unsupervised Learning (Kernel PCA and CCA)

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-54
SLIDE 54

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Principal Component Analysis

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-55
SLIDE 55

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Principal Component Analysis

Given a training set {X1, . . . , Xn} (no labels are available). PCA finds the principal axes as the eigenvectors of M = 1 n

n

  • j=1

Xj Xj t

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-56
SLIDE 56

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Kernel Principal Component Analysis

A function Φ maps the data from the original (input) space into a high dimensional (feature) space. A space of dimension n, guarantees the existence of a linear manifold for at most n + 1 training data.

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-57
SLIDE 57

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Kernel Principal Component Analysis

M = 1 n

n

  • j=1

Φ(Xj) Φ(Xj)t ∀k = 1, ..., n, ∃αk1, ..., αkn ∈ R s.t. Vk =

n

  • j=1

αkj Φ(Xj) where αk = (αk1, ..., αkn) are found by solving the following eigenproblem : K αk = λk αk K is the Gram matrix. Kji = Φ(Xj), Φ(Xi) = k(Xj, Xi). Using the kernel trick, the projection of a training sample in the mapping space : Φ(X), Vk =

n

  • j=1

αkj Φ(X), Φ(Xj) =

n

  • j=1

αkj k(X, Xj)

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-58
SLIDE 58

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Similarity Invariance

KPCA is translation and rotation invariant for some kernels (e.g., triangular). It’s encoded in kernels. k(R Xi + t, R Xj + t) = −R Xi + t − R Xj − tp = −Xi − Xjp = k(Xi, Xj) Scale ? the kernel is not invariant (just equivariant), but KPCA projections can be made invariant. S = {X1, ..., Xn} γS = {γ X1, ..., γ Xn} We need to show : ∀ X ∈ Rn ∀ k = 1, ..., n V (γ)

k

, γ X = V (1)

k

, X

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-59
SLIDE 59

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Scale Invariance

Let’s consider a new expansion of each eigenvector : V (γ)

k

= 1 λγ

1 n

  • j

αγ

kj Φ(γ Xj)

here αγ

k comes from the eigenproblem K γ αγ k = λγ k αγ k.

The proof comes from the fact that the Gram matrix K γ of the scaled set can be written as : K γ = γp K 1 which implies : ∀ k = 1, ..., n, λγ

k = γp λ1 k

and αγ

k = α1 k.

So ∀ X V (γ)

k

, γ X = V (1)

k

, X

  • Hichem SAHBI

Introduction to Statistical Learning and Kernel Machines

slide-60
SLIDE 60

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Examples

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-61
SLIDE 61

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Examples : Dimensions

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-62
SLIDE 62

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Examples : Dimensions

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-63
SLIDE 63

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Kernel PCA and Shape Description & Recognition

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-64
SLIDE 64

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Interpretation

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-65
SLIDE 65

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Databases

SQUID : unavailable ground truth (5500 contours). Swedish : (15 categories with 75 images each). SmithSonian : (135 categories). Performances measured using Recall, Precision.

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-66
SLIDE 66

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Performance

Recall (over 16) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Precision Linear PCA 1.0 .59 .46 .40 .35 .32 .31 .29 .28 .27 .27 .26 .25 .25 .24 .24 Kernel PCA 1.0 .967 .949 .935 .925 .914 .903 .892 .882 .871 .860 .851 .841 .832 .825 .816 Hough 1.0 .899 .848 .814 .785 .765 .749 .737 .724 .711 .702 .693 .684 .677 .671 .664 EOH 1.0 .752 .651 .592 .554 .528 .508 .490 .474 .462 .453 .444 .436 0.431 .424 .418 CSS 1.0 .959 .936 .924 .916 .914 .905 .899 .892 .888 .884 .879 .876 .872 .867 .864

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-67
SLIDE 67

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Kernel Parameters

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 0.125 0.25 0.375 0.5 Precision Recall p=1.0 p=0.2 p=0.5 p=0.8 p=1.2 p=1.5 p=1.8 p=1.98 p=1.9 Linear 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 First eigenvalue p parameter shape 1, class 1 shape 2, class 1 shape 3, class 2

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-68
SLIDE 68

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Matching

Given two curves S1, S2 and Xi ∈ S1, we define X ′

j ∈ S2 as a

match of Xi if : X ′

j = arg min X ′∈S2 π(Xi) − π(X ′)2

π(Xi) =       Φ(Xi), V1 . . . Φ(Xi), Vd       π(X ′

j ) =

      Φ(X ′

j ), V ′ 1

. . . Φ(X ′

j ), V ′ d

     

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-69
SLIDE 69

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Matching

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-70
SLIDE 70

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Canonical Correlation Analysis

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-71
SLIDE 71

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Canonical Correlation Analysis

Given aligned training data {(X1, X ′

1), . . . , (Xn, X ′ n)} (taken

from two different views ; for instance camera 1 and camera 2, two languages, document and text, etc.) Canonical correlation analysis maps the two views into a common (latent) space where data become highly correlated.

1 11 1 1 1 1 1 1 1 22 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 33 33 3 3 3 3 3 4 44 4 4 4 4 4 1 11 11 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 33 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 23 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-72
SLIDE 72

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Canonical Correlation Analysis

max

P1,P2

  • i

Pt

1Xi, Pt 2X ′ i

s.t.

  • i

Pt

1Xi, Pt 1Xi = 1

  • i

Pt

2X ′ i , Pt 2X ′ i = 1

In a matrix form : max

P1,P2

Pt

1XX′tP2

max

P1,P2

Pt

1C12P2

s.t. Pt

1XXtP1 = 1

s.t. Pt

1C11P1 = 1

Pt

2X′X′tP2 = 1

Pt

2C22P2 = 1

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-73
SLIDE 73

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Canonical Correlation Analysis

The corresponding Lagrangian is L(λ, P1, P2) = Pt

1C12P2−λ1

2 (Pt

1C11P1−1)−λ2

2 (Pt

2C22P2−1)

Taking derivatives w.r.t P1 and P2, one obtains ∂L ∂P1 = C12P2 − λ1C11P1 = ∂L ∂P2 = C21P1 − λ2C22P2 = C12C−1

22 C21P1

= λ2C11P1 P2 = C−1

22 C21P1

λ

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-74
SLIDE 74

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Kernel Canonical Correlation Analysis

when p ≪ n, linear transformations are not appropriate. X = [X1, . . . , Xn] → Φ(X) = [Φ(X1), . . . , Φ(Xn)]. X′ = [X ′

1, . . . , X ′ n] → Φ(X′) = [Φ(X ′ 1), . . . , Φ(X ′ n)].

P1 = Φ(X)α1, P2 = Φ(X′)α2 max

α1,α2

αt

1Φ(Xt)Φ(X)Φ(X′t)Φ(X′)α2

= max

α1,α2 αt 1K1K2α2

s.t. αt

1Φ(Xt)Φ(X)Φ(Xt)Φ(X)α1

= αt

1K 2 1 α1

= 1 αt

2Φ(X′t)Φ(X′)Φ(X′t)Φ(X′)α2

= αt

2K 2 2 α2

= 1

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-75
SLIDE 75

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Kernel Canonical Correlation Analysis

The corresponding Lagrangian L(λ, α1, α2) = αt

1K1K2α2− λ1

2 (αt

1K 2 1 α1−1)− λ2

2 (αt

2K 2 2 α2−1)

Taking derivatives w.r.t. α1 and α2, one obtains ∂L ∂α1 = K1K2α2 − λ1K 2

1 α1

= ∂L ∂α2 = K2K1α1 − λ2K 2

2 α2

= α2 = K −1

2 K1α1

λ Iα1 = λ2α1

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-76
SLIDE 76

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Examples : multi-view recognition

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-77
SLIDE 77

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Examples : multi-view recognition

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-78
SLIDE 78

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Examples : remote sensing change detection

1 3 5 7 9 10 15 20 25 30 35 Iteration number (t) Equal Error Rate (%)

Image Difference Large Scale (Monolithic) FDA Large Scale (Monolithic) SVM Large Scale (Monolithic) CA−DCCA Proposed RF with no CCA Proposed RF with CA−DCCA

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-79
SLIDE 79

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

General Conclusion & Good Practices

Use shallow models (kernel methods) when you have few data in high dimensional spaces. Representation learning converts into a similarity (kernel) learning. Convexity is an asset but not a necessity (the game is more

  • pen with non-convex problems but finding “good” solution is

another story) Use deep models when data (actually ground truth) is not an issue. Is the future of ML semi-parametric ?

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines

slide-80
SLIDE 80

The Representer Theorem Supervised Learning (SVMs and SVRs) Kernel Design Unsupervised Learning (Kernel PCA and CCA)

Some References

  • V. Vapnik, Statistical Learning Theory, 1998.
  • C. Bishop, Pattern Recognition and machine learning, 2006.
  • J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis-

Cambridge University Press, 2004.

  • B. Scholkopf, K. Tsuda, J-P. Vert. Kernel Methods in Computational Biology,

2004.

  • S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press,

2004. J.C. Spall. Introduction to stochastic search and optimization : estimation, simulation, and control (Vol. 65). John Wiley & Sons. 2005.

  • I. Goodfellow, Y. Bengio and A. Courville. Deep learning (Vol. 1). Cambridge :

MIT press, 2016.

  • M. Jiu and H. Sahbi. Nonlinear Deep Kernel Learning for Image Annotation.

IEEE Trans. Image Processing 26(4) : 1820-1832, 2017.

Hichem SAHBI Introduction to Statistical Learning and Kernel Machines