Advanced Machine Learning Learning Kernels MEHRYAR MOHRI - - PowerPoint PPT Presentation

advanced machine learning
SMART_READER_LITE
LIVE PREVIEW

Advanced Machine Learning Learning Kernels MEHRYAR MOHRI - - PowerPoint PPT Presentation

Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH .. Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. Advanced Machine Learning -


slide-1
SLIDE 1

Advanced Machine Learning

MEHRYAR MOHRI MOHRI@

COURANT INSTITUTE & GOOGLE RESEARCH..

Learning Kernels

slide-2
SLIDE 2

page

Advanced Machine Learning - Mohri@

Outline

Kernel methods. Learning kernels

  • scenario.
  • learning bounds.
  • algorithms.

2

slide-3
SLIDE 3

page

Advanced Machine Learning - Mohri@

Machine Learning Components

3

sample

algorithm user h

features

slide-4
SLIDE 4

page

Advanced Machine Learning - Mohri@

Machine Learning Components

4

sample

algorithm user h

features

critical task main focus

  • f ML literature
slide-5
SLIDE 5

page

Advanced Machine Learning - Mohri@

Kernel Methods

Features implicitly defined via the choice of a PDS kernel interpreted as a similarity measure. Flexibility: PDS kernel can be chosen arbitrarily. Help extend a variety of algorithms to non-linear predictors, e.g., SVMs, KRR, SVR, KPCA. PDS condition directly related to convexity of optimization problem.

5

Φ: X → H K ∀x, y ∈ X, Φ(x) · Φ(y) = K(x, y). K

slide-6
SLIDE 6

page

Advanced Machine Learning - Mohri@

Example - Polynomial Kernels

Definition: Example: for and ,

6

N =2 d=2 ∀x, y ∈ RN, K(x, y) = (x · y + c)d, c > 0. K(x, y) = (x1y1 + x2y2 + c)2 =         x2

1

x2

2

√ 2 x1x2 √ 2c x1 √ 2c x2 c         ·         y2

1

y2

2

√ 2 y1y2 √ 2c y1 √ 2c y2 c         .

slide-7
SLIDE 7

page

Advanced Machine Learning - Mohri@

(1, 1, − √ 2, + √ 2, − √ 2, 1)

XOR Problem

Use second-degree polynomial kernel with :

√ 2 x1x2 √ 2 x1

Linearly non-separable Linearly separable by

(1, 1, − √ 2, − √ 2, + √ 2, 1) (1, 1, + √ 2, − √ 2, − √ 2, 1) (1, 1, + √ 2, + √ 2, + √ 2, 1)

7

c = 1 x1x2 = 0.

x1 x2

(1, 1) (−1, 1) (−1, −1) (1, −1)

slide-8
SLIDE 8

page

Advanced Machine Learning - Mohri@

Other Standard PDS Kernels

Gaussian kernels:

  • Normalized kernel of

Sigmoid Kernels:

8

K(x, y) = tanh(a(x · y) + b), a, b ≥ 0. K(x, y) = exp

  • ||x y||2

2σ2

  • , σ = 0.

(x, x) exp x·x

σ2

  • .
slide-9
SLIDE 9

page

Advanced Machine Learning - Mohri@

SVM

Primal: Dual:

9

min

w,b

1 2w2 + C

m

  • i=1
  • 1 yi(w · ΦK(xi) + b)
  • +.

max

α m

  • i=1

αi − 1 2

m

  • i,j=1

αiαjyiyjK(xi, xj) subject to: 0 ≤ αi ≤ C ∧

m

  • i=1

αiyi = 0, i ∈ [1, m].

(Cortes and Vapnik, 1995; Boser, Guyon, and Vapnik, 1992)

slide-10
SLIDE 10

page

Advanced Machine Learning - Mohri@

Kernel Ridge Regression

Primal: Dual:

10

(Hoerl and Kennard, 1970; Sanders et al., 1998)

max

αRm −α (K + λI)α + 2α y.

min

w λw2 + m

  • i=1

(w · ΦK(xi) + b yi)2 .

slide-11
SLIDE 11

page

Advanced Machine Learning - Mohri@

Questions

How should the user choose the kernel?

  • problem similar to that of selecting features for other

learning algorithms.

  • poor choice learning made very difficult.
  • good choice even poor learners could succeed.

The requirement from the user is thus critical.

  • can this requirement be lessened?
  • is a more automatic selection of features possible?

11

slide-12
SLIDE 12

page

Advanced Machine Learning - Mohri@

Outline

Kernel methods. Learning kernels

  • scenario.
  • learning bounds.
  • algorithms.

12

slide-13
SLIDE 13

page

Advanced Machine Learning - Mohri@

Standard Learning with Kernels

13

sample

algorithm user h

kernel K

slide-14
SLIDE 14

page

Advanced Machine Learning - Mohri@

Learning Kernel Framework

14

sample

algorithm user

kernel family K (K, h)

slide-15
SLIDE 15

page

Advanced Machine Learning - Mohri@

Kernel Families

Most frequently used kernel families, , Hypothesis sets:


15

with ∆q =

  • µ : µ 0, µq = 1
  • .

Hq =

  • h HK : K Kq, hHK 1
  • .

q ≥ 1 Kq =

  • Kµ : Kµ =

p

  • k=1

µkKk, µ = µ1 . . .

µp

  • ∈∆q
slide-16
SLIDE 16

page

Advanced Machine Learning - Mohri@

Relation between Norms

Lemma: for , the following holds: Proof: for the left inequalities, observe that for ,

  • Right inequalities follow immediately Hölder’s inequality:

16

p, q ∈ (0, +∞] x RN, p q xq xp N

1 p − 1 q xq.

x = 0

xp = N

  • i=1

|xi|p 1

p

  • N
  • i=1

(|xi|p)

q p

p

q N

  • i=1

(1)

q q−p

1− p

q

  • 1

p

= xqN

1 p − 1 q .

kxkp kxkq p =

N

X

i=1

 |xi| kxkq | {z }

≤1

p

  • N

X

i=1

 |xi| kxkq q = 1.

slide-17
SLIDE 17

page

Advanced Machine Learning - Mohri@

Single Kernel Guarantee

Theorem: fix . Then, for any , with probability at least , the following holds for all ,

17

ρ>0 δ>0 1−δ h∈H1

(Koltchinskii and Panchenko, 2002)

R(h) ≤ Rρ(h) + 2 ρ

  • Tr[K]

m +

  • log 1

δ

2m .

slide-18
SLIDE 18

page

Advanced Machine Learning - Mohri@

Multiple Kernel Guarantee

Theorem: fix . Let with . Then, for any , with probability at least , the following holds for all and any integer :

18

ρ>0 δ>0 1−δ

(Cortes, MM, and Rostamizadeh, 2010)

h∈Hq q, r ≥ 1

1 q + 1 r =1

u = (Tr[K1], . . . , Tr[Kp])

with .

1≤s≤r R(h)  b Rρ(h) + 2 ρ p skuks m + s log 1

δ

2m ,

slide-19
SLIDE 19

page

Advanced Machine Learning - Mohri@

Proof

Let with .

19

q, r ≥ 1

1 q + 1 r =1 b RS(Hq) = 1 m E

σ

h sup

h2Hq m

X

i=1

σih(xi) i = 1 m E

σ

h sup

µ2∆q,α>Kµα1 m

X

i,j=1

σiαjKµ(xi, xj) i = 1 m E

σ

h sup

µ2∆q,α>Kµα1

σ>Kµα i = 1 m E

σ

h sup

µ2∆q,kαk

K1/2 µ

1

hσ, αiK1/2

µ

i = 1 m E

σ

h sup

µ2∆q

q σ>Kµσ i (Cauchy-Schwarz) = 1 m E

σ

h sup

µ2∆q

pµ · uσ i ⇥ uσ = (σ>K1σ, . . . , σ>Kpσ)>) ⇤ = 1 m E

σ

⇥p kuσkr ⇤ . (definition of dual norm)

slide-20
SLIDE 20

page

Advanced Machine Learning - Mohri@

Lemma

Lemma: Let be a kernel matrix for a finite sample. Then, for any integer , Proof: combinatorial argument.

20

K r

(Cortes, MM, and Rostamizadeh, 2010)

E

σ

h (σ>Kσ)ri ≤ ⇣ r Tr[K] ⌘r .

slide-21
SLIDE 21

page

Advanced Machine Learning - Mohri@

Proof

For any ,

21

1≤s≤r

b RS(Hq) = 1 m E

σ

⇥p kuσkr ⇤  1 m E

σ

⇥p kuσks ⇤ = 1 m E

σ

hh

p

X

k=1

(σ>Kkσ)si 1

2s i

 1 m h E

σ

h

p

X

k=1

(σ>Kkσ)sii 1

2s (Jensen’s inequality)

= 1 m h

p

X

k=1

E

σ

h (σ>Kkσ)sii 1

2s

 1 m h

p

X

k=1

⇣ s Tr[Kk] ⌘si 1

2s =

p skuks m . (lemma)

slide-22
SLIDE 22

page

Advanced Machine Learning - Mohri@

L1 Learning Bound

Corollary: fix . For any , with probability , the following holds for all :

  • weak dependency on .
  • bound valid for .
  • 22

ρ>0 δ>0 1−δ

(Cortes, MM, and Rostamizadeh, 2010)

h∈H1

p Tr[Kk] ≤ m max

x

Kk(x, x). p m R(h)  b Rρ(h) + 2 ρ r edlog pe

p

max

k=1 Tr[Kk]

m + s log 1

δ

2m .

slide-23
SLIDE 23

page

Advanced Machine Learning - Mohri@

Proof

For , the bound holds for any integer The function reaches it minimum at .

23

q = 1

s ≥ 1

with

s sp

1 s

log p skuks = s " p X

k=1

Tr[Kk]s # 1

s

 sp

1 s

p

max

k=1 Tr[Kk].

R(h)  b Rρ(h) + 2 ρ p skuks m + s log 1

δ

2m ,

slide-24
SLIDE 24

page

Advanced Machine Learning - Mohri@

Lower Bound

Tight bound:

  • dependency cannot be improved.
  • argument based on VC dimension or example.

Observations: case .

  • canonical projection kernels .
  • contains .
  • .
  • for and , .
  • VC lower bound: .

24

  • log p

X={-1, +1}p Kk(x, x)=xkx

k

VCdim(Jp)=Ω(log p) ρ=1 h∈Jp Jp ={xsxk : k[1, p], s{-1, +1}}

  • Rρ(h)=

R(h) Ω

  • VCdim(Jp)/m
  • H1
slide-25
SLIDE 25

page

Advanced Machine Learning - Mohri@

Pseudo-Dimension Bound

Assume that for all . Then, for any , with probability at least , for any ,

  • bound additive in (modulo log terms).
  • not informative for .
  • based on pseudo-dimension of kernel family.
  • similar guarantees for other families.

δ>0 1−δ

(Srebro and Ben-David, 2006)

p

25

k ∈ [1, p], Kk(x, x)≤R2

R(h) ≤ Rρ(h) + ⇥ 8 2 + p log 128em3R2

ρ2p

+ 256 R2

ρ2 log ρem 8R log 128mR2 ρ2

+ log(1/δ) m .

p>m h∈H1

slide-26
SLIDE 26

page

Advanced Machine Learning - Mohri@

Comparison

ρ/R=.2

26

slide-27
SLIDE 27

page

Advanced Machine Learning - Mohri@

Lq Learning Bound

Corollary: fix . Let with . Then, for any , with probability at least , the following holds for all :

  • mild dependency on .
  • 27

ρ>0 δ>0 1−δ

(Cortes, MM, and Rostamizadeh, 2010)

h∈Hq q, r ≥ 1

1 q + 1 r =1

p Tr[Kk] ≤ m max

x

Kk(x, x). R(h) ≤ b Rρ(h) + 2 ρ q rp

1 r maxp

k=1 Tr[Kk]

m + s log 1

δ

2m ,

slide-28
SLIDE 28

page

Advanced Machine Learning - Mohri@

Lower Bound

Tight bound:

  • dependency cannot be improved.
  • in particular tight for regularization.

Observations: equal kernels.

  • .
  • thus, for .
  • (Hölder’s inequality).
  • coincides with .

28

p

1 2r

p

1 4

L2 p

k=1 µk p

1 r µq = p 1 r

p

k=1 µkKk =

p

k=1 µk

  • K1

h2

HK1= (p k=1 µk)h2 HK

p

k=1 µk =0

{h HK1: hHK1 p

1 2r }

Hq

slide-29
SLIDE 29

page

Advanced Machine Learning - Mohri@

Outline

Kernel methods. Learning kernels

  • scenario.
  • learning bounds.
  • algorithms.

29

slide-30
SLIDE 30

page

Advanced Machine Learning - Mohri@

General LK Formulation - SVMs

Notation:

  • set of PDS kernel functions.
  • kernel matrices associated to , assumed convex.
  • diagonal matrix with .

Optimization problem:

  • convex problem: function linear in , convexity of

pointwise maximum.

30

Y∈Rm×m Yii =yi K K K K min

KK

max

α

2 α1 − αYKYα subject to: 0 ≤ α ≤ C ∧ αy = 0.

slide-31
SLIDE 31

page

Advanced Machine Learning - Mohri@

Parameterized LK Formulation

Notation:

  • parameterized set of PDS kernel functions.
  • convex set, concave function.
  • diagonal matrix with .

Optimization problem:

  • convex problem: function convex in , convexity of

pointwise maximum.

31

Y∈Rm×m Yii =yi (Kµ)µ∈∆ ∆ µKµ min

µ∆ max α

2 α1 − αYKµYα subject to: 0 ≤ α ≤ C ∧ αy = 0. µ

slide-32
SLIDE 32

page

Advanced Machine Learning - Mohri@

Non-Negative Combinations

, . By von Neumann’s generalized minimax theorem (convexity wrt , concavity wrt , convex and compact, convex and compact):

32

µ α A ∆1 min

µ∆1 max αA 2 α1 − αYKµYα

= max

αA min µ∆1 2 α1 − αYKµYα

= max

αA 2 α1 − max µ∆1 αYKµYα

= max

αA 2 α1 − max k[1,p] αYKkYα.

Kµ = p

k=1 µkKk µ ∈ ∆1

slide-33
SLIDE 33

page

Advanced Machine Learning - Mohri@

Non-Negative Combinations

Optimization problem: in view of the previous analysis, the problem can be rewritten as the following QCQP.

  • complexity (interior-point methods): .

33

(Lanckriet et al., 2004)

max

α,t

2α1 − t subject to: ∀k ∈ [1, p], t ≥ αYKkYα; 0 ≤ α ≤ C ∧ αy = 0. O(pm3)

slide-34
SLIDE 34

page

Advanced Machine Learning - Mohri@

Equivalent Primal Formulation

Optimization problem:

34

min

w,µ∈∆q

1 2

p

X

k=1

kwkk2

2

µk + C

m

X

i=1

max ( 0, 1 yi p X

k=1

wk · Φk(xi) !) .

slide-35
SLIDE 35

page

Advanced Machine Learning - Mohri@

Lots of Optimization Solutions

QCQP (Lanckriet et al., 2004). Wrapper methods — interleaving call to SVM solver and update of :

  • SILP (Sonnenburg et al., 2006).
  • Reduced gradient (SimpleML) (Rakotomamonjy et al.,

2008).

  • Newton’s method (Kloft et al., 2009).
  • Mirror descent (Nath et al., 2009).

On-line method (Orabona & Jie, 2011). Many other methods proposed.

35

µ

slide-36
SLIDE 36

page

Advanced Machine Learning - Mohri@

Does It Work?

Experiments:

  • this algorithm and its different optimization solutions
  • ften do not significantly outperform the simple uniform

combination kernel in practice!

  • bservations corroborated by NIPS workshops.

Alternative algorithms: significant improvement (see empirical results of (Gönen and Alpaydin, 2011)).

  • centered alignment-based LK algorithms (Cortes, MM,

and Rostamizadeh, 2010 and 2012).

  • non-linear combination of kernels (Cortes, MM, and

Rostamizadeh, 2009).

36

slide-37
SLIDE 37

page

Advanced Machine Learning - Mohri@

LK Formulation - KRR

Kernel family:

  • non-negative combinations.
  • Lq regularization.

Optimization problem:

  • convex optimization: linearity in and convexity of

pointwise maximum.

37

min

µ max α

λαα

p

  • k=1

µkαKkα + 2αy subject to: µ 0 µ µ0q Λ. µ

(Cortes, MM, and Rostamizadeh, 2009)

slide-38
SLIDE 38

page

Advanced Machine Learning - Mohri@

Projected Gradient

Solving maximization problem in , closed-form solution , reduces problem to Convex optimization problem, one solution using projection-based gradient descent:

38

α α = (Kµ + λI)−1y min

µ

y

(Kµ + λI)1y

subject to: µ 0 µ µ02 Λ.

∂F ∂µk = Tr ∂y(Kµ + λI)1y ∂(Kµ + λI) ∂(Kµ + λI) ∂µk

  • = − Tr
  • (Kµ + λI)1yy(Kµ + λI)1 ∂(Kµ + λI)

∂µk

  • = − Tr
  • (Kµ + λI)1yy(Kµ + λI)1Kk
  • = − y(Kµ + λI)1Kk(Kµ + λI)1y = −αKkα.
slide-39
SLIDE 39

page

Advanced Machine Learning - Mohri@

  • Proj. Grad. KRR - L2 Reg.

39

ProjectionBasedGradientDescent((Kk)k[1,p], µ0) 1 µ µ0 2 µ 3 while µ µ > do 4 µ µ 5 α (Kµ + I)1y 6 µ µ + (αK1α, . . . , αKpα) 7 for k 1 to p do 8 µ

k max(0, µ k)

9 µ µ0 + Λ µµ0

µµ0

10 return µ

slide-40
SLIDE 40

page

Advanced Machine Learning - Mohri@

Interpolated Step KRR - L2 Reg.

40

Simple and very efficient: few iterations (less than15).

InterpolatedIterativeAlgorithm((Kk)k[1,p], µ0) 1 α 2 α (Kµ0 + I)1y 3 while α α > do 4 α α 5 v (αK1α, . . . , αKpα) 6 µ µ0 + Λ v

v

7 α α + (1 )(Kµ + I)1y 8 return α

slide-41
SLIDE 41

page

Advanced Machine Learning - Mohri@

L2-Regularized Combinations

Dense combinations are beneficial when using many kernels. Combining kernels based on single features, can be viewed as principled feature weighting.

41

1000 2000 3000 4000 5000 6000 .52 .54 .56 .58 0.6 .62 Reuters (acq) baseline L2 L1

2000 4000 6000 1.44 1.46 1.48 1.5 1.52 1.54 1.56 DVD baseline L1 L2

(Cortes, MM, and Rostamizadeh, 2009)

slide-42
SLIDE 42

page

Advanced Machine Learning - Mohri@

Conclusion

Solid theoretical guarantees suggesting the use of a large number of base kernels. Broad literature on optimization techniques but often no significant improvement over uniform combination. Recent algorithms with significant improvements, in particular non-linear combinations. Still many theoretical and algorithmic questions left to explore.

42

slide-43
SLIDE 43

References

Advanced Machine Learning - Mohri@

page

Bousquet, Olivier and Herrmann, Daniel J. L. On the complexity of learning the kernel matrix. In NIPS, 2002. Corinna Cortes, Marius Kloft, and Mehryar Mohri. Learning kernels using local Rademacher complexity. In Proceedings of NIPS, 2013. Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Learning non-linear combinations of kernels. In NIPS, 2009. Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Generalization Bounds for Learning Kernels. In ICML, 2010. Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Two-stage learning kernel methods. In ICML, 2010. Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh. Algorithms for Learning Kernels Based on Centered Alignment. JMLR 13: 795-828, 2012.

43

slide-44
SLIDE 44

References

Advanced Machine Learning - Mohri@

page

Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Tutorial: Learning

  • Kernels. ICML 2011, Bellevue, Washington, July 2011.

Zakria Hussain, John Shawe-Taylor. Improved Loss Bounds For Multiple Kernel

  • Learning. In AISTATS, 2011 [see arxiv for corrected version].

Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari: Regularization Techniques for Learning with Matrices. JMLR 13: 1865-1890, 2012. Koltchinskii, V. and Panchenko, D. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30, 2002. Koltchinskii, Vladimir and Yuan, Ming. Sparse recovery in large ensembles of kernel machines on-line learning and bandits. In COLT, 2008. Lanckriet, Gert, Cristianini, Nello, Bartlett, Peter, Ghaoui, Laurent El, and Jordan,

  • Michael. Learning the kernel matrix with semidefinite programming. JMLR, 5,

2004.

44

slide-45
SLIDE 45

References

Advanced Machine Learning - Mohri@

page

Mehmet Gönen, Ethem Alpaydin: Multiple Kernel Learning Algorithms. JMLR 12: 2211-2268 (2011). Srebro, Nathan and Ben-David, Shai. Learning bounds for support vector machines with learned kernels. In COLT, 2006. Ying, Yiming and Campbell, Colin. Generalization bounds for learning the kernel

  • problem. In COLT, 2009.

45