[PPT] - Optimization for Machine Learning Lecture 4: SMO-MKL S.V . N. PowerPoint Presentation

SLIDE 1

Optimization for Machine Learning

Lecture 4: SMO-MKL S.V . N. (vishy) Vishwanathan

Purdue University vishy@purdue.edu

July 11, 2012

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 22

SLIDE 2

Motivation

Binary Classification yi = −1 yi = +1

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22

SLIDE 3

Motivation

Binary Classification yi = −1 yi = +1

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22

SLIDE 4

Motivation

Binary Classification yi = −1 yi = +1

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22

SLIDE 5

Motivation

Binary Classification yi = −1 yi = +1 {x | w, x + b = 0} {x | w, x + b = −1} {x | w, x + b = 1} x2 x1 w, x1 + b = +1 w, x2 + b = −1 w, x1 − x2 = 2

w

w, x1 − x2

=

2 w

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22

SLIDE 6

Motivation

Linear Support Vector Machines Optimization Problem min

w,b,ξ

1 2w2 + C

m

i=1

ξi s.t. yi(w, xi + b) ≥ 1 − ξi for all i ξi ≥ 0

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 22

SLIDE 7

Motivation

The Kernel Trick x y

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 22

SLIDE 8

Motivation

The Kernel Trick x y

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 22

SLIDE 9

Motivation

The Kernel Trick

x x2 + y2

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 22

SLIDE 10

Motivation

Kernel Trick Optimization Problem min

w,b,ξ

1 2w2 + C

m

i=1

ξi s.t. yi(w, φ(xi) + b) ≥ 1 − ξi for all i ξi ≥ 0

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 22

SLIDE 11

Motivation

Kernel Trick Optimization Problem max

α

− 1 2α⊤Hα + 1⊤ α s.t. 0 ≤ αi ≤ C

i

αiyi = 0 Hij = yiyj φ(xi), φ(xj)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 22

SLIDE 12

Motivation

Kernel Trick Optimization Problem max

α

− 1 2α⊤Hα + 1⊤ α s.t. 0 ≤ αi ≤ C

i

αiyi = 0 Hij = yiyjk(xi, xj)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 22

SLIDE 13

Motivation

Key Question Which kernel should I use? The Multiple Kernel Learning Answer Cook up as many (base) kernels as you can Compute a data dependent kernel function as a linear combination of base kernels k(x, x′) =

k

dkkk(x, x′) s.t. dk ≥ 0

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 22

SLIDE 14

Motivation

Key Question Which kernel should I use? The Multiple Kernel Learning Answer Cook up as many (base) kernels as you can Compute a data dependent kernel function as a linear combination of base kernels k(x, x′) =

k

dkkk(x, x′) s.t. dk ≥ 0

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 22

SLIDE 15

Motivation

Object Detection Localize a specified object of interest if it exists in a given image

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 7 / 22

SLIDE 16

Motivation

Some Examples of MKL Detection

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 22

SLIDE 17

Motivation

Summary of Our Results Sonar Dataset with 800 kernels p Training Time (s) # Kernels Selected SMO-MKL Shogun SMO-MKL Shogun 1.1 4.71 47.43 91.20 258.00 1.33 3.21 19.94 248.20 374.20 2.0 3.39 34.67 661.20 664.80 Web dataset: ≈ 50,000 points and 50 kernels ≈ 30 minutes Sonar with a hundred thousand kernels

Precomputed: ≈ 8 minutes Kernels computed on-the-fly: ≈ 30 minutes

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 22

SLIDE 18

Motivation

Setting up the Optimization Problem -I The Setup We are given K kernel functions k1, . . . , kn with corresponding feature maps φ1(·), . . . , φn(·) We are interested in deriving the feature map φ(x) =    √d1φ1(x) . . . √dnφn(x)   

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 22

SLIDE 19

Motivation

Setting up the Optimization Problem -I The Setup We are given K kernel functions k1, . . . , kn with corresponding feature maps φ1(·), . . . , φn(·) We are interested in deriving the feature map φ(x) =    √d1φ1(x) . . . √dnφn(x)    = ⇒ w =    w1 . . . wn   

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 22

SLIDE 20

Motivation

Setting up the Optimization Problem Optimization Problem min

w,b,ξ

1 2w2 + C

m

i=1

ξi s.t. yi(w, φ(xi) + b) ≥ 1 − ξi for all i ξi ≥ 0

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22

SLIDE 21

Motivation

Setting up the Optimization Problem Optimization Problem min

w,b,ξ,d

1 2

k

wk2 + C

m

i=1

ξi s.t. yi

k
dk wk, φk(xi) + b
≥ 1 − ξi for all i

ξi ≥ 0 dk ≥ 0 for all k

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22

SLIDE 22

Motivation

Setting up the Optimization Problem Optimization Problem min

w,b,ξ,d

1 2

k

wk2 + C

m

i=1

ξi + ρ 2

k

dp

k

2

p

s.t. yi

k
dk wk, φk(xi) + b
≥ 1 − ξi for all i

ξi ≥ 0 dk ≥ 0 for all k

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22

SLIDE 23

Motivation

Setting up the Optimization Problem Optimization Problem min

w,b,ξ,d

1 2

k

wk2 dk + C

m

i=1

ξi + ρ 2

k

dp

k

2

p

s.t. yi

k

wk, φk(xi) + b

≥ 1 − ξi for all i

ξi ≥ 0 dk ≥ 0 for all k

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22

SLIDE 24

Motivation

Setting up the Optimization Problem Optimization Problem min

d max α

− 1 2

k

dkα⊤Hkα + 1⊤ α + ρ 2

k

dp

k

2

p

s.t. 0 ≤ αi ≤ C

i

αiyi = 0 dk ≥ 0

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22

SLIDE 25

Motivation

Saddle Point Problem −4 −2 2 4 −5 5 −20 20 d α

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 22

SLIDE 26

Motivation

Solving the Saddle Point Saddle Point Problem min

d max α

− 1 2

k

dkα⊤Hkα + 1⊤ α + ρ 2

k

dp

k

2

p

s.t. 0 ≤ αi ≤ C

i

αiyi = 0 dk ≥ 0

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 13 / 22

SLIDE 27

Our Approach

The Key Insight Eliminate d D(α) := max

α

− 1 8ρ

k
α⊤Hkα

q 2

q

+ 1⊤ α s.t. 0 ≤ αi ≤ C

i

αiyi = 0 1 p + 1 q = 1 Not a QP but very close to one!

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 14 / 22

SLIDE 28

Our Approach

SMO-MKL: High Level Overview D(α) := max

α

− 1 8ρ

k
α⊤Hkα

q 2

q

+ 1⊤ α s.t. 0 ≤ αi ≤ C

i

αiyi = 0 Algorithm Choose two variables αi and αj to optimize Solve the one dimensional reduced optimization problem Repeat until convergence

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 15 / 22

SLIDE 29

Our Approach

SMO-MKL: High Level Overview Selecting the Working Set Compute directional derivative and directional Hessian Greedily select the variables Solving the Reduced Problem Analytic solution for p = q = 2 (one dimensional quartic) For other values of p use Newton Raphson

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 22

SLIDE 30

Our Approach

SMO-MKL: High Level Overview Selecting the Working Set Compute directional derivative and directional Hessian Greedily select the variables Solving the Reduced Problem Analytic solution for p = q = 2 (one dimensional quartic) For other values of p use Newton Raphson

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 22

SLIDE 31

Experiments

Generalization Performance

1.1 1.33 1.66 2.0 2.33 2.66 3.0

80 82 84 86 88 90 Test Accuracy (%) Australian SMO-MKL Shogun

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 22

SLIDE 32

Experiments

Generalization Performance

1.1 1.33 1.66 2.0 2.33 2.66 3.0

88 90 92 94 Test Accuracy (%) ionosphere SMO-MKL Shogun

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 22

SLIDE 33

Experiments

Scaling with Training Set Size Adult: 123 dimensions, 50 RBF kernels, p = 1.33, C = 1 103.5 104 104.5 101 102 103 104 Number of Training Examples CPU Time in seconds SMO-MKL Shogun

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 18 / 22

SLIDE 34

Experiments

Scaling with Training Set Size Adult: 123 dimensions, 50 RBF kernels, p = 1.33, C = 1 103.5 104 104.5 101 102 103 104 Number of Training Examples CPU Time in seconds Observed O(n1.9)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 18 / 22

SLIDE 35

Experiments

On Another Dataset Web: 300 dimensions, 50 RBF kernels, p = 1.33, C = 1 103.5 104 104.5 101 102 103 Number of Training Examples CPU Time in seconds

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 19 / 22

SLIDE 36

Experiments

Scaling with Number of Kernels Sonar: 208 examples, 59 dimensions, p = 1.33, C = 1 104 105 102.5 103 Number of Kernels CPU Time in seconds

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 20 / 22

SLIDE 37

Experiments

Scaling with Number of Kernels Sonar: 208 examples, 59 dimensions, p = 1.33, C = 1 104 105 102.5 103 Number of Kernels CPU Time in seconds SMO-MKL O(n)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 20 / 22

SLIDE 38

Experiments

On Another Dataset Real-sim: 72,309 examples, 20,958 dimensions, p = 1.33, C = 1 100.6 100.8 101 101.2 101.4 101.6 104 105 Number of Training Examples CPU Time in seconds SMO-MKL O(n)

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 21 / 22

SLIDE 39

References

S. V. N. Vishwanathan, Zhaonan Sun, Theera-Ampornpunt, and

Manik Varma. Multiple Kernel Learning and the SMO Algorithm. NIPS 2010. Pages 2361–2369. Code available for download from http://research.microsoft. com/en-us/um/people/manik/code/SMO-MKL/download.html

S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 22 / 22