Kernel Learning with a Million Kernels Ashesh Jain SVN - PowerPoint PPT Presentation

Kernel Learning with a Million Kernels Ashesh Jain SVN Vishwanathan IIT Delhi Purdue University Manik Varma Microsoft Research India

Kernel Learning • The objective in kernel learning is to jointly learn both SVM and kernel parameters from training data. • Kernel parameterizations • Linear : 𝐿 = 𝑒 𝑗 𝐿 𝑗 𝑗 • Non-linear : 𝐿 = 𝐿 𝑗 = 𝑓 −𝑒 𝑗 𝐸 𝑗 𝑗 𝑗 • Regularizers • Sparse l 1 • Sparse and non-sparse l p >1 • Log determinant

Kernel Learning for Object Detection • Vedaldi, Gulshan, Varma and Zisserman ICCV 2009

Kernel Learning for Object Recognition • Orabona, Jie and Caputo CVPR 2010

Kernel Learning for Feature Selection • Varma and Babu ICML 2009 FERET Gender Identification Data Set AdaBoost Baluja et al . LP-SVM BAHSIC Linear # OWL-QN SSVM QCQP Non-Linear Feat [ICML 2007] [ICML 2007] MKL [IJCV 2007] [COA 2004] [ICML 2007] MKL 76.3  0.9 79.5  1.9 71.6  1.4 84.9  1.9 79.5  2.6 81.2  3.2 80.8  0.2 88.7  0.8 10 82.6  0.6 80.5  3.3 87.6  0.5 85.6  0.7 86.5  1.3 83.8  0.7 93.2  0.9 20 - 83.4  0.3 84.8  0.4 89.3  1.1 88.6  0.2 89.4  2.4 86.3  1.6 95.1  0.5 30 - 86.9  1.0 88.8  0.4 90.6  0.6 89.5  0.2 91.0  1.3 89.4  0.9 95.5  0.7 50 - 88.9  0.6 90.4  0.2 90.6  1.1 92.4  1.4 90.5  0.2 80 - - - 89.5  0.2 90.6  0.3 90.5  0.2 94.1  1.3 91.3  1.3 100 - - - 91.3  0.5 90.3  0.8 90.7  0.2 94.5  0.7 150 - - - - 93.1  0.5 90.8  0.0 94.3  0.1 252 - - - - - 76.3(12.6) - 91 (221.3) 91 (58.3) 90.8 (252) - 91.6(146.3) 95.5 (69.6)

The GMKL Primal Formulation P = Min w , b , d ½ 𝐱 𝑢 𝐱 + 𝐷 𝑀 𝐱 𝑢 𝛠 𝐞 𝐲 𝑗 + 𝑐, 𝑧 𝑗 + 𝑠(𝐞) 𝑗 s. t. 𝐞 ∈ 𝑬 𝑢 𝐲 𝑗 𝛠 𝐞 𝐲 𝑘 ≻ 0 ∀𝐞 ∈ 𝐸 • 𝐿 𝐞 (𝐲 𝑗 , 𝐲 𝑘 ) = 𝛠 𝐞 • 𝛂 𝐞 𝐿 and 𝛂 𝐞 𝑠 exist and are continuous

The GMKL Primal Formulation • The GMKL primal formulation for binary classification. ½ 𝐱 𝑢 𝐱 + 𝐷 𝜊 𝑗 + 𝑠(𝐞) P = Min w , b , d, 𝜊 𝑗 𝑧 𝑗 𝐱 𝑢 𝛠 𝐞 𝐲 𝑗 + 𝑐 ≥ 1 − 𝜊 𝑗 s. t. 𝜊 𝑗 ≥ 0 & 𝐞 ∈ 𝑬

The GMKL Primal Formulation • The GMKL primal formulation for binary classification. ½ 𝐱 𝑢 𝐱 + 𝐷 𝜊 𝑗 + 𝑠(𝐞) P = Min w , b , d, 𝜊 𝑗 𝑧 𝑗 𝐱 𝑢 𝛠 𝐞 𝐲 𝑗 + 𝑐 ≥ 1 − 𝜊 𝑗 s. t. 𝜊 𝑗 ≥ 0 & 𝐞 ∈ 𝑬 • Intermediate Dual 1 t  – ½  t YK d Y  + r( d ) D = Min d Max  1 t Y  = 0 s. t. 0    C & 𝐞 ∈ 𝑬

Projected Gradient Descent x 0 1 0.9 0.8 0.7 0.6 d 2 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 d 1

Projected Gradient Descent x 0 1 0.9 x 1 0.8 0.7 0.6 d 2 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 d 1

Projected Gradient Descent x 0 1 0.9 x 1 0.8 0.7 0.6 d 2 0.5 0.4 0.3 0.2 0.1 x 2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 d 1 z 1

Projected Gradient Descent x 0 1 0.9 x 1 0.8 0.7 0.6 d 2 0.5 0.4 0.3 x 3 0.2 0.1 x 2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 d 1 z 1

Projected Gradient Descent x 0 1 0.9 x 1 0.8 0.7 0.6 d 2 0.5 x ∗ 0.4 0.3 x 3 0.2 0.1 x 2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 d 1 z 1

PGD Limitations • PGD requires many function and gradient evaluations as • No step size information is available. • The Armijo rule might reject many step size proposals. • Inaccurate gradient values can lead to many tiny steps.

PGD Limitations • PGD requires many function and gradient evaluations as • No step size information is available. • The Armijo rule might reject many step size proposals. • Inaccurate gradient values can lead to many tiny steps. • Noisy function and gradient values can cause PGD to converge to points far away from the optimum.

PGD Limitations • PGD requires many function and gradient evaluations as • No step size information is available. • The Armijo rule might reject many step size proposals. • Inaccurate gradient values can lead to many tiny steps. • Noisy function and gradient values can cause PGD to converge to points far away from the optimum. • Solving SVMs to high precision to obtain accurate function and gradient values is very expensive.

PGD Limitations • PGD requires many function and gradient evaluations as • No step size information is available. • The Armijo rule might reject many step size proposals. • Inaccurate gradient values can lead to many tiny steps. • Noisy function and gradient values can cause PGD to converge to points far away from the optimum. • Solving SVMs to high precision to obtain accurate function and gradient values is very expensive. • Repeated projection onto the feasible set might also be expensive.

SPG Solution – Spectral Step Length • Quadratic approximation : ½ 𝜇 −𝟐 𝐲 𝑢 𝐲 + 𝐝 𝑢 𝐲 + 𝑒 𝐲 𝑜 − 𝐲 𝑜−𝟐 , 𝐲 𝑜 − 𝐲 𝑜−𝟐 • Spectral step length : 𝜇 𝑇𝑄𝐻 = 𝐲 𝑜 − 𝐲 𝑜−𝟐 , 𝛂𝑔(𝐲 𝑜 ) − 𝛂𝑔(𝐲 𝑜−𝟐 ) x * x 1 x 1 x 0 x 0 Original Function Approximation

SPG Solution – Spectral Step Length 𝐲 𝑜 − 𝐲 𝑜−𝟐 , 𝐲 𝑜 − 𝐲 𝑜−𝟐 • Spectral step length : 𝜇 𝑇𝑄𝐻 = 𝐲 𝑜 − 𝐲 𝑜−𝟐 , 𝛂𝑔(𝐲 𝑜 ) − 𝛂𝑔(𝐲 𝑜−𝟐 ) 1 x 1 x 0 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1

PGD Limitations – Repeated Projections • Accept P ( z t ) if it satisfies the Armijo rule z t – 𝛂 f x t P ( z t )

PGD Limitations – Repeated Projections • PGD might require many projections before accepting a point z t – 𝛂 f x t P ( z t )

SPG Solution – Spectral Proj Gradient • SPG requires a single projection per step z t – 𝛂 f x t P ( z t ) – 𝛂 spg

SPG Solution – Non-Monotone Rule • Handling function and gradient noise. • Non-monotone rule : 𝑔 𝑦 𝑢 − 𝑡𝛼𝑔 𝑦 𝑢 0≤𝑘≤𝑁 𝑔 𝑦 𝑢−𝑘 − 𝛿𝑡 𝛼𝑔 𝑦 𝑢 2 ≤ max 2 1438 Global Minimum SPG-GMKL 1436 1434 f(x) 1432 1430 1428 1426 0 10 20 30 40 50 60 time (s)

PGD Limitations – Step Size Selection • The Armijo rule might get stuck due to noisy function values 1438 Global Minimum PGD 1436 1434 f(x) 1432 1430 1428 1426 0 10 20 30 40 50 60 time (s)

SPG Solution – SVM Precision Tuning SPG PGD 0.5 hr 6.5 3 hr 0.2 hr 6 5.5 0.3 hr 5 4.5 2 hr 0.1 hr 4 3.5 1 hr 1 1 0.9 0.9 0.8 0.1 hr 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0

SPG Advantages • SPG requires fewer function and gradient evaluations due to • The 2 nd order spectral step length estimation. • The non-monotone line search criterion. • SPG is more robust to noisy function and gradient values due to the non-monotone line search criterion. • SPG never needs to solve an SVM with high precision due to our precision tuning strategy. • SPG needs to perform only a single projection per step.

SPG Algorithm 1: 𝑜 ← 0 2: Initialize 𝐞 0 randomly 3: 𝐬𝐟𝐪𝐟𝐛𝐮 𝛃 ∗ ← SolveSVM(𝐋 𝐞 𝑜 , ϵ) 4: 𝜇 ← SpectralStepLength 5: 𝐪 𝑜 ← 𝐞 𝑜 − 𝐐 𝐞 𝑜 − 𝛍𝛂𝑋 𝐞 𝑜 , 𝛃 ∗ 6: s 𝑜 ← Non − Monotone 7: ϵ ← TuneSVMPrecision 8: 𝐞 𝑜+𝟐 ← 𝐞 𝑜 − s 𝑜 𝐪 𝑜 9: 10: 𝐯𝐨𝐮𝐣𝐦 converged

Results on Large Scale Data Sets Covertype: Sum of kernels subject to 𝑚 1.33 regularization • • Number of training points 581,012 • Number of Kernels 5 • SPG time taken 64.46 hrs • SPG took 26 SVM evaluations • First SVM evaluation took 44 hours • Only 0.19% of SV were cached

Results on Large Scale Data Sets Sonar: Sum of kernels subject to 𝑚 1.33 regularization • • Number of training points 208 • Number of Kernels 1 Million • SPG time taken 105.62 hrs 5.5 SPG p=1.33 SPG p=1.66 5 PGD p=1.33 PGD p=1.66 4.5 log(Time) 4 3.5 3 2.5 2 3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2 log(#Kernel)

Results on Large Scale Data Sets Sum of kernels subject to 𝑚 𝑞≥1 regularization • p =1 p =1.33 Data Sets # Train # Kernels PGD (hrs) SPG (hrs) PGD (hrs) SPG (hrs) 32,561 50 35.84 4.55 31.77 4.42 Adult - 9 59,535 50 – 25.17 66.48 19.10 Cod - RNA 50,000 40.10 42.20 KDDCup04 50 – –

Results on Small Scale Data Sets Sum of kernels subject to 𝑚 1 regularization • Data Sets SimpleMKL (s) Shogun (s) PGD (s) SPG (s) 400 ± 128.4 15 ± 7.7 38 ± 17.6 6 ± 4.2 Wpbc 676 ± 356.4 12 ± 1.2 57 ± 85.1 5 ± 0.6 Breast - Cancer 383 ± 33.5 1094 ± 621.6 29 ± 7.1 10 ± 0.8 Australian 1247 ± 680.0 107 ± 18.8 1392 ± 824.2 39 ± 6.8 Ionosphere 1468 ± 1252.7 935 ± 65.0 273 ± 64.0 Sonar –

Kernel Learning with a Million Kernels Ashesh Jain SVN - PowerPoint PPT Presentation

Kernel Learning with a Million Kernels Ashesh Jain SVN Vishwanathan IIT Delhi Purdue University Manik Varma Microsoft Research India Kernel Learning The objective in kernel learning is to jointly learn both SVM and kernel parameters

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Learning From Data Lecture 26 Kernel Machines Popular Kernels The Kernel Measures Similarity

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Kernel methods and Graph kernels Social and Technological Networks Rik Sarkar University of

Kernel methods and Graph kernels Social and Technological Networks Rik Sarkar University of

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

Kernel on Automata Cousins of String Kernels and Dynamic Systems Kernels? S.V.N. Vishy

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE &

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

MICROKERNELS: MACH AND L4 Hakim Weatherspoon CS6410 Introduction to Kernels Different Types

Rumprun for Rump Kernels: Instant Unikernels for POSIX applications Martin Lucina, @matolucina 1

Mobile IPv6 Security Arnaud Ebalard - EADS Corporate Research Center France Guillaume Valadon - The

COAs Integrated Behavioral Health & Primary Care Supplement Melissa Dury, LCSW Associate

4. Convex Sets and (Quasi-)Concave Functions Daisuke Oyama Mathematics II April 17, 2020 Convex

Security Models: Proofs, Protocols and Certification Florent Autrau - Yassine Lakhnech -

DETERMINATION OF NEED FAFSA & CSS Profile Expected Family Contribution (EFC)

UAV Presentation Bill Timmins GIS Services UAV copters can provide for a variety of sensors for

COAPS API A Generic Cloud Application Provisioning and Management API Why COAPS ? PaaS 1 Cloud

Synchronization Computer Architecture J. Daniel Garca Snchez (coordinator) David Expsito

Kernel Learning with a Million Kernels Ashesh Jain SVN - PowerPoint PPT Presentation

Kernel Learning with a Million Kernels Ashesh Jain SVN Vishwanathan IIT Delhi Purdue University Manik Varma Microsoft Research India Kernel Learning The objective in kernel learning is to jointly learn both SVM and kernel parameters

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Learning From Data Lecture 26 Kernel Machines Popular Kernels The Kernel Measures Similarity

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Kernel methods and Graph kernels Social and Technological Networks Rik Sarkar University of

Kernel methods and Graph kernels Social and Technological Networks Rik Sarkar University of

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

Kernel on Automata Cousins of String Kernels and Dynamic Systems Kernels? S.V.N. Vishy

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE &amp;

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

MICROKERNELS: MACH AND L4 Hakim Weatherspoon CS6410 Introduction to Kernels Different Types

Rumprun for Rump Kernels: Instant Unikernels for POSIX applications Martin Lucina, @matolucina 1

Mobile IPv6 Security Arnaud Ebalard - EADS Corporate Research Center France Guillaume Valadon - The

COAs Integrated Behavioral Health &amp; Primary Care Supplement Melissa Dury, LCSW Associate

4. Convex Sets and (Quasi-)Concave Functions Daisuke Oyama Mathematics II April 17, 2020 Convex

Security Models: Proofs, Protocols and Certification Florent Autrau - Yassine Lakhnech -

DETERMINATION OF NEED FAFSA &amp; CSS Profile Expected Family Contribution (EFC)

UAV Presentation Bill Timmins GIS Services UAV copters can provide for a variety of sensors for

COAPS API A Generic Cloud Application Provisioning and Management API Why COAPS ? PaaS 1 Cloud

Synchronization Computer Architecture J. Daniel Garca Snchez (coordinator) David Expsito

Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE &

COAs Integrated Behavioral Health & Primary Care Supplement Melissa Dury, LCSW Associate

DETERMINATION OF NEED FAFSA & CSS Profile Expected Family Contribution (EFC)