Explicit Feature Methods for Accelerated Kernel Learning - - PowerPoint PPT Presentation

explicit feature methods for accelerated kernel learning
SMART_READER_LITE
LIVE PREVIEW

Explicit Feature Methods for Accelerated Kernel Learning - - PowerPoint PPT Presentation

Explicit Feature Methods for Accelerated Kernel Learning Purushottam Kar Quick Motivation Kernel Algorithms (SVM, SVR, KPCA) have output = , Number of support vectors is typically


slide-1
SLIDE 1

Explicit Feature Methods for Accelerated Kernel Learning

Purushottam Kar

slide-2
SLIDE 2

Quick Motivation

  • Kernel Algorithms (SVM, SVR, KPCA) have output

= ,

  • Number of “support vectors” is typically large
  • Provably a constant fraction of training set size*
  • Prediction time where ∈ ⊂ ℝ
  • Slow for real time applications

2

*[Steinwart NIPS 03, Steinwart-Christmann. NIPS 08]

slide-3
SLIDE 3

The General Idea

  • Approximate kernel using explicit feature maps

: ⟶ ℝ s.t. , ≈

  • Speeds up prediction time to ≪

≈ ,

  • =

=

  • Speeds up training time as well

3

slide-4
SLIDE 4

Why Should Such Maps Exist?

  • Mercer’s theorem*

Every PSD kernel has the following expansion , =

  • The series converges uniformly to kernel
  • For every > , ∃ such that if we construct the map

= , , … , ∈ ℝ,

  • then for all , ∈

, − ≤ ϵ

  • Call such maps uniformly -approximate

4

*[Mercer 09]

slide-5
SLIDE 5

Today’s Agenda

  • Some explicit feature map constructions
  • Randomized feature maps

e.g. Translation invariant, rotation invariant

  • Deterministic feature maps

e.g. Intersection, scale invariant

  • Some “fast” random feature constructions
  • Translation invariant, dot product
  • The BIG picture?

5

slide-6
SLIDE 6

Random Feature Maps

Approximate recovery of kernel values with high confidence

6

slide-7
SLIDE 7

Translation Invariant Kernels*

  • Kernels of the form , = −
  • Gaussian kernel, Laplacian kernel
  • Bochner’s Theorem**

For every there exists a positive function − = cos −

  • = E∼ cos −
  • Finding : take inverse Fourier transform of
  • Select ∼ for = , … ,

: ↦ cos

, sin

  • 7

*[Rahimi-Recht NIPS 07], ** Special case for ⊂ ℝ, [Bochner 33]

slide-8
SLIDE 8

Translation Invariant Kernels

  • Empirical averages approximate expectations
  • Let : ↦ , , … ,

=

  • =

cos

  • ≈ E∼ cos −

= −

  • Let us assume points , ∈ , ⊂ ℝ

Then we require ≥

  • log
  • depends on spectrum of kernel

8

slide-9
SLIDE 9

Translation Invariant Kernels

  • For the RBF Kernel

, = exp − −

  • =
  • / exp −
  • If kernel offers a margin, then we should require

≳ log

  • Here ≈ where , ∈ ℝ

9

slide-10
SLIDE 10

Rotation Invariant Kernels*

  • Kernels of the form , =
  • Polynomial kernels, exponential kernel
  • Schoenberg’s theorem**

=

  • ,

  • Select ∼ ∈ ℕ for = , … ,
  • Approx. : select , … , ∼ −,

: ↦

  • Similar approximation guarantees as earlier

10

*[K.-Karnick AISTATS 12], **[Schoenberg 42]

slide-11
SLIDE 11

Deterministic Feature Maps

Exact/approximate recovery of kernel values with certainty

11

slide-12
SLIDE 12

Intersection Kernel*

  • Kernel of the form , = ∑

min ,

  • Exploit additive separability of the kernel

=

  • min ,
  • =
  • = ,
  • Each can be calculated in log time !
  • Requires log preprocessing time per dimension
  • Prediction time almost independent of
  • However, deterministic and exact method – no or

12

*[Maji-Berg-Malik CVPR 08]

slide-13
SLIDE 13

Scale Invariant Kernels*

  • Kernels of the form , = ∑

,

  • where

, =

  • , ≥
  • Bochner’s theorem still applies**
  • Involves working with

, = log − log

  • Restrict domain so that we have a Fourier series
  • =
  • Use only lower frequencies ∈ −, …
  • Deterministic -approximate maps

13

*[Vedaldi-Zisserman CVPR 10], **[K. 12]

slide-14
SLIDE 14

Fast Feature Maps

Accelerated Random Feature Constructions

14

slide-15
SLIDE 15

Fast Fourier Features

  • Special case of , = exp
  • Old method: ∈ ℝ×, ∼ , , time
  • Instead use

: ↦ cos where =

  • is the Hadamard transform, is a random permutation

, , random diagonal scaling, Gaussian and sign matrices

  • Prediction time Dlog , E

= ,

  • Rows of are (non independent) Gaussian vectors
  • Correlations are sufficiently low Var

  • However, exponential convergence (for now) only for =

15

slide-16
SLIDE 16

Fast Taylor Features

  • S , = +
  • Earlier method : ↦ ∏
  • , takes time
  • New method* takes + log

time

  • Earlier method works (a bit) better for >
  • Should be possible to improve new method as well
  • Crucial idea = ⊗, ⊗
  • Count Sketch** : ↦ () such that ≈
  • Create sketch ∈ ℝ of tensor = ⊗
  • Create independent count sketches , … ,
  • Can show that ⊗ ∼ ∏
  • Can be done in time + log

time using FFT

16

*[Pham-Pagh KDD 13], **[Charikar et al 02]

slide-17
SLIDE 17

The BIG Picture

An Overview of Explicit Feature Methods

17

slide-18
SLIDE 18

Other Feature Construction Methods

  • Efficiently evaluable maps for efficient prediction
  • Fidelity to a particular kernel not an objective
  • Hard(er?) to give generalization guarantees
  • Local Deep Kernel Learning (LDKL)*
  • Sparse features speed up evaluation time to log
  • Training phase more involved
  • Pairwise Piecewise Linear Embedding (PL2)**
  • Encodes (discretization of) individual and pairs of features
  • Construct a = +

dimensional feature map

  • Features are + -sparse

18

*[Jose et al ICML 13], **[Pele et al ICML 13]

slide-19
SLIDE 19

A Taxonomy of Feature Methods

19

Kernel Dependence

Yes No Yes Nystrom Methods

  • Slow training
  • Data aware
  • Problem oblivious

Explicit Maps

  • Fast training
  • Data oblivious
  • Problem oblivious

No LDKL, PL2

  • Slow(er) training
  • Data aware
  • Problem aware

Data Dependence

slide-20
SLIDE 20

Discussion

The next big thing in accelerated kernel learning ?