Skoltech Skolkovo Institute of Science and Technology Kernel - - PowerPoint PPT Presentation

skoltech
SMART_READER_LITE
LIVE PREVIEW

Skoltech Skolkovo Institute of Science and Technology Kernel - - PowerPoint PPT Presentation

Quadrature-based Features for Kernel Approximation Marina Munkhoeva , Yermek Kapushev, Evgeny Burnaev, Ivan Oseledets Skoltech Skolkovo Institute of Science and Technology Kernel Methods Refresher Kernel trick: compute


slide-1
SLIDE 1

Skoltech

Skolkovo Institute of Science and Technology

Quadrature-based Features for Kernel Approximation

Marina Munkhoeva, Yermek Kapushev, Evgeny Burnaev, Ivan Oseledets

slide-2
SLIDE 2
  • Kernel trick: compute via kernel function
  • Inner product in an implicit space using input features
  • Naively, kernel methods scale poorly with # of samples

Input space Feature space ψ

Kernel Methods Refresher

K(x, z) = ⟨ψ(x), ψ(z)⟩

k(x, z)

1/9

slide-3
SLIDE 3

Input space Feature space ψ

k(x, y) = ⟨ψ(x), ψ(y)⟩ ≈ ϕ(x)⊤ϕ(y)

Scalable Kernel Methods

  • Revert the trick:
  • Use linear methods with mapped objects
  • How to generate approximate mapping ?

x → ϕ(x)

k(x, z) ≈ ϕ(x)⊤ϕ(z)

ϕ( ⋅ )

2/9

slide-4
SLIDE 4

Kernel Function Approximation

Consider kernels that allow integral representation: k(x, y) = 𝔽p(w)fxy(w) = ∫ℝd fxy(w)p(w)dw = I(f ), fxy(w) = ϕ(w⊤x)ϕ(w⊤y) = f(w),

3/9

slide-5
SLIDE 5

Kernel Function Approximation

Consider kernels that allow integral representation: k(x, y) = 𝔽p(w)fxy(w) = ∫ℝd fxy(w)p(w)dw = I(f ), fxy(w) = ϕ(w⊤x)ϕ(w⊤y) = f(w),

p(w) = (2π)−d/2e− ∥w∥2

2

3/9

slide-6
SLIDE 6

Kernel Function Approximation

Consider kernels that allow integral representation: k(x, y) = 𝔽p(w)fxy(w) = ∫ℝd fxy(w)p(w)dw = I(f ),

p(w) = (2π)−d/2e− ∥w∥2

2

fxy(w) = ϕ(w⊤x)ϕ(w⊤y) = f(w),

  • Shift-invariant kernels (e.g. radial basis functions (RBF) kernel)
  • Pointwise Nonlinear Gaussian kernels (e.g. arc-cosine kernels)

3/9

slide-7
SLIDE 7
  • Orthogonal points more accurate
  • Structured faster
  • Orthogonal + structured more accurate and faster

w w w

[Rahimi and Recht, 2008] RFF mapping : RFF Monte Carlo approximation for

Random Fourier Features (RFF)

ϕ( ⋅ )

k(x, z) = 𝔽[ϕw(x)ϕw(z)]

ϕw(x) = [cos(w⊤x), sin(w⊤x)], w ∼ p(w)

I(f )

4/9

slide-8
SLIDE 8

Change to polar coordinates ( )

I(f ) = (2π)− d

2 ∫ℝd

e− ∥w∥2

2 f(w)dw = (2π)− d 2

2 ∫Ud ∫

∞ −∞

e− r2

2 |r|d−1 f(rz)dr

dz

Our method uses polar form of the integral

w = rz,∥z∥2 = 1

5/9

slide-9
SLIDE 9

I(f ) = (2π)− d

2 ∫ℝd

e− ∥w∥2

2 f(w)dw = (2π)− d 2

2 ∫Ud ∫

∞ −∞

e− r2

2 |r|d−1 f(rz)dr

dz

Change to polar coordinates ( )

Our method uses polar form of the integral

Integration over radius :

r

w = rz,∥z∥2 = 1

5/9

∞ −∞

e− r2

2 |r|d−1h(r)dr

slide-10
SLIDE 10

Change to polar coordinates ( )

Our method uses polar form of the integral

R(h) =

l

i=0

̂ wi h(ρi) + h(−ρi) 2

Integration over radius :

r

Use radial rules

w = rz,∥z∥2 = 1

5/9

I(f ) = (2π)− d

2 ∫ℝd

e− ∥w∥2

2 f(w)dw = (2π)− d 2

2 ∫Ud ∫

∞ −∞

e− r2

2 |r|d−1 f(rz)dr

dz

∞ −∞

e− r2

2 |r|d−1h(r)dr

slide-11
SLIDE 11

Change to polar coordinates ( )

I(f ) = (2π)− d

2 ∫ℝd

e− ∥w∥2

2 f(w)dw = (2π)− d 2

2 ∫Ud ∫

∞ −∞

e− r2

2 |r|d−1 f(rz)dr

dz

Our method uses polar form of the integral

Integration over unit d-sphere :

Ud

w = rz,∥z∥2 = 1

∫Ud s(z)dz SQ(s) =

p

j=1

˜ w js(Qzj)

Use spherical rules

5/9

slide-12
SLIDE 12

Quadrature-based Features

SR3,3

Q,ρ(fxy) = (1 − d

ρ2 ) fxy(0) + d d + 1

d+1

j=1 [

fxy(−ρQvj) + fxy(ρQvj) 2ρ2 ]

I(fxy) = 𝔽Q,ρ[SR3,3

Q,ρ(fxy)] ≈ ̂

I(fxy) = 1 n

n

i=1

SR3,3

Qi,ρi(fxy)

[Genz and Monahan, 1998] introduced Spherical-Radial (SR) rules We propose to estimate the integral by SR rules sample complexity with constant smaller than RFF

𝒫(ε−2)

6/9

slide-13
SLIDE 13

SR(1,1)

Q,ρ = f(ρQz) + f(−ρQz)

2 , ρ ∼ χ(d), ρQz ∼ 𝒪(0,I) ⟹ SR(1,1)

Q,ρ = f(w),

w ∼ 𝒪(0,I)

Our method generalizes RFF and ORF

RFF are SR rules of degree (1, 1)

7/9

slide-14
SLIDE 14

Our method generalizes RFF and ORF

SR(1,1)

Q,ρ = f(ρQz) + f(−ρQz)

2 , ρ ∼ χ(d), ρQz ∼ 𝒪(0,I) ⟹ SR(1,1)

Q,ρ = f(w),

w ∼ 𝒪(0,I)

RFF are SR rules of degree (1, 1)

SR(1,3)

Q,ρ = d

i=1

f(ρQei) + f(−ρQei) 2 , ρ ∼ χ(d)

Orthogonal Random Features (ORF) are SR rules of degree (1, 3)

7/9

slide-15
SLIDE 15

Faster mapping with orthogonal

Use orthogonal butterfly matrices with structured factors Allow fast matrix-vector multiplication ( )

B(4) = c1 −s1 s1 c1 c3 −s3 s3 c3 c2 −s2 c2 −s2 s2 c2 s2 c2

= c1c2 −s1c2 −c1s2 s1s2 s1c2 c1c2 −s1s2 −c1s2 c3s2 −s3s2 c3c2 −s3c2 s3s2 c3s2 s3c2 c3c2

Q

𝒫(n log n)

8/9

slide-16
SLIDE 16

Kernel Approximation Accuracy (ours - B)

1 2 3 4 5 1.6 2.4 3.2 4.0 4.8 kK ˆ Kk kKk ×10−1

Arc-cosine 0

Powerplant

1 2 3 4 5 0.6 0.9 1.2 1.5 1.8 ×10−1

LETTER

1 2 3 4 5 2 3 4 5 6 ×10−2

USPS

1 2 3 4 5 1.2 1.8 2.4 3.0 3.6 ×10−2

MNIST

1 2 3 4 5 0.3 0.6 0.9 1.2 1.5 1.8 ×10−2

CIFAR100

1 2 3 4 5 0.4 0.6 0.8 1.0 1.2 1.4 ×10−2

LEUKEMIA

1 2 3 4 5 1.5 3.0 4.5 6.0 7.5 kK ˆ Kk kKk ×10−1

Arc-cosine 1

1 2 3 4 5 1 2 3 4 5 ×10−1 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 ×10−1 1 2 3 4 5 1.5 3.0 4.5 6.0 ×10−2 1 2 3 4 5 0.0 0.6 1.2 1.8 2.4 3.0 ×10−2 1 2 3 4 5 0.6 1.2 1.8 2.4 3.0 ×10−2 1 2 3 4 5

n

1.5 3.0 4.5 6.0 7.5 kK ˆ Kk kKk ×10−2

Gaussian

1 2 3 4 5

n

0.00 0.25 0.50 0.75 1.00 1.25 ×10−2 1 2 3 4 5

n

0.5 1.0 1.5 2.0 2.5 3.0 ×10−2 1 2 3 4 5

n

1 2 3 4 5 ×10−3 1 2 3 4 5

n

0.5 1.0 1.5 2.0 2.5 ×10−3 1 2 3 4 5

n

0.0 0.8 1.6 2.4 3.2 4.0 ×10−4

G Gort ROM QMC GQ B

9/9

slide-17
SLIDE 17

Summary

Our method quadrature-based features

Poster #130

  • achieves higher accuracy
  • generalizes previous work
  • applicable to a wide range of kernels
  • uses structured matrices