Random Fourier Features for Kernel Ridge Regression Michael Kapralov - - PowerPoint PPT Presentation

random fourier features for kernel ridge regression
SMART_READER_LITE
LIVE PREVIEW

Random Fourier Features for Kernel Ridge Regression Michael Kapralov - - PowerPoint PPT Presentation

Random Fourier Features for Kernel Ridge Regression Michael Kapralov 1 1 EPFL (Joint work with H. Avron, C. Musco, C. Musco, A. Velingker and A. Zandieh) 1 / 43 Scalable machine learning algorithms with provable guarantees In this talk: towards


slide-1
SLIDE 1

Random Fourier Features for Kernel Ridge Regression

Michael Kapralov1

1EPFL

(Joint work with H. Avron, C. Musco, C. Musco, A. Velingker and A. Zandieh)

1 / 43

slide-2
SLIDE 2

Scalable machine learning algorithms with provable guarantees In this talk: towards scalable numerical linear algebra in kernel spaces with provable guarantees

2 / 43

slide-3
SLIDE 3

Linear regression

Input:

a sequence of d-dimensional data points x1,...,xn ∈ Rd values yj = f(xj),j = 1,...,n

Output: linear approximation to f

3 / 43

slide-4
SLIDE 4

Linear regression

Input:

a sequence of d-dimensional data points x1,...,xn ∈ Rd values yj = f(xj),j = 1,...,n

Output: linear approximation to f

3 / 43

slide-5
SLIDE 5

Linear regression

Input:

a sequence of d-dimensional data points x1,...,xn ∈ Rd values yj = f(xj),j = 1,...,n

Output: linear approximation to f Solve least squares problem: min

α∈Rd

n

  • j=1

|xjα−yj|2 +λ||α||2

2

3 / 43

slide-6
SLIDE 6

Kernel ridge regression

Input:

a sequence of d-dimensional data points x1,...,xn ∈ Rd values yj = f(xj),j = 1,...,n

Output: approximation from class of ‘smooth’ functions on Rd

4 / 43

slide-7
SLIDE 7

Kernel ridge regression

Input:

a sequence of d-dimensional data points x1,...,xn ∈ Rd values yj = f(xj),j = 1,...,n

Output: approximation from class of ‘smooth’ functions on Rd

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2

True Function Data

4 / 43

slide-8
SLIDE 8

Choose an embedding into a high dimensional feature space

Ψ : R → RD

Dimension D may be infinite (e.g. Gaussian kernel). Solve least squares problem: min

α∈RD

n

  • j=1

|Ψ(xj)α−yj|2 +λ||α||2

2

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2

Data

5 / 43

slide-9
SLIDE 9

Choose an embedding into a high dimensional feature space

Ψ : x →

1 (2π)1/4 e−(·−x)2/4

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

6 / 43

slide-10
SLIDE 10

Choose an embedding into a high dimensional feature space

Ψ : x →

1 (2π)1/4 e−(·−x)2/4

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

6 / 43

slide-11
SLIDE 11

Choose an embedding into a high dimensional feature space

Ψ : x →

1 (2π)1/4 e−(·−x)2/4

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

6 / 43

slide-12
SLIDE 12

Choose an embedding into a high dimensional feature space

Ψ : x →

1 (2π)1/4 e−(·−x)2/4

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

6 / 43

slide-13
SLIDE 13

Choose an embedding into a high dimensional feature space

Ψ : x →

1 (2π)1/4 e−(·−x)2/4

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

6 / 43

slide-14
SLIDE 14

Choose an embedding into a high dimensional feature space

Ψ : x →

1 (2π)1/4 e−(·−x)2/4

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

6 / 43

slide-15
SLIDE 15

Choose an embedding into a high dimensional feature space

Ψ : x →

1 (2π)1/4 e−(·−x)2/4

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

6 / 43

slide-16
SLIDE 16

Choose an embedding into a high dimensional feature space

Ψ : x →

1 (2π)1/4 e−(·−x)2/4

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

Solve least squares problem: min

α∈RD

n

  • j=1

|Ψ(xj)α−yj|2 +λ||α||2

2

6 / 43

slide-17
SLIDE 17

Solve least squares problem: min

α∈RD

n

  • j=1

|Ψ(xj)α−yj|2 +λ||α||2

2

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2

Data

7 / 43

slide-18
SLIDE 18

Solve least squares problem: min

α∈RD

n

  • j=1

|Ψ(xj)α−yj|2 +λ||α||2

2

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2

True Function Estimator Data

8 / 43

slide-19
SLIDE 19

Solve least squares problem: min

α∈RD

n

  • j=1

|Ψ(xj)α−yj|2 +λ||α||2

2

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2

True Function Estimator Data

After algebraic manipulations

α∗ = ΨT (K +λI)−1y

8 / 43

slide-20
SLIDE 20

Kernel ridge regression

Main computational effort: (K +λI)−1y

9 / 43

slide-21
SLIDE 21

Kernel ridge regression

Main computational effort: (K +λI)−1y n

Ψ(xj)

n ∞ =

K The (i,j)-th entry of Gaussian kernel matrix K is Kij = e−(xi−xj)2/2

9 / 43

slide-22
SLIDE 22

How quickly can we compute (K +λI)−1y? The (i,j)-th entry of Gaussian kernel matrix K is Kij = e−(xi−xj)2/2

10 / 43

slide-23
SLIDE 23

How quickly can we compute (K +λI)−1y? The (i,j)-th entry of Gaussian kernel matrix K is Kij = e−(xi−xj)2/2 n3 (or nω) in full generality...

Ω(n2) time needed when λ = 0 assuming SETH

Backurs-Indyk-Schmidt (NIPS’17)

10 / 43

slide-24
SLIDE 24

How quickly can we compute (K +λI)−1y? The (i,j)-th entry of Gaussian kernel matrix K is Kij = e−(xi−xj)2/2 n3 (or nω) in full generality...

Ω(n2) time needed when λ = 0 assuming SETH

Backurs-Indyk-Schmidt (NIPS’17) In practice: find Z ∈ Rn×s,s ≪ n such that K ≈ ZZ T and use ZZ T +λI as a proxy for K +λI!

10 / 43

slide-25
SLIDE 25

How quickly can we compute (K +λI)−1y? The (i,j)-th entry of Gaussian kernel matrix K is Kij = e−(xi−xj)2/2 n3 (or nω) in full generality...

Ω(n2) time needed when λ = 0 assuming SETH

Backurs-Indyk-Schmidt (NIPS’17) In practice: find Z ∈ Rn×s,s ≪ n such that K ≈ ZZ T and use ZZ T +λI as a proxy for K +λI! Can compute (ZZ T +λI)−1y in O(ns2) time and O(ns) space!

10 / 43

slide-26
SLIDE 26

Fourier Features

Theorem (Bochner’s Theorem)

A normalized continuous function k : R → R is a shift-invariant kernel if and only if its Fourier transform k is a measure.

11 / 43

slide-27
SLIDE 27

Fourier Features

Theorem (Bochner’s Theorem)

A normalized continuous function k : R → R is a shift-invariant kernel if and only if its Fourier transform k is a measure. Let p(η) := k(η). Then for every xa,xb Kab = k(xa −xb) =

  • R
  • k(η)e−2πi(xa−xb)ηdη

=

  • Re−2πi(xa−xb)ηp(η)dη

= Eη∼p(η)

  • e−2πi(xa−xb)η

11 / 43

slide-28
SLIDE 28

Fourier Features

n

  • p(η)e−2πixjη

A AT

n ∞ =

K

12 / 43

slide-29
SLIDE 29

Fourier Features

n

  • p(η)e−2πixjη

A AT

n ∞ =

K Rahimi-Recht’2007: fix s, sample i.i.d. η1,...,ηs ∼ p(η) Let j-th row of Z be Zj,k := 1

se−2πixjηk

(samples of pure frequency xj) and use ZZ T as a proxy for K!

12 / 43

slide-30
SLIDE 30

Fourier Features: sampling columns of Fourier factorization of K

n

A

  • p(η)e−2πixjη

AT

n ∞ =

K Rahimi-Recht’2007: fix s, sample i.i.d. η1,...,ηs ∼ p(η) Let j-th row of Z be Zj,k := 1

se−2πixjηk

(samples of pure frequency xj) and use ZZ T as a proxy for K!

13 / 43

slide-31
SLIDE 31

Fourier Features: sampling columns of Fourier factorization of K

n

A

  • p(η)e−2πixjη

AT

n ∞ =

K

14 / 43

slide-32
SLIDE 32

Fourier Features: sampling columns of Fourier factorization of K

n

A

  • p(η)e−2πixjη

AT

n ∞ =

K Column η has ℓ2

2 norm n ·p(η)!

Fourier features = sampling columns of A with probability proportional to column norms squared!

14 / 43

slide-33
SLIDE 33

Fourier Features: sampling columns of Fourier factorization of K

n

A

  • p(η)e−2πixjη

AT

n ∞ =

K

15 / 43

slide-34
SLIDE 34

Fourier Features: sampling columns of Fourier factorization of K

n

A

  • p(η)e−2πixjη

AT

n ∞ =

K Column η has ℓ2

2 norm n ·p(η)!

Fourier features = sampling columns of A with probability proportional to column norms squared!

15 / 43

slide-35
SLIDE 35

Fourier Features: sampling columns of Fourier factorization of K

n s Z Z T

K Column η has ℓ2

2 norm n ·p(η)!

Fourier features = sampling columns of A with probability proportional to column norms squared! One has E[ZZ T ] = K

16 / 43

slide-36
SLIDE 36

Spectral approximations

n

  • p(η)e−2πixjη

n ∞ =

K Our goal: find Z ∈ Rn×s,s ≪ n such that (1−ε)(K +λI) ≺ ZZ T +λI ≺ (1+ε)(K +λI)?

17 / 43

slide-37
SLIDE 37

Spectral approximations

n

  • p(η)e−2πixjη

n ∞ =

K Our goal: find Z ∈ Rn×s,s ≪ n such that (1−ε)(K +λI) ≺ ZZ T +λI ≺ (1+ε)(K +λI)? Subspace embeddings for kernel matrices that can be applied implicitly to points x1,...,xn ∈ Rd?

17 / 43

slide-38
SLIDE 38

Spectral approximations

n

  • p(η)e−2πixjη

n ∞ =

K Our goal: find Z ∈ Rn×s,s ≪ n such that (1−ε)(K +λI) ≺ ZZ T +λI ≺ (1+ε)(K +λI)? Subspace embeddings for kernel matrices that can be applied implicitly to points x1,...,xn ∈ Rd? Known for the polynomial kernel only: Avron et al., NIPS’2014 via TENSORSKETCH

17 / 43

slide-39
SLIDE 39

Spectral approximation via column sampling

n D A AT

n

D

=

K For each j = 1,...,D compute sampling probability τ(j) Sample s columns independently from distribution τ, include j in Z with weight

1

  • s·τ(j) if sampled.

18 / 43

slide-40
SLIDE 40

Spectral approximation via column sampling

n D A AT

n

D

=

K For each j = 1,...,D compute sampling probability τ(j) Sample s columns independently from distribution τ, include j in Z with weight

1

  • s·τ(j) if sampled.

That way E[ZZ T ] = K

18 / 43

slide-41
SLIDE 41

Spectral approximation via column sampling

n D A AT

n

D

=

K For each j = 1,...,D compute sampling probability τ(j) Sample s columns independently from distribution τ, include j in Z with weight

1

  • s·τ(j) if sampled.

That way E[ZZ T ] = K Choose τ to ensure ZZ T spectrally close to K whp?

18 / 43

slide-42
SLIDE 42

Ridge leverage scores

Define λ-ridge leverage scores by

τλ(j) := aT

j (K +λI)+aj

19 / 43

slide-43
SLIDE 43

Ridge leverage scores

Define λ-ridge leverage scores by

τλ(j) := aT

j (K +λI)+aj

The number of samples required ≈ statistical dimension of K sλ(K) := tr((K +λI)+K) =

d

  • j=1

λj λj +λ

19 / 43

slide-44
SLIDE 44

Ridge leverage scores

Define λ-ridge leverage scores by

τλ(j) := aT

j (K +λI)+aj

The number of samples required ≈ statistical dimension of K sλ(K) := tr((K +λI)+K) =

d

  • j=1

λj λj +λ

Statistical dimension≈ # eigenvalues above λ +(sum of eigenvalues below λ)/λ

19 / 43

slide-45
SLIDE 45

n D A AT

n

D

=

K

Theorem (Folklore)

Suppose that

for each i = 1,...,s one has Zi ∼ aj with probability ∼ τλ(j)

independently;

s = O(ε−2sλlogsλ).

Then (1−ε)(K +λI) ≺ ZZ T +λI ≺ (1+ε)(K +λI) with high probability.

20 / 43

slide-46
SLIDE 46

Q1: does Fourier Features provide spectral guarantees with

  • O(sλ) samples?

21 / 43

slide-47
SLIDE 47

Q1: does Fourier Features provide spectral guarantees with

  • O(sλ) samples?

This paper: NO, not even in dimension d = 1

21 / 43

slide-48
SLIDE 48

Q1: does Fourier Features provide spectral guarantees with

  • O(sλ) samples?

This paper: NO, not even in dimension d = 1

Q1’: how many samples are necessary and sufficient for spectral guarantees?

This paper: (essentially) tight bounds

21 / 43

slide-49
SLIDE 49

Q1: does Fourier Features provide spectral guarantees with

  • O(sλ) samples?

This paper: NO, not even in dimension d = 1

Q1’: how many samples are necessary and sufficient for spectral guarantees?

This paper: (essentially) tight bounds

Q2: a better sampling scheme with O(sλ) samples?

This paper: YES, at least in constant dimensions for bounded datasets

21 / 43

slide-50
SLIDE 50

Leverage score density function Primal-dual characterization Tight lower bound for Fourier Features

22 / 43

slide-51
SLIDE 51

Leverage score density function Primal-dual characterization Tight lower bound for Fourier Features

23 / 43

slide-52
SLIDE 52

n

  • p(η)z(η)

n ∞ =

K For each η ∈ R let z(η)j := e−2πxjη and let dµ(η) := p(η)dη so that K =

  • Rz(η)z(η)∗dµ(η).

24 / 43

slide-53
SLIDE 53

n

  • p(η)z(η)

n ∞ =

K For each η ∈ R let z(η)j := e−2πxjη and let dµ(η) := p(η)dη so that K =

  • Rz(η)z(η)∗dµ(η).

Define the ridge leverage score function

τλ(η) := p(η)z(η)∗(K +λI)−1z(η)

24 / 43

slide-54
SLIDE 54

Define the ridge leverage score function

τλ(η) := p(η)z(η)∗(K +λI)−1z(η)

Lemma

For every η ∈ R

τλ(η) ≤ p(η)· n λ

25 / 43

slide-55
SLIDE 55

Define the ridge leverage score function

τλ(η) := p(η)z(η)∗(K +λI)−1z(η)

Lemma

For every η ∈ R

τλ(η) ≤ p(η)· n λ

Proof:

τλ(η) = p(η)z(η)∗(K +λI)−1z(η) ≤ p(η)z(η)∗z(η)/λ = p(η)||z(η)||2

2/λ

= p(η)· n λ

25 / 43

slide-56
SLIDE 56

Theorem

For every kernel k, any dataset x1,...,xn, any ε ∈ (0,1/2) if Z is a Fourier Features matrix with s = O( 1

ε2 n λsλlogsλ) columns, then

(1−ε)(K +λI) ≺ ZZ T +λI ≺ (1+ε)(K +λI) with high probability.

26 / 43

slide-57
SLIDE 57

Theorem

For every kernel k, any dataset x1,...,xn, any ε ∈ (0,1/2) if Z is a Fourier Features matrix with s = O( 1

ε2 n λsλlogsλ) columns, then

(1−ε)(K +λI) ≺ ZZ T +λI ≺ (1+ε)(K +λI) with high probability. Is this good? Usually λ = ω(1) (e.g. λ =

n), and definitely λ ≤ n.

26 / 43

slide-58
SLIDE 58

Theorem

For every kernel k, any dataset x1,...,xn, any ε ∈ (0,1/2) if Z is a Fourier Features matrix with s = O( 1

ε2 n λsλlogsλ) columns, then

(1−ε)(K +λI) ≺ ZZ T +λI ≺ (1+ε)(K +λI) with high probability. Is this good? Usually λ = ω(1) (e.g. λ =

n), and definitely λ ≤ n.

Is this best possible? basically YES, even for 1d datasets!

26 / 43

slide-59
SLIDE 59

Theorem

For every kernel k, any dataset x1,...,xn, any ε ∈ (0,1/2) if Z is a Fourier Features matrix with s = O( 1

ε2 n λsλlogsλ) columns, then

(1−ε)(K +λI) ≺ ZZ T +λI ≺ (1+ε)(K +λI) with high probability. Is this good? Usually λ = ω(1) (e.g. λ =

n), and definitely λ ≤ n.

Is this best possible? basically YES, even for 1d datasets! Can we do better? YES, at least for bounded datasets in constant dimension

26 / 43

slide-60
SLIDE 60

Assume: dimension d is constant (one in pictures), kernel is Gaussian, data points belong to [−R,+R]

−R +R

data points x1,...,xn

27 / 43

slide-61
SLIDE 61

Theorem (Upper bound, informal)

For every |η| ≤ 10

  • log(n/λ):

τλ(η) ≤ 25max(R,3000log1.5(n/λ)).

28 / 43

slide-62
SLIDE 62

Theorem (Upper bound, informal)

For every |η| ≤ 10

  • log(n/λ):

τλ(η) ≤ 25max(R,3000log1.5(n/λ)). Gaussian density p(η) upper bound

−10

  • log(n/λ)

+10

  • log(n/λ)

−R +R

data points x1,...,xn

28 / 43

slide-63
SLIDE 63

Theorem (Lower bound, informal)

For integer n, regularization parameter λ, and radius R1, there exist x1,...,xn ∈ [−R,R] such that for every

η ∈ [−100

  • log(n/λ),+100
  • log(n/λ)]

τλ(η) ≥ R

150

  • p(η)

p(η)+2R(λ/n)

  • .

1Restrictions apply

29 / 43

slide-64
SLIDE 64

Theorem (Lower bound, informal)

For integer n, regularization parameter λ, and radius R1, there exist x1,...,xn ∈ [−R,R] such that for every

η ∈ [−100

  • log(n/λ),+100
  • log(n/λ)]

τλ(η) ≥ R

150

  • p(η)

p(η)+2R(λ/n)

  • .

Gaussian density p(η) lower bound

  • 2log(n/(Rλ))

+

  • 2log(n/(Rλ))

−R +R

data points x1,...,xn

1Restrictions apply

29 / 43

slide-65
SLIDE 65

n

A AT

n ∞ =

K

Theorem

Suppose that

for each i = 1,...,s one has Zi ∼ aη with probability τλ(η)dη

independently;

s = O(ε−2sλlogsλ).

Then (1−ε)(K +λI) ≺ ZZ T +λI ≺ (1+ε)(K +λI) with high probability. Statistical dimension of sλ(K) = n

j=1

λj λj+λ

30 / 43

slide-66
SLIDE 66

Leverage score density function Primal-dual characterization Tight lower bound for Fourier Features

31 / 43

slide-67
SLIDE 67

Leverage score density function Primal-dual characterization Tight lower bound for Fourier Features

32 / 43

slide-68
SLIDE 68

Define operator Φ : L2(dµ) → Cn by

Φy =

  • Rz(ξ)y(ξ)dµ(ξ),

n

∞ Φ =

y

33 / 43

slide-69
SLIDE 69

We have ΦΦ∗ = K. n

Φ

n ∞ =

K

34 / 43

slide-70
SLIDE 70

Lemma

The ridge leverage function can alternatively be defined as follows:

τλ(η) =

min

y∈L2(dµ)λ−1||Φy −

  • p(η)z(η)||2

2 +||y||2 L2(dµ)

Intuition: recombine many columns of Φ to get our column (i.e. frequency η), approximately n

∞ Φ ≈

z(η) y

35 / 43

slide-71
SLIDE 71

For a function y ∈ L2(dµ)

Φy =

  • Rz(ξ)y(ξ)dµ(ξ)

Fix η ∈ R. Want to upper bound

τλ(η) =

min

y∈L2(dµ)λ−1||Φy −

  • p(η)z(η)||2

2 +||y||2 L2(dµ)

Gaussian times sinc

−R +R

Gaussian convolved with box

36 / 43

slide-72
SLIDE 72

For a function y ∈ L2(dµ)

Φy =

  • Rz(ξ)y(ξ)dµ(ξ)

Fix η ∈ R. Want to upper bound

τλ(η) =

min

y∈L2(dµ)λ−1||Φy −

  • p(η)z(η)||2

2 +||y||2 L2(dµ)

Gaussian times sinc

−R +R

Gaussian convolved with box

37 / 43

slide-73
SLIDE 73

Lemma

The ridge leverage function can alternatively be defined as follows:

τλ(η) = max

α∈Cn

p(η)·|α∗z(η)|2

||Φ∗α||2

L2(dµ) +λ||α||2 2

Intuition: recombine rows of Φ to create a ‘localized’ vector n D

Φ ≈ α

y Similar construction of test functions

38 / 43

slide-74
SLIDE 74

Leverage score density function Primal-dual characterization Tight lower bound for Fourier Features

39 / 43

slide-75
SLIDE 75

Tight lower bound – proof idea

Need: for every α ∈ Rn

αT K α+λ||α||2

2 ∈ (1±ε)(αT ZZ T α+λ||α||2 2)

−R +R

For a vector α ∈ Rn

αT K α =

  • Rd p(η)
  • n
  • j=1

e−2πixjηαj

  • 2

40 / 43

slide-76
SLIDE 76

Tight lower bound – proof idea

Need: for every α ∈ Rn

αT K α+λ||α||2

2 ∈ (1±ε)(αT ZZ T α+λ||α||2 2)

−R +R

For a vector α ∈ Rn

αT K α =

  • Rd p(η)
  • n
  • j=1

e−2πixjηαj

  • 2

40 / 43

slide-77
SLIDE 77

Tight lower bound – proof idea

Need: for every α ∈ Rn

αT K α+λ||α||2

2 ∈ (1±ε)(αT ZZ T α+λ||α||2 2)

−R +R

For a vector α ∈ Rn

αT K α =

  • Rd p(η)
  • n
  • j=1

e−2πixjηαj

  • 2

40 / 43

slide-78
SLIDE 78

Tight lower bound – proof idea

Need: for every α ∈ Rn

αT K α+λ||α||2

2 ∈ (1±ε)(αT ZZ T α+λ||α||2 2)

−R +R

For a vector α ∈ Rn

αT K α =

  • Rd p(η)
  • n
  • j=1

e−2πixjηαj

  • 2

samples ηj ∼ e−η2/2

40 / 43

slide-79
SLIDE 79

Tight lower bound – proof idea

Need: for every α ∈ Rn

αT K α+λ||α||2

2 ∈ (1±ε)(αT ZZ T α+λ||α||2 2)

−R +R

For a vector α ∈ Rn

αT K α =

  • Rd p(η)
  • n
  • j=1

e−2πixjηαj

  • 2

samples ηj ∼ e−η2/2

40 / 43

slide-80
SLIDE 80

Tight lower bound – proof idea

Need: for every α ∈ Rn

αT K α+λ||α||2

2 ∈ (1±ε)(αT ZZ T α+λ||α||2 2)

−R +R

For a vector α ∈ Rn

αT K α =

  • Rd p(η)
  • n
  • j=1

e−2πixjηαj

  • 2

dη ≈ 1 s

s

  • k=1

p(ηk)

  • n
  • j=1

e−2πixT

j ηj αj

  • 2

samples ηj ∼ e−η2/2

40 / 43

slide-81
SLIDE 81

Experiments: one-dimensional

Sample from the function f ⋆(x) = sin(6x)+sin(60exp(x)). Use a 400-point uniform grid spanning [−5/2π,+5/2π], and sample according to yi = f ⋆(xi)+νi . where νi is i.i.d. Gaussian noise.

41 / 43

slide-82
SLIDE 82

Experiments: one-dimensional

Sample from the function f ⋆(x) = sin(6x)+sin(60exp(x)). Use a 400-point uniform grid spanning [−5/2π,+5/2π], and sample according to yi = f ⋆(xi)+νi . where νi is i.i.d. Gaussian noise.

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2

True Function Estimator Data

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2

True Function MRF Estimator CRF Estimator

41 / 43

slide-83
SLIDE 83

Experiments: two-dimensional

f ⋆(x,z) = (sin(x)+sin(10exp(x)))(sin(z)+sin(10exp(z))). Sample points on a 40×40 uniform grid.

42 / 43

slide-84
SLIDE 84

Experiments: two-dimensional

f ⋆(x,z) = (sin(x)+sin(10exp(x)))(sin(z)+sin(10exp(z))). Sample points on a 40×40 uniform grid.

42 / 43

slide-85
SLIDE 85

Summary

Our results:

tight bounds for Fourier Features for bounded datasets in

constant dimension

43 / 43

slide-86
SLIDE 86

Summary

Our results:

tight bounds for Fourier Features for bounded datasets in

constant dimension

tight bounds on leverage score function for bounded

datasets in any constant dimension

43 / 43

slide-87
SLIDE 87

Summary

Our results:

tight bounds for Fourier Features for bounded datasets in

constant dimension

tight bounds on leverage score function for bounded

datasets in any constant dimension Subspace embeddings with poly(d) dependence? Tight bounds for worst case datasets? Does Rahimi-Recht work on ‘typical’ datasets? Other kernels?

Thank you!

43 / 43