random fourier features for kernel ridge regression
play

Random Fourier Features for Kernel Ridge Regression Michael Kapralov - PowerPoint PPT Presentation

Random Fourier Features for Kernel Ridge Regression Michael Kapralov 1 1 EPFL (Joint work with H. Avron, C. Musco, C. Musco, A. Velingker and A. Zandieh) 1 / 43 Scalable machine learning algorithms with provable guarantees In this talk: towards


  1. Random Fourier Features for Kernel Ridge Regression Michael Kapralov 1 1 EPFL (Joint work with H. Avron, C. Musco, C. Musco, A. Velingker and A. Zandieh) 1 / 43

  2. Scalable machine learning algorithms with provable guarantees In this talk: towards scalable numerical linear algebra in kernel spaces with provable guarantees 2 / 43

  3. Linear regression Input: � a sequence of d -dimensional data points x 1 ,..., x n ∈ R d � values y j = f ( x j ) , j = 1 ,..., n Output: linear approximation to f 3 / 43

  4. Linear regression Input: � a sequence of d -dimensional data points x 1 ,..., x n ∈ R d � values y j = f ( x j ) , j = 1 ,..., n Output: linear approximation to f 3 / 43

  5. Linear regression Input: � a sequence of d -dimensional data points x 1 ,..., x n ∈ R d � values y j = f ( x j ) , j = 1 ,..., n Output: linear approximation to f Solve least squares problem: n � | x j α − y j | 2 + λ || α || 2 min 2 α ∈ R d j = 1 3 / 43

  6. Kernel ridge regression Input: � a sequence of d -dimensional data points x 1 ,..., x n ∈ R d � values y j = f ( x j ) , j = 1 ,..., n Output: approximation from class of ‘smooth’ functions on R d 4 / 43

  7. Kernel ridge regression Input: � a sequence of d -dimensional data points x 1 ,..., x n ∈ R d � values y j = f ( x j ) , j = 1 ,..., n Output: approximation from class of ‘smooth’ functions on R d 2 1.5 1 0.5 0 -0.5 -1 -1.5 True Function Data -2 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 4 / 43

  8. Choose an embedding into a high dimensional feature space Ψ : R → R D Dimension D may be infinite (e.g. Gaussian kernel). Solve least squares problem: � n | Ψ ( x j ) α − y j | 2 + λ || α || 2 min 2 α ∈ R D j = 1 2 1.5 1 0.5 0 -0.5 -1 -1.5 Data -2 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 5 / 43

  9. Choose an embedding into a high dimensional feature space 1 ( 2 π ) 1 / 4 e − ( ·− x ) 2 / 4 Ψ : x → x 1 x 2 x 3 x 4 x 6 x 8 x 9 x 10 x 5 x 7 6 / 43

  10. Choose an embedding into a high dimensional feature space 1 ( 2 π ) 1 / 4 e − ( ·− x ) 2 / 4 Ψ : x → x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 6 / 43

  11. Choose an embedding into a high dimensional feature space 1 ( 2 π ) 1 / 4 e − ( ·− x ) 2 / 4 Ψ : x → x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 6 / 43

  12. Choose an embedding into a high dimensional feature space 1 ( 2 π ) 1 / 4 e − ( ·− x ) 2 / 4 Ψ : x → x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 6 / 43

  13. Choose an embedding into a high dimensional feature space 1 ( 2 π ) 1 / 4 e − ( ·− x ) 2 / 4 Ψ : x → x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 6 / 43

  14. Choose an embedding into a high dimensional feature space 1 ( 2 π ) 1 / 4 e − ( ·− x ) 2 / 4 Ψ : x → x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 6 / 43

  15. Choose an embedding into a high dimensional feature space 1 ( 2 π ) 1 / 4 e − ( ·− x ) 2 / 4 Ψ : x → x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 6 / 43

  16. Choose an embedding into a high dimensional feature space 1 ( 2 π ) 1 / 4 e − ( ·− x ) 2 / 4 Ψ : x → x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 Solve least squares problem: n � | Ψ ( x j ) α − y j | 2 + λ || α || 2 min 2 α ∈ R D j = 1 6 / 43

  17. Solve least squares problem: � n | Ψ ( x j ) α − y j | 2 + λ || α || 2 min 2 α ∈ R D j = 1 2 1.5 1 0.5 0 -0.5 -1 -1.5 Data -2 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 7 / 43

  18. Solve least squares problem: n � | Ψ ( x j ) α − y j | 2 + λ || α || 2 min 2 α ∈ R D j = 1 2 1.5 1 0.5 0 -0.5 -1 True Function -1.5 Estimator Data -2 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 8 / 43

  19. Solve least squares problem: n � | Ψ ( x j ) α − y j | 2 + λ || α || 2 min 2 α ∈ R D j = 1 2 1.5 1 0.5 0 -0.5 -1 True Function -1.5 Estimator Data -2 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 After algebraic manipulations α ∗ = Ψ T ( K + λ I ) − 1 y 8 / 43

  20. Kernel ridge regression Main computational effort: ( K + λ I ) − 1 y 9 / 43

  21. Kernel ridge regression Main computational effort: ( K + λ I ) − 1 y ∞ n = n K ∞ Ψ ( x j ) The ( i , j ) -th entry of Gaussian kernel matrix K is K ij = e − ( x i − x j ) 2 / 2 9 / 43

  22. How quickly can we compute ( K + λ I ) − 1 y ? The ( i , j ) -th entry of Gaussian kernel matrix K is K ij = e − ( x i − x j ) 2 / 2 10 / 43

  23. How quickly can we compute ( K + λ I ) − 1 y ? The ( i , j ) -th entry of Gaussian kernel matrix K is K ij = e − ( x i − x j ) 2 / 2 n 3 (or n ω ) in full generality... Ω ( n 2 ) time needed when λ = 0 assuming SETH Backurs-Indyk-Schmidt (NIPS’17) 10 / 43

  24. How quickly can we compute ( K + λ I ) − 1 y ? The ( i , j ) -th entry of Gaussian kernel matrix K is K ij = e − ( x i − x j ) 2 / 2 n 3 (or n ω ) in full generality... Ω ( n 2 ) time needed when λ = 0 assuming SETH Backurs-Indyk-Schmidt (NIPS’17) In practice: find Z ∈ R n × s , s ≪ n such that K ≈ ZZ T and use ZZ T + λ I as a proxy for K + λ I ! 10 / 43

  25. How quickly can we compute ( K + λ I ) − 1 y ? The ( i , j ) -th entry of Gaussian kernel matrix K is K ij = e − ( x i − x j ) 2 / 2 n 3 (or n ω ) in full generality... Ω ( n 2 ) time needed when λ = 0 assuming SETH Backurs-Indyk-Schmidt (NIPS’17) In practice: find Z ∈ R n × s , s ≪ n such that K ≈ ZZ T and use ZZ T + λ I as a proxy for K + λ I ! Can compute ( ZZ T + λ I ) − 1 y in O ( ns 2 ) time and O ( ns ) space! 10 / 43

  26. Fourier Features Theorem (Bochner’s Theorem) A normalized continuous function k : R → R is a shift-invariant kernel if and only if its Fourier transform � k is a measure. 11 / 43

  27. Fourier Features Theorem (Bochner’s Theorem) A normalized continuous function k : R → R is a shift-invariant kernel if and only if its Fourier transform � k is a measure. Let p ( η ) : = � k ( η ) . Then for every x a , x b � k ( η ) e − 2 π i ( x a − x b ) η d η � K ab = k ( x a − x b ) = R � R e − 2 π i ( x a − x b ) η p ( η ) d η = � e − 2 π i ( x a − x b ) η � = E η ∼ p ( η ) 11 / 43

  28. Fourier Features ∞ n = A n K � A T ∞ p ( η ) e − 2 π ix j η 12 / 43

  29. Fourier Features ∞ n = A n K � A T ∞ p ( η ) e − 2 π ix j η Rahimi-Recht’2007: fix s , sample i.i.d. η 1 ,..., η s ∼ p ( η ) Let j -th row of Z be Z j , k : = 1 � se − 2 π ix j η k (samples of pure frequency x j ) and use ZZ T as a proxy for K ! 12 / 43

  30. Fourier Features: sampling columns of Fourier factorization of K ∞ n = n A K � A T ∞ p ( η ) e − 2 π ix j η Rahimi-Recht’2007: fix s , sample i.i.d. η 1 ,..., η s ∼ p ( η ) Let j -th row of Z be Z j , k : = 1 � se − 2 π ix j η k (samples of pure frequency x j ) and use ZZ T as a proxy for K ! 13 / 43

  31. Fourier Features: sampling columns of Fourier factorization of K ∞ n = n A K � A T ∞ p ( η ) e − 2 π ix j η 14 / 43

  32. Fourier Features: sampling columns of Fourier factorization of K ∞ n = n A K � A T ∞ p ( η ) e − 2 π ix j η Column η has ℓ 2 2 norm n · p ( η ) ! Fourier features = sampling columns of A with probability proportional to column norms squared! 14 / 43

  33. Fourier Features: sampling columns of Fourier factorization of K ∞ n = n A K � A T ∞ p ( η ) e − 2 π ix j η 15 / 43

  34. Fourier Features: sampling columns of Fourier factorization of K ∞ n = n A K � A T ∞ p ( η ) e − 2 π ix j η Column η has ℓ 2 2 norm n · p ( η ) ! Fourier features = sampling columns of A with probability proportional to column norms squared! 15 / 43

  35. Fourier Features: sampling columns of Fourier factorization of K s Z T ≈ n Z K Column η has ℓ 2 2 norm n · p ( η ) ! Fourier features = sampling columns of A with probability proportional to column norms squared! One has E [ ZZ T ] = K 16 / 43

  36. Spectral approximations ∞ n = n K � ∞ p ( η ) e − 2 π ix j η Our goal: find Z ∈ R n × s , s ≪ n such that ( 1 − ε )( K + λ I ) ≺ ZZ T + λ I ≺ ( 1 + ε )( K + λ I )? 17 / 43

  37. Spectral approximations ∞ n = n K � ∞ p ( η ) e − 2 π ix j η Our goal: find Z ∈ R n × s , s ≪ n such that ( 1 − ε )( K + λ I ) ≺ ZZ T + λ I ≺ ( 1 + ε )( K + λ I )? Subspace embeddings for kernel matrices that can be applied implicitly to points x 1 ,..., x n ∈ R d ? 17 / 43

  38. Spectral approximations ∞ n = n K � ∞ p ( η ) e − 2 π ix j η Our goal: find Z ∈ R n × s , s ≪ n such that ( 1 − ε )( K + λ I ) ≺ ZZ T + λ I ≺ ( 1 + ε )( K + λ I )? Subspace embeddings for kernel matrices that can be applied implicitly to points x 1 ,..., x n ∈ R d ? Known for the polynomial kernel only: Avron et al., NIPS’2014 via T ENSOR S KETCH 17 / 43

  39. Spectral approximation via column sampling D n = n A K D A T For each j = 1 ,..., D compute sampling probability τ ( j ) Sample s columns independently from distribution τ , include j 1 in Z with weight � s · τ ( j ) if sampled. 18 / 43

  40. Spectral approximation via column sampling D n = n A K D A T For each j = 1 ,..., D compute sampling probability τ ( j ) Sample s columns independently from distribution τ , include j 1 in Z with weight � s · τ ( j ) if sampled. That way E [ ZZ T ] = K 18 / 43

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend