falkon optimal and efficient large scale kernel learning
play

Falkon: optimal and efficient large scale kernel learning Alessandro - PowerPoint PPT Presentation

Falkon: optimal and efficient large scale kernel learning Alessandro Rudi INRIA - Ecole Normale Sup erieure joint work with Luigi Carratino (UniGe), Lorenzo Rosasco (MIT - IIT) July, 6th ISMP 2018 Learning problem The problem P Find


  1. Falkon: optimal and efficient large scale kernel learning Alessandro Rudi INRIA - ´ Ecole Normale Sup´ erieure joint work with Luigi Carratino (UniGe), Lorenzo Rosasco (MIT - IIT) July, 6th – ISMP 2018

  2. Learning problem The problem P Find � dρ ( x, y )( y − f ( x )) 2 f H = argmin E ( f ) , E ( f ) = f ∈H with ρ unknown but given ( x i , y i ) n i =1 i.i.d. samples. Basic assumtions: � | y | p dρ ≤ 1 2 p ! σ 2 b p − 2 , ◮ Tail assumption: ∀ p ≥ 2 ◮ ( H , �· , ·� H ) RKHS with bounded kernel K

  3. Kernel ridge regression n � 1 ( y i − f ( x i )) 2 + λ � f � 2 � f λ = argmin H n f ∈H i =1 n � � f λ ( x ) = K ( x, x i ) c i i =1 b c = b y K ( � K + λnI ) c = � y Complexity: Space O ( n 2 ) Kernel eval. O ( n 2 ) Time O ( n 3 )

  4. Random projections Solve � P n on H M = span { K (˜ x 1 , · ) , . . . , K (˜ x M , · ) } n � 1 ( y i − f ( x i )) 2 + λ � f � 2 � f λ,M = argmin H n f ∈H M i =1 ◮ ... that is, pick M columns at random � M c � f λ,M ( x ) = K ( x, ˜ x i ) c i i =1 b y b K nM = ( � nM � K nM + λn � K MM ) c = � K ⊤ K ⊤ nM � y - Nystr¨ om methods (Smola, Scholk¨ opf ’00) - Gaussian processes: inducing inputs (Quionero-Candela et al ’05) - Galerkin methods and Randomized linear algebra (Halko et al. ’11)

  5. Nystr¨ om KRR: Statistics (refined) Let Lf ( x ′ ) = E K ( x ′ , x ) f ( x ) and N ( λ ) = Trace (( L + λI ) − 1 L ) Capacity condition: N ( λ ) = O ( λ − γ ) , γ ∈ [0 , 1] Source condition: f H ∈ Range ( L r ) , r ≥ 1 / 2 Theorem [Rudi, Camoriano, R. ’15] Under (basic) and (refined) f λ,M ) − E ( f H ) � N ( λ ) + λ 2 r + 1 E E ( � M . n 1 By selecting λ n = n − 2 r + γ , M n = 1 λ n E E ( � 2 r f λ n ,M n ) − E ( f H ) � n − 2 r + γ

  6. Remarks M = n c M = O ( √ n ) suffices for O (1 / √ n ) rates ◮ Previous works: only for fixed design (Bach ’13, Alaoui, Mahoney, ’15, Yang et al. ’15, Musco, Musco ’16) ◮ Same minmax bound of KRR [Caponnetto, De Vito ’05] . ◮ Projection regularizes!

  7. Computations required for O (1 / √ n ) rate Space: O ( n ) O ( n √ n ) Kernel eval.: O ( n 2 ) Time: O ( √ n ) Test: Possible improvements: ◮ adaptive sampling ◮ optimization

  8. Optimization to rescue c � nM � K nM + λn � c = � K ⊤ K ⊤ K MM nM � y . b � �� � � �� � b y K nM = H b Idea: First order methods � � c t = c t − 1 − τ � nM ( � K nM c t − 1 − y n ) + λn � K ⊤ K MM c t − 1 n Pros: requires O ( nMt ) Cons: t ∝ κ ( H ) arbitrarily large- κ ( H ) = σ max ( H ) /σ min ( H ) condition number.

  9. Preconditioning Idea : solve an equivalent linear system with better condition number Preconditioning P ⊤ HPβ = P ⊤ b, Hc = b �→ c = Pβ. Ideally PP ⊤ = H − 1 , so that t = O ( κ ( H )) �→ t = O (1)! Note : Preconditioning KRR (Fasshauer et al ’12, Avron et al ’16, Cutajat ’16, Ma, Belkin ’17) H = K + λnI Can we precondition Nystrom-KRR?

  10. Preconditioning Nystom-KRR H := � nM � K nM + λn � K ⊤ Consider K MM Proposed Preconditioning � n � − 1 PP ⊤ = � MM + λn � K 2 K MM M Compare to naive preconditioning � � − 1 PP ⊤ = � nM � K nM + λn � K ⊤ K MM .

  11. Baby FALKON Proposed Preconditioning � n � − 1 PP ⊤ = � MM + λn � K 2 K MM , M Gradient descent � M � f λ,M,t ( x ) = K ( x, � x i ) c t,i , c t = Pβ t i =1 nP ⊤ � � β t = β t − 1 − τ K ⊤ � nM ( � K nM Pβ t − 1 − y n ) + λn � K MM Pβ t − 1

  12. FALKON ◮ Gradient descent �→ conjugate gradient ◮ Computing P � 1 � 1 M TT ⊤ + λI √ nT − 1 A − 1 , P = T = chol( K MM ) , A = chol , where chol( · ) is the Cholesky decomposition.

  13. Falkon statistics Theorem Under (basic) and (refined), when M > log n λ , � � 1 / 2 � � f λ n ,M n ,t n ) − E ( f H ) � N ( λ ) + λ 2 r + 1 1 − log n E E ( � M + exp − t n λM By selecting M n = 2 log n 1 λ n = n − 2 r + γ , , t n = log n, λ then 2 r E E ( � f λ n ,M n ,t n ) − E ( f H ) � n − 2 r + γ

  14. Remarks ◮ Same rates and memory of NKRR, much smaller time complexity, for O (1 / √ n ) : O ( √ n ) Model: Space: O ( n ) O ( n √ n ) Kernel eval.: O ( n 2 ) → O ( n √ n ) ✟ Time: ✟✟ Related (worse complexity) ◮ EigenPro (Belkin et al. ’16) ◮ SGD (Smale, Yao ’05, Tarres, Yao ’07, Ying, Pontil ’08, Bach et al. ’14-. . . , ) ◮ RF-KRR (Rahimi, Recht ’07; Bach ’15; Rudi, Rosasco ’17) ◮ Divide and conquer (Zhang et al. ’13) ◮ NYTRO (Angles et al ’16) ◮ Nystr¨ om SGD (Lin, Rosasco ’16)

  15. In practice Higgs dataset: n = 10 , 000 , 000 , M = 50 , 000 1 0.95 0.9 0.85 0.8 0.75 0 7 20 40 60 80 100

  16. Some experiments MillionSongs ( n ∼ 10 6 ) YELP ( n ∼ 10 6 ) TIMIT ( n ∼ 10 6 ) MSE Relative error Time( s ) RMSE Time( m ) c-err Time( h ) 4 . 51 × 10 − 3 FALKON 80.30 55 0.833 20 32.3% 1.5 4 . 58 × 10 − 3 289 † Prec. KRR - - - - - 293 ⋆ 4 . 56 × 10 − 3 Hierarchical - - - - - 737 ∗ D&C 80.35 - - - - - Rand. Feat. 80.93 - 772 ∗ - - - - 876 ∗ Nystr¨ om 80.38 - - - - - ADMM R. F. - 5 . 01 × 10 − 3 958 † - - - - 42 ‡ 1 . 7 ‡ BCD R. F. - - - 0.949 34.0% BCD Nystr¨ om - - - 0.861 60 ‡ 33.7% 1 . 7 ‡ 4 . 55 × 10 − 3 500 ‡ 8 . 3 ‡ KRR - - 0.854 33.5% EigenPro - - - - - 32.6% 3 . 9 ≀ Deep NN - - - - - 32.4% - Sparse Kernels - - - - - 30.9% - Ensemble - - - - - 33.5% - Table: MillionSongs, YELP and TIMIT Datasets. Times obtained on: ‡ = cluster of 128 EC2 r3.2xlarge machines, † = cluster of 8 EC2 r3.8xlarge machines, ≀ = single machine with two Intel Xeon E5-2620, one Nvidia GTX Titan X GPU and 128GB of RAM, ⋆ = cluster with 512 GB of RAM and IBM POWER8 12-core processor, ∗ = unknown platform.

  17. Some more experiments SUSY ( n ∼ 10 6 ) HIGGS ( n ∼ 10 7 ) IMAGENET ( n ∼ 10 6 ) c-err AUC Time( m ) AUC Time( h ) c-err Time( h ) FALKON 19.6% 0.877 4 0.833 3 20.7% 4 6 ≀ EigenPro 19.8% - - - - - 40 † Hierarchical 20.1% - - - - - Boosted Decision Tree - 0.863 - 0.810 - - - Neural Network - 0.875 - 0.816 - - - 4680 ‡ 78 ‡ Deep Neural Network - 0.879 0.885 - - Inception-V4 - - - - - 20.0% - Table: Architectures: † cluster with IBM POWER8 12-core cpu, 512 GB RAM, ≀ single machine with two Intel Xeon E5-2620, one Nvidia GTX Titan X GPU, 128GB RAM, ‡ single machine.

  18. Contributions ◮ Best computations so far for optimal statistics Time O ( n √ n ) Space O ( n ) ◮ In the pipeline: adaptive sampling, general projection, SGD ◮ TBD other loss, other regularizers, other problems, other solvers. . .

  19. Proof: bridging statistics and optimization Lemma Let δ > 0 , κ P := κ ( P ⊤ HP ) , c δ = c 0 log 1 δ . When λ ≥ 1 n c δ exp( − t/ √ κ P ) . E ( � E ( � f λ,M,t ) − E ( f H ) ≤ f λ,M ) − E ( f H ) + with probability 1 − δ . Lemma Let δ ∈ (0 , 1] , λ > 0 . When M = 2 log 1 δ , λ then � � − 1 1 − log 1 κ ( P ⊤ HP ) ≤ δ < 4 λM with probability 1 − δ .

  20. Proving κ ( P ⊤ HP ) ≈ 1 Let K x = K ( x, · ) ∈ H , � � n � M C n = 1 C M = 1 � � C = K x ⊗ K x dρ X ( x ) , K x i ⊗ K x i , K � x j ⊗ K � x j . n M i =1 j =1 � 1 � M TT ⊤ + λI √ n T − 1 A − 1 , T = chol( K MM ) , A = chol 1 Recall that P = . Steps P ⊤ HP = A −⊤ V ∗ ( � C n + λI ) V A − 1 1 . C M + λI ) V A − 1 + A −⊤ V ∗ ( � A −⊤ V ∗ ( � C n − � P ⊤ HP C M ) V A − 1 2 . = I + A −⊤ V ∗ ( � C n − � P ⊤ HP C M ) V A − 1 3 . = with E = A −⊤ V ∗ ( � C n − � P ⊤ HP C M ) V A − 1 3 . = I + E

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend