learning with random features
play

Learning with random features Alessandro Rudi INRIA - Ecole - PowerPoint PPT Presentation

Learning with random features Alessandro Rudi INRIA - Ecole Normale Sup erieure, Paris joint work with Lorenzo Rosasco (IIT-MIT) January 17th, 2018 Cambridge Data+computers+ machine learning = AI/Data science 1Y US data center=


  1. Learning with random features Alessandro Rudi INRIA - ´ Ecole Normale Sup´ erieure, Paris joint work with Lorenzo Rosasco (IIT-MIT) January 17th, 2018 – Cambridge

  2. Data+computers+ machine learning = AI/Data science ◮ 1Y US data center= 1M houses ◮ MobileEye pays 1000 labellers Can we make do with less? Beyond a theoretical divide → Integrate statistics and numerics/optimization

  3. Outline Part I: Random feature networks Part II: Properties of RFN Part III: Refined results on RFN

  4. Supervised learning

  5. Supervised learning

  6. Supervised learning Problem: given { ( x 1 , y 1 ) , . . . , ( x n , y n ) } find f ( x new ) ∼ y new

  7. Neural networks M � β j σ ( w ⊤ f ( x ) = j x + b j ) j =1 ◮ σ : R → R a non linear activation function. ◮ For j = 1 , . . . , M , β j , w j , b j parameters to be determined.

  8. Neural networks M � β j σ ( w ⊤ f ( x ) = j x + b j ) j =1 ◮ σ : R → R a non linear activation function. ◮ For j = 1 , . . . , M , β j , w j , b j parameters to be determined. Some references ◮ History [McCulloch, Pitts ’43; Rosenblatt ’58; Minsky, Papert ’69; Y. LeCun, ’85; Hinton et al. ’06] ◮ Deep learning [Krizhevsky et al. ’12 - 18705 Cit.!!!] ◮ Theory [Barron ’92-94; Bartlett, Anthony ’99; Pinkus, ’99]

  9. Random features networks � M ⊤ x + b j ) f ( x ) = β j σ ( w j j =1 ◮ σ : R → R a non linear activation function. ◮ For j = 1 , . . . , M , β j parameters to be determined ◮ For j = 1 , . . . , M , w j , b j chosen at random

  10. Random features networks � M ⊤ x + b j ) f ( x ) = β j σ ( w j j =1 ◮ σ : R → R a non linear activation function. ◮ For j = 1 , . . . , M , β j parameters to be determined ◮ For j = 1 , . . . , M , w j , b j chosen at random Some references ◮ Neural nets [Block ’62] , Extreme learning machine [Huang et al. ’06] 5196 Cit.?? ◮ Sketching/one-bit compressed sensing see e.g. [Plan, Vershynin ’11-14] x �→ σ ( S ⊤ x ) , S random matrix ◮ Gaussian processes/kernel methods [Neal ’95, Rahimi, Recht ’06’08’08]

  11. From RFN to PD kernels M � 1 σ ( w ⊤ j x + b j ) σ ( w ⊤ j x ′ + b j ) ≈ K ( x, x ′ ) = E [ σ ( W ⊤ x + B ) σ ( W ⊤ x ′ + B )] M j =1

  12. From RFN to PD kernels M � 1 σ ( w ⊤ j x + b j ) σ ( w ⊤ j x ′ + b j ) ≈ K ( x, x ′ ) = E [ σ ( W ⊤ x + B ) σ ( W ⊤ x ′ + B )] M j =1 Example I: Gaussian kernel/Random Fourier features [Rahimi, Recht ’08] Let σ ( · ) = cos( · ) , W ∼ N (0 , I ) and B ∼ U [0 , 2 π ] K ( x, x ′ ) = e −� x − x ′ � 2 γ Example II: Arccos kernel/ReLU features [Le Roux, Bengio ’07; Chou, Saul ’09] Let σ ( · ) = | · | + , ( W, B ) ∼ U [ S d +1 ] K ( x, x ′ ) = sin θ + ( π − θ ) cos θ, θ = arcos ( x ⊤ x ′ )

  13. A general view Let X a measurable space and K : X × X → R symmetric and pos. def. Assumption (RF) There exist ◮ W random var. in W with law π . ◮ φ : W × X → R a measurable function. such that for all x, x ′ ∈ X , K ( x, x ′ ) = E [ φ ( W, x ) φ ( W, x ′ )] .

  14. A general view Let X a measurable space and K : X × X → R symmetric and pos. def. Assumption (RF) There exist ◮ W random var. in W with law π . ◮ φ : W × X → R a measurable function. such that for all x, x ′ ∈ X , K ( x, x ′ ) = E [ φ ( W, x ) φ ( W, x ′ )] . Random feature representation Given a sample w 1 , . . . , w M of M i.i. copies of W consider M � K ( x, x ′ ) ≈ 1 φ ( w j , x ) φ ( w j , x ′ ) M j =1

  15. Functional view Reproducing kernel Hilbert space (RKHS) [Aronzaijn ’50] : H K space of functions p � f ( x ) = β j K ( x, x j ) j =1 completed with respect to � K x , K ′ x � := K ( x, x ′ ) .

  16. Functional view Reproducing kernel Hilbert space (RKHS) [Aronzaijn ’50] : H K space of functions p � f ( x ) = β j K ( x, x j ) j =1 completed with respect to � K x , K ′ x � := K ( x, x ′ ) . RFN spaces : H φ,p space of functions � f ( x ) = dπ ( w ) β ( w ) φ ( w, x ) , p = E | β ( W ) | p < ∞ . with � β � p

  17. Functional view Reproducing kernel Hilbert space (RKHS) [Aronzaijn ’50] : H K space of functions p � f ( x ) = β j K ( x, x j ) j =1 completed with respect to � K x , K ′ x � := K ( x, x ′ ) . RFN spaces : H φ,p space of functions � f ( x ) = dπ ( w ) β ( w ) φ ( w, x ) , p = E | β ( W ) | p < ∞ . with � β � p Theorem (Schoenberg, ’38, Aronzaijn ’50) Under Assumption (RF), Then, H K ≃ H φ, 2 .

  18. Why should you care RFN promises ◮ Replace optimization with randomization in NN. ◮ Reduce memory/time footprint of GP/kernel methods.

  19. Outline Part I: Random feature networks Part II: Properties of RFN Part III: Refined results on RFN

  20. Kernel approximations � M 1 ˜ K ( x, x ′ ) φ ( w j , x ) φ ( w j , x ′ ) = M j =1 K ( x, x ′ ) E [ φ ( W, x ) φ ( W, x ′ )] = Theorem Assume φ is bounded. Let K ⊂ X compact, then w.h.p. K ( x, x ) | � C K | K ( x, x ) − ˜ sup √ M x ∈K ◮ [Rahimi, B. Recht ’08, Sutherland, Schneider ’15 , Sriperumbudur, Szab´ o ’15] ◮ Empirical characteristic function [Feuerverger, Mureika ’77, Cs¨ org´ o ’84, Yukich ’87]

  21. Supervised learning ◮ ( X, Y ) a pair of random variables in X × R . ◮ L : R × R → [0 , ∞ ) a loss function. ◮ H ⊂ R X Problem: Solve min f ∈H E [ L ( f ( X ) , Y )] given only ( x 1 , y 1 ) , . . . , ( x n , y n ) , a sample of n i.i. copies of ( X, Y ) .

  22. Rahimi & Recht estimator Ideally, H = H φ, ∞ ,R , the space of functions � � β � ∞ ≤ R. f ( x ) = dπ ( w ) β ( w ) φ ( w, x ) , In practice, H = H φ, ∞ ,R,M the space of functions � M ˜ | ˜ f ( x ) = β j φ ( w j , x ) , sup β j | ≤ R. j j =1 Estimator n � 1 argmin L ( f ( x i ) , y i ) n f ∈H φ, ∞ ,R,M i =1

  23. Rahimi & Recht result Theorem (Rahimi, Recht ’08) Assume L is ℓ -Lipschitz and convex. If φ is bounded, then w.h.p. � 1 � 1 L ( � f ( X ) , Y )] − f ∈H φ, ∞ ,R E [ L ( f ( X ) , Y )] � ℓR min √ n + √ M Other result: [Bach ’15] , replaced H φ, ∞ ,R with a ball in H φ, 2 . R needs be fixed and M = n is needed for 1 / √ n rates.

  24. Our approach For f β ( x ) = � M j =1 β j φ ( w j , x ) , consider RF-ridge regression � n � M 1 ( y i − f β ( x i )) 2 + λ | β j | 2 min n β ∈ R M i =1 j =1

  25. Our approach For f β ( x ) = � M j =1 β j φ ( w j , x ) , consider RF-ridge regression � n � M 1 ( y i − f β ( x i )) 2 + λ | β j | 2 min n β ∈ R M i =1 j =1 Computations β λ = ( � � Φ � Φ + λnI ) − 1 � Φ ⊤ � y ◮ � Φ i,j = φ ( w j , x i ) , n × M data matrix ◮ � y n × 1 outputs vector

  26. Computational footprint β λ = ( � � Φ ⊤ � Φ + λnI ) − 1 � Φ ⊤ � y O ( nM 2 ) time and O ( Mn ) memory cost Compare to O ( n 3 ) and O ( n 2 ) using kernel methods/GP. What are the learning properties if M < n ?

  27. Worst case: basic assumptions Noise E [ | Y | p | X = x ] ≤ 1 2 p ! σ 2 b p − 2 , ∀ p ≥ 2 RF boundness: Under assumption (RF), let φ be bounded. Best model: There exists f † solving f ∈H φ, 2 E [( Y − f ( X )) 2 ] . min Note: - we allow to consider the whole space H φ, 2 rather than a ball. ∈ H ). - We allow misspecified models (regression function /

  28. Worst case: analysis Theorem (Rudi, R. ’17) Under the basic assumptions, let � f = f � β λ then w.h.p. f λ ( X )) 2 ] − E [( Y − f † ( X )) 2 ] � 1 λn + λ + 1 E [( Y − � M , so that, for � 1 � � 1 � � � λ = O √ n , M = O � λ then w.h.p. 1 E [( Y − � λ ( X )) 2 ] − E [( Y − f † ( X )) 2 ] � √ n. f �

  29. Remarks ◮ Match statistical minmax lower bounds [Caponnetto, De Vito ’05] . ◮ Special case: Sobolev spaces with s = 2 d , e.g. exponential kernel and Fourier features. ◮ Corollaries for classification using plugin classifiers [Audibert, Tsybakov ’07; Yao, Caponnetto, R. ’07] ◮ Same statistical bound of (kernel) ridge regression [Caponnetto, De Vito ’05] .

  30. M = √ n suffices for 1 √ n rates. O ( n 2 ) time and O ( n √ n ) memory suffice, rather than O ( n 3 ) / O ( n 2 )

  31. Some ideas from the proof [Caponnetto, De Vito, R. ’05- , Smale, Zhou’05] Fixed design linear regression y = � � Xw ∗ + δ Ridge regression X ⊤ � X ( � � y − � X + λ ) − 1 � Xw ∗ = X ⊤ � X ⊤ � X + λ ) − 1 − I ) w ∗ = X � � X ⊤ ( � X ⊤ + λ ) − 1 δ − � X (( � X ⊤ � X ⊤ � X � � X ⊤ ( � X ⊤ + λ ) − 1 δ + λ � X ( � X + λ ) − 1 w ∗

  32. Key quantities Lf ( x ) = E [ K ( x, X ) f ( X )] , L M f ( x ) = E [ K M ( x, X ) f ( X )] . Let K x = K ( x, · ) . ◮ Noise: ( L M + λI ) − 1 2 ˜ K X Y [Pinelis ’94] ◮ Sampling: ( L M + λI ) − 1 2 ˜ K X ⊗ ˜ K X [Tropp ’12, Minsker ’17] 1 ◮ Bias: λ ( L + λI ) − 1 L 2 [. . . ]

  33. Key quantities (cont.) RF approximation: ◮ L 1 / 2 [( L + λI ) − 1 L − ( L M + λI ) − 1 L M ] [Rudi, R. ’17] ◮ ( I − P ) φ ( w, · ) , where P = L † L [Rudi, R. ’17, De Vito; R., Toigo ’14] Note: it can be that φ ( w, · ) / ∈ H k

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend