Learning with random features Alessandro Rudi INRIA - Ecole - PowerPoint PPT Presentation

Learning with random features Alessandro Rudi INRIA - ´ Ecole Normale Sup´ erieure, Paris joint work with Lorenzo Rosasco (IIT-MIT) January 17th, 2018 – Cambridge

Data+computers+ machine learning = AI/Data science ◮ 1Y US data center= 1M houses ◮ MobileEye pays 1000 labellers Can we make do with less? Beyond a theoretical divide → Integrate statistics and numerics/optimization

Outline Part I: Random feature networks Part II: Properties of RFN Part III: Refined results on RFN

Supervised learning

Supervised learning Problem: given { ( x 1 , y 1 ) , . . . , ( x n , y n ) } find f ( x new ) ∼ y new

Neural networks M � β j σ ( w ⊤ f ( x ) = j x + b j ) j =1 ◮ σ : R → R a non linear activation function. ◮ For j = 1 , . . . , M , β j , w j , b j parameters to be determined.

Neural networks M � β j σ ( w ⊤ f ( x ) = j x + b j ) j =1 ◮ σ : R → R a non linear activation function. ◮ For j = 1 , . . . , M , β j , w j , b j parameters to be determined. Some references ◮ History [McCulloch, Pitts ’43; Rosenblatt ’58; Minsky, Papert ’69; Y. LeCun, ’85; Hinton et al. ’06] ◮ Deep learning [Krizhevsky et al. ’12 - 18705 Cit.!!!] ◮ Theory [Barron ’92-94; Bartlett, Anthony ’99; Pinkus, ’99]

Random features networks � M ⊤ x + b j ) f ( x ) = β j σ ( w j j =1 ◮ σ : R → R a non linear activation function. ◮ For j = 1 , . . . , M , β j parameters to be determined ◮ For j = 1 , . . . , M , w j , b j chosen at random

Random features networks � M ⊤ x + b j ) f ( x ) = β j σ ( w j j =1 ◮ σ : R → R a non linear activation function. ◮ For j = 1 , . . . , M , β j parameters to be determined ◮ For j = 1 , . . . , M , w j , b j chosen at random Some references ◮ Neural nets [Block ’62] , Extreme learning machine [Huang et al. ’06] 5196 Cit.?? ◮ Sketching/one-bit compressed sensing see e.g. [Plan, Vershynin ’11-14] x �→ σ ( S ⊤ x ) , S random matrix ◮ Gaussian processes/kernel methods [Neal ’95, Rahimi, Recht ’06’08’08]

From RFN to PD kernels M � 1 σ ( w ⊤ j x + b j ) σ ( w ⊤ j x ′ + b j ) ≈ K ( x, x ′ ) = E [ σ ( W ⊤ x + B ) σ ( W ⊤ x ′ + B )] M j =1

From RFN to PD kernels M � 1 σ ( w ⊤ j x + b j ) σ ( w ⊤ j x ′ + b j ) ≈ K ( x, x ′ ) = E [ σ ( W ⊤ x + B ) σ ( W ⊤ x ′ + B )] M j =1 Example I: Gaussian kernel/Random Fourier features [Rahimi, Recht ’08] Let σ ( · ) = cos( · ) , W ∼ N (0 , I ) and B ∼ U [0 , 2 π ] K ( x, x ′ ) = e −� x − x ′ � 2 γ Example II: Arccos kernel/ReLU features [Le Roux, Bengio ’07; Chou, Saul ’09] Let σ ( · ) = | · | + , ( W, B ) ∼ U [ S d +1 ] K ( x, x ′ ) = sin θ + ( π − θ ) cos θ, θ = arcos ( x ⊤ x ′ )

A general view Let X a measurable space and K : X × X → R symmetric and pos. def. Assumption (RF) There exist ◮ W random var. in W with law π . ◮ φ : W × X → R a measurable function. such that for all x, x ′ ∈ X , K ( x, x ′ ) = E [ φ ( W, x ) φ ( W, x ′ )] .

A general view Let X a measurable space and K : X × X → R symmetric and pos. def. Assumption (RF) There exist ◮ W random var. in W with law π . ◮ φ : W × X → R a measurable function. such that for all x, x ′ ∈ X , K ( x, x ′ ) = E [ φ ( W, x ) φ ( W, x ′ )] . Random feature representation Given a sample w 1 , . . . , w M of M i.i. copies of W consider M � K ( x, x ′ ) ≈ 1 φ ( w j , x ) φ ( w j , x ′ ) M j =1

Functional view Reproducing kernel Hilbert space (RKHS) [Aronzaijn ’50] : H K space of functions p � f ( x ) = β j K ( x, x j ) j =1 completed with respect to � K x , K ′ x � := K ( x, x ′ ) .

Functional view Reproducing kernel Hilbert space (RKHS) [Aronzaijn ’50] : H K space of functions p � f ( x ) = β j K ( x, x j ) j =1 completed with respect to � K x , K ′ x � := K ( x, x ′ ) . RFN spaces : H φ,p space of functions � f ( x ) = dπ ( w ) β ( w ) φ ( w, x ) , p = E | β ( W ) | p < ∞ . with � β � p

Functional view Reproducing kernel Hilbert space (RKHS) [Aronzaijn ’50] : H K space of functions p � f ( x ) = β j K ( x, x j ) j =1 completed with respect to � K x , K ′ x � := K ( x, x ′ ) . RFN spaces : H φ,p space of functions � f ( x ) = dπ ( w ) β ( w ) φ ( w, x ) , p = E | β ( W ) | p < ∞ . with � β � p Theorem (Schoenberg, ’38, Aronzaijn ’50) Under Assumption (RF), Then, H K ≃ H φ, 2 .

Why should you care RFN promises ◮ Replace optimization with randomization in NN. ◮ Reduce memory/time footprint of GP/kernel methods.

Outline Part I: Random feature networks Part II: Properties of RFN Part III: Refined results on RFN

Kernel approximations � M 1 ˜ K ( x, x ′ ) φ ( w j , x ) φ ( w j , x ′ ) = M j =1 K ( x, x ′ ) E [ φ ( W, x ) φ ( W, x ′ )] = Theorem Assume φ is bounded. Let K ⊂ X compact, then w.h.p. K ( x, x ) | � C K | K ( x, x ) − ˜ sup √ M x ∈K ◮ [Rahimi, B. Recht ’08, Sutherland, Schneider ’15 , Sriperumbudur, Szab´ o ’15] ◮ Empirical characteristic function [Feuerverger, Mureika ’77, Cs¨ org´ o ’84, Yukich ’87]

Supervised learning ◮ ( X, Y ) a pair of random variables in X × R . ◮ L : R × R → [0 , ∞ ) a loss function. ◮ H ⊂ R X Problem: Solve min f ∈H E [ L ( f ( X ) , Y )] given only ( x 1 , y 1 ) , . . . , ( x n , y n ) , a sample of n i.i. copies of ( X, Y ) .

Rahimi & Recht estimator Ideally, H = H φ, ∞ ,R , the space of functions � � β � ∞ ≤ R. f ( x ) = dπ ( w ) β ( w ) φ ( w, x ) , In practice, H = H φ, ∞ ,R,M the space of functions � M ˜ | ˜ f ( x ) = β j φ ( w j , x ) , sup β j | ≤ R. j j =1 Estimator n � 1 argmin L ( f ( x i ) , y i ) n f ∈H φ, ∞ ,R,M i =1

Rahimi & Recht result Theorem (Rahimi, Recht ’08) Assume L is ℓ -Lipschitz and convex. If φ is bounded, then w.h.p. � 1 � 1 L ( � f ( X ) , Y )] − f ∈H φ, ∞ ,R E [ L ( f ( X ) , Y )] � ℓR min √ n + √ M Other result: [Bach ’15] , replaced H φ, ∞ ,R with a ball in H φ, 2 . R needs be fixed and M = n is needed for 1 / √ n rates.

Our approach For f β ( x ) = � M j =1 β j φ ( w j , x ) , consider RF-ridge regression � n � M 1 ( y i − f β ( x i )) 2 + λ | β j | 2 min n β ∈ R M i =1 j =1

Our approach For f β ( x ) = � M j =1 β j φ ( w j , x ) , consider RF-ridge regression � n � M 1 ( y i − f β ( x i )) 2 + λ | β j | 2 min n β ∈ R M i =1 j =1 Computations β λ = ( � � Φ � Φ + λnI ) − 1 � Φ ⊤ � y ◮ � Φ i,j = φ ( w j , x i ) , n × M data matrix ◮ � y n × 1 outputs vector

Computational footprint β λ = ( � � Φ ⊤ � Φ + λnI ) − 1 � Φ ⊤ � y O ( nM 2 ) time and O ( Mn ) memory cost Compare to O ( n 3 ) and O ( n 2 ) using kernel methods/GP. What are the learning properties if M < n ?

Worst case: basic assumptions Noise E [ | Y | p | X = x ] ≤ 1 2 p ! σ 2 b p − 2 , ∀ p ≥ 2 RF boundness: Under assumption (RF), let φ be bounded. Best model: There exists f † solving f ∈H φ, 2 E [( Y − f ( X )) 2 ] . min Note: - we allow to consider the whole space H φ, 2 rather than a ball. ∈ H ). - We allow misspecified models (regression function /

Worst case: analysis Theorem (Rudi, R. ’17) Under the basic assumptions, let � f = f � β λ then w.h.p. f λ ( X )) 2 ] − E [( Y − f † ( X )) 2 ] � 1 λn + λ + 1 E [( Y − � M , so that, for � 1 � � 1 � � � λ = O √ n , M = O � λ then w.h.p. 1 E [( Y − � λ ( X )) 2 ] − E [( Y − f † ( X )) 2 ] � √ n. f �

Remarks ◮ Match statistical minmax lower bounds [Caponnetto, De Vito ’05] . ◮ Special case: Sobolev spaces with s = 2 d , e.g. exponential kernel and Fourier features. ◮ Corollaries for classification using plugin classifiers [Audibert, Tsybakov ’07; Yao, Caponnetto, R. ’07] ◮ Same statistical bound of (kernel) ridge regression [Caponnetto, De Vito ’05] .

M = √ n suffices for 1 √ n rates. O ( n 2 ) time and O ( n √ n ) memory suffice, rather than O ( n 3 ) / O ( n 2 )

Some ideas from the proof [Caponnetto, De Vito, R. ’05- , Smale, Zhou’05] Fixed design linear regression y = � � Xw ∗ + δ Ridge regression X ⊤ � X ( � � y − � X + λ ) − 1 � Xw ∗ = X ⊤ � X ⊤ � X + λ ) − 1 − I ) w ∗ = X � � X ⊤ ( � X ⊤ + λ ) − 1 δ − � X (( � X ⊤ � X ⊤ � X � � X ⊤ ( � X ⊤ + λ ) − 1 δ + λ � X ( � X + λ ) − 1 w ∗

Key quantities Lf ( x ) = E [ K ( x, X ) f ( X )] , L M f ( x ) = E [ K M ( x, X ) f ( X )] . Let K x = K ( x, · ) . ◮ Noise: ( L M + λI ) − 1 2 ˜ K X Y [Pinelis ’94] ◮ Sampling: ( L M + λI ) − 1 2 ˜ K X ⊗ ˜ K X [Tropp ’12, Minsker ’17] 1 ◮ Bias: λ ( L + λI ) − 1 L 2 [. . . ]

Key quantities (cont.) RF approximation: ◮ L 1 / 2 [( L + λI ) − 1 L − ( L M + λI ) − 1 L M ] [Rudi, R. ’17] ◮ ( I − P ) φ ( w, · ) , where P = L † L [Rudi, R. ’17, De Vito; R., Toigo ’14] Note: it can be that φ ( w, · ) / ∈ H k

Learning with random features Alessandro Rudi INRIA - Ecole - PowerPoint PPT Presentation

Learning with random features Alessandro Rudi INRIA - Ecole Normale Sup erieure, Paris joint work with Lorenzo Rosasco (IIT-MIT) January 17th, 2018 Cambridge Data+computers+ machine learning = AI/Data science 1Y US data center=

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Chapter 2: Random Variables In this chapter we will cover: 1. Discrete Random variables, ( 2.1

Random Numbers, Files, and Onwards Random Numbers Computers cannot produce truly random numbers.

Discrete Random Variables October 7, 2010 Discrete Random Variables Random Variables In many

Chapter 9 Object recognition Random Forests 9.9 Random forests 2 9.9 Random forests

Back to Random Walks on Graphs Random walk on a graph: Stationary distribution: Back to Random

Probability and Random Processes Lecture 10 Random processes Kolmogorovs extension

continuous random variables continuous random variables Discrete random variable: takes values in

Uncertainty in Eddy Sources of Random Error Random Errors: . . . Covariance Measurements:

Simulation Random numbers Random numbers Anyone who considers arithmetic methods of

BLOGGING How to blog well FEATURES OF A BLOG... FEATURES OF A BLOG... Chronological

Supervised Learning Given: a set of inputs features X 1 , . . . , X n a set of target features Y 1

Multiscale Conditional 1) Generalization of conditional random fields (CRF) to multiscale

STK-IN4300 Details of Random Forests Statistical Learning Methods in Data Science Adaptive

Random Walks on Graphs Larry Fenn DATE Larry Fenn Random Walks on Graphs Introduction

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

Low norm and 1 guarantees on Sparsifiability Shai Shalev-Shwartz & Nathan Srebro

Workplace Attributes and Womens Labor Supply Decisions Evidence from a Randomized Experiment

Randomized algorithms Inge Li Grtz Thank you to Kevin Wayne for inspiration to slides

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Stat 8931 (Aster Models) Lecture Slides Deck 6 Aster Models with Random Effects Charles J. Geyer

The Complexity of ( +1) Coloring in Congested Clique, Massively Parallel Computation, and

Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments Jiyan Yang

Learning with random features Alessandro Rudi INRIA - Ecole - PowerPoint PPT Presentation

Learning with random features Alessandro Rudi INRIA - Ecole Normale Sup erieure, Paris joint work with Lorenzo Rosasco (IIT-MIT) January 17th, 2018 Cambridge Data+computers+ machine learning = AI/Data science 1Y US data center=

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Chapter 2: Random Variables In this chapter we will cover: 1. Discrete Random variables, ( 2.1

Random Numbers, Files, and Onwards Random Numbers Computers cannot produce truly random numbers.

Discrete Random Variables October 7, 2010 Discrete Random Variables Random Variables In many

Chapter 9 Object recognition Random Forests 9.9 Random forests 2 9.9 Random forests

Back to Random Walks on Graphs Random walk on a graph: Stationary distribution: Back to Random

Probability and Random Processes Lecture 10 Random processes Kolmogorovs extension

continuous random variables continuous random variables Discrete random variable: takes values in

Uncertainty in Eddy Sources of Random Error Random Errors: . . . Covariance Measurements:

Simulation Random numbers Random numbers Anyone who considers arithmetic methods of

BLOGGING How to blog well FEATURES OF A BLOG... FEATURES OF A BLOG... Chronological

Supervised Learning Given: a set of inputs features X 1 , . . . , X n a set of target features Y 1

Multiscale Conditional 1) Generalization of conditional random fields (CRF) to multiscale

STK-IN4300 Details of Random Forests Statistical Learning Methods in Data Science Adaptive

Random Walks on Graphs Larry Fenn DATE Larry Fenn Random Walks on Graphs Introduction

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

Low norm and 1 guarantees on Sparsifiability Shai Shalev-Shwartz &amp; Nathan Srebro

Workplace Attributes and Womens Labor Supply Decisions Evidence from a Randomized Experiment

Randomized algorithms Inge Li Grtz Thank you to Kevin Wayne for inspiration to slides

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Stat 8931 (Aster Models) Lecture Slides Deck 6 Aster Models with Random Effects Charles J. Geyer

The Complexity of ( +1) Coloring in Congested Clique, Massively Parallel Computation, and

Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments Jiyan Yang

Low norm and 1 guarantees on Sparsifiability Shai Shalev-Shwartz & Nathan Srebro