Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels Jiyan - PowerPoint PPT Presentation

Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels Jiyan Yang Stanford University June 24th, 2014 ICML, 2014, Beijing Joint work with Vikas Sindhwani, Haim Avron and Michael Mahoney

Brief Overview of Kernel Methods Low-dimensional Explicit Feature Map Quasi-Monte Carlo Random Feature Empirical Results

Problem setting We will start with the kernelized ridge regression problem, n 1 ( y i − f ( x i )) 2 + λ � f � 2 � min H . (1) n f ∈H i =1 where x i ∈ R d , H is a nice hypothesis space (RKHS) and ℓ is a convex loss function. ◮ A symmetric and positive-definite kernel k ( x , y ) generates a unique RKHS H . ◮ For example, RBF kernel, k ( x , y ) = e − � x − y � 2 2 σ 2 . ◮ Kernel methods are widely used in solving regression, classification or inverse problems raised in many areas as well as unsupervised learning problems.

Scalability ◮ By the Representer Theorem, the minimizer of (1) can be represented by c = ( K + λ nI ) − 1 Y . ◮ Above the Gram matrix K is defined as K ij = k ( x i , x j ). Forming n × n matrix K needs O ( n 2 ) storage and typical linear algebra needs O ( n 3 ) running time. ◮ This is an n × n dense linear system which is not scalable for large n .

Linear kernel and explicit feature maps ◮ Suppose we can find a feature map Ψ : X → R s such that k ( x , y ) = z ( x ) T z ( y ). Then the Gram matrix K = ZZ T , where the i -th row of Z is z ( x i ) and Z ∈ R n × s . ◮ The solution to (1) can be expressed as w = ( Z T Z + λ nI ) − 1 Z T Y . ◮ This is an s × s linear system. ◮ It is attractive if s < n . ◮ Testing times reduces from O ( nd ) to O ( s + d ).

Mercer’s Theorem and explicit feature map Theorem (Mercer) For any positive definite kernel k, it can be expanded into N F � k ( x , y ) = λ i φ i ( x ) φ i ( y ) . i =1 ◮ Can define Φ( x ) = ( √ λ 1 φ 1 ( x ) , . . . , � λ N F φ N F ( x )). ◮ For many kernels, such as RBF, N F = ∞ . ◮ Goal : Find explicit feature map z ( x ) ∈ R s such that k ( x , y ) ≃ z ( x ) T z ( y ) , where s < n . Then K ≃ ZZ T .

Bochner’s Theorem Theorem (Bochner) A continuous kernel k ( x , y ) = k ( x − y ) on R d is positive definite if and only if k ( x − y ) is the Fourier transform of a non-negative measure.

A Monte Carlo Approximation ◮ More specifically, given a shift-invariant kernel k , we have � R d e − i w T ( x − y ) p ( w )d w . k ( x , y ) = k ( x − y ) = ◮ By standard Monte Carlo (MC) approach, the above can be approximated by s k ( x , y ) = 1 e − i w T ˜ � j ( x − y ) , (2) s j =1 where w j are drawn from p ( w ).

Random Fourier feature ◮ The random Fourier feature map can be defined as ψ ( x ) = 1 √ s ( g w 1 ( x ) , . . . , g w s ( x )) , where g w j ( x ) = e − i w T j x . [Rahimi and Recht 07]. ◮ So s k ( x , y ) = 1 j ( x − y ) = ψ ( x ) T ¯ e − i w T ˜ � ψ ( y ) . s j =1

Motivation ◮ We want to use less random features while maintaining the same approximation accuracy. ◮ MC method has a convergence rate of O (1 / √ s ). ◮ To gain a faster convergence, quasi-Monte Carlo method will be a better choice since it has a convergence rate of O ((log s ) d / s ).

Quasi-Monte Carlo method Goal To approximate an integral over the d -dimensional unit cube [0 , 1] d , � I d ( f ) = [0 , 1] d f ( x ) d x 1 · · · d x d . Quasi-Monte Carlo methods usually take the following form, s Q s ( f ) = 1 � f ( t i ) , s i =1 where t 1 , . . . , t s ∈ [0 , 1] d are pseudo-random points chosen deterministically with low-discrepancy.

Low-discrepancy sequences ◮ Many pseudo-random sequences { t i } ∞ i =1 with low-discrepancy are available, such as Halton sequence and Sobol’ sequence. ◮ They tend to be more “uniform” than sequence drawn uniformly. ◮ Notice the clumping and the space with no points in the left subplot. Uniform Halton 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Quasi-random features ◮ By setting w = Φ − 1 ( t ), k ( x , y ) can be rewritten as � � R d e − i ( x − y ) T w p ( w ) d w [0 , 1] d e − i ( x − y ) T Φ − 1 ( t ) d t = s 1 e − i ( x − y ) T Φ − 1 ( t j ) . � ≈ (3) s j =1 ◮ After generating the low discrepancy sequence { t j } s j =1 , the � s quasi-random features can be represented by 1 j =1 g t i ( x ), s where g t j ( x ) = e − i x T Φ − 1 ( t j ) .

Algorithm: Quasi-Random Fourier Features Input: Shift-invariant kernel k , size s . Ψ( x ) : R d �→ C s . Output: Feature map ˆ 1: Find p , the inverse Fourier transform of k . 2: Generate a low discrepancy sequence t 1 , . . . , t s . 3: Transform the sequence: w j = Φ − 1 ( t j ). � � e − i x T w 1 , . . . , e − i x T w s � 4: Set ˆ 1 Ψ( x ) = . s

Quality of Approximation ◮ Given a pair of points x , y , let u = x − y . The approximation error is s � R d f u ( w ) p ( w ) d w − 1 � ǫ [ f u ] = f u ( w i ) , s i =1 where f u ( w ) = e i u T w . ◮ Want to characterize the behavior of ǫ [ f u ] when u ∈ ¯ X and ¯ X = { x − z | x , z ∈ X} . ◮ Consider a broader class of integrands, F � b = { f u | u ∈ � b } . Here � b = { u ∈ R d | | u j | ≤ b j } and ¯ X ∈ � b .

Main Theoretical Result Theorem (Average Case Error) Let U ( F � b ) denote the uniform distribution on F � b . That is, f ∼ U ( F � b ) denotes f = f u where f u ( x ) = e − i u T x and u is randomly drawn from a uniform distribution on � b . We have, π d p ( S ) 2 . ǫ S , p [ f ] 2 � D � b � = E f ∼U ( F � b ) � d j =1 b j

Box discrepancy Suppose that p ( · ) is a probability density function, and that we can write p ( x ) = � d j =1 p j ( x j ) where each p j ( · ) is a univariate probability density function as well. Let φ j ( · ) be the characteristic function associated with p j ( · ). Then, � b j d � D sinc b , p ( S ) 2 ( π ) − d | φ j ( β ) | 2 d β − = − b j j =1 � b j s d 2(2 π ) − d � � φ j ( β ) e iw lj β d β + s − b j l =1 j =1 s s 1 � � sinc b ( w l , w j ) . (4) s 2 l =1 j =1

Proof techniques ◮ Consider integrands to be in some Reproducing Kernel Hilbert Space (RKHS). Uniform bound for approximating error can be derived by standard arguments. ◮ Here we consider the space of functions that admit an integral representation over F � b of the form, � f ( u ) e − i u T x d u where ˆ ˆ f ( x ) = f ( u ) ∈ L 2 ( � b ) . (5) u ∈ � b These spaces are called Paley-Wiener spaces PW b and they constitute a RKHS. ◮ The damped approximations of the integrands in F � b of form f u ( x ) = e − i u T x sinc( T x ) are members of PW b with � ˜ ˜ 1 f � PW b = T . √ Hence, we expect D � b to provide a discrepancy measure for p integrating functions in F � b .

Approximation error on Gram matrix Euclidean norm Frobenius norm Euclidean norm Frobenius norm Digital Net 0.15 Digital Net Digital Net Digital Net 0.045 0.06 0.09 MC MC MC MC Halton Halton 0.04 Halton Halton 0.08 Sobol’ Sobol’ Sobol’ 0.05 Sobol’ Relative error on ||K|| Relative error on ||K|| 0.035 Lattice Lattice Lattice Lattice 0.07 0.03 0.04 0.1 0.06 0.025 0.05 0.03 0.02 0.04 0.015 0.02 0.03 0.05 0.01 0.01 0.02 0.005 0.01 200 400 600 800 200 400 600 800 200 400 600 800 200 400 600 800 Number of random features Number of random features Number of random features Number of random features (a) MNIST (b) CPU Figure : Relative error on approximating the Gram matrix measured in Euclidean norm and Frobenius norm, i.e. � K − ˜ K � 2 / � K � 2 and � K − ˜ K � F / � K � F , for various s . For each kind of random feature and s , 10 independent trials are executed, and the mean and standard deviation are plotted.

Generalization error s Halton Sobol’ Lattice Digit MC 0.0367 0.0383 0.0374 0.0376 0.0383 100 (0) (0.0015) (0.0010) (0.0010) (0.0013) cpu 0.0339 0.0344 0.0348 0.0343 0.0349 500 (0) (0.0005) (0.0007) (0.0005) (0.0009) 0.0334 0.0339 0.0337 0.0335 0.0338 1000 (0) (0.0007) (0.0004) (0.0003) (0.0005) 0.0529 0.0747 0.0801 0.0755 0.0791 400 (0) (0.0138) (0.0206) (0.0080) (0.0180) census 0.0553 0.0588 0.0694 0.0587 0.0670 1200 (0) (0.0080) (0.0188) (0.0067) (0.0078) 0.0498 0.0613 0.0608 0.0583 0.0600 1800 (0) (0.0084) (0.0129) (0.0100) (0.0113) Table : Regression error, i.e. � ˆ y − y � 2 / � y � 2 where ˆ y is the predicted value and y is the ground truth.

Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels Jiyan - PowerPoint PPT Presentation

Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels Jiyan Yang Stanford University June 24th, 2014 ICML, 2014, Beijing Joint work with Vikas Sindhwani, Haim Avron and Michael Mahoney Brief Overview of Kernel Methods Low-dimensional

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

Tutorial on quasi-Monte Carlo methods Josef Dick School of Mathematics and Statistics, UNSW,

Draft 1 Density estimation by Monte Carlo and randomized quasi-Monte Carlo (RQMC) Pierre

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

QUASI-EQUILIBRIUM MONTE-CARLO: OFF-LATTICE KINETIC MONTE CARLO SIMULATION OF HETEROEPITAXY

Draft 1 Density estimation by Monte Carlo and randomized quasi-Monte Carlo Pierre LEcuyer

1 2 nd Shift Associates 2 nd Shift Associates 3 rd Shift Associates 3 rd Shift Associates 2

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

Adapting quasi-Monte Carlo methods to simulation problems in weighted Korobov spaces Christian

Goals and Motivations Measure how well an automatic system can describe a video in natural

Image Captioning Image Captioning Image Captioning A survey of recent deep-learning approaches

Corpus Linguistics Statistical Measures in Information Retrieval Niko Schenk Institut f ur

11 Practicalities 2: Evaluating MT Systems Now that weve talked about how to create machine

Measuring Dependence and Conditional Dependence with Kernels Kenji Fukumizu The Institute of

A method for primary calibration of AM and PM noise measurements TimeNav 07 May 31, 2007

Evaluation Philipp Koehn 22 September 2020 Philipp Koehn Machine Translation: Evaluation 22

Evaluation measures in NLP Zdenk abokrtsk October 30, 2020 NPFL070 Language Data Resources

Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels Jiyan - PowerPoint PPT Presentation

Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels Jiyan Yang Stanford University June 24th, 2014 ICML, 2014, Beijing Joint work with Vikas Sindhwani, Haim Avron and Michael Mahoney Brief Overview of Kernel Methods Low-dimensional

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

Tutorial on quasi-Monte Carlo methods Josef Dick School of Mathematics and Statistics, UNSW,

Draft 1 Density estimation by Monte Carlo and randomized quasi-Monte Carlo (RQMC) Pierre

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

QUASI-EQUILIBRIUM MONTE-CARLO: OFF-LATTICE KINETIC MONTE CARLO SIMULATION OF HETEROEPITAXY

Draft 1 Density estimation by Monte Carlo and randomized quasi-Monte Carlo Pierre LEcuyer

1 2 nd Shift Associates 2 nd Shift Associates 3 rd Shift Associates 3 rd Shift Associates 2

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

Monte Carlo Control CMPUT 366: Intelligent Systems S&amp;B 5.3-5.5, 5.7 Lecture Outline 1.

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

Adapting quasi-Monte Carlo methods to simulation problems in weighted Korobov spaces Christian

Goals and Motivations Measure how well an automatic system can describe a video in natural

Image Captioning Image Captioning Image Captioning A survey of recent deep-learning approaches

Corpus Linguistics Statistical Measures in Information Retrieval Niko Schenk Institut f ur

11 Practicalities 2: Evaluating MT Systems Now that weve talked about how to create machine

Measuring Dependence and Conditional Dependence with Kernels Kenji Fukumizu The Institute of

A method for primary calibration of AM and PM noise measurements TimeNav 07 May 31, 2007

Evaluation Philipp Koehn 22 September 2020 Philipp Koehn Machine Translation: Evaluation 22

Evaluation measures in NLP Zdenk abokrtsk October 30, 2020 NPFL070 Language Data Resources

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.