learning with approximate kernel embeddings
play

Learning with Approximate Kernel Embeddings Dino Sejdinovic - PowerPoint PPT Presentation

Learning with Approximate Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford RegML Workshop, Simula, Oslo, 06/05/2017 D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 1 / 18


  1. Learning with Approximate Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford RegML Workshop, Simula, Oslo, 06/05/2017 D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 1 / 18

  2. Outline Preliminaries on Kernel Embeddings 1 Testing and Learning on Distributions with Symmetric Noise Invariance 2 D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 1 / 18

  3. Outline Preliminaries on Kernel Embeddings 1 Testing and Learning on Distributions with Symmetric Noise Invariance 2 D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 1 / 18

  4. Reproducing Kernel Hilbert Spaces RKHS: a Hilbert space of functions on X with continuous evaluation f �→ f ( x ) , ∀ x ∈ X (norm convergence implies pointwise convergence). Each RKHS corresponds to a positive definite kernel k : X × X → R , s.t. ∀ x ∈ X , k ( · , x ) ∈ H , and 1 ∀ x ∈ X , ∀ f ∈ H , � f, k ( · , x ) � H = f ( x ) . 2 RKHS can be constructed as H k = span { k ( · , x ) | x ∈ X} and includes functions f ( x ) = � n i =1 α i k ( x, x i ) and their pointwise limits. 1 0.8 0.6 0.4 f(x) 0.2 0 −0.2 −0.4 −6 −4 −2 0 2 4 6 8 x D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 2 / 18

  5. Kernel Trick and Kernel Mean Trick implicit feature map x �→ k ( · , x ) ∈ H k replaces x �→ [ φ 1 ( x ) , . . . , φ s ( x )] ∈ R s � k ( · , x ) , k ( · , y ) � H k = k ( x, y ) inner products readily available • nonlinear decision boundaries, nonlinear regression [Cortes & Vapnik, 1995; Schölkopf & functions, learning on non-Euclidean/structured Smola, 2001] data D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 3 / 18

  6. Kernel Trick and Kernel Mean Trick implicit feature map x �→ k ( · , x ) ∈ H k replaces x �→ [ φ 1 ( x ) , . . . , φ s ( x )] ∈ R s � k ( · , x ) , k ( · , y ) � H k = k ( x, y ) inner products readily available • nonlinear decision boundaries, nonlinear regression [Cortes & Vapnik, 1995; Schölkopf & functions, learning on non-Euclidean/structured Smola, 2001] data RKHS embedding : implicit feature mean [Smola et al, 2007; Sriperumbudur et al, 2010] P �→ µ k ( P ) = E X ∼ P k ( · , X ) ∈ H k replaces P �→ [ E φ 1 ( X ) , . . . , E φ s ( X )] ∈ R s � µ k ( P ) , µ k ( Q ) � H k = E X ∼ P,Y ∼ Q k ( X, Y ) [Gretton et al, 2005; Gretton et al, inner products easy to estimate 2006; Fukumizu et al, 2007; DS et • nonparametric two-sample, independence, al, 2013; Muandet et al, 2012; conditional independence, interaction testing, Szabo et al, 2015] learning on distributions D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 3 / 18

  7. Maximum Mean Discrepancy Maximum Mean Discrepancy (MMD) [Borgwardt et al, 2006; Gretton et al, 2007] between P and Q : 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 6 4 2 0 2 4 6 MMD k ( P, Q ) = � µ k ( P ) − µ k ( Q ) � H k = sup | E f ( X ) − E f ( Y ) | f ∈H k : � f � H k ≤ 1 D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 4 / 18

  8. Maximum Mean Discrepancy Maximum Mean Discrepancy (MMD) [Borgwardt et al, 2006; Gretton et al, 2007] between P and Q : 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 6 4 2 0 2 4 6 MMD k ( P, Q ) = � µ k ( P ) − µ k ( Q ) � H k = sup | E f ( X ) − E f ( Y ) | f ∈H k : � f � H k ≤ 1 Characteristic kernels: MMD k ( P, Q ) = 0 iff P = Q . 2 σ 2 � x − x ′ � 2 • Gaussian RBF exp( − 1 2 ) , Matérn family, inverse multiquadrics. For characteristic kernels on LCH X , MMD metrizes weak* topology on probability measures [Sriperumbudur,2010] , MMD k ( P n , P ) → 0 ⇔ P n � P. D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 4 / 18

  9. Some uses of MMD MMD has been applied to: two-sample tests and independence tests within-sample average similarity – [Gretton et al, 2009, Gretton et al, 2012] between-sample average similarity model criticism and interpretability [Lloyd & Ghahramani, 2015; Kim, Khanna & Koyejo, 2016] analysis of Bayesian quadrature [Briol et al, 2015+] k ( dog i , dog j ) k (dog i , fish j ) ABC summary statistics [Park, Jitkrittum & DS, 2015] summarising streaming data [Paige, DS & k ( fish j , dog i ) k (fish i , fish j ) Wood, 2016] traversal of manifolds learned by convolutional nets [Gardner et al, 2015] training deep generative models [Dziugaite, Figure by Arthur Gretton Roy & Ghahramani, 2015; Sutherland et al, 2017] P k ( X, X ′ ) + E Y ,Y ′ i.i.d. Q k ( Y , Y ′ ) − 2 E X ∼ P ,Y ∼ Q k ( X, Y ) . MMD 2 k ( P, Q ) = E X,X ′ i.i.d. ∼ ∼ D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 5 / 18

  10. Kernel dependence measures 1 0.7 0.3 0.1 0.3 0.8 1 HSIC 2 ( X, Y ; κ ) = � µ κ ( P XY ) − µ κ ( P X P Y ) � 2 H κ 1 1 1 1 1 1 0.3 0.1 0.1 0.3 0.2 0.2 0 Hilbert-Schmidt norm of the feature-space 0.4 0.2 0.2 0.5 0.3 0.2 0 cross-covariance [Gretton et al, 2009] dependence witness is a smooth function in cor vs. dcor the RKHS H κ of functions on X × Y Dependence witness and sample 1.5 l ( ) k ( ) , , !" #" !" #" 0.05 1 0.04 0.03 0.5 0.02 κ ( ) = !" !" #" #" , 0.01 Y 0 k ( ) × l ( ) , , 0 !" #" !" #" −0.01 −0.5 Independence testing framework that −0.02 −1 −0.03 generalises Distance Correlation (dcor) of −0.04 [Szekely et al, 2007] : HSIC with Brownian −1.5 −1.5 −1 −0.5 0 0.5 1 1.5 X motion covariance kernels [DS et al, 2013] Figure by Arthur Gretton D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 6 / 18

  11. Distribution Regression supervised learning where labels are available at the group, rather than at the individual level. y 2 % vote for Obama ? y 3 ? y 1 feature space µ w µ m µ 1 µ 3 µ 2 3 3 men women both x 2 x 3 1 x 2 3 3 x 1 x 1 x 2 1 x 1 2 2 3 x 3 x 3 x 5 x 4 2 1 3 3 region 1 region 2 region 3 Figure from Flaxman et al, 2015 Figure from Mooij et al, 2014 • classifying text based on word features [Yoshikawa et al, 2014; Kusner et al, 2015] • aggregate voting behaviour of demographic groups [Flaxman et al, 2015; 2016] • image labels based on a distribution of small patches [Szabo et al, 2016] • “traditional” parametric statistical inference by learning a function from sets of samples to parameters: ABC [Mitrovic et al, 2016] , EP [Jitkrittum et al, 2015] • identify the cause-effect direction between a pair of variables from a joint sample [Lopez-Paz et al,2015] Possible (distributional) covariate shift? D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 7 / 18

  12. Outline Preliminaries on Kernel Embeddings 1 Testing and Learning on Distributions with Symmetric Noise Invariance 2 D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 7 / 18

  13. All possible differences between generating processes? differences discovered by an MMD two-sample test can be due to different types of measurement noise or data collection artefacts • With a large sample-size, uncovers potentially irrelevant sources of variability: slightly different calibration of the data collecting equipment, different numerical precision, different conventions of dealing with edge-cases Learning on distributions: each label y i in supervised learning is associated to a whole bag of observations B i = { X ij } N i j =1 – assumed to come from a probability distribution P i • Each bag of observations could be impaired by a different measurement noise process. Distributional covariate shift: different measurement noise on test bags? Both problems require encoding the distribution with a representation invariant to symmetric noise. Testing and Learning on Distributions with Symmetric Noise Invariance. Ho Chung Leon Law, Christopher Yau, DS. http://arxiv.org/abs/1703.07596 D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 8 / 18

  14. Random Fourier features: Inverse Kernel Trick Bochner’s representation: Assume that k is a positive definite translation-invariant kernel on R p . Then k can be written as ˆ iω ⊤ ( x − y ) � � k ( x, y ) = R p exp d Λ( ω ) ˆ � � ω ⊤ x � � ω ⊤ y � � ω ⊤ x � � ω ⊤ y �� = 2 cos cos + sin sin d Λ( ω ) R p for some positive measure (w.l.o.g. a probability distribution) Λ . Sample m frequencies Ω = { ω j } m j =1 ∼ Λ and use a Monte Carlo estimator of the kernel function instead [Rahimi & Recht, 2007] : m 2 ˆ � ω ⊤ ω ⊤ ω ⊤ ω ⊤ � � � � � � � � �� k ( x, y ) = cos j x cos j y + sin j x sin j y m j =1 = � ξ Ω ( x ) , ξ Ω ( y ) � R 2 m , � � ⊤ . 2 with an explicit set of features ξ Ω : x �→ � � ω ⊤ � � ω ⊤ � cos 1 x , sin 1 x , . . . m How fast does m need to grow with n ? Can be sublinear for regression [Bach, 2015] . D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 9 / 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend