duality in vv rkhss with infinite dimensional outputs
play

Duality in vv-RKHSs with Infinite Dimensional Outputs: Application - PowerPoint PPT Presentation

Duality in vv-RKHSs with Infinite Dimensional Outputs: Application to Robust Losses Pierre Laforgue , Alex Lambert, Luc Brogat-Motte, Florence dAlch e-Buc LTCI, T el ecom Paris, Institut Polytechnique de Paris, France 1/25 Outline


  1. Duality in vv-RKHSs with Infinite Dimensional Outputs: Application to Robust Losses Pierre Laforgue , Alex Lambert, Luc Brogat-Motte, Florence d’Alch´ e-Buc LTCI, T´ el´ ecom Paris, Institut Polytechnique de Paris, France 1/25

  2. Outline Motivations A duality theory for general OVKs Robust losses as convolutions Experiments Conclusion 2/25

  3. Motivation 1: structured prediction by surrogate approach Kernel trick in the input space. Kernel trick in the output space [Cortes ’05, Geurts ’06, Brouard ’11, Kadri ’13, Brouard ’16] , Input Output Kernel Regression (IOKR). � n � � φ ( y i ) − h ( x i ) � � h ( x ) � 1 + Λ � 2 � φ ( y ) − ˆ � ˆ 2 � h � 2 h = argmin HK , g ( x ) = argmin 2 n FY FY h ∈HK y ∈Y i =1 2/25

  4. − − − Motivation 2: function to function regression EMG curves Lip acceleration curves 3 2 2 1.5 1 Millivolts Meters/s 2 0 1 − 1 0.5 − 2 − 3 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 seconds seconds � n � � 1 L 2 + Λ � 2 � y i − h ( x i ) 2 � h � 2 min [Kadri et al., 2016] 2 n h ∈H K i =1 And many more! e.g. structured data autoencoding [Laforgue et al., 2019] � n � � 1 � 2 � φ ( x i ) − h 2 ◦ h 1 ( φ ( x i )) min F X + Λ Reg( h 1 , h 2 ) . 2 n h 1 , h 2 ∈H 1 K ×H 2 K i =1 3/25

  5. Purpose of this work Question: Is it possible to extend the previous approaches to different (ideally robust) loss functions? First answer: Yes, possible extension to maximum-margin regression [Brouard et al., 2016], and ǫ -insensitive loss functions for matrix-valued kernels [Sangnier et al., 2017] What about general Operator-Valued Kernels (OVKs)? What about other types of loss functions? 4/25

  6. Outline Motivations A duality theory for general OVKs Robust losses as convolutions Experiments Conclusion 5/25

  7. Learning in vector-valued RKHSs (vv-RKHSs) � K ( x , x ′ ) = K ( x ′ , x ) ∗ , • K : X × X → L ( Y ), i , j � y i , K ( x i , x j ) y j � Y ≥ 0 • Unique vv-RKHS H K ⊂ F ( X , Y ) , H K = Span {K ( · , x ) y : x , y ∈ X × Y} • Ex: decomposable OVK K ( x , x ′ ) = k ( x , x ′ ) A , with k scalar, A p.s.d. on Y 5/25

  8. Learning in vector-valued RKHSs (vv-RKHSs) � K ( x , x ′ ) = K ( x ′ , x ) ∗ , • K : X × X → L ( Y ), i , j � y i , K ( x i , x j ) y j � Y ≥ 0 • Unique vv-RKHS H K ⊂ F ( X , Y ) , H K = Span {K ( · , x ) y : x , y ∈ X × Y} • Ex: decomposable OVK K ( x , x ′ ) = k ( x , x ′ ) A , with k scalar, A p.s.d. on Y i =1 ∈ ( X × Y ) n with Y a Hilbert space, we want to find: • For { ( x i , y i ) } n � n ℓ � � 1 + Λ ˆ 2 � h � 2 h ∈ argmin h ( x i ) , y i H K . n h ∈H K i =1 Representer Theorem [Micchelli and Pontil, 2005]: � n i =1 ∈ Y n (infinite dimensional!) ˆ α i ) n ∃ (ˆ s . t . h ( x ) = K ( · , x i )ˆ α i . i =1 α i = � n When ℓ ( · , · ) = 1 2 � · − · � 2 A = ( K + n Λ I n ) − 1 . Y , K = k · I Y : ˆ j =1 A ij y j , 5/25

  9. Applying duality � n � n 1 ℓ i ( h ( x i )) + Λ h = 1 ˆ 2 � h � 2 ˆ h ∈ argmin is given by K ( · , x i )ˆ α i , H K n Λ n h ∈H K i =1 i =1 i =1 ∈ Y n the solutions to the dual problem : α i ) n with (ˆ � n � n 1 ℓ ⋆ min i ( − α i ) + � α i , K ( x i , x j ) α j � Y , 2Λ n ( α i ) n i =1 ∈Y n i =1 i , j =1 with f ⋆ : α ∈ Y �→ sup y ∈Y � α, y � Y − f ( y ) the Fenchel-Legendre transform of f . 6/25

  10. Applying duality � n � n 1 ℓ i ( h ( x i )) + Λ h = 1 ˆ 2 � h � 2 ˆ h ∈ argmin is given by K ( · , x i )ˆ α i , H K n Λ n h ∈H K i =1 i =1 i =1 ∈ Y n the solutions to the dual problem : α i ) n with (ˆ � n � n 1 ℓ ⋆ min i ( − α i ) + � α i , K ( x i , x j ) α j � Y , 2Λ n ( α i ) n i =1 ∈Y n i =1 i , j =1 with f ⋆ : α ∈ Y �→ sup y ∈Y � α, y � Y − f ( y ) the Fenchel-Legendre transform of f . • 1st limitation: the FL transform ℓ ⋆ needs to be computable ( → assumption) • 2nd limitation : the dual variables ( α i ) n i =1 are still infinite dimensional! 6/25

  11. Applying duality � n � n 1 ℓ i ( h ( x i )) + Λ h = 1 ˆ 2 � h � 2 ˆ h ∈ argmin is given by K ( · , x i )ˆ α i , H K n Λ n h ∈H K i =1 i =1 i =1 ∈ Y n the solutions to the dual problem : α i ) n with (ˆ � n � n 1 ℓ ⋆ min i ( − α i ) + � α i , K ( x i , x j ) α j � Y , 2Λ n ( α i ) n i =1 ∈Y n i =1 i , j =1 with f ⋆ : α ∈ Y �→ sup y ∈Y � α, y � Y − f ( y ) the Fenchel-Legendre transform of f . • 1st limitation: the FL transform ℓ ⋆ needs to be computable ( → assumption) • 2nd limitation : the dual variables ( α i ) n i =1 are still infinite dimensional! If Y = Span { y j , j ≤ n } invariant by K , i.e. ∀ ( x , x ′ ) , y ∈ Y ⇒ K ( x , x ′ ) y ∈ Y : α i = � then ˆ α i ∈ Y → possible reparametrization: ˆ j ˆ ω ij y j 6/25

  12. The double representer theorem (1/2) Assume that OVK K and loss ℓ satisfy the appropriate assumptions (see paper for details, verified by standard kernels and losses), then � 1 ℓ ( h ( x i ) , y i ) + Λ ˆ 2 � h � 2 h = argmin H K is given by n H K i � n h = 1 ˆ K ( · , x i ) ˆ ω ij y j , Λ n i , j =1 ω ij ] ∈ R n × n the solution to the finite dimensional problem with ˆ Ω = [ˆ � n � Ω i : , K Y � 2Λ n Tr � ˜ M ⊤ (Ω ⊗ Ω) � 1 min L i + , Ω ∈ R n × n i =1 M the n 2 × n 2 matrix writing of M s.t. M ijkl = � y k , K ( x i , x j ) y l � Y . with ˜ 7/25

  13. The double representer theorem (2/2) If K further satisfies K ( x , x ′ ) = � t k t ( x , x ′ ) A t , then tensor M simplifies to M ijkl = � t [ K X t ] ij [ K Y t ] kl and the problem rewrites � n � T � Ω i : , K Y � Tr � t Ω ⊤ � 1 K X t Ω K Y min L i + . 2Λ n Ω ∈ R n × n i =1 t =1 Rmk. Only need the n 4 tensor � y k , K ( x i , x j ) y l � Y to learn OVKMs. Rmk. Simplifies to 2 n 2 matrices K X ij K Y kl if K is decomposable. How to apply the duality approach? 8/25

  14. Outline Motivations A duality theory for general OVKs Robust losses as convolutions Experiments Conclusion 9/25

  15. Infimal convolution and Fenchel-Legendre transforms Infimal-convolution operator � between proper lower semicontinuous functions [Bauschke et al., 2011]: ( f � g )( x ) = inf y f ( y ) + g ( x − y ) . Relation to FL transform: ( f � g ) ⋆ = f ⋆ + g ⋆ Ex: ǫ -insensitive losses. Let ℓ : Y → R be a convex loss with unique minimum at 0, and ǫ > 0. The ǫ -insensitive version of ℓ , denoted ℓ ǫ , is defined by: � ℓ (0) if � y � Y ≤ ǫ ℓ ǫ ( y ) = ( ℓ � χ B ǫ ) ( y ) = , � d � Y ≤ 1 ℓ ( y − ǫ d ) inf otherwise and has FL transform: ǫ ( y ) = ( ℓ � χ B ǫ ) ⋆ ( y ) = ℓ ⋆ ( y ) + ǫ � y � . ℓ ⋆ 9/25

  16. Interesting loss functions: sparsity and robustness ǫ -Ridge ǫ -SVR κ -Huber 5 jj x jj 2 jj x jj 2 1 1 2 jj x jj 2 12 ² -insensitive 4 Huber loss ² -insensitive 4 10 3 8 3 6 2 2 4 1 1 2 0 0 0 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 4 12 12 3 9 9 2 6 6 1 3 3 −3 −3 −3 −1 −1 −1 0 0 0 1 1 1 2 3 2 3 2 3 1 1 1 3 −1 0 3 −1 0 3 −1 0 −3 −2 −2 −3 −2 −3 2 � · � 2 � χ B ǫ 1 κ � · � � 1 2 � · � 2 � · � � χ B ǫ (Sparsity) (Sparsity, Robustness) (Robustness) 10/25

  17. Specific dual problems For the ǫ -ridge, ǫ -SVR and κ -Huber, it holds ˆ Ω = ˆ W V − 1 , with ˆ W the solution to these finite dimensional dual problems: 1 2 � AW − B � 2 ( D 1) min Fro + ǫ � W � 2 , 1 , W ∈ R n × n 1 2 � AW − B � 2 ( D 2) min Fro + ǫ � W � 2 , 1 , W ∈ R n × n s.t. � W � 2 , ∞ ≤ 1 , 1 2 � AW − B � 2 ( D 3) min Fro , W ∈ R n × n s.t. � W � 2 , ∞ ≤ κ, with V , A , B such that: VV ⊤ = K Y , A ⊤ A = K X / (Λ n ) + I n (or A ⊤ A = K X / (Λ n ) for the ǫ -SVR), and A ⊤ B = V . 11/25

  18. Outline Motivations A duality theory for general OVKs Robust losses as convolutions Experiments Conclusion 12/25

  19. Surrogate approaches for structured prediction • Experiments on YEAST dataset • Empirically, ǫ -SV-IOKR outperforms ridge-IOKR for a wide range of ǫ • Promotes sparsity and acts as a regularizer Sparsity w.r.t. ¤ for different ² ( ² -SVR) Comparison ² -SVR / KRR 1.0 2.6 1.0 3.5 KRR 0.9 0.9 2.5 3.0 Sparsity (% null components) 0.8 0.8 2.5 0.7 2.4 0.7 Test MSE 2.0 0.6 0.6 2.3 ² ² 0.5 0.5 1.5 2.2 0.4 0.4 1.0 0.3 0.3 0.5 2.1 0.2 0.2 0.0 2.0 0.1 0.1 10 -8 10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 -8 10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 ¤ ¤ Figure 1: MSEs and sparsity w.r.t. Λ for several ǫ 13/25

  20. Robust function-to-function regression Task from [Kadri et al., 2016]: predict lip acceleration from EMG signals. • Dataset augmented with outliers, model learned with Huber loss • Improvement for every output size M (see paper for approximation) 0 . 900 m=4 m=5 m=6 0 . 875 LOO generalization error m=7 m=15 Ridge Regression ( κ = + ∞ ) 0 . 850 0 . 825 0 . 800 0 . 775 0 . 750 0 . 0 0 . 5 1 . 0 1 . 5 κ Figure 2: LOO generalization error w.r.t. κ 14/25

  21. Outline Motivations A duality theory for general OVKs Robust losses as convolutions Experiments Conclusion 15/25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend