regml 2016 class 4 regularization for multi task learning
play

RegML 2016 Class 4 Regularization for multi-task learning Lorenzo - PowerPoint PPT Presentation

RegML 2016 Class 4 Regularization for multi-task learning Lorenzo Rosasco UNIGE-MIT-IIT June 28, 2016 Supervised learning so far Regression f : X Y R Classification f : X Y = { 1 , 1 } What next? Vector-valued f : X


  1. RegML 2016 Class 4 Regularization for multi-task learning Lorenzo Rosasco UNIGE-MIT-IIT June 28, 2016

  2. Supervised learning so far ◮ Regression f : X → Y ⊆ R ◮ Classification f : X → Y = {− 1 , 1 } What next? ◮ Vector-valued f : X → Y ⊆ R T ◮ Multiclass f : X → Y = { 1 , 2 , . . . , T } ◮ ... L.Rosasco, RegML 2016 2

  3. Multitask learning Given S 1 = ( x 1 i , y 1 i ) n 1 i =1 , . . . , S T = ( x T i , y T i ) n T i =1 find f 1 : X 1 → Y 1 , . . . , f T : X T → Y T L.Rosasco, RegML 2016 3

  4. Multitask learning Given S 1 = ( x 1 i , y 1 i ) n 1 i =1 , . . . , S T = ( x T i , y T i ) n T i =1 find f 1 : X 1 → Y 1 , . . . , f T : X T → Y T ◮ vector valued regression, S n = ( x i , y i ) n y i ∈ R T x i ∈ X, i =1 , MTL with equal inputs! Output coordinates are “tasks” ◮ multiclass S n = ( x i , y i ) n i =1 , x i ∈ X, y i ∈ { 1 , . . . , T } L.Rosasco, RegML 2016 4

  5. Why MTL? Task 1 Y X Task 2 X L.Rosasco, RegML 2016 5

  6. Why MTL? 60 60 40 40 20 20 0 0 0 5 10 15 20 25 0 5 10 15 20 25 60 60 40 40 20 20 0 0 0 5 10 15 20 25 0 5 10 15 20 25 Real data! L.Rosasco, RegML 2016 6

  7. Why MTL? Related problems: ◮ conjoint analysis ◮ transfer learning ◮ collaborative filtering ◮ co-kriging Examples of applications: ◮ geophysics ◮ music recommendation (Dinuzzo 08) ◮ pharmacological data (Pillonetto at el. 08) ◮ binding data (Jacob et al. 08) ◮ movies recommendation (Abernethy et al. 08) ◮ HIV Therapy Screening (Bickel et al. 08) L.Rosasco, RegML 2016 7

  8. Why MTL? VVR, e.g. vector fields estimation L.Rosasco, RegML 2016 8

  9. Why MTL? Component 1 Y X Component 2 X L.Rosasco, RegML 2016 9

  10. Penalized regularization for MTL err( w 1 , . . . , w T ) + pen( w 1 , . . . , w T ) We start with linear models f 1 ( x ) = w ⊤ 1 x, . . . , f T ( x ) = w ⊤ T x L.Rosasco, RegML 2016 10

  11. Empirical error � T � n i 1 � ( y i j − w ⊤ i x i j ) 2 E ( w 1 , . . . , w T ) = n i i =1 j =1 ◮ could consider other losses ◮ could try to “couple” errors L.Rosasco, RegML 2016 11

  12. Least squares error We focus on vector valued regression (VVR) S n = ( x i , y i ) n y i ∈ R T x i ∈ X, i =1 , L.Rosasco, RegML 2016 12

  13. Least squares error We focus on vector valued regression (VVR) S n = ( x i , y i ) n y i ∈ R T x i ∈ X, i =1 , � T � n 1 t x i ) 2 = 1 ˆ − � ( y t i − w ⊤ � 2 n � X W Y ���� ���� ���� F n t =1 i =1 n × d d × T n × T � F = Tr( W ⊤ W ) , y t � W � 2 W = ( w 1 , . . . , w T ) , Y it = ˆ i i = 1 . . . n t = 1 . . . T L.Rosasco, RegML 2016 13

  14. MTL by regularization pen( w 1 . . . w T ) ◮ Coupling task solutions by regularization ◮ Borrowing strength ◮ Exploit structure L.Rosasco, RegML 2016 14

  15. Regularizations for MTL T � � w t � 2 pen ( w 1 , . . . , w T ) = t =1 L.Rosasco, RegML 2016 15

  16. Regularizations for MTL T � � w t � 2 pen ( w 1 , . . . , w T ) = t =1 Single tasks regularization! T n T � � � 1 t x i ) 2 + λ � w t � 2 = ( y t i − w ⊤ min n w 1 ,...,w T t =1 i =1 t =1 T n � � 1 t x i ) 2 + λ � w t � 2 ) ( y t i − w ⊤ (min n w t t =1 i =1 L.Rosasco, RegML 2016 16

  17. Regularizations for MTL ◮ Isotropic coupling � � 2 � � T T T � � � � w j − 1 � � � w j � 2 (1 − α ) + α � w i � T � j =1 j =1 i =1 L.Rosasco, RegML 2016 17

  18. Regularizations for MTL ◮ Isotropic coupling � � 2 � � T T T � � � � w j − 1 � � � w j � 2 (1 − α ) + α � w i � T � j =1 j =1 i =1 ◮ Graph coupling - Let M ∈ R T × T an adjacency matrix, with M ts ≥ 0 T T T � � � M ts � w t − w s � 2 + γ � w t � 2 t =1 s =1 t =1 special case: output divided in clusters L.Rosasco, RegML 2016 18

  19. A general form of regularization All the regularizers so far are of the form � T � T A ts w ⊤ t w s t =1 s =1 for a suitable positive definite matrix A L.Rosasco, RegML 2016 19

  20. MTL regularization revisited ◮ Single tasks � T j =1 � w j � 2 = ⇒ A = I L.Rosasco, RegML 2016 20

  21. MTL regularization revisited ◮ Single tasks � T j =1 � w j � 2 = ⇒ A = I ◮ Isotropic coupling � � 2 � � T T T � � � � � w j − 1 � � � w j � 2 (1 − α ) + α w j � � T � � j =1 j =1 j =1 A = I − α = ⇒ T 1 L.Rosasco, RegML 2016 21

  22. MTL regularization revisited ◮ Single tasks � T j =1 � w j � 2 = ⇒ A = I ◮ Isotropic coupling � � 2 � � T T T � � � � � w j − 1 � � � w j � 2 (1 − α ) + α w j � � T � � j =1 j =1 j =1 A = I − α = ⇒ T 1 ◮ Graph coupling T T T � � � M ts � w t − w s � 2 + γ � w t � 2 t =1 s =1 t =1 = ⇒ A = L + γI, where L graph Laplacian of M � � L = D − M, D = diag ( M 1 ,j , . . . , M T,j , ) j j L.Rosasco, RegML 2016 22

  23. A general form of regularization A ∈ R T × T Let W = ( w 1 , . . . , w T ) , Note that T T � � A ts w ⊤ t w s = Tr( WAW ⊤ ) t =1 s =1 L.Rosasco, RegML 2016 23

  24. A general form of regularization A ∈ R T × T Let W = ( w 1 , . . . , w T ) , Note that T T � � A ts w ⊤ t w s = Tr( WAW ⊤ ) t =1 s =1 Indeed � d � d � T Tr( WAW ⊤ ) = ⊤ AW i = W i A ts W it W is i =1 i =1 t,s =1 T d T � � � A ts w ⊤ = A ts W is W ir = t w s t,s =1 i =1 t,s =1 L.Rosasco, RegML 2016 24

  25. Computations 1 n � � XW − � Y � 2 F + λ Tr( WAW ⊤ ) L.Rosasco, RegML 2016 25

  26. Computations 1 n � � XW − � Y � 2 F + λ Tr( WAW ⊤ ) Consider the SVD A = U Σ U ⊤ , Σ = diag ( σ 1 , . . . , σ T ) L.Rosasco, RegML 2016 26

  27. Computations 1 n � � XW − � Y � 2 F + λ Tr( WAW ⊤ ) Consider the SVD A = U Σ U ⊤ , Σ = diag ( σ 1 , . . . , σ T ) let ˜ Y = � ˜ W = WU, Y U then we can rewrite the above problem as 1 n � � X ˜ W − ˜ F + λ Tr( ˜ W Σ ˜ Y � 2 W ⊤ ) L.Rosasco, RegML 2016 27

  28. Computations (cont.) Fially, rewrite 1 n � � X ˜ W − ˜ F + λ Tr( ˜ W Σ ˜ Y � 2 W ⊤ ) as T n � � ( 1 t x i ) 2 + λσ t � ˜ y t w ⊤ w t � 2 ) (˜ i − ˜ n t =1 i =1 Finally W = ˜ WU ⊤ Compare to single task regularization L.Rosasco, RegML 2016 28

  29. Computations (cont.) E λ ( W ) = 1 n � � XW − � Y � 2 F + λ Tr( WAW ⊤ ) Alternatively ∇E λ ( W ) = 2 X ⊤ ( � � XW − � Y ) + 2 λWA n W t +1 = W t − γ ∇E λ ( W t ) Trivially extends to other loss functions. L.Rosasco, RegML 2016 29

  30. Beyond Linearity f t ( x ) = w ⊤ t Φ( x ) , Φ( x ) = ( φ 1 ( x ) , . . . , φ p ( x )) E λ ( W ) = 1 Y � 2 + λ Tr( WAW ⊤ ) , n � � Φ W − � with � Φ matrix with rows Φ( x 1 ) , . . . , Φ( x n ) L.Rosasco, RegML 2016 30

  31. Nonparametrics and kernels n � f t ( x ) = K ( x, x i ) C it i =1 with � 2 � KC ℓ − � � C ℓ +1 = C ℓ − γ Y + 2 λC ℓ A n ◮ C ℓ ∈ R n × T ◮ � K ∈ R n × n , � K ij = K ( x i , x j ) Y ∈ R n × T , � ◮ � Y ij = y j i L.Rosasco, RegML 2016 31

  32. Spectral filtering for MTL Beyond penalization 1 Y � 2 + λ Tr( WAW ⊤ ) , n � � XW − � min W other forms of regularizations can be considered ◮ projection ◮ early stopping L.Rosasco, RegML 2016 32

  33. Multiclass and MTL Y = { 1 , . . . , T } L.Rosasco, RegML 2016 33

  34. From Multiclass to MTL Encoding For j = 1 , . . . , T j �→ e j canonical vector of R T the problem reduces to vector valued regression Decoding For f ( x ) ∈ R T e ⊤ f ( x ) �→ argmax t f ( x ) = argmax f t ( x ) t =1 ,...t t =1 ,...t L.Rosasco, RegML 2016 34

  35. Single MTL and OVA Write 1 Y � 2 + λ Tr( WW ⊤ ) , n � � XW − � min W as � T � n t 1 i ) 2 + λ � w t � 2 ( w ⊤ t x t i − y t min n w t t =1 i =1 This is known as one versus all (OVA) L.Rosasco, RegML 2016 35

  36. Beyond OVA Consider 1 Y � 2 + λ Tr( WAW ⊤ ) , n � � XW − � min W that is T T n � � � ( 1 t x i ) 2 + λσ t � ˜ y t w ⊤ w t � 2 ) min (˜ i − ˜ n w t ˜ t =1 t =1 i =1 Class relatedness encoded in A L.Rosasco, RegML 2016 36

  37. Back to MTL T n t � � 1 ( y t j − w ⊤ i x t j ) 2 n t t =1 j =1 ⇓ T � � ( ˆ � 2 − Y ) ⊙ M n = X W F , n t ���� ���� ���� ���� t =1 n × d d × T n × T n × T ◮ ⊙ Hadamard product ◮ M mask ◮ Y having one non-zero value for each row L.Rosasco, RegML 2016 37

  38. Computations W � ( ˆ XW − Y ) ⊙ M � 2 F + λ Tr( WAW ⊤ ) min ◮ can be rewritten using tensor calculus ◮ computation for vector valued regression easily extended ◮ sparsity of M can be exploited L.Rosasco, RegML 2016 38

  39. From MTL to matrix completion Special case Take d = n and X = I � ( ˆ XW − Y ) ◦ M � 2 F ⇓ � T � n y ij ) 2 M ij ( w ij − ¯ t =1 i =1 L.Rosasco, RegML 2016 39

  40. Summary so far A regularization framework for ◮ VVR ◮ Multiclass ◮ MTL ◮ Matrix completion if the structure of the “tasks” is known. What if it is not? L.Rosasco, RegML 2016 40

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend