multi task learning and matrix regularization
play

Multi-Task Learning and Matrix Regularization Andreas Argyriou TTI - PowerPoint PPT Presentation

Multi-Task Learning and Matrix Regularization Andreas Argyriou TTI Chicago Outline Multi-task learning and related problems Multi-task feature learning (trace norm, Schatten L p norms, non-convex regularizers) Representer theorems;


  1. Multi-Task Learning and Matrix Regularization Andreas Argyriou TTI Chicago

  2. Outline • Multi-task learning and related problems • Multi-task feature learning (trace norm, Schatten L p norms, non-convex regularizers) • Representer theorems; “kernelization” 1

  3. Multi-Task Learning • Tasks t = 1 , . . . , n • m examples per task are given: ( x t 1 , y t 1 ) , . . . , ( x tm , y tm ) ∈ X × Y (simplification: sample sizes need not be equal; subsumes case of common input data) • Predict using functions f t : X → Y , t = 1 , . . . , n • When the tasks are related, learning the tasks jointly should perform better than learning each task independently • Especially important when few data points are available per task (small m ); in such cases, independent learning is not successful 2

  4. Transfer • Want good generalization on the n given tasks but also on new tasks ( transfer learning ) • Given a few examples from a new task t ′ , { ( x t ′ 1 , y t ′ 1 ) , . . . , ( x t ′ ℓ , y t ′ ℓ ) } , want to learn f t ′ • Do this by “transferring” the common task structure / features learned from the n tasks • Transfer is an important feature of human intelligence 3

  5. Multi-Task Applications • Marketing databases, collaborative filtering, recommendation systems (e.g. Netflix); task = product preferences for each person 4

  6. Matrix Completion • Matrix completion minimize rank( W ) R d × n W ∈ I s.t. w ij = y ij , ∀ ( i, j ) ∈ E • Special case of multi-task learning (input vectors are elements of the canonical basis) • Each column of the matrix corresponds to the regression vector for a task; emphasis is on recovery of the matrix; in learning we are also interested in generalization 5

  7. Related Problems • Domain adaptation / transfer • Multi-view learning • Multi-label learning • Multi-task learning is a broad problem ; no single method can solve everything; 6

  8. Learning Multiple Tasks with a Common Kernel • Learn a common kernel K ( x, x ′ ) = � x, Dx ′ � from a convex set of kernels: n m � � + γ tr( W ⊤ D − 1 W ) inf E ( � w t , x ti � , y ti ) ( MT L ) R d w 1 ,...,w n ∈ I t =1 i =1 D ≻ 0 , tr( D ) ≤ 1 ↑ n � � w t , D − 1 w t � t =1    w 1 . . . w n where W =  7

  9. Learning Multiple Tasks with a Common Kernel • Jointly convex problem in ( W, D ) • The choice of constraint tr( D ) ≤ 1 is important; intuitively, penalizes the number of common features (eigenvectors of D ) • Once we have learned ˆ D , we can transfer it to learning of a new task t ′ m � E ( � w, x t ′ i � , y t ′ i ) + γ � w, ˆ D − 1 w � min R d w ∈ I i =1 8

  10. Alternating Minimization Algorithm • Alternating minimization over W and D Initialization: given initial D , e.g. D = I d d while convergence condition is not true do for t = 1 , . . . , n learn w t independently by minimizing m E ( � w, x ti � , y ti ) + γ � w, D − 1 w � � i =1 end for 1 ( W W ⊤ ) 2 set D = 1 tr( W W ⊤ ) 2 end while 9

  11. Alternating Minimization (contd.) 29 6 η = 0.05 Alternating η = 0.03 5 η = 0.05 28 η = 0.01 Alternating 4 27 3 objective seconds 26 function 2 25 1 24 0 0 20 40 60 80 100 50 100 150 200 #iterations #tasks (green = alternating) (blue = alternating) • Compare computational cost with a gradient descent on W only ( η := learning rate) 10

  12. Alternating Minimization (contd.) • Small number of iterations (typically fewer than 50 in experiments) • Alternative algorithms: singular value thresholding [ Cai et al. 2008 ], Bregman-type gradient descent [ Ma et al. 2009 ] etc. • Non-SVD alternatives like [ Rennie & Srebro 2005 , Maurer 2007 ] or SOCP methods [ Srebro et al. 2005, Liu and Vandenberghe 2008 ] 11

  13. Trace Norm Regularization Problem ( MT L ) is equivalent to n m � � E ( � w t , x ti � , y ti ) + γ � W � 2 min ( T R ) tr R d × n W ∈ I t =1 i =1 The trace norm (or nuclear norm) � W � tr is the sum of the singular values of W W = U Σ V ⊤ � � W � tr = σ i ( W ) i 12

  14. Trace Norm vs. Rank • Problem ( T R ) is a convex relaxation of the problem n m � � min E ( � w t , x ti � , y ti ) + γ rank( W ) R d × n W ∈ I t =1 i =1 • NP-hard problem • Rank and trace norm correspond to L 0 , L 1 on the vector of singular values • Hence one (qualified) interpretation: we want the task parameter vectors w t to lie on a low dimensional subspace 13

  15. Machine Learning Interpretations • Learning a common linear kernel for all tasks (discussed already) • Maximum likelihood (learning a Gaussian covariance with fixed trace) • Matrix factorization � W � tr = 1 F ⊤ G = W ( � F � 2 F r + � G � 2 min F r ) 2 • MAP in a graphical model (as above) • Gaussian process interpretation 14

  16. “Rotation invariant” Group Lasso • Problem ( MT L ) is equivalent to n m � � E ( � a t , U ⊤ x ti � , y ti ) + γ � A � 2 min 2 , 1 R d × n A ∈ I t =1 i =1 R d × d , U ⊤ U = I U ∈ I � d n a 2 � � where � A � 2 , 1 := it i =1 t =1 2 4 6 8 10 12 14 10 20 30 40 50 60 70 80 90 100 15

  17. Experiment (Computer Survey) • Consumers’ ratings of products [Lenk et al. 1996] • 180 persons (tasks) • 8 PC models (training examples) • 13 binary input variables (RAM, CPU, price etc.) + bias term • Integer output in { 0 , . . . , 10 } (likelihood of purchase) • The square loss was used 16

  18. Experiment (Computer Survey) 0.25 Method RMSE 0.2 Alternating Alg. 1.93 0.15 Hierarchical Bayes 0.1 1.90 u 1 [Lenk et al.] 0.05 Independent 3.88 0 Aggregate 2.35 −0.05 Group Lasso 2.01 −0.1 TE RAM SC CPU HD CD CA CO AV WA SW GU PR • The most important feature (eigenvector of D ) weighs technical characteristics (RAM, CPU, CD-ROM) vs. price 17

  19. Generalizations: Spectral Regularization • Generalize ( MT L ): n m � � E ( � w t , x ti � , y ti ) + γ � W � 2 inf p R d × n W ∈ I t =1 i =1 where � W � p is the Schatten L p norm of the singular values of W • L 1 − L 2 trade-off • Can be generalized to a family of spectral functions • A similar alternating algorithm can be used 18

  20. Generalizations: Learning Groups of Tasks • Assume heterogeneous environment, i.e. K low dimensional subspaces • Learn a partition of tasks in K groups � m n � K � � E ( � w t , x ti � , y ti ) + γ � w t , D − 1 inf min min k w t � D 1 ,...,D K ≻ 0 k =1 R d w t ∈ I t =1 i =1 tr( D k ) ≤ 1 • The representation learned is ( ˆ D 1 , . . . , ˆ D K ) ; we can transfer this representation to easily learn a new task • Non-convex problem; we use stochastic gradient descent 19

  21. Nonlinear Kernels • An important note: all methods presented satisfy a multi-task representer theorem (a type of necessary optimality condition) • This fact enables “kernelization”, i.e. we may use a given kernel (e.g. polynomial, RBF) via its Gram matrix • We now expand on this observation 20

  22. Representer Theorems • Consider any learning problem of the form m � min E ( � w, x i � , y i ) + Ω( w ) R d w ∈ I i =1 • This problem can be “kernelized” if Ω satisfies the “classical” rep. thm. m � w = ˆ c i x i i =1 (a necessary but not sufficient optimality condition) 21

  23. Representer Theorems (contd.) Theorem. The “classical” rep. thm. for single-task learning , holds if and only if there exists a nondecreasing function h : I R + → I R such that R d Ω( w ) = h ( � w, w � ) ∀ w ∈ I (under differentiability assumptions) • Sufficiency of the condition was known [ Kimeldorf & Wahba, 1970, olkopf et al., 2001 etc.] Sch¨ 22

  24. Representer Theorems (contd.) • Sketch of the proof: equivalent condition is Ω( w + p ) ≥ Ω( w ) for all w, p such that � w, p � = 0 . w w+p 0 23

  25. Multi-Task Representer Theorems • Our multi-task formulations satisfy a multi-task representer theorem n m c ( t ) � � w t = ˆ si x si ∀ t ∈ { 1 , . . . , n } ( R . T . ) s =1 i =1 • All tasks are involved in this expression (unlike the single-task representer theorem ⇔ Frobenius norm regularization) • Generally, consider any matrix optimization problem of the form n m � � min E ( � w t , x ti � , y ti ) + Ω( W ) R d w 1 ,...,w n ∈ I t =1 i =1 24

  26. Multi-Task Representer Theorems (contd.) • Definitions: S n + = the positive semidefinite cone The function h : S n + → I R is matrix nondecreasing, if ∀ A, B ∈ S n h ( A ) ≤ h ( B ) s.t. A � B + Theorem. Rep. thm. ( R . T . ) holds if and only if there exists a matrix nondecreasing function h : S n + → I R such that R d × n Ω( W ) = h ( W ⊤ W ) ∀ W ∈ I (under differentiability assumptions) 25

  27. Implications • The theorem tells us when a matrix learning problem can be “kernelized” • In single-task learning, the choice of h does not matter essentially • However, in multi-task learning, the choice of h is important (since � is a partial ordering) • Many valid regularizers: Schatten L p norms � · � p , rank , orthogonally invariant norms, norms of type W �→ � WM � p etc. 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend