multi task learning and matrix regularization
play

Multi-Task Learning and Matrix Regularization Andreas Argyriou - PowerPoint PPT Presentation

Multi-Task Learning and Matrix Regularization Andreas Argyriou Department of Computer Science University College London Collaborators T. Evgeniou (INSEAD) R. Hauser (University of Oxford) M. Herbster (University College London)


  1. Multi-Task Learning and Matrix Regularization Andreas Argyriou Department of Computer Science University College London

  2. Collaborators • T. Evgeniou (INSEAD) • R. Hauser (University of Oxford) • M. Herbster (University College London) • A. Maurer (Stemmer Imaging) • C.A. Micchelli (SUNY Albany) • M. Pontil (University College London) • Y. Ying (University of Bristol) 1

  3. Main Themes • Machine learning • Convex optimization • Sparse recovery 2

  4. Outline • Multi-task learning and related problems • Matrix learning and an alternating algorithm • Extensions of the method • Multi-task representer theorems • Kernel hyperparameter learning; convex kernel learning 3

  5. Supervised Learning (Single-Task) • m examples are given: ( x 1 , y 1 ) , . . . , ( x m , y m ) ∈ X × Y • Predict using a function f : X → Y • Want the function to generalize well over the whole of X × Y • Includes regression, classification etc. • Task = probability measure on X × Y 4

  6. Multi-Task Learning • Tasks t = 1 , . . . , n • m examples per task are given: ( x t 1 , y t 1 ) , . . . , ( x tm , y tm ) ∈ X × Y • Predict using functions f t : X → Y , t = 1 , . . . , n • When the tasks are related, learning the tasks jointly should perform better than learning each task independently • Especially important when few data points are available per task (small m ); in such cases, independent learning is not successful 5

  7. Multi-Task Learning (contd.) • One goal is to learn what structure is common across the n tasks • Want simple, interpretable models that can explain multiple tasks • Want good generalization on the n given tasks but also on new tasks ( transfer learning ) • Given a few examples from a new task t ′ , { ( x t ′ 1 , y t ′ 1 ) , . . . , ( x t ′ ℓ , y t ′ ℓ ) } , want to learn f t ′ using just the learned task structure 6

  8. Learning Theoretic View: Environment of Tasks • Environment = probability distribution on a set of learning tasks [ Baxter, 1996 ] • To sample a task-specific sample from the environment – draw a function f t from the environment – generate a sample { ( x t 1 , y t 1 ) , . . . , ( x tm , y tm ) } ∈ ( X × Y ) m using f t • Multi-task learning means learning a common hypothesis space 7

  9. Learning Theoretic View (contd.) • Baxter’s results: – As n (#tasks) increases, m (#examples per task needed) decreases as O ( 1 n ) – Once we have learned a hypothesis space H , we can use it to learn a new task drawn from the same environment; the sample complexity depends on the log-capacity of H • Other results: – Task relatedness due to input transformations: improved multi-task bounds in some cases [ Ben-David & Schuller, 2003 ] – Using common feature maps (bounded linear operators): error bounds depend on Hilbert-Schmidt norm [ Maurer, 2006 ] 8

  10. Multi-Task Applications • Multi-task learning is ubiquitous • Human intelligence relies on transfering learned knowledge from previous tasks to new tasks • E.g. character recognition (very few examples should be needed to recognize new characters) • Integration of medical / bioinformatics databases 9

  11. Multi-Task Applications (contd.) • Marketing databases, collaborative filtering, recommendation systems (e.g. Netflix); task = product preferences for each person 10

  12. Multi-Task Applications (contd.) • Multiple object classification in scenes: an image may contain multiple objects; learning common visual features enhances performance 11

  13. Related Problems • Sparse coding (some images share common basis images) • Vector-valued / structured output • Multi-class problems • Regression with grouped variables, multifactor ANOVA in statistics; (selection of groups of variables) • Multi-task learning is a broad problem ; no single method can solve everything 12

  14. Learning Multiple Tasks with a Common Kernel R d , Y ⊆ I • Let X ⊆ I R and let us learn n linear functions f t ( x ) = � w t , x � t = 1 , . . . , n (we ignore nonlinearities for the moment) • Want to impose common structure / relatedness across tasks • Idea: use a common linear kernel for all tasks K ( x, x ′ ) = � x, D x ′ � (where D ≻ 0 ) 13

  15. Learning Multiple Tasks with a Common Kernel • For every t = 1 , . . . , n solve m � E ( � w t , x ti � , y ti ) + γ � w t , D − 1 w t � min R d w t ∈ I i =1 • Adding up, we obtain the equivalent problem n m n � � � � w t , D − 1 w t � min E ( � w t , x ti � , y ti ) + γ R d w 1 ,...,w n ∈ I t =1 i =1 t =1 14

  16. Learning Multiple Tasks with a Common Kernel • For multi-task learning, we want to learn the common kernel from a convex set of kernels: n m � � + γ tr( W ⊤ D − 1 W ) inf E ( � w t , x ti � , y ti ) ( MT L ) R d w 1 ,...,w n ∈ I t =1 i =1 D ≻ 0 , tr( D ) ≤ 1 ↑ n � � w t , D − 1 w t � t =1    w 1 . . . w n • We denote W =  15

  17. Learning Multiple Tasks with a Common Kernel • Jointly convex problem in ( W, D ) • The constraint tr( D ) ≤ 1 is important • Fixing W , the optimal D ( W ) is 1 D ( W ) ∝ ( WW ⊤ ) 2 ( D ( W ) is usually not in the feasible set because of the inf ) • Once we have learned ˆ D , we can transfer it to learning of a new task t ′ m � E ( � w, x t ′ i � , y t ′ i ) + γ � w, ˆ D − 1 w � min R d w ∈ I i =1 16

  18. Alternating Minimization Algorithm • Alternating minimization over W (supervised learning) and D (unsupervised “correlation” of tasks). Initialization: set D = I d × d d while convergence condition is not true do for t = 1 , . . . , n learn w t independently by minimizing m E ( � w, x ti � , y ti ) + γ � w, D − 1 w � � i =1 end for 1 ( W W ⊤ ) 2 set D = 1 tr( W W ⊤ ) 2 end while 17

  19. Alternating Minimization (contd.) • Each w t step is a regularization problem (e.g. SVM, ridge regression etc.) • It does not require computation of the (pseudo)inverse of D • Each D step requires an SVD; this is usually the most costly step 18

  20. Alternating Minimization (contd.) • The algorithm (with some perturbation) converges to an optimal solution n m D − 1 ( WW ⊤ + εI ) � � � � min E ( � w t , x ti � , y ti ) + γ tr ( R ε ) R d w 1 ,...,w n ∈ I t =1 i =1 D ≻ 0 , tr( D ) ≤ 1 Theorem. An alternating algorithm for problem ( R ε ) has the property � W ( k ) , D ( k ) � that its iterates converge to the minimizer of ( R ε ) as k → ∞ . Theorem. Consider a sequence ε ℓ → 0 + and let ( W ℓ , D ℓ ) be the minimizer of ( R ε ℓ ) . Then any limiting point of the sequence { ( W ℓ , D ℓ ) } is an optimal solution of the problem ( MT L ) . • Note: the starting value of D does not matter 19

  21. Alternating Minimization (contd.) 29 6 η = 0.05 Alternating η = 0.03 η = 0.05 5 28 η = 0.01 Alternating 4 27 3 objective seconds 26 function 2 25 1 24 0 0 20 40 60 80 100 50 100 150 200 #iterations #tasks (green = alternating) (blue = alternating) • Compare computational cost with a gradient descent approach ( η := learning rate) 20

  22. Alternating Minimization (contd.) • Typically fewer than 50 iterations needed in experiments • At least an order of magnitude fewer iterations than gradient descent (but cost per iteration is larger) • Scales better with the number of tasks • Both methods require SVD (costly if d is large) • Alternative algorithms: SOCP methods [ Srebro et al. 2005, Liu and Vandenberghe 2008 ], gradient descent on matrix factors [ Rennie & Srebro 2005 ], singular value thresholding [ Cai et al. 2008 ] 21

  23. Trace Norm Regularization • Eliminating D in optimization problem ( MT L ) yields n m � � E ( � w t , x ti � , y ti ) + γ � W � 2 min ( T R ) tr R d × n W ∈ I t =1 i =1 The trace norm (or nuclear norm) � W � tr is the sum of the singular values of W • There has been recent interest in trace norm / rank problems in matrix factorization, statistics, matrix completion etc. [ Cai et al. 2008, Fazel et al. 2001, Izenman 1975, Liu and Vandenberghe 2008, Srebro et al. 2005 ] 22

  24. Trace Norm vs. Rank • Problem ( T R ) is a convex relaxation of the problem n m � � min E ( � w t , x ti � , y ti ) + γ rank( W ) R d × n W ∈ I t =1 i =1 • NP-hard problem (at least as hard as Boolean LP) • Rank and trace norm correspond to L 0 , L 1 on the vector of singular values • Multi-task intuition: we want the task parameter vectors w t to lie on a low dimensional subspace 23

  25. Connection to Group Lasso • Problem ( MT L ) is equivalent to n m � � E ( � a t , U ⊤ x ti � , y ti ) + γ � A � 2 min 2 , 1 R d × n A ∈ I t =1 i =1 R d × d , U ⊤ U = I U ∈ I � d n a 2 � � where � A � 2 , 1 := it i =1 t =1 2 4 6 8 10 12 14 10 20 30 40 50 60 70 80 90 100 24

  26. Experiment (Computer Survey) • Consumers’ ratings of products [Lenk et al. 1996] • 180 persons (tasks) • 8 PC models (training examples); 4 PC models (test examples) • 13 binary input variables (RAM, CPU, price etc.) + bias term • Integer output in { 0 , . . . , 10 } (likelihood of purchase) • The square loss was used 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend