Multi-Task Learning and Matrix Regularization Andreas Argyriou - PowerPoint PPT Presentation

Multi-Task Learning and Matrix Regularization Andreas Argyriou Department of Computer Science University College London

Collaborators • T. Evgeniou (INSEAD) • R. Hauser (University of Oxford) • M. Herbster (University College London) • A. Maurer (Stemmer Imaging) • C.A. Micchelli (SUNY Albany) • M. Pontil (University College London) • Y. Ying (University of Bristol) 1

Main Themes • Machine learning • Convex optimization • Sparse recovery 2

Outline • Multi-task learning and related problems • Matrix learning and an alternating algorithm • Extensions of the method • Multi-task representer theorems • Kernel hyperparameter learning; convex kernel learning 3

Supervised Learning (Single-Task) • m examples are given: ( x 1 , y 1 ) , . . . , ( x m , y m ) ∈ X × Y • Predict using a function f : X → Y • Want the function to generalize well over the whole of X × Y • Includes regression, classification etc. • Task = probability measure on X × Y 4

Multi-Task Learning • Tasks t = 1 , . . . , n • m examples per task are given: ( x t 1 , y t 1 ) , . . . , ( x tm , y tm ) ∈ X × Y • Predict using functions f t : X → Y , t = 1 , . . . , n • When the tasks are related, learning the tasks jointly should perform better than learning each task independently • Especially important when few data points are available per task (small m ); in such cases, independent learning is not successful 5

Multi-Task Learning (contd.) • One goal is to learn what structure is common across the n tasks • Want simple, interpretable models that can explain multiple tasks • Want good generalization on the n given tasks but also on new tasks ( transfer learning ) • Given a few examples from a new task t ′ , { ( x t ′ 1 , y t ′ 1 ) , . . . , ( x t ′ ℓ , y t ′ ℓ ) } , want to learn f t ′ using just the learned task structure 6

Learning Theoretic View: Environment of Tasks • Environment = probability distribution on a set of learning tasks [ Baxter, 1996 ] • To sample a task-specific sample from the environment – draw a function f t from the environment – generate a sample { ( x t 1 , y t 1 ) , . . . , ( x tm , y tm ) } ∈ ( X × Y ) m using f t • Multi-task learning means learning a common hypothesis space 7

Learning Theoretic View (contd.) • Baxter’s results: – As n (#tasks) increases, m (#examples per task needed) decreases as O ( 1 n ) – Once we have learned a hypothesis space H , we can use it to learn a new task drawn from the same environment; the sample complexity depends on the log-capacity of H • Other results: – Task relatedness due to input transformations: improved multi-task bounds in some cases [ Ben-David & Schuller, 2003 ] – Using common feature maps (bounded linear operators): error bounds depend on Hilbert-Schmidt norm [ Maurer, 2006 ] 8

Multi-Task Applications • Multi-task learning is ubiquitous • Human intelligence relies on transfering learned knowledge from previous tasks to new tasks • E.g. character recognition (very few examples should be needed to recognize new characters) • Integration of medical / bioinformatics databases 9

Multi-Task Applications (contd.) • Marketing databases, collaborative filtering, recommendation systems (e.g. Netflix); task = product preferences for each person 10

Multi-Task Applications (contd.) • Multiple object classification in scenes: an image may contain multiple objects; learning common visual features enhances performance 11

Related Problems • Sparse coding (some images share common basis images) • Vector-valued / structured output • Multi-class problems • Regression with grouped variables, multifactor ANOVA in statistics; (selection of groups of variables) • Multi-task learning is a broad problem ; no single method can solve everything 12

Learning Multiple Tasks with a Common Kernel R d , Y ⊆ I • Let X ⊆ I R and let us learn n linear functions f t ( x ) = � w t , x � t = 1 , . . . , n (we ignore nonlinearities for the moment) • Want to impose common structure / relatedness across tasks • Idea: use a common linear kernel for all tasks K ( x, x ′ ) = � x, D x ′ � (where D ≻ 0 ) 13

Learning Multiple Tasks with a Common Kernel • For every t = 1 , . . . , n solve m � E ( � w t , x ti � , y ti ) + γ � w t , D − 1 w t � min R d w t ∈ I i =1 • Adding up, we obtain the equivalent problem n m n � � � � w t , D − 1 w t � min E ( � w t , x ti � , y ti ) + γ R d w 1 ,...,w n ∈ I t =1 i =1 t =1 14

Learning Multiple Tasks with a Common Kernel • For multi-task learning, we want to learn the common kernel from a convex set of kernels: n m � � + γ tr( W ⊤ D − 1 W ) inf E ( � w t , x ti � , y ti ) ( MT L ) R d w 1 ,...,w n ∈ I t =1 i =1 D ≻ 0 , tr( D ) ≤ 1 ↑ n � � w t , D − 1 w t � t =1    w 1 . . . w n • We denote W =  15

Learning Multiple Tasks with a Common Kernel • Jointly convex problem in ( W, D ) • The constraint tr( D ) ≤ 1 is important • Fixing W , the optimal D ( W ) is 1 D ( W ) ∝ ( WW ⊤ ) 2 ( D ( W ) is usually not in the feasible set because of the inf ) • Once we have learned ˆ D , we can transfer it to learning of a new task t ′ m � E ( � w, x t ′ i � , y t ′ i ) + γ � w, ˆ D − 1 w � min R d w ∈ I i =1 16

Alternating Minimization Algorithm • Alternating minimization over W (supervised learning) and D (unsupervised “correlation” of tasks). Initialization: set D = I d × d d while convergence condition is not true do for t = 1 , . . . , n learn w t independently by minimizing m E ( � w, x ti � , y ti ) + γ � w, D − 1 w � � i =1 end for 1 ( W W ⊤ ) 2 set D = 1 tr( W W ⊤ ) 2 end while 17

Alternating Minimization (contd.) • Each w t step is a regularization problem (e.g. SVM, ridge regression etc.) • It does not require computation of the (pseudo)inverse of D • Each D step requires an SVD; this is usually the most costly step 18

Alternating Minimization (contd.) • The algorithm (with some perturbation) converges to an optimal solution n m D − 1 ( WW ⊤ + εI ) � � � � min E ( � w t , x ti � , y ti ) + γ tr ( R ε ) R d w 1 ,...,w n ∈ I t =1 i =1 D ≻ 0 , tr( D ) ≤ 1 Theorem. An alternating algorithm for problem ( R ε ) has the property � W ( k ) , D ( k ) � that its iterates converge to the minimizer of ( R ε ) as k → ∞ . Theorem. Consider a sequence ε ℓ → 0 + and let ( W ℓ , D ℓ ) be the minimizer of ( R ε ℓ ) . Then any limiting point of the sequence { ( W ℓ , D ℓ ) } is an optimal solution of the problem ( MT L ) . • Note: the starting value of D does not matter 19

Alternating Minimization (contd.) 29 6 η = 0.05 Alternating η = 0.03 η = 0.05 5 28 η = 0.01 Alternating 4 27 3 objective seconds 26 function 2 25 1 24 0 0 20 40 60 80 100 50 100 150 200 #iterations #tasks (green = alternating) (blue = alternating) • Compare computational cost with a gradient descent approach ( η := learning rate) 20

Alternating Minimization (contd.) • Typically fewer than 50 iterations needed in experiments • At least an order of magnitude fewer iterations than gradient descent (but cost per iteration is larger) • Scales better with the number of tasks • Both methods require SVD (costly if d is large) • Alternative algorithms: SOCP methods [ Srebro et al. 2005, Liu and Vandenberghe 2008 ], gradient descent on matrix factors [ Rennie & Srebro 2005 ], singular value thresholding [ Cai et al. 2008 ] 21

Trace Norm Regularization • Eliminating D in optimization problem ( MT L ) yields n m � � E ( � w t , x ti � , y ti ) + γ � W � 2 min ( T R ) tr R d × n W ∈ I t =1 i =1 The trace norm (or nuclear norm) � W � tr is the sum of the singular values of W • There has been recent interest in trace norm / rank problems in matrix factorization, statistics, matrix completion etc. [ Cai et al. 2008, Fazel et al. 2001, Izenman 1975, Liu and Vandenberghe 2008, Srebro et al. 2005 ] 22

Trace Norm vs. Rank • Problem ( T R ) is a convex relaxation of the problem n m � � min E ( � w t , x ti � , y ti ) + γ rank( W ) R d × n W ∈ I t =1 i =1 • NP-hard problem (at least as hard as Boolean LP) • Rank and trace norm correspond to L 0 , L 1 on the vector of singular values • Multi-task intuition: we want the task parameter vectors w t to lie on a low dimensional subspace 23

Connection to Group Lasso • Problem ( MT L ) is equivalent to n m � � E ( � a t , U ⊤ x ti � , y ti ) + γ � A � 2 min 2 , 1 R d × n A ∈ I t =1 i =1 R d × d , U ⊤ U = I U ∈ I � d n a 2 � � where � A � 2 , 1 := it i =1 t =1 2 4 6 8 10 12 14 10 20 30 40 50 60 70 80 90 100 24

Experiment (Computer Survey) • Consumers’ ratings of products [Lenk et al. 1996] • 180 persons (tasks) • 8 PC models (training examples); 4 PC models (test examples) • 13 binary input variables (RAM, CPU, price etc.) + bias term • Integer output in { 0 , . . . , 10 } (likelihood of purchase) • The square loss was used 25

Multi-Task Learning and Matrix Regularization Andreas Argyriou - PowerPoint PPT Presentation

Multi-Task Learning and Matrix Regularization Andreas Argyriou Department of Computer Science University College London Collaborators T. Evgeniou (INSEAD) R. Hauser (University of Oxford) M. Herbster (University College London)

Multi-Task Learning and Matrix Regularization Andreas Argyriou TTI Chicago Outline

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 L. Rosasco Regularization for

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Regularization Overview Regularization Overview Problems & Multicollinearity We will

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 Class 11 March 9, 2011 L.

Multi-Task Active Learning Yi Zhang Outline Active Learning Multi-Task Active Learning

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en-

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Liberating Communication with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

The Peculiar Optimization and Regularization Challenges in Multi-Task Learning and Meta-Learning

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

1 Shaadi by Marriott 2 There is no greater or more prestigious venue to host events with

PT Kawasan Industri Jababeka Tbk. Investor Presentation September 2019 0 Leading township

Internet content HTML SGML CSS XML XHTML MIME HTTP DD1335 (Lecture 2) Basic Internet

juk; -08 Njh;r;rp 6: ,izaj;ijg; gad;gLj;jp fy;tp gw;wpa jfty;fis Muha;thh; (ghlNtis:03) fs;

The State of Serverless Computing or, Fixing Dysfunction-as-a-Service Chenggang Wu RISE Lab, UC

MATE TERNAL MORTALITY REDUCT CTION IN IN COVID ID PDG Dr . HIMANSU BASU, MD, PhD, FRCS,FRCOG

Business Model Canvas Business Model : Definition A business model describes how an idea will

Guidelines for Biodiversity Management Committees (BMCs) National Biodiversity Authority

Multi-Task Learning and Matrix Regularization Andreas Argyriou - PowerPoint PPT Presentation

Multi-Task Learning and Matrix Regularization Andreas Argyriou Department of Computer Science University College London Collaborators T. Evgeniou (INSEAD) R. Hauser (University of Oxford) M. Herbster (University College London)

Multi-Task Learning and Matrix Regularization Andreas Argyriou TTI Chicago Outline

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 L. Rosasco Regularization for

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 Class 11 March 9, 2011 L.

Multi-Task Active Learning Yi Zhang Outline Active Learning Multi-Task Active Learning

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en-

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Liberating Communication with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

The Peculiar Optimization and Regularization Challenges in Multi-Task Learning and Meta-Learning

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

1 Shaadi by Marriott 2 There is no greater or more prestigious venue to host events with

PT Kawasan Industri Jababeka Tbk. Investor Presentation September 2019 0 Leading township

Internet content HTML SGML CSS XML XHTML MIME HTTP DD1335 (Lecture 2) Basic Internet

juk; -08 Njh;r;rp 6: ,izaj;ijg; gad;gLj;jp fy;tp gw;wpa jfty;fis Muha;thh; (ghlNtis:03) fs;

The State of Serverless Computing or, Fixing Dysfunction-as-a-Service Chenggang Wu RISE Lab, UC

MATE TERNAL MORTALITY REDUCT CTION IN IN COVID ID PDG Dr . HIMANSU BASU, MD, PhD, FRCS,FRCOG

Business Model Canvas Business Model : Definition A business model describes how an idea will

Guidelines for Biodiversity Management Committees (BMCs) National Biodiversity Authority

Regularization Overview Regularization Overview Problems & Multicollinearity We will