Multi-Task Learning: Models, Optimization and Applications Linli Xu - - PowerPoint PPT Presentation

multi task learning
SMART_READER_LITE
LIVE PREVIEW

Multi-Task Learning: Models, Optimization and Applications Linli Xu - - PowerPoint PPT Presentation

Multi-Task Learning: Models, Optimization and Applications Linli Xu University of Science and Technology of China University of Science and Technology of China Outline Introduction to multi-task learning (MTL): problem and models


slide-1
SLIDE 1

Multi-Task Learning: Models, Optimization and Applications

Linli Xu University of Science and Technology of China

University of Science and Technology of China

slide-2
SLIDE 2

Outline

  • Introduction to multi-task learning (MTL):

problem and models

  • Multi-task learning with task-feature

co-clusters

  • Low-rank optimization in multi-task learning
  • Multi-task learning applied to trajectory

regression

2016/11/5 2

slide-3
SLIDE 3

Multiple Tasks

Examination Scores Prediction1 (Argyriou et. al.’08)

School 1 - Alverno High School School 138 - Jefferson Intermediate School School 139 - Rosemead High School

1The Inner London Education Authority (ILEA)

student-dependent school-dependent student-dependent school-dependent student-dependent school-dependent

Student id Birth year Previous score … School ranking … 72981 1985 95 … 83% … Student id Birth year Previous score … School ranking … 31256 1986 87 … 72% … Student id Birth year Previous score … School ranking … 12381 1986 83 … 77% … Exam score ? Exam score ? Exam score ?

2016/11/5 5

slide-4
SLIDE 4

Learning Multiple Tasks

1st task 138th task 139th task

Student id Birth year Previous score School ranking … 72981 1985 95 83% … Student id Birth year Previous score School ranking … 31256 1986 87 72% … Student id Birth year Previous score School ranking … 12381 1986 83 77% … Exam Score ? Exam Score ? Exam Score ? School 1 - Alverno High School School 138 - Jefferson Intermediate School School 139 - Rosemead High School

Excellent Excellent Excellent

Learning each task independently

2016/11/5 6

slide-5
SLIDE 5

Learning Multiple Tasks

1st task 138th task 139th task ……

Learn tasks simultaneously Model the task relationships

Student id Birth year Previous score School ranking … 72981 1985 95 83% … Student id Birth year Previous score School ranking … 31256 1986 87 72% … Student id Birth year Previous score School ranking … 12381 1986 83 77% … Exam Score ? Exam Score ? School 1 - Alverno High School School 138 - Jefferson Intermediate School School 139 - Rosemead High School Exam Score ?

Learning multiple tasks simultaneously

2016/11/5 7

slide-6
SLIDE 6

Multi-Task Learning

  • Different from single task

learning

  • Training multiple tasks

simultaneously to exploit task relationships

Model Model Model Training Data Task 1 Training Data Task 2 Training Data Task m Training

Training Training

Single Task Learning

Model Model Model Training Data Task 1 Training Data Task 2 Training Data Task m

Training

Multi-Task Learning

2016/11/5 8

slide-7
SLIDE 7

Exploiting Task Relationships

Key challenge in multi-task learning: Exploiting (statistical) relationships between the tasks so as to improve individual and/or overall predictive accuracy (in comparison to training individual models)!

2016/11/5 10

slide-8
SLIDE 8

How Tasks Are Related?

  • All tasks are related

– Models of all tasks are close to each other; – Models of all tasks share a common set of features; – Models share the same low rank subspace

  • Structure in tasks

– clusters / graphs / trees

  • Learning with outlier tasks

2016/11/5 11

slide-9
SLIDE 9

Regularization-based Multi-Task Learning

We focus on linear models: 𝑍𝑗~𝑌𝑗𝒙𝑗 𝑌𝑗 ∈ ℝ𝑜𝑗×𝑒, 𝑍𝑗 ∈ ℝ𝑜𝑗×1, 𝑋 = [𝒙1, 𝒙2, … , 𝒙𝑛] ∈ ℝ𝑒×𝑛 Generic framework min

𝑋 𝑗

𝑀𝑝𝑡𝑡 𝑋, 𝑌𝑗, 𝑍𝑗 + 𝜇 𝑆𝑓𝑕(𝑋) Impose various types of relations on tasks with 𝑆𝑓𝑕 𝑋

Learning Task m Dimension d Sample nm ... Sample n2 Sample n1 Feature Matrices Xi Task m Sample nm ... Sample n2 Sample n1 Target Vectors Yi Task m Dimension d Model Matrix W

2016/11/5 12

slide-10
SLIDE 10

How Tasks Are Related?

  • All tasks are related

– Models of all tasks are close to each other; – Models of all tasks share a common set of features; – Models share the same low rank subspace

  • Structure in tasks

– clusters / graphs / trees

  • Learning with outlier tasks

2016/11/5 13

slide-11
SLIDE 11

MTL Methods: Mean-Regularized MTL

Evgeniou & Pontil, 2004 KDD

Assumption: model parameters of all tasks are close to each other.

– Advantage: simple, intuitive, easy to implement – Disadvantage: too simple

Regularization

– Penalizes the deviation of each task from the mean min

𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇

𝑗=1 𝑛

𝑋𝑗 − 1 𝑛

𝑡=1 𝑛

𝑋𝑡

2 2

2016/11/5 14

slide-12
SLIDE 12

MTL Methods: Joint Feature Learning

Evgeniou et al. 2006 NIPS, Obozinski et. al. 2009 Stat Comput, Liu et. al. 2010 Technical Report

Assumption: models of all tasks share a common set of features

– Using group sparsity: ℓ1,𝑟-norm regularization

Regularization

– 𝑋 1,𝑟 = 𝑗=1

𝑒

𝒙𝑗

𝑟

– When 𝑟 > 1 we have group sparsity min

𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 𝑋 1,𝑟

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Feature 6 Feature 7 …… Feature d Task 1 Task 2 Task m ……

2016/11/5 15

slide-13
SLIDE 13

MTL Methods: Low-Rank MTL

Ji et. al. 2009 ICML

Assumption: in high dimensional feature space, the linear models share the same low-rank subspace Regularization - Rank minimization formulation

min

𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 ∙ rank(𝑋)

– Rank minimization is NP-Hard for general loss functions

  • Convex relaxation: nuclear norm minimization

min

𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 𝑋 ∗

( 𝑋 ∗ : sum of singular values of 𝑋 )

2016/11/5 16

slide-14
SLIDE 14

How Tasks Are Related?

  • All tasks are related

– Models of all tasks are close to each other; – Models of all tasks share a common set of features; – Models share the same low rank subspace

  • Structure in tasks

– clusters / graphs / trees

  • Learning with outlier tasks

2016/11/5 17

slide-15
SLIDE 15

MTL Methods: Clustered MTL

Zhou et. al. 2011 NIPS

Assumption: cluster structure in tasks - the models of tasks from the same group are closer to each other than those from a different group Regularization - capture clustered structures

Improves generalization performance capture cluster structures

min

𝑋,𝐺:𝐺𝑈𝐺=𝐽𝑙

𝑀𝑝𝑡𝑡 W + 𝛽 tr 𝑋𝑈𝑋 − tr 𝐺𝑈𝑋𝑈𝑋𝐺 + 𝛾 tr 𝑋𝑈𝑋

2016/11/5 18

slide-16
SLIDE 16

Regularization-based MTL: Decomposition Framework

  • In practice, it is too restrictive to constrain all

tasks to share a single shared structure.

  • Assumption: the model is the sum of two

components 𝑋 = 𝑄 + 𝑅

– A shared low dimensional subspace and a task specific component (Ando and Zhang, 2005, JMLR) – A group sparse component and a task specific sparse component (Jalali et.al., 2010, NIPS) – A low rank structure among relevant tasks + outlier tasks (Gong et.al., 2011, KDD)

2016/11/5 19

slide-17
SLIDE 17

How Tasks Are Related?

  • All tasks are related

– Models of all tasks are close to each other; – Models of all tasks share a common set of features; – Models share the same low rank subspace

  • Structure in tasks

– clusters / graphs / trees

  • Learning with outlier tasks

2016/11/5 20

slide-18
SLIDE 18

column-sparse low rank

MTL Methods: Robust MTL

Chen et. al. 2011 KDD

Assumption: models share the same low-rank subspace + outlier tasks 𝑋 = 𝑄 + 𝑅 Regularization

– 𝑄 ∗: nuclear norm – 𝑅 2,1 = 𝑘=1

𝑛

𝒓:,𝒌 2 min

𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝛽 𝑄 ∗ + 𝛾 𝑅 2,1

Features

  • utlier tasks

𝑅

2016/11/5 21

slide-19
SLIDE 19

Summary So Far…

  • All multi-task learning formulations discussed

above can fit into the 𝑋 = 𝑄 + 𝑅 schema.

– Component 𝑄: shared structure – Component 𝑅: information not captured by the shared structure

2016/11/5 22

slide-20
SLIDE 20

Outline

  • Introduction to multi-task learning (MTL):

problem and models

  • Multi-task learning with task-feature

co-clusters

  • Low-rank optimization in multi-task learning
  • Multi-task learning applied to trajectory

regression

2016/11/5 23

slide-21
SLIDE 21

Recap: How Tasks Are Related?

  • All tasks are related

– Models of all tasks are close to each other; – Models of all tasks share a common set of features; – Models share the same low rank subspace

  • Structure in tasks

– clusters / graphs / trees

  • Learning with outlier tasks

Task-level

2016/11/5 24

slide-22
SLIDE 22

How Tasks are Related

  • Existing methods consider the structure at a

general task-level

  • Restrictive assumption in practice:

– In document classification: different tasks may be relevant to different sets of words – In a recommender system: two users with similar tastes on one feature subset may have totally different preference on another subset

2016/11/5 25

slide-23
SLIDE 23

CoCMTL: MTL with Task-Feature Co-Clusters

[Xu. et al, AAAI15]

  • Motivation: feature-level groups
  • Impose task-feature co-clustering structure with

𝑆𝑓𝑕(𝑋)

task feature

clustering on the bipartite graph

2016/11/5 26

slide-24
SLIDE 24

CoCMTL: Model

  • Decomposition model: 𝑋 = 𝑄 + 𝑅

min

𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇1 Ω1 𝑄 + 𝜇2 Ω2(𝑅)

2016/11/5 27

slide-25
SLIDE 25

CoCMTL: Model

  • Decomposition model: 𝑋 = 𝑄 + 𝑅

min

𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇1 Ω1 𝑄 + 𝜇2 Ω2(𝑅)

Ω2 𝑅 = 𝑗=𝑙+1

min 𝑒,𝑛 𝜏𝑗 2(𝑅)

min

𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇1 tr(𝑄𝑀𝑄𝑈) + 𝜇2 𝑗=𝑙+1 min 𝑒,𝑛

𝜏𝑗

2(𝑅)

non-convex

2016/11/5 28

slide-26
SLIDE 26

CoCMTL: Optimization

  • We follow the Proximal Alternative Linear Method (PALM) to

solve the non-convex problem.

,

min ( , ) ( ) ( )

P Q h P Q

g P f Q  

( , ) h P Q

( ), ( ) g P f Q

Formulation:

1 1 2 2 , ,

min ( ) ( ) ( )

W P Q

W P Q      

h(P,Q) f(Q) : convex : lower semi continuous General Form Specific Form

2016/11/5 29

slide-27
SLIDE 27

CoCMTL: Results

School data: #Tasks 139, #Features 27, #Samples 15k

2016/11/5 30

slide-28
SLIDE 28

Outline

  • Introduction to multi-task learning (MTL):

problem and models

  • Multi-task learning with task-feature

co-clusters

  • Low-rank optimization in multi-task learning
  • Multi-task learning applied to trajectory

regression

2016/11/5 31

slide-29
SLIDE 29

Recap: Low-Rank MTL

Assumption: in high dimensional feature space, the linear models share the same low-rank subspace Regularization - Rank minimization formulation

min

𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 ∙ rank(𝑋)

– Rank minimization is NP-Hard for general loss functions

  • Convex relaxation: nuclear norm minimization

min

𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 𝑋 ∗

( 𝑋 ∗ : sum of singular values of 𝑋 )

2016/11/5 32

slide-30
SLIDE 30

More on Nuclear Norm

Rank minimization formulation

min

𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 ∙ rank(𝑋)

– rank 𝑋 = #non-zero singular values – 𝑋 ∗ = 𝜏𝑗(𝑋) : sum of singular values

  • Limitation of 𝑋 ∗

– Large singular values are penalized more heavily – Large singular values are dominant in determining the properties of a matrix

2016/11/5 33

slide-31
SLIDE 31

Idea: Weighted Nuclear Norm

[Zhong et al, AAAI15; Xu et al, ICDM16]

min

𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 𝑗

𝑞𝑗𝜏𝑗(𝑋)

  • Intuition: penalize large singular values less

– Non-descending weights 𝑞𝑗

  • Reweighting strategy:

– Given current weights 𝒒 𝑙−1, solve for 𝑋𝑙−1 – Reweighting of 𝒒

  • 𝑞𝑗

𝑙 = 𝑠 𝜏𝑗 𝑋𝑙 +𝜗

1−𝑠, where 0 < 𝑠 < 1, 𝜗 > 0

  • Each weight inversely proportional to the corresponding singular

value

Non-convex

2016/11/5 34

slide-32
SLIDE 32

Idea: Weighted Nuclear Norm

min

𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 𝑗

𝑞𝑗𝜏𝑗(𝑋) min

𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 𝑗

𝜏𝑗 𝑋 + 𝜗 𝑠

𝑞𝑗

𝑙 =

𝑠 𝜏𝑗 𝑋𝑙 + 𝜗 1−𝑠

→ rank 𝑋 when 𝜗 → 0, 𝑠 → 0 Enhances low rank approximation

2016/11/5 35

slide-33
SLIDE 33

Optimization: Proximal Operator

First-order approximation of 𝑀𝑝𝑡𝑡(𝑋), regularized by a proximal term

𝑄𝑢𝑙 𝑋, 𝑋𝑙 = 𝑀𝑝𝑡𝑡 𝑋𝑙 + 𝑋 − 𝑋𝑙, 𝛼𝑀𝑝𝑡𝑡 𝑋𝑙 + 𝑢𝑙 2 𝑋 − 𝑋𝑙

2

Generate the sequence

𝑋𝑙 = arg min

𝑋

𝑢𝑙 2𝜇 𝑋 − 𝑋𝑙 − 1 𝑢𝑙 𝛼𝑀𝑝𝑡𝑡 𝑋𝑙

𝐺 2

+ 𝑞𝑙 𝑈𝜏 𝑋 –  Non-convex proximal operator problem –  Has closed form solution by exploiting structure of the weighted nuclear norm (unitarily invariant property)

36

  • Theorem. Suppose that 𝐵 = 𝑉Σ𝑊𝑈, then, 𝑋∗ = 𝑉𝐸 𝑦∗ 𝑊𝑈 is a global

solution of the problem min

𝑌

𝜈 2 𝑋 − 𝐵 𝐺

2 + 𝑞𝑈𝜏 𝑋

where 𝑦∗can be denoted as 𝑦∗ = max 𝜏 𝐵 −

1 𝜈 𝑞, 0

slide-34
SLIDE 34

Optimization: Algorithm

Barzilai Borwein (BB) rule decrease the step size reweighting strategy

2016/11/5 37

slide-35
SLIDE 35

Convergence Analysis

  • Critical points
  • Sublinear convergence rate
  • Theorem. The sequence 𝑋𝑙 generated by the ISTRA algorithm makes the
  • bjective function monotonically decrease, and all accumulation points (i.e.

the limit points of convergent subsequence in 𝑋𝑙 ) are critical points (i.e. 0 belongs to the subgradients)

  • Theorem. Suppose that 𝑋𝑙 is the sequence generated by the ISTRA

algorithm, and 𝑋∗ is an accumulation point of 𝑌𝑙 , then min

0≤𝑙≤𝑜 𝑋𝑙+1 − 𝑋𝑙 2 ≤ 2 𝑕 𝑋0 − 𝑕 𝑋∗

/𝑜𝜐𝑢min

2016/11/5 38

slide-36
SLIDE 36

Results

2016/11/5 39

School data

slide-37
SLIDE 37

Outline

  • Introduction to multi-task learning (MTL):

problem and models

  • Multi-task learning with task-feature

co-clusters

  • Low-rank optimization in multi-task learning
  • Multi-task learning applied to trajectory

regression

2016/11/5 41

slide-38
SLIDE 38

Trajectory Regression: Problem

Trajectory: A sequence of link (road segments), where any two consecutive links share an intersection Goal: Estimate the total travel time of an arbitrary trajectory

2016/11/5 42

slide-39
SLIDE 39

Trajectory Regression: Problem

Given a set consisting of 𝑂 trajectory-cost pairs: 𝐸 ≡ {(𝒚𝑗, 𝑧𝑗)|𝑗 = 1,2, … , 𝑂}, 𝒚𝑗 ∈ ℝ𝑒

– Each feature of 𝒚𝑗 corresponds to a link — distance traveled along the link

Goal: Learn the weights 𝑥 ∈ ℝ𝑒 that encode the cost per distance unit for each link Single task learning: min

𝒙 ‖𝑍 − 𝑌𝒙‖ 2 2 + 𝛾‖𝒙‖ 2 2

Cost 20min Road

  • Seg. 1

Road

  • Seg. 2

Road

  • Seg. 3

… Road

  • Seg. d

5km 2km 1km …

2016/11/5 43

slide-40
SLIDE 40

Trajectory Regression: Key Challenges

  • Dynamic: costs of road segments are not static over

time

– Cost of a road segment fluctuates smoothly most of the time – Costs can be abruptly different between peak periods and

  • ff-peak periods
  • Trajectories are extremely sparse

– A driving path spans just a small fraction of road segments

  • Insufficient instances

2016/11/5 44

slide-41
SLIDE 41

Trajectory Regression: Idea

[Huang et al. ICDM14]

Dynamic trajectory regression in an MTL framework

– Divide 𝐸 into 𝑛 disjoint subsets ordered by time – Multi-task learning framework: each time slot corresponds to a task

  • leverage the inherent relations of tasks to enhance the

predictive performance, especially when the data samples are insufficient

2016/11/5 45

slide-42
SLIDE 42

Trajectory Regression

min

W 𝑗

𝑀𝑝𝑡𝑡 𝑋, 𝑌𝑗, 𝑍𝑗 + λ 𝑆𝑓𝑕 𝑋 = min

𝑋 𝑗

𝑍𝑗 − 𝑌𝑗𝒙𝑗 + λ 𝑆𝑓𝑕 𝑋

𝑋 Structure in the trajectory regression problem

  • Global temporal smoothness:

– Link costs change smoothly most of the time

  • Global spatial smoothness:

– Costs are similar if the two corresponding links are close to each other

  • Local temporal patterns:

– Significant temporal changes in rush hours

2016/11/5 46

slide-43
SLIDE 43

Trajectory Regression - Additive Model

min

W 𝑗

𝑀𝑝𝑡𝑡 𝑋, 𝑌𝑗, 𝑍𝑗 + λ 𝑆𝑓𝑕 𝑋 = min

𝑋 𝑗

𝑍𝑗 − 𝑌𝑗𝒙𝑗 + λ 𝑆𝑓𝑕 𝑋

𝑋 = 𝑄 + 𝑅

  • 𝑄: models the global smoothness over links and time
  • 𝑅: captures the local “outliers” including rush hours

2016/11/5 47

slide-44
SLIDE 44

Trajectory Regression - Regularization

𝑋 = 𝑄 + 𝑅

  • 𝑄: models the global smoothness over links and

time

  • Global temporal smoothness

Ω1= 𝑢=1

𝑛

𝑄:,𝑢 −

1 𝑛 𝑠=1 𝑛

𝑄

:,𝑠 2 2

= 𝑢𝑠(𝑄𝑀1𝑄𝑈) 𝑀1 = 𝐽 −

1 𝑛 𝟐𝟐′

  • Enforces the columns of 𝑄 or the tasks to be similar with some

discrepancy

2016/11/5 48

slide-45
SLIDE 45

Trajectory Regression - Regularization

𝑋 = 𝑄 + 𝑅

  • 𝑄: models the global smoothness over links and

time

  • Global spatial smoothness

Ω2 = 𝑗,𝑘=1

𝑒

𝑇𝑗𝑘 𝑄𝑗,∶ − 𝑄

𝑘,∶ 2 2 = 𝑢𝑠(𝑄𝑈𝑀2𝑄)

  • 𝑇 measures the spatial closeness of links
  • Costs are similar if the two corresponding links are close to each
  • ther

2016/11/5 49

slide-46
SLIDE 46

Trajectory Regression - Regularization

𝑋 = 𝑄 + 𝑅

  • 𝑅: captures the local “outliers” including rush hours
  • Local significant temporal transitions

Ω3 = 𝑅 ∞,1

  • 𝑎 ∞,1 = 𝑘 𝑎:,𝑘

∞,

𝑎:,𝑘

∞ = max 𝑗

𝑎𝑗𝑘

  • Enforces column sparsity to identify peak traffic
  • The ℓ∞,1 norm is only influenced by the maximum elements of the

nonzero columns — the cost of a trajectory is mostly decided by the link with highest cost during traffic peaks

  • Leaves out the outliers — ROBUST

2016/11/5 50

slide-47
SLIDE 47

Trajectory Regression - Model

min

𝑋 𝑗=1 𝑛

‖𝑍𝑗 − 𝑌𝑗𝒙𝑗‖2

2 + 𝜇1𝑢𝑠(𝑄𝑀1𝑄𝑈) + 𝜇2𝑢𝑠(𝑄𝑈𝑀2𝑄) + 𝜇3‖𝑅‖∞,1

2016/11/5 51

slide-48
SLIDE 48

Trajectory Regression - Optimization

min

𝑋 𝑗=1 𝑛

‖𝑍𝑗 − 𝑌𝑗𝒙𝑗‖2

2 + 𝜇1𝑢𝑠(𝑄𝑀1𝑄𝑈) + 𝜇2𝑢𝑠(𝑄𝑈𝑀2𝑄) + 𝜇3‖𝑅‖∞,1

Convex problem, but non-trivial for optimization due to the ℓ∞,1 term Proximal Method:

min

𝑋

𝐺 𝑋 + 𝑆(𝑋) 𝐺 𝑋 = 𝑀 𝑋 + 𝜇1𝑢𝑠(𝑄𝑀1𝑄𝑈) + 𝜇2𝑢𝑠(𝑄𝑈𝑀2𝑄) 𝑆 𝑋 = 𝜇3‖𝑅‖∞,1 𝑄

𝑠 = 𝑏𝑠𝑕 min 𝑄

𝛿𝑠 2 𝑄 − 𝐷𝑄(𝑄𝑠−1) 𝐺

2 ,

𝑅𝑠 = 𝑏𝑠𝑕 min

𝑅

𝛿𝑠 2 𝑅 − 𝐷𝑅(𝑅𝑠−1) 𝐺

2 + 𝜇3‖𝑅‖∞,1

min

𝒓𝑗

1 2 𝒓𝑗 − 𝒅𝑗

2 2 + 𝜇‖𝒓𝑗‖∞

min

𝒓𝑗

𝒅𝑗 − 1 2 𝒓𝑗 − 𝒅𝑗

2 2 + 𝜇‖𝒓𝑗‖1 𝒅 = 𝑞𝑠𝑝𝑦𝑆 𝒅 + 𝑞𝑠𝑝𝑦𝑆∗ 𝒅 2016/11/5 52

slide-49
SLIDE 49

Trajectory Regression - Results

  • Suzhou Traffic Data

– Contains 59593 trajectory records of 4797 taxies from 7:00 to 19:59 in urban area of Suzhou during the first week in March, 2012

2016/11/5 53

slide-50
SLIDE 50

Summary

  • Multi-task Learning (MTL)

– MTL is preferred when dealing with multiple related tasks with small number of training samples – Key issue of MTL: Exploiting relationships among the tasks

  • Optimization

– General formulations, classical algorithms apply – Distributed optimization

  • Applications

– Task relationships are specific to the nature of the problem

2016/11/5 54

slide-51
SLIDE 51

References

  • Xu, L., Huang, A., Chen, J., and Chen, E. Exploiting Task-Feature Co-Clusters in Multi-Task Learning. In Proceedings of the

29th National Conference on Artificial Intelligence (AAAI-15).

  • Zhong, X., Xu, L., Li., Y, Liu, Z., and Chen, E. A Nonconvex Relaxation Approach for Rank Minimization Problems. In

Proceedings of the 29th National Conference on Artificial Intelligence (AAAI-15).

  • Xu, L., Chen, Z., Zhou, Q., Chen, E., Yuan, N.J., Xie, X. Aligned Matrix Completion: Integrating Consistency and

Independency in Multiple Domains. In Proceedings of the 16th IEEE Conference on Data Mining (ICDM-16).

  • Huang, A., Xu, L., Li., Y, and Chen, E. Robust Dynamic Trajectory Regression on Road Networks: A Multi-Task Learning
  • Framework. In Proceedings of the 14th IEEE Conference on Data Mining (ICDM-14).
  • Zhou, J., Chen, J., and Ye, J. Multi-Task Learning: Theory, Algorithms, and Applications. SDM 2012 Tutorial.
  • Evgeniou, T., and Pontil, M. Regularized Multi-Task Learning. In Proceedings of the 10th ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining (KDD-04).

  • Argyriou, A., Evgeniou, T., and Pontil, M. Multi-Task Feature Learning. In Advances in Neural Information Processing Systems

19 (NIPS-06).

  • Argyriou, A., Evgeniou, T., and Pontil, M. Convex Multi-Task Learning. Machine Learning, 73, 2008.
  • Ji, S., and Ye, J. An Accelerated Gradient Method for Trace Norm Minimization. In Proceedings of the 26th Annual

International Conference on Machine Learning (ICML-09).

  • Zhou, J., Chen, J., and Ye, J. Clustered Multi-Task Learning Via Alternating Structure Optimization. In Advances in Neural

Information Processing Systems 24 (NIPS-11).

  • Chen, J., Zhou, J., and Ye, J. Integrating Low-Rank and Group-Sparse Structures for Robust Multi-Task Learning. In

Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-11). 2016/11/5 55

slide-52
SLIDE 52

Joint Work With

Students

– Aiqing Huang – Xiaowei Zhong – Zaiyi Chen – Yitan Li

Collaborators

– Enhong Chen – Jianhui Chen

2016/11/5

slide-53
SLIDE 53

Thanks!

Email: linlixu@ustc.edu.cn