Multi-Task Learning: Models, Optimization and Applications
Linli Xu University of Science and Technology of China
University of Science and Technology of China
Multi-Task Learning: Models, Optimization and Applications Linli Xu - - PowerPoint PPT Presentation
Multi-Task Learning: Models, Optimization and Applications Linli Xu University of Science and Technology of China University of Science and Technology of China Outline Introduction to multi-task learning (MTL): problem and models
University of Science and Technology of China
2016/11/5 2
School 1 - Alverno High School School 138 - Jefferson Intermediate School School 139 - Rosemead High School
…
1The Inner London Education Authority (ILEA)
student-dependent school-dependent student-dependent school-dependent student-dependent school-dependent
Student id Birth year Previous score … School ranking … 72981 1985 95 … 83% … Student id Birth year Previous score … School ranking … 31256 1986 87 … 72% … Student id Birth year Previous score … School ranking … 12381 1986 83 … 77% … Exam score ? Exam score ? Exam score ?
2016/11/5 5
…
1st task 138th task 139th task
Student id Birth year Previous score School ranking … 72981 1985 95 83% … Student id Birth year Previous score School ranking … 31256 1986 87 72% … Student id Birth year Previous score School ranking … 12381 1986 83 77% … Exam Score ? Exam Score ? Exam Score ? School 1 - Alverno High School School 138 - Jefferson Intermediate School School 139 - Rosemead High School
Excellent Excellent Excellent
2016/11/5 6
…
1st task 138th task 139th task ……
Learn tasks simultaneously Model the task relationships
Student id Birth year Previous score School ranking … 72981 1985 95 83% … Student id Birth year Previous score School ranking … 31256 1986 87 72% … Student id Birth year Previous score School ranking … 12381 1986 83 77% … Exam Score ? Exam Score ? School 1 - Alverno High School School 138 - Jefferson Intermediate School School 139 - Rosemead High School Exam Score ?
2016/11/5 7
Model Model Model Training Data Task 1 Training Data Task 2 Training Data Task m Training
Training Training
Single Task Learning
Model Model Model Training Data Task 1 Training Data Task 2 Training Data Task m
Training
Multi-Task Learning
2016/11/5 8
2016/11/5 10
2016/11/5 11
We focus on linear models: 𝑍𝑗~𝑌𝑗𝒙𝑗 𝑌𝑗 ∈ ℝ𝑜𝑗×𝑒, 𝑍𝑗 ∈ ℝ𝑜𝑗×1, 𝑋 = [𝒙1, 𝒙2, … , 𝒙𝑛] ∈ ℝ𝑒×𝑛 Generic framework min
𝑋 𝑗
𝑀𝑝𝑡𝑡 𝑋, 𝑌𝑗, 𝑍𝑗 + 𝜇 𝑆𝑓(𝑋) Impose various types of relations on tasks with 𝑆𝑓 𝑋
Learning Task m Dimension d Sample nm ... Sample n2 Sample n1 Feature Matrices Xi Task m Sample nm ... Sample n2 Sample n1 Target Vectors Yi Task m Dimension d Model Matrix W
2016/11/5 12
2016/11/5 13
Evgeniou & Pontil, 2004 KDD
𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇
𝑗=1 𝑛
𝑋𝑗 − 1 𝑛
𝑡=1 𝑛
𝑋𝑡
2 2
2016/11/5 14
Evgeniou et al. 2006 NIPS, Obozinski et. al. 2009 Stat Comput, Liu et. al. 2010 Technical Report
𝑒
𝑟
𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 𝑋 1,𝑟
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Feature 6 Feature 7 …… Feature d Task 1 Task 2 Task m ……
2016/11/5 15
Ji et. al. 2009 ICML
𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 ∙ rank(𝑋)
𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 𝑋 ∗
2016/11/5 16
2016/11/5 17
Zhou et. al. 2011 NIPS
Improves generalization performance capture cluster structures
min
𝑋,𝐺:𝐺𝑈𝐺=𝐽𝑙
𝑀𝑝𝑡𝑡 W + 𝛽 tr 𝑋𝑈𝑋 − tr 𝐺𝑈𝑋𝑈𝑋𝐺 + 𝛾 tr 𝑋𝑈𝑋
2016/11/5 18
2016/11/5 19
2016/11/5 20
column-sparse low rank
Chen et. al. 2011 KDD
𝑛
𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝛽 𝑄 ∗ + 𝛾 𝑅 2,1
Features
2016/11/5 21
2016/11/5 22
2016/11/5 23
2016/11/5 24
2016/11/5 25
[Xu. et al, AAAI15]
task feature
clustering on the bipartite graph
2016/11/5 26
𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇1 Ω1 𝑄 + 𝜇2 Ω2(𝑅)
2016/11/5 27
𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇1 Ω1 𝑄 + 𝜇2 Ω2(𝑅)
Ω2 𝑅 = 𝑗=𝑙+1
min 𝑒,𝑛 𝜏𝑗 2(𝑅)
𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇1 tr(𝑄𝑀𝑄𝑈) + 𝜇2 𝑗=𝑙+1 min 𝑒,𝑛
2(𝑅)
non-convex
2016/11/5 28
,
min ( , ) ( ) ( )
P Q h P Q
g P f Q
( , ) h P Q
( ), ( ) g P f Q
Formulation:
1 1 2 2 , ,
min ( ) ( ) ( )
W P Q
W P Q
h(P,Q) f(Q) : convex : lower semi continuous General Form Specific Form
2016/11/5 29
2016/11/5 30
2016/11/5 31
𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 ∙ rank(𝑋)
𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 𝑋 ∗
2016/11/5 32
𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 ∙ rank(𝑋)
2016/11/5 33
[Zhong et al, AAAI15; Xu et al, ICDM16]
𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 𝑗
𝑙 = 𝑠 𝜏𝑗 𝑋𝑙 +𝜗
1−𝑠, where 0 < 𝑠 < 1, 𝜗 > 0
value
2016/11/5 34
𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 𝑗
𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 𝑗
𝑞𝑗
𝑙 =
𝑠 𝜏𝑗 𝑋𝑙 + 𝜗 1−𝑠
2016/11/5 35
First-order approximation of 𝑀𝑝𝑡𝑡(𝑋), regularized by a proximal term
𝑄𝑢𝑙 𝑋, 𝑋𝑙 = 𝑀𝑝𝑡𝑡 𝑋𝑙 + 𝑋 − 𝑋𝑙, 𝛼𝑀𝑝𝑡𝑡 𝑋𝑙 + 𝑢𝑙 2 𝑋 − 𝑋𝑙
2
Generate the sequence
𝑋𝑙 = arg min
𝑋
𝑢𝑙 2𝜇 𝑋 − 𝑋𝑙 − 1 𝑢𝑙 𝛼𝑀𝑝𝑡𝑡 𝑋𝑙
𝐺 2
+ 𝑞𝑙 𝑈𝜏 𝑋 – Non-convex proximal operator problem – Has closed form solution by exploiting structure of the weighted nuclear norm (unitarily invariant property)
36
solution of the problem min
𝑌
𝜈 2 𝑋 − 𝐵 𝐺
2 + 𝑞𝑈𝜏 𝑋
where 𝑦∗can be denoted as 𝑦∗ = max 𝜏 𝐵 −
1 𝜈 𝑞, 0
Barzilai Borwein (BB) rule decrease the step size reweighting strategy
2016/11/5 37
the limit points of convergent subsequence in 𝑋𝑙 ) are critical points (i.e. 0 belongs to the subgradients)
algorithm, and 𝑋∗ is an accumulation point of 𝑌𝑙 , then min
0≤𝑙≤𝑜 𝑋𝑙+1 − 𝑋𝑙 2 ≤ 2 𝑋0 − 𝑋∗
/𝑜𝜐𝑢min
2016/11/5 38
2016/11/5 39
2016/11/5 41
2016/11/5 42
𝒙 ‖𝑍 − 𝑌𝒙‖ 2 2 + 𝛾‖𝒙‖ 2 2
Cost 20min Road
Road
Road
… Road
5km 2km 1km …
2016/11/5 43
2016/11/5 44
[Huang et al. ICDM14]
2016/11/5 45
min
W 𝑗
𝑀𝑝𝑡𝑡 𝑋, 𝑌𝑗, 𝑍𝑗 + λ 𝑆𝑓 𝑋 = min
𝑋 𝑗
𝑍𝑗 − 𝑌𝑗𝒙𝑗 + λ 𝑆𝑓 𝑋
2016/11/5 46
min
W 𝑗
𝑀𝑝𝑡𝑡 𝑋, 𝑌𝑗, 𝑍𝑗 + λ 𝑆𝑓 𝑋 = min
𝑋 𝑗
𝑍𝑗 − 𝑌𝑗𝒙𝑗 + λ 𝑆𝑓 𝑋
2016/11/5 47
Ω1= 𝑢=1
𝑛
𝑄:,𝑢 −
1 𝑛 𝑠=1 𝑛
𝑄
:,𝑠 2 2
= 𝑢𝑠(𝑄𝑀1𝑄𝑈) 𝑀1 = 𝐽 −
1 𝑛 𝟐𝟐′
discrepancy
2016/11/5 48
𝑒
𝑘,∶ 2 2 = 𝑢𝑠(𝑄𝑈𝑀2𝑄)
2016/11/5 49
∞,
𝑎:,𝑘
∞ = max 𝑗
𝑎𝑗𝑘
nonzero columns — the cost of a trajectory is mostly decided by the link with highest cost during traffic peaks
2016/11/5 50
min
𝑋 𝑗=1 𝑛
‖𝑍𝑗 − 𝑌𝑗𝒙𝑗‖2
2 + 𝜇1𝑢𝑠(𝑄𝑀1𝑄𝑈) + 𝜇2𝑢𝑠(𝑄𝑈𝑀2𝑄) + 𝜇3‖𝑅‖∞,1
2016/11/5 51
min
𝑋 𝑗=1 𝑛
‖𝑍𝑗 − 𝑌𝑗𝒙𝑗‖2
2 + 𝜇1𝑢𝑠(𝑄𝑀1𝑄𝑈) + 𝜇2𝑢𝑠(𝑄𝑈𝑀2𝑄) + 𝜇3‖𝑅‖∞,1
Convex problem, but non-trivial for optimization due to the ℓ∞,1 term Proximal Method:
min
𝑋
𝐺 𝑋 + 𝑆(𝑋) 𝐺 𝑋 = 𝑀 𝑋 + 𝜇1𝑢𝑠(𝑄𝑀1𝑄𝑈) + 𝜇2𝑢𝑠(𝑄𝑈𝑀2𝑄) 𝑆 𝑋 = 𝜇3‖𝑅‖∞,1 𝑄
𝑠 = 𝑏𝑠 min 𝑄
𝛿𝑠 2 𝑄 − 𝐷𝑄(𝑄𝑠−1) 𝐺
2 ,
𝑅𝑠 = 𝑏𝑠 min
𝑅
𝛿𝑠 2 𝑅 − 𝐷𝑅(𝑅𝑠−1) 𝐺
2 + 𝜇3‖𝑅‖∞,1
min
𝒓𝑗
1 2 𝒓𝑗 − 𝒅𝑗
2 2 + 𝜇‖𝒓𝑗‖∞
min
𝒓𝑗
𝒅𝑗 − 1 2 𝒓𝑗 − 𝒅𝑗
2 2 + 𝜇‖𝒓𝑗‖1 𝒅 = 𝑞𝑠𝑝𝑦𝑆 𝒅 + 𝑞𝑠𝑝𝑦𝑆∗ 𝒅 2016/11/5 52
2016/11/5 53
2016/11/5 54
29th National Conference on Artificial Intelligence (AAAI-15).
Proceedings of the 29th National Conference on Artificial Intelligence (AAAI-15).
Independency in Multiple Domains. In Proceedings of the 16th IEEE Conference on Data Mining (ICDM-16).
Conference on Knowledge Discovery and Data Mining (KDD-04).
19 (NIPS-06).
International Conference on Machine Learning (ICML-09).
Information Processing Systems 24 (NIPS-11).
Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-11). 2016/11/5 55
2016/11/5