Multi-Task Learning and Matrix Regularization Andreas Argyriou - - PowerPoint PPT Presentation
Multi-Task Learning and Matrix Regularization Andreas Argyriou - - PowerPoint PPT Presentation
Multi-Task Learning and Matrix Regularization Andreas Argyriou Department of Computer Science University College London Collaborators T. Evgeniou (INSEAD) R. Hauser (University of Oxford) M. Herbster (University College London)
Collaborators
- T. Evgeniou (INSEAD)
- R. Hauser (University of Oxford)
- M. Herbster (University College London)
- A. Maurer (Stemmer Imaging)
- C.A. Micchelli (SUNY Albany)
- M. Pontil (University College London)
- Y. Ying (University of Bristol)
1
Main Themes
- Machine learning
- Convex optimization
- Sparse recovery
2
Outline
- Multi-task learning and related problems
- Matrix learning and an alternating algorithm
- Extensions of the method
- Multi-task representer theorems
- Kernel hyperparameter learning; convex kernel learning
3
Supervised Learning (Single-Task)
- m examples are given: (x1, y1), . . . , (xm, ym) ∈ X × Y
- Predict using a function f : X → Y
- Want the function to generalize well over the whole of X × Y
- Includes regression, classification etc.
- Task = probability measure on X × Y
4
Multi-Task Learning
- Tasks t = 1, . . . , n
- m examples per task are given: (xt1, yt1), . . . , (xtm, ytm) ∈ X × Y
- Predict using functions ft : X → Y , t = 1, . . . , n
- When the tasks are related, learning the tasks jointly should perform
better than learning each task independently
- Especially important when few data points are available per task (small
m); in such cases, independent learning is not successful
5
Multi-Task Learning (contd.)
- One goal is to learn what structure is common across the n tasks
- Want simple, interpretable models that can explain multiple tasks
- Want good generalization on the n given tasks but also on new tasks
(transfer learning)
- Given a few examples from a new task t′, {(xt′1, yt′1), . . . , (xt′ℓ, yt′ℓ)},
want to learn ft′ using just the learned task structure
6
Learning Theoretic View: Environment of Tasks
- Environment = probability distribution on a set of learning tasks
[Baxter, 1996]
- To sample a task-specific sample from the environment
– draw a function ft from the environment – generate a sample {(xt1, yt1), . . . , (xtm, ytm)} ∈ (X × Y)m using ft
- Multi-task learning means learning a common hypothesis space
7
Learning Theoretic View (contd.)
- Baxter’s results:
– As n (#tasks) increases, m (#examples per task needed) decreases as O( 1
n)
– Once we have learned a hypothesis space H, we can use it to learn a new task drawn from the same environment; the sample complexity depends on the log-capacity of H
- Other results:
– Task relatedness due to input transformations: improved multi-task bounds in some cases [Ben-David & Schuller, 2003] – Using common feature maps (bounded linear operators): error bounds depend on Hilbert-Schmidt norm [Maurer, 2006]
8
Multi-Task Applications
- Multi-task learning is ubiquitous
- Human intelligence relies on transfering learned knowledge from
previous tasks to new tasks
- E.g. character recognition (very few examples should be needed to
recognize new characters)
- Integration of medical / bioinformatics databases
9
Multi-Task Applications (contd.)
- Marketing databases, collaborative filtering, recommendation systems
(e.g. Netflix); task = product preferences for each person
10
Multi-Task Applications (contd.)
- Multiple object classification in scenes:
an image may contain multiple objects; learning common visual features enhances performance
11
Related Problems
- Sparse coding (some images share common basis images)
- Vector-valued / structured output
- Multi-class problems
- Regression with grouped variables, multifactor ANOVA in statistics;
(selection of groups of variables)
- Multi-task learning is a broad problem; no single method can solve
everything
12
Learning Multiple Tasks with a Common Kernel
- Let X ⊆ I
Rd, Y ⊆ I R and let us learn n linear functions ft(x) = wt, x t = 1, . . . , n (we ignore nonlinearities for the moment)
- Want to impose common structure / relatedness across tasks
- Idea: use a common linear kernel for all tasks
K(x, x′) = x, D x′ (where D ≻ 0)
13
Learning Multiple Tasks with a Common Kernel
- For every t = 1, . . . , n solve
min
wt∈I Rd m
- i=1
E (wt, xti, yti) + γwt, D−1wt
- Adding up, we obtain the equivalent problem
min
w1,...,wn∈I Rd n
- t=1
m
- i=1
E (wt, xti, yti) + γ
n
- t=1
wt, D−1wt
14
Learning Multiple Tasks with a Common Kernel
- For multi-task learning, we want to learn the common kernel from a
convex set of kernels: inf
w1,...,wn∈I Rd D≻0, tr(D)≤1 n
- t=1
m
- i=1
E (wt, xti, yti) + γ tr(W ⊤D−1W) (MT L)
↑ n
- t=1
wt, D−1wt
- We denote W =
w1 . . . wn
15
Learning Multiple Tasks with a Common Kernel
- Jointly convex problem in (W, D)
- The constraint tr(D) ≤ 1 is important
- Fixing W, the optimal D(W) is
D(W) ∝ (WW ⊤)
1 2
(D(W) is usually not in the feasible set because of the inf)
- Once we have learned ˆ
D, we can transfer it to learning of a new task t′ min
w∈I Rd m
- i=1
E (w, xt′i, yt′i) + γ w, ˆ D−1w
16
Alternating Minimization Algorithm
- Alternating minimization over W (supervised learning) and D
(unsupervised “correlation” of tasks). Initialization: set D = Id×d
d
while convergence condition is not true do for t = 1, . . . , n learn wt independently by minimizing
m
- i=1
E(w, xti, yti) + γ w, D−1w end for set D =
(W W ⊤)
1 2
tr(W W ⊤)
1 2
end while
17
Alternating Minimization (contd.)
- Each wt step is a regularization problem (e.g. SVM, ridge regression
etc.)
- It does not require computation of the (pseudo)inverse of D
- Each D step requires an SVD; this is usually the most costly step
18
Alternating Minimization (contd.)
- The algorithm (with some perturbation) converges to an optimal
solution min
w1,...,wn∈I Rd D≻0, tr(D)≤1 n
- t=1
m
- i=1
E(wt, xti, yti) + γ tr
- D−1(WW ⊤ + εI)
- (Rε)
- Theorem. An alternating algorithm for problem (Rε) has the property
that its iterates
- W (k), D(k)
converge to the minimizer of (Rε) as k → ∞.
- Theorem. Consider a sequence εℓ → 0+ and let (Wℓ, Dℓ) be the
minimizer of (Rεℓ). Then any limiting point of the sequence {(Wℓ, Dℓ)} is an optimal solution of the problem (MT L).
- Note: the starting value of D does not matter
19
Alternating Minimization (contd.)
- bjective
function
20 40 60 80 100 24 25 26 27 28 29 η = 0.05 η = 0.03 η = 0.01 Alternating
seconds
50 100 150 200 1 2 3 4 5 6 Alternating η = 0.05
#iterations #tasks (green = alternating) (blue = alternating)
- Compare computational cost with a gradient descent approach
(η := learning rate)
20
Alternating Minimization (contd.)
- Typically fewer than 50 iterations needed in experiments
- At least an order of magnitude fewer iterations than gradient descent
(but cost per iteration is larger)
- Scales better with the number of tasks
- Both methods require SVD (costly if d is large)
- Alternative algorithms: SOCP methods [Srebro et al. 2005, Liu and
Vandenberghe 2008], gradient descent on matrix factors [Rennie & Srebro 2005], singular value thresholding [Cai et al. 2008]
21
Trace Norm Regularization
- Eliminating D in optimization problem (MT L) yields
min
W ∈I Rd×n n
- t=1
m
- i=1
E(wt, xti, yti) + γ W2
tr
(T R) The trace norm (or nuclear norm) Wtr is the sum of the singular values of W
- There has been recent interest in trace norm / rank problems in matrix
factorization, statistics, matrix completion etc. [Cai et al. 2008, Fazel et al. 2001, Izenman 1975, Liu and Vandenberghe 2008, Srebro et al. 2005]
22
Trace Norm vs. Rank
- Problem (T R) is a convex relaxation of the problem
min
W ∈I Rd×n n
- t=1
m
- i=1
E(wt, xti, yti) + γ rank(W)
- NP-hard problem (at least as hard as Boolean LP)
- Rank and trace norm correspond to L0, L1 on the vector of singular
values
- Multi-task intuition: we want the task parameter vectors wt to lie on a
low dimensional subspace
23
Connection to Group Lasso
- Problem (MT L) is equivalent to
min
A∈I Rd×n U∈I Rd×d, U⊤U=I n
- t=1
m
- i=1
E(at, U ⊤xti, yti) + γ A2
2,1
where A2,1 :=
d
- i=1
- n
- t=1
a2
it
10 20 30 40 50 60 70 80 90 100 2 4 6 8 10 12 14
24
Experiment (Computer Survey)
- Consumers’ ratings of products [Lenk et al. 1996]
- 180 persons (tasks)
- 8 PC models (training examples); 4 PC models (test examples)
- 13 binary input variables (RAM, CPU, price etc.) + bias term
- Integer output in {0, . . . , 10} (likelihood of purchase)
- The square loss was used
25
Experiment (Computer Survey)
Test error
50 100 150 200 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5 5.1 5.2 5.3
Eig(D)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
#tasks
- Performance improves with more tasks
(for learning of the tasks independently, error = 16.53)
- A single most important feature shared by all persons
26
Experiment (Computer Survey)
u1
TE RAM SC CPU HD CD CA CO AV WA SW GU PR −0.1 −0.05 0.05 0.1 0.15 0.2 0.25
Method RMSE Alternating Alg. 1.93 Hierarchical Bayes [Lenk et al.] 1.90 Independent 3.88 Aggregate 2.35 Group Lasso 2.01
- The most important feature (eigenvector of D) weighs technical
characteristics (RAM, CPU, CD-ROM) vs. price
27
Spectral Regularization
- Generalize (MT L):
inf
w1,...,wn∈I Rd D≻0, tr(D)≤1 n
- t=1
m
- i=1
E (wt, xti, yti) + γ tr(W ⊤F(D)W) where F is a spectral matrix function, f : (0, +∞) → (0, +∞), F(UΛU ⊤) = U diag[f(λ1), ..., f(λd)] U ⊤
- r
min
W ∈I Rd×n n
- t=1
m
- i=1
E (wt, xti, yti) + γ Ω(W)
- It can be shown that Ω(W) is a function of the singular values of W
28
Spectral Regularization (contd.)
- In particular, if f(λ) = λ1−2
p, p ∈ (0, 2], we have
Ω(W) = W2
p
where Wp is the Schatten Lp (pre)norm of the singular values of W
- Theorem. The regularizer tr(W ⊤F(D) W) is jointly convex if and
- nly if 1
f is matrix concave of order d, that is,
µ 1 F
- (A) + (1 − µ)
1 F
- (B)
1 F
- (µA + (1 − µ)B)
for all A, B ≻ 0, µ ∈ [0, 1]
- Spectral problems appear also in graph applications, control theory etc.
29
Learning Groups of Tasks
- Assume heterogeneous environment, i.e. K low dimensional subspaces
- Learn a partition of tasks in K groups
inf
D1,...,DK≻0 tr(Dk)≤1 n
- t=1
K
min
k=1
min
wt∈I Rd
m
- i=1
E (wt, xti, yti) + γwt, D−1
k wt
- The representation learned is ( ˆ
D1, . . . , ˆ DK); we can transfer this representation to easily learn a new task
- Non-convex problem; we use stochastic gradient descent
30
Experiment (Character Recognition - Projection on Image Halves)
6 vs. 1 task (on the right half)
- Binary classification tasks on 28 × 56 images
- One half of the image contains the relevant character, the other half
contains a randomly chosen character
- Two groups of tasks (with probabilities 50-50%): projection on left half
- projection on right half
31
Experiment (Character Recognition - Projection on Image Halves)
- Training set contains pairs of alphabetic characters
- 1000 tasks, 10 examples per task
- Wish to obtain a representation that represents rotation invariance on
either half of the image
- Wish to transfer this representation to pairs of digits
32
Experiment (Character Recognition - Projection on Image Halves)
Independent K = 1 K = 2 0.27 0.036 0.013 Transfer error for different methods D1 D2 All digits (left & right) 48.2% 51.8% Left 99.2% 0.8% Right 1.4% 98.6% Training data 50.7% 49.3% Assignment of tasks in groups (with K = 2)
33
Experiment (Character Recognition - Projection on Image Halves)
D D1 D2 Dominant eigenvectors of D (for K = 1) and D1, D2 (for K = 2)
34
Experiment (Character Recognition - Projection on Image Halves)
5 10 15 20 500 1000 1500 2000 5 10 15 20 500 1000 1500 2000 5 10 15 20 500 1000 1500 2000
Spectrum of D learned with K = 1 (left) Spectra of D1 (middle) and D2 (right) learned with K = 2
35
Representer Theorems
- All previous formulations satisfy a multi-task representer theorem
ˆ wt =
n
- s=1
m
- i=1
c(t)
si xsi
∀ t ∈ {1, . . . , n} (R.T .) Consequently, a nonlinear kernel can be used
- All tasks are involved in this expression (unlike the single-task
representer theorem ⇔ Frobenius norm regularization)
- Generally, consider any matrix optimization problem of the form
min
w1,...,wn∈I Rd n
- t=1
m
- i=1
E (wt, xti, yti) + Ω(W)
36
Representer Theorems (contd.)
- Definitions:
Sn
+ = the positive semidefinite cone
The function h : Sn
+ → I
R is matrix nondecreasing, if h(A) ≤ h(B) ∀ A, B ∈ Sn
+
s.t. A B
- Theorem. Rep. thm. (R.T .) holds if and only if there exists a matrix
nondecreasing function h : Sn
+ → I
R such that Ω(W) = h(W ⊤W) ∀ W ∈ I Rd×n (under differentiability assumptions)
37
Representer Theorems (contd.)
- Corollary. The standard rep. thm. for single-task learning (n = 1),
ˆ w =
m
- i=1
cixi holds if and only if there exists a nondecreasing function h : I R+ → I R such that Ω(w) = h(w, w) ∀ w ∈ I Rd
- Sufficiency of the condition has been known [Kimeldorf & Wahba, 1970,
Sch¨
- lkopf et al., 2001 etc.]
38
Implications
- “Kernelization”
- In single-task learning, the choice of h does not matter essentially
- However, in multi-task learning, the choice of h is important
(since is a partial ordering)
- Many valid regularizers: Schatten Lp norms · p, rank, orthogonally
invariant norms, norms of type W → WMp etc.
- In matrix learning, kernels and sparsity can be exploited in the same
model
39
Connection to Learning the Kernel
- Recall problem (MT L)
min
D≻0, tr(D)≤1
min
w1,...,wn∈I Rm n
- t=1
m
- i=1
E (wt, xti, yti) + γ wt, D−1wt (MT L) ⇐ ⇒ learning a common kernel K(x, x′) = x, Dx′ within the convex hull of an infinite number of linear kernels
- Extends formulation of [Lanckriet et al. 2004] (single task)
min
K∈K min c∈I Rm m
- i=1
E
- (Kc)i, yi
- + γ c, Kc
in which K was a polytope (convex hull of a finite set of kernels)
40
A General Framework for Learning the Kernel
- Convex set K is generated by basic kernels: K = conv(B)
- Example 1: Finite set of basic kernels (aka “multiple kernel learning”)
- Example 2: Linear basic kernels
B(x, x′) = x, Dx′ where D ∈ bounded, convex set (e.g. (MT L))
- Example 3: Gaussian basic kernels
B(x, x′) = e−x−x′,Σ−1(x−x′) where Σ belongs in a convex subset of the p.s.d. cone
41
Learning the Kernel and Structured Sparsity
- Interpretation of LTK in the feature space
[Bach et al. 2004, Micchelli & Pontil 2005] min
v1,...,vN∈I Rm m
- i=1
E
N
- j=1
vj, Φj(xi), yi + γ
N
- j=1
vj
2
- Group Lasso in the feature space
- The · 2,1 norm tends to favor a small number of feature maps /
kernels in the solution
42
Why Learn Kernels in Convex Sets?
- Data fusion (e.g. in bioinformatics)
- Kernel hyperparameter learning (e.g. Gaussian kernel parameters)
- Multi-task learning
- Metric learning
- Semi-supervised learning (learning the graph)
- Efficient alternative to cross validation; exploits the power of convex
- ptimization
43
Properties of the Solution of LTK
- Formulation for learning the kernel
min
K∈K min c∈I Rm m
- i=1
E
- (Kc)i, yi
- + γ c, Kc
(LT K) where K = conv(B)
- I.e. solutions of (LT K) are of the form ˆ
K =
N
- i=1
ˆ λi ˆ Bi, where ˆ λi ≥ 0,
N
- i=1
ˆ λi = 1, ˆ Bi ∈ B
- Any solution (ˆ
c, ˆ K) of (LT K) is a saddle point of a minimax problem
44
Properties of the Solution of LTK (contd.)
- Theorem. (ˆ
c, ˆ K) solves (LT K) if and only if
- 1. ˆ
c, ˆ Bi ˆ c = max
B∈B ˆ
c, B ˆ c i = 1, . . . , N
- 2. ˆ
c is the solution to min
c∈I Rm m
- i=1
E
- ( ˆ
Kc)i, yi
- + γ c, ˆ
Kc Moreover, there exists a solution involving at most m + 1 kernels: ˆ K =
m+1
- i=1
ˆ λi ˆ Bi
45
A General Algorithm for Learning the Kernel
- Incrementally builds an estimate of the solution, K(k) =
k
- i=1
λiK(i) Initialization: Given an initial kernel K(1) in the convex set K while convergence condition is not true do
- 1. Compute ˆ
c = argmin
c∈I Rm m
- i=1
E
- (K(k)c)i, yi
- + γ c, K(k)c
- 2. Find a basic kernel ˆ
B ∈ argmax
B∈B
ˆ c, B ˆ c
- 3. Compute K(k+1) as the optimal convex combination of ˆ
B and K(k) end while
46
A General Algorithm for Learning the Kernel (contd.)
- Theorem. There exists a limit point of the LTK algorithm and any
limit point is a solution of (LT K).
- Step 1 is standard regularization (SVM, ridge regression etc.)
- Step 2 is the hardest: tractable for e.g. finite B or MTL;
non-convex for e.g. Gaussian kernels
- Step 3 is convex in one variable
(requires solving a few regularization problems)
- Some non-convex cases (e.g. few-parameter Gaussians) are solvable
47
Character Recognition Experiments
- MNIST classification of digits.
- 1st experiment: Gaussian basic kernels with one parameter σ.
- Compared with
– continuously parameterized + local search – finite grid of basic kernels – SVM
- 2nd experiment: Gaussian basic kernels with two parameters σ1, σ2
(left, right halves of images).
- Compared with varying finite grids.
48
Character Recognition Experiments (contd.)
Method Task LTK local finite SVM LTK local finite SVM LTK local finite SVM σ ∈ [75, 25000] σ ∈ [100, 10000] σ ∈ [500, 5000]
- dd-even
6.5 6.6 18.0 11.8 6.5 6.6 10.9 8.6 6.5 6.5 6.7 6.9 3 vs. 8 3.7 3.8 6.9 6.0 3.9 3.8 4.9 5.1 3.6 3.8 3.7 3.8 4 vs. 7 2.7 2.5 4.2 2.8 2.4 2.5 2.7 2.6 2.3 2.5 2.6 2.3 Task LTK 5 × 5 10 × 10 LTK 5 × 5 10 × 10 LTK 5 × 5 10 × 10 σ ∈ [75, 25000] σ ∈ [100, 10000] σ ∈ [500, 5000]
- dd-even
5.8 15.8 11.2 5.8 10.1 6.2 5.8 6.8 5.8 3 vs. 8 2.7 6.5 5.1 2.5 4.6 2.5 2.6 3.5 2.5 4 vs. 7 1.8 3.9 2.9 1.7 2.7 2.0 1.8 2.0 1.8
- Using two parameters improves performance
- Continuous vs. finite grid
– more efficient – robust wrt. parameter range – does not overfit
49
Character Recognition Experiments (contd.)
1 2 3 4 5 6 7 8 9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 2 3 4 5 6 7 8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10 11 12 13 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Learned kernel coefficients. From left to right: odd vs. even (1 and 2 kernel params.), 3 vs. 8 and 4 vs. 7 (2 params.)
50
Learning the Graph in Semi-Supervised Learning
- We can also apply LTK when there are few labeled data points
- A number of graphs are given, e.g. generated using k-NN, exponential
decay weights etc.
- To exploit the structure of each graph, the Gram matrices Bi are taken
to be the pseudoinverses of the graph Laplacians
51
Conclusion
- Multi-task learning is ubiquitous; exploiting task relatedness can
enhance learning performance significantly
- Proposed an alternating algorithm to learn tasks that lie on a common
subspace; this algorithm is simple and efficient
- Nonlinear kernels can be introduced via a multi-task representer
theorem; also proposed spectral and non-convex matrix learning
- Multi-task learning can be viewed as an instance of learning
combinations of infinite kernels
- Generally, we can use a greedy incremental algorithm to learn
combinations of (finite or infinite) kernels
52
Future Work
- Convergence rates for the algorithms proposed
- Task relatedness can be of many types (hierarchical features, input
transformation invariances etc.)
- Convex relaxation techniques for the non-convex formulations presented
- Algorithms and representer theorems for special types of norms
- Related problems (sparse coding, structured output, metric learning
etc.)
53