Multi-Task Learning and Matrix Regularization Andreas Argyriou - - PowerPoint PPT Presentation

multi task learning and matrix regularization
SMART_READER_LITE
LIVE PREVIEW

Multi-Task Learning and Matrix Regularization Andreas Argyriou - - PowerPoint PPT Presentation

Multi-Task Learning and Matrix Regularization Andreas Argyriou Department of Computer Science University College London Collaborators T. Evgeniou (INSEAD) R. Hauser (University of Oxford) M. Herbster (University College London)


slide-1
SLIDE 1

Multi-Task Learning and Matrix Regularization

Andreas Argyriou Department of Computer Science University College London

slide-2
SLIDE 2

Collaborators

  • T. Evgeniou (INSEAD)
  • R. Hauser (University of Oxford)
  • M. Herbster (University College London)
  • A. Maurer (Stemmer Imaging)
  • C.A. Micchelli (SUNY Albany)
  • M. Pontil (University College London)
  • Y. Ying (University of Bristol)

1

slide-3
SLIDE 3

Main Themes

  • Machine learning
  • Convex optimization
  • Sparse recovery

2

slide-4
SLIDE 4

Outline

  • Multi-task learning and related problems
  • Matrix learning and an alternating algorithm
  • Extensions of the method
  • Multi-task representer theorems
  • Kernel hyperparameter learning; convex kernel learning

3

slide-5
SLIDE 5

Supervised Learning (Single-Task)

  • m examples are given: (x1, y1), . . . , (xm, ym) ∈ X × Y
  • Predict using a function f : X → Y
  • Want the function to generalize well over the whole of X × Y
  • Includes regression, classification etc.
  • Task = probability measure on X × Y

4

slide-6
SLIDE 6

Multi-Task Learning

  • Tasks t = 1, . . . , n
  • m examples per task are given: (xt1, yt1), . . . , (xtm, ytm) ∈ X × Y
  • Predict using functions ft : X → Y , t = 1, . . . , n
  • When the tasks are related, learning the tasks jointly should perform

better than learning each task independently

  • Especially important when few data points are available per task (small

m); in such cases, independent learning is not successful

5

slide-7
SLIDE 7

Multi-Task Learning (contd.)

  • One goal is to learn what structure is common across the n tasks
  • Want simple, interpretable models that can explain multiple tasks
  • Want good generalization on the n given tasks but also on new tasks

(transfer learning)

  • Given a few examples from a new task t′, {(xt′1, yt′1), . . . , (xt′ℓ, yt′ℓ)},

want to learn ft′ using just the learned task structure

6

slide-8
SLIDE 8

Learning Theoretic View: Environment of Tasks

  • Environment = probability distribution on a set of learning tasks

[Baxter, 1996]

  • To sample a task-specific sample from the environment

– draw a function ft from the environment – generate a sample {(xt1, yt1), . . . , (xtm, ytm)} ∈ (X × Y)m using ft

  • Multi-task learning means learning a common hypothesis space

7

slide-9
SLIDE 9

Learning Theoretic View (contd.)

  • Baxter’s results:

– As n (#tasks) increases, m (#examples per task needed) decreases as O( 1

n)

– Once we have learned a hypothesis space H, we can use it to learn a new task drawn from the same environment; the sample complexity depends on the log-capacity of H

  • Other results:

– Task relatedness due to input transformations: improved multi-task bounds in some cases [Ben-David & Schuller, 2003] – Using common feature maps (bounded linear operators): error bounds depend on Hilbert-Schmidt norm [Maurer, 2006]

8

slide-10
SLIDE 10

Multi-Task Applications

  • Multi-task learning is ubiquitous
  • Human intelligence relies on transfering learned knowledge from

previous tasks to new tasks

  • E.g. character recognition (very few examples should be needed to

recognize new characters)

  • Integration of medical / bioinformatics databases

9

slide-11
SLIDE 11

Multi-Task Applications (contd.)

  • Marketing databases, collaborative filtering, recommendation systems

(e.g. Netflix); task = product preferences for each person

10

slide-12
SLIDE 12

Multi-Task Applications (contd.)

  • Multiple object classification in scenes:

an image may contain multiple objects; learning common visual features enhances performance

11

slide-13
SLIDE 13

Related Problems

  • Sparse coding (some images share common basis images)
  • Vector-valued / structured output
  • Multi-class problems
  • Regression with grouped variables, multifactor ANOVA in statistics;

(selection of groups of variables)

  • Multi-task learning is a broad problem; no single method can solve

everything

12

slide-14
SLIDE 14

Learning Multiple Tasks with a Common Kernel

  • Let X ⊆ I

Rd, Y ⊆ I R and let us learn n linear functions ft(x) = wt, x t = 1, . . . , n (we ignore nonlinearities for the moment)

  • Want to impose common structure / relatedness across tasks
  • Idea: use a common linear kernel for all tasks

K(x, x′) = x, D x′ (where D ≻ 0)

13

slide-15
SLIDE 15

Learning Multiple Tasks with a Common Kernel

  • For every t = 1, . . . , n solve

min

wt∈I Rd m

  • i=1

E (wt, xti, yti) + γwt, D−1wt

  • Adding up, we obtain the equivalent problem

min

w1,...,wn∈I Rd n

  • t=1

m

  • i=1

E (wt, xti, yti) + γ

n

  • t=1

wt, D−1wt

14

slide-16
SLIDE 16

Learning Multiple Tasks with a Common Kernel

  • For multi-task learning, we want to learn the common kernel from a

convex set of kernels: inf

w1,...,wn∈I Rd D≻0, tr(D)≤1 n

  • t=1

m

  • i=1

E (wt, xti, yti) + γ tr(W ⊤D−1W) (MT L)

↑ n

  • t=1

wt, D−1wt

  • We denote W =

 w1 . . . wn  

15

slide-17
SLIDE 17

Learning Multiple Tasks with a Common Kernel

  • Jointly convex problem in (W, D)
  • The constraint tr(D) ≤ 1 is important
  • Fixing W, the optimal D(W) is

D(W) ∝ (WW ⊤)

1 2

(D(W) is usually not in the feasible set because of the inf)

  • Once we have learned ˆ

D, we can transfer it to learning of a new task t′ min

w∈I Rd m

  • i=1

E (w, xt′i, yt′i) + γ w, ˆ D−1w

16

slide-18
SLIDE 18

Alternating Minimization Algorithm

  • Alternating minimization over W (supervised learning) and D

(unsupervised “correlation” of tasks). Initialization: set D = Id×d

d

while convergence condition is not true do for t = 1, . . . , n learn wt independently by minimizing

m

  • i=1

E(w, xti, yti) + γ w, D−1w end for set D =

(W W ⊤)

1 2

tr(W W ⊤)

1 2

end while

17

slide-19
SLIDE 19

Alternating Minimization (contd.)

  • Each wt step is a regularization problem (e.g. SVM, ridge regression

etc.)

  • It does not require computation of the (pseudo)inverse of D
  • Each D step requires an SVD; this is usually the most costly step

18

slide-20
SLIDE 20

Alternating Minimization (contd.)

  • The algorithm (with some perturbation) converges to an optimal

solution min

w1,...,wn∈I Rd D≻0, tr(D)≤1 n

  • t=1

m

  • i=1

E(wt, xti, yti) + γ tr

  • D−1(WW ⊤ + εI)
  • (Rε)
  • Theorem. An alternating algorithm for problem (Rε) has the property

that its iterates

  • W (k), D(k)

converge to the minimizer of (Rε) as k → ∞.

  • Theorem. Consider a sequence εℓ → 0+ and let (Wℓ, Dℓ) be the

minimizer of (Rεℓ). Then any limiting point of the sequence {(Wℓ, Dℓ)} is an optimal solution of the problem (MT L).

  • Note: the starting value of D does not matter

19

slide-21
SLIDE 21

Alternating Minimization (contd.)

  • bjective

function

20 40 60 80 100 24 25 26 27 28 29 η = 0.05 η = 0.03 η = 0.01 Alternating

seconds

50 100 150 200 1 2 3 4 5 6 Alternating η = 0.05

#iterations #tasks (green = alternating) (blue = alternating)

  • Compare computational cost with a gradient descent approach

(η := learning rate)

20

slide-22
SLIDE 22

Alternating Minimization (contd.)

  • Typically fewer than 50 iterations needed in experiments
  • At least an order of magnitude fewer iterations than gradient descent

(but cost per iteration is larger)

  • Scales better with the number of tasks
  • Both methods require SVD (costly if d is large)
  • Alternative algorithms: SOCP methods [Srebro et al. 2005, Liu and

Vandenberghe 2008], gradient descent on matrix factors [Rennie & Srebro 2005], singular value thresholding [Cai et al. 2008]

21

slide-23
SLIDE 23

Trace Norm Regularization

  • Eliminating D in optimization problem (MT L) yields

min

W ∈I Rd×n n

  • t=1

m

  • i=1

E(wt, xti, yti) + γ W2

tr

(T R) The trace norm (or nuclear norm) Wtr is the sum of the singular values of W

  • There has been recent interest in trace norm / rank problems in matrix

factorization, statistics, matrix completion etc. [Cai et al. 2008, Fazel et al. 2001, Izenman 1975, Liu and Vandenberghe 2008, Srebro et al. 2005]

22

slide-24
SLIDE 24

Trace Norm vs. Rank

  • Problem (T R) is a convex relaxation of the problem

min

W ∈I Rd×n n

  • t=1

m

  • i=1

E(wt, xti, yti) + γ rank(W)

  • NP-hard problem (at least as hard as Boolean LP)
  • Rank and trace norm correspond to L0, L1 on the vector of singular

values

  • Multi-task intuition: we want the task parameter vectors wt to lie on a

low dimensional subspace

23

slide-25
SLIDE 25

Connection to Group Lasso

  • Problem (MT L) is equivalent to

min

A∈I Rd×n U∈I Rd×d, U⊤U=I n

  • t=1

m

  • i=1

E(at, U ⊤xti, yti) + γ A2

2,1

where A2,1 :=

d

  • i=1
  • n
  • t=1

a2

it

10 20 30 40 50 60 70 80 90 100 2 4 6 8 10 12 14

24

slide-26
SLIDE 26

Experiment (Computer Survey)

  • Consumers’ ratings of products [Lenk et al. 1996]
  • 180 persons (tasks)
  • 8 PC models (training examples); 4 PC models (test examples)
  • 13 binary input variables (RAM, CPU, price etc.) + bias term
  • Integer output in {0, . . . , 10} (likelihood of purchase)
  • The square loss was used

25

slide-27
SLIDE 27

Experiment (Computer Survey)

Test error

50 100 150 200 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5 5.1 5.2 5.3

Eig(D)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

#tasks

  • Performance improves with more tasks

(for learning of the tasks independently, error = 16.53)

  • A single most important feature shared by all persons

26

slide-28
SLIDE 28

Experiment (Computer Survey)

u1

TE RAM SC CPU HD CD CA CO AV WA SW GU PR −0.1 −0.05 0.05 0.1 0.15 0.2 0.25

Method RMSE Alternating Alg. 1.93 Hierarchical Bayes [Lenk et al.] 1.90 Independent 3.88 Aggregate 2.35 Group Lasso 2.01

  • The most important feature (eigenvector of D) weighs technical

characteristics (RAM, CPU, CD-ROM) vs. price

27

slide-29
SLIDE 29

Spectral Regularization

  • Generalize (MT L):

inf

w1,...,wn∈I Rd D≻0, tr(D)≤1 n

  • t=1

m

  • i=1

E (wt, xti, yti) + γ tr(W ⊤F(D)W) where F is a spectral matrix function, f : (0, +∞) → (0, +∞), F(UΛU ⊤) = U diag[f(λ1), ..., f(λd)] U ⊤

  • r

min

W ∈I Rd×n n

  • t=1

m

  • i=1

E (wt, xti, yti) + γ Ω(W)

  • It can be shown that Ω(W) is a function of the singular values of W

28

slide-30
SLIDE 30

Spectral Regularization (contd.)

  • In particular, if f(λ) = λ1−2

p, p ∈ (0, 2], we have

Ω(W) = W2

p

where Wp is the Schatten Lp (pre)norm of the singular values of W

  • Theorem. The regularizer tr(W ⊤F(D) W) is jointly convex if and
  • nly if 1

f is matrix concave of order d, that is,

µ 1 F

  • (A) + (1 − µ)

1 F

  • (B)

1 F

  • (µA + (1 − µ)B)

for all A, B ≻ 0, µ ∈ [0, 1]

  • Spectral problems appear also in graph applications, control theory etc.

29

slide-31
SLIDE 31

Learning Groups of Tasks

  • Assume heterogeneous environment, i.e. K low dimensional subspaces
  • Learn a partition of tasks in K groups

inf

D1,...,DK≻0 tr(Dk)≤1 n

  • t=1

K

min

k=1

min

wt∈I Rd

m

  • i=1

E (wt, xti, yti) + γwt, D−1

k wt

  • The representation learned is ( ˆ

D1, . . . , ˆ DK); we can transfer this representation to easily learn a new task

  • Non-convex problem; we use stochastic gradient descent

30

slide-32
SLIDE 32

Experiment (Character Recognition - Projection on Image Halves)

6 vs. 1 task (on the right half)

  • Binary classification tasks on 28 × 56 images
  • One half of the image contains the relevant character, the other half

contains a randomly chosen character

  • Two groups of tasks (with probabilities 50-50%): projection on left half
  • projection on right half

31

slide-33
SLIDE 33

Experiment (Character Recognition - Projection on Image Halves)

  • Training set contains pairs of alphabetic characters
  • 1000 tasks, 10 examples per task
  • Wish to obtain a representation that represents rotation invariance on

either half of the image

  • Wish to transfer this representation to pairs of digits

32

slide-34
SLIDE 34

Experiment (Character Recognition - Projection on Image Halves)

Independent K = 1 K = 2 0.27 0.036 0.013 Transfer error for different methods D1 D2 All digits (left & right) 48.2% 51.8% Left 99.2% 0.8% Right 1.4% 98.6% Training data 50.7% 49.3% Assignment of tasks in groups (with K = 2)

33

slide-35
SLIDE 35

Experiment (Character Recognition - Projection on Image Halves)

D D1 D2 Dominant eigenvectors of D (for K = 1) and D1, D2 (for K = 2)

34

slide-36
SLIDE 36

Experiment (Character Recognition - Projection on Image Halves)

5 10 15 20 500 1000 1500 2000 5 10 15 20 500 1000 1500 2000 5 10 15 20 500 1000 1500 2000

Spectrum of D learned with K = 1 (left) Spectra of D1 (middle) and D2 (right) learned with K = 2

35

slide-37
SLIDE 37

Representer Theorems

  • All previous formulations satisfy a multi-task representer theorem

ˆ wt =

n

  • s=1

m

  • i=1

c(t)

si xsi

∀ t ∈ {1, . . . , n} (R.T .) Consequently, a nonlinear kernel can be used

  • All tasks are involved in this expression (unlike the single-task

representer theorem ⇔ Frobenius norm regularization)

  • Generally, consider any matrix optimization problem of the form

min

w1,...,wn∈I Rd n

  • t=1

m

  • i=1

E (wt, xti, yti) + Ω(W)

36

slide-38
SLIDE 38

Representer Theorems (contd.)

  • Definitions:

Sn

+ = the positive semidefinite cone

The function h : Sn

+ → I

R is matrix nondecreasing, if h(A) ≤ h(B) ∀ A, B ∈ Sn

+

s.t. A B

  • Theorem. Rep. thm. (R.T .) holds if and only if there exists a matrix

nondecreasing function h : Sn

+ → I

R such that Ω(W) = h(W ⊤W) ∀ W ∈ I Rd×n (under differentiability assumptions)

37

slide-39
SLIDE 39

Representer Theorems (contd.)

  • Corollary. The standard rep. thm. for single-task learning (n = 1),

ˆ w =

m

  • i=1

cixi holds if and only if there exists a nondecreasing function h : I R+ → I R such that Ω(w) = h(w, w) ∀ w ∈ I Rd

  • Sufficiency of the condition has been known [Kimeldorf & Wahba, 1970,

Sch¨

  • lkopf et al., 2001 etc.]

38

slide-40
SLIDE 40

Implications

  • “Kernelization”
  • In single-task learning, the choice of h does not matter essentially
  • However, in multi-task learning, the choice of h is important

(since is a partial ordering)

  • Many valid regularizers: Schatten Lp norms · p, rank, orthogonally

invariant norms, norms of type W → WMp etc.

  • In matrix learning, kernels and sparsity can be exploited in the same

model

39

slide-41
SLIDE 41

Connection to Learning the Kernel

  • Recall problem (MT L)

min

D≻0, tr(D)≤1

min

w1,...,wn∈I Rm n

  • t=1

m

  • i=1

E (wt, xti, yti) + γ wt, D−1wt (MT L) ⇐ ⇒ learning a common kernel K(x, x′) = x, Dx′ within the convex hull of an infinite number of linear kernels

  • Extends formulation of [Lanckriet et al. 2004] (single task)

min

K∈K min c∈I Rm m

  • i=1

E

  • (Kc)i, yi
  • + γ c, Kc

in which K was a polytope (convex hull of a finite set of kernels)

40

slide-42
SLIDE 42

A General Framework for Learning the Kernel

  • Convex set K is generated by basic kernels: K = conv(B)
  • Example 1: Finite set of basic kernels (aka “multiple kernel learning”)
  • Example 2: Linear basic kernels

B(x, x′) = x, Dx′ where D ∈ bounded, convex set (e.g. (MT L))

  • Example 3: Gaussian basic kernels

B(x, x′) = e−x−x′,Σ−1(x−x′) where Σ belongs in a convex subset of the p.s.d. cone

41

slide-43
SLIDE 43

Learning the Kernel and Structured Sparsity

  • Interpretation of LTK in the feature space

[Bach et al. 2004, Micchelli & Pontil 2005] min

v1,...,vN∈I Rm m

  • i=1

E  

N

  • j=1

vj, Φj(xi), yi   + γ  

N

  • j=1

vj  

2

  • Group Lasso in the feature space
  • The · 2,1 norm tends to favor a small number of feature maps /

kernels in the solution

42

slide-44
SLIDE 44

Why Learn Kernels in Convex Sets?

  • Data fusion (e.g. in bioinformatics)
  • Kernel hyperparameter learning (e.g. Gaussian kernel parameters)
  • Multi-task learning
  • Metric learning
  • Semi-supervised learning (learning the graph)
  • Efficient alternative to cross validation; exploits the power of convex
  • ptimization

43

slide-45
SLIDE 45

Properties of the Solution of LTK

  • Formulation for learning the kernel

min

K∈K min c∈I Rm m

  • i=1

E

  • (Kc)i, yi
  • + γ c, Kc

(LT K) where K = conv(B)

  • I.e. solutions of (LT K) are of the form ˆ

K =

N

  • i=1

ˆ λi ˆ Bi, where ˆ λi ≥ 0,

N

  • i=1

ˆ λi = 1, ˆ Bi ∈ B

  • Any solution (ˆ

c, ˆ K) of (LT K) is a saddle point of a minimax problem

44

slide-46
SLIDE 46

Properties of the Solution of LTK (contd.)

  • Theorem. (ˆ

c, ˆ K) solves (LT K) if and only if

  • 1. ˆ

c, ˆ Bi ˆ c = max

B∈B ˆ

c, B ˆ c i = 1, . . . , N

  • 2. ˆ

c is the solution to min

c∈I Rm m

  • i=1

E

  • ( ˆ

Kc)i, yi

  • + γ c, ˆ

Kc Moreover, there exists a solution involving at most m + 1 kernels: ˆ K =

m+1

  • i=1

ˆ λi ˆ Bi

45

slide-47
SLIDE 47

A General Algorithm for Learning the Kernel

  • Incrementally builds an estimate of the solution, K(k) =

k

  • i=1

λiK(i) Initialization: Given an initial kernel K(1) in the convex set K while convergence condition is not true do

  • 1. Compute ˆ

c = argmin

c∈I Rm m

  • i=1

E

  • (K(k)c)i, yi
  • + γ c, K(k)c
  • 2. Find a basic kernel ˆ

B ∈ argmax

B∈B

ˆ c, B ˆ c

  • 3. Compute K(k+1) as the optimal convex combination of ˆ

B and K(k) end while

46

slide-48
SLIDE 48

A General Algorithm for Learning the Kernel (contd.)

  • Theorem. There exists a limit point of the LTK algorithm and any

limit point is a solution of (LT K).

  • Step 1 is standard regularization (SVM, ridge regression etc.)
  • Step 2 is the hardest: tractable for e.g. finite B or MTL;

non-convex for e.g. Gaussian kernels

  • Step 3 is convex in one variable

(requires solving a few regularization problems)

  • Some non-convex cases (e.g. few-parameter Gaussians) are solvable

47

slide-49
SLIDE 49

Character Recognition Experiments

  • MNIST classification of digits.
  • 1st experiment: Gaussian basic kernels with one parameter σ.
  • Compared with

– continuously parameterized + local search – finite grid of basic kernels – SVM

  • 2nd experiment: Gaussian basic kernels with two parameters σ1, σ2

(left, right halves of images).

  • Compared with varying finite grids.

48

slide-50
SLIDE 50

Character Recognition Experiments (contd.)

Method Task LTK local finite SVM LTK local finite SVM LTK local finite SVM σ ∈ [75, 25000] σ ∈ [100, 10000] σ ∈ [500, 5000]

  • dd-even

6.5 6.6 18.0 11.8 6.5 6.6 10.9 8.6 6.5 6.5 6.7 6.9 3 vs. 8 3.7 3.8 6.9 6.0 3.9 3.8 4.9 5.1 3.6 3.8 3.7 3.8 4 vs. 7 2.7 2.5 4.2 2.8 2.4 2.5 2.7 2.6 2.3 2.5 2.6 2.3 Task LTK 5 × 5 10 × 10 LTK 5 × 5 10 × 10 LTK 5 × 5 10 × 10 σ ∈ [75, 25000] σ ∈ [100, 10000] σ ∈ [500, 5000]

  • dd-even

5.8 15.8 11.2 5.8 10.1 6.2 5.8 6.8 5.8 3 vs. 8 2.7 6.5 5.1 2.5 4.6 2.5 2.6 3.5 2.5 4 vs. 7 1.8 3.9 2.9 1.7 2.7 2.0 1.8 2.0 1.8

  • Using two parameters improves performance
  • Continuous vs. finite grid

– more efficient – robust wrt. parameter range – does not overfit

49

slide-51
SLIDE 51

Character Recognition Experiments (contd.)

1 2 3 4 5 6 7 8 9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 2 3 4 5 6 7 8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10 11 12 13 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Learned kernel coefficients. From left to right: odd vs. even (1 and 2 kernel params.), 3 vs. 8 and 4 vs. 7 (2 params.)

50

slide-52
SLIDE 52

Learning the Graph in Semi-Supervised Learning

  • We can also apply LTK when there are few labeled data points
  • A number of graphs are given, e.g. generated using k-NN, exponential

decay weights etc.

  • To exploit the structure of each graph, the Gram matrices Bi are taken

to be the pseudoinverses of the graph Laplacians

51

slide-53
SLIDE 53

Conclusion

  • Multi-task learning is ubiquitous; exploiting task relatedness can

enhance learning performance significantly

  • Proposed an alternating algorithm to learn tasks that lie on a common

subspace; this algorithm is simple and efficient

  • Nonlinear kernels can be introduced via a multi-task representer

theorem; also proposed spectral and non-convex matrix learning

  • Multi-task learning can be viewed as an instance of learning

combinations of infinite kernels

  • Generally, we can use a greedy incremental algorithm to learn

combinations of (finite or infinite) kernels

52

slide-54
SLIDE 54

Future Work

  • Convergence rates for the algorithms proposed
  • Task relatedness can be of many types (hierarchical features, input

transformation invariances etc.)

  • Convex relaxation techniques for the non-convex formulations presented
  • Algorithms and representer theorems for special types of norms
  • Related problems (sparse coding, structured output, metric learning

etc.)

53