Consistent Multitask Learning with Nonlinear Output Constraints - - PowerPoint PPT Presentation
Consistent Multitask Learning with Nonlinear Output Constraints - - PowerPoint PPT Presentation
Consistent Multitask Learning with Nonlinear Output Constraints Carlo Ciliberto Department of Computer Science, UCL joint work w/ Alessandro Rudi, Lorenzo Rosasco and Massi Pontil Multitask Learning (MTL) MTL Mantra : leverage the similarities
Multitask Learning (MTL)
MTL Mantra: leverage the similarities among multiple learning problems (tasks) to reduce the complexity of the overall learning process.
- Prev. Literature:
investigated linear tasks relations (more on this in a minute). This work: we address the problem of learning multiple tasks that are nonlinearly related one to the other
MTL Setting
Given T datasets St = (xit, yit)nt
i=1 learn ˆ
ft : X → R by solving ( ˆ f1, . . . , ˆ fT ) = argmin
f1,...,fT ∈H
1 T
T
- t=1
L(ft, St) + R(f1, . . . , fT )
◮ H space of hypotheses. ◮ L(ft, St) = 1 nt
nt
i=1 ℓ(ft(xit, yit)) Data fitting term. Loss
ℓ : R × R → R (e.g. leas squares, logistic, hinge, etc.).
◮ R(f1, . . . , fT ) a joint tasks-structure regularizer
Previous Work: Linear MTL
For example R(f1, . . . , fT ) =
◮ Single task learning λ
T
- t=1
ft2
H
◮ Variance Regularization λ
T
- t=1
ft − ¯ f2
H with ¯
f =
1 T
T
t=1
◮ Clustered tasks λ1
|C|
- t∈C(c)
c=1
ft − ¯ fc2
H + λ2 |C|
- c=1
¯ fc − ¯ f2
H
◮ Similarity regularizer λ
T
- t,s
Ws,tft − fs2
H
Ws,t ≥ 0
Why “Linear”? Because the tasks relations are encoded in a matrix.
R(f1, . . . , fT ) =
T
- t,s=1
At,sft, fsH with A ∈ RT ×T
Nonlinear MTL: Setting
What if relations are nonlinear? We study the case where tasks satisfy a set of k equations γ(f1(x), · · · , fT (x)) = 0 identified by γ : RT → Rk. Examples
◮ Manifold-valued learning ◮ Physical systems (e.g. robotics) ◮ Logical constraints (e.g. ranking)
Nonlinear MTL: Setting
NL-MTL Goal: approximate f ∗ : X → C minimizer the Expected Risk min
f:X→C E(f),
E(f) = 1 T
- ℓ(ft(x), y) dρt(x, y)
where
◮ f : X → C is such that f(x) = (f1(x), . . . , fT (x)) for all x ∈ X. ◮ C = {c ∈ RT | γ(c) = 0} is the constraints set induced by γ. ◮ ρt(x, y) = ρt(y|x)ρX (x) is the unknown data distribution.
Nonlinear MTL: Challenges
Why not try Empirical Risk Minimization? ˆ f = argmin
H⊂{f:X→C} f∈H
1 T
T
- t=1
L(ft, St) Problems:
◮ Modeling: f1, f2 : X → C does not guarantee f1 + f2 : X → C.
H not a linear space. How to choose a “good” H in practice?
◮ Computations: Hard (non-convex) optimization. How to solve it? ◮ Statistics: How to study the generalization properties of ˆ
f?
Nonliner MTL: a Structured Prediction Perspective
Idea: formulate NL-MTL as a structured prediction problem. Structured Prediction: originally designed for discrete outputs, but recently genearlized to any set C within the SELF framework [Ciliberto et
- al. 2016].
Nonlinear MTL Estimator
We propose to approximate f ∗ via the estimator ˆ f : X → C such that ˆ f(x) = argmin
c∈C
1 T
T
- t=1
nt
- i=1
αit(x)ℓ(ct, yit) where the weights are obtained in closed form as (αi1(x), · · · , αint(x)) = (Kt + λI)−1vt(x) with Kt the kernel matrix (Kt)ij = k(xit, xjt) of t-th dataset and vt(x) ∈ Rn with vt(x)i = k(xit, x). k : X × X → R a kernel.
- Note. evaluating ˆ
f(x) requires solving an optimization over C (e.g. for ℓ least squares it ˆ f reduces to a projection onto C).
Theoretical Results
- Thm. 1 (Universal Consistency)
E( ˆ f) − E(f ∗) → 0 with probability 1.
- Thm. 2 (Rates). Let n = nt and g∗
t ∈ G for all t = 1, . . . , T. Then
E( ˆ f) − E(f ∗) ≤ O(n−1/4) with high probability
- Thm. 3 (Benefits of MTL). Let C ⊂ RT radius 1 sphere. Let N = nT.
Then E( ˆ f) − E(f ∗) ≤ O(N −1/2) with high probability
Intuition
Ok... but how did we get there?
Structure Encoding Loss Function (SELF)
Ciliberto et al. 2016
- Def. ℓ : C × Y → R is a structure encoding loss function (SELF) if there
exist H Hilbert space and ψ : C → H, ϕ : Y → H such that ℓ(c, y) = ψ(c), ϕ(y)H ∀c ∈ C, ∀y ∈ Y. Abstract definition... BUT “most” loss functions used in MTL settings are SELF! More precisely any Lipschitz continuous function differentiable almost everywhere (e.g. least squares, logistic, hinge).
Nonlinear MTL + SELF
Minimizer of the expected risk f ∗(x) = argmin
c∈C
1 T
T
- t=1
- ℓ(ct, y)ρt(y|x)
Nonlinear MTL + SELF
Minimizer of the expected risk f ∗(x) = argmin
c∈C
1 T
T
- t=1
- ψ(ct), ϕ(y)Hρt(y|x)
Nonlinear MTL + SELF
Minimizer of the expected risk f ∗(x) = argmin
c∈C
1 T
T
- t=1
- ψ(ct),
- ϕ(y)ρt(y|x)
- H
Nonlinear MTL + SELF
Minimizer of the expected risk f ∗(x) = argmin
c∈C
1 T
T
- t=1
ψ(ct), g∗
t (x)H
where g∗
t : X → H is such that g∗ t (x) =
- ϕ(y)ρt(y|x).
Nonlinear MTL Estimator
Idea, learn a ˆ gt : X → H for each g∗
t . Then approximate
f ∗(x) = argmin
c∈C
1 T
T
- t=1
ψ(ct), g∗
t (x)H
with ˆ f : X → C ˆ f(x) = argmin
c∈C
1 T
T
- t=1
ψ(ct), ˆ gt(x)H
Nonlinear MTL Estimator
This work: learn ˆ gt via kernel ridge regression. Let G1 be a reproducing kernel Hilbert space with kernel k : X × X → R. ˆ gt = argmin
g∈G
1 nt
nt
- i=1
g(xit) − ϕ(yit)2
H + λg2 G
Then ˆ gt(x) =
nt
- i=1
αit(x)ϕ(yit) (αi1(x), · · · , αint(x)) = (Kt + λI)−1vt(x) where Kt kernel matrix of t-th dataset, vt(x) ∈ Rn evaluation vector vt(x)i = k(xit, x).
1actually G ⊗ H
Nonlinear MTL Estimator
Plugging into ˆ f(x) = argmin
c∈C
1 T
T
- t=1
ψ(ct), ˆ gt(x)H by the SELF property we have ˆ f(x) = argmin
c∈C
1 T
T
- t=1
nt
- i=1
αit(x)
- ψ(ct),
nt
- i=1
αit(x)ϕ(yit)
- H
Nonlinear MTL Estimator
Plugging into ˆ f(x) = argmin
c∈C
1 T
T
- t=1
ψ(ct), ˆ gt(x)H by the SELF property we have ˆ f(x) = argmin
c∈C
1 T
T
- t=1
nt
- i=1
αit(x)ψ(ct), ϕ(yit)H
Nonlinear MTL Estimator
Plugging into ˆ f(x) = argmin
c∈C
1 T
T
- t=1
ψ(ct), ˆ gt(x)H by the SELF property we have ˆ f(x) = argmin
c∈C
1 T
T
- t=1
nt
- i=1