Consistent Multitask Learning with Nonlinear Output Constraints - - PowerPoint PPT Presentation

consistent multitask learning with nonlinear output
SMART_READER_LITE
LIVE PREVIEW

Consistent Multitask Learning with Nonlinear Output Constraints - - PowerPoint PPT Presentation

Consistent Multitask Learning with Nonlinear Output Constraints Carlo Ciliberto Department of Computer Science, UCL joint work w/ Alessandro Rudi, Lorenzo Rosasco and Massi Pontil Multitask Learning (MTL) MTL Mantra : leverage the similarities


slide-1
SLIDE 1

Consistent Multitask Learning with Nonlinear Output Constraints

Carlo Ciliberto Department of Computer Science, UCL joint work w/ Alessandro Rudi, Lorenzo Rosasco and Massi Pontil

slide-2
SLIDE 2

Multitask Learning (MTL)

MTL Mantra: leverage the similarities among multiple learning problems (tasks) to reduce the complexity of the overall learning process.

  • Prev. Literature:

investigated linear tasks relations (more on this in a minute). This work: we address the problem of learning multiple tasks that are nonlinearly related one to the other

slide-3
SLIDE 3

MTL Setting

Given T datasets St = (xit, yit)nt

i=1 learn ˆ

ft : X → R by solving ( ˆ f1, . . . , ˆ fT ) = argmin

f1,...,fT ∈H

1 T

T

  • t=1

L(ft, St) + R(f1, . . . , fT )

◮ H space of hypotheses. ◮ L(ft, St) = 1 nt

nt

i=1 ℓ(ft(xit, yit)) Data fitting term. Loss

ℓ : R × R → R (e.g. leas squares, logistic, hinge, etc.).

◮ R(f1, . . . , fT ) a joint tasks-structure regularizer

slide-4
SLIDE 4

Previous Work: Linear MTL

For example R(f1, . . . , fT ) =

◮ Single task learning λ

T

  • t=1

ft2

H

◮ Variance Regularization λ

T

  • t=1

ft − ¯ f2

H with ¯

f =

1 T

T

t=1

◮ Clustered tasks λ1

|C|

  • t∈C(c)

c=1

ft − ¯ fc2

H + λ2 |C|

  • c=1

¯ fc − ¯ f2

H

◮ Similarity regularizer λ

T

  • t,s

Ws,tft − fs2

H

Ws,t ≥ 0

Why “Linear”? Because the tasks relations are encoded in a matrix.

R(f1, . . . , fT ) =

T

  • t,s=1

At,sft, fsH with A ∈ RT ×T

slide-5
SLIDE 5

Nonlinear MTL: Setting

What if relations are nonlinear? We study the case where tasks satisfy a set of k equations γ(f1(x), · · · , fT (x)) = 0 identified by γ : RT → Rk. Examples

◮ Manifold-valued learning ◮ Physical systems (e.g. robotics) ◮ Logical constraints (e.g. ranking)

slide-6
SLIDE 6

Nonlinear MTL: Setting

NL-MTL Goal: approximate f ∗ : X → C minimizer the Expected Risk min

f:X→C E(f),

E(f) = 1 T

  • ℓ(ft(x), y) dρt(x, y)

where

◮ f : X → C is such that f(x) = (f1(x), . . . , fT (x)) for all x ∈ X. ◮ C = {c ∈ RT | γ(c) = 0} is the constraints set induced by γ. ◮ ρt(x, y) = ρt(y|x)ρX (x) is the unknown data distribution.

slide-7
SLIDE 7

Nonlinear MTL: Challenges

Why not try Empirical Risk Minimization? ˆ f = argmin

H⊂{f:X→C} f∈H

1 T

T

  • t=1

L(ft, St) Problems:

◮ Modeling: f1, f2 : X → C does not guarantee f1 + f2 : X → C.

H not a linear space. How to choose a “good” H in practice?

◮ Computations: Hard (non-convex) optimization. How to solve it? ◮ Statistics: How to study the generalization properties of ˆ

f?

slide-8
SLIDE 8

Nonliner MTL: a Structured Prediction Perspective

Idea: formulate NL-MTL as a structured prediction problem. Structured Prediction: originally designed for discrete outputs, but recently genearlized to any set C within the SELF framework [Ciliberto et

  • al. 2016].
slide-9
SLIDE 9

Nonlinear MTL Estimator

We propose to approximate f ∗ via the estimator ˆ f : X → C such that ˆ f(x) = argmin

c∈C

1 T

T

  • t=1

nt

  • i=1

αit(x)ℓ(ct, yit) where the weights are obtained in closed form as (αi1(x), · · · , αint(x)) = (Kt + λI)−1vt(x) with Kt the kernel matrix (Kt)ij = k(xit, xjt) of t-th dataset and vt(x) ∈ Rn with vt(x)i = k(xit, x). k : X × X → R a kernel.

  • Note. evaluating ˆ

f(x) requires solving an optimization over C (e.g. for ℓ least squares it ˆ f reduces to a projection onto C).

slide-10
SLIDE 10

Theoretical Results

  • Thm. 1 (Universal Consistency)

E( ˆ f) − E(f ∗) → 0 with probability 1.

  • Thm. 2 (Rates). Let n = nt and g∗

t ∈ G for all t = 1, . . . , T. Then

E( ˆ f) − E(f ∗) ≤ O(n−1/4) with high probability

  • Thm. 3 (Benefits of MTL). Let C ⊂ RT radius 1 sphere. Let N = nT.

Then E( ˆ f) − E(f ∗) ≤ O(N −1/2) with high probability

slide-11
SLIDE 11

Intuition

Ok... but how did we get there?

slide-12
SLIDE 12

Structure Encoding Loss Function (SELF)

Ciliberto et al. 2016

  • Def. ℓ : C × Y → R is a structure encoding loss function (SELF) if there

exist H Hilbert space and ψ : C → H, ϕ : Y → H such that ℓ(c, y) = ψ(c), ϕ(y)H ∀c ∈ C, ∀y ∈ Y. Abstract definition... BUT “most” loss functions used in MTL settings are SELF! More precisely any Lipschitz continuous function differentiable almost everywhere (e.g. least squares, logistic, hinge).

slide-13
SLIDE 13

Nonlinear MTL + SELF

Minimizer of the expected risk f ∗(x) = argmin

c∈C

1 T

T

  • t=1
  • ℓ(ct, y)ρt(y|x)
slide-14
SLIDE 14

Nonlinear MTL + SELF

Minimizer of the expected risk f ∗(x) = argmin

c∈C

1 T

T

  • t=1
  • ψ(ct), ϕ(y)Hρt(y|x)
slide-15
SLIDE 15

Nonlinear MTL + SELF

Minimizer of the expected risk f ∗(x) = argmin

c∈C

1 T

T

  • t=1
  • ψ(ct),
  • ϕ(y)ρt(y|x)
  • H
slide-16
SLIDE 16

Nonlinear MTL + SELF

Minimizer of the expected risk f ∗(x) = argmin

c∈C

1 T

T

  • t=1

ψ(ct), g∗

t (x)H

where g∗

t : X → H is such that g∗ t (x) =

  • ϕ(y)ρt(y|x).
slide-17
SLIDE 17

Nonlinear MTL Estimator

Idea, learn a ˆ gt : X → H for each g∗

t . Then approximate

f ∗(x) = argmin

c∈C

1 T

T

  • t=1

ψ(ct), g∗

t (x)H

with ˆ f : X → C ˆ f(x) = argmin

c∈C

1 T

T

  • t=1

ψ(ct), ˆ gt(x)H

slide-18
SLIDE 18

Nonlinear MTL Estimator

This work: learn ˆ gt via kernel ridge regression. Let G1 be a reproducing kernel Hilbert space with kernel k : X × X → R. ˆ gt = argmin

g∈G

1 nt

nt

  • i=1

g(xit) − ϕ(yit)2

H + λg2 G

Then ˆ gt(x) =

nt

  • i=1

αit(x)ϕ(yit) (αi1(x), · · · , αint(x)) = (Kt + λI)−1vt(x) where Kt kernel matrix of t-th dataset, vt(x) ∈ Rn evaluation vector vt(x)i = k(xit, x).

1actually G ⊗ H

slide-19
SLIDE 19

Nonlinear MTL Estimator

Plugging into ˆ f(x) = argmin

c∈C

1 T

T

  • t=1

ψ(ct), ˆ gt(x)H by the SELF property we have ˆ f(x) = argmin

c∈C

1 T

T

  • t=1

nt

  • i=1

αit(x)

  • ψ(ct),

nt

  • i=1

αit(x)ϕ(yit)

  • H
slide-20
SLIDE 20

Nonlinear MTL Estimator

Plugging into ˆ f(x) = argmin

c∈C

1 T

T

  • t=1

ψ(ct), ˆ gt(x)H by the SELF property we have ˆ f(x) = argmin

c∈C

1 T

T

  • t=1

nt

  • i=1

αit(x)ψ(ct), ϕ(yit)H

slide-21
SLIDE 21

Nonlinear MTL Estimator

Plugging into ˆ f(x) = argmin

c∈C

1 T

T

  • t=1

ψ(ct), ˆ gt(x)H by the SELF property we have ˆ f(x) = argmin

c∈C

1 T

T

  • t=1

nt

  • i=1

αit(x)ℓ(ct, yit) as desired. Note that evaluating ˆ f(x) Does not require knowledge of H, ϕ or ψ!

slide-22
SLIDE 22

Empirical Results

Synthetic data Inverse dynamics (Sarcos) Logic constraints (Ranking Movielens100k)