 
              Consistent Multitask Learning with Nonlinear Output Constraints Carlo Ciliberto Department of Computer Science, UCL joint work w/ Alessandro Rudi, Lorenzo Rosasco and Massi Pontil
Multitask Learning (MTL) MTL Mantra : leverage the similarities among multiple learning problems (tasks) to reduce the complexity of the overall learning process. Prev. Literature : investigated linear tasks relations (more on this in a minute). This work : we address the problem of learning multiple tasks that are nonlinearly related one to the other
MTL Setting Given T datasets S t = ( x it , y it ) n t i =1 learn ˆ f t : X → R by solving T 1 ( ˆ f 1 , . . . , ˆ � f T ) = argmin L ( f t , S t ) + R ( f 1 , . . . , f T ) T f 1 ,...,f T ∈H t =1 ◮ H space of hypotheses. � n t ◮ L ( f t , S t ) = 1 i =1 ℓ ( f t ( x it , y it )) Data fitting term. Loss n t ℓ : R × R → R (e.g. leas squares, logistic, hinge, etc.). ◮ R ( f 1 , . . . , f T ) a joint tasks-structure regularizer
Previous Work: Linear MTL For example R ( f 1 , . . . , f T ) = T ◮ Single task learning � � f t � 2 λ H t =1 T f � 2 ◮ Variance Regularization � � f t − ¯ H with ¯ 1 � T λ f = T t =1 t =1 |C| |C| f c � 2 f � 2 ◮ Clustered tasks � � f t − ¯ � � ¯ f c − ¯ λ 1 H + λ 2 H t ∈C ( c ) c =1 c =1 T W s,t � f t − f s � 2 ◮ Similarity regularizer � λ W s,t ≥ 0 H t,s Why “Linear”? Because the tasks relations are encoded in a matrix. T � A ∈ R T × T R ( f 1 , . . . , f T ) = A t,s � f t , f s � H with t,s =1
Nonlinear MTL: Setting What if relations are nonlinear ? We study the case where tasks satisfy a set of k equations γ ( f 1 ( x ) , · · · , f T ( x )) = 0 identified by γ : R T → R k . Examples ◮ Manifold-valued learning ◮ Physical systems (e.g. robotics) ◮ Logical constraints (e.g. ranking)
Nonlinear MTL: Setting NL-MTL Goal : approximate f ∗ : X → C minimizer the Expected Risk E ( f ) = 1 � f : X→C E ( f ) , min ℓ ( f t ( x ) , y ) dρ t ( x, y ) T where ◮ f : X → C is such that f ( x ) = ( f 1 ( x ) , . . . , f T ( x )) for all x ∈ X . ◮ C = { c ∈ R T | γ ( c ) = 0 } is the constraints set induced by γ . ◮ ρ t ( x, y ) = ρ t ( y | x ) ρ X ( x ) is the unknown data distribution.
Nonlinear MTL: Challenges Why not try Empirical Risk Minimization ? T 1 ˆ � f = argmin L ( f t , S t ) T H⊂{ f : X→C} t =1 f ∈H Problems: ◮ Modeling : f 1 , f 2 : X → C does not guarantee f 1 + f 2 : X → C . H not a linear space. How to choose a “good” H in practice? ◮ Computations : Hard (non-convex) optimization. How to solve it? ◮ Statistics : How to study the generalization properties of ˆ f ?
Nonliner MTL: a Structured Prediction Perspective Idea : formulate NL-MTL as a structured prediction problem. Structured Prediction : originally designed for discrete outputs, but recently genearlized to any set C within the SELF framework [Ciliberto et al. 2016].
Nonlinear MTL Estimator We propose to approximate f ∗ via the estimator ˆ f : X → C such that T n t 1 ˆ � � f ( x ) = argmin α it ( x ) ℓ ( c t , y it ) T c ∈C t =1 i =1 where the weights are obtained in closed form as ( α i 1 ( x ) , · · · , α in t ( x )) = ( K t + λI ) − 1 v t ( x ) with K t the kernel matrix ( K t ) ij = k ( x it , x jt ) of t -th dataset and v t ( x ) ∈ R n with v t ( x ) i = k ( x it , x ) . k : X × X → R a kernel . Note . evaluating ˆ f ( x ) requires solving an optimization over C (e.g. for ℓ least squares it ˆ f reduces to a projection onto C ).
Theoretical Results Thm. 1 (Universal Consistency) E ( ˆ f ) − E ( f ∗ ) → 0 with probability 1 . Thm. 2 (Rates) . Let n = n t and g ∗ t ∈ G for all t = 1 , . . . , T . Then E ( ˆ f ) − E ( f ∗ ) ≤ O ( n − 1 / 4 ) with high probability Thm. 3 (Benefits of MTL) . Let C ⊂ R T radius 1 sphere. Let N = nT . E ( ˆ f ) − E ( f ∗ ) ≤ O ( N − 1 / 2 ) Then with high probability
Intuition Ok... but how did we get there?
Structure Encoding Loss Function (SELF) Ciliberto et al. 2016 Def . ℓ : C × Y → R is a structure encoding loss function (SELF) if there exist H Hilbert space and ψ : C → H , ϕ : Y → H such that ℓ ( c, y ) = � ψ ( c ) , ϕ ( y ) � H ∀ c ∈ C , ∀ y ∈ Y . Abstract definition... BUT “most” loss functions used in MTL settings are SELF! More precisely any Lipschitz continuous function differentiable almost everywhere (e.g. least squares, logistic, hinge).
Nonlinear MTL + SELF Minimizer of the expected risk T 1 � � f ∗ ( x ) = argmin ℓ ( c t , y ) ρ t ( y | x ) T c ∈C t =1
Nonlinear MTL + SELF Minimizer of the expected risk T 1 � � f ∗ ( x ) = argmin � ψ ( c t ) , ϕ ( y ) � H ρ t ( y | x ) T c ∈C t =1
Nonlinear MTL + SELF Minimizer of the expected risk T 1 � � � � f ∗ ( x ) = argmin ψ ( c t ) , ϕ ( y ) ρ t ( y | x ) T c ∈C H t =1
Nonlinear MTL + SELF Minimizer of the expected risk T 1 f ∗ ( x ) = argmin � � ψ ( c t ) , g ∗ t ( x ) � H T c ∈C t =1 where g ∗ t : X → H is such that g ∗ � t ( x ) = ϕ ( y ) ρ t ( y | x ) .
Nonlinear MTL Estimator g t : X → H for each g ∗ Idea, learn a ˆ t . Then approximate T 1 � f ∗ ( x ) = argmin � ψ ( c t ) , g ∗ t ( x ) � H T c ∈C t =1 with ˆ f : X → C T 1 ˆ � f ( x ) = argmin � ψ ( c t ) , ˆ g t ( x ) � H T c ∈C t =1
Nonlinear MTL Estimator g t via kernel ridge regression. Let G 1 be a reproducing This work: learn ˆ kernel Hilbert space with kernel k : X × X → R . n t 1 � � g ( x it ) − ϕ ( y it ) � 2 H + λ � g � 2 g t = argmin ˆ G n t g ∈G i =1 Then n t � ( α i 1 ( x ) , · · · , α in t ( x )) = ( K t + λI ) − 1 v t ( x ) g t ( x ) = ˆ α it ( x ) ϕ ( y it ) i =1 where K t kernel matrix of t -th dataset, v t ( x ) ∈ R n evaluation vector v t ( x ) i = k ( x it , x ) . 1 actually G ⊗ H
Nonlinear MTL Estimator Plugging into T 1 ˆ � f ( x ) = argmin � ψ ( c t ) , ˆ g t ( x ) � H T c ∈C t =1 by the SELF property we have T n t n t � � 1 ˆ � � � f ( x ) = argmin α it ( x ) ψ ( c t ) , α it ( x ) ϕ ( y it ) T c ∈C t =1 i =1 i =1 H
Nonlinear MTL Estimator Plugging into T 1 ˆ � f ( x ) = argmin � ψ ( c t ) , ˆ g t ( x ) � H T c ∈C t =1 by the SELF property we have T n t 1 ˆ � � f ( x ) = argmin α it ( x ) � ψ ( c t ) , ϕ ( y it ) � H T c ∈C t =1 i =1
Nonlinear MTL Estimator Plugging into T 1 ˆ � f ( x ) = argmin � ψ ( c t ) , ˆ g t ( x ) � H T c ∈C t =1 by the SELF property we have T n t 1 ˆ � � f ( x ) = argmin α it ( x ) ℓ ( c t , y it ) T c ∈C t =1 i =1 as desired. Note that evaluating ˆ f ( x ) Does not require knowledge of H , ϕ or ψ !
Empirical Results Synthetic data Inverse dynamics (Sarcos) Logic constraints (Ranking Movielens100k)
Recommend
More recommend