consistent multitask learning with nonlinear output
play

Consistent Multitask Learning with Nonlinear Output Constraints - PowerPoint PPT Presentation

Consistent Multitask Learning with Nonlinear Output Constraints Carlo Ciliberto Department of Computer Science, UCL joint work w/ Alessandro Rudi, Lorenzo Rosasco and Massi Pontil Multitask Learning (MTL) MTL Mantra : leverage the similarities


  1. Consistent Multitask Learning with Nonlinear Output Constraints Carlo Ciliberto Department of Computer Science, UCL joint work w/ Alessandro Rudi, Lorenzo Rosasco and Massi Pontil

  2. Multitask Learning (MTL) MTL Mantra : leverage the similarities among multiple learning problems (tasks) to reduce the complexity of the overall learning process. Prev. Literature : investigated linear tasks relations (more on this in a minute). This work : we address the problem of learning multiple tasks that are nonlinearly related one to the other

  3. MTL Setting Given T datasets S t = ( x it , y it ) n t i =1 learn ˆ f t : X → R by solving T 1 ( ˆ f 1 , . . . , ˆ � f T ) = argmin L ( f t , S t ) + R ( f 1 , . . . , f T ) T f 1 ,...,f T ∈H t =1 ◮ H space of hypotheses. � n t ◮ L ( f t , S t ) = 1 i =1 ℓ ( f t ( x it , y it )) Data fitting term. Loss n t ℓ : R × R → R (e.g. leas squares, logistic, hinge, etc.). ◮ R ( f 1 , . . . , f T ) a joint tasks-structure regularizer

  4. Previous Work: Linear MTL For example R ( f 1 , . . . , f T ) = T ◮ Single task learning � � f t � 2 λ H t =1 T f � 2 ◮ Variance Regularization � � f t − ¯ H with ¯ 1 � T λ f = T t =1 t =1 |C| |C| f c � 2 f � 2 ◮ Clustered tasks � � f t − ¯ � � ¯ f c − ¯ λ 1 H + λ 2 H t ∈C ( c ) c =1 c =1 T W s,t � f t − f s � 2 ◮ Similarity regularizer � λ W s,t ≥ 0 H t,s Why “Linear”? Because the tasks relations are encoded in a matrix. T � A ∈ R T × T R ( f 1 , . . . , f T ) = A t,s � f t , f s � H with t,s =1

  5. Nonlinear MTL: Setting What if relations are nonlinear ? We study the case where tasks satisfy a set of k equations γ ( f 1 ( x ) , · · · , f T ( x )) = 0 identified by γ : R T → R k . Examples ◮ Manifold-valued learning ◮ Physical systems (e.g. robotics) ◮ Logical constraints (e.g. ranking)

  6. Nonlinear MTL: Setting NL-MTL Goal : approximate f ∗ : X → C minimizer the Expected Risk E ( f ) = 1 � f : X→C E ( f ) , min ℓ ( f t ( x ) , y ) dρ t ( x, y ) T where ◮ f : X → C is such that f ( x ) = ( f 1 ( x ) , . . . , f T ( x )) for all x ∈ X . ◮ C = { c ∈ R T | γ ( c ) = 0 } is the constraints set induced by γ . ◮ ρ t ( x, y ) = ρ t ( y | x ) ρ X ( x ) is the unknown data distribution.

  7. Nonlinear MTL: Challenges Why not try Empirical Risk Minimization ? T 1 ˆ � f = argmin L ( f t , S t ) T H⊂{ f : X→C} t =1 f ∈H Problems: ◮ Modeling : f 1 , f 2 : X → C does not guarantee f 1 + f 2 : X → C . H not a linear space. How to choose a “good” H in practice? ◮ Computations : Hard (non-convex) optimization. How to solve it? ◮ Statistics : How to study the generalization properties of ˆ f ?

  8. Nonliner MTL: a Structured Prediction Perspective Idea : formulate NL-MTL as a structured prediction problem. Structured Prediction : originally designed for discrete outputs, but recently genearlized to any set C within the SELF framework [Ciliberto et al. 2016].

  9. Nonlinear MTL Estimator We propose to approximate f ∗ via the estimator ˆ f : X → C such that T n t 1 ˆ � � f ( x ) = argmin α it ( x ) ℓ ( c t , y it ) T c ∈C t =1 i =1 where the weights are obtained in closed form as ( α i 1 ( x ) , · · · , α in t ( x )) = ( K t + λI ) − 1 v t ( x ) with K t the kernel matrix ( K t ) ij = k ( x it , x jt ) of t -th dataset and v t ( x ) ∈ R n with v t ( x ) i = k ( x it , x ) . k : X × X → R a kernel . Note . evaluating ˆ f ( x ) requires solving an optimization over C (e.g. for ℓ least squares it ˆ f reduces to a projection onto C ).

  10. Theoretical Results Thm. 1 (Universal Consistency) E ( ˆ f ) − E ( f ∗ ) → 0 with probability 1 . Thm. 2 (Rates) . Let n = n t and g ∗ t ∈ G for all t = 1 , . . . , T . Then E ( ˆ f ) − E ( f ∗ ) ≤ O ( n − 1 / 4 ) with high probability Thm. 3 (Benefits of MTL) . Let C ⊂ R T radius 1 sphere. Let N = nT . E ( ˆ f ) − E ( f ∗ ) ≤ O ( N − 1 / 2 ) Then with high probability

  11. Intuition Ok... but how did we get there?

  12. Structure Encoding Loss Function (SELF) Ciliberto et al. 2016 Def . ℓ : C × Y → R is a structure encoding loss function (SELF) if there exist H Hilbert space and ψ : C → H , ϕ : Y → H such that ℓ ( c, y ) = � ψ ( c ) , ϕ ( y ) � H ∀ c ∈ C , ∀ y ∈ Y . Abstract definition... BUT “most” loss functions used in MTL settings are SELF! More precisely any Lipschitz continuous function differentiable almost everywhere (e.g. least squares, logistic, hinge).

  13. Nonlinear MTL + SELF Minimizer of the expected risk T 1 � � f ∗ ( x ) = argmin ℓ ( c t , y ) ρ t ( y | x ) T c ∈C t =1

  14. Nonlinear MTL + SELF Minimizer of the expected risk T 1 � � f ∗ ( x ) = argmin � ψ ( c t ) , ϕ ( y ) � H ρ t ( y | x ) T c ∈C t =1

  15. Nonlinear MTL + SELF Minimizer of the expected risk T 1 � � � � f ∗ ( x ) = argmin ψ ( c t ) , ϕ ( y ) ρ t ( y | x ) T c ∈C H t =1

  16. Nonlinear MTL + SELF Minimizer of the expected risk T 1 f ∗ ( x ) = argmin � � ψ ( c t ) , g ∗ t ( x ) � H T c ∈C t =1 where g ∗ t : X → H is such that g ∗ � t ( x ) = ϕ ( y ) ρ t ( y | x ) .

  17. Nonlinear MTL Estimator g t : X → H for each g ∗ Idea, learn a ˆ t . Then approximate T 1 � f ∗ ( x ) = argmin � ψ ( c t ) , g ∗ t ( x ) � H T c ∈C t =1 with ˆ f : X → C T 1 ˆ � f ( x ) = argmin � ψ ( c t ) , ˆ g t ( x ) � H T c ∈C t =1

  18. Nonlinear MTL Estimator g t via kernel ridge regression. Let G 1 be a reproducing This work: learn ˆ kernel Hilbert space with kernel k : X × X → R . n t 1 � � g ( x it ) − ϕ ( y it ) � 2 H + λ � g � 2 g t = argmin ˆ G n t g ∈G i =1 Then n t � ( α i 1 ( x ) , · · · , α in t ( x )) = ( K t + λI ) − 1 v t ( x ) g t ( x ) = ˆ α it ( x ) ϕ ( y it ) i =1 where K t kernel matrix of t -th dataset, v t ( x ) ∈ R n evaluation vector v t ( x ) i = k ( x it , x ) . 1 actually G ⊗ H

  19. Nonlinear MTL Estimator Plugging into T 1 ˆ � f ( x ) = argmin � ψ ( c t ) , ˆ g t ( x ) � H T c ∈C t =1 by the SELF property we have T n t n t � � 1 ˆ � � � f ( x ) = argmin α it ( x ) ψ ( c t ) , α it ( x ) ϕ ( y it ) T c ∈C t =1 i =1 i =1 H

  20. Nonlinear MTL Estimator Plugging into T 1 ˆ � f ( x ) = argmin � ψ ( c t ) , ˆ g t ( x ) � H T c ∈C t =1 by the SELF property we have T n t 1 ˆ � � f ( x ) = argmin α it ( x ) � ψ ( c t ) , ϕ ( y it ) � H T c ∈C t =1 i =1

  21. Nonlinear MTL Estimator Plugging into T 1 ˆ � f ( x ) = argmin � ψ ( c t ) , ˆ g t ( x ) � H T c ∈C t =1 by the SELF property we have T n t 1 ˆ � � f ( x ) = argmin α it ( x ) ℓ ( c t , y it ) T c ∈C t =1 i =1 as desired. Note that evaluating ˆ f ( x ) Does not require knowledge of H , ϕ or ψ !

  22. Empirical Results Synthetic data Inverse dynamics (Sarcos) Logic constraints (Ranking Movielens100k)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend