localized structured prediction
play

Localized Structured Prediction Carlo Ciliberto 1 , Francis Bach 2 , - PowerPoint PPT Presentation

Localized Structured Prediction Carlo Ciliberto 1 , Francis Bach 2 , 3 , Alessandro Rudi 2 , 3 1 Department of Electrical and Electronic Engineering, Imperial College London, London 2 Dpartement dinformatique, Ecole normale suprieure, PSL


  1. Localized Structured Prediction Carlo Ciliberto 1 , Francis Bach 2 , 3 , Alessandro Rudi 2 , 3 1 Department of Electrical and Electronic Engineering, Imperial College London, London 2 Département d’informatique, Ecole normale supérieure, PSL Research University. 3 INRIA, Paris, France

  2. Supervised Learning 101 1 • X input space, Y output space, • ℓ : Y × Y → R loss function, • ρ probability on X × Y . f ⋆ = argmin E [ ℓ ( f ( x ) , y )] , f : X→Y given only the dataset ( x i , y i ) n i =1 sampled independently from ρ .

  3. Structured Prediction 2

  4. • How to choose Protypical Approach: Empirical Risk Minimization If ? How to optimize over it? is a “structured” space: If methods, Neural Networks, etc. easy to choose/optimize: (generalized) linear models, Kernel • is a vector space 3 Solve the problem: n ∑ 1 � f = argmin ℓ ( f ( x i ) , y i ) + λR ( f ) . n f ∈G i =1 Where G ⊆ { f : X → Y} (usually a convex function space)

  5. • How to choose Protypical Approach: Empirical Risk Minimization Solve the problem: ? How to optimize over it? is a “structured” space: If methods, Neural Networks, etc. 3 n ∑ 1 � f = argmin ℓ ( f ( x i ) , y i ) + λR ( f ) . n f ∈G i =1 Where G ⊆ { f : X → Y} (usually a convex function space) If Y is a vector space • G easy to choose/optimize: (generalized) linear models, Kernel

  6. Protypical Approach: Empirical Risk Minimization Solve the problem: methods, Neural Networks, etc. 3 n ∑ 1 � f = argmin ℓ ( f ( x i ) , y i ) + λR ( f ) . n f ∈G i =1 Where G ⊆ { f : X → Y} (usually a convex function space) If Y is a vector space • G easy to choose/optimize: (generalized) linear models, Kernel If Y is a “structured” space: • How to choose G ? How to optimize over it?

  7. State of the art: Structured case Surrogate approaches + Clear theory (e.g. convergence and learning rates) - Only for special cases (classification, ranking, multi-labeling etc.) [Bartlett et al., 2006, Duchi et al., 2010, Mroueh et al., 2012] Score learning techniques + General algorithmic framework (e.g. StructSVM [Tsochantaridis et al., 2005] ) - Limited Theory (no consistency, see e.g. [Bakir et al., 2007] ) 4 Y arbitrary: how do we parametrize G and learn � f ?

  8. Is it possible to have best of both worlds? general algorithmic framework + clear theory 5

  9. Table of contents 1. A General Framework for Structured Prediction [Ciliberto et al., 2016] 2. Leveraging Local Structure [This Work] 6

  10. A General Framework for Structured Prediction

  11. Characterizing the target function Pointwise characterization in terms of the conditional expectation : 7 f ⋆ = argmin E xy [ ℓ ( f ( x ) , y )] . f : X→Y

  12. Characterizing the target function Pointwise characterization in terms of the conditional expectation : 7 f ⋆ = argmin E xy [ ℓ ( f ( x ) , y )] . f : X→Y f ⋆ ( x ) = argmin E y [ ℓ ( z, y ) | x ] . z ∈Y

  13. Deriving an Estimator Idea: approximate 8 f ⋆ ( x ) = argmin E ( z, x ) E ( z, x ) = E y [ ℓ ( z, y ) | x ] z ∈Y by means of an estimator � E ( z, x ) of the ideal E ( z, x ) � � � f ( x ) = argmin E ( z, x ) E ( z, x ) ≈ E ( z, x ) z ∈Y Question: How to choose � E ( z, x ) given the dataset ( x i , y i ) n i =1 ?

  14. Estimating the Conditional Expectation Questions: 9 Idea: for every z perform “regression” over the ℓ ( z, · ) . n ∑ 1 g z = argmin � L ( g ( x i ) , ℓ ( z, y i )) + λR ( g ) n g : X→ R i =1 Then we take � E ( z, x ) = � g z ( x ) . • Models: How to choose L ? • Computations: Do we need to compute � g z for every z ∈ Y ? • Theory: Does � E ( z, x ) → E ( z, x ) ? More generally, does � f → f ⋆ ?

  15. Square Loss! With and 10 Let L be the square loss . Then: ∑ n 1 ( g ( x i ) − ℓ ( z, y i )) 2 + λ ∥ g ∥ 2 � g z = argmin n g i =1 In particular, for linear models g ( x ) = ϕ ( x ) ⊤ w � � � � � 2 + λ g z ( x ) = ϕ ( x ) ⊤ � � 2 � Aw − b � w � w z w z = argmin � w A = [ ϕ ( x 1 ) , . . . , ϕ ( x n )] ⊤ b = [ ℓ ( z, y 1 ) , . . . , ℓ ( z, y n )] ⊤

  16. 11 In particular, we can compute Closed form solution Computing the � g z All in Once g z ( x ) = ϕ ( x ) ⊤ � w z = ϕ ( x ) ⊤ ( A ⊤ A + λnI ) − 1 A ⊤ b = α ( x ) ⊤ b � � �� � α ( x ) α i ( x ) = ϕ ( x ) ⊤ ( A ⊤ A + λnI ) − 1 ϕ ( x i ) only once (independently of z ). Then, for any z n n ∑ ∑ g z ( x ) = � α i ( x ) b i = α i ( x ) ℓ ( z, y i ) i =1 i =1

  17. Structured Prediction Algorithm Then, 12 Input: dataset ( x i , y i ) n i =1 . Training: for i = 1 , . . . , n , compute v i = ( A ⊤ A + λnI ) − 1 ϕ ( x i ) Prediction: given a new test point x compute α i ( x ) = ϕ ( x ) ⊤ v i n ∑ � f ( x ) = argmin α i ( x ) ℓ ( z, y i ) z ∈Y i =1

  18. The Proposed Structured Prediction Algorithm Questions: Square loss! No need, Compute them all in once! Yes! Theorem (Rates - [Ciliberto et al., 2016]) Under mild assumption on . Let , then 13 • Models: How to choose L ? • Computations: Do we need to compute � g z for every z ∈ Y ? • Theory: Does � f → f ⋆ ?

  19. The Proposed Structured Prediction Algorithm No need, Compute them all in once! Theorem (Rates - [Ciliberto et al., 2016]) Questions: Yes! 13 Square loss! • Models: How to choose L ? • Computations: Do we need to compute � g z for every z ∈ Y ? • Theory: Does � f → f ⋆ ? Under mild assumption on ℓ . Let λ = n − 1 / 2 , then E [ ℓ ( � f ( x ) , y ) − ℓ ( f ⋆ ( x ) , y )] O ( n − 1 / 4 ) , ≤ w.h.p.

  20. A General Framework for Structured Prediction (General Algorithm + Theory) Is it possible to have best of both worlds? Yes! We introduced an algorithmic framework for structured prediction: • With strong theoretical guarantees. • Recovering many existing algorithms (not seen here). 14 • Directly applicable on a wide family of problems ( Y , ℓ ).

  21. What Am I Hiding? • Theory. The key assumption to achieve consistency and rates is • Similar to the characterization of reproducing kernels. • In principle hard to verify. However lots of ML losses satisfy it! • Computations. We need to solve an optimization problem at prediction time! 15 that ℓ is a Structure Encoding Loss Function (SELF) . ℓ ( z, y ) = ⟨ ψ ( z ) , φ ( y ) ⟩ H ∀ z, y ∈ Y With ψ, φ : Y → H continuous maps into H Hilbert.

  22. Prediction: The Inference Problem In our case it is reminiscient of a weighted barycenter. Solving an optimization problem at prediction time is a standard 16 practice in structured prediction. Known as Inference Problem � � f ( x ) = argmin E ( x, z ) z ∈Y n ∑ � f ( x ) = argmin α i ( x ) ℓ ( z, y i ) z ∈Y i =1 It is *very* problem dependent

  23. Example: Learning to Rank is a Minimum Feedback Arc Set problem on DAGs (NP Hard!) approaches. Still, approximate solutions can 17 Pair-wise Loss: Goal: given a query x , order a set of documents d 1 , . . . , d k according to their relevance scores y 1 , . . . , y k w.r.t. x . k ∑ ℓrank ( f ( x ) , y ) = ( y i − y j ) sign ( f ( x ) i − f ( x ) j ) i,j =1 ∑ n It can be shown that � f ( x ) = argmin z ∈Y i =1 α i ( x ) ℓ ( z, y i ) improve upon non-consistent

  24. Additional Work Case studies: • Learning to rank [Korba et al., 2018] • Output Fisher Embeddings [Djerrab et al., 2018] Refinements of the analysis: • Alternative derivations [Osokin et al., 2017] • Discrete loss [Nowak-Vila et al., 2018, Struminsky et al., 2018] Extensions: • Application to multitask-learning [Ciliberto et al., 2017] • Beyond least squares surrogate [Nowak-Vila et al., 2019] • Regularizing with trace norm [Luise et al., 2019] 18 • Y = manifolds, ℓ = geodesic distance [Rudi et al., 2018] • Y = probability space, ℓ = wasserstein distance [Luise et al., 2018]

  25. Predicting Probability Distributions [Luise, Rudi, Pontil, Ciliberto ’18] Loss: Wasserstein distance Digit Reconstruction 19 Setting: Y = P ( R d ) probability distributions on R d . ∫ ∥ z − y ∥ 2 dτ ( x, y ) ℓ ( µ, ν ) = min τ ∈ Π( µ,ν )

  26. Manifold Regression [Rudi, Ciliberto, Marconi, Rosasco ’18] Loss: (squared) geodesic distance. Optimization: Riemannian GD. Fingerprint Reconstruction Multi-labeling 20 Setting: Y Riemmanian manifold. ( Y = S 1 sphere) ( Y statistical manifold)

  27. Nonlinear Multi-task Learning [Ciliberto, Rudi, Rosasco, Pontil ’17, Luise, Stamos, Pontil, Ciliberto ’19 ] Idea: instead of solving multiple learning problems (tasks) separately, leverage the potential relations among them. Unable to cope with non-linear constraints (e.g. ranking, robotics, etc.). MTL+Structured Prediction separate outputs. structure on the joint output. 21 Previous Methods : only imposing/learning linear tasks relations. − Interpret multiple tasks as − Impose constraints as

  28. Leveraging local structure

  29. Local Structure 22

  30. Motivating Example (Between-Locality) Super-Resolution : However... • Very large output sets (high sample complexity). • Local info might be sufficient to predict output. 23 Learn f : Low res → High res .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend