Localized Structured Prediction Carlo Ciliberto 1 , Francis Bach 2 , - PowerPoint PPT Presentation

Localized Structured Prediction Carlo Ciliberto 1 , Francis Bach 2 , 3 , Alessandro Rudi 2 , 3 1 Department of Electrical and Electronic Engineering, Imperial College London, London 2 Département d’informatique, Ecole normale supérieure, PSL Research University. 3 INRIA, Paris, France

Supervised Learning 101 1 • X input space, Y output space, • ℓ : Y × Y → R loss function, • ρ probability on X × Y . f ⋆ = argmin E [ ℓ ( f ( x ) , y )] , f : X→Y given only the dataset ( x i , y i ) n i =1 sampled independently from ρ .

Structured Prediction 2

• How to choose Protypical Approach: Empirical Risk Minimization If ? How to optimize over it? is a “structured” space: If methods, Neural Networks, etc. easy to choose/optimize: (generalized) linear models, Kernel • is a vector space 3 Solve the problem: n ∑ 1 � f = argmin ℓ ( f ( x i ) , y i ) + λR ( f ) . n f ∈G i =1 Where G ⊆ { f : X → Y} (usually a convex function space)

• How to choose Protypical Approach: Empirical Risk Minimization Solve the problem: ? How to optimize over it? is a “structured” space: If methods, Neural Networks, etc. 3 n ∑ 1 � f = argmin ℓ ( f ( x i ) , y i ) + λR ( f ) . n f ∈G i =1 Where G ⊆ { f : X → Y} (usually a convex function space) If Y is a vector space • G easy to choose/optimize: (generalized) linear models, Kernel

Protypical Approach: Empirical Risk Minimization Solve the problem: methods, Neural Networks, etc. 3 n ∑ 1 � f = argmin ℓ ( f ( x i ) , y i ) + λR ( f ) . n f ∈G i =1 Where G ⊆ { f : X → Y} (usually a convex function space) If Y is a vector space • G easy to choose/optimize: (generalized) linear models, Kernel If Y is a “structured” space: • How to choose G ? How to optimize over it?

State of the art: Structured case Surrogate approaches + Clear theory (e.g. convergence and learning rates) - Only for special cases (classification, ranking, multi-labeling etc.) [Bartlett et al., 2006, Duchi et al., 2010, Mroueh et al., 2012] Score learning techniques + General algorithmic framework (e.g. StructSVM [Tsochantaridis et al., 2005] ) - Limited Theory (no consistency, see e.g. [Bakir et al., 2007] ) 4 Y arbitrary: how do we parametrize G and learn � f ?

Is it possible to have best of both worlds? general algorithmic framework + clear theory 5

Table of contents 1. A General Framework for Structured Prediction [Ciliberto et al., 2016] 2. Leveraging Local Structure [This Work] 6

A General Framework for Structured Prediction

Characterizing the target function Pointwise characterization in terms of the conditional expectation : 7 f ⋆ = argmin E xy [ ℓ ( f ( x ) , y )] . f : X→Y

Characterizing the target function Pointwise characterization in terms of the conditional expectation : 7 f ⋆ = argmin E xy [ ℓ ( f ( x ) , y )] . f : X→Y f ⋆ ( x ) = argmin E y [ ℓ ( z, y ) | x ] . z ∈Y

Deriving an Estimator Idea: approximate 8 f ⋆ ( x ) = argmin E ( z, x ) E ( z, x ) = E y [ ℓ ( z, y ) | x ] z ∈Y by means of an estimator � E ( z, x ) of the ideal E ( z, x ) � � � f ( x ) = argmin E ( z, x ) E ( z, x ) ≈ E ( z, x ) z ∈Y Question: How to choose � E ( z, x ) given the dataset ( x i , y i ) n i =1 ?

Estimating the Conditional Expectation Questions: 9 Idea: for every z perform “regression” over the ℓ ( z, · ) . n ∑ 1 g z = argmin � L ( g ( x i ) , ℓ ( z, y i )) + λR ( g ) n g : X→ R i =1 Then we take � E ( z, x ) = � g z ( x ) . • Models: How to choose L ? • Computations: Do we need to compute � g z for every z ∈ Y ? • Theory: Does � E ( z, x ) → E ( z, x ) ? More generally, does � f → f ⋆ ?

Square Loss! With and 10 Let L be the square loss . Then: ∑ n 1 ( g ( x i ) − ℓ ( z, y i )) 2 + λ ∥ g ∥ 2 � g z = argmin n g i =1 In particular, for linear models g ( x ) = ϕ ( x ) ⊤ w � � � � � 2 + λ g z ( x ) = ϕ ( x ) ⊤ � � 2 � Aw − b � w � w z w z = argmin � w A = [ ϕ ( x 1 ) , . . . , ϕ ( x n )] ⊤ b = [ ℓ ( z, y 1 ) , . . . , ℓ ( z, y n )] ⊤

11 In particular, we can compute Closed form solution Computing the � g z All in Once g z ( x ) = ϕ ( x ) ⊤ � w z = ϕ ( x ) ⊤ ( A ⊤ A + λnI ) − 1 A ⊤ b = α ( x ) ⊤ b � � �� α ( x ) α i ( x ) = ϕ ( x ) ⊤ ( A ⊤ A + λnI ) − 1 ϕ ( x i ) only once (independently of z ). Then, for any z n n ∑ ∑ g z ( x ) = � α i ( x ) b i = α i ( x ) ℓ ( z, y i ) i =1 i =1

Structured Prediction Algorithm Then, 12 Input: dataset ( x i , y i ) n i =1 . Training: for i = 1 , . . . , n , compute v i = ( A ⊤ A + λnI ) − 1 ϕ ( x i ) Prediction: given a new test point x compute α i ( x ) = ϕ ( x ) ⊤ v i n ∑ � f ( x ) = argmin α i ( x ) ℓ ( z, y i ) z ∈Y i =1

The Proposed Structured Prediction Algorithm Questions: Square loss! No need, Compute them all in once! Yes! Theorem (Rates - [Ciliberto et al., 2016]) Under mild assumption on . Let , then 13 • Models: How to choose L ? • Computations: Do we need to compute � g z for every z ∈ Y ? • Theory: Does � f → f ⋆ ?

The Proposed Structured Prediction Algorithm No need, Compute them all in once! Theorem (Rates - [Ciliberto et al., 2016]) Questions: Yes! 13 Square loss! • Models: How to choose L ? • Computations: Do we need to compute � g z for every z ∈ Y ? • Theory: Does � f → f ⋆ ? Under mild assumption on ℓ . Let λ = n − 1 / 2 , then E [ ℓ ( � f ( x ) , y ) − ℓ ( f ⋆ ( x ) , y )] O ( n − 1 / 4 ) , ≤ w.h.p.

A General Framework for Structured Prediction (General Algorithm + Theory) Is it possible to have best of both worlds? Yes! We introduced an algorithmic framework for structured prediction: • With strong theoretical guarantees. • Recovering many existing algorithms (not seen here). 14 • Directly applicable on a wide family of problems ( Y , ℓ ).

What Am I Hiding? • Theory. The key assumption to achieve consistency and rates is • Similar to the characterization of reproducing kernels. • In principle hard to verify. However lots of ML losses satisfy it! • Computations. We need to solve an optimization problem at prediction time! 15 that ℓ is a Structure Encoding Loss Function (SELF) . ℓ ( z, y ) = ⟨ ψ ( z ) , φ ( y ) ⟩ H ∀ z, y ∈ Y With ψ, φ : Y → H continuous maps into H Hilbert.

Prediction: The Inference Problem In our case it is reminiscient of a weighted barycenter. Solving an optimization problem at prediction time is a standard 16 practice in structured prediction. Known as Inference Problem � � f ( x ) = argmin E ( x, z ) z ∈Y n ∑ � f ( x ) = argmin α i ( x ) ℓ ( z, y i ) z ∈Y i =1 It is *very* problem dependent

Example: Learning to Rank is a Minimum Feedback Arc Set problem on DAGs (NP Hard!) approaches. Still, approximate solutions can 17 Pair-wise Loss: Goal: given a query x , order a set of documents d 1 , . . . , d k according to their relevance scores y 1 , . . . , y k w.r.t. x . k ∑ ℓrank ( f ( x ) , y ) = ( y i − y j ) sign ( f ( x ) i − f ( x ) j ) i,j =1 ∑ n It can be shown that � f ( x ) = argmin z ∈Y i =1 α i ( x ) ℓ ( z, y i ) improve upon non-consistent

Additional Work Case studies: • Learning to rank [Korba et al., 2018] • Output Fisher Embeddings [Djerrab et al., 2018] Refinements of the analysis: • Alternative derivations [Osokin et al., 2017] • Discrete loss [Nowak-Vila et al., 2018, Struminsky et al., 2018] Extensions: • Application to multitask-learning [Ciliberto et al., 2017] • Beyond least squares surrogate [Nowak-Vila et al., 2019] • Regularizing with trace norm [Luise et al., 2019] 18 • Y = manifolds, ℓ = geodesic distance [Rudi et al., 2018] • Y = probability space, ℓ = wasserstein distance [Luise et al., 2018]

Predicting Probability Distributions [Luise, Rudi, Pontil, Ciliberto ’18] Loss: Wasserstein distance Digit Reconstruction 19 Setting: Y = P ( R d ) probability distributions on R d . ∫ ∥ z − y ∥ 2 dτ ( x, y ) ℓ ( µ, ν ) = min τ ∈ Π( µ,ν )

Manifold Regression [Rudi, Ciliberto, Marconi, Rosasco ’18] Loss: (squared) geodesic distance. Optimization: Riemannian GD. Fingerprint Reconstruction Multi-labeling 20 Setting: Y Riemmanian manifold. ( Y = S 1 sphere) ( Y statistical manifold)

Nonlinear Multi-task Learning [Ciliberto, Rudi, Rosasco, Pontil ’17, Luise, Stamos, Pontil, Ciliberto ’19 ] Idea: instead of solving multiple learning problems (tasks) separately, leverage the potential relations among them. Unable to cope with non-linear constraints (e.g. ranking, robotics, etc.). MTL+Structured Prediction separate outputs. structure on the joint output. 21 Previous Methods : only imposing/learning linear tasks relations. − Interpret multiple tasks as − Impose constraints as

Leveraging local structure

Local Structure 22

Motivating Example (Between-Locality) Super-Resolution : However... • Very large output sets (high sample complexity). • Local info might be sufficient to predict output. 23 Learn f : Low res → High res .

Localized Structured Prediction Carlo Ciliberto 1 , Francis Bach 2 , - PowerPoint PPT Presentation

Localized Structured Prediction Carlo Ciliberto 1 , Francis Bach 2 , 3 , Alessandro Rudi 2 , 3 1 Department of Electrical and Electronic Engineering, Imperial College London, London 2 Dpartement dinformatique, Ecole normale suprieure, PSL

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

L101: Introduction to Structured Prediction Ryan Cotterell What is structured prediction?

Training Strategies CS 6355: Structured Prediction 1 So far we saw What is structured output

CSCE 496/896 Lecture 11: Structured Prediction and Structured Prediction and Probabilistic

Course Information CS 6355: Structured Prediction Building up structured output prediction

L101: Incremental structured prediction Structured prediction reminder Given an input x (e.g. a

Structured Prediction Final words CS 6355: Structured Prediction 1 A look back What is a

Smeared versus localized sources in flux compactifications Smeared vs. localized sources Timm

Localized Pressure and Equilibrium States Tamara Kucherenko, CCNY (joint work with Christian

Complex Prediction Problems A novel approach to multiple Structured Output Prediction Yasemin

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

CSCE 970 Lecture 8: Prediction Stephen Scott Structured Prediction and Vinod Variyam

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Structured Electronic Design Structured Electronic Design ET 8016 5 ECTS credits 1

Why Patient Centered Care Matters on the Path to Value GENA COOK CEO, NAVIGATING CANCER October

Home sweet home! Greenbrier River West Virginia Attualit in tema di Pneumoconiosi Siena, Italy

Mind & Brain in the What Is Neuroscience? 21 st Century Multidisciplinary research attempts

5/11/2014 MY, YOUR... OUR PATIENTS PATIENTS' ATTITUDES TOWARDS NURSE PRACTITIONERS IN FAMILY

Lecture 8: Spatial Localization and Detection Fei-Fei Li & Andrej Karpathy & Justin

Introduction to Mobile Robotics SLAM: Simultaneous Localization and Mapping Wolfram Burgard,

First, parse the title ... Eigenvector localization : Eigenvectors are usually global

Weaving localization issues into a content strategy W3C Multilingual Web Workshop Limerick,

Localized Structured Prediction Carlo Ciliberto 1 , Francis Bach 2 , - PowerPoint PPT Presentation

Localized Structured Prediction Carlo Ciliberto 1 , Francis Bach 2 , 3 , Alessandro Rudi 2 , 3 1 Department of Electrical and Electronic Engineering, Imperial College London, London 2 Dpartement dinformatique, Ecole normale suprieure, PSL

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

L101: Introduction to Structured Prediction Ryan Cotterell What is structured prediction?

Training Strategies CS 6355: Structured Prediction 1 So far we saw What is structured output

CSCE 496/896 Lecture 11: Structured Prediction and Structured Prediction and Probabilistic

Course Information CS 6355: Structured Prediction Building up structured output prediction

L101: Incremental structured prediction Structured prediction reminder Given an input x (e.g. a

Structured Prediction Final words CS 6355: Structured Prediction 1 A look back What is a

Smeared versus localized sources in flux compactifications Smeared vs. localized sources Timm

Localized Pressure and Equilibrium States Tamara Kucherenko, CCNY (joint work with Christian

Complex Prediction Problems A novel approach to multiple Structured Output Prediction Yasemin

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

CSCE 970 Lecture 8: Prediction Stephen Scott Structured Prediction and Vinod Variyam

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Structured Electronic Design Structured Electronic Design ET 8016 5 ECTS credits 1

Why Patient Centered Care Matters on the Path to Value GENA COOK CEO, NAVIGATING CANCER October

Home sweet home! Greenbrier River West Virginia Attualit in tema di Pneumoconiosi Siena, Italy

Mind &amp; Brain in the What Is Neuroscience? 21 st Century Multidisciplinary research attempts

5/11/2014 MY, YOUR... OUR PATIENTS PATIENTS' ATTITUDES TOWARDS NURSE PRACTITIONERS IN FAMILY

Lecture 8: Spatial Localization and Detection Fei-Fei Li &amp; Andrej Karpathy &amp; Justin

Introduction to Mobile Robotics SLAM: Simultaneous Localization and Mapping Wolfram Burgard,

First, parse the title ... Eigenvector localization : Eigenvectors are usually global

Weaving localization issues into a content strategy W3C Multilingual Web Workshop Limerick,

Mind & Brain in the What Is Neuroscience? 21 st Century Multidisciplinary research attempts

Lecture 8: Spatial Localization and Detection Fei-Fei Li & Andrej Karpathy & Justin