Structured Prediction via Implicit Embeddings Alessandro Rudi Imaging and Machine Learning, April 1st, Paris Inria, École normale supérieure In collaboration with: Carlo Ciliberto, Lorenzo Rosasco, Francis Bach

Structured Prediction 1

Structured Prediction 2

3 Supervised Learning • X input space, Y output space, • ℓ : Y × Y → R loss function, • ρ probability on X × Y . f ⋆ = argmin E ( f ) , E ( f ) := E [ ℓ ( y , f ( x ))] . f : X→Y given only the dataset ( x i , y i ) n i = 1 sampled independently from ρ .

Supervised learning: Goal f n , such that Consistency Learning rates 4 Given the dataset ( x i , y i ) n i = 1 sampled independently from ρ , produce � n →∞ E ( � f n ) = E ( f ⋆ ) , lim a . s . E ( � f n ) − E ( f ⋆ ) ≤ c ( n ) , w . h . p .

State of the art: Vector-valued case 1 • Consistency and (optimal) learning rates for many losses Kernel machines, Kernel SVM. Easy to optimize. • Well known methods: Linear models, generalized linear models, n n space) • solve empirical risk minimization 5 Y is a vector space • choose suitable G ⊆ { f : X → Y} (usually a convex function ∑ � f = argmin ℓ ( f ( x i ) , y i ) + λ R ( f ) . f ∈G i = 1

State of the art: Structured case f ? Surrogate approaches + Clear theory - Only for special cases (e.g. classification, ranking, multi-labeling etc.) [Bartlett et al ’06, Duchi et al ’10, Mroueh et al ’12, Gao et al. ’13] Score learning techniques + General algorithmic framework (e.g. StructSVM [Tsochandaridis et al ’05] ) - Limited Theory ( [McAllester ’06] ) 6 Y arbitrary how do we parametrize G and learn �

Supervised learning with structure Is it possible to (a) have best of both worlds? (general algorithmic framework with clear theory) (b) learn leveraging the local structure of the input and the output? We will address (a), (b) using implicit embeddings (related techniques: Cortes et al. 2005; Geurts, Wehenkel, d’Alché Buc ’06; Kadri et al. ’13; Brouard, Szafranski, d’Alché Buc ’16) 7

Table of contents 1. Structured learning with implicit embeddings 2. Algorithm and properties 3. Leveraging local structure 8

Structured learning with implicit embeddings

Characterizing the target function Pointwise characterization f x y y y x 9 f ⋆ = argmin E [ ℓ ( f ( x ) , y )] . f : X→Y

Characterizing the target function Pointwise characterization 9 f ⋆ = argmin E [ ℓ ( f ( x ) , y )] . f : X→Y f ⋆ ( x ) = argmin E [ ℓ ( y ′ , y ) | x ] y ′ ∈Y

maximum theory for measurable functions). Characterizing the target function 10 ˜ E [ ℓ ( y ′ , y ) | x ] f ( x ) = argmin y ′ ∈Y E [ ℓ (˜ f ( x ) , y )] = E x [ E [ ℓ (˜ f ( x ) , y ) | x ]] y ′ ∈Y E [ ℓ ( y ′ , y ) | x ]] = E x [ inf ≤ E [ ℓ ( f ( x ) , y )] , ∀ f : X → Y . Then E (˜ f ) = inf f : X→Y E ( f ) (measurability issues solved via Berge

Implicit embedding continuous such that Theorem ( Ciliberto, Rosasco, Rudi ’16) A1 is satisfied 11 A1. There exists Hilbert space H and ψ, ϕ : Y → H , bounded ℓ ( y ′ , y ) := ⟨ ψ ( y ′ ) , ϕ ( y ) ⟩ . 1. for any loss ℓ when Y discrete space 2. for any smooth loss ℓ when Y ⊂ R d compact 3. for any smooth loss ℓ when Y ⊆ M with M compact manifold

Idea for a unified approach When A1 holds 12 f ⋆ ( x ) = argmin E [ ℓ ( y ′ , y ) | x ] y ′ ∈Y

Idea for a unified approach When A1 holds 12 f ⋆ ( x ) = argmin E [ ⟨ ψ ( y ′ ) , ϕ ( y ) ⟩ | x ] y ′ ∈Y

Idea for a unified approach When A1 holds 12 f ⋆ ( x ) = argmin ⟨ ψ ( y ′ ) , E [ ϕ ( y ) | x ] ⟩ y ′ ∈Y

Idea for a unified approach When A1 holds 12 f ⋆ ( x ) = argmin ⟨ ψ ( y ′ ) , µ ⋆ ( x ) ⟩ y ′ ∈Y with µ ⋆ ( x ) = E [ ϕ ( y ) | x ] conditional expectation of ϕ ( y ) given x

13 The estimator µ estimating µ ⋆ , define Given � � ⟨ ψ ( y ′ ) , � f ( x ) = argmin µ ( x ) ⟩ y ′ ∈Y

14 n 2 2 y x i 1 i n 1 suitable space of functions use standard techniques for vector valued problems. Given How to compute � µ µ ⋆ = E [ ϕ ( y ) | x ] is characterized by µ ⋆ = argmin E [ ∥ µ ( x ) − ϕ ( y ) ∥ 2 ] µ : X→H

14 suitable space of functions n n 1 How to compute � µ µ ⋆ = E [ ϕ ( y ) | x ] is characterized by µ ⋆ = argmin E [ ∥ µ ( x ) − ϕ ( y ) ∥ 2 ] µ : X→H use standard techniques for vector valued problems. Given G ∑ ∥ µ ( x i ) − ϕ ( y ) ∥ 2 + λ ∥ µ ∥ 2 . � µ = argmin µ ∈G i = 1

n where 15 G space of linear functions Let X be a vector space and G = X ⊗ H , then ∑ � µ ( x ) = α i ( x ) ϕ ( y i ) , i = 1 α i ( x ) := [( K + λ nI ) − 1 v ( x )] i , and v ( x ) = ( x ⊤ x 1 , . . . x ⊤ x n ) ∈ R n , K ∈ R n × n K i , j = x ⊤ i x j .

n non-parametric model where 16 Let k : X × X → R be a kernel on X . Denote by F the reproducing kernel Hilbert space induced by k over X . Let G = F ⊗ H , then ∑ � µ ( x ) = α i ( x ) ϕ ( y i ) , i = 1 α i ( x ) := [( K + λ nI ) − 1 v ( x )] i , and v ( x ) = ( k ( x , x 1 ) , . . . k ( x , x n )) ∈ R n , K ∈ R n × n K i , j = k ( x i , x j ) .

Algorithm and properties

f 17 Explicit representation of � When � µ is a non-parametric model, then � ⟨ ψ ( y ′ ) , � f ( x ) = argmin µ ( x ) ⟩ y ′ ∈Y

f n 17 Explicit representation of � When � µ is a non-parametric model, then ⟨ ⟩ ∑ � ψ ( y ′ ) , f ( x ) = argmin α i ( x ) ϕ ( y i ) y ′ ∈Y i = 1

f n 17 Explicit representation of � When � µ is a non-parametric model, then ∑ � α i ( x ) ⟨ ψ ( y ′ ) , ϕ ( y i ) ⟩ f ( x ) = argmin y ′ ∈Y i = 1

f n 17 Explicit representation of � When � µ is a non-parametric model, then ∑ � α i ( x ) ℓ ( y ′ , y i ) . f ( x ) = argmin y ′ ∈Y i = 1

f n 17 Explicit representation of � When � µ is a non-parametric model, then ∑ � α i ( x ) ℓ ( y ′ , y i ) . f ( x ) = argmin y ′ ∈Y i = 1 No need to know H , ϕ, ψ to run the algorithm!

Recap • Applicable to a wide family of problems (no need to know • Generalization properties? and not on f • Only optimization on ) 18 n The proposed estimator has the form • Given ℓ satisfying A1 • k : X × X → R , kernel on X ∑ � α i ( x ) ℓ ( y ′ , y i ) , f ( x ) = argmin y ′ ∈Y i = 1 with α i ( x ) := [( K + λ nI ) − 1 v ( x )] i , and v ( x ) = ( k ( x , x 1 ) , . . . k ( x , x n )) ∈ R n , K ∈ R n × n K i , j = k ( x i , x j ) .

Recap n • Generalization properties? 18 The proposed estimator has the form • Given ℓ satisfying A1 • k : X × X → R , kernel on X ∑ � α i ( x ) ℓ ( y ′ , y i ) , f ( x ) = argmin y ′ ∈Y i = 1 with α i ( x ) := [( K + λ nI ) − 1 v ( x )] i , and v ( x ) = ( k ( x , x 1 ) , . . . k ( x , x n )) ∈ R n , K ∈ R n × n K i , j = k ( x i , x j ) . • Applicable to a wide family of problems (no need to know H , ϕ, ψ ) • Only optimization on Y and not on { f : X → Y} = Y X

f Theorem (Comparison inequality) 19 Properties of � Let ℓ satisfy A1 . For any � µ : X → H , √ E ( � f ) − E ( f ⋆ ) ≤ 2 c ψ µ ( x ) − µ ⋆ ( x ) ∥ 2 ] . E [ ∥ � with c ψ = sup y ′ ∈Y ∥ ψ ( y ) ∥ .

f Theorem (Universal consistency - Ciliberto, Rosasco, Rudi ’16 ) with probability 1 20 Consistency of � Let ℓ satisfy A1 and k be a universal kernel. Let λ = n − 1 / 4 , then n →∞ E ( � f ) = E ( f ⋆ ) , lim

f Theorem (Rates - Ciliberto, Rosasco, Rudi ’16 ) 21 Learning rates of � Let ℓ satisfy A1 and µ ⋆ ∈ G . Let λ = n − 1 / 2 , then E ( � f ) − E ( f ⋆ ) 2 c ψ n − 1 / 4 , ≤ w . h . p .

Check point We provide a framework for structured prediction with • theoretical guarantees as empirical risk minimization • some important existing algorithms are covered by this framework (not seen here) 22 • explicit algorithm applicable on wide family of problems ( Y , ℓ )

Case studies: • ranking with different losses ( Korba, Garcia, d’Alché-Buc ’18 ) • Output Fisher Embeddings ( Djerrab, Garcia, Sangnier, d’Alché-Buc ’18 ) Refinements of the analysis: • different derivation ( Osokin, Bach, Lacoste-Julien ’17; Goh ’18 ) ( Nowak, Bach, Rudi ’18; Struminsky et al. ’18 ) Extensions: • application to multitask-learning ( Ciliberto, Rosasco, Rudi ’17 ) • beyond least squares surrogate ( Nowak, Bach, Rudi ’19 ) • regularizing with trace norm ( Luise, Stamos, Pontil, Ciliberto ’19 ) • localized structured prediction ( Ciliberto, Bach, Rudi ’18 ) 23 • Y = manifolds, ℓ = geodesic distance ( Ciliberto et al. 18 ) • Y = probability space, ℓ = wasserstein distance ( Luise et al. 18 ) • determination of the constant c ψ in terms of log |Y| for discrete sets

Leveraging local structure

Local Structure 24

Recommend

More recommend