Structured Prediction via Implicit Embeddings Alessandro Rudi - PowerPoint PPT Presentation

Structured Prediction via Implicit Embeddings Alessandro Rudi Imaging and Machine Learning, April 1st, Paris Inria, École normale supérieure In collaboration with: Carlo Ciliberto, Lorenzo Rosasco, Francis Bach

Structured Prediction 1

Structured Prediction 2

3 Supervised Learning • X input space, Y output space, • ℓ : Y × Y → R loss function, • ρ probability on X × Y . f ⋆ = argmin E ( f ) , E ( f ) := E [ ℓ ( y , f ( x ))] . f : X→Y given only the dataset ( x i , y i ) n i = 1 sampled independently from ρ .

Supervised learning: Goal f n , such that Consistency Learning rates 4 Given the dataset ( x i , y i ) n i = 1 sampled independently from ρ , produce � n →∞ E ( � f n ) = E ( f ⋆ ) , lim a . s . E ( � f n ) − E ( f ⋆ ) ≤ c ( n ) , w . h . p .

State of the art: Vector-valued case 1 • Consistency and (optimal) learning rates for many losses Kernel machines, Kernel SVM. Easy to optimize. • Well known methods: Linear models, generalized linear models, n n space) • solve empirical risk minimization 5 Y is a vector space • choose suitable G ⊆ { f : X → Y} (usually a convex function ∑ � f = argmin ℓ ( f ( x i ) , y i ) + λ R ( f ) . f ∈G i = 1

State of the art: Structured case f ? Surrogate approaches + Clear theory - Only for special cases (e.g. classification, ranking, multi-labeling etc.) [Bartlett et al ’06, Duchi et al ’10, Mroueh et al ’12, Gao et al. ’13] Score learning techniques + General algorithmic framework (e.g. StructSVM [Tsochandaridis et al ’05] ) - Limited Theory ( [McAllester ’06] ) 6 Y arbitrary how do we parametrize G and learn �

Supervised learning with structure Is it possible to (a) have best of both worlds? (general algorithmic framework with clear theory) (b) learn leveraging the local structure of the input and the output? We will address (a), (b) using implicit embeddings (related techniques: Cortes et al. 2005; Geurts, Wehenkel, d’Alché Buc ’06; Kadri et al. ’13; Brouard, Szafranski, d’Alché Buc ’16) 7

Table of contents 1. Structured learning with implicit embeddings 2. Algorithm and properties 3. Leveraging local structure 8

Structured learning with implicit embeddings

Characterizing the target function Pointwise characterization f x y y y x 9 f ⋆ = argmin E [ ℓ ( f ( x ) , y )] . f : X→Y

Characterizing the target function Pointwise characterization 9 f ⋆ = argmin E [ ℓ ( f ( x ) , y )] . f : X→Y f ⋆ ( x ) = argmin E [ ℓ ( y ′ , y ) | x ] y ′ ∈Y

maximum theory for measurable functions). Characterizing the target function 10 ˜ E [ ℓ ( y ′ , y ) | x ] f ( x ) = argmin y ′ ∈Y E [ ℓ (˜ f ( x ) , y )] = E x [ E [ ℓ (˜ f ( x ) , y ) | x ]] y ′ ∈Y E [ ℓ ( y ′ , y ) | x ]] = E x [ inf ≤ E [ ℓ ( f ( x ) , y )] , ∀ f : X → Y . Then E (˜ f ) = inf f : X→Y E ( f ) (measurability issues solved via Berge

Implicit embedding continuous such that Theorem ( Ciliberto, Rosasco, Rudi ’16) A1 is satisfied 11 A1. There exists Hilbert space H and ψ, ϕ : Y → H , bounded ℓ ( y ′ , y ) := ⟨ ψ ( y ′ ) , ϕ ( y ) ⟩ . 1. for any loss ℓ when Y discrete space 2. for any smooth loss ℓ when Y ⊂ R d compact 3. for any smooth loss ℓ when Y ⊆ M with M compact manifold

Idea for a unified approach When A1 holds 12 f ⋆ ( x ) = argmin E [ ℓ ( y ′ , y ) | x ] y ′ ∈Y

Idea for a unified approach When A1 holds 12 f ⋆ ( x ) = argmin E [ ⟨ ψ ( y ′ ) , ϕ ( y ) ⟩ | x ] y ′ ∈Y

Idea for a unified approach When A1 holds 12 f ⋆ ( x ) = argmin ⟨ ψ ( y ′ ) , E [ ϕ ( y ) | x ] ⟩ y ′ ∈Y

Idea for a unified approach When A1 holds 12 f ⋆ ( x ) = argmin ⟨ ψ ( y ′ ) , µ ⋆ ( x ) ⟩ y ′ ∈Y with µ ⋆ ( x ) = E [ ϕ ( y ) | x ] conditional expectation of ϕ ( y ) given x

13 The estimator µ estimating µ ⋆ , define Given � � ⟨ ψ ( y ′ ) , � f ( x ) = argmin µ ( x ) ⟩ y ′ ∈Y

14 n 2 2 y x i 1 i n 1 suitable space of functions use standard techniques for vector valued problems. Given How to compute � µ µ ⋆ = E [ ϕ ( y ) | x ] is characterized by µ ⋆ = argmin E [ ∥ µ ( x ) − ϕ ( y ) ∥ 2 ] µ : X→H

14 suitable space of functions n n 1 How to compute � µ µ ⋆ = E [ ϕ ( y ) | x ] is characterized by µ ⋆ = argmin E [ ∥ µ ( x ) − ϕ ( y ) ∥ 2 ] µ : X→H use standard techniques for vector valued problems. Given G ∑ ∥ µ ( x i ) − ϕ ( y ) ∥ 2 + λ ∥ µ ∥ 2 . � µ = argmin µ ∈G i = 1

n where 15 G space of linear functions Let X be a vector space and G = X ⊗ H , then ∑ � µ ( x ) = α i ( x ) ϕ ( y i ) , i = 1 α i ( x ) := [( K + λ nI ) − 1 v ( x )] i , and v ( x ) = ( x ⊤ x 1 , . . . x ⊤ x n ) ∈ R n , K ∈ R n × n K i , j = x ⊤ i x j .

n non-parametric model where 16 Let k : X × X → R be a kernel on X . Denote by F the reproducing kernel Hilbert space induced by k over X . Let G = F ⊗ H , then ∑ � µ ( x ) = α i ( x ) ϕ ( y i ) , i = 1 α i ( x ) := [( K + λ nI ) − 1 v ( x )] i , and v ( x ) = ( k ( x , x 1 ) , . . . k ( x , x n )) ∈ R n , K ∈ R n × n K i , j = k ( x i , x j ) .

Algorithm and properties

f 17 Explicit representation of � When � µ is a non-parametric model, then � ⟨ ψ ( y ′ ) , � f ( x ) = argmin µ ( x ) ⟩ y ′ ∈Y

f n 17 Explicit representation of � When � µ is a non-parametric model, then ⟨ ⟩ ∑ � ψ ( y ′ ) , f ( x ) = argmin α i ( x ) ϕ ( y i ) y ′ ∈Y i = 1

f n 17 Explicit representation of � When � µ is a non-parametric model, then ∑ � α i ( x ) ⟨ ψ ( y ′ ) , ϕ ( y i ) ⟩ f ( x ) = argmin y ′ ∈Y i = 1

f n 17 Explicit representation of � When � µ is a non-parametric model, then ∑ � α i ( x ) ℓ ( y ′ , y i ) . f ( x ) = argmin y ′ ∈Y i = 1

f n 17 Explicit representation of � When � µ is a non-parametric model, then ∑ � α i ( x ) ℓ ( y ′ , y i ) . f ( x ) = argmin y ′ ∈Y i = 1 No need to know H , ϕ, ψ to run the algorithm!

Recap • Applicable to a wide family of problems (no need to know • Generalization properties? and not on f • Only optimization on ) 18 n The proposed estimator has the form • Given ℓ satisfying A1 • k : X × X → R , kernel on X ∑ � α i ( x ) ℓ ( y ′ , y i ) , f ( x ) = argmin y ′ ∈Y i = 1 with α i ( x ) := [( K + λ nI ) − 1 v ( x )] i , and v ( x ) = ( k ( x , x 1 ) , . . . k ( x , x n )) ∈ R n , K ∈ R n × n K i , j = k ( x i , x j ) .

Recap n • Generalization properties? 18 The proposed estimator has the form • Given ℓ satisfying A1 • k : X × X → R , kernel on X ∑ � α i ( x ) ℓ ( y ′ , y i ) , f ( x ) = argmin y ′ ∈Y i = 1 with α i ( x ) := [( K + λ nI ) − 1 v ( x )] i , and v ( x ) = ( k ( x , x 1 ) , . . . k ( x , x n )) ∈ R n , K ∈ R n × n K i , j = k ( x i , x j ) . • Applicable to a wide family of problems (no need to know H , ϕ, ψ ) • Only optimization on Y and not on { f : X → Y} = Y X

f Theorem (Comparison inequality) 19 Properties of � Let ℓ satisfy A1 . For any � µ : X → H , √ E ( � f ) − E ( f ⋆ ) ≤ 2 c ψ µ ( x ) − µ ⋆ ( x ) ∥ 2 ] . E [ ∥ � with c ψ = sup y ′ ∈Y ∥ ψ ( y ) ∥ .

f Theorem (Universal consistency - Ciliberto, Rosasco, Rudi ’16 ) with probability 1 20 Consistency of � Let ℓ satisfy A1 and k be a universal kernel. Let λ = n − 1 / 4 , then n →∞ E ( � f ) = E ( f ⋆ ) , lim

f Theorem (Rates - Ciliberto, Rosasco, Rudi ’16 ) 21 Learning rates of � Let ℓ satisfy A1 and µ ⋆ ∈ G . Let λ = n − 1 / 2 , then E ( � f ) − E ( f ⋆ ) 2 c ψ n − 1 / 4 , ≤ w . h . p .

Check point We provide a framework for structured prediction with • theoretical guarantees as empirical risk minimization • some important existing algorithms are covered by this framework (not seen here) 22 • explicit algorithm applicable on wide family of problems ( Y , ℓ )

Case studies: • ranking with different losses ( Korba, Garcia, d’Alché-Buc ’18 ) • Output Fisher Embeddings ( Djerrab, Garcia, Sangnier, d’Alché-Buc ’18 ) Refinements of the analysis: • different derivation ( Osokin, Bach, Lacoste-Julien ’17; Goh ’18 ) ( Nowak, Bach, Rudi ’18; Struminsky et al. ’18 ) Extensions: • application to multitask-learning ( Ciliberto, Rosasco, Rudi ’17 ) • beyond least squares surrogate ( Nowak, Bach, Rudi ’19 ) • regularizing with trace norm ( Luise, Stamos, Pontil, Ciliberto ’19 ) • localized structured prediction ( Ciliberto, Bach, Rudi ’18 ) 23 • Y = manifolds, ℓ = geodesic distance ( Ciliberto et al. 18 ) • Y = probability space, ℓ = wasserstein distance ( Luise et al. 18 ) • determination of the constant c ψ in terms of log |Y| for discrete sets

Leveraging local structure

Local Structure 24

Structured Prediction via Implicit Embeddings Alessandro Rudi - PowerPoint PPT Presentation

Structured Prediction via Implicit Embeddings Alessandro Rudi Imaging and Machine Learning, April 1st, Paris Inria, cole normale suprieure In collaboration with: Carlo Ciliberto, Lorenzo Rosasco, Francis Bach Structured Prediction 1

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Implicit Guarantees and Risk Taking: Implicit Guarantees and Risk Taking: Implicit Guarantees and

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

L101: Introduction to Structured Prediction Ryan Cotterell What is structured prediction?

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Training Strategies CS 6355: Structured Prediction 1 So far we saw What is structured output

Implicit Bias Implicit bias Implicit bias refers to attitudes or stereotypes that affect our

Implicit Surfaces Implicit Surfaces An implicit surface is simply an iso-contour CIS 781 of a

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

June 18, 2020 Our Agenda Welcome Eric Chapman, Attorney, CowanPerryPC Going Forward

Outline Introduction Convolutional neural networks (CNN) The usage of intermediate

Developing a Food Procurement Policy or Profile With Healthcare Without Harm, Inova Center for

Stat 5102 Lecture Slides: Deck 8 Bootstrap Charles J. Geyer School of Statistics University of

Mobile Communications Towards 2020 Carlos Caseiro January 2017 Evolution Mobile Networks

MARKETING TO LARGER ORGANISATIONS Presented by JE Consulting Corporate Finance Network Workshop

Pushing data into CP models using Graphical Model Learning & Solving CP 2020 CP and ML track

Play with Prometheus Journey to make testing in production more reliable Giovanni Gargiulo

Structured Prediction via Implicit Embeddings Alessandro Rudi - PowerPoint PPT Presentation

Structured Prediction via Implicit Embeddings Alessandro Rudi Imaging and Machine Learning, April 1st, Paris Inria, cole normale suprieure In collaboration with: Carlo Ciliberto, Lorenzo Rosasco, Francis Bach Structured Prediction 1

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Implicit Guarantees and Risk Taking: Implicit Guarantees and Risk Taking: Implicit Guarantees and

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

L101: Introduction to Structured Prediction Ryan Cotterell What is structured prediction?

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Training Strategies CS 6355: Structured Prediction 1 So far we saw What is structured output

Implicit Bias Implicit bias Implicit bias refers to attitudes or stereotypes that affect our

Implicit Surfaces Implicit Surfaces An implicit surface is simply an iso-contour CIS 781 of a

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

June 18, 2020 Our Agenda Welcome Eric Chapman, Attorney, CowanPerryPC Going Forward

Outline Introduction Convolutional neural networks (CNN) The usage of intermediate

Developing a Food Procurement Policy or Profile With Healthcare Without Harm, Inova Center for

Stat 5102 Lecture Slides: Deck 8 Bootstrap Charles J. Geyer School of Statistics University of

Mobile Communications Towards 2020 Carlos Caseiro January 2017 Evolution Mobile Networks

MARKETING TO LARGER ORGANISATIONS Presented by JE Consulting Corporate Finance Network Workshop

Pushing data into CP models using Graphical Model Learning &amp; Solving CP 2020 CP and ML track

Play with Prometheus Journey to make testing in production more reliable Giovanni Gargiulo

Pushing data into CP models using Graphical Model Learning & Solving CP 2020 CP and ML track