A Consistent Regularization Approach for Structured Prediction - PowerPoint PPT Presentation

A Consistent Regularization Approach for Structured Prediction Carlo Ciliberto, Alessandro Rudi, Lorenzo Rosasco University of Genova Istituto Italiano di Tecnologia - Massachusetts Institute of Technology lcsl.mit.edu Dec 9th, NIPS 2016

Structured Prediction

Outline Standard Supervised Learning Structured Prediction with SELF Algorithm Theory Experiments Conclusions

Scalar Learning Goal : given ( x i , y i ) n i =1 , find f n : X → Y Let Y = R ◮ Parametrize f ( x ) = w ⊤ ϕ ( x ) w ∈ R P ϕ : X → R P ◮ Learn n 1 � f n = w ⊤ L ( w ⊤ ϕ ( x i ) , y i ) n ϕ ( x ) w n = argmin n w ∈ R P i =1

Multi-variate Learning Goal : given ( x i , y i ) n i =1 , find f n : X → Y Let Y = R M ◮ Parametrize W ∈ R M × P ϕ : X → R P f ( x ) = Wϕ ( x ) ◮ Learn n 1 � f n ( x ) = W n ϕ ( x ) W n = argmin L ( Wϕ ( x i ) , y i ) n W ∈ R M × P i =1

Learning Theory Expected Risk � E ( f ) = L ( f ( x ) , y ) dρ ( x, y ) X×Y ◮ Consistency n → + ∞ E ( f n ) = inf lim E ( f ) ( in probability ) f ◮ Excess Risk Bounds E ( f n ) − inf f ∈H E ( f ) � ǫ ( n, ρ, H ) ( w.h.p. )

(Un)Structured prediction What if Y is not a vector space? (e.g. strings, graphs, histograms, etc.) Q . How do we: ◮ Parametrize ◮ Learn a function f : X → Y ?

Possible Approaches ◮ Score-Learning Methods + General algorithmic framework (e.g. StructSVM [Tsochandaridis et al ’ 05 ]) − Limited Theory ( [McAllester ’ 06 ] ) ◮ Surrogate/Relaxation approaches: + Clear theory − Only for special cases (e.g. classification, ranking, multi-labeling etc.) [Bartlett et al ’ 06 , Duchi et al ’ 10 , Mroueh et al ’ 12 , Gao et al. ’ 13 ]

Relaxation Approaches 1. Encoding choose c : Y → R M 2. Learning Given ( x i , c ( y i )) n find g n : X → R M i =1 , 3. Decoding choose d : R M → Y and let f n ( x ) = ( d ◦ g n )( x )

Example I: Binary Classification Let Y = {− 1 , 1 } 1. c : {− 1 , 1 } → R identity 2. Scalar learning g n : X → R 3. d = sign : R → {− 1 , 1 } f n ( x ) = sign ( g n ( x ))

Example II: Multi-class Classification Let Y = { 1 , . . . , M } 1. c : Y → { e 1 , . . . , e M } ⊂ R M canonical basis, c ( j ) = e j ∈ R M g n : X → R M 2. Multi-variate learning 3. d : R M → { 1 , . . . , M } e ⊤ f n ( x ) = argmax j g n ( x ) j =1 ,...,M � �� j − th value of g n ( x )

A General Relaxation Approach

A General Relaxation Approach Main Assumption . Structure Encoding Loss Function (SELF) Given △ : Y × Y → R , there exist: ◮ H Y RKHS with c : Y → H Y feature map ◮ V : H Y → H Y bounded linear operator such that: ∀ y, y ′ ∈ Y △ ( y, y ′ ) = � c ( y ) , V c ( y ′ ) � H Y Note . If V is Positive Semidefinite = ⇒ △ is a kernel.

SELF: Examples ◮ Binary classification: c : {− 1 , 1 } → R and V = 1 . ◮ Multi-class classification: c ( j ) = e j ∈ R M and V = 1 − I ∈ R M × M . ◮ Kernel Dependency Estimation (KDE) [Weston et al. ’ 02 , Cortes et al. ’ 05 ] : △ ( y, y ′ ) = 1 − h ( y, y ′ ) , h : Y × Y → R kernel on Y .

SELF: Finite Y All △ on discrete Y are SELF Examples: ◮ Strings : edit distance, KL divergence, word error rate, . . . ◮ Ordered sequences : rank loss, . . . ◮ Graphs/Trees : graph/trees edit distance, subgraph matching . . . ◮ Discrete subsets : weighted overlap loss, . . . ◮ . . .

SELF: More examples ◮ Histograms / Probabilities : e.g. χ 2 , Hellinger, . . . ◮ Manifolds : Diffusion distances ◮ . . .

Relaxation with SELF 1. Encoding . c : Y → H Y canonical feature map of H Y 2. Surrogate Learning . Multi-variate regression g n : X → H Y 3. Decoding . f n ( x ) = argmin � c ( y ) , V g n ( x ) � H Y y ∈Y

Surrogate Learning Multi-variate learning with ridge regression ◮ Parametrize W ∈ R M × P ϕ : X → R P g ( x ) = Wϕ ( x ) ◮ Learn n 1 � 2 g n = W n ϕ ( x ) W n = argmin � Wϕ ( x i ) − c ( y i ) � H Y n � �� W ∈ R M × P i =1 least-squares

Learning (cont.) Solution 1 g n ( x ) = W n ϕ ( x ) W n = C (Φ ⊤ Φ) − 1 Φ ⊤ = CA � �� A ∈ R n × n ◮ Φ = [ ϕ ( x 1 ) , . . . , ϕ ( x n )] ∈ R P × n input features ◮ C = [ c ( y 1 ) , . . . , c ( y n )] ∈ R M × n output features 1 In practice add a regularizer!

Decoding Lemma (Ciliberto, Rudi, Rosasco ’ 16 ) Let g n ( x ) = CA ϕ ( x ) solution the surrogate problem. Then f n ( x ) = argmin � c ( y ) , V g n ( x ) � H Y y ∈Y can be written as n � f n ( x ) = argmin α i ( x ) △ ( y, y i ) y ∈Y i =1 where ( α 1 ( x ) , . . . , α n ( x )) ⊤ = A ϕ ( x ) ∈ R n

Decoding Sketch of the proof: ◮ g n ( x ) = CA ϕ ( x ) = � n i =1 α i ( x ) c ( y i ) with ( α 1 ( x ) , . . . , α n ( x )) ⊤ = A ϕ ( x ) ∈ R n ◮ Plugging g n ( x ) in � � c ( y ) , V g n ( x ) � H Y = � c ( y ) , V α i ( x ) c ( y i ) � H Y i =1 = � i =1 α i ( x ) � c ( y ) , V c ( y i ) � H Y (SELF) = � n i =1 α i ( x ) △ ( y, y i )

SELF Learning Two steps: 1. Surrogate Learning ( α 1 ( x ) , . . . , α n ( x )) ⊤ = A ϕ ( x ) A = (Φ ⊤ Φ + λ ) − 1 Φ ⊤ 2. Decoding n � f n ( x ) = argmin α i ( x ) △ ( y, y i ) y ∈Y i =1 Note : ◮ Implicit encoding : no need to know H Y , V (extends kernel trick)! ◮ Optimization over Y is problem specific and can be a challenge.

Connections with Previous Work ◮ Score-Learning approaches (e.g. StructSVM [Tsochandaridis et al ’ 05 ] ) In StructSVM is possible to choose any feature map on the output... ... here we show that this choice must be compatible with △ ◮ Kernel dependency estimation, △ is (one minus) a kernel ◮ Conditional mean embeddings ? [Smola et al ’ 07 ]

Relaxation Analysis

Relaxation Analysis Consider � E ( f ) = △ ( f ( x ) , y ) dρ ( x, y ) X×Y and � � g ( x ) − c ( y ) � 2 dρ ( x, y ) R ( g ) = X×Y How are R ( g n ) and E ( f n ) related?

Relaxation Analysis f ∗ = argmin E ( f ) and g ∗ = argmin R ( g ) g : X→H Y f : X→Y Key properties: ◮ Fisher Consistency (FC) E ( d ◦ g ∗ ) = E ( f ∗ ) ◮ Comparison Inequality (CI) ∃ θ : R → R such that θ ( r ) → 0 when r → 0 and E ( d ◦ g ) − E ( f ∗ ) ≤ θ ( R ( g ) − R ( g ∗ )) ∀ g : X → H Y

SELF Relaxation Analysis Theorem (Ciliberto, Rudi, Rosasco ’ 16 ) △ : Y × Y → R SELF loss, g ∗ : X → H Y least-square “relaxed” solution. Then ◮ Fisher Consistency E ( d ◦ g ∗ ) = E ( f ∗ ) ◮ Comparison Inequality ∀ g : X → H Y � E ( d ◦ g ) − E ( f ∗ ) � R ( g ) − R ( g ∗ )

SELF Relaxation Analysis (cont.) Lemma (Ciliberto, Rudi, Rosasco ’ 16 ) △ : Y × Y → R SELF loss. Then � E ( f ) = � c ( f ( x )) , V g ∗ ( x ) � H Y dρ X ( x ) X where g ∗ : X → H Y minimizes � � g ( x ) − c ( y ) � 2 R ( g ) = H Y dρ ( x, y ) X×Y Least-squares on H Y is a good surrogate loss

Consistency and Generalization Bounds Theorem (Ciliberto, Rudi, Rosasco ’ 16 ) If we consider a universal feature map and λ = 1 / √ n , then, n →∞ E ( f n ) = E ( f ∗ ) , lim almost surely Moreover, under mild assumptions E ( f n ) − E ( f ∗ ) � n − 1 / 4 ( w.h.p. ) Proof. Relaxation analysis + (kernel) ridge regression results R ( g n ) − R ( g ∗ ) � n − 1 / 2

Remarks ◮ First result proving universal consistency and excess risk bounds for general structured prediction (partial results for KDE in [Gigure et al ’ 13 ]) ◮ Rates are sharp for the class of SELF loss functions △ : i.e. matching classification results. ◮ Faster rates under further regularity conditions.

Experiments: Ranking M � △ rank ( f ( x ) , y ) = γ ( y ) ij (1 − sign( f ( x ) i − f ( x ) j )) / 2 i,j =1 Rank Loss [ Herbrich et al. ’ 99] 0 . 432 ± 0 . 008 [ Dekel et al. ’ 04] 0 . 432 ± 0 . 012 [ Duchi et al. ’ 10] 0 . 430 ± 0 . 004 [ Tsochantaridis et al. ’ 05] 0 . 451 ± 0 . 008 [ Ciliberto, Rudi, R. ’ 16] 0 . 396 ± 0 . 003 Ranking experiments on the MovieLens dataset with △ rank [ Dekel et al. ’ 04 , Duchi et al. ’ 10] . ∼ 1600 Movies for ∼ 900 users.

Experiments: Digit Reconstruction Digit reconstruction on USPS dataset Loss KDE SELF △ G △ H △ G 0 . 149 ± 0 . 013 0 . 172 ± 0 . 011 △ H 0 . 736 ± 0 . 032 0 . 647 ± 0 . 017 △ R 0 . 294 ± 0 . 012 0 . 193 ± 0 . 015 ◮ △ G ( f ( x ) , y ) = 1 − k ( f ( x ) , y ) k Gaussian kernel on the output. f ( x ) − √ y � � ◮ △ H ( f ( x ) , y ) = � Hellinger distance. ◮ △ R ( f ( x ) , y ) Recognition accuracy of an SVM digit classifier.

Experiments: Robust Estimation 2 log(1 + � f ( x ) − y � 2 △ Cauchy ( f ( x ) , y ) = c ) c > 0 c 4 n SELF RNW KRR Alg. 1 RNW 2 50 0.39 ± 0.17 0.45 ± 0.18 0.62 ± 0.13 KRLS 100 0.21 ± 0.04 0.29 ± 0.04 0.47 ± 0.09 0 200 0.12 ± 0.02 0.24 ± 0.03 0.33 ± 0.04 500 0.08 ± 0.01 0.22 ± 0.02 0.31 ± 0.03 1000 0.07 ± 0.01 0.21 ± 0.02 0.19 ± 0.02 − 2 − 1 − 0.8 − 0.6 − 0.4 − 0.2 0 0.2 0.4 0.6 0.8 1

A Consistent Regularization Approach for Structured Prediction - PowerPoint PPT Presentation

A Consistent Regularization Approach for Structured Prediction Carlo Ciliberto, Alessandro Rudi, Lorenzo Rosasco University of Genova Istituto Italiano di Tecnologia - Massachusetts Institute of Technology lcsl.mit.edu Dec 9th, NIPS 2016

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Regularization Regularization is a general approach to add a complexity parameter to a

Feasibility of Consistent, Feasibility of Consistent, Feasibility of Consistent, Feasibility of

General Structure of a PW code Self-Consistent KS eqs. or Global Minimization approach

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Greater Manchester Cricket A Structured Approach A Structured Approach Introductions John

CSS Modules with BEM Consistent Design Consistent Design Different Module Versions Consistent

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Relational Pooling for Graph Representations Ryan L. Murphy 1 (with Balasubramaniam Srinivasan 2 ,

Full-scale trial results to qualify optimized manufacturing plan for ITER Toroidal Field coil

1 https://www.cisco.com/c/en/us/about/press/internet-protocol-journal/back-

RP Trigger studies L. Grzanka; J. Proch azka 9th March 2010 L. Grzanka; J. Proch azka RP

How to use the Scanner For Slides Double click on the HP Precision scan icon You will see

Intro to harmonic analysis on groups Risi Kondor . The Fourier series (1807) Any (sufficiently

Dimension Reduction using PCA and SVD Plan of Class Starting the machine Learning part of the

Companion symmetry of SUSY PHENO 2008 Hye-Sung Lee Lightest U -parity Particle (LUP) dark matter

A Consistent Regularization Approach for Structured Prediction - PowerPoint PPT Presentation

A Consistent Regularization Approach for Structured Prediction Carlo Ciliberto, Alessandro Rudi, Lorenzo Rosasco University of Genova Istituto Italiano di Tecnologia - Massachusetts Institute of Technology lcsl.mit.edu Dec 9th, NIPS 2016

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Regularization Regularization is a general approach to add a complexity parameter to a

Feasibility of Consistent, Feasibility of Consistent, Feasibility of Consistent, Feasibility of

General Structure of a PW code Self-Consistent KS eqs. or Global Minimization approach

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Greater Manchester Cricket A Structured Approach A Structured Approach Introductions John

CSS Modules with BEM Consistent Design Consistent Design Different Module Versions Consistent

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Relational Pooling for Graph Representations Ryan L. Murphy 1 (with Balasubramaniam Srinivasan 2 ,

Full-scale trial results to qualify optimized manufacturing plan for ITER Toroidal Field coil

1 https://www.cisco.com/c/en/us/about/press/internet-protocol-journal/back-

RP Trigger studies L. Grzanka; J. Proch azka 9th March 2010 L. Grzanka; J. Proch azka RP

How to use the Scanner For Slides Double click on the HP Precision scan icon You will see

Intro to harmonic analysis on groups Risi Kondor . The Fourier series (1807) Any (sufficiently

Dimension Reduction using PCA and SVD Plan of Class Starting the machine Learning part of the

Companion symmetry of SUSY PHENO 2008 Hye-Sung Lee Lightest U -parity Particle (LUP) dark matter

Regularization Overview Regularization Overview Problems & Multicollinearity We will