a consistent regularization approach for structured
play

A Consistent Regularization Approach for Structured Prediction - PowerPoint PPT Presentation

A Consistent Regularization Approach for Structured Prediction Carlo Ciliberto, Alessandro Rudi, Lorenzo Rosasco University of Genova Istituto Italiano di Tecnologia - Massachusetts Institute of Technology lcsl.mit.edu Dec 9th, NIPS 2016


  1. A Consistent Regularization Approach for Structured Prediction Carlo Ciliberto, Alessandro Rudi, Lorenzo Rosasco University of Genova Istituto Italiano di Tecnologia - Massachusetts Institute of Technology lcsl.mit.edu Dec 9th, NIPS 2016

  2. Structured Prediction

  3. Outline Standard Supervised Learning Structured Prediction with SELF Algorithm Theory Experiments Conclusions

  4. Outline Standard Supervised Learning Structured Prediction with SELF Algorithm Theory Experiments Conclusions

  5. Scalar Learning Goal : given ( x i , y i ) n i =1 , find f n : X → Y Let Y = R ◮ Parametrize f ( x ) = w ⊤ ϕ ( x ) w ∈ R P ϕ : X → R P ◮ Learn n 1 � f n = w ⊤ L ( w ⊤ ϕ ( x i ) , y i ) n ϕ ( x ) w n = argmin n w ∈ R P i =1

  6. Multi-variate Learning Goal : given ( x i , y i ) n i =1 , find f n : X → Y Let Y = R M ◮ Parametrize W ∈ R M × P ϕ : X → R P f ( x ) = Wϕ ( x ) ◮ Learn n 1 � f n ( x ) = W n ϕ ( x ) W n = argmin L ( Wϕ ( x i ) , y i ) n W ∈ R M × P i =1

  7. Learning Theory Expected Risk � E ( f ) = L ( f ( x ) , y ) dρ ( x, y ) X×Y ◮ Consistency n → + ∞ E ( f n ) = inf lim E ( f ) ( in probability ) f ◮ Excess Risk Bounds E ( f n ) − inf f ∈H E ( f ) � ǫ ( n, ρ, H ) ( w.h.p. )

  8. Outline Standard Supervised Learning Structured Prediction with SELF Algorithm Theory Experiments Conclusions

  9. (Un)Structured prediction What if Y is not a vector space? (e.g. strings, graphs, histograms, etc.) Q . How do we: ◮ Parametrize ◮ Learn a function f : X → Y ?

  10. Possible Approaches ◮ Score-Learning Methods + General algorithmic framework (e.g. StructSVM [Tsochandaridis et al ’ 05 ]) − Limited Theory ( [McAllester ’ 06 ] ) ◮ Surrogate/Relaxation approaches: + Clear theory − Only for special cases (e.g. classification, ranking, multi-labeling etc.) [Bartlett et al ’ 06 , Duchi et al ’ 10 , Mroueh et al ’ 12 , Gao et al. ’ 13 ]

  11. Relaxation Approaches 1. Encoding choose c : Y → R M 2. Learning Given ( x i , c ( y i )) n find g n : X → R M i =1 , 3. Decoding choose d : R M → Y and let f n ( x ) = ( d ◦ g n )( x )

  12. Example I: Binary Classification Let Y = {− 1 , 1 } 1. c : {− 1 , 1 } → R identity 2. Scalar learning g n : X → R 3. d = sign : R → {− 1 , 1 } f n ( x ) = sign ( g n ( x ))

  13. Example II: Multi-class Classification Let Y = { 1 , . . . , M } 1. c : Y → { e 1 , . . . , e M } ⊂ R M canonical basis, c ( j ) = e j ∈ R M g n : X → R M 2. Multi-variate learning 3. d : R M → { 1 , . . . , M } e ⊤ f n ( x ) = argmax j g n ( x ) j =1 ,...,M � �� � j − th value of g n ( x )

  14. A General Relaxation Approach

  15. A General Relaxation Approach Main Assumption . Structure Encoding Loss Function (SELF) Given △ : Y × Y → R , there exist: ◮ H Y RKHS with c : Y → H Y feature map ◮ V : H Y → H Y bounded linear operator such that: ∀ y, y ′ ∈ Y △ ( y, y ′ ) = � c ( y ) , V c ( y ′ ) � H Y Note . If V is Positive Semidefinite = ⇒ △ is a kernel.

  16. SELF: Examples ◮ Binary classification: c : {− 1 , 1 } → R and V = 1 . ◮ Multi-class classification: c ( j ) = e j ∈ R M and V = 1 − I ∈ R M × M . ◮ Kernel Dependency Estimation (KDE) [Weston et al. ’ 02 , Cortes et al. ’ 05 ] : △ ( y, y ′ ) = 1 − h ( y, y ′ ) , h : Y × Y → R kernel on Y .

  17. SELF: Finite Y All △ on discrete Y are SELF Examples: ◮ Strings : edit distance, KL divergence, word error rate, . . . ◮ Ordered sequences : rank loss, . . . ◮ Graphs/Trees : graph/trees edit distance, subgraph matching . . . ◮ Discrete subsets : weighted overlap loss, . . . ◮ . . .

  18. SELF: More examples ◮ Histograms / Probabilities : e.g. χ 2 , Hellinger, . . . ◮ Manifolds : Diffusion distances ◮ . . .

  19. Relaxation with SELF 1. Encoding . c : Y → H Y canonical feature map of H Y 2. Surrogate Learning . Multi-variate regression g n : X → H Y 3. Decoding . f n ( x ) = argmin � c ( y ) , V g n ( x ) � H Y y ∈Y

  20. Surrogate Learning Multi-variate learning with ridge regression ◮ Parametrize W ∈ R M × P ϕ : X → R P g ( x ) = Wϕ ( x ) ◮ Learn n 1 � 2 g n = W n ϕ ( x ) W n = argmin � Wϕ ( x i ) − c ( y i ) � H Y n � �� � W ∈ R M × P i =1 least-squares

  21. Learning (cont.) Solution 1 g n ( x ) = W n ϕ ( x ) W n = C (Φ ⊤ Φ) − 1 Φ ⊤ = CA � �� � A ∈ R n × n ◮ Φ = [ ϕ ( x 1 ) , . . . , ϕ ( x n )] ∈ R P × n input features ◮ C = [ c ( y 1 ) , . . . , c ( y n )] ∈ R M × n output features 1 In practice add a regularizer!

  22. Decoding Lemma (Ciliberto, Rudi, Rosasco ’ 16 ) Let g n ( x ) = CA ϕ ( x ) solution the surrogate problem. Then f n ( x ) = argmin � c ( y ) , V g n ( x ) � H Y y ∈Y can be written as n � f n ( x ) = argmin α i ( x ) △ ( y, y i ) y ∈Y i =1 where ( α 1 ( x ) , . . . , α n ( x )) ⊤ = A ϕ ( x ) ∈ R n

  23. Decoding Sketch of the proof: ◮ g n ( x ) = CA ϕ ( x ) = � n i =1 α i ( x ) c ( y i ) with ( α 1 ( x ) , . . . , α n ( x )) ⊤ = A ϕ ( x ) ∈ R n ◮ Plugging g n ( x ) in � � c ( y ) , V g n ( x ) � H Y = � c ( y ) , V α i ( x ) c ( y i ) � H Y i =1 = � i =1 α i ( x ) � c ( y ) , V c ( y i ) � H Y (SELF) = � n i =1 α i ( x ) △ ( y, y i )

  24. SELF Learning Two steps: 1. Surrogate Learning ( α 1 ( x ) , . . . , α n ( x )) ⊤ = A ϕ ( x ) A = (Φ ⊤ Φ + λ ) − 1 Φ ⊤ 2. Decoding n � f n ( x ) = argmin α i ( x ) △ ( y, y i ) y ∈Y i =1 Note : ◮ Implicit encoding : no need to know H Y , V (extends kernel trick)! ◮ Optimization over Y is problem specific and can be a challenge.

  25. Connections with Previous Work ◮ Score-Learning approaches (e.g. StructSVM [Tsochandaridis et al ’ 05 ] ) In StructSVM is possible to choose any feature map on the output... ... here we show that this choice must be compatible with △ ◮ Kernel dependency estimation, △ is (one minus) a kernel ◮ Conditional mean embeddings ? [Smola et al ’ 07 ]

  26. Relaxation Analysis

  27. Relaxation Analysis Consider � E ( f ) = △ ( f ( x ) , y ) dρ ( x, y ) X×Y and � � g ( x ) − c ( y ) � 2 dρ ( x, y ) R ( g ) = X×Y How are R ( g n ) and E ( f n ) related?

  28. Relaxation Analysis f ∗ = argmin E ( f ) and g ∗ = argmin R ( g ) g : X→H Y f : X→Y Key properties: ◮ Fisher Consistency (FC) E ( d ◦ g ∗ ) = E ( f ∗ ) ◮ Comparison Inequality (CI) ∃ θ : R → R such that θ ( r ) → 0 when r → 0 and E ( d ◦ g ) − E ( f ∗ ) ≤ θ ( R ( g ) − R ( g ∗ )) ∀ g : X → H Y

  29. SELF Relaxation Analysis Theorem (Ciliberto, Rudi, Rosasco ’ 16 ) △ : Y × Y → R SELF loss, g ∗ : X → H Y least-square “relaxed” solution. Then ◮ Fisher Consistency E ( d ◦ g ∗ ) = E ( f ∗ ) ◮ Comparison Inequality ∀ g : X → H Y � E ( d ◦ g ) − E ( f ∗ ) � R ( g ) − R ( g ∗ )

  30. SELF Relaxation Analysis (cont.) Lemma (Ciliberto, Rudi, Rosasco ’ 16 ) △ : Y × Y → R SELF loss. Then � E ( f ) = � c ( f ( x )) , V g ∗ ( x ) � H Y dρ X ( x ) X where g ∗ : X → H Y minimizes � � g ( x ) − c ( y ) � 2 R ( g ) = H Y dρ ( x, y ) X×Y Least-squares on H Y is a good surrogate loss

  31. Consistency and Generalization Bounds Theorem (Ciliberto, Rudi, Rosasco ’ 16 ) If we consider a universal feature map and λ = 1 / √ n , then, n →∞ E ( f n ) = E ( f ∗ ) , lim almost surely Moreover, under mild assumptions E ( f n ) − E ( f ∗ ) � n − 1 / 4 ( w.h.p. ) Proof. Relaxation analysis + (kernel) ridge regression results R ( g n ) − R ( g ∗ ) � n − 1 / 2

  32. Remarks ◮ First result proving universal consistency and excess risk bounds for general structured prediction (partial results for KDE in [Gigure et al ’ 13 ]) ◮ Rates are sharp for the class of SELF loss functions △ : i.e. matching classification results. ◮ Faster rates under further regularity conditions.

  33. Experiments: Ranking M � △ rank ( f ( x ) , y ) = γ ( y ) ij (1 − sign( f ( x ) i − f ( x ) j )) / 2 i,j =1 Rank Loss [ Herbrich et al. ’ 99] 0 . 432 ± 0 . 008 [ Dekel et al. ’ 04] 0 . 432 ± 0 . 012 [ Duchi et al. ’ 10] 0 . 430 ± 0 . 004 [ Tsochantaridis et al. ’ 05] 0 . 451 ± 0 . 008 [ Ciliberto, Rudi, R. ’ 16] 0 . 396 ± 0 . 003 Ranking experiments on the MovieLens dataset with △ rank [ Dekel et al. ’ 04 , Duchi et al. ’ 10] . ∼ 1600 Movies for ∼ 900 users.

  34. Experiments: Digit Reconstruction Digit reconstruction on USPS dataset Loss KDE SELF △ G △ H △ G 0 . 149 ± 0 . 013 0 . 172 ± 0 . 011 △ H 0 . 736 ± 0 . 032 0 . 647 ± 0 . 017 △ R 0 . 294 ± 0 . 012 0 . 193 ± 0 . 015 ◮ △ G ( f ( x ) , y ) = 1 − k ( f ( x ) , y ) k Gaussian kernel on the output. f ( x ) − √ y � � ◮ △ H ( f ( x ) , y ) = � Hellinger distance. ◮ △ R ( f ( x ) , y ) Recognition accuracy of an SVM digit classifier.

  35. Experiments: Robust Estimation 2 log(1 + � f ( x ) − y � 2 △ Cauchy ( f ( x ) , y ) = c ) c > 0 c 4 n SELF RNW KRR Alg. 1 RNW 2 50 0.39 ± 0.17 0.45 ± 0.18 0.62 ± 0.13 KRLS 100 0.21 ± 0.04 0.29 ± 0.04 0.47 ± 0.09 0 200 0.12 ± 0.02 0.24 ± 0.03 0.33 ± 0.04 500 0.08 ± 0.01 0.22 ± 0.02 0.31 ± 0.03 1000 0.07 ± 0.01 0.21 ± 0.02 0.19 ± 0.02 − 2 − 1 − 0.8 − 0.6 − 0.4 − 0.2 0 0.2 0.4 0.6 0.8 1

  36. Outline Standard Supervised Learning Structured Prediction with SELF Algorithm Theory Experiments Conclusions

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend