Duality in vv-RKHSs with Infinite Dimensional Outputs: Application - - PowerPoint PPT Presentation

duality in vv rkhss with infinite dimensional outputs
SMART_READER_LITE
LIVE PREVIEW

Duality in vv-RKHSs with Infinite Dimensional Outputs: Application - - PowerPoint PPT Presentation

Duality in vv-RKHSs with Infinite Dimensional Outputs: Application to Robust Losses Pierre Laforgue , Alex Lambert, Luc Brogat-Motte, Florence dAlch e-Buc LTCI, T el ecom Paris, Institut Polytechnique de Paris, France 1/25 Outline


slide-1
SLIDE 1

Duality in vv-RKHSs with Infinite Dimensional Outputs: Application to Robust Losses

Pierre Laforgue, Alex Lambert, Luc Brogat-Motte, Florence d’Alch´ e-Buc

LTCI, T´ el´ ecom Paris, Institut Polytechnique de Paris, France

1/25

slide-2
SLIDE 2

Outline

Motivations A duality theory for general OVKs Robust losses as convolutions Experiments Conclusion

2/25

slide-3
SLIDE 3

Motivation 1: structured prediction by surrogate approach

Kernel trick in the input space. Kernel trick in the output space [Cortes ’05, Geurts ’06, Brouard ’11, Kadri ’13, Brouard ’16], Input Output Kernel Regression (IOKR).

ˆ h = argmin

h∈HK

1 2n

n

  • i=1
  • φ(yi) − h(xi)

2

FY

+ Λ 2 h2

HK,

g(x) = argmin

y∈Y

  • φ(y) − ˆ

h(x)

  • FY

2/25

slide-4
SLIDE 4

Motivation 2: function to function regression

0.1 0.2 0.3 0.4 0.5 0.6 0.5 1 1.5 2

seconds Millivolts EMG curves

− − − 0.1 0.2 0.3 0.4 0.5 0.6 −3 −2 −1 1 2 3

seconds Meters/s2 Lip acceleration curves

min

h∈HK

1 2n

n

  • i=1
  • yi − h(xi)
  • 2

L2 + Λ

2 h2 [Kadri et al., 2016] And many more! e.g. structured data autoencoding [Laforgue et al., 2019] min

h1,h2∈H1

K×H2 K

1 2n

n

  • i=1
  • φ(xi) − h2 ◦ h1(φ(xi))
  • 2

FX + Λ Reg(h1, h2). 3/25

slide-5
SLIDE 5

Purpose of this work Question: Is it possible to extend the previous approaches to different (ideally robust) loss functions? First answer: Yes, possible extension to maximum-margin regression [Brouard et al., 2016], and ǫ-insensitive loss functions for matrix-valued kernels [Sangnier et al., 2017] What about general Operator-Valued Kernels (OVKs)? What about other types of loss functions?

4/25

slide-6
SLIDE 6

Outline

Motivations A duality theory for general OVKs Robust losses as convolutions Experiments Conclusion

5/25

slide-7
SLIDE 7

Learning in vector-valued RKHSs (vv-RKHSs)

  • K: X × X → L(Y),

K(x, x ′) = K(x ′, x)∗,

  • i,j yi, K(xi, xj)yjY ≥ 0
  • Unique vv-RKHS HK ⊂ F(X, Y),

HK = Span {K(·, x)y : x, y ∈ X × Y}

  • Ex: decomposable OVK K(x, x ′) = k(x, x ′)A, with k scalar, A p.s.d. on Y

5/25

slide-8
SLIDE 8

Learning in vector-valued RKHSs (vv-RKHSs)

  • K: X × X → L(Y),

K(x, x ′) = K(x ′, x)∗,

  • i,j yi, K(xi, xj)yjY ≥ 0
  • Unique vv-RKHS HK ⊂ F(X, Y),

HK = Span {K(·, x)y : x, y ∈ X × Y}

  • Ex: decomposable OVK K(x, x ′) = k(x, x ′)A, with k scalar, A p.s.d. on Y
  • For {(xi, yi)}n

i=1 ∈ (X × Y)n with Y a Hilbert space, we want to find:

ˆ h ∈ argmin

h∈HK

1 n

n

  • i=1

ℓ h(xi), yi

  • + Λ

2 h2

HK.

Representer Theorem [Micchelli and Pontil, 2005]: ∃(ˆ αi)n

i=1 ∈ Yn (infinite dimensional!)

s.t. ˆ h(x) =

n

  • i=1

K(·, xi)ˆ αi. When ℓ(·, ·) = 1

2 · − · 2 Y, K = k · IY: ˆ

αi = n

j=1 Aijyj,

A = (K + nΛIn)−1.

5/25

slide-9
SLIDE 9

Applying duality

ˆ h ∈ argmin

h∈HK

1 n

n

  • i=1

ℓi(h(xi)) + Λ 2 h2

HK

is given by ˆ h = 1 Λn

n

  • i=1

K(·, xi)ˆ αi, with (ˆ αi)n

i=1 ∈ Yn the solutions to the dual problem:

min

(αi )n

i=1∈Yn

n

  • i=1

ℓ⋆

i (−αi) +

1 2Λn

n

  • i,j=1

αi, K(xi, xj)αjY , with f ⋆ : α ∈ Y → supy∈Y α, yY − f (y) the Fenchel-Legendre transform of f .

6/25

slide-10
SLIDE 10

Applying duality

ˆ h ∈ argmin

h∈HK

1 n

n

  • i=1

ℓi(h(xi)) + Λ 2 h2

HK

is given by ˆ h = 1 Λn

n

  • i=1

K(·, xi)ˆ αi, with (ˆ αi)n

i=1 ∈ Yn the solutions to the dual problem:

min

(αi )n

i=1∈Yn

n

  • i=1

ℓ⋆

i (−αi) +

1 2Λn

n

  • i,j=1

αi, K(xi, xj)αjY , with f ⋆ : α ∈ Y → supy∈Y α, yY − f (y) the Fenchel-Legendre transform of f .

  • 1st limitation: the FL transform ℓ⋆ needs to be computable (→ assumption)
  • 2nd limitation : the dual variables (αi)n

i=1 are still infinite dimensional! 6/25

slide-11
SLIDE 11

Applying duality

ˆ h ∈ argmin

h∈HK

1 n

n

  • i=1

ℓi(h(xi)) + Λ 2 h2

HK

is given by ˆ h = 1 Λn

n

  • i=1

K(·, xi)ˆ αi, with (ˆ αi)n

i=1 ∈ Yn the solutions to the dual problem:

min

(αi )n

i=1∈Yn

n

  • i=1

ℓ⋆

i (−αi) +

1 2Λn

n

  • i,j=1

αi, K(xi, xj)αjY , with f ⋆ : α ∈ Y → supy∈Y α, yY − f (y) the Fenchel-Legendre transform of f .

  • 1st limitation: the FL transform ℓ⋆ needs to be computable (→ assumption)
  • 2nd limitation : the dual variables (αi)n

i=1 are still infinite dimensional!

If Y = Span{yj, j ≤ n} invariant by K, i.e. ∀(x, x ′), y ∈ Y ⇒ K(x, x ′)y ∈ Y: then ˆ αi ∈ Y → possible reparametrization: ˆ αi =

j ˆ

ωijyj

6/25

slide-12
SLIDE 12

The double representer theorem (1/2)

Assume that OVK K and loss ℓ satisfy the appropriate assumptions (see paper for details, verified by standard kernels and losses), then ˆ h = argmin

HK

1 n

  • i

ℓ(h(xi), yi) + Λ 2 h2

HK is given by

ˆ h = 1 Λn

n

  • i,j=1

K(·, xi) ˆ ωij yj, with ˆ Ω = [ˆ ωij] ∈ Rn×n the solution to the finite dimensional problem min

Ω∈Rn×n n

  • i=1

Li

  • Ωi:, K Y

+ 1 2ΛnTr ˜ M⊤(Ω ⊗ Ω) , with ˜ M the n2 × n2 matrix writing of M s.t. Mijkl = yk, K(xi, xj)ylY.

7/25

slide-13
SLIDE 13

The double representer theorem (2/2)

If K further satisfies K(x, x ′) =

t kt(x, x ′)At, then tensor M simplifies

to Mijkl =

t[K X t ]ij[K Y t ]kl and the problem rewrites

min

Ω∈Rn×n n

  • i=1

Li

  • Ωi:, K Y

+ 1 2Λn

T

  • t=1

Tr K X

t ΩK Y t Ω⊤

.

  • Rmk. Only need the n4 tensor yk, K(xi, xj)ylY to learn OVKMs.
  • Rmk. Simplifies to 2 n2 matrices K X

ij K Y kl if K is decomposable.

How to apply the duality approach?

8/25

slide-14
SLIDE 14

Outline

Motivations A duality theory for general OVKs Robust losses as convolutions Experiments Conclusion

9/25

slide-15
SLIDE 15

Infimal convolution and Fenchel-Legendre transforms

Infimal-convolution operator between proper lower semicontinuous functions [Bauschke et al., 2011]: (f g)(x) = inf

y f (y) + g(x − y).

Relation to FL transform: (f g)⋆ = f ⋆ + g⋆ Ex: ǫ-insensitive losses. Let ℓ : Y → R be a convex loss with unique minimum at 0, and ǫ > 0. The ǫ-insensitive version of ℓ, denoted ℓǫ, is defined by: ℓǫ(y) = (ℓ χBǫ) (y) =

  • ℓ(0)

if yY ≤ ǫ inf

dY ≤1 ℓ(y − ǫd)

  • therwise

, and has FL transform: ℓ⋆

ǫ(y) = (ℓ χBǫ)⋆ (y) = ℓ⋆(y) + ǫy. 9/25

slide-16
SLIDE 16

Interesting loss functions: sparsity and robustness

ǫ-Ridge

−4 −2 2 4 2 4 6 8 10 12 1 2jjxjj2

²-insensitive

−3 −1 1 3 −3 −2 −1 1 2 3 3 6 9 12

1 2 · 2 χBǫ

(Sparsity) ǫ-SVR

−4 −2 2 4 1 2 3 4 5

jjxjj ²-insensitive

−3 −1 1 3 −3 −2 −1 1 2 3 1 2 3 4

· χBǫ (Sparsity, Robustness) κ-Huber

−4 −2 2 4 1 2 3 4 1 2jjxjj2

Huber loss

−3 −1 1 3 −3 −2 −1 1 2 3 3 6 9 12

κ · 1

2 · 2

(Robustness)

10/25

slide-17
SLIDE 17

Specific dual problems

For the ǫ-ridge, ǫ-SVR and κ-Huber, it holds ˆ Ω = ˆ W V −1, with ˆ W the solution to these finite dimensional dual problems: (D1) min

W ∈Rn×n

1 2 AW − B2

Fro + ǫ W 2,1,

(D2) min

W ∈Rn×n

1 2 AW − B2

Fro + ǫ W 2,1,

s.t. W 2,∞ ≤ 1, (D3) min

W ∈Rn×n

1 2 AW − B2

Fro ,

s.t. W 2,∞ ≤ κ, with V , A, B such that: VV ⊤ = K Y , A⊤A = K X/(Λn) + In (or A⊤A = K X/(Λn) for the ǫ-SVR), and A⊤B = V .

11/25

slide-18
SLIDE 18

Outline

Motivations A duality theory for general OVKs Robust losses as convolutions Experiments Conclusion

12/25

slide-19
SLIDE 19

Surrogate approaches for structured prediction

  • Experiments on YEAST dataset
  • Empirically, ǫ-SV-IOKR outperforms ridge-IOKR for a wide range of ǫ
  • Promotes sparsity and acts as a regularizer

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1

¤

2.0 2.1 2.2 2.3 2.4 2.5 2.6

Test MSE

Comparison ²-SVR / KRR

KRR 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

²

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1

¤

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

Sparsity (% null components) Sparsity w.r.t. ¤ for different ² (²-SVR)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

²

Figure 1: MSEs and sparsity w.r.t. Λ for several ǫ

13/25

slide-20
SLIDE 20

Robust function-to-function regression

Task from [Kadri et al., 2016]: predict lip acceleration from EMG signals.

  • Dataset augmented with outliers, model learned with Huber loss
  • Improvement for every output size M (see paper for approximation)

0.0 0.5 1.0 1.5

κ

0.750 0.775 0.800 0.825 0.850 0.875 0.900

LOO generalization error

m=4 m=5 m=6 m=7 m=15 Ridge Regression (κ = +∞)

Figure 2: LOO generalization error w.r.t. κ

14/25

slide-21
SLIDE 21

Outline

Motivations A duality theory for general OVKs Robust losses as convolutions Experiments Conclusion

15/25

slide-22
SLIDE 22

Conclusion

State of the art:

  • OVK and vv-RKHSs tailored to infinite dimensional outputs
  • RT: expansion with few information on the coefficients
  • Duality: coefficients solutions to the (infinite) dual problem

Contributions:

  • Double RT: coefficients linear combinations of the outputs
  • Allows to cope with many losses (ǫ, Huber) and kernels
  • Empirical improvements on surrogate approaches

Much more in the paper!

  • Thorough algorithmic stability analysis
  • What if Y is not invariant by K?

15/25

slide-23
SLIDE 23

References I

Audiffren, J. and Kadri, H. (2013). Stability of multi-task kernel regression algorithms. In Asian Conference on Machine Learning, pages 1–16. Bauschke, H. H., Combettes, P. L., et al. (2011). Convex analysis and monotone operator theory in Hilbert spaces, volume 408. Springer. Bousquet, O. and Elisseeff, A. (2002). Stability and generalization. Journal of Machine Learning Research, 2(Mar):499–526. Brouard, C., Szafranski, M., and d’Alch´ e-Buc, F. (2016). Input output kernel regression: Supervised and semi-supervised structured output prediction with operator-valued kernel. Journal of Machine Learning Research, 17:176:1–176:48.

16/25

slide-24
SLIDE 24

References II

Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, pages 73–101. Kadri, H., Duflos, E., Preux, P., Canu, S., Rakotomamonjy, A., and Audiffren, J. (2016). Operator-valued kernels for learning from functional response data. Journal of Machine Learning Research, 17:20:1–20:54. Kadri, H., Ghavamzadeh, M., and Preux, P. (2013). A generalized kernel approach to structured output learning. In International Conference on Machine Learning (ICML), pages 471–479. Laforgue, P., Cl´ emenc ¸on, S., and d’Alch´ e-Buc, F. (2019). Autoencoding any data through kernel autoencoders. In Artificial Intelligence and Statistics, pages 1061–1069.

17/25

slide-25
SLIDE 25

References III

Micchelli, C. A. and Pontil, M. (2005). On learning vector-valued functions. Neural computation, 17(1):177–204. Moreau, J. J. (1962). Fonctions convexes duales et points proximaux dans un espace hilbertien. Comptes rendus hebdomadaires des s´ eances de l’Acad´ emie des sciences, 255:2897–2899. Sangnier, M., Fercoq, O., and d’Alch´ e-Buc, F. (2017). Data sparse nonparametric regression with ǫ-insensitive losses. In Asian Conference on Machine Learning, pages 192–207.

18/25

slide-26
SLIDE 26

On the invariance assumption

With Y = Span{yj, j ≤ n}, the assumption reads: ∀(x, x ′) ∈ X 2, ∀y ∈ Y, y ∈ Y = ⇒ K(x, x ′)y ∈ Y

  • We do not need it to hold for every collection of {yi}i≤n ∈ Yn
  • Rather an a posteriori condition to ensure that the kernel is aligned
  • The little we know about Y should be preserved through K
  • If Y finite dimensional, and sufficiently many outputs, then Y = Y
  • Identity-decomposable kernels fit (nontrivial in infinite dimension)
  • The empirical covariance kernel

i yi ⊗ yi [Kadri et al., 2013] fits 19/25

slide-27
SLIDE 27

Admissible kernels

  • K(s, t) =

i ki(s, t) yi ⊗ yi,

with ki positive semi-definite (p.s.d.) scalar kernels for all i ≤ n

  • K(s, t) =

i µi k(s, t) yi ⊗ yi,

with k a p.s.d. scalar kernel and µi ≥ 0 for all i ≤ n

  • K(s, t) =

i k(s, xi)k(t, xi) yi ⊗ yi,

  • K(s, t) =

i,j kij(s, t) (yi + yj) ⊗ (yi + yj),

with kij p.s.d. scalar kernels for all i, j ≤ n

  • K(s, t) =

i,j µij k(s, t) (yi + yj) ⊗ (yi + yj),

with k a p.s.d. scalar kernel and µij ≥ 0

  • K(s, t) =

i,j k(s, xi, xj)k(t, xi, xj) (yi + yj) ⊗ (yi + yj). 20/25

slide-28
SLIDE 28

Admissible losses

∀i ≤ n, ∀(αY, α⊥) ∈ Y × Y⊥, ℓ⋆

i (αY) ≤ ℓ⋆ i (αY + α⊥)

  • ℓi(y) = f (y, zi), zi ∈ Y and f : R → R convex. Maximum-margin
  • btained with zi = yi and f (t) = max(0, 1 − t).
  • ℓ(y) = f (y), f : R+ → R convex increasing s.t. t → f ′(t)

t

is continuous

  • ver R+. Includes the functions λ

η yη Y for η > 1, λ > 0.

  • ∀λ > 0, with Bλ the centered ball of radius λ,

◮ ℓ(y) = λy, ◮ ℓ(y) = λy log(y), ◮ ℓ(y) = χBλ(y), ◮ ℓ(y) = λ(exp(y) − 1).

  • ℓi(y) = f (y − yi), f ⋆ verifying the condition.
  • Infimal convolution of functions verifying the condition. (ǫ-insensitive

[Sangnier et al., 2017], the Huber loss [Huber, 1964], Moreau or Pasch-Hausdorff envelopes [Moreau, 1962, Bauschke et al., 2011])

21/25

slide-29
SLIDE 29

Proof of the Double Representer Theorem

Dual problem: (ˆ αi)n

i=1 ∈ argmin (αi )n

i=1∈Yn

n

  • i=1

ℓ⋆

i (−αi) +

1 2Λn

n

  • i,j=1

αi, K(xi, xj)αjY .

  • Decompose ˆ

αi = αY

i + α⊥ i , with (αY i )i≤n, (α⊥ i )i≤n ∈ Yn × Y⊥n

  • Assume that ℓ⋆

i (αY) ≤ ℓ⋆ i (αY + α⊥)

(satisfied if ℓ relies on ·, ·) Then it holds:

n

  • i=1

ℓ⋆

i (−αY i ) +

1 2Λn

n

  • i,j=1
  • αY

i , K(xi, xj)αY j

  • Y

n

  • i=1

ℓ⋆

i (−αY i − α⊥ i ) +

1 2Λn

n

  • i,j=1
  • αY

i + α⊥ i , K(xi, xj)(αY j + α⊥ j ) Y . 22/25

slide-30
SLIDE 30

Approximating the dual problem if no invariance

The kernel K = k · A is a separable OVK, with A a compact operator. There exists an o.n.b. (ψj)∞

j=1 of Y, s.t. A = ∞ j=1 λjψj ⊗ ψj, (λj ≥ 0).

There exists (ˆ ωi)n

i=1 ∈ ℓ2(R)n such that ∀i ≤ n, ˆ

αi = ∞

j=1 ˆ

ωijψj. Denoting by Ym = span({ψj}m

j=1), S = diag(λj)m j=1, solve instead:

min

(αi )n

i=1∈

Yn

m

n

  • i=1

ℓ⋆

i (−αi) +

1 2Λn

n

  • i,j=1

αi, K(xi, xj)αjY . The final solution is given by: ˆ h = 1 Λn

n

  • i=1

m

  • j=1

k(·, xi) λj ˆ ωij ψj, with ˆ Ω solution to: min

Ω∈Rn×m n

  • i=1

Li (Ωi:, Ri:) + 1 2ΛnTr K XΩSΩ⊤ .

23/25

slide-31
SLIDE 31

Application to kernel autoencoding

  • Experiments on molecules with Tanimoto-Gaussian kernel
  • Empirical improvements for wide range of ǫ
  • Introduces sparsity

0.0 0.2 0.4 0.6 0.8 1.0

²

0.5 0.6 0.7 0.8 0.9 1.0

Test reconstruction MSE

200 400 600 800 1000 1200 1400 1600

jjWjj2; 1

3 6 12 22 38 46 74 108 136 397 1229

²-KAE standard KAE W's `2; 1 norm discarded data

Figure 3: Performances of ǫ-insensitive Kernel Autoencoder

24/25

slide-32
SLIDE 32

Algorithmic stability analysis [Bousquet and Elisseeff, 2002]

Algorithm A has stability β if for any sample Sn, and any i ≤ n, it holds: sup

(x,y)∈X×Y

|ℓ(hA(Sn)(x), y) − ℓ(h

A(S\i n )(x), y)| ≤ β

Let A be an algorithm with stability β and loss function bounded by M. Then, for any n ≥ 1 and δ ∈]0, 1[ it holds with probability at least 1 − δ: R(hA(Sn)) ≤ ˆ Rn(hA(Sn)) + 2β + (4nβ + M)

  • ln(1/δ)

2n . If K(x, x)op ≤ γ2, and |ℓ(hS(x), y) − ℓ(hS\i (x), y)| ≤ ChS(x) − hS\i (x)Y, then OVK algorithm has stability β ≤ C2γ2/(Λn) [Audiffren and Kadri, 2013].

M C ǫ-SVR

  • MY − ǫ √

2γ √ Λ +

MY − ǫ 1 ǫ-Ridge (MY − ǫ)2 1 + 2

√ 2γ √ Λ

+ 2γ2

Λ

  • 2(MY − ǫ)

1 + γ

√ 2 √ Λ

  • κ-Huber

κ MY − κ

2

  • γ

√ 2κ √ Λ

+ MY − κ

2

  • κ

25/25