Duality in vv-RKHSs with Infinite Dimensional Outputs: Application - - PowerPoint PPT Presentation
Duality in vv-RKHSs with Infinite Dimensional Outputs: Application - - PowerPoint PPT Presentation
Duality in vv-RKHSs with Infinite Dimensional Outputs: Application to Robust Losses Pierre Laforgue , Alex Lambert, Luc Brogat-Motte, Florence dAlch e-Buc LTCI, T el ecom Paris, Institut Polytechnique de Paris, France 1/25 Outline
Outline
Motivations A duality theory for general OVKs Robust losses as convolutions Experiments Conclusion
2/25
Motivation 1: structured prediction by surrogate approach
Kernel trick in the input space. Kernel trick in the output space [Cortes ’05, Geurts ’06, Brouard ’11, Kadri ’13, Brouard ’16], Input Output Kernel Regression (IOKR).
ˆ h = argmin
h∈HK
1 2n
n
- i=1
- φ(yi) − h(xi)
2
FY
+ Λ 2 h2
HK,
g(x) = argmin
y∈Y
- φ(y) − ˆ
h(x)
- FY
2/25
Motivation 2: function to function regression
0.1 0.2 0.3 0.4 0.5 0.6 0.5 1 1.5 2
seconds Millivolts EMG curves
− − − 0.1 0.2 0.3 0.4 0.5 0.6 −3 −2 −1 1 2 3
seconds Meters/s2 Lip acceleration curves
min
h∈HK
1 2n
n
- i=1
- yi − h(xi)
- 2
L2 + Λ
2 h2 [Kadri et al., 2016] And many more! e.g. structured data autoencoding [Laforgue et al., 2019] min
h1,h2∈H1
K×H2 K
1 2n
n
- i=1
- φ(xi) − h2 ◦ h1(φ(xi))
- 2
FX + Λ Reg(h1, h2). 3/25
Purpose of this work Question: Is it possible to extend the previous approaches to different (ideally robust) loss functions? First answer: Yes, possible extension to maximum-margin regression [Brouard et al., 2016], and ǫ-insensitive loss functions for matrix-valued kernels [Sangnier et al., 2017] What about general Operator-Valued Kernels (OVKs)? What about other types of loss functions?
4/25
Outline
Motivations A duality theory for general OVKs Robust losses as convolutions Experiments Conclusion
5/25
Learning in vector-valued RKHSs (vv-RKHSs)
- K: X × X → L(Y),
K(x, x ′) = K(x ′, x)∗,
- i,j yi, K(xi, xj)yjY ≥ 0
- Unique vv-RKHS HK ⊂ F(X, Y),
HK = Span {K(·, x)y : x, y ∈ X × Y}
- Ex: decomposable OVK K(x, x ′) = k(x, x ′)A, with k scalar, A p.s.d. on Y
5/25
Learning in vector-valued RKHSs (vv-RKHSs)
- K: X × X → L(Y),
K(x, x ′) = K(x ′, x)∗,
- i,j yi, K(xi, xj)yjY ≥ 0
- Unique vv-RKHS HK ⊂ F(X, Y),
HK = Span {K(·, x)y : x, y ∈ X × Y}
- Ex: decomposable OVK K(x, x ′) = k(x, x ′)A, with k scalar, A p.s.d. on Y
- For {(xi, yi)}n
i=1 ∈ (X × Y)n with Y a Hilbert space, we want to find:
ˆ h ∈ argmin
h∈HK
1 n
n
- i=1
ℓ h(xi), yi
- + Λ
2 h2
HK.
Representer Theorem [Micchelli and Pontil, 2005]: ∃(ˆ αi)n
i=1 ∈ Yn (infinite dimensional!)
s.t. ˆ h(x) =
n
- i=1
K(·, xi)ˆ αi. When ℓ(·, ·) = 1
2 · − · 2 Y, K = k · IY: ˆ
αi = n
j=1 Aijyj,
A = (K + nΛIn)−1.
5/25
Applying duality
ˆ h ∈ argmin
h∈HK
1 n
n
- i=1
ℓi(h(xi)) + Λ 2 h2
HK
is given by ˆ h = 1 Λn
n
- i=1
K(·, xi)ˆ αi, with (ˆ αi)n
i=1 ∈ Yn the solutions to the dual problem:
min
(αi )n
i=1∈Yn
n
- i=1
ℓ⋆
i (−αi) +
1 2Λn
n
- i,j=1
αi, K(xi, xj)αjY , with f ⋆ : α ∈ Y → supy∈Y α, yY − f (y) the Fenchel-Legendre transform of f .
6/25
Applying duality
ˆ h ∈ argmin
h∈HK
1 n
n
- i=1
ℓi(h(xi)) + Λ 2 h2
HK
is given by ˆ h = 1 Λn
n
- i=1
K(·, xi)ˆ αi, with (ˆ αi)n
i=1 ∈ Yn the solutions to the dual problem:
min
(αi )n
i=1∈Yn
n
- i=1
ℓ⋆
i (−αi) +
1 2Λn
n
- i,j=1
αi, K(xi, xj)αjY , with f ⋆ : α ∈ Y → supy∈Y α, yY − f (y) the Fenchel-Legendre transform of f .
- 1st limitation: the FL transform ℓ⋆ needs to be computable (→ assumption)
- 2nd limitation : the dual variables (αi)n
i=1 are still infinite dimensional! 6/25
Applying duality
ˆ h ∈ argmin
h∈HK
1 n
n
- i=1
ℓi(h(xi)) + Λ 2 h2
HK
is given by ˆ h = 1 Λn
n
- i=1
K(·, xi)ˆ αi, with (ˆ αi)n
i=1 ∈ Yn the solutions to the dual problem:
min
(αi )n
i=1∈Yn
n
- i=1
ℓ⋆
i (−αi) +
1 2Λn
n
- i,j=1
αi, K(xi, xj)αjY , with f ⋆ : α ∈ Y → supy∈Y α, yY − f (y) the Fenchel-Legendre transform of f .
- 1st limitation: the FL transform ℓ⋆ needs to be computable (→ assumption)
- 2nd limitation : the dual variables (αi)n
i=1 are still infinite dimensional!
If Y = Span{yj, j ≤ n} invariant by K, i.e. ∀(x, x ′), y ∈ Y ⇒ K(x, x ′)y ∈ Y: then ˆ αi ∈ Y → possible reparametrization: ˆ αi =
j ˆ
ωijyj
6/25
The double representer theorem (1/2)
Assume that OVK K and loss ℓ satisfy the appropriate assumptions (see paper for details, verified by standard kernels and losses), then ˆ h = argmin
HK
1 n
- i
ℓ(h(xi), yi) + Λ 2 h2
HK is given by
ˆ h = 1 Λn
n
- i,j=1
K(·, xi) ˆ ωij yj, with ˆ Ω = [ˆ ωij] ∈ Rn×n the solution to the finite dimensional problem min
Ω∈Rn×n n
- i=1
Li
- Ωi:, K Y
+ 1 2ΛnTr ˜ M⊤(Ω ⊗ Ω) , with ˜ M the n2 × n2 matrix writing of M s.t. Mijkl = yk, K(xi, xj)ylY.
7/25
The double representer theorem (2/2)
If K further satisfies K(x, x ′) =
t kt(x, x ′)At, then tensor M simplifies
to Mijkl =
t[K X t ]ij[K Y t ]kl and the problem rewrites
min
Ω∈Rn×n n
- i=1
Li
- Ωi:, K Y
+ 1 2Λn
T
- t=1
Tr K X
t ΩK Y t Ω⊤
.
- Rmk. Only need the n4 tensor yk, K(xi, xj)ylY to learn OVKMs.
- Rmk. Simplifies to 2 n2 matrices K X
ij K Y kl if K is decomposable.
How to apply the duality approach?
8/25
Outline
Motivations A duality theory for general OVKs Robust losses as convolutions Experiments Conclusion
9/25
Infimal convolution and Fenchel-Legendre transforms
Infimal-convolution operator between proper lower semicontinuous functions [Bauschke et al., 2011]: (f g)(x) = inf
y f (y) + g(x − y).
Relation to FL transform: (f g)⋆ = f ⋆ + g⋆ Ex: ǫ-insensitive losses. Let ℓ : Y → R be a convex loss with unique minimum at 0, and ǫ > 0. The ǫ-insensitive version of ℓ, denoted ℓǫ, is defined by: ℓǫ(y) = (ℓ χBǫ) (y) =
- ℓ(0)
if yY ≤ ǫ inf
dY ≤1 ℓ(y − ǫd)
- therwise
, and has FL transform: ℓ⋆
ǫ(y) = (ℓ χBǫ)⋆ (y) = ℓ⋆(y) + ǫy. 9/25
Interesting loss functions: sparsity and robustness
ǫ-Ridge
−4 −2 2 4 2 4 6 8 10 12 1 2jjxjj2
²-insensitive
−3 −1 1 3 −3 −2 −1 1 2 3 3 6 9 12
1 2 · 2 χBǫ
(Sparsity) ǫ-SVR
−4 −2 2 4 1 2 3 4 5
jjxjj ²-insensitive
−3 −1 1 3 −3 −2 −1 1 2 3 1 2 3 4
· χBǫ (Sparsity, Robustness) κ-Huber
−4 −2 2 4 1 2 3 4 1 2jjxjj2
Huber loss
−3 −1 1 3 −3 −2 −1 1 2 3 3 6 9 12
κ · 1
2 · 2
(Robustness)
10/25
Specific dual problems
For the ǫ-ridge, ǫ-SVR and κ-Huber, it holds ˆ Ω = ˆ W V −1, with ˆ W the solution to these finite dimensional dual problems: (D1) min
W ∈Rn×n
1 2 AW − B2
Fro + ǫ W 2,1,
(D2) min
W ∈Rn×n
1 2 AW − B2
Fro + ǫ W 2,1,
s.t. W 2,∞ ≤ 1, (D3) min
W ∈Rn×n
1 2 AW − B2
Fro ,
s.t. W 2,∞ ≤ κ, with V , A, B such that: VV ⊤ = K Y , A⊤A = K X/(Λn) + In (or A⊤A = K X/(Λn) for the ǫ-SVR), and A⊤B = V .
11/25
Outline
Motivations A duality theory for general OVKs Robust losses as convolutions Experiments Conclusion
12/25
Surrogate approaches for structured prediction
- Experiments on YEAST dataset
- Empirically, ǫ-SV-IOKR outperforms ridge-IOKR for a wide range of ǫ
- Promotes sparsity and acts as a regularizer
10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1
¤
2.0 2.1 2.2 2.3 2.4 2.5 2.6
Test MSE
Comparison ²-SVR / KRR
KRR 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
²
10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1
¤
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Sparsity (% null components) Sparsity w.r.t. ¤ for different ² (²-SVR)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
²
Figure 1: MSEs and sparsity w.r.t. Λ for several ǫ
13/25
Robust function-to-function regression
Task from [Kadri et al., 2016]: predict lip acceleration from EMG signals.
- Dataset augmented with outliers, model learned with Huber loss
- Improvement for every output size M (see paper for approximation)
0.0 0.5 1.0 1.5
κ
0.750 0.775 0.800 0.825 0.850 0.875 0.900
LOO generalization error
m=4 m=5 m=6 m=7 m=15 Ridge Regression (κ = +∞)
Figure 2: LOO generalization error w.r.t. κ
14/25
Outline
Motivations A duality theory for general OVKs Robust losses as convolutions Experiments Conclusion
15/25
Conclusion
State of the art:
- OVK and vv-RKHSs tailored to infinite dimensional outputs
- RT: expansion with few information on the coefficients
- Duality: coefficients solutions to the (infinite) dual problem
Contributions:
- Double RT: coefficients linear combinations of the outputs
- Allows to cope with many losses (ǫ, Huber) and kernels
- Empirical improvements on surrogate approaches
Much more in the paper!
- Thorough algorithmic stability analysis
- What if Y is not invariant by K?
15/25
References I
Audiffren, J. and Kadri, H. (2013). Stability of multi-task kernel regression algorithms. In Asian Conference on Machine Learning, pages 1–16. Bauschke, H. H., Combettes, P. L., et al. (2011). Convex analysis and monotone operator theory in Hilbert spaces, volume 408. Springer. Bousquet, O. and Elisseeff, A. (2002). Stability and generalization. Journal of Machine Learning Research, 2(Mar):499–526. Brouard, C., Szafranski, M., and d’Alch´ e-Buc, F. (2016). Input output kernel regression: Supervised and semi-supervised structured output prediction with operator-valued kernel. Journal of Machine Learning Research, 17:176:1–176:48.
16/25
References II
Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, pages 73–101. Kadri, H., Duflos, E., Preux, P., Canu, S., Rakotomamonjy, A., and Audiffren, J. (2016). Operator-valued kernels for learning from functional response data. Journal of Machine Learning Research, 17:20:1–20:54. Kadri, H., Ghavamzadeh, M., and Preux, P. (2013). A generalized kernel approach to structured output learning. In International Conference on Machine Learning (ICML), pages 471–479. Laforgue, P., Cl´ emenc ¸on, S., and d’Alch´ e-Buc, F. (2019). Autoencoding any data through kernel autoencoders. In Artificial Intelligence and Statistics, pages 1061–1069.
17/25
References III
Micchelli, C. A. and Pontil, M. (2005). On learning vector-valued functions. Neural computation, 17(1):177–204. Moreau, J. J. (1962). Fonctions convexes duales et points proximaux dans un espace hilbertien. Comptes rendus hebdomadaires des s´ eances de l’Acad´ emie des sciences, 255:2897–2899. Sangnier, M., Fercoq, O., and d’Alch´ e-Buc, F. (2017). Data sparse nonparametric regression with ǫ-insensitive losses. In Asian Conference on Machine Learning, pages 192–207.
18/25
On the invariance assumption
With Y = Span{yj, j ≤ n}, the assumption reads: ∀(x, x ′) ∈ X 2, ∀y ∈ Y, y ∈ Y = ⇒ K(x, x ′)y ∈ Y
- We do not need it to hold for every collection of {yi}i≤n ∈ Yn
- Rather an a posteriori condition to ensure that the kernel is aligned
- The little we know about Y should be preserved through K
- If Y finite dimensional, and sufficiently many outputs, then Y = Y
- Identity-decomposable kernels fit (nontrivial in infinite dimension)
- The empirical covariance kernel
i yi ⊗ yi [Kadri et al., 2013] fits 19/25
Admissible kernels
- K(s, t) =
i ki(s, t) yi ⊗ yi,
with ki positive semi-definite (p.s.d.) scalar kernels for all i ≤ n
- K(s, t) =
i µi k(s, t) yi ⊗ yi,
with k a p.s.d. scalar kernel and µi ≥ 0 for all i ≤ n
- K(s, t) =
i k(s, xi)k(t, xi) yi ⊗ yi,
- K(s, t) =
i,j kij(s, t) (yi + yj) ⊗ (yi + yj),
with kij p.s.d. scalar kernels for all i, j ≤ n
- K(s, t) =
i,j µij k(s, t) (yi + yj) ⊗ (yi + yj),
with k a p.s.d. scalar kernel and µij ≥ 0
- K(s, t) =
i,j k(s, xi, xj)k(t, xi, xj) (yi + yj) ⊗ (yi + yj). 20/25
Admissible losses
∀i ≤ n, ∀(αY, α⊥) ∈ Y × Y⊥, ℓ⋆
i (αY) ≤ ℓ⋆ i (αY + α⊥)
- ℓi(y) = f (y, zi), zi ∈ Y and f : R → R convex. Maximum-margin
- btained with zi = yi and f (t) = max(0, 1 − t).
- ℓ(y) = f (y), f : R+ → R convex increasing s.t. t → f ′(t)
t
is continuous
- ver R+. Includes the functions λ
η yη Y for η > 1, λ > 0.
- ∀λ > 0, with Bλ the centered ball of radius λ,
◮ ℓ(y) = λy, ◮ ℓ(y) = λy log(y), ◮ ℓ(y) = χBλ(y), ◮ ℓ(y) = λ(exp(y) − 1).
- ℓi(y) = f (y − yi), f ⋆ verifying the condition.
- Infimal convolution of functions verifying the condition. (ǫ-insensitive
[Sangnier et al., 2017], the Huber loss [Huber, 1964], Moreau or Pasch-Hausdorff envelopes [Moreau, 1962, Bauschke et al., 2011])
21/25
Proof of the Double Representer Theorem
Dual problem: (ˆ αi)n
i=1 ∈ argmin (αi )n
i=1∈Yn
n
- i=1
ℓ⋆
i (−αi) +
1 2Λn
n
- i,j=1
αi, K(xi, xj)αjY .
- Decompose ˆ
αi = αY
i + α⊥ i , with (αY i )i≤n, (α⊥ i )i≤n ∈ Yn × Y⊥n
- Assume that ℓ⋆
i (αY) ≤ ℓ⋆ i (αY + α⊥)
(satisfied if ℓ relies on ·, ·) Then it holds:
n
- i=1
ℓ⋆
i (−αY i ) +
1 2Λn
n
- i,j=1
- αY
i , K(xi, xj)αY j
- Y
≤
n
- i=1
ℓ⋆
i (−αY i − α⊥ i ) +
1 2Λn
n
- i,j=1
- αY
i + α⊥ i , K(xi, xj)(αY j + α⊥ j ) Y . 22/25
Approximating the dual problem if no invariance
The kernel K = k · A is a separable OVK, with A a compact operator. There exists an o.n.b. (ψj)∞
j=1 of Y, s.t. A = ∞ j=1 λjψj ⊗ ψj, (λj ≥ 0).
There exists (ˆ ωi)n
i=1 ∈ ℓ2(R)n such that ∀i ≤ n, ˆ
αi = ∞
j=1 ˆ
ωijψj. Denoting by Ym = span({ψj}m
j=1), S = diag(λj)m j=1, solve instead:
min
(αi )n
i=1∈
Yn
m
n
- i=1
ℓ⋆
i (−αi) +
1 2Λn
n
- i,j=1
αi, K(xi, xj)αjY . The final solution is given by: ˆ h = 1 Λn
n
- i=1
m
- j=1
k(·, xi) λj ˆ ωij ψj, with ˆ Ω solution to: min
Ω∈Rn×m n
- i=1
Li (Ωi:, Ri:) + 1 2ΛnTr K XΩSΩ⊤ .
23/25
Application to kernel autoencoding
- Experiments on molecules with Tanimoto-Gaussian kernel
- Empirical improvements for wide range of ǫ
- Introduces sparsity
0.0 0.2 0.4 0.6 0.8 1.0
²
0.5 0.6 0.7 0.8 0.9 1.0
Test reconstruction MSE
200 400 600 800 1000 1200 1400 1600
jjWjj2; 1
3 6 12 22 38 46 74 108 136 397 1229
²-KAE standard KAE W's `2; 1 norm discarded data
Figure 3: Performances of ǫ-insensitive Kernel Autoencoder
24/25
Algorithmic stability analysis [Bousquet and Elisseeff, 2002]
Algorithm A has stability β if for any sample Sn, and any i ≤ n, it holds: sup
(x,y)∈X×Y
|ℓ(hA(Sn)(x), y) − ℓ(h
A(S\i n )(x), y)| ≤ β
Let A be an algorithm with stability β and loss function bounded by M. Then, for any n ≥ 1 and δ ∈]0, 1[ it holds with probability at least 1 − δ: R(hA(Sn)) ≤ ˆ Rn(hA(Sn)) + 2β + (4nβ + M)
- ln(1/δ)
2n . If K(x, x)op ≤ γ2, and |ℓ(hS(x), y) − ℓ(hS\i (x), y)| ≤ ChS(x) − hS\i (x)Y, then OVK algorithm has stability β ≤ C2γ2/(Λn) [Audiffren and Kadri, 2013].
M C ǫ-SVR
- MY − ǫ √
2γ √ Λ +
MY − ǫ 1 ǫ-Ridge (MY − ǫ)2 1 + 2
√ 2γ √ Λ
+ 2γ2
Λ
- 2(MY − ǫ)
1 + γ
√ 2 √ Λ
- κ-Huber
κ MY − κ
2
- γ
√ 2κ √ Λ
+ MY − κ
2
- κ