SLIDE 1 Large-Scale Sparse Kernel Canonical Correlation Analysis
Viivi Uurtio1, Sahely Bhadra2, and Juho Rousu1
1 Department of Computer Science, Aalto University
Helsinki Institute for Information Technology HIIT
2 Indian Institute of Technology (IIT), Palakkad
June 11, 2019
Viivi Uurtio (Aalto, HIIT) ICML 2019 June 11, 2019 1 / 5
SLIDE 2
From large two-view datasets, it is not straightforward to identify which of the variables are related
SLIDE 3
From large two-view datasets, it is not straightforward to identify which of the variables are related Xu,Yv ||Xu||2||Yv||2
SLIDE 4
From large two-view datasets, it is not straightforward to identify which of the variables are related Xu,Yv ||Xu||2||Yv||2
→ In standard CCA, we identify the related vari- ables from u and v
SLIDE 5
From large two-view datasets, it is not straightforward to identify which of the variables are related Xu,Yv ||Xu||2||Yv||2
→ In standard CCA, we identify the related vari- ables from u and v → In the non-linear and/or large-scale variants, we cannot access the u and v
SLIDE 6
From large two-view datasets, it is not straightforward to identify which of the variables are related Xu,Yv ||Xu||2||Yv||2
→ In standard CCA, we identify the related vari- ables from u and v → In the non-linear and/or large-scale variants, we cannot access the u and v Scalability u and v
SLIDE 7
From large two-view datasets, it is not straightforward to identify which of the variables are related Xu,Yv ||Xu||2||Yv||2
→ In standard CCA, we identify the related vari- ables from u and v → In the non-linear and/or large-scale variants, we cannot access the u and v Scalability u and v Kernel CCA
⊠ ⊠
SLIDE 8 From large two-view datasets, it is not straightforward to identify which of the variables are related Xu,Yv ||Xu||2||Yv||2
→ In standard CCA, we identify the related vari- ables from u and v → In the non-linear and/or large-scale variants, we cannot access the u and v Scalability u and v Kernel CCA
⊠ ⊠
RF KCCA
SLIDE 9 From large two-view datasets, it is not straightforward to identify which of the variables are related Xu,Yv ||Xu||2||Yv||2
→ In standard CCA, we identify the related vari- ables from u and v → In the non-linear and/or large-scale variants, we cannot access the u and v Scalability u and v Kernel CCA
⊠ ⊠
RF KCCA
KNOI
SLIDE 10 From large two-view datasets, it is not straightforward to identify which of the variables are related Xu,Yv ||Xu||2||Yv||2
→ In standard CCA, we identify the related vari- ables from u and v → In the non-linear and/or large-scale variants, we cannot access the u and v Scalability u and v Kernel CCA
⊠ ⊠
RF KCCA
KNOI
Deep CCA
SLIDE 11 From large two-view datasets, it is not straightforward to identify which of the variables are related Xu,Yv ||Xu||2||Yv||2
→ In standard CCA, we identify the related vari- ables from u and v → In the non-linear and/or large-scale variants, we cannot access the u and v Scalability u and v Kernel CCA
⊠ ⊠
RF KCCA
KNOI
Deep CCA
SCCA-HSIC
⊠
- Viivi Uurtio (Aalto, HIIT)
ICML 2019 June 11, 2019 2 / 5
SLIDE 12 gradKCCA is a kernel matrix free method that efficiently
SLIDE 13 gradKCCA is a kernel matrix free method that efficiently
Let kx(u) = (kx(xi, u))n
i=1 and ky(v) = (ky(yi, v))n i=1
SLIDE 14 gradKCCA is a kernel matrix free method that efficiently
Let kx(u) = (kx(xi, u))n
i=1 and ky(v) = (ky(yi, v))n i=1
max
u,v
ρgradKCCA(u, v) = kx(u)⊤ky(v) ||kx(u)||2||ky(v)||2 s.t. ||u||Px ≤ su and ||v||Py ≤ sv
SLIDE 15 gradKCCA is a kernel matrix free method that efficiently
Let kx(u) = (kx(xi, u))n
i=1 and ky(v) = (ky(yi, v))n i=1
max
u,v
ρgradKCCA(u, v) = kx(u)⊤ky(v) ||kx(u)||2||ky(v)||2 s.t. ||u||Px ≤ su and ||v||Py ≤ sv Maximum through alternating projected gradient ascent
SLIDE 16 gradKCCA is a kernel matrix free method that efficiently
Let kx(u) = (kx(xi, u))n
i=1 and ky(v) = (ky(yi, v))n i=1
max
u,v
ρgradKCCA(u, v) = kx(u)⊤ky(v) ||kx(u)||2||ky(v)||2 s.t. ||u||Px ≤ su and ||v||Py ≤ sv Maximum through alternating projected gradient ascent Optimization steps for u:
SLIDE 17 gradKCCA is a kernel matrix free method that efficiently
Let kx(u) = (kx(xi, u))n
i=1 and ky(v) = (ky(yi, v))n i=1
max
u,v
ρgradKCCA(u, v) = kx(u)⊤ky(v) ||kx(u)||2||ky(v)||2 s.t. ||u||Px ≤ su and ||v||Py ≤ sv Maximum through alternating projected gradient ascent Optimization steps for u: → Compute the gradient ∇ρu = ∂ρ(u,v)
∂u
SLIDE 18 gradKCCA is a kernel matrix free method that efficiently
Let kx(u) = (kx(xi, u))n
i=1 and ky(v) = (ky(yi, v))n i=1
max
u,v
ρgradKCCA(u, v) = kx(u)⊤ky(v) ||kx(u)||2||ky(v)||2 s.t. ||u||Px ≤ su and ||v||Py ≤ sv Maximum through alternating projected gradient ascent Optimization steps for u: → Compute the gradient ∇ρu = ∂ρ(u,v)
∂u
→ Step-size using line search: maxγ ρ(u + γ∇ρu)
SLIDE 19 gradKCCA is a kernel matrix free method that efficiently
Let kx(u) = (kx(xi, u))n
i=1 and ky(v) = (ky(yi, v))n i=1
max
u,v
ρgradKCCA(u, v) = kx(u)⊤ky(v) ||kx(u)||2||ky(v)||2 s.t. ||u||Px ≤ su and ||v||Py ≤ sv Maximum through alternating projected gradient ascent Optimization steps for u: → Compute the gradient ∇ρu = ∂ρ(u,v)
∂u
→ Step-size using line search: maxγ ρ(u + γ∇ρu) → Gradient step towards maximum: ugrad = u+γ∗∇ρu
SLIDE 20 gradKCCA is a kernel matrix free method that efficiently
Let kx(u) = (kx(xi, u))n
i=1 and ky(v) = (ky(yi, v))n i=1
max
u,v
ρgradKCCA(u, v) = kx(u)⊤ky(v) ||kx(u)||2||ky(v)||2 s.t. ||u||Px ≤ su and ||v||Py ≤ sv Maximum through alternating projected gradient ascent Optimization steps for u: → Compute the gradient ∇ρu = ∂ρ(u,v)
∂u
→ Step-size using line search: maxγ ρ(u + γ∇ρu) → Gradient step towards maximum: ugrad = u+γ∗∇ρu → Project onto ℓP ball: u =
.Px≤sx ugrad
Viivi Uurtio (Aalto, HIIT) ICML 2019 June 11, 2019 3 / 5
SLIDE 21
Experiments demonstrate noise tolerance, scalability, and superior speed of gradKCCA
SLIDE 22 Experiments demonstrate noise tolerance, scalability, and superior speed of gradKCCA
train 0.6 0.8 0.96 0.98 0.6 0.8 1 train 0.6 0.8 1 test 0.6 0.8 0.96 0.98 test DCCA KNOI KCCA gradKCCA KCCApreimage SCCA-HSIC
F1 score 0.6 0.8 0.96 0.98 AUC 0.6 0.8 0.96 0.98 Proportion of Noise Variables
SLIDE 23 Experiments demonstrate noise tolerance, scalability, and superior speed of gradKCCA
train 0.6 0.8 0.96 0.98 0.6 0.8 1 train 0.6 0.8 1 test 0.6 0.8 0.96 0.98 test DCCA KNOI KCCA gradKCCA KCCApreimage SCCA-HSIC
F1 score 0.6 0.8 0.96 0.98 AUC 0.6 0.8 0.96 0.98 Proportion of Noise Variables
train 103 104 105 106 0.9 1 test 103 104 105 106 gradKCCA DCCA RCCA KNOI SCCA-HSIC
F1 score 103 104 105 106 Time (s) 103 104 105 106 1 s 1 min 1 h 10 h Sample Size
SLIDE 24 Experiments demonstrate noise tolerance, scalability, and superior speed of gradKCCA
train 0.6 0.8 0.96 0.98 0.6 0.8 1 train 0.6 0.8 1 test 0.6 0.8 0.96 0.98 test DCCA KNOI KCCA gradKCCA KCCApreimage SCCA-HSIC
F1 score 0.6 0.8 0.96 0.98 AUC 0.6 0.8 0.96 0.98 Proportion of Noise Variables
train 103 104 105 106 0.9 1 test 103 104 105 106 gradKCCA DCCA RCCA KNOI SCCA-HSIC
F1 score 103 104 105 106 Time (s) 103 104 105 106 1 s 1 min 1 h 10 h Sample Size
MediaMill ρtrain ρtest Time (s) gradKCCA 0.666 ± 0.004 0.657 ± 0.007 8 ± 4 Deep CCA 0.643 ± 0.005 0.633 ± 0.003 1280 ± 112 RF KCCA 0.633 ± 0.001 0.626 ± 0.005 23 ± 9 KNOI 0.652 ± 0.001 0.645 ± 0.003 218 ± 73 SCCA-HSIC 0.627 ± 0.004 0.625 ± 0.002 1804 ± 143 Viivi Uurtio (Aalto, HIIT) ICML 2019 June 11, 2019 4 / 5
SLIDE 25 Thanks and meet me at the poster!
Large-Scale Sparse Kernel Canonical Correlation Analysis
Viivi Uurtio1, Sahely Bhadra2, and Juho Rousu1
1 Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University firstname.lastname@aalto.fi 2 Indian Institute of Technology (IIT), Palakkad sahely@iitpkd.ac.in
gradKCCA: Sparse kernel-based non-linear CCA
- ✓ maximizes canonical correlation in the kernel-induced feature spaces
through the gradients of the preimages u and v
does not rely on a kernel matrix
sparsity-inducing variant of the model is achieved through control- ling the ℓ1 norms of the preimages of the projection directions max u,v ρgradKCCA(u, v) = kx(u)⊤ky(v) ||kx(u)||2||ky(v)||2 s.t. ||u||Pu ≤ su and ||v||Pv ≤ sv
Algorithm: Alternating projected gradient ascent
1: Input: X, Y, M (components), R (repetitions), δ (convergence limit), Px and Py (norms of u and v) sx and sy (l1 or l2 norm constraints for u and v), dx and dy (hyperparameters for kx and ky) 2: Output: U, V 3: for all m = {1, 2, . . . , M} do 4: for all r = {1, 2, . . . , R} do 5: Initialize umr and vmr 6: Compute kx(u), ky(v) 7: repeat 8: Compute ρold = ρ(u, v) 9: Compute ∇ρu = ∂ρ(u,v) ∂u 10: Update umr = .Px≤sx(umr + γ∇u) (The step size γ determined by line search) 11: Re-compute kx(u) 12: Compute ∇ρv = ∂ρ(u,v) ∂v 13: Update vmr = .Py≤sy(vmr + γ∇v) (The step size γ determined by line search) 14: Re-compute ky(v) 15: Compute ρcurrent = ρ(u, v) 16: until |ρold − ρcurrent|/|ρold + ρcurrent| < δ 17: ρr = ρcurrent, ur = umr, vr = vmr 18: end for 19: Select r∗ = arg maxr ρr 20: Store U(:, m) = ur∗, V(:, m) = vr∗ 21: Deflate X(m), Y(m) by U(:, m) and V(:, m) 22: end for 23: Return: U, V O((p+q)n) where n, p, and q denote sample size and numbers of variables, respectively O((p+q)n) where n, p, and q denote sample size and numbers of variables, respectively
Simulated monotone relations
train . 6 . 8 . 9 6 . 9 8 0.6 0.8 1 train 0.6 0.8 1 test . 6 . 8 . 9 6 . 9 8 test DCCA KNOI KCCA gradKCCA KCCApreimage SCCA-HSIC F1 score . 6 . 8 . 9 6 . 9 8 AUC 0.6 0.8 0.96 0.98 Proportion of Noise Variables train 1 03 1 04 1 05 1 06 0.9 1 test 1 03 1 04 1 05 1 06 gradKCCA DCCA RCCA KNOI SCCA-HSIC F1 score 1 03 1 04 1 05 1 06 Time (s) 1 03 1 04 1 05 1 06 1 s 1 min 1 h 10 h Sample Size
gradKCCA is more robust to noise than KCCA gradKCCA is more robust to noise than KCCA
Simulated non-monotone relations
train . 6 . 8 . 9 6 . 9 8 0.6 0.8 1 train 0.6 0.8 1 test . 6 . 8 . 9 6 . 9 8 test DCCA KNOI KCCA gradKCCA KCCApreimage SCCA-HSIC F1 score . 6 . 8 . 9 6 . 9 8 AUC 0.6 0.8 0.96 0.98 Proportion of Noise Variables train 1 03 1 04 1 05 1 06 0.6 0.8 1 test 1 03 1 04 1 05 1 06 gradKCCA DCCA RCCA KNOI SCCA-HSIC F1 score 1 03 1 04 1 05 1 06 Time (s) 1 03 1 04 1 05 1 06 1 s 1 min 1 h 10 h Sample Size
The canonical correlation using the computed the preimages
KCCA is lower than the kernel canonical correlation. The canonical correlation using the computed the preimages
KCCA is lower than the kernel canonical correlation. Computing the preimage of KCCA arg min u α∗TKxα∗ − 2α∗Tkx(u) + kx(u, u) by gradient descent on u. Computing the preimage of KCCA arg min u α∗TKxα∗ − 2α∗Tkx(u) + kx(u, u) by gradient descent on u.
First components from real-world datasets
MNIST Handwritten Digits ρtrain ρtest Time (s) gradKCCA 0.955 ± 0.001 0.952 ± 0.001 56 ± 6 DCCA 0.941 ± 0.010 0.943 ± 0.003 4578 ± 203 RCCA 0.949 ± 0.001 0.949 ± 0.010 78 ± 13 KNOI 0.950 ± 0.001 0.950 ± 0.005 878 ± 62 SCCA-HSIC 0.912 ± 0.020 0.934 ± 0.006 5611 ± 193 MediaMill ρtrain ρtest Time (s) gradKCCA 0.666 ± 0.004 0.657 ± 0.007 8 ± 4 DCCA 0.643 ± 0.005 0.633 ± 0.003 1280 ± 112 RCCA 0.633 ± 0.001 0.626 ± 0.005 23 ± 9 KNOI 0.652 ± 0.001 0.645 ± 0.003 218 ± 73 SCCA-HSIC 0.627 ± 0.004 0.625 ± 0.002 1804 ± 143
Concluding remarks
- ✓ finds the related variables in the data space accurately,
when using a linear or a non-linear kernel
- ✓ unlike KCCA, gradKCCA does not rely on a kernel matrix,
which results in superior speed
- ✓ the ℓ1 norm constraints on the preimages gives robustness
against irrelevant variables This work has been in part supported by Academy
- f Finland grants 310107 (MACOME) and 313268
(TensorBiomed). Code: https://github.com/aalto-ics-kepaco/gradKCCA Contact: viivi.uurtio@aalto.fi This work has been in part supported by Academy
- f Finland grants 310107 (MACOME) and 313268
(TensorBiomed). Code: https://github.com/aalto-ics-kepaco/gradKCCA Contact: viivi.uurtio@aalto.fi
Considerations can be sent to
MATLAB codes available on https://github.com/aalto-ics-kepaco/gradKCCA
Viivi Uurtio (Aalto, HIIT) ICML 2019 June 11, 2019 5 / 5