Large-Scale Sparse Kernel Canonical Correlation Analysis Viivi Uurtio - - PowerPoint PPT Presentation

large scale sparse kernel canonical correlation analysis
SMART_READER_LITE
LIVE PREVIEW

Large-Scale Sparse Kernel Canonical Correlation Analysis Viivi Uurtio - - PowerPoint PPT Presentation

Large-Scale Sparse Kernel Canonical Correlation Analysis Viivi Uurtio 1 , Sahely Bhadra 2 , and Juho Rousu 1 1 Department of Computer Science, Aalto University Helsinki Institute for Information Technology HIIT 2 Indian Institute of Technology


slide-1
SLIDE 1

Large-Scale Sparse Kernel Canonical Correlation Analysis

Viivi Uurtio1, Sahely Bhadra2, and Juho Rousu1

1 Department of Computer Science, Aalto University

Helsinki Institute for Information Technology HIIT

2 Indian Institute of Technology (IIT), Palakkad

June 11, 2019

Viivi Uurtio (Aalto, HIIT) ICML 2019 June 11, 2019 1 / 5

slide-2
SLIDE 2

From large two-view datasets, it is not straightforward to identify which of the variables are related

slide-3
SLIDE 3

From large two-view datasets, it is not straightforward to identify which of the variables are related Xu,Yv ||Xu||2||Yv||2

slide-4
SLIDE 4

From large two-view datasets, it is not straightforward to identify which of the variables are related Xu,Yv ||Xu||2||Yv||2

→ In standard CCA, we identify the related vari- ables from u and v

slide-5
SLIDE 5

From large two-view datasets, it is not straightforward to identify which of the variables are related Xu,Yv ||Xu||2||Yv||2

→ In standard CCA, we identify the related vari- ables from u and v → In the non-linear and/or large-scale variants, we cannot access the u and v

slide-6
SLIDE 6

From large two-view datasets, it is not straightforward to identify which of the variables are related Xu,Yv ||Xu||2||Yv||2

→ In standard CCA, we identify the related vari- ables from u and v → In the non-linear and/or large-scale variants, we cannot access the u and v Scalability u and v

slide-7
SLIDE 7

From large two-view datasets, it is not straightforward to identify which of the variables are related Xu,Yv ||Xu||2||Yv||2

→ In standard CCA, we identify the related vari- ables from u and v → In the non-linear and/or large-scale variants, we cannot access the u and v Scalability u and v Kernel CCA

⊠ ⊠

slide-8
SLIDE 8

From large two-view datasets, it is not straightforward to identify which of the variables are related Xu,Yv ||Xu||2||Yv||2

→ In standard CCA, we identify the related vari- ables from u and v → In the non-linear and/or large-scale variants, we cannot access the u and v Scalability u and v Kernel CCA

⊠ ⊠

RF KCCA

slide-9
SLIDE 9

From large two-view datasets, it is not straightforward to identify which of the variables are related Xu,Yv ||Xu||2||Yv||2

→ In standard CCA, we identify the related vari- ables from u and v → In the non-linear and/or large-scale variants, we cannot access the u and v Scalability u and v Kernel CCA

⊠ ⊠

RF KCCA

KNOI

slide-10
SLIDE 10

From large two-view datasets, it is not straightforward to identify which of the variables are related Xu,Yv ||Xu||2||Yv||2

→ In standard CCA, we identify the related vari- ables from u and v → In the non-linear and/or large-scale variants, we cannot access the u and v Scalability u and v Kernel CCA

⊠ ⊠

RF KCCA

KNOI

Deep CCA

slide-11
SLIDE 11

From large two-view datasets, it is not straightforward to identify which of the variables are related Xu,Yv ||Xu||2||Yv||2

→ In standard CCA, we identify the related vari- ables from u and v → In the non-linear and/or large-scale variants, we cannot access the u and v Scalability u and v Kernel CCA

⊠ ⊠

RF KCCA

KNOI

Deep CCA

SCCA-HSIC

  • Viivi Uurtio (Aalto, HIIT)

ICML 2019 June 11, 2019 2 / 5

slide-12
SLIDE 12

gradKCCA is a kernel matrix free method that efficiently

  • ptimizes u and v
slide-13
SLIDE 13

gradKCCA is a kernel matrix free method that efficiently

  • ptimizes u and v

Let kx(u) = (kx(xi, u))n

i=1 and ky(v) = (ky(yi, v))n i=1

slide-14
SLIDE 14

gradKCCA is a kernel matrix free method that efficiently

  • ptimizes u and v

Let kx(u) = (kx(xi, u))n

i=1 and ky(v) = (ky(yi, v))n i=1

max

u,v

ρgradKCCA(u, v) = kx(u)⊤ky(v) ||kx(u)||2||ky(v)||2 s.t. ||u||Px ≤ su and ||v||Py ≤ sv

slide-15
SLIDE 15

gradKCCA is a kernel matrix free method that efficiently

  • ptimizes u and v

Let kx(u) = (kx(xi, u))n

i=1 and ky(v) = (ky(yi, v))n i=1

max

u,v

ρgradKCCA(u, v) = kx(u)⊤ky(v) ||kx(u)||2||ky(v)||2 s.t. ||u||Px ≤ su and ||v||Py ≤ sv Maximum through alternating projected gradient ascent

slide-16
SLIDE 16

gradKCCA is a kernel matrix free method that efficiently

  • ptimizes u and v

Let kx(u) = (kx(xi, u))n

i=1 and ky(v) = (ky(yi, v))n i=1

max

u,v

ρgradKCCA(u, v) = kx(u)⊤ky(v) ||kx(u)||2||ky(v)||2 s.t. ||u||Px ≤ su and ||v||Py ≤ sv Maximum through alternating projected gradient ascent Optimization steps for u:

slide-17
SLIDE 17

gradKCCA is a kernel matrix free method that efficiently

  • ptimizes u and v

Let kx(u) = (kx(xi, u))n

i=1 and ky(v) = (ky(yi, v))n i=1

max

u,v

ρgradKCCA(u, v) = kx(u)⊤ky(v) ||kx(u)||2||ky(v)||2 s.t. ||u||Px ≤ su and ||v||Py ≤ sv Maximum through alternating projected gradient ascent Optimization steps for u: → Compute the gradient ∇ρu = ∂ρ(u,v)

∂u

slide-18
SLIDE 18

gradKCCA is a kernel matrix free method that efficiently

  • ptimizes u and v

Let kx(u) = (kx(xi, u))n

i=1 and ky(v) = (ky(yi, v))n i=1

max

u,v

ρgradKCCA(u, v) = kx(u)⊤ky(v) ||kx(u)||2||ky(v)||2 s.t. ||u||Px ≤ su and ||v||Py ≤ sv Maximum through alternating projected gradient ascent Optimization steps for u: → Compute the gradient ∇ρu = ∂ρ(u,v)

∂u

→ Step-size using line search: maxγ ρ(u + γ∇ρu)

slide-19
SLIDE 19

gradKCCA is a kernel matrix free method that efficiently

  • ptimizes u and v

Let kx(u) = (kx(xi, u))n

i=1 and ky(v) = (ky(yi, v))n i=1

max

u,v

ρgradKCCA(u, v) = kx(u)⊤ky(v) ||kx(u)||2||ky(v)||2 s.t. ||u||Px ≤ su and ||v||Py ≤ sv Maximum through alternating projected gradient ascent Optimization steps for u: → Compute the gradient ∇ρu = ∂ρ(u,v)

∂u

→ Step-size using line search: maxγ ρ(u + γ∇ρu) → Gradient step towards maximum: ugrad = u+γ∗∇ρu

slide-20
SLIDE 20

gradKCCA is a kernel matrix free method that efficiently

  • ptimizes u and v

Let kx(u) = (kx(xi, u))n

i=1 and ky(v) = (ky(yi, v))n i=1

max

u,v

ρgradKCCA(u, v) = kx(u)⊤ky(v) ||kx(u)||2||ky(v)||2 s.t. ||u||Px ≤ su and ||v||Py ≤ sv Maximum through alternating projected gradient ascent Optimization steps for u: → Compute the gradient ∇ρu = ∂ρ(u,v)

∂u

→ Step-size using line search: maxγ ρ(u + γ∇ρu) → Gradient step towards maximum: ugrad = u+γ∗∇ρu → Project onto ℓP ball: u =

.Px≤sx ugrad

Viivi Uurtio (Aalto, HIIT) ICML 2019 June 11, 2019 3 / 5

slide-21
SLIDE 21

Experiments demonstrate noise tolerance, scalability, and superior speed of gradKCCA

slide-22
SLIDE 22

Experiments demonstrate noise tolerance, scalability, and superior speed of gradKCCA

train 0.6 0.8 0.96 0.98 0.6 0.8 1 train 0.6 0.8 1 test 0.6 0.8 0.96 0.98 test DCCA KNOI KCCA gradKCCA KCCApreimage SCCA-HSIC

F1 score 0.6 0.8 0.96 0.98 AUC 0.6 0.8 0.96 0.98 Proportion of Noise Variables

slide-23
SLIDE 23

Experiments demonstrate noise tolerance, scalability, and superior speed of gradKCCA

train 0.6 0.8 0.96 0.98 0.6 0.8 1 train 0.6 0.8 1 test 0.6 0.8 0.96 0.98 test DCCA KNOI KCCA gradKCCA KCCApreimage SCCA-HSIC

F1 score 0.6 0.8 0.96 0.98 AUC 0.6 0.8 0.96 0.98 Proportion of Noise Variables

train 103 104 105 106 0.9 1 test 103 104 105 106 gradKCCA DCCA RCCA KNOI SCCA-HSIC

F1 score 103 104 105 106 Time (s) 103 104 105 106 1 s 1 min 1 h 10 h Sample Size

slide-24
SLIDE 24

Experiments demonstrate noise tolerance, scalability, and superior speed of gradKCCA

train 0.6 0.8 0.96 0.98 0.6 0.8 1 train 0.6 0.8 1 test 0.6 0.8 0.96 0.98 test DCCA KNOI KCCA gradKCCA KCCApreimage SCCA-HSIC

F1 score 0.6 0.8 0.96 0.98 AUC 0.6 0.8 0.96 0.98 Proportion of Noise Variables

train 103 104 105 106 0.9 1 test 103 104 105 106 gradKCCA DCCA RCCA KNOI SCCA-HSIC

F1 score 103 104 105 106 Time (s) 103 104 105 106 1 s 1 min 1 h 10 h Sample Size

MediaMill ρtrain ρtest Time (s) gradKCCA 0.666 ± 0.004 0.657 ± 0.007 8 ± 4 Deep CCA 0.643 ± 0.005 0.633 ± 0.003 1280 ± 112 RF KCCA 0.633 ± 0.001 0.626 ± 0.005 23 ± 9 KNOI 0.652 ± 0.001 0.645 ± 0.003 218 ± 73 SCCA-HSIC 0.627 ± 0.004 0.625 ± 0.002 1804 ± 143 Viivi Uurtio (Aalto, HIIT) ICML 2019 June 11, 2019 4 / 5

slide-25
SLIDE 25

Thanks and meet me at the poster!

Large-Scale Sparse Kernel Canonical Correlation Analysis

Viivi Uurtio1, Sahely Bhadra2, and Juho Rousu1 1 Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University firstname.lastname@aalto.fi 2 Indian Institute of Technology (IIT), Palakkad sahely@iitpkd.ac.in gradKCCA: Sparse kernel-based non-linear CCA
  • ✓ maximizes canonical correlation in the kernel-induced feature spaces
through the gradients of the preimages u and v
does not rely on a kernel matrix
sparsity-inducing variant of the model is achieved through control- ling the ℓ1 norms of the preimages of the projection directions max u,v ρgradKCCA(u, v) = kx(u)⊤ky(v) ||kx(u)||2||ky(v)||2 s.t. ||u||Pu ≤ su and ||v||Pv ≤ sv Algorithm: Alternating projected gradient ascent 1: Input: X, Y, M (components), R (repetitions), δ (convergence limit), Px and Py (norms of u and v) sx and sy (l1 or l2 norm constraints for u and v), dx and dy (hyperparameters for kx and ky) 2: Output: U, V 3: for all m = {1, 2, . . . , M} do 4: for all r = {1, 2, . . . , R} do 5: Initialize umr and vmr 6: Compute kx(u), ky(v) 7: repeat 8: Compute ρold = ρ(u, v) 9: Compute ∇ρu = ∂ρ(u,v) ∂u 10: Update umr = .Px≤sx(umr + γ∇u) (The step size γ determined by line search) 11: Re-compute kx(u) 12: Compute ∇ρv = ∂ρ(u,v) ∂v 13: Update vmr = .Py≤sy(vmr + γ∇v) (The step size γ determined by line search) 14: Re-compute ky(v) 15: Compute ρcurrent = ρ(u, v) 16: until |ρold − ρcurrent|/|ρold + ρcurrent| < δ 17: ρr = ρcurrent, ur = umr, vr = vmr 18: end for 19: Select r∗ = arg maxr ρr 20: Store U(:, m) = ur∗, V(:, m) = vr∗ 21: Deflate X(m), Y(m) by U(:, m) and V(:, m) 22: end for 23: Return: U, V O((p+q)n) where n, p, and q denote sample size and numbers of variables, respectively O((p+q)n) where n, p, and q denote sample size and numbers of variables, respectively Simulated monotone relations train . 6 . 8 . 9 6 . 9 8 0.6 0.8 1 train 0.6 0.8 1 test . 6 . 8 . 9 6 . 9 8 test DCCA KNOI KCCA gradKCCA KCCApreimage SCCA-HSIC F1 score . 6 . 8 . 9 6 . 9 8 AUC 0.6 0.8 0.96 0.98 Proportion of Noise Variables train 1 03 1 04 1 05 1 06 0.9 1 test 1 03 1 04 1 05 1 06 gradKCCA DCCA RCCA KNOI SCCA-HSIC F1 score 1 03 1 04 1 05 1 06 Time (s) 1 03 1 04 1 05 1 06 1 s 1 min 1 h 10 h Sample Size gradKCCA is more robust to noise than KCCA gradKCCA is more robust to noise than KCCA Simulated non-monotone relations train . 6 . 8 . 9 6 . 9 8 0.6 0.8 1 train 0.6 0.8 1 test . 6 . 8 . 9 6 . 9 8 test DCCA KNOI KCCA gradKCCA KCCApreimage SCCA-HSIC F1 score . 6 . 8 . 9 6 . 9 8 AUC 0.6 0.8 0.96 0.98 Proportion of Noise Variables train 1 03 1 04 1 05 1 06 0.6 0.8 1 test 1 03 1 04 1 05 1 06 gradKCCA DCCA RCCA KNOI SCCA-HSIC F1 score 1 03 1 04 1 05 1 06 Time (s) 1 03 1 04 1 05 1 06 1 s 1 min 1 h 10 h Sample Size The canonical correlation using the computed the preimages
  • f
KCCA is lower than the kernel canonical correlation. The canonical correlation using the computed the preimages
  • f
KCCA is lower than the kernel canonical correlation. Computing the preimage of KCCA arg min u α∗TKxα∗ − 2α∗Tkx(u) + kx(u, u) by gradient descent on u. Computing the preimage of KCCA arg min u α∗TKxα∗ − 2α∗Tkx(u) + kx(u, u) by gradient descent on u. First components from real-world datasets MNIST Handwritten Digits ρtrain ρtest Time (s) gradKCCA 0.955 ± 0.001 0.952 ± 0.001 56 ± 6 DCCA 0.941 ± 0.010 0.943 ± 0.003 4578 ± 203 RCCA 0.949 ± 0.001 0.949 ± 0.010 78 ± 13 KNOI 0.950 ± 0.001 0.950 ± 0.005 878 ± 62 SCCA-HSIC 0.912 ± 0.020 0.934 ± 0.006 5611 ± 193 MediaMill ρtrain ρtest Time (s) gradKCCA 0.666 ± 0.004 0.657 ± 0.007 8 ± 4 DCCA 0.643 ± 0.005 0.633 ± 0.003 1280 ± 112 RCCA 0.633 ± 0.001 0.626 ± 0.005 23 ± 9 KNOI 0.652 ± 0.001 0.645 ± 0.003 218 ± 73 SCCA-HSIC 0.627 ± 0.004 0.625 ± 0.002 1804 ± 143 Concluding remarks
  • ✓ finds the related variables in the data space accurately,
when using a linear or a non-linear kernel
  • ✓ unlike KCCA, gradKCCA does not rely on a kernel matrix,
which results in superior speed
  • ✓ the ℓ1 norm constraints on the preimages gives robustness
against irrelevant variables This work has been in part supported by Academy
  • f Finland grants 310107 (MACOME) and 313268
(TensorBiomed). Code: https://github.com/aalto-ics-kepaco/gradKCCA Contact: viivi.uurtio@aalto.fi This work has been in part supported by Academy
  • f Finland grants 310107 (MACOME) and 313268
(TensorBiomed). Code: https://github.com/aalto-ics-kepaco/gradKCCA Contact: viivi.uurtio@aalto.fi

Considerations can be sent to

  • viivi.uurtio@aalto.fi

MATLAB codes available on https://github.com/aalto-ics-kepaco/gradKCCA

Viivi Uurtio (Aalto, HIIT) ICML 2019 June 11, 2019 5 / 5