Eliminating the Invariance on the Loss Landscape of Linear - - PowerPoint PPT Presentation

eliminating the invariance on the loss landscape of
SMART_READER_LITE
LIVE PREVIEW

Eliminating the Invariance on the Loss Landscape of Linear - - PowerPoint PPT Presentation

Eliminating the Invariance on the Loss Landscape of Linear Autoencoders Reza Oftadeh, Jiayi Shen, Atlas Wang, Dylan Shell Texas A&M University Department of Computer Sciense and Engineering ICML 2020 Overview Linear Autoencoder (LAE)


slide-1
SLIDE 1

Eliminating the Invariance on the Loss Landscape of Linear Autoencoders

Reza Oftadeh, Jiayi Shen, Atlas Wang, Dylan Shell

Texas A&M University Department of Computer Sciense and Engineering

ICML 2020

slide-2
SLIDE 2

Overview

◮ Linear Autoencoder (LAE) with Mean Square Error (MSE). The classical results:

– Loss surface has been analytically characterized. – All local minima are global minima. – The columns of the optimal decoder does not identify the principal directions but only their low dimensional subspace (the so-called invariance problem).

◮ We present a new loss function for LAE:

– Analytically characterize the loss landscape. – All local minima are global minima. – The columns of the optimal decoder span the principal directions. – Invariant local minima become saddle points. – Computational complexity is of the same order of MSE loss.

2

slide-3
SLIDE 3

Overview

◮ Linear Autoencoder (LAE) with Mean Square Error (MSE). The classical results:

– Loss surface has been analytically characterized. – All local minima are global minima. – The columns of the optimal decoder does not identify the principal directions but only their low dimensional subspace (the so-called invariance problem).

◮ We present a new loss function for LAE:

– Analytically characterize the loss landscape. – All local minima are global minima. – The columns of the optimal decoder span the principal directions. – Invariant local minima become saddle points. – Computational complexity is of the same order of MSE loss.

2

slide-4
SLIDE 4

Overview

◮ Linear Autoencoder (LAE) with Mean Square Error (MSE). The classical results:

– Loss surface has been analytically characterized. – All local minima are global minima. – The columns of the optimal decoder does not identify the principal directions but only their low dimensional subspace (the so-called invariance problem).

◮ We present a new loss function for LAE:

– Analytically characterize the loss landscape. – All local minima are global minima. – The columns of the optimal decoder span the principal directions. – Invariant local minima become saddle points. – Computational complexity is of the same order of MSE loss.

2

slide-5
SLIDE 5

Overview

◮ Linear Autoencoder (LAE) with Mean Square Error (MSE). The classical results:

– Loss surface has been analytically characterized. – All local minima are global minima. – The columns of the optimal decoder does not identify the principal directions but only their low dimensional subspace (the so-called invariance problem).

◮ We present a new loss function for LAE:

– Analytically characterize the loss landscape. – All local minima are global minima. – The columns of the optimal decoder span the principal directions. – Invariant local minima become saddle points. – Computational complexity is of the same order of MSE loss.

2

slide-6
SLIDE 6

Overview

◮ Linear Autoencoder (LAE) with Mean Square Error (MSE). The classical results:

– Loss surface has been analytically characterized. – All local minima are global minima. – The columns of the optimal decoder does not identify the principal directions but only their low dimensional subspace (the so-called invariance problem).

◮ We present a new loss function for LAE:

– Analytically characterize the loss landscape. – All local minima are global minima. – The columns of the optimal decoder span the principal directions. – Invariant local minima become saddle points. – Computational complexity is of the same order of MSE loss.

2

slide-7
SLIDE 7

Overview

◮ Linear Autoencoder (LAE) with Mean Square Error (MSE). The classical results:

– Loss surface has been analytically characterized. – All local minima are global minima. – The columns of the optimal decoder does not identify the principal directions but only their low dimensional subspace (the so-called invariance problem).

◮ We present a new loss function for LAE:

– Analytically characterize the loss landscape. – All local minima are global minima. – The columns of the optimal decoder span the principal directions. – Invariant local minima become saddle points. – Computational complexity is of the same order of MSE loss.

2

slide-8
SLIDE 8

Overview

◮ Linear Autoencoder (LAE) with Mean Square Error (MSE). The classical results:

– Loss surface has been analytically characterized. – All local minima are global minima. – The columns of the optimal decoder does not identify the principal directions but only their low dimensional subspace (the so-called invariance problem).

◮ We present a new loss function for LAE:

– Analytically characterize the loss landscape. – All local minima are global minima. – The columns of the optimal decoder span the principal directions. – Invariant local minima become saddle points. – Computational complexity is of the same order of MSE loss.

2

slide-9
SLIDE 9

Overview

◮ Linear Autoencoder (LAE) with Mean Square Error (MSE). The classical results:

– Loss surface has been analytically characterized. – All local minima are global minima. – The columns of the optimal decoder does not identify the principal directions but only their low dimensional subspace (the so-called invariance problem).

◮ We present a new loss function for LAE:

– Analytically characterize the loss landscape. – All local minima are global minima. – The columns of the optimal decoder span the principal directions. – Invariant local minima become saddle points. – Computational complexity is of the same order of MSE loss.

2

slide-10
SLIDE 10

Overview

◮ Linear Autoencoder (LAE) with Mean Square Error (MSE). The classical results:

– Loss surface has been analytically characterized. – All local minima are global minima. – The columns of the optimal decoder does not identify the principal directions but only their low dimensional subspace (the so-called invariance problem).

◮ We present a new loss function for LAE:

– Analytically characterize the loss landscape. – All local minima are global minima. – The columns of the optimal decoder span the principal directions. – Invariant local minima become saddle points. – Computational complexity is of the same order of MSE loss.

2

slide-11
SLIDE 11

Overview

◮ Linear Autoencoder (LAE) with Mean Square Error (MSE). The classical results:

– Loss surface has been analytically characterized. – All local minima are global minima. – The columns of the optimal decoder does not identify the principal directions but only their low dimensional subspace (the so-called invariance problem).

◮ We present a new loss function for LAE:

– Analytically characterize the loss landscape. – All local minima are global minima. – The columns of the optimal decoder span the principal directions. – Invariant local minima become saddle points. – Computational complexity is of the same order of MSE loss.

2

slide-12
SLIDE 12

Setup

◮ Data: m sample points of dimension n:

– Input: xj ∈ Rn, Output: yj ∈ Rn for j = 1, . . . , m. – In matrix form: X ∈ Rn×m, Y ∈ Rn×m.

◮ LAE: A neural network with linear activation functions and single hidden layer of width p < n. B ∈ Rp×n Encoder A ∈ Rn×p Decoder xj ∈ Rn ˆ yj ∈ Rn p < n

– The weights: The encoder matrix B, and the decoder matrix A. – The global map is ˆ yj = ABxj or ˆ Y = ABX.

3

slide-13
SLIDE 13

Setup

◮ Data: m sample points of dimension n:

– Input: xj ∈ Rn, Output: yj ∈ Rn for j = 1, . . . , m. – In matrix form: X ∈ Rn×m, Y ∈ Rn×m.

◮ LAE: A neural network with linear activation functions and single hidden layer of width p < n. B ∈ Rp×n Encoder A ∈ Rn×p Decoder xj ∈ Rn ˆ yj ∈ Rn p < n

– The weights: The encoder matrix B, and the decoder matrix A. – The global map is ˆ yj = ABxj or ˆ Y = ABX.

3

slide-14
SLIDE 14

Setup

◮ Data: m sample points of dimension n:

– Input: xj ∈ Rn, Output: yj ∈ Rn for j = 1, . . . , m. – In matrix form: X ∈ Rn×m, Y ∈ Rn×m.

◮ LAE: A neural network with linear activation functions and single hidden layer of width p < n. B ∈ Rp×n Encoder A ∈ Rn×p Decoder xj ∈ Rn ˆ yj ∈ Rn p < n

– The weights: The encoder matrix B, and the decoder matrix A. – The global map is ˆ yj = ABxj or ˆ Y = ABX.

3

slide-15
SLIDE 15

Setup

◮ Data: m sample points of dimension n:

– Input: xj ∈ Rn, Output: yj ∈ Rn for j = 1, . . . , m. – In matrix form: X ∈ Rn×m, Y ∈ Rn×m.

◮ LAE: A neural network with linear activation functions and single hidden layer of width p < n. B ∈ Rp×n Encoder A ∈ Rn×p Decoder xj ∈ Rn ˆ yj ∈ Rn p < n

– The weights: The encoder matrix B, and the decoder matrix A. – The global map is ˆ yj = ABxj or ˆ Y = ABX.

3

slide-16
SLIDE 16

Setup

◮ Data: m sample points of dimension n:

– Input: xj ∈ Rn, Output: yj ∈ Rn for j = 1, . . . , m. – In matrix form: X ∈ Rn×m, Y ∈ Rn×m.

◮ LAE: A neural network with linear activation functions and single hidden layer of width p < n. B ∈ Rp×n Encoder A ∈ Rn×p Decoder xj ∈ Rn ˆ yj ∈ Rn p < n

– The weights: The encoder matrix B, and the decoder matrix A. – The global map is ˆ yj = ABxj or ˆ Y = ABX.

3

slide-17
SLIDE 17

Setup

◮ Data: m sample points of dimension n:

– Input: xj ∈ Rn, Output: yj ∈ Rn for j = 1, . . . , m. – In matrix form: X ∈ Rn×m, Y ∈ Rn×m.

◮ LAE: A neural network with linear activation functions and single hidden layer of width p < n. B ∈ Rp×n Encoder A ∈ Rn×p Decoder xj ∈ Rn ˆ yj ∈ Rn p < n

– The weights: The encoder matrix B, and the decoder matrix A. – The global map is ˆ yj = ABxj or ˆ Y = ABX.

3

slide-18
SLIDE 18

The loss functions

◮ The MSE loss: ˜ L(A, B) := Y − ABX2

F .

– If (A∗, B∗) is a local minimum of ˜ L then for any invertible C ∈ Rp×p, (A∗C, C−1B∗) is another local minima: ˜ L(A∗C, C−1B∗) =

  • Y − A∗CC−1B∗X
  • 2

F = ˜

L(A∗, B∗).

◮ The proposed loss: L(A, B) := p

i=1 Y − AIi;pBX2 F ,

where, Ii;p = diag(1, · · · , 1

i

, 0, · · · , 0) ∈ Rp×p.

4

slide-19
SLIDE 19

The loss functions

◮ The MSE loss: ˜ L(A, B) := Y − ABX2

F .

– If (A∗, B∗) is a local minimum of ˜ L then for any invertible C ∈ Rp×p, (A∗C, C−1B∗) is another local minima: ˜ L(A∗C, C−1B∗) =

  • Y − A∗CC−1B∗X
  • 2

F = ˜

L(A∗, B∗).

◮ The proposed loss: L(A, B) := p

i=1 Y − AIi;pBX2 F ,

where, Ii;p = diag(1, · · · , 1

i

, 0, · · · , 0) ∈ Rp×p.

4

slide-20
SLIDE 20

The loss functions

◮ The MSE loss: ˜ L(A, B) := Y − ABX2

F .

– If (A∗, B∗) is a local minimum of ˜ L then for any invertible C ∈ Rp×p, (A∗C, C−1B∗) is another local minima: ˜ L(A∗C, C−1B∗) =

  • Y − A∗CC−1B∗X
  • 2

F = ˜

L(A∗, B∗).

◮ The proposed loss: L(A, B) := p

i=1 Y − AIi;pBX2 F ,

where, Ii;p = diag(1, · · · , 1

i

, 0, · · · , 0) ∈ Rp×p.

4

slide-21
SLIDE 21

A Visualization: MSE Loss

−1 −0.5 0.5 1 −0.5 0.5 −0.35 0.35 v1 v2 x y z

5

slide-22
SLIDE 22

A Visualization: Proposed Loss

−1 −0.5 0.5 1 −0.5 0.5 −0.35 0.35 v1 v

2

x y z

6

slide-23
SLIDE 23

The loss functions

◮ The MSE loss: ˜ L(A, B) := Y − ABX2

F .

– If (A∗, B∗) is a local minimum of ˜ L then for any invertible C ∈ Rp×p, (A∗C, C−1B∗) is another local minima: ˜ L(A∗C, C−1B∗) =

  • Y − A∗CC−1B∗X
  • 2

F = ˜

L(A∗, B∗).

◮ The proposed loss: L(A, B) := p

i=1 Y − AIi;pBX2 F ,

where, Ii;p = diag(1, · · · , 1

i

, 0, · · · , 0) ∈ Rp×p.

– Intuition: (Sequential) As an example look at p = 3, where I1;3 =   1   , I2;3 =   1 1   , I3;3 =   1 1 1   . – But does this work simultaneously? And is it computationally feasible (p can be large)? – Well, it does and it is! But before getting into details let’s discuss some implications ...

7

slide-24
SLIDE 24

The loss functions

◮ The MSE loss: ˜ L(A, B) := Y − ABX2

F .

– If (A∗, B∗) is a local minimum of ˜ L then for any invertible C ∈ Rp×p, (A∗C, C−1B∗) is another local minima: ˜ L(A∗C, C−1B∗) =

  • Y − A∗CC−1B∗X
  • 2

F = ˜

L(A∗, B∗).

◮ The proposed loss: L(A, B) := p

i=1 Y − AIi;pBX2 F ,

where, Ii;p = diag(1, · · · , 1

i

, 0, · · · , 0) ∈ Rp×p.

– Intuition: (Sequential) As an example look at p = 3, where I1;3 =   1   , I2;3 =   1 1   , I3;3 =   1 1 1   . – But does this work simultaneously? And is it computationally feasible (p can be large)? – Well, it does and it is! But before getting into details let’s discuss some implications ...

7

slide-25
SLIDE 25

The loss functions

◮ The MSE loss: ˜ L(A, B) := Y − ABX2

F .

– If (A∗, B∗) is a local minimum of ˜ L then for any invertible C ∈ Rp×p, (A∗C, C−1B∗) is another local minima: ˜ L(A∗C, C−1B∗) =

  • Y − A∗CC−1B∗X
  • 2

F = ˜

L(A∗, B∗).

◮ The proposed loss: L(A, B) := p

i=1 Y − AIi;pBX2 F ,

where, Ii;p = diag(1, · · · , 1

i

, 0, · · · , 0) ∈ Rp×p.

– Intuition: (Sequential) As an example look at p = 3, where I1;3 =   1   , I2;3 =   1 1   , I3;3 =   1 1 1   . – But does this work simultaneously? And is it computationally feasible (p can be large)? – Well, it does and it is! But before getting into details let’s discuss some implications ...

7

slide-26
SLIDE 26

Implications

◮ Let (A∗, B∗) be the local minimum of MSE loss, where the columns of A∗ are the largest eigenvectors of the sample covariance matrix, then for any invertible C ∈ Rp×p, (A∗C, C−1B∗) is another local minima.

– Numerically, on the same dataset different runs with different initializations lead to different optimal points. – Almost surely none will represent the principal directions.

◮ The only local minimum of the loss L is (A∗, B∗), up to the normalization of the columns.

– The loss L enables low rank decomposition as a single

  • ptimization block that can be incorporated as part of a

larger pipeline. – Potentially enabling LAEs to compete with other approaches for low rank decomposition.

8

slide-27
SLIDE 27

Critical Points

◮ The critical point equations of ˜ L and L. For ˜ L(A, B): For L(A, B): A′ABΣxx = A′Σyx, (Sp◦(A′A))BΣxx = TpA′Σyx, ABΣxxB′ = ΣyxB′, A(Sp◦(BΣxxB′)) = ΣyxB′Tp, where,

– A′ is the transpose of A. – Σxx = XX′, Σyx = Y X′ are covariance matrices. – ◦ is the (element-wise) Hadamard product, and – Tp, and Sp are Tp = diag (p, p − 1, · · · , 1) , Sp =      p p − 1 · · · 1 p − 1 p − 1 · · · 1 . . . . . . ... 1 1 1 1 1      , e.g. S4 =     4 3 2 1 3 3 2 1 2 2 2 1 1 1 1 1     .

9

slide-28
SLIDE 28

Critical Points

◮ The critical point equations of ˜ L and L. For ˜ L(A, B): For L(A, B): A′ABΣxx = A′Σyx, (Sp◦(A′A))BΣxx = TpA′Σyx, ABΣxxB′ = ΣyxB′, A(Sp◦(BΣxxB′)) = ΣyxB′Tp, where,

– A′ is the transpose of A. – Σxx = XX′, Σyx = Y X′ are covariance matrices. – ◦ is the (element-wise) Hadamard product, and – Tp, and Sp are Tp = diag (p, p − 1, · · · , 1) , Sp =      p p − 1 · · · 1 p − 1 p − 1 · · · 1 . . . . . . ... 1 1 1 1 1      , e.g. S4 =     4 3 2 1 3 3 2 1 2 2 2 1 1 1 1 1     .

9

slide-29
SLIDE 29

Critical Points

◮ The critical point equations of ˜ L and L. For ˜ L(A, B): For L(A, B): A′ABΣxx = A′Σyx, (Sp◦(A′A))BΣxx = TpA′Σyx, ABΣxxB′ = ΣyxB′, A(Sp◦(BΣxxB′)) = ΣyxB′Tp, where,

– A′ is the transpose of A. – Σxx = XX′, Σyx = Y X′ are covariance matrices. – ◦ is the (element-wise) Hadamard product, and – Tp, and Sp are Tp = diag (p, p − 1, · · · , 1) , Sp =      p p − 1 · · · 1 p − 1 p − 1 · · · 1 . . . . . . ... 1 1 1 1 1      , e.g. S4 =     4 3 2 1 3 3 2 1 2 2 2 1 1 1 1 1     .

9

slide-30
SLIDE 30

Critical Points

◮ The critical point equations of ˜ L and L. For ˜ L(A, B): For L(A, B): A′ABΣxx = A′Σyx, (Sp◦(A′A))BΣxx = TpA′Σyx, ABΣxxB′ = ΣyxB′, A(Sp◦(BΣxxB′)) = ΣyxB′Tp, where,

– A′ is the transpose of A. – Σxx = XX′, Σyx = Y X′ are covariance matrices. – ◦ is the (element-wise) Hadamard product, and – Tp, and Sp are Tp = diag (p, p − 1, · · · , 1) , Sp =      p p − 1 · · · 1 p − 1 p − 1 · · · 1 . . . . . . ... 1 1 1 1 1      , e.g. S4 =     4 3 2 1 3 3 2 1 2 2 2 1 1 1 1 1     .

9

slide-31
SLIDE 31

Critical Points

◮ The critical point equations of ˜ L and L. For ˜ L(A, B): For L(A, B): A′ABΣxx = A′Σyx, (Sp◦(A′A))BΣxx = TpA′Σyx, ABΣxxB′ = ΣyxB′, A(Sp◦(BΣxxB′)) = ΣyxB′Tp, where,

– A′ is the transpose of A. – Σxx = XX′, Σyx = Y X′ are covariance matrices. – ◦ is the (element-wise) Hadamard product, and – Tp, and Sp are Tp = diag (p, p − 1, · · · , 1) , Sp =      p p − 1 · · · 1 p − 1 p − 1 · · · 1 . . . . . . ... 1 1 1 1 1      , e.g. S4 =     4 3 2 1 3 3 2 1 2 2 2 1 1 1 1 1     .

9

slide-32
SLIDE 32

Results I

◮ Every critical point of L( A, B) is a critical point of ˜ L(A, B), but not the other way around. ◮ Local minima of L, and ˜ L: For ˜ L(A, B): For L(A, B): A∗= U1:pCp, A∗= U1:pDp, B∗= C−1

p U ′ 1:pΣyxΣ−1 xx ,

B∗= D−1

p U ′ 1:pΣyxΣ−1 xx ,

– The ith column of U1:p is a unit eigenvector of Σ := ΣyxΣ−1

xx Σxy corresponding the ith largest eigenvalue.

– Dp is a diagonal matrix with nonzero diagonal elements, and Cp ∈ GLp(R).

◮ The characterization of the loss landscape:

– The structure of full rank saddle points. – The structure of low rank saddle points (rather involved!).

10

slide-33
SLIDE 33

Results I

◮ Every critical point of L( A, B) is a critical point of ˜ L(A, B), but not the other way around. ◮ Local minima of L, and ˜ L: For ˜ L(A, B): For L(A, B): A∗= U1:pCp, A∗= U1:pDp, B∗= C−1

p U ′ 1:pΣyxΣ−1 xx ,

B∗= D−1

p U ′ 1:pΣyxΣ−1 xx ,

– The ith column of U1:p is a unit eigenvector of Σ := ΣyxΣ−1

xx Σxy corresponding the ith largest eigenvalue.

– Dp is a diagonal matrix with nonzero diagonal elements, and Cp ∈ GLp(R).

◮ The characterization of the loss landscape:

– The structure of full rank saddle points. – The structure of low rank saddle points (rather involved!).

10

slide-34
SLIDE 34

Results I

◮ Every critical point of L( A, B) is a critical point of ˜ L(A, B), but not the other way around. ◮ Local minima of L, and ˜ L: For ˜ L(A, B): For L(A, B): A∗= U1:pCp, A∗= U1:pDp, B∗= C−1

p U ′ 1:pΣyxΣ−1 xx ,

B∗= D−1

p U ′ 1:pΣyxΣ−1 xx ,

– The ith column of U1:p is a unit eigenvector of Σ := ΣyxΣ−1

xx Σxy corresponding the ith largest eigenvalue.

– Dp is a diagonal matrix with nonzero diagonal elements, and Cp ∈ GLp(R).

◮ The characterization of the loss landscape:

– The structure of full rank saddle points. – The structure of low rank saddle points (rather involved!).

10

slide-35
SLIDE 35

Results I

◮ Every critical point of L( A, B) is a critical point of ˜ L(A, B), but not the other way around. ◮ Local minima of L, and ˜ L: For ˜ L(A, B): For L(A, B): A∗= U1:pCp, A∗= U1:pDp, B∗= C−1

p U ′ 1:pΣyxΣ−1 xx ,

B∗= D−1

p U ′ 1:pΣyxΣ−1 xx ,

– The ith column of U1:p is a unit eigenvector of Σ := ΣyxΣ−1

xx Σxy corresponding the ith largest eigenvalue.

– Dp is a diagonal matrix with nonzero diagonal elements, and Cp ∈ GLp(R).

◮ The characterization of the loss landscape:

– The structure of full rank saddle points. – The structure of low rank saddle points (rather involved!).

10

slide-36
SLIDE 36

Results I

◮ Every critical point of L( A, B) is a critical point of ˜ L(A, B), but not the other way around. ◮ Local minima of L, and ˜ L: For ˜ L(A, B): For L(A, B): A∗= U1:pCp, A∗= U1:pDp, B∗= C−1

p U ′ 1:pΣyxΣ−1 xx ,

B∗= D−1

p U ′ 1:pΣyxΣ−1 xx ,

– The ith column of U1:p is a unit eigenvector of Σ := ΣyxΣ−1

xx Σxy corresponding the ith largest eigenvalue.

– Dp is a diagonal matrix with nonzero diagonal elements, and Cp ∈ GLp(R).

◮ The characterization of the loss landscape:

– The structure of full rank saddle points. – The structure of low rank saddle points (rather involved!).

10

slide-37
SLIDE 37

Results II

◮ The MSE loss ˜ L and our loss L can be written as ˜ L(A, B) = p Tr(Σyy) − 2 Tr (ABΣxy) + Tr

  • B′A′ABΣxx
  • ,

L(A, B) = p Tr(Σyy) − 2 Tr (ATpBΣxy) + Tr

  • B′

Sp◦

  • A′A
  • BΣxx
  • .

◮ The analytical gradients are: dB ˜ L(A, B)W = −2A′Σyx − A′ABΣxx, W F , dBL(A, B)W = −2TpA′Σyx −

  • Sp◦
  • A′A
  • BΣxx, W F ,

in direction of W ∈ Rp×n. The gradient for A is similar. ◮ Finally, since the loss function is explicitly provided, any

  • ptimization method that works with MSE loss is usable

with the proposed loss.

11

slide-38
SLIDE 38

Results II

◮ The MSE loss ˜ L and our loss L can be written as ˜ L(A, B) = p Tr(Σyy) − 2 Tr (ABΣxy) + Tr

  • B′A′ABΣxx
  • ,

L(A, B) = p Tr(Σyy) − 2 Tr (ATpBΣxy) + Tr

  • B′

Sp◦

  • A′A
  • BΣxx
  • .

◮ The analytical gradients are: dB ˜ L(A, B)W = −2A′Σyx − A′ABΣxx, W F , dBL(A, B)W = −2TpA′Σyx −

  • Sp◦
  • A′A
  • BΣxx, W F ,

in direction of W ∈ Rp×n. The gradient for A is similar. ◮ Finally, since the loss function is explicitly provided, any

  • ptimization method that works with MSE loss is usable

with the proposed loss.

11

slide-39
SLIDE 39

Results II

◮ The MSE loss ˜ L and our loss L can be written as ˜ L(A, B) = p Tr(Σyy) − 2 Tr (ABΣxy) + Tr

  • B′A′ABΣxx
  • ,

L(A, B) = p Tr(Σyy) − 2 Tr (ATpBΣxy) + Tr

  • B′

Sp◦

  • A′A
  • BΣxx
  • .

◮ The analytical gradients are: dB ˜ L(A, B)W = −2A′Σyx − A′ABΣxx, W F , dBL(A, B)W = −2TpA′Σyx −

  • Sp◦
  • A′A
  • BΣxx, W F ,

in direction of W ∈ Rp×n. The gradient for A is similar. ◮ Finally, since the loss function is explicitly provided, any

  • ptimization method that works with MSE loss is usable

with the proposed loss.

11