Low Rank Approximation Lecture 6
Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch
1
Low Rank Approximation Lecture 6 Daniel Kressner Chair for - - PowerPoint PPT Presentation
Low Rank Approximation Lecture 6 Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch 1 The Kronecker product 2 Vectorization The vectorization of an m n matrix A denoted by vec (
Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch
1
2
The vectorization of an m × n matrix A denoted by vec(A), where vec : Rm×n → Rm·n stacks the columns of a matrix into a long column vector. Example: A = a11 a12 a21 a22 a31 a32 ⇒ vec(A) = a11 a21 a31 a12 a22 a32 Remarks:
◮ In MATLAB: A(:) ◮ This way of vectorizing corresponds to how matrices are laid out
in memory in MATLAB. In other programming languages (e.g., C arrays) matrices are laid out rowwise.
3
For m × n matrix A and k × ℓ matrix B, Kronecker product defined as B ⊗ A := b11A · · · b1ℓA . . . . . . bk1A · · · bkℓA ∈ Rkm×ℓn. Most important properties (for our purposes):
for A ∈ Rm×n, B ∈ Rk×ℓ, C ∈ Rn×q, D ∈ Rℓ×p.
4
5
◮ scalar = tensor of order 0 ◮ (column) vector = tensor of order 1 ◮ matrix = tensor of order 2 ◮ tensor of order 3
= n1n2n3 numbers arranged in n1 × n2 × n3 array
6
A d-th order tensor X of size n1 × n2 × · · · × nd is a d-dimensional array with entries Xi1,i2,...,id, iµ ∈ {1, . . . , nµ} for µ = 1, . . . , d. In the following, entries of X are usually real (for simplicity) X ∈ Rn1×n2×···×nd. Multi-index notation: I = {1, . . . , n1} × {1, . . . , n2} × · · · × {1, . . . , nd}. Then i ∈ I is a tuple of d indices: i = (i1, i2, . . . , id). Allows to write entries of X as Xi for i ∈ I.
7
A : Rn → Rm, A : x → A · x. There is no such (unique and natural) interpretation for tensors! fundamental difficulty to define meaningful general notion of eigenvalues and singular values of tensors.
Curse of dimensionality. Example: Tensor of order 30 with n1 = n2 = · · · = nd = 10 has 1030 entries = 8 × 1012 Exabyte storage!1 For d ≫ 1: Cannot afford to store tensor explicitly (in terms of its entries).
1Global data storage a few years ago calculated at 295 exabyte, see
http://www.bbc.co.uk/news/technology-12419672.
8
◮ Addition of two equal-sized tensors X, Y:
Z = X + Y ⇔ Zi = Xi + Yi ∀i ∈ I.
◮ Scalar multiplication with α ∈ R:
Z = αX ⇔ Zi = αXi ∀i ∈ I. vector space structure.
◮ Inner product of two equal-sized tensors X, Y:
X, Y :=
xiyi. Induced norm X :=
i∈I
x2
i
1/2 For a 2nd order tensor (= matrix) this corresponds to the usual Euclidean geometry and Frobenius norm.
9
Tensor X of size n1 × n2 × · · · × nd has n1 · n2 · · · nd entries many ways to stack entries in a (loooong) column vector. One possible choice: The vectorization of X is denoted by vec(X), where vec : Rn1×n2×···×nd → Rn1·n2···nd stacks the entries of a tensor in reverse lexicographical order into a long column vector. Example: d = 3, n1 = 3, n2 = 2, n3 = 3.
vec(X) = x111 x211 x311 x121 . . . . . . x123 x223 x323
10
◮ A matrix has two modes (column mode and row mode). ◮ A dth-order tensor X has d modes (µ = 1, µ = 2, . . ., µ = d).
Let us fix all but one mode, e.g., µ = 1: Then X(:, i2, i3, . . . , id) (abuse of MATLAB notation) is a vector of length n1 for each choice of i2, . . . , id. These vectors are called fibers. View tensor X as a bunch of column vectors:
11
Stack vectors into an n1 × (n2 · · · nd) matrix: X ∈ Rn1×n2×···×nd X (1) ∈ Rn1×(n2n3···nd) For µ = 1, . . . , d, the µ-mode matricization of X is a matrix X (µ) ∈ Rnµ×(n1···nµ−1nµ+1···nd) with entries
iµ1,(i1,...,iµ−1,iµ+1...id) = Xi
∀i ∈ I.
12
In MATLAB: a = rand(2,3,4,5);
◮ 1-mode matricization:
reshape(a,2,3*4*5)
◮ 2-mode matricization:
b = permute(a,[2 1 3 4]); reshape(b,3,2*4*5)
◮ 3-mode matricization:
b = permute(a,[3 1 2 4]); reshape(b,4,2*3*5)
◮ 4-mode matricization:
b = permute(a,[4 1 2 3]); reshape(b,5,2*3*4) For a matrix A ∈ Rn1×n2: A(1) = A, A(2) = AT.
13
Consider 1-mode matricization X (1) ∈ Rn1×(n2···nd): Seems to make sense to multiply an m × n1 matrix A from the left: Y (1) := A X (1) ∈ Rm×(n2···nd). Can rearrange Y (1) back into an m × n2 × · · · × nd tensor Y. This is called 1-mode matrix multiplication Y = A ◦1 X ⇔ Y (1) = AX (1) More formally (and more ugly): Yi1,i2,...,id =
n1
ai1,kXk,i2,...,id.
14
General definition of a µ-mode matrix product with A ∈ Rm×n1: Y = A ◦µ X ⇔ Y (µ) = AX (µ). More formally (and more ugly): Yi1,i2,...,id =
n1
aiµ,kXi1,...,iµ−1,k,iµ+1,...,id. For matrices:
◮ 1-mode multiplication = multiplication from the left:
Y = A ◦1 X = A X.
◮ 2-mode multiplication = transposed multiplication from the right:
Y = A ◦2 X = X AT.
15
By definition, vec(X) = vec
. Consequently, also vec(A ◦1 X) = vec
. Vectorized version of 1-mode matrix product: vec(A ◦1 X) = (In2···nd ⊗ A)vec(X) = (Ind ⊗ · · · ⊗ In2 ⊗ A) vec(X). Relation between µ-mode matrix product and matrix-vector product: vec(A ◦µ X) = (Ind ⊗ · · · ⊗ Inµ+1 ⊗ A ⊗ Inµ−1 ⊗ · · · ⊗ In1) vec(X)
16
◮ Tensor X ∈ Rn1×···×nd is a d-dimensional array. ◮ Various ways of reshaping entries of a tensor X into a vector or
matrix.
◮ µ-mode matrix multiplication can be expressed with Kronecker
products Further reading:
◮ T. Kolda and B. W. Bader. Tensor decompositions and
Software:
◮ MATLAB (and all programming languages) offer basic
functionality to work with d-dimensional arrays.
◮ MATLAB Tensor Toolbox: http://www.tensortoolbox.org/
17
18
Class 1: function-related tensors Consider a function u(ξ1, . . . , ξd) ∈ R in d variables ξ1, . . . , ξd. Tensor U ∈ Rn1×···×nd represents discretization of u:
◮ U contains function values of u evaluated on a grid; or ◮ U contains coefficients of truncated expansion in tensorized
basis functions: u(ξ1, . . . , ξd) ≈
Ui φi1(ξ1)φi2(ξ2) · · · φid(ξd). Typical setting:
◮ U only given implicitly, e.g., as the solution of a discretized PDE; ◮ seek approximations to U with very low storage and tolerable
accuracy.
◮ d may become very large.
19
Discretization of function in d variables ξ1, . . . , ξd ∈ [0, 1] #function values grows exponentially with d
20
Ideal situation: Function f separable: f(ξ1, ξ2, . . . , ξd) = f1(ξ1)f2(ξ2) . . . fd(ξd) Kronecker product diskretized f discretized f j O(nd) memory O(dn) memory Of course: Exact separability rarely satisfied in practice.
21
Class 2: data-related tensors Tensor U ∈ Rn1×···×nd contains multi-dimensional data. Example 1: U2011,3,2 denotes the number of papers published 2011 by author 3 in the mathematical journal 2. Example 2: A video of 1000 frames with resolution 640 × 480 can be viewed as a 640 × 480 × 1000 tensor. Example 3: Hyperspectral images. Example 4: Deep learning: Coefficients in each layer of deep NN stored as tensors (TensorFlow), Interpretation of RNNs as hierarchical tensor decomposition. Typical setting (except for Example 4):
◮ entries of U often given explicitly (at least partially). ◮ extraction of dominant features from U. ◮ usually moderate values for d.
22
◮ Consider
−∆u = f in Ω, u|∂Ω = 0,
◮ Discretize on tensor grid.
Uniform grid for simplicity: ξ(j)
µ = jh,
h = 1 n + 1 for µ = 1, 2, 3.
◮ Approximate solution tensor U ∈ Rn×n×n:
Ui1,i2,i3 ≈ u
1 , ξ(i2) 2 , . . . , ξ(id) d
23
◮ Discretization of 1D-Laplace:
−∂xx ≈ 2 −1 −1 2 ... ... ... −1 −1 2 =: A.
◮ Application in each coordinate direction:
−∂ξ1ξ1u(ξ1, ξ2, ξ3) ≈ A ◦1 U, −∂ξ2ξ2u(ξ1, ξ2, ξ3) ≈ A ◦2 U, −∂ξ3ξ3u(ξ1, ξ2, ξ3) ≈ A ◦3 U.
◮ Hence,
−∆u ≈ A ◦1 U + A ◦2 U + A ◦3 U
−∆u ≈ (I ⊗ I ⊗ A + I ⊗ A ⊗ I + A ⊗ I ⊗ I)u.
24
Finite difference discretization of model problem −∆u = f in Ω, u|∂Ω = 0 for Ω = [0, 1]3 takes the form (I ⊗ I ⊗ A + I ⊗ A ⊗ I + A ⊗ I ⊗ I)u = f. Similar structure for finite element discretization with tensorized FEs: V ⊗ W ⊗ Z = αijkvi(ξ1)wj(ξ2)zk(ξ3) : αijk ∈ R
V = {v1(ξ1), . . . , vn(ξ1)}, W = {w1(ξ2), . . . , wn(ξ2)}, Z = {z1(ξ3), . . . , zn(ξ3)}
Galerkin discretization (KV ⊗ MW ⊗ MZ + MV ⊗ KW ⊗ MZ + MV ⊗ MW ⊗ KZ)u = f, with 1D mass/stiffness matrices MV, MW, MZ, KV, KW, KZ.
25
Finite difference discretization of model problem −∆u = f in Ω, u|∂Ω = 0 for Ω = [0, 1]d takes the form
I ⊗ · · · ⊗ I ⊗ A ⊗ I ⊗ · · · ⊗ I
To obtain such Kronecker structure in general:
◮ tensorized domain; ◮ highly structured grid; ◮ coefficients that can be written/approximated as sum of
separable functions.
26
PDE-eigenvalue problem ∆u(ξ) + V(ξ)u(ξ) = λ u(ξ) in Ω = [0, 1]d, u(ξ) =
Assumption: Potential represented as V(ξ) =
s
V (1)
j
(ξ1)V (2)
j
(ξ2) · · · V (d)
j
(ξd). finite difference discretization Au = (AL + AV)u = λu, with AL =
d
I ⊗ · · · ⊗ I
⊗AL ⊗ I ⊗ · · · ⊗ I
, AV =
s
A(d)
V,j ⊗ · · · ⊗ A(2) V,j ⊗ A(1) V,j.
27
Consider Ω = [−10, 2]d and potential ([Meyer et al. 1990; Raab et al. 2000; Faou et al. 2009]) V(ξ) = 1 2
d
σjξ2
j + d−1
j+1 − 1
3ξ3
j ) + σ2 ∗
16(ξ2
j + ξ2 j+1)2
. with σj ≡ 1, σ∗ = 0.2. Discretization with n = 128 dof/dimension for d = 20 dimensions.
◮ Eigenvector has nd ≈ 1042 entries. ◮ Explicit storage of eigenvector would require 1025 exabyte!
Solved with accuracy 10−12 in less than 1 hour on laptop.
28
◮ spin-1/2 particles: proton, neutron, electron, and quark. ◮ two states: spin-up, spin-down ◮ quantum state for each spin represented by vector in C2 (spinor) ◮ quantum state for system of d spins represented by vector in C2d ◮ quantum mechanical operators expressed in terms of Pauli
matrices Px = 1 1
Py = −i i
Pz = 1 −1
◮ spin Hamiltonian: sum of Kronecker products of Pauli matrices
and identities each term describes physical (inter)action of spins
◮ interaction of spins described by graph ◮ Goal: Compute ground state of spin Hamiltonian.
29
Example: 1d chain of 5 spins with periodic boundary conditions
Hamiltonian describing pairwise interaction between nearest neighbors: H = Pz ⊗ Pz ⊗ I ⊗ I ⊗ I + I ⊗ Pz ⊗ Pz ⊗ I ⊗ I + I ⊗ I ⊗ Pz ⊗ Pz ⊗ I + I ⊗ I ⊗ I ⊗ Pz ⊗ Pz + Pz ⊗ I ⊗ I ⊗ I ⊗ Pz
30
◮ Ising (ZZ) model for 1d chain of d spins with open boundary
conditions: H =
p−1
I ⊗ · · · ⊗ I ⊗ Pz ⊗ Pz ⊗ I ⊗ · · · ⊗ I +λ
p
I ⊗ · · · ⊗ I ⊗ Px ⊗ I ⊗ · · · ⊗ I λ = ratio between strength of magnetic field and pairwise interactions
◮ 1d Heisenberg (XY) model ◮ Current research: 2d models. ◮ More details in:
Huckle/Waldherr/Schulte-Herbrüggen: Computations in Quantum Tensor Networks. Schollwöck: The density-matrix renormalization group in the age
31
◮ 3 stochastic automata A1, A2, A3 having 3 states each. ◮ Vector x(i) t
∈ R3 describes probabilities of states (1), (2), (3) in Ai at time t
◮ No coupling between automata local transition x(i) t
→ x(i)
t+1
described by Markov chain: x(i)
t+1 = Eix(i) t ,
with a stochastic matrix Ei.
◮ Stationary distribution of Ai = Perron vector of Ei (eigenvector for
eigenvalue 1).
32
◮ 3 stochastic automata A1, A2, A3 having 3 states each. ◮ Coupling between automata local transition x(i) t
→ x(i)
t+1 not
described by Markov chain.
◮ Need to consider all possible combinations of states in
(A1, A2, A3): (1, 1, 1), (1, 1, 2), (1, 1, 3), (1, 2, 1), (1, 2, 2), . . . .
◮ Vector xt ∈ R33 (or tensor X(t) ∈ R3×3×3) describes probabilities
33
◮ Transition xt → xt+1 described by Markov chain:
xt+1 = Ext, with a large stochastic matrix E.
◮ Oversimplified example:
E = I ⊗ I ⊗ E1 + I ⊗ E2 ⊗ I + E3 ⊗ I ⊗ I
. + I ⊗ E21 ⊗ E12
+ E32 ⊗ E23 ⊗ I
◮ Goal: Compute stationary distribution = Perron vector of E. ◮ More details in:
Stewart: Introduction to the Numerical Solution of Markov
Buchholz: Product Form Approximations for Communicating Markov Processes.
34
◮ Emerged during last five years in numerical analysis. ◮ Successfully applied to:
◮ parameter-dependent / multi-dimensional integrals; ◮ electronic structure calculations: Hartree-Fock / DFT; ◮ stochastic and parametric PDEs; ◮ high-dimensional Boltzmann / chemical master / Fokker-Planck /
Schrödinger equations;
◮ micromagnetism; ◮ rational approximation problems; ◮ computational homogenization; ◮ computational finance; ◮ multivariate regression and machine learning; ◮ . . . 35
Tensor networks and hierarchical tensors for the solution of high-dimensional partial differential equations.
Era of big data processing: A new approach via tensor networks and tensor decompositions. arXiv:1403.2048, 2014.
A literature survey of low-rank tensor approximation techniques. GAMM-Mitt., 36(1):53–78, 2013.
Tensor Spaces and Numerical Tensor Calculus. Springer, Heidelberg, 2012.
Expressive power of recurrent neural networks, 2018. ICLR: Sixth International Conference on Learning Representations. Anthony Nouy. Low-rank methods for high-dimensional approximation and model order reduction. In Model reduction and approximation, volume 15 of Comput. Sci. Eng., pages 171–226. SIAM, Philadelphia, PA, 2017.
Tensor decomposition for signal processing and machine learning. IEEE Trans. Signal Process., 65(13):3551–3582, 2017.
36
37
◮ Aim: Generalize concept of low rank from matrices to tensors. ◮ One possibility motivated by
X =
T = = a1bT
1 + a2bT 2 + · · · + aRbT R.
vectorization vec(X) = b1 ⊗ a1 + b2 ⊗ a2 + · · · + bR ⊗ aR. Canonical Polyadic decomposition of tensor X ∈ Rn1×n2×n3 defined via vec(X) = c1 ⊗ b1 ⊗ a1 + c2 ⊗ b2 ⊗ a2 + · · · + cR ⊗ bR ⊗ aR X = a1 ◦ b1 ◦ c1 + a2 ◦ b2 ◦ c2 + · · · + aR ◦ bR ◦ cR for vectors aj ∈ Rn1, bj ∈ Rn2, cj ∈ Rn3. CP directly corresponds to semi-separable approximation. Tensor rank of X = minimal possible R
38
Illustration of CP decomposition X = a1 ◦ b1 ◦ c1 + a2 ◦ b2 ◦ c2 + · · · + aR ◦ bR ◦ cR.
c1 a1 b1 cr ar br
X
More compact notation: vec(X) = A, B, C, with A = [a1, . . . , aR] ∈ Rn1×R B = [b1, . . . , bR] ∈ Rn2×R C = [c1, . . . , cR] ∈ Rn3×R
39
Elementwise expression of CP decomposition: Xijk =
R
airbjrckr, iµ = 1, . . . , nµ Shows that every matricization has rank at most R. Question: What is a necessary and sufficient condition on the matricizations for a tensor X to have tensor rank 1? For the 1-mode matricization: X (1) = AY, with rth row of Y given by Kronecker prod of rth columns of B and C.
n × k. Then the Kathri-Rao product of V and W is the mn × k matrix given by V ⊙ W = v1 ⊗ w1 · · · vk ⊗ wk
We have X (1) = A(C ⊙ B)T, X (2) = B(C ⊙ A)T, X (3) = C(B ⊙ A)T.
max{R1, R2, R3} ≤ rank(X) ≤ min{R2R3, R1R3, R1R2}.
◮ Except for very special situations (e.g., n3 = 2) there are no
tighter upper bounds known.
◮ A real 2 × 2 × 2 tensor has rank at most 3 [Kruskal’1989], has
ranks 2 or 3 with positive probability, and ranks 0 or 1 with zero probability.
◮ A real 3 × 3 × 3 tensor has rank at most 5 [Kruskal’1989] and
has rank 4 with probability 1 [ten Berge et al.’2004].
◮ Computing the tensor rank is an NP hard problem.
See [Kolda/Bader’2009] for more.
41
The relation X (1) = A(C ⊙ B)T can be written as X (1) =
· · · A · diag(C(n3, :)) · BT . simultaneous diagonalizations of all frontal slices of X: X(:, :, k) = A · diag(C(k, :)) · BT, k = 1, . . . , n3. For n3 = 1, significant degrees of freedom. For n3 > 1, CP decomposition often unique (up to permutations and scalings).
that any k columns are linearly independent. Theorem (Kruskal’1977). The CP decomposition is unique (up to permutations and scalings) if k-rank(A) + k-rank(B) + k-rank(C) ≥ 2R + 2.
42
Given a real matrix A, its (matrix) rank is the same over R and C. This is not true, in general, for tensor rank. Consider 2 × 2 × 2 tensor X given by 1-mode matricization X (1) = 1 1 1 −1
because one can write X = A, B, C, with A = 1 √ 2 1 1 −i i
B = 1 √ 2 1 1 i −i
C = 1 1 i −i
43
Given a real matrix A, its (matrix) rank is the same over R and C. This is not true, in general, for tensor rank. Consider 2 × 2 × 2 tensor X given by 1-mode matricization X (1) = 1 1 1 −1
because one can write X = A, B, C, with A = 1 1 1 −1
B = 1 1 1 1
C = 1 1 −1 1 1
techniques by ten Berge [1991].
44
The matrix rank is a lower semi-continuous function. Meaning: A converging sequence of matrices of rank r always converges to a matrix of rank at most r. Proof: Equivalently one has to show that for any m × n matrix A of rank r there is an open neighbourhood U such that all matrices B ∈ U have rank r or larger. Use fact that a matrix A has rank at least r if and only if each of its r × r submatrices is invertible. For submatrix there is a neighbourhood of A such that this submatrix stays invertible in the
neighbourhoods.
45
The matrix rank is a lower semi-continuous function. This is not true, in general, for tensor rank. Consider 2 × 2 × 2 tensor X = e1 ◦ e1 ◦ e2 + e1 ◦ e2 ◦ e1 + e2 ◦ e1 ◦ e1. with unit vectors e1, e2. Its rank can be shown to be 3. Consider parametrized rank-2 tensor Xα = α
αe2
αe2
αe2
This tensor has rank 2 but, as Xα
α→∞
→ X, it is arbitrarily close to a tensor of rank 3. A tensor X is called degenerate if it is arbitrarily close to a tensor of lower tensor rank. See [Silva/Lim’2008] for more details on degenerate tensors.
46
The matrix rank is a lower semi-continuous function. This is not true, in general, for tensor rank. For tensors of order d ≥ 3:
◮ tensor rank R is not upper
semi-continuous lack of closedness
◮ successive rank-1 approximations fail Picture taken from [Kolda/Bader’2009].
To avoid degeneracies, can use smoothed version of tensor rank. The border rank of a tensor X is defined as the smallest integer r such that for any ε > 0 there exists E with E = ε and X + E has tensor rank r.
47
Consider 2 × 2 block matrix product C11 C12 C21 C22
A11 A12 A21 A22 B11 B12 B21 B22
C11 = A11B11 + A12B21 C21 = A21B11 + A22B21 C12 = A11B12 + A12B22 C22 = A21B12 + A22B22 Requires 8 block matrix multiplications! Can be reduced via tensors + CP decomposition.
48
More general view of matrix multiplication algorithms:
◮ Map multiindex into single index:
(1, 1) (2, 1) (1, 2) (2, 2) → 1 2 3 4
◮ Translate each operation into a rank-one tensor.
Compute A11B11 and add to (1, 1) entry of C becomes: e(1,1) ◦ e(1,1) ◦ e(1,1) = 1 ◦ 1 ◦ 1 Compute A21B12 and add to (2, 2) entry of C becomes: e(2,1) ◦ e(1,2) ◦ e(2,2) = 1 ◦ 1 ◦ 1
49
2 × 2 matrix multiplication tensor: T2 = e(1,1) ◦ e(1,1) ◦ e(1,1) + e(1,2) ◦ e(2,1) ◦ e(1,1) + e(2,1) ◦ e(1,1) ◦ e(2,1) + e(2,2) ◦ e(2,1) ◦ e(2,1) + · · · Strassen discovered 1969 that this tensor has rank less than 8 by providing an explicit CP decomposition with 8 terms: T2 = (e(1,1) + e(2,2)) ◦ (e(1,1) + e(2,2)) ◦ (e(1,1) + e(2,2)) + (e(2,1) + e(2,2)) ◦ e(1,1) ◦ (e(2,1) − e(2,2)) + e(1,1) ◦ (e(1,2) − e(2,2)) ◦ (e(1,2) + e(2,2)) + e(2,2) ◦ (e(2,1) − e(1,1)) ◦ (e(1,1) + e(2,1)) + (e(1,1) + e(1,2)) ◦ e(2,2) ◦ (−e(1,1) + e(1,2)) + (e(2,1) − e(1,1)) ◦ (e(1,1) + e(1,2)) ◦ e(2,2) + (e(1,2) − e(2,2)) ◦ (e(2,1) + e(2,2)) ◦ e(1,1)
50
First compute seven matrix products: M1 := (A11 + A22)(B11 + B22) M2 := (A21 + A22)B1,1 M3 := A11(B12 − B22) M4 := A22(B21 − B11) M5 := (A11 + A12)B22 M6 := (A21 − A11)(B11 + B12) M7 := (A12 − A22)(B21 + B22) Then compute entries of C: C11 = M1 + M4 − M5 + M7 C12 = M3 + M5 C21 = M2 + M4 C22 = M1 − M2 + M3 + M6 Reduces complexity from O(n3) to O(nlog2 7) = O(n2.807...)
51
◮ T2 has tensor and border rank 7 [Strassen, Winograd, ...]. ◮ T3 has tensor rank between 19 and 23; border rank between 15
and 21.
◮ [Bläser’1999]:Tn has tensor rank at least 5 2n2 − 3n. ◮ State-of-the-art constructions of fast matrix multiplication all
tensor based.
◮ Record based on the so called laser method for tensors:
n2.3728639 [Le Gall’2014]. See also https://simons.berkeley.edu/sites/default/ files/docs/2438/slideslegall.pdf.
52
Fitting the CP decomposition to a tensor X requires the solution of the optimization problem min
. Difficulties:
◮ Scaling (and permutation) indeterminacy. ◮ target function convex in each factor but not jointly convex
potentially many local minima. Idea: Normalize all but one factor and exploit multilinearity of the format.
53
Let B, C be fixed and with unit norm columns. Optimize for A only: min
. This is a standard linear least-squares problem (in an unusual form). Recall that 1-mode matricization of A, B, C given by A(C ⊙ B)T: min
F : A ∈ Rn1×R
. Solution via normal equations given by A = X (1)(C ⊙ B)((C ⊙ B)T(C ⊙ B))−1 = X (1)(C ⊙ B)(CTC ∗ BTB)−1 Remarks:
◮ This assumes that C ⊙ B has full row rank. If not, replace
(CTC ∗ BTB)−1 by pseudo-inverse.
54
Input: X ∈ Rn1×n2×n3, starting factors B ∈ Rn2×R,C ∈ Rn3×R. Output: CP approximation A, B, C.
1: Normalize columns of B. 2: while not converged do 3:
Normalize columns of C.
4:
Set A ← X (1)(C ⊙ B)(CTC ∗ BTB)−1.
5:
Normalize columns of A.
6:
Set B ← X (2)(C ⊙ A)(CTC ∗ ATA)−1.
7:
Normalize columns of B.
8:
Set C ← X (3)(B ⊙ A)(BTB ∗ ATA)−1.
9: end while
Question: How does the algorithm reduce for the special case R = 1, d = 2?
55
56
◮ Alternative rank concept for tensors motivated by
A = U · Σ · V T, U ∈ Rn1×r, V ∈ Rn2×r, Σ ∈ Rr×r. vectorization vec(X) =
Ignore diagonal structure of Σ and call it C. Tucker decomposition of tensor X ∈ Rn1×n2×n3 defined via vec(X) =
with U ∈ Rn1×r1, V ∈ Rn2×r2, W ∈ Rn3×r3, and core tensor C ∈ Rr1×r2×r3. In terms of µ-mode matrix products: X = U ◦1 V ◦2 W ◦3 C =: (U, V, W) ◦ C.
57
Illustration of Tucker decomposition X = (U, V, W) ◦ C X C
U V W
58
Consider all three matricizations: X (1) = U · C(1) ·
T, X (2) = V · C(2) ·
T, X (3) = W · C(3) ·
T. These are low rank decompositions rank
≤ r1, rank
≤ r2, rank
≤ r3. Multilinear rank of tensor X ∈ Rn1×n2×n3 defined by tuple (r1, r2, r3), with ri = rank
.
59
Goal: Approximate given tensor X by Tucker decomposition with prescribed multilinear rank (r1, r2, r3).
X (µ) = Uµ Σµ V T
µ
for µ = 1, 2, 3.
Uµ := Uµ(:, 1 : rµ) for µ = 1, 2, 3.
C := UT
1 ◦1 UT 2 ◦2 UT 3 ◦3 X.
Truncated tensor produced by HOSVD [Lathauwer/De Moor/Vandewalle’2000]:
Remark: Orthogonal projection X :=
µ ◦µ X.
60
X resulting from HOSVD satisfies quasi-optimality condition X − X ≤ √ dX − Xbest, where Xbest is best approximation of X with multilinear ranks (r1, . . . , rd). Proof: X − X2 = X − (π1 ◦ π2 ◦ π3)X2 = X − π1X2 + π1X − (π1 ◦ π2)X2 + · · · · · · + (π1 ◦ π2)X − (π1 ◦ π2 ◦ π3)X2 ≤ X − π1X2 + X − π2X2 + X − π3X2 Using X − πµX ≤ X − Xbest for µ = 1, 2, 3 leads to X − X2 ≤ 3 · X − Xbest2.
61
Another direct consequence of the proof:
k
denote the kth singular of X (µ). Then the approximation X obtained from the HOSVD satisfies X − X2 ≤
3
nµ
(σ(µ)
k
)2. This also implies a lower bound for X − Xbest in terms of the singular values of the matricizations of X.
◮ SVD can be replaced by any low-rank approximation technique
discussed in this course. By triangular inequality, bound of Corollary still holds with an extra term accounting for the inexact SVD.
62
Aim at finding best approximation: X − U1 ◦1 U2 ◦2 U3 ◦3 C = min! Given ONB U1, U2, U3, the core tensor C := UT
1 ◦1 UT 2 ◦2 UT 3 ◦3 X is
the optimal choice. With the choice of core tensor above, we have X − U1 ◦1 U2 ◦2 U3 ◦3 C2 = X2 − UT
1 ◦1 UT 2 ◦2 UT 3 ◦3 X2.
In turn, minimization problem equivalent to UT
1 ◦1 UT 2 ◦2 UT 3 ◦3 X = max!
s.t. U1, U2, U3 ONB.
63
ALS for minimization problem: Let U2, U3 be fixed and maximize wrt U1 max
UT
1 U1=I UT
1 ◦1 UT 2 ◦2 UT 3 ◦3 X = max UT
1 U1=I UT
1 X (1)(U3 ⊗ U2)F.
Solution obtained by setting U1 to the r1 dominant left singular vectors
Higher-order orthogonal iteration (HOOI):
1: while not converged do 2:
Set U1 to the r1 dominant left singular vectors of X (1)(U3 ⊗ U2).
3:
Set U2 to the r2 dominant left singular vectors of X (2)(U3 ⊗ U1).
4:
Set U3 to the r3 dominant left singular vectors of X (3)(U2 ⊗ U1).
5: end while
◮ Convergence often observed to be good but there are example,
where it does not even converge to a stationary point.
◮ HOOI best initialized with factors from the (ST)HOSVD. ◮ Again, SVD can be replaced by other methods but little to no
analysis exists that would guide such a choice.
◮ See [Kolda/Bader’09] for more details.
64
For general tensors:
◮ multilinear rank r is upper semi-continuous closedness
property.
◮ HOSVD – simple and robust algorithm to obtain quasi-optimal
low-rank approximation.
◮ quasi-optimality good enough for most applications in scientific
computing.
◮ robust black-box algorithms/software available (e.g., Tensor
Toolbox).
Storage of core tensor ∼ r d curse of dimensionality
65