Low Rank Approximation Lecture 6 Daniel Kressner Chair for - - PowerPoint PPT Presentation

low rank approximation lecture 6
SMART_READER_LITE
LIVE PREVIEW

Low Rank Approximation Lecture 6 Daniel Kressner Chair for - - PowerPoint PPT Presentation

Low Rank Approximation Lecture 6 Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch 1 The Kronecker product 2 Vectorization The vectorization of an m n matrix A denoted by vec (


slide-1
SLIDE 1

Low Rank Approximation Lecture 6

Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch

1

slide-2
SLIDE 2

The Kronecker product

2

slide-3
SLIDE 3

Vectorization

The vectorization of an m × n matrix A denoted by vec(A), where vec : Rm×n → Rm·n stacks the columns of a matrix into a long column vector. Example: A =   a11 a12 a21 a22 a31 a32   ⇒ vec(A) =         a11 a21 a31 a12 a22 a32         Remarks:

◮ In MATLAB: A(:) ◮ This way of vectorizing corresponds to how matrices are laid out

in memory in MATLAB. In other programming languages (e.g., C arrays) matrices are laid out rowwise.

3

slide-4
SLIDE 4

Kronecker product

For m × n matrix A and k × ℓ matrix B, Kronecker product defined as B ⊗ A :=    b11A · · · b1ℓA . . . . . . bk1A · · · bkℓA    ∈ Rkm×ℓn. Most important properties (for our purposes):

  • 1. vec(A X) = (I ⊗ A) vec(X).
  • 2. vec(X AT) = (A ⊗ I) vec(X).
  • 3. (B ⊗ A)(D ⊗ C) = (BD ⊗ AC)

for A ∈ Rm×n, B ∈ Rk×ℓ, C ∈ Rn×q, D ∈ Rℓ×p.

  • 4. Im ⊗ In = Imn.
  • 5. (A1 + A2) ⊗ B = A1 ⊗ B + A2 ⊗ B, A ⊗ (B1 ⊗ B2) = A ⊗ B1 + A ⊗ B2

4

slide-5
SLIDE 5

First steps with tensors

5

slide-6
SLIDE 6

Vectors, matrices, and tensors

Vector Matrix Tensor

◮ scalar = tensor of order 0 ◮ (column) vector = tensor of order 1 ◮ matrix = tensor of order 2 ◮ tensor of order 3

= n1n2n3 numbers arranged in n1 × n2 × n3 array

6

slide-7
SLIDE 7

Tensors of arbitrary order

A d-th order tensor X of size n1 × n2 × · · · × nd is a d-dimensional array with entries Xi1,i2,...,id, iµ ∈ {1, . . . , nµ} for µ = 1, . . . , d. In the following, entries of X are usually real (for simplicity) X ∈ Rn1×n2×···×nd. Multi-index notation: I = {1, . . . , n1} × {1, . . . , n2} × · · · × {1, . . . , nd}. Then i ∈ I is a tuple of d indices: i = (i1, i2, . . . , id). Allows to write entries of X as Xi for i ∈ I.

7

slide-8
SLIDE 8

Two important points

  • 1. A matrix A ∈ Rm×n has a natural interpretation as a linear
  • perator in terms of matrix-vector multiplications:

A : Rn → Rm, A : x → A · x. There is no such (unique and natural) interpretation for tensors! fundamental difficulty to define meaningful general notion of eigenvalues and singular values of tensors.

  • 2. Number of entries in tensor grows exponentially with d

Curse of dimensionality. Example: Tensor of order 30 with n1 = n2 = · · · = nd = 10 has 1030 entries = 8 × 1012 Exabyte storage!1 For d ≫ 1: Cannot afford to store tensor explicitly (in terms of its entries).

1Global data storage a few years ago calculated at 295 exabyte, see

http://www.bbc.co.uk/news/technology-12419672.

8

slide-9
SLIDE 9

Basic calculus

◮ Addition of two equal-sized tensors X, Y:

Z = X + Y ⇔ Zi = Xi + Yi ∀i ∈ I.

◮ Scalar multiplication with α ∈ R:

Z = αX ⇔ Zi = αXi ∀i ∈ I. vector space structure.

◮ Inner product of two equal-sized tensors X, Y:

X, Y :=

  • i∈I

xiyi. Induced norm X :=

i∈I

x2

i

1/2 For a 2nd order tensor (= matrix) this corresponds to the usual Euclidean geometry and Frobenius norm.

9

slide-10
SLIDE 10

Vectorization

Tensor X of size n1 × n2 × · · · × nd has n1 · n2 · · · nd entries many ways to stack entries in a (loooong) column vector. One possible choice: The vectorization of X is denoted by vec(X), where vec : Rn1×n2×···×nd → Rn1·n2···nd stacks the entries of a tensor in reverse lexicographical order into a long column vector. Example: d = 3, n1 = 3, n2 = 2, n3 = 3.

vec(X) =                  x111 x211 x311 x121 . . . . . . x123 x223 x323                 

10

slide-11
SLIDE 11

Matricization

◮ A matrix has two modes (column mode and row mode). ◮ A dth-order tensor X has d modes (µ = 1, µ = 2, . . ., µ = d).

Let us fix all but one mode, e.g., µ = 1: Then X(:, i2, i3, . . . , id) (abuse of MATLAB notation) is a vector of length n1 for each choice of i2, . . . , id. These vectors are called fibers. View tensor X as a bunch of column vectors:

11

slide-12
SLIDE 12

Matricization

Stack vectors into an n1 × (n2 · · · nd) matrix: X ∈ Rn1×n2×···×nd X (1) ∈ Rn1×(n2n3···nd) For µ = 1, . . . , d, the µ-mode matricization of X is a matrix X (µ) ∈ Rnµ×(n1···nµ−1nµ+1···nd) with entries

  • X (µ)

iµ1,(i1,...,iµ−1,iµ+1...id) = Xi

∀i ∈ I.

12

slide-13
SLIDE 13

Matricization

In MATLAB: a = rand(2,3,4,5);

◮ 1-mode matricization:

reshape(a,2,3*4*5)

◮ 2-mode matricization:

b = permute(a,[2 1 3 4]); reshape(b,3,2*4*5)

◮ 3-mode matricization:

b = permute(a,[3 1 2 4]); reshape(b,4,2*3*5)

◮ 4-mode matricization:

b = permute(a,[4 1 2 3]); reshape(b,5,2*3*4) For a matrix A ∈ Rn1×n2: A(1) = A, A(2) = AT.

13

slide-14
SLIDE 14

µ-mode matrix products

Consider 1-mode matricization X (1) ∈ Rn1×(n2···nd): Seems to make sense to multiply an m × n1 matrix A from the left: Y (1) := A X (1) ∈ Rm×(n2···nd). Can rearrange Y (1) back into an m × n2 × · · · × nd tensor Y. This is called 1-mode matrix multiplication Y = A ◦1 X ⇔ Y (1) = AX (1) More formally (and more ugly): Yi1,i2,...,id =

n1

  • k=1

ai1,kXk,i2,...,id.

14

slide-15
SLIDE 15

µ-mode matrix products

General definition of a µ-mode matrix product with A ∈ Rm×n1: Y = A ◦µ X ⇔ Y (µ) = AX (µ). More formally (and more ugly): Yi1,i2,...,id =

n1

  • k=1

aiµ,kXi1,...,iµ−1,k,iµ+1,...,id. For matrices:

◮ 1-mode multiplication = multiplication from the left:

Y = A ◦1 X = A X.

◮ 2-mode multiplication = transposed multiplication from the right:

Y = A ◦2 X = X AT.

15

slide-16
SLIDE 16

µ-mode matrix products and vectorization

By definition, vec(X) = vec

  • X (1)

. Consequently, also vec(A ◦1 X) = vec

  • A X (1)

. Vectorized version of 1-mode matrix product: vec(A ◦1 X) = (In2···nd ⊗ A)vec(X) = (Ind ⊗ · · · ⊗ In2 ⊗ A) vec(X). Relation between µ-mode matrix product and matrix-vector product: vec(A ◦µ X) = (Ind ⊗ · · · ⊗ Inµ+1 ⊗ A ⊗ Inµ−1 ⊗ · · · ⊗ In1) vec(X)

16

slide-17
SLIDE 17

Summary

◮ Tensor X ∈ Rn1×···×nd is a d-dimensional array. ◮ Various ways of reshaping entries of a tensor X into a vector or

matrix.

◮ µ-mode matrix multiplication can be expressed with Kronecker

products Further reading:

◮ T. Kolda and B. W. Bader. Tensor decompositions and

  • applications. SIAM Rev. 51 (2009), no. 3, 455–500.

Software:

◮ MATLAB (and all programming languages) offer basic

functionality to work with d-dimensional arrays.

◮ MATLAB Tensor Toolbox: http://www.tensortoolbox.org/

17

slide-18
SLIDE 18

Applications of tensors

18

slide-19
SLIDE 19

Two classes of tensor problems

Class 1: function-related tensors Consider a function u(ξ1, . . . , ξd) ∈ R in d variables ξ1, . . . , ξd. Tensor U ∈ Rn1×···×nd represents discretization of u:

◮ U contains function values of u evaluated on a grid; or ◮ U contains coefficients of truncated expansion in tensorized

basis functions: u(ξ1, . . . , ξd) ≈

  • i∈I

Ui φi1(ξ1)φi2(ξ2) · · · φid(ξd). Typical setting:

◮ U only given implicitly, e.g., as the solution of a discretized PDE; ◮ seek approximations to U with very low storage and tolerable

accuracy.

◮ d may become very large.

19

slide-20
SLIDE 20

Discretization of function in d variables ξ1, . . . , ξd ∈ [0, 1] #function values grows exponentially with d

20

slide-21
SLIDE 21

Separability helps

Ideal situation: Function f separable: f(ξ1, ξ2, . . . , ξd) = f1(ξ1)f2(ξ2) . . . fd(ξd) Kronecker product diskretized f discretized f j O(nd) memory O(dn) memory Of course: Exact separability rarely satisfied in practice.

21

slide-22
SLIDE 22

Two classes of tensor problems

Class 2: data-related tensors Tensor U ∈ Rn1×···×nd contains multi-dimensional data. Example 1: U2011,3,2 denotes the number of papers published 2011 by author 3 in the mathematical journal 2. Example 2: A video of 1000 frames with resolution 640 × 480 can be viewed as a 640 × 480 × 1000 tensor. Example 3: Hyperspectral images. Example 4: Deep learning: Coefficients in each layer of deep NN stored as tensors (TensorFlow), Interpretation of RNNs as hierarchical tensor decomposition. Typical setting (except for Example 4):

◮ entries of U often given explicitly (at least partially). ◮ extraction of dominant features from U. ◮ usually moderate values for d.

22

slide-23
SLIDE 23

High-dimensional elliptic PDEs: 3D model problem

◮ Consider

−∆u = f in Ω, u|∂Ω = 0,

  • n unit cube Ω = [0, 1]3.

◮ Discretize on tensor grid.

Uniform grid for simplicity: ξ(j)

µ = jh,

h = 1 n + 1 for µ = 1, 2, 3.

◮ Approximate solution tensor U ∈ Rn×n×n:

Ui1,i2,i3 ≈ u

  • ξ(i1)

1 , ξ(i2) 2 , . . . , ξ(id) d

  • .

23

slide-24
SLIDE 24

High-dimensional elliptic PDEs: 3D model problem

◮ Discretization of 1D-Laplace:

−∂xx ≈       2 −1 −1 2 ... ... ... −1 −1 2       =: A.

◮ Application in each coordinate direction:

−∂ξ1ξ1u(ξ1, ξ2, ξ3) ≈ A ◦1 U, −∂ξ2ξ2u(ξ1, ξ2, ξ3) ≈ A ◦2 U, −∂ξ3ξ3u(ξ1, ξ2, ξ3) ≈ A ◦3 U.

◮ Hence,

−∆u ≈ A ◦1 U + A ◦2 U + A ◦3 U

  • r in vectorized form with u = vec(U):

−∆u ≈ (I ⊗ I ⊗ A + I ⊗ A ⊗ I + A ⊗ I ⊗ I)u.

24

slide-25
SLIDE 25

High-dimensional elliptic PDEs: 3D model problem

Finite difference discretization of model problem −∆u = f in Ω, u|∂Ω = 0 for Ω = [0, 1]3 takes the form (I ⊗ I ⊗ A + I ⊗ A ⊗ I + A ⊗ I ⊗ I)u = f. Similar structure for finite element discretization with tensorized FEs: V ⊗ W ⊗ Z = αijkvi(ξ1)wj(ξ2)zk(ξ3) : αijk ∈ R

  • with

V = {v1(ξ1), . . . , vn(ξ1)}, W = {w1(ξ2), . . . , wn(ξ2)}, Z = {z1(ξ3), . . . , zn(ξ3)}

Galerkin discretization (KV ⊗ MW ⊗ MZ + MV ⊗ KW ⊗ MZ + MV ⊗ MW ⊗ KZ)u = f, with 1D mass/stiffness matrices MV, MW, MZ, KV, KW, KZ.

25

slide-26
SLIDE 26

High-dimensional elliptic PDEs: Arbitrary dimensions

Finite difference discretization of model problem −∆u = f in Ω, u|∂Ω = 0 for Ω = [0, 1]d takes the form

  • d
  • j=1

I ⊗ · · · ⊗ I ⊗ A ⊗ I ⊗ · · · ⊗ I

  • u = f.

To obtain such Kronecker structure in general:

◮ tensorized domain; ◮ highly structured grid; ◮ coefficients that can be written/approximated as sum of

separable functions.

26

slide-27
SLIDE 27

High-dimensional PDE-eigenvalue problems

PDE-eigenvalue problem ∆u(ξ) + V(ξ)u(ξ) = λ u(ξ) in Ω = [0, 1]d, u(ξ) =

  • n ∂Ω.

Assumption: Potential represented as V(ξ) =

s

  • j=1

V (1)

j

(ξ1)V (2)

j

(ξ2) · · · V (d)

j

(ξd). finite difference discretization Au = (AL + AV)u = λu, with AL =

d

  • j=1

I ⊗ · · · ⊗ I

  • d−j times

⊗AL ⊗ I ⊗ · · · ⊗ I

  • j−1 times

, AV =

s

  • j=1

A(d)

V,j ⊗ · · · ⊗ A(2) V,j ⊗ A(1) V,j.

27

slide-28
SLIDE 28

Example: Henon-Heiles potential

Consider Ω = [−10, 2]d and potential ([Meyer et al. 1990; Raab et al. 2000; Faou et al. 2009]) V(ξ) = 1 2

d

  • j=1

σjξ2

j + d−1

  • j=1
  • σ∗(ξjξ2

j+1 − 1

3ξ3

j ) + σ2 ∗

16(ξ2

j + ξ2 j+1)2

. with σj ≡ 1, σ∗ = 0.2. Discretization with n = 128 dof/dimension for d = 20 dimensions.

◮ Eigenvector has nd ≈ 1042 entries. ◮ Explicit storage of eigenvector would require 1025 exabyte!

Solved with accuracy 10−12 in less than 1 hour on laptop.

28

slide-29
SLIDE 29

Quantum many-body problems

◮ spin-1/2 particles: proton, neutron, electron, and quark. ◮ two states: spin-up, spin-down ◮ quantum state for each spin represented by vector in C2 (spinor) ◮ quantum state for system of d spins represented by vector in C2d ◮ quantum mechanical operators expressed in terms of Pauli

matrices Px = 1 1

  • ,

Py = −i i

  • ,

Pz = 1 −1

  • .

◮ spin Hamiltonian: sum of Kronecker products of Pauli matrices

and identities each term describes physical (inter)action of spins

◮ interaction of spins described by graph ◮ Goal: Compute ground state of spin Hamiltonian.

29

slide-30
SLIDE 30

Quantum many-body problems

Example: 1d chain of 5 spins with periodic boundary conditions

1 3 4 5 2

Hamiltonian describing pairwise interaction between nearest neighbors: H = Pz ⊗ Pz ⊗ I ⊗ I ⊗ I + I ⊗ Pz ⊗ Pz ⊗ I ⊗ I + I ⊗ I ⊗ Pz ⊗ Pz ⊗ I + I ⊗ I ⊗ I ⊗ Pz ⊗ Pz + Pz ⊗ I ⊗ I ⊗ I ⊗ Pz

30

slide-31
SLIDE 31

Quantum many-body problems

◮ Ising (ZZ) model for 1d chain of d spins with open boundary

conditions: H =

p−1

  • k=1

I ⊗ · · · ⊗ I ⊗ Pz ⊗ Pz ⊗ I ⊗ · · · ⊗ I +λ

p

  • k=1

I ⊗ · · · ⊗ I ⊗ Px ⊗ I ⊗ · · · ⊗ I λ = ratio between strength of magnetic field and pairwise interactions

◮ 1d Heisenberg (XY) model ◮ Current research: 2d models. ◮ More details in:

Huckle/Waldherr/Schulte-Herbrüggen: Computations in Quantum Tensor Networks. Schollwöck: The density-matrix renormalization group in the age

  • f matrix product states.

31

slide-32
SLIDE 32

Stochastic Automata Networks (SANs)

◮ 3 stochastic automata A1, A2, A3 having 3 states each. ◮ Vector x(i) t

∈ R3 describes probabilities of states (1), (2), (3) in Ai at time t

◮ No coupling between automata local transition x(i) t

→ x(i)

t+1

described by Markov chain: x(i)

t+1 = Eix(i) t ,

with a stochastic matrix Ei.

◮ Stationary distribution of Ai = Perron vector of Ei (eigenvector for

eigenvalue 1).

32

slide-33
SLIDE 33

Stochastic Automata Networks (SANs)

◮ 3 stochastic automata A1, A2, A3 having 3 states each. ◮ Coupling between automata local transition x(i) t

→ x(i)

t+1 not

described by Markov chain.

◮ Need to consider all possible combinations of states in

(A1, A2, A3): (1, 1, 1), (1, 1, 2), (1, 1, 3), (1, 2, 1), (1, 2, 2), . . . .

◮ Vector xt ∈ R33 (or tensor X(t) ∈ R3×3×3) describes probabilities

  • f combined states.

33

slide-34
SLIDE 34

Stochastic Automata Networks (SANs)

◮ Transition xt → xt+1 described by Markov chain:

xt+1 = Ext, with a large stochastic matrix E.

◮ Oversimplified example:

E = I ⊗ I ⊗ E1 + I ⊗ E2 ⊗ I + E3 ⊗ I ⊗ I

  • local transition

. + I ⊗ E21 ⊗ E12

  • interaction between A1, A2

+ E32 ⊗ E23 ⊗ I

  • interaction between A2, A3

◮ Goal: Compute stationary distribution = Perron vector of E. ◮ More details in:

Stewart: Introduction to the Numerical Solution of Markov

  • Chains. Chapter 9.

Buchholz: Product Form Approximations for Communicating Markov Processes.

34

slide-35
SLIDE 35

Low-rank tensor techniques

◮ Emerged during last five years in numerical analysis. ◮ Successfully applied to:

◮ parameter-dependent / multi-dimensional integrals; ◮ electronic structure calculations: Hartree-Fock / DFT; ◮ stochastic and parametric PDEs; ◮ high-dimensional Boltzmann / chemical master / Fokker-Planck /

Schrödinger equations;

◮ micromagnetism; ◮ rational approximation problems; ◮ computational homogenization; ◮ computational finance; ◮ multivariate regression and machine learning; ◮ . . . 35

slide-36
SLIDE 36

References on applications

  • M. Bachmayr, R. Schneider, and A. Uschmajew.

Tensor networks and hierarchical tensors for the solution of high-dimensional partial differential equations.

  • Found. Comput. Math., 16(6):1423–1472, 2016.
  • A. Cichocki.

Era of big data processing: A new approach via tensor networks and tensor decompositions. arXiv:1403.2048, 2014.

  • L. Grasedyck, D. Kressner, and C. Tobler.

A literature survey of low-rank tensor approximation techniques. GAMM-Mitt., 36(1):53–78, 2013.

  • W. Hackbusch.

Tensor Spaces and Numerical Tensor Calculus. Springer, Heidelberg, 2012.

  • V. Khrulkov, A. Novikov, and I. Oseledets.

Expressive power of recurrent neural networks, 2018. ICLR: Sixth International Conference on Learning Representations. Anthony Nouy. Low-rank methods for high-dimensional approximation and model order reduction. In Model reduction and approximation, volume 15 of Comput. Sci. Eng., pages 171–226. SIAM, Philadelphia, PA, 2017.

  • N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos.

Tensor decomposition for signal processing and machine learning. IEEE Trans. Signal Process., 65(13):3551–3582, 2017.

36

slide-37
SLIDE 37

The CP decomposition

37

slide-38
SLIDE 38

CP decomposition

◮ Aim: Generalize concept of low rank from matrices to tensors. ◮ One possibility motivated by

X =

  • a1, a2, . . . , aR
  • b1, b2, . . . , bR

T = = a1bT

1 + a2bT 2 + · · · + aRbT R.

vectorization vec(X) = b1 ⊗ a1 + b2 ⊗ a2 + · · · + bR ⊗ aR. Canonical Polyadic decomposition of tensor X ∈ Rn1×n2×n3 defined via vec(X) = c1 ⊗ b1 ⊗ a1 + c2 ⊗ b2 ⊗ a2 + · · · + cR ⊗ bR ⊗ aR X = a1 ◦ b1 ◦ c1 + a2 ◦ b2 ◦ c2 + · · · + aR ◦ bR ◦ cR for vectors aj ∈ Rn1, bj ∈ Rn2, cj ∈ Rn3. CP directly corresponds to semi-separable approximation. Tensor rank of X = minimal possible R

38

slide-39
SLIDE 39

CP decomposition

Illustration of CP decomposition X = a1 ◦ b1 ◦ c1 + a2 ◦ b2 ◦ c2 + · · · + aR ◦ bR ◦ cR.

c1 a1 b1 cr ar br

X

More compact notation: vec(X) = A, B, C, with A = [a1, . . . , aR] ∈ Rn1×R B = [b1, . . . , bR] ∈ Rn2×R C = [c1, . . . , cR] ∈ Rn3×R

39

slide-40
SLIDE 40

CP decomposition and matricizations

Elementwise expression of CP decomposition: Xijk =

R

  • r=1

airbjrckr, iµ = 1, . . . , nµ Shows that every matricization has rank at most R. Question: What is a necessary and sufficient condition on the matricizations for a tensor X to have tensor rank 1? For the 1-mode matricization: X (1) = AY, with rth row of Y given by Kronecker prod of rth columns of B and C.

  • Definition. Let V = [v1, . . . , vk] be m × k and W = [w1, . . . , wk] be

n × k. Then the Kathri-Rao product of V and W is the mn × k matrix given by V ⊙ W = v1 ⊗ w1 · · · vk ⊗ wk

  • 40
slide-41
SLIDE 41

CP decomposition and matricizations

We have X (1) = A(C ⊙ B)T, X (2) = B(C ⊙ A)T, X (3) = C(B ⊙ A)T.

  • Corollary. Let Rµ be rank of X (µ). Then

max{R1, R2, R3} ≤ rank(X) ≤ min{R2R3, R1R3, R1R2}.

  • Proof. Lower bound obvious, upper bound EFY.

◮ Except for very special situations (e.g., n3 = 2) there are no

tighter upper bounds known.

◮ A real 2 × 2 × 2 tensor has rank at most 3 [Kruskal’1989], has

ranks 2 or 3 with positive probability, and ranks 0 or 1 with zero probability.

◮ A real 3 × 3 × 3 tensor has rank at most 5 [Kruskal’1989] and

has rank 4 with probability 1 [ten Berge et al.’2004].

◮ Computing the tensor rank is an NP hard problem.

See [Kolda/Bader’2009] for more.

41

slide-42
SLIDE 42

CP decomposition and uniqueness

The relation X (1) = A(C ⊙ B)T can be written as X (1) =

  • A · diag(C(1, :)) · BT

· · · A · diag(C(n3, :)) · BT . simultaneous diagonalizations of all frontal slices of X: X(:, :, k) = A · diag(C(k, :)) · BT, k = 1, . . . , n3. For n3 = 1, significant degrees of freedom. For n3 > 1, CP decomposition often unique (up to permutations and scalings).

  • Definition. The k-rank of a matrix A is the maximum value k such

that any k columns are linearly independent. Theorem (Kruskal’1977). The CP decomposition is unique (up to permutations and scalings) if k-rank(A) + k-rank(B) + k-rank(C) ≥ 2R + 2.

42

slide-43
SLIDE 43

Properties of tensor vs. matrix rank

Given a real matrix A, its (matrix) rank is the same over R and C. This is not true, in general, for tensor rank. Consider 2 × 2 × 2 tensor X given by 1-mode matricization X (1) = 1 1 1 −1

  • Considered as complex tensor X ∈ C2×2×2 this tensor has rank 2

because one can write X = A, B, C, with A = 1 √ 2 1 1 −i i

  • ,

B = 1 √ 2 1 1 i −i

  • ,

C = 1 1 i −i

  • and it is easy to see that X does not have rank 1.

43

slide-44
SLIDE 44

Properties of tensor vs. matrix rank

Given a real matrix A, its (matrix) rank is the same over R and C. This is not true, in general, for tensor rank. Consider 2 × 2 × 2 tensor X given by 1-mode matricization X (1) = 1 1 1 −1

  • Considered as real tensor X ∈ R2×2×2 this tensor has rank 3

because one can write X = A, B, C, with A = 1 1 1 −1

  • ,

B = 1 1 1 1

  • ,

C = 1 1 −1 1 1

  • and the tensor can be shown to not have rank 2 or smaller, using

techniques by ten Berge [1991].

44

slide-45
SLIDE 45

Properties of tensor vs. matrix rank

The matrix rank is a lower semi-continuous function. Meaning: A converging sequence of matrices of rank r always converges to a matrix of rank at most r. Proof: Equivalently one has to show that for any m × n matrix A of rank r there is an open neighbourhood U such that all matrices B ∈ U have rank r or larger. Use fact that a matrix A has rank at least r if and only if each of its r × r submatrices is invertible. For submatrix there is a neighbourhood of A such that this submatrix stays invertible in the

  • neighbourhood. Take U to be the intersection of all these

neighbourhoods.

45

slide-46
SLIDE 46

Properties of tensor vs. matrix rank

The matrix rank is a lower semi-continuous function. This is not true, in general, for tensor rank. Consider 2 × 2 × 2 tensor X = e1 ◦ e1 ◦ e2 + e1 ◦ e2 ◦ e1 + e2 ◦ e1 ◦ e1. with unit vectors e1, e2. Its rank can be shown to be 3. Consider parametrized rank-2 tensor Xα = α

  • e1 + 1

αe2

  • e1 + 1

αe2

  • e1 + 1

αe2

  • − αe1 ◦ e1 ◦ e1.

This tensor has rank 2 but, as Xα

α→∞

→ X, it is arbitrarily close to a tensor of rank 3. A tensor X is called degenerate if it is arbitrarily close to a tensor of lower tensor rank. See [Silva/Lim’2008] for more details on degenerate tensors.

46

slide-47
SLIDE 47

Properties of tensor vs. matrix rank

The matrix rank is a lower semi-continuous function. This is not true, in general, for tensor rank. For tensors of order d ≥ 3:

◮ tensor rank R is not upper

semi-continuous lack of closedness

◮ successive rank-1 approximations fail Picture taken from [Kolda/Bader’2009].

To avoid degeneracies, can use smoothed version of tensor rank. The border rank of a tensor X is defined as the smallest integer r such that for any ε > 0 there exists E with E = ε and X + E has tensor rank r.

47

slide-48
SLIDE 48

Block matrix multiplication and tensor rank

Consider 2 × 2 block matrix product C11 C12 C21 C22

  • =

A11 A12 A21 A22 B11 B12 B21 B22

  • Standard block matrix multiplication:

C11 = A11B11 + A12B21 C21 = A21B11 + A22B21 C12 = A11B12 + A12B22 C22 = A21B12 + A22B22 Requires 8 block matrix multiplications! Can be reduced via tensors + CP decomposition.

48

slide-49
SLIDE 49

Block matrix multiplication and tensor rank

More general view of matrix multiplication algorithms:

◮ Map multiindex into single index:

    (1, 1) (2, 1) (1, 2) (2, 2)     →     1 2 3 4    

◮ Translate each operation into a rank-one tensor.

Compute A11B11 and add to (1, 1) entry of C becomes: e(1,1) ◦ e(1,1) ◦ e(1,1) =     1     ◦     1     ◦     1     Compute A21B12 and add to (2, 2) entry of C becomes: e(2,1) ◦ e(1,2) ◦ e(2,2) =     1     ◦     1     ◦     1    

49

slide-50
SLIDE 50

Block matrix multiplication and tensor rank

2 × 2 matrix multiplication tensor: T2 = e(1,1) ◦ e(1,1) ◦ e(1,1) + e(1,2) ◦ e(2,1) ◦ e(1,1) + e(2,1) ◦ e(1,1) ◦ e(2,1) + e(2,2) ◦ e(2,1) ◦ e(2,1) + · · · Strassen discovered 1969 that this tensor has rank less than 8 by providing an explicit CP decomposition with 8 terms: T2 = (e(1,1) + e(2,2)) ◦ (e(1,1) + e(2,2)) ◦ (e(1,1) + e(2,2)) + (e(2,1) + e(2,2)) ◦ e(1,1) ◦ (e(2,1) − e(2,2)) + e(1,1) ◦ (e(1,2) − e(2,2)) ◦ (e(1,2) + e(2,2)) + e(2,2) ◦ (e(2,1) − e(1,1)) ◦ (e(1,1) + e(2,1)) + (e(1,1) + e(1,2)) ◦ e(2,2) ◦ (−e(1,1) + e(1,2)) + (e(2,1) − e(1,1)) ◦ (e(1,1) + e(1,2)) ◦ e(2,2) + (e(1,2) − e(2,2)) ◦ (e(2,1) + e(2,2)) ◦ e(1,1)

50

slide-51
SLIDE 51

Block matrix multiplication and tensor rank

First compute seven matrix products: M1 := (A11 + A22)(B11 + B22) M2 := (A21 + A22)B1,1 M3 := A11(B12 − B22) M4 := A22(B21 − B11) M5 := (A11 + A12)B22 M6 := (A21 − A11)(B11 + B12) M7 := (A12 − A22)(B21 + B22) Then compute entries of C: C11 = M1 + M4 − M5 + M7 C12 = M3 + M5 C21 = M2 + M4 C22 = M1 − M2 + M3 + M6 Reduces complexity from O(n3) to O(nlog2 7) = O(n2.807...)

51

slide-52
SLIDE 52

Block matrix multiplication and tensor rank

◮ T2 has tensor and border rank 7 [Strassen, Winograd, ...]. ◮ T3 has tensor rank between 19 and 23; border rank between 15

and 21.

◮ [Bläser’1999]:Tn has tensor rank at least 5 2n2 − 3n. ◮ State-of-the-art constructions of fast matrix multiplication all

tensor based.

◮ Record based on the so called laser method for tensors:

n2.3728639 [Le Gall’2014]. See also https://simons.berkeley.edu/sites/default/ files/docs/2438/slideslegall.pdf.

52

slide-53
SLIDE 53

Alternating least squares (ALS)

Fitting the CP decomposition to a tensor X requires the solution of the optimization problem min

  • X − A, B, C2 : A ∈ Rn1×R, B ∈ Rn2×R, C ∈ Rn3×R

. Difficulties:

◮ Scaling (and permutation) indeterminacy. ◮ target function convex in each factor but not jointly convex

potentially many local minima. Idea: Normalize all but one factor and exploit multilinearity of the format.

53

slide-54
SLIDE 54

Alternating least squares (ALS)

Let B, C be fixed and with unit norm columns. Optimize for A only: min

  • X − A, B, C2 : A ∈ Rn1×R

. This is a standard linear least-squares problem (in an unusual form). Recall that 1-mode matricization of A, B, C given by A(C ⊙ B)T: min

  • X (1) − A(C ⊙ B)T2

F : A ∈ Rn1×R

. Solution via normal equations given by A = X (1)(C ⊙ B)((C ⊙ B)T(C ⊙ B))−1 = X (1)(C ⊙ B)(CTC ∗ BTB)−1 Remarks:

◮ This assumes that C ⊙ B has full row rank. If not, replace

(CTC ∗ BTB)−1 by pseudo-inverse.

54

slide-55
SLIDE 55

Alternating least squares (ALS)

Input: X ∈ Rn1×n2×n3, starting factors B ∈ Rn2×R,C ∈ Rn3×R. Output: CP approximation A, B, C.

1: Normalize columns of B. 2: while not converged do 3:

Normalize columns of C.

4:

Set A ← X (1)(C ⊙ B)(CTC ∗ BTB)−1.

5:

Normalize columns of A.

6:

Set B ← X (2)(C ⊙ A)(CTC ∗ ATA)−1.

7:

Normalize columns of B.

8:

Set C ← X (3)(B ⊙ A)(BTB ∗ ATA)−1.

9: end while

Question: How does the algorithm reduce for the special case R = 1, d = 2?

55

slide-56
SLIDE 56

The Tucker decomposition

56

slide-57
SLIDE 57

Tucker decomposition

◮ Alternative rank concept for tensors motivated by

A = U · Σ · V T, U ∈ Rn1×r, V ∈ Rn2×r, Σ ∈ Rr×r. vectorization vec(X) =

  • V ⊗ U
  • · vec(Σ).

Ignore diagonal structure of Σ and call it C. Tucker decomposition of tensor X ∈ Rn1×n2×n3 defined via vec(X) =

  • W ⊗ V ⊗ U
  • · vec(C)

with U ∈ Rn1×r1, V ∈ Rn2×r2, W ∈ Rn3×r3, and core tensor C ∈ Rr1×r2×r3. In terms of µ-mode matrix products: X = U ◦1 V ◦2 W ◦3 C =: (U, V, W) ◦ C.

57

slide-58
SLIDE 58

Tucker decomposition

Illustration of Tucker decomposition X = (U, V, W) ◦ C X C

U V W

58

slide-59
SLIDE 59

Tucker decomposition

Consider all three matricizations: X (1) = U · C(1) ·

  • W ⊗ V

T, X (2) = V · C(2) ·

  • W ⊗ U

T, X (3) = W · C(3) ·

  • V ⊗ U

T. These are low rank decompositions rank

  • X (1)

≤ r1, rank

  • X (2)

≤ r2, rank

  • X (3)

≤ r3. Multilinear rank of tensor X ∈ Rn1×n2×n3 defined by tuple (r1, r2, r3), with ri = rank

  • X (i)

.

59

slide-60
SLIDE 60

Higher-order SVD (HOSVD)

Goal: Approximate given tensor X by Tucker decomposition with prescribed multilinear rank (r1, r2, r3).

  • 1. Calculate SVD of matricizations:

X (µ) = Uµ Σµ V T

µ

for µ = 1, 2, 3.

  • 2. Truncate basis matrices:

Uµ := Uµ(:, 1 : rµ) for µ = 1, 2, 3.

  • 3. Form core tensor:

C := UT

1 ◦1 UT 2 ◦2 UT 3 ◦3 X.

Truncated tensor produced by HOSVD [Lathauwer/De Moor/Vandewalle’2000]:

  • X := U1 ◦1 U2 ◦2 U3 ◦3 C.

Remark: Orthogonal projection X :=

  • π1 ◦ π2 ◦ π3
  • X with πµX := UµUT

µ ◦µ X.

60

slide-61
SLIDE 61

Higher-order SVD (HOSVD)

  • Theorem. Tensor

X resulting from HOSVD satisfies quasi-optimality condition X − X ≤ √ dX − Xbest, where Xbest is best approximation of X with multilinear ranks (r1, . . . , rd). Proof: X − X2 = X − (π1 ◦ π2 ◦ π3)X2 = X − π1X2 + π1X − (π1 ◦ π2)X2 + · · · · · · + (π1 ◦ π2)X − (π1 ◦ π2 ◦ π3)X2 ≤ X − π1X2 + X − π2X2 + X − π3X2 Using X − πµX ≤ X − Xbest for µ = 1, 2, 3 leads to X − X2 ≤ 3 · X − Xbest2.

61

slide-62
SLIDE 62

Approximation error obtained from HOSVD

Another direct consequence of the proof:

  • Corollary. Let σ(µ)

k

denote the kth singular of X (µ). Then the approximation X obtained from the HOSVD satisfies X − X2 ≤

3

  • µ=1

  • k=rµ+1

(σ(µ)

k

)2. This also implies a lower bound for X − Xbest in terms of the singular values of the matricizations of X.

◮ SVD can be replaced by any low-rank approximation technique

discussed in this course. By triangular inequality, bound of Corollary still holds with an extra term accounting for the inexact SVD.

62

slide-63
SLIDE 63

HOOI

Aim at finding best approximation: X − U1 ◦1 U2 ◦2 U3 ◦3 C = min! Given ONB U1, U2, U3, the core tensor C := UT

1 ◦1 UT 2 ◦2 UT 3 ◦3 X is

the optimal choice. With the choice of core tensor above, we have X − U1 ◦1 U2 ◦2 U3 ◦3 C2 = X2 − UT

1 ◦1 UT 2 ◦2 UT 3 ◦3 X2.

In turn, minimization problem equivalent to UT

1 ◦1 UT 2 ◦2 UT 3 ◦3 X = max!

s.t. U1, U2, U3 ONB.

63

slide-64
SLIDE 64

HOOI

ALS for minimization problem: Let U2, U3 be fixed and maximize wrt U1 max

UT

1 U1=I UT

1 ◦1 UT 2 ◦2 UT 3 ◦3 X = max UT

1 U1=I UT

1 X (1)(U3 ⊗ U2)F.

Solution obtained by setting U1 to the r1 dominant left singular vectors

  • f X (1)(U3 ⊗ U2).

Higher-order orthogonal iteration (HOOI):

1: while not converged do 2:

Set U1 to the r1 dominant left singular vectors of X (1)(U3 ⊗ U2).

3:

Set U2 to the r2 dominant left singular vectors of X (2)(U3 ⊗ U1).

4:

Set U3 to the r3 dominant left singular vectors of X (3)(U2 ⊗ U1).

5: end while

◮ Convergence often observed to be good but there are example,

where it does not even converge to a stationary point.

◮ HOOI best initialized with factors from the (ST)HOSVD. ◮ Again, SVD can be replaced by other methods but little to no

analysis exists that would guide such a choice.

◮ See [Kolda/Bader’09] for more details.

64

slide-65
SLIDE 65

Tucker decomposition – Summary

For general tensors:

◮ multilinear rank r is upper semi-continuous closedness

property.

◮ HOSVD – simple and robust algorithm to obtain quasi-optimal

low-rank approximation.

◮ quasi-optimality good enough for most applications in scientific

computing.

◮ robust black-box algorithms/software available (e.g., Tensor

Toolbox).

Drawback:

Storage of core tensor ∼ r d curse of dimensionality

65