STRUCTURED LOW-RANK MATRIX FACTORIZATION: GLOBAL OPTIMALITY, - - PowerPoint PPT Presentation

structured low rank matrix factorization global
SMART_READER_LITE
LIVE PREVIEW

STRUCTURED LOW-RANK MATRIX FACTORIZATION: GLOBAL OPTIMALITY, - - PowerPoint PPT Presentation

1/20 STRUCTURED LOW-RANK MATRIX FACTORIZATION: GLOBAL OPTIMALITY, ALGORITHMS, AND APPLICATIONS ARTICLE BY BENJAMIN D. HAEFFELE AND REN VIDAL (2017) CMAP Machine Learning Journal Club Speaker: Imke Mayer , December 13 th 2018 CMAP 2/20


slide-1
SLIDE 1

STRUCTURED LOW-RANK MATRIX FACTORIZATION: GLOBAL OPTIMALITY, ALGORITHMS, AND APPLICATIONS

ARTICLE BY BENJAMIN D. HAEFFELE AND RENÉ VIDAL (2017)

CMAP Machine Learning Journal Club

Speaker: Imke Mayer CMAP , December 13th 2018 1/20

slide-2
SLIDE 2

OUTLINE

I.

Structured Matrix Factorization

i.

Context and definition

ii.

Special case 1: Sparse dictionary learning (SDL)

iii.

Special case 2: Subspace clustering (SC) II.

Global optimality for structured matrix factorization

i.

Main theorem

ii.

Polar problem

  • III. Application: SDL global optimality
  • IV. Extension to tensor factorization and deep learning

CMAP Machine Learning Journal Club, December 13th 2018

2/20

slide-3
SLIDE 3

STRUCTURED MATRIX FACTORIZATION CONTEXT

(Large) high-dimensional datasets (images, videos, user ratings, etc.) Ø difficult to assess (computational issues, memory complexity) Ø but relevant information often lies in a low-dimensional structure Goal: recover this underlying low-dimensional structure of given (large scale) data X Motion segmentation Face clustering 3/20

CMAP Machine Learning Journal Club, December 13th 2018

[12] VIDAL, R., MA, Y., AND SASTRY, S. S. Generalized principal component analysis, vol. 5. Springer, 2016.

slide-4
SLIDE 4

STRUCTURED MATRIX FACTORIZATION CONTEXT

Model assumption: linear subspace model. The data can be approximated by one ore more low-dimensional subspace(s).

X ⇡ UV T

Basis of the linear low- dimensional structure Low-dimensional data representation 4/20 Large high-dimensional datasets (images, videos, user ratings, etc.) Ø difficult to assess (computational issues, memory complexity) Ø but relevant information often lies in general low-dimensional structure Goal: recover this underlying low-dimensional structure of given (large scale) data X

CMAP Machine Learning Journal Club, December 13th 2018

slide-5
SLIDE 5

STRUCTURED MATRIX FACTORIZATION CONTEXT

min

U,V

`(X, UV T) + Θ(U, V ) ( (1)

X ⇡ UV T

Basis of the linear low- dimensional structure Low-dimensional data representation Ø Issue: Without any assumptions there are infinitely many choices for U and V such that X ≈UVT. Ø Solution: Constrain the factors to satisfy certain properties. Loss: measures the approximation Regularization: imposes restrictions

  • n the factors

Ø Non-convex Ø Structured factors → more modeling flexibility Ø Explicit representation 4/20

CMAP Machine Learning Journal Club, December 13th 2018

slide-6
SLIDE 6

STRUCTURED MATRIX FACTORIZATION SPECIAL CASE 1: SPARSE DICTIONARY LEARNING

Given a set of signals, find a set of dictionary atoms and sparse codes to approximate the signals. [9] Ø denoising, inpainting Ø classification signals dictionary sparse codes (3)

min

U,V

kX UV Tk2

F + kV k1

subject to kUik2  1 (

Noisy image Sparse linear combinations

  • f dictionary atoms

Denoised image = Dictionary atoms 5/20

[9] OLSHAUSEN, B. A., AND FIELD, D. J. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research 37, 23 (1997), 3311–3325.

CMAP Machine Learning Journal Club, December 13th 2018

slide-7
SLIDE 7

STRUCTURED MATRIX FACTORIZATION SPECIAL CASE 1: SPARSE DICTIONARY LEARNING

signals dictionary sparse codes (3) (4) Challenges: Ø Optimization strategies without global convergence guarantees Ø Which size for U and V? Need to pick r (number of columns) a priori

min

U,V

kX UV Tk2

F + kV k1

subject to kUik2  1 (

min

U,V,r

kX UV Tk2

F + r

X

i=1

kVik1 + (1 )kVik2 subject to kUik2  1 (

6/20

CMAP Machine Learning Journal Club, December 13th 2018

slide-8
SLIDE 8

STRUCTURED MATRIX FACTORIZATION SPECIAL CASE 2: SUBSPACE CLUSTERING

Given data X coming from a union of subspaces, find these underlying subspaces and separate the data according to these subspaces. Ø clustering Ø recover low-dimensional structures 7/20

CMAP Machine Learning Journal Club, December 13th 2018

slide-9
SLIDE 9

STRUCTURED MATRIX FACTORIZATION SPECIAL CASE 2: SUBSPACE CLUSTERING

Given data X coming from a union of subspaces, determine these underlying subspaces and separate the data according to these subspaces. Ø clustering Ø recover low-dimensional structures Subspaces S1,..., Sn characterized by bases → U Challenges: Ø Model selection: how many subspaces? Dimension of each subspace? Ø Potentially: difficult subspace configurations Segmentation by finding a subspace- preserving representation → V

recover number and dimensions of the subspaces recover data segmentation

8/20

CMAP Machine Learning Journal Club, December 13th 2018

slide-10
SLIDE 10

STRUCTURED MATRIX FACTORIZATION SPECIAL CASE 2: SUBSPACE CLUSTERING

[1] ADLER, A., ELAD, M., AND HEL-OR, Y. Linear-time subspace clustering via bipartite graph modeling. IEEE transactions on neural networks and learning systems 26, 10 (2015), 2234–2246. [4] ELHAMIFAR, E., AND VIDAL, R. Sparse subspace clustering: Algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence 35, 11 (2013), 2765–2781.

One solution to do subspace clustering: Sparse Subspace Clustering [4]

  • Self-expressive dictionary: fix the

dictionary as U ← X

  • Find a sparse representation over U

which allows to segment the data. But optimality of the dictionary is not addressed. Idea: Sparse dictionary learning on union of subspaces model is suited to recover a more compact factorization with subspace-sparse codes. [1] 9/20

slide-11
SLIDE 11

STRUCTURED MATRIX FACTORIZATION THEORY FOR GLOBAL OPTIMALITY

Matrix approximation Matrix factorization min

U,V

`(X, UV T) + Θ(U, V ) (

Ø Convex Ø Large problem size Ø Unstructured Ø Non-convex Ø Small problem size Ø Structured factors → more modeling flexibility Ø Explicit representation

(1) (2)

min

Y

`(X, Y ) + ΩΘ(Y ) (

?

min

U,V

`(X, UV T) subject to U, V have number of columns  r

min

Y

`(X, Y ) + kY k⇤

Low-rank matrix factorization Low-rank matrix approximation

10/20

CMAP Machine Learning Journal Club, December 13th 2018

slide-12
SLIDE 12

STRUCTURED MATRIX FACTORIZATION THEORY FOR GLOBAL OPTIMALITY

Matrix approximation Matrix factorization min

U,V

`(X, UV T) + Θ(U, V ) (

Ø Convex Ø Large problem size Ø Unstructured Ø Non-convex Ø Small problem size Ø Structured factors → more modeling flexibility Ø Explicit representation

(1) (2)

min

Y

`(X, Y ) + ΩΘ(Y ) (

Ideas:

  • Find a convex relaxation for general a regularization function to couple the two problems (1) and (2).
  • Allow the number of columns of U and V to change in (1).

Results:

  • Problem (2) gives a global lower-bound to problem (1).
  • This convex lower-bound allows to analyze global optimality for problem (1).

X Θ

ΩΘ ?

10/20

slide-13
SLIDE 13

GLOBAL OPTIMALITY OF STRUCTURED MATRIX FACTORIZATION AT A LOCAL MINIMUM

Assumptions:

  • Factorization size r is allowed to change.
  • Loss is convex and once differentiable w.r.t. Y.
  • is a sum of positively homogeneous functions of degree 2.

Θ(U, V ) =

r

X

i=1

θ(Ui, Vi)

Θ

X θ(αu, αv) = α2θ(u, v) for all α 0

`(X, Y ) +

THEOREM [6]

[6] HAEFFELE, B. D., AND VIDAL, R. Structured low-rank matrix factorization: Global optimality, algorithms, and applications. arXiv preprint arXiv:1708.07850 (2017).

min

U,V

f(U, V ) = `(X, UV T) + Θ(U, V ) (

f(U, V )

All local minima of of sufficient size are global minima.

11/20

Local minima of are globally optimal if for some .

( ˜ U, ˜ V )

f(U, V )

( ˜ Ui, ˜ Vi) = (0, 0) i 2 [r]

slide-14
SLIDE 14

GLOBAL OPTIMALITY OF STRUCTURED MATRIX FACTORIZATION AT ANY POINT

Assumptions:

  • Factorization size r is allowed to change.
  • Loss is convex and once differentiable w.r.t. Y.
  • is a sum of positively homogeneous functions of degree 2.

Θ(U, V ) =

r

X

i=1

θ(Ui, Vi)

Θ

X θ(αu, αv) = α2θ(u, v) for all α 0

`(X, Y ) +

COROLLARY [6] A point is a global optimum of if it satisfies the following conditions: 1) 2)

min

U,V

f(U, V ) = `(X, UV T) + Θ(U, V ) (

12/20

( ˜ U, ˜ V )

f(U, V )

˜ U T

i

✓ 1 rY `(X, ˜ U ˜ V T) ◆ ˜ Vi = ✓( ˜ Ui, ˜ Vi) for all i 2 [r] uT ✓ 1 rY `(X, ˜ U ˜ V T) ◆ v  ✓(u, v) for all (u, v)

[6] HAEFFELE, B. D., AND VIDAL, R. Structured low-rank matrix factorization: Global optimality, algorithms, and applications. arXiv preprint arXiv:1708.07850 (2017).

← for many choices of " condition 1 is satisfied by first order optimal points

slide-15
SLIDE 15

GLOBAL OPTIMALITY OF STRUCTURED MATRIX FACTORIZATION AT ANY POINT

Assumptions:

  • Factorization size r is allowed to change.
  • Loss is convex and once differentiable w.r.t. Y.
  • is a sum of positively homogeneous functions of degree 2.

Θ(U, V ) =

r

X

i=1

θ(Ui, Vi)

Θ

X θ(αu, αv) = α2θ(u, v) for all α 0

`(X, Y ) +

COROLLARY [6]

Given a point we can test whether it is a local minimum and of sufficient size by testing:

( ˜ U, ˜ V )

min

U,V

f(U, V ) = `(X, UV T) + Θ(U, V ) (

uT ✓ 1 rY `(X, ˜ U ˜ V T) ◆ v  ✓(u, v) for all (u, v)

(5) 12/20

[6] HAEFFELE, B. D., AND VIDAL, R. Structured low-rank matrix factorization: Global optimality, algorithms, and applications. arXiv preprint arXiv:1708.07850 (2017).

slide-16
SLIDE 16

min

U,V

f(U, V ) = `(X, UV T) + Θ(U, V ) (

GLOBAL OPTIMALITY OF STRUCTURED MATRIX FACTORIZATION AT ANY POINT

COROLLARY [6]

Given a point we can test whether it is a local minimum and of sufficient size by testing:

( ˜ U, ˜ V )

uT ✓ 1 rY `(X, ˜ U ˜ V T) ◆ v  ✓(u, v) for all (u, v)

(5)

Directional derivative of f in direction (u,v) , 0  hrY `(X, ˜ U ˜ V T), uvTi + ✓(u, v) for all (u, v)

(6)

✓ ◆ , Ω

θ

✓ 1 rY `(X, ˜ U ˜ V T) ◆  1 Ω

θ (Z) = sup u,v

uTZv s.t. ✓(u, v)  1

(7) where polar problem 13/20

[6] HAEFFELE, B. D., AND VIDAL, R. Structured low-rank matrix factorization: Global optimality, algorithms, and applications. arXiv preprint arXiv:1708.07850 (2017).

slide-17
SLIDE 17

POLAR PROBLEM

Why are we interested in solving this polar problem?

  • For a non-convex problem, first-order optimality is not sufficient to guarantee local minimality.

→ Theorem on global optimality only applies to local minima.

  • Polar problem allows to test global optimality at any (critical) point.

→ It is a higher-order non-smooth saddle point problem.

θ

✓ 1 rY `(X, ˜ U ˜ V T) ◆  1 Ω

θ (Z) = sup u,v

uTZv s.t. ✓(u, v)  1

The difficulty of solving the polar depends on the choice of ".

is the top eigenvalue of ZTZ.

is the largest entry (in absolute value) of Z.

Solving this polar problem is NP-hard [7].

✓(u, v) = kuk2kvk2

θ (Z) = s

θ(u, v) = kuk1kvk1

θ (Z) = s

θ(u, v) = kuk∞kvk∞

[7] HENDRICKX, J. M., AND OLSHEVSKY, A. Matrix p-norms are np-hard to approximate if p≠1,2,∞. SIAM Journal on Matrix Analysis and Applications 31, 5 (2010), 2802–2812.

14/20 (7)

slide-18
SLIDE 18

ELEMENTS OF PROOF (THEOREM)

¡

Rank-1 regularizer:

(i)

Positively homogeneous with degree 2

(ii)

Positive semi-definite

(iii)

Well-defined as a regularizer for rank-1 matrices

¡

Matrix factorization regularizer:

¡

Generalization of decomposition/atomic norm.

¡

Convex function.

¡

The infimum is achieved with r ≤ DN.

✓ : RD ⇥ RN ! R+ [ 1

Ωθ : RD⇥N ! R+ [ 1 Y 7! inf

r2N+

inf

U2RD×r V 2RN×r r

X

i=1

✓(Ui, Vi) s.t. Y = UV T.

CMAP Machine Learning Journal Club, December 13th 2018

(8) 15/20

slide-19
SLIDE 19

ELEMENTS OF PROOF (THEOREM)

¡

Rank-1 regularizer:

¡

Matrix factorization regularizer:

¡

Generalization of decomposition/atomic norm.

¡

Convex function.

¡

The infimum is achieved with r ≤ DN.

¡

Example: variational form of the nuclear norm (sum of the singular values of the given matrix) .

✓ : RD ⇥ RN ! R+ [ 1

Ωθ : RD⇥N ! R+ [ 1 Y 7! inf

r2N+

inf

U2RD×r V 2RN×r r

X

i=1

✓(Ui, Vi) s.t. Y = UV T.

CMAP Machine Learning Journal Club, December 13th 2018

(8) 15/20

kY k⇤ = kY k2,2 = inf

r2N+

inf

(U,V ): UV T =Y r

X

i=1

kUik2kVik2

slide-20
SLIDE 20

ELEMENTS OF PROOF (THEOREM)

¡

Matrix factorization regularizer

¡

Convex optimization problem

¡

Non-convex factorized formulation

Ø If is an optimal solution to problem (9), then any factorization such that

is also an optimal solution to the non-convex problem (10).

§

Idea of the proof: Local minima of that satisfy the conditions of the theorem also satisfy conditions for global

  • ptimality of F.

CMAP Machine Learning Journal Club, December 13th 2018

min

Y

{F(X) ⌘ `(X, Y ) + Ωθ(Y )} min

U,V

( f(U, V ) ⌘ `(X, UV T) +

r

X

i=1

✓(Ui, Vi) ) Ωθ(Y ) = inf

r2N+

inf

U2RD×r V 2RN×r r

X

i=1

✓(Ui, Vi) s.t. Y = UV T

(8) (9) (10) 16/20

ˆ Y UV T = ˆ Y

r

X

i=1

✓(Ui, Vi) = Ωθ(ˆ Y )

( f(

slide-21
SLIDE 21

ADDITIONAL COMMENTS ON THE POLAR PROBLEM

¡

Matrix factorization regularizer

¡

Polar function

Ø Given a FO optimal point of the non-convex problem, the value of the polar problem solution at this point provides a

bound on how far this point is from being globally optimal.

CMAP Machine Learning Journal Club, December 13th 2018

Ωθ(Y ) = inf

r2N+

inf

U2RD×r V 2RN×r r

X

i=1

✓(Ui, Vi) s.t. Y = UV T

(8) (11) (12)

θ (Z) = sup Y

hZ, Y i s.t. Ωθ(Y )  1 Ω

θ (Z) = sup u,v

uTZv s.t. ✓(u, v)  1 17/20

slide-22
SLIDE 22

Algorithm 1: SDL Matrix Factorization Input: data X, initial factorization (Uinit, Vinit)

  • f size rinit. Result: optimal factorization (Ufinal, Vfinal) of size rfinal

while not converged do 1) Local descent to a first-order optimal point ( ˜ U, ˜ V ). 2) Solve the SDL polar problem at Z = 1

λrY `(X, ˜

U ˜ V T). if polar value  1 then Algorithm has converged. else Append the polar solution (u⇤, v⇤) with some step size ⌧ to ( ˜ U, ˜ V ): (U, V ) ([ ˜ U ⌧u⇤], [ ˜ V ⌧v⇤]). end end

ALGORITHM FOR SPARSE DICTIONARY LEARNING

¡

Instantiation of Structured MF Meta- Algorithm

¡

Build upon COROLLARY (on global optimality at any given point).

¡

Alternates between:

a)

local descent to a critical point

b)

evaluation of the polar function (by solving the polar problem) in order to test for global optimality at this critical point

c)

augmentation of the current factorization with the solution of the polar step (b) as long as the polar value is > 1.

¡

From [6] we know that condition 2) of COROLLARY will hold for finite r.

a b c 18/20

[6] HAEFFELE, B. D., AND VIDAL, R. Structured low-rank matrix factorization: Global optimality, algorithms, and applications. arXiv preprint arXiv:1708.07850 (2017).

slide-23
SLIDE 23

BEYOND MATRIX FACTORIZATION

¡

Structured Tensor Factorization and Deep Learning.

¡

Mapping:

¡

Dimension: r = # number of columns in U and V

¡

Optimization problem:

¡

Positive homogeneity of the network architecture and parallel subnetwork structure.

¡

Example of regularizer: product of norms. 19/20

CMAP Machine Learning Journal Club, December 13th 2018

Φ(U, V ) = UV T

Φ(W 1, . . . , W K) = k(· · · 2( 1(V W 1)W 2) · · · W K)

Non-linearity Input features Weights

min

W 1,...,W K `(Y, Φ(W 1, . . . , W K)) + Θ(W 1, . . . , W K)

Figure from ICCV ‘17 tutorial on Global Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond (Ben Haeffele and René Vidal)

r = # number of parallel subnetworks

slide-24
SLIDE 24

CONCLUSION AND FUTURE WORK

¡

Structured matrix factorization as a general formulation of many popular problems (LR MF, sparse PCA, NMF, SDL, etc.)

¡

Global optimality of structured matrix factorization.

¡

Iterative algorithm using local descent and polar problem to reach global optimum.

¡

Limits of direct applicability: polar problem optimization landscape can be complicated.

¡

Extendable to the analysis of global optimality for deep learning problems.

¡

Further reading

¡

HAEFFELE, B. D., AND VIDAL, R. Global optimality in tensor factorization, deep learning, and beyond. arXiv preprint arXiv:1506.07540 (2015).

¡

BACH, F. Convex relaxations of structured matrix factorizations. arXiv preprint arXiv:1309.3117 (2013).

¡

SCHWAB, E., HAEFFELE, B., CHARON, N., AND VIDAL, R. Separable dictionary learning with global optimality and applications to diffusion mri. arXiv preprint arXiv:1807.05595 (2018).

20/20

CMAP Machine Learning Journal Club, December 13th 2018

slide-25
SLIDE 25

THANK YOU FOR YOUR ATTENTION

CMAP , December 13th 2018 20/20 Presented article:

Haeffele, B.D. and Vidal, R. (2017). Structured Low-Rank Matrix Factorization: Global optimality, Algorithms, and Applications. URL: https://arxiv.org/abs/1708.07850

slide-26
SLIDE 26

REFERENCES I

(1) ADLER, A., ELAD, M., AND HEL-OR, Y. Linear-time subspace clustering via bipartite graph modeling. IEEE transactions on

neural networks and learning systems 26, 10 (2015), 2234–2246.

(2) BACH, F. Convex relaxations of structured matrix factorizations. arXiv preprint arXiv:1309.3117 (2013). (3) BACH, F., MAIRAL, J., AND PONCE, J. Convex sparse matrix factorizations. arXiv preprint arXiv:0812.1869 (2008). (4) ELHAMIFAR, E., AND VIDAL, R. Sparse subspace clustering: Algorithm, theory, and applications. IEEE transactions on

pattern analysis and machine intelligence 35, 11 (2013), 2765–2781.

(5) HAEFFELE, B. D., AND VIDAL, R. Global optimality in tensor factorization, deep learning, and beyond. arXiv preprint

arXiv:1506.07540 (2015).

(6) HAEFFELE, B. D., AND VIDAL, R. Structured low-rank matrix factorization: Global optimality, algorithms, and

  • applications. arXiv preprint arXiv:1708.07850 (2017).

(7) HENDRICKX, J. M., AND OLSHEVSKY, A. Matrix p-norms are np-hard to approximate if p≠1,2,∞. SIAM Journal on

Matrix Analysis and Applications 31, 5 (2010), 2802–2812.

(8) LIU, G., LIN, Z., AND YU, Y. Robust subspace segmentation by low-rank representation. In Proceedings of the 27th

international conference on machine learning (ICML-10) (2010), pp. 663–670.

slide-27
SLIDE 27

REFERENCES II

(9) OLSHAUSEN, B. A., AND FIELD, D. J. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision

research 37, 23 (1997), 3311–3325.

(10)SCHWAB, E., HAEFFELE, B., CHARON, N., AND VIDAL, R. Separable dictionary learning with global optimality and

applications to diffusion mri. arXiv preprint arXiv:1807.05595 (2018).

(11)SUN, J., QU, Q., AND WRIGHT, J. Complete dictionary recovery over the sphere. arXiv preprint arXiv:1504.06785

(2015).

(12)VIDAL, R., MA, Y., AND SASTRY, S. S. Generalized principal component analysis, vol. 5. Springer, 2016. (13)ZHU, Z., LI, Q., TANG, G., AND WAKIN, M. B. Global optimality in low-rank matrix optimization. IEEE Transactions on

Signal Processing 66, 13 (2018), 3614–3628.