STRUCTURED LOW-RANK MATRIX FACTORIZATION: GLOBAL OPTIMALITY, - - PowerPoint PPT Presentation
STRUCTURED LOW-RANK MATRIX FACTORIZATION: GLOBAL OPTIMALITY, - - PowerPoint PPT Presentation
1/20 STRUCTURED LOW-RANK MATRIX FACTORIZATION: GLOBAL OPTIMALITY, ALGORITHMS, AND APPLICATIONS ARTICLE BY BENJAMIN D. HAEFFELE AND REN VIDAL (2017) CMAP Machine Learning Journal Club Speaker: Imke Mayer , December 13 th 2018 CMAP 2/20
OUTLINE
I.
Structured Matrix Factorization
i.
Context and definition
ii.
Special case 1: Sparse dictionary learning (SDL)
iii.
Special case 2: Subspace clustering (SC) II.
Global optimality for structured matrix factorization
i.
Main theorem
ii.
Polar problem
- III. Application: SDL global optimality
- IV. Extension to tensor factorization and deep learning
CMAP Machine Learning Journal Club, December 13th 2018
2/20
STRUCTURED MATRIX FACTORIZATION CONTEXT
(Large) high-dimensional datasets (images, videos, user ratings, etc.) Ø difficult to assess (computational issues, memory complexity) Ø but relevant information often lies in a low-dimensional structure Goal: recover this underlying low-dimensional structure of given (large scale) data X Motion segmentation Face clustering 3/20
CMAP Machine Learning Journal Club, December 13th 2018
[12] VIDAL, R., MA, Y., AND SASTRY, S. S. Generalized principal component analysis, vol. 5. Springer, 2016.
STRUCTURED MATRIX FACTORIZATION CONTEXT
Model assumption: linear subspace model. The data can be approximated by one ore more low-dimensional subspace(s).
X ⇡ UV T
Basis of the linear low- dimensional structure Low-dimensional data representation 4/20 Large high-dimensional datasets (images, videos, user ratings, etc.) Ø difficult to assess (computational issues, memory complexity) Ø but relevant information often lies in general low-dimensional structure Goal: recover this underlying low-dimensional structure of given (large scale) data X
CMAP Machine Learning Journal Club, December 13th 2018
STRUCTURED MATRIX FACTORIZATION CONTEXT
min
U,V
`(X, UV T) + Θ(U, V ) ( (1)
X ⇡ UV T
Basis of the linear low- dimensional structure Low-dimensional data representation Ø Issue: Without any assumptions there are infinitely many choices for U and V such that X ≈UVT. Ø Solution: Constrain the factors to satisfy certain properties. Loss: measures the approximation Regularization: imposes restrictions
- n the factors
Ø Non-convex Ø Structured factors → more modeling flexibility Ø Explicit representation 4/20
CMAP Machine Learning Journal Club, December 13th 2018
STRUCTURED MATRIX FACTORIZATION SPECIAL CASE 1: SPARSE DICTIONARY LEARNING
Given a set of signals, find a set of dictionary atoms and sparse codes to approximate the signals. [9] Ø denoising, inpainting Ø classification signals dictionary sparse codes (3)
min
U,V
kX UV Tk2
F + kV k1
subject to kUik2 1 (
Noisy image Sparse linear combinations
- f dictionary atoms
Denoised image = Dictionary atoms 5/20
[9] OLSHAUSEN, B. A., AND FIELD, D. J. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research 37, 23 (1997), 3311–3325.
CMAP Machine Learning Journal Club, December 13th 2018
STRUCTURED MATRIX FACTORIZATION SPECIAL CASE 1: SPARSE DICTIONARY LEARNING
signals dictionary sparse codes (3) (4) Challenges: Ø Optimization strategies without global convergence guarantees Ø Which size for U and V? Need to pick r (number of columns) a priori
min
U,V
kX UV Tk2
F + kV k1
subject to kUik2 1 (
min
U,V,r
kX UV Tk2
F + r
X
i=1
kVik1 + (1 )kVik2 subject to kUik2 1 (
6/20
CMAP Machine Learning Journal Club, December 13th 2018
STRUCTURED MATRIX FACTORIZATION SPECIAL CASE 2: SUBSPACE CLUSTERING
Given data X coming from a union of subspaces, find these underlying subspaces and separate the data according to these subspaces. Ø clustering Ø recover low-dimensional structures 7/20
CMAP Machine Learning Journal Club, December 13th 2018
STRUCTURED MATRIX FACTORIZATION SPECIAL CASE 2: SUBSPACE CLUSTERING
Given data X coming from a union of subspaces, determine these underlying subspaces and separate the data according to these subspaces. Ø clustering Ø recover low-dimensional structures Subspaces S1,..., Sn characterized by bases → U Challenges: Ø Model selection: how many subspaces? Dimension of each subspace? Ø Potentially: difficult subspace configurations Segmentation by finding a subspace- preserving representation → V
recover number and dimensions of the subspaces recover data segmentation
8/20
CMAP Machine Learning Journal Club, December 13th 2018
STRUCTURED MATRIX FACTORIZATION SPECIAL CASE 2: SUBSPACE CLUSTERING
[1] ADLER, A., ELAD, M., AND HEL-OR, Y. Linear-time subspace clustering via bipartite graph modeling. IEEE transactions on neural networks and learning systems 26, 10 (2015), 2234–2246. [4] ELHAMIFAR, E., AND VIDAL, R. Sparse subspace clustering: Algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence 35, 11 (2013), 2765–2781.
One solution to do subspace clustering: Sparse Subspace Clustering [4]
- Self-expressive dictionary: fix the
dictionary as U ← X
- Find a sparse representation over U
which allows to segment the data. But optimality of the dictionary is not addressed. Idea: Sparse dictionary learning on union of subspaces model is suited to recover a more compact factorization with subspace-sparse codes. [1] 9/20
STRUCTURED MATRIX FACTORIZATION THEORY FOR GLOBAL OPTIMALITY
Matrix approximation Matrix factorization min
U,V
`(X, UV T) + Θ(U, V ) (
Ø Convex Ø Large problem size Ø Unstructured Ø Non-convex Ø Small problem size Ø Structured factors → more modeling flexibility Ø Explicit representation
(1) (2)
min
Y
`(X, Y ) + ΩΘ(Y ) (
?
min
U,V
`(X, UV T) subject to U, V have number of columns r
min
Y
`(X, Y ) + kY k⇤
Low-rank matrix factorization Low-rank matrix approximation
10/20
CMAP Machine Learning Journal Club, December 13th 2018
STRUCTURED MATRIX FACTORIZATION THEORY FOR GLOBAL OPTIMALITY
Matrix approximation Matrix factorization min
U,V
`(X, UV T) + Θ(U, V ) (
Ø Convex Ø Large problem size Ø Unstructured Ø Non-convex Ø Small problem size Ø Structured factors → more modeling flexibility Ø Explicit representation
(1) (2)
min
Y
`(X, Y ) + ΩΘ(Y ) (
Ideas:
- Find a convex relaxation for general a regularization function to couple the two problems (1) and (2).
- Allow the number of columns of U and V to change in (1).
Results:
- Problem (2) gives a global lower-bound to problem (1).
- This convex lower-bound allows to analyze global optimality for problem (1).
X Θ
ΩΘ ?
10/20
GLOBAL OPTIMALITY OF STRUCTURED MATRIX FACTORIZATION AT A LOCAL MINIMUM
Assumptions:
- Factorization size r is allowed to change.
- Loss is convex and once differentiable w.r.t. Y.
- is a sum of positively homogeneous functions of degree 2.
Θ(U, V ) =
r
X
i=1
θ(Ui, Vi)
Θ
X θ(αu, αv) = α2θ(u, v) for all α 0
`(X, Y ) +
THEOREM [6]
[6] HAEFFELE, B. D., AND VIDAL, R. Structured low-rank matrix factorization: Global optimality, algorithms, and applications. arXiv preprint arXiv:1708.07850 (2017).
min
U,V
f(U, V ) = `(X, UV T) + Θ(U, V ) (
f(U, V )
All local minima of of sufficient size are global minima.
11/20
Local minima of are globally optimal if for some .
( ˜ U, ˜ V )
f(U, V )
( ˜ Ui, ˜ Vi) = (0, 0) i 2 [r]
GLOBAL OPTIMALITY OF STRUCTURED MATRIX FACTORIZATION AT ANY POINT
Assumptions:
- Factorization size r is allowed to change.
- Loss is convex and once differentiable w.r.t. Y.
- is a sum of positively homogeneous functions of degree 2.
Θ(U, V ) =
r
X
i=1
θ(Ui, Vi)
Θ
X θ(αu, αv) = α2θ(u, v) for all α 0
`(X, Y ) +
COROLLARY [6] A point is a global optimum of if it satisfies the following conditions: 1) 2)
min
U,V
f(U, V ) = `(X, UV T) + Θ(U, V ) (
12/20
( ˜ U, ˜ V )
f(U, V )
˜ U T
i
✓ 1 rY `(X, ˜ U ˜ V T) ◆ ˜ Vi = ✓( ˜ Ui, ˜ Vi) for all i 2 [r] uT ✓ 1 rY `(X, ˜ U ˜ V T) ◆ v ✓(u, v) for all (u, v)
[6] HAEFFELE, B. D., AND VIDAL, R. Structured low-rank matrix factorization: Global optimality, algorithms, and applications. arXiv preprint arXiv:1708.07850 (2017).
← for many choices of " condition 1 is satisfied by first order optimal points
GLOBAL OPTIMALITY OF STRUCTURED MATRIX FACTORIZATION AT ANY POINT
Assumptions:
- Factorization size r is allowed to change.
- Loss is convex and once differentiable w.r.t. Y.
- is a sum of positively homogeneous functions of degree 2.
Θ(U, V ) =
r
X
i=1
θ(Ui, Vi)
Θ
X θ(αu, αv) = α2θ(u, v) for all α 0
`(X, Y ) +
COROLLARY [6]
Given a point we can test whether it is a local minimum and of sufficient size by testing:
( ˜ U, ˜ V )
min
U,V
f(U, V ) = `(X, UV T) + Θ(U, V ) (
uT ✓ 1 rY `(X, ˜ U ˜ V T) ◆ v ✓(u, v) for all (u, v)
(5) 12/20
[6] HAEFFELE, B. D., AND VIDAL, R. Structured low-rank matrix factorization: Global optimality, algorithms, and applications. arXiv preprint arXiv:1708.07850 (2017).
min
U,V
f(U, V ) = `(X, UV T) + Θ(U, V ) (
GLOBAL OPTIMALITY OF STRUCTURED MATRIX FACTORIZATION AT ANY POINT
COROLLARY [6]
Given a point we can test whether it is a local minimum and of sufficient size by testing:
( ˜ U, ˜ V )
uT ✓ 1 rY `(X, ˜ U ˜ V T) ◆ v ✓(u, v) for all (u, v)
(5)
Directional derivative of f in direction (u,v) , 0 hrY `(X, ˜ U ˜ V T), uvTi + ✓(u, v) for all (u, v)
(6)
✓ ◆ , Ω
θ
✓ 1 rY `(X, ˜ U ˜ V T) ◆ 1 Ω
θ (Z) = sup u,v
uTZv s.t. ✓(u, v) 1
(7) where polar problem 13/20
[6] HAEFFELE, B. D., AND VIDAL, R. Structured low-rank matrix factorization: Global optimality, algorithms, and applications. arXiv preprint arXiv:1708.07850 (2017).
POLAR PROBLEM
Why are we interested in solving this polar problem?
- For a non-convex problem, first-order optimality is not sufficient to guarantee local minimality.
→ Theorem on global optimality only applies to local minima.
- Polar problem allows to test global optimality at any (critical) point.
→ It is a higher-order non-smooth saddle point problem.
Ω
θ
✓ 1 rY `(X, ˜ U ˜ V T) ◆ 1 Ω
θ (Z) = sup u,v
uTZv s.t. ✓(u, v) 1
The difficulty of solving the polar depends on the choice of ".
- ⟹
is the top eigenvalue of ZTZ.
- ⟹
is the largest entry (in absolute value) of Z.
- ⟹
Solving this polar problem is NP-hard [7].
✓(u, v) = kuk2kvk2
Ω
θ (Z) = s
θ(u, v) = kuk1kvk1
Ω
θ (Z) = s
θ(u, v) = kuk∞kvk∞
[7] HENDRICKX, J. M., AND OLSHEVSKY, A. Matrix p-norms are np-hard to approximate if p≠1,2,∞. SIAM Journal on Matrix Analysis and Applications 31, 5 (2010), 2802–2812.
14/20 (7)
ELEMENTS OF PROOF (THEOREM)
¡
Rank-1 regularizer:
(i)
Positively homogeneous with degree 2
(ii)
Positive semi-definite
(iii)
Well-defined as a regularizer for rank-1 matrices
¡
Matrix factorization regularizer:
¡
Generalization of decomposition/atomic norm.
¡
Convex function.
¡
The infimum is achieved with r ≤ DN.
✓ : RD ⇥ RN ! R+ [ 1
Ωθ : RD⇥N ! R+ [ 1 Y 7! inf
r2N+
inf
U2RD×r V 2RN×r r
X
i=1
✓(Ui, Vi) s.t. Y = UV T.
CMAP Machine Learning Journal Club, December 13th 2018
(8) 15/20
ELEMENTS OF PROOF (THEOREM)
¡
Rank-1 regularizer:
¡
Matrix factorization regularizer:
¡
Generalization of decomposition/atomic norm.
¡
Convex function.
¡
The infimum is achieved with r ≤ DN.
¡
Example: variational form of the nuclear norm (sum of the singular values of the given matrix) .
✓ : RD ⇥ RN ! R+ [ 1
Ωθ : RD⇥N ! R+ [ 1 Y 7! inf
r2N+
inf
U2RD×r V 2RN×r r
X
i=1
✓(Ui, Vi) s.t. Y = UV T.
CMAP Machine Learning Journal Club, December 13th 2018
(8) 15/20
kY k⇤ = kY k2,2 = inf
r2N+
inf
(U,V ): UV T =Y r
X
i=1
kUik2kVik2
ELEMENTS OF PROOF (THEOREM)
¡
Matrix factorization regularizer
¡
Convex optimization problem
¡
Non-convex factorized formulation
Ø If is an optimal solution to problem (9), then any factorization such that
is also an optimal solution to the non-convex problem (10).
§
Idea of the proof: Local minima of that satisfy the conditions of the theorem also satisfy conditions for global
- ptimality of F.
CMAP Machine Learning Journal Club, December 13th 2018
min
Y
{F(X) ⌘ `(X, Y ) + Ωθ(Y )} min
U,V
( f(U, V ) ⌘ `(X, UV T) +
r
X
i=1
✓(Ui, Vi) ) Ωθ(Y ) = inf
r2N+
inf
U2RD×r V 2RN×r r
X
i=1
✓(Ui, Vi) s.t. Y = UV T
(8) (9) (10) 16/20
ˆ Y UV T = ˆ Y
r
X
i=1
✓(Ui, Vi) = Ωθ(ˆ Y )
( f(
ADDITIONAL COMMENTS ON THE POLAR PROBLEM
¡
Matrix factorization regularizer
¡
Polar function
Ø Given a FO optimal point of the non-convex problem, the value of the polar problem solution at this point provides a
bound on how far this point is from being globally optimal.
CMAP Machine Learning Journal Club, December 13th 2018
Ωθ(Y ) = inf
r2N+
inf
U2RD×r V 2RN×r r
X
i=1
✓(Ui, Vi) s.t. Y = UV T
(8) (11) (12)
Ω
θ (Z) = sup Y
hZ, Y i s.t. Ωθ(Y ) 1 Ω
θ (Z) = sup u,v
uTZv s.t. ✓(u, v) 1 17/20
Algorithm 1: SDL Matrix Factorization Input: data X, initial factorization (Uinit, Vinit)
- f size rinit. Result: optimal factorization (Ufinal, Vfinal) of size rfinal
while not converged do 1) Local descent to a first-order optimal point ( ˜ U, ˜ V ). 2) Solve the SDL polar problem at Z = 1
λrY `(X, ˜
U ˜ V T). if polar value 1 then Algorithm has converged. else Append the polar solution (u⇤, v⇤) with some step size ⌧ to ( ˜ U, ˜ V ): (U, V ) ([ ˜ U ⌧u⇤], [ ˜ V ⌧v⇤]). end end
ALGORITHM FOR SPARSE DICTIONARY LEARNING
¡
Instantiation of Structured MF Meta- Algorithm
¡
Build upon COROLLARY (on global optimality at any given point).
¡
Alternates between:
a)
local descent to a critical point
b)
evaluation of the polar function (by solving the polar problem) in order to test for global optimality at this critical point
c)
augmentation of the current factorization with the solution of the polar step (b) as long as the polar value is > 1.
¡
From [6] we know that condition 2) of COROLLARY will hold for finite r.
a b c 18/20
[6] HAEFFELE, B. D., AND VIDAL, R. Structured low-rank matrix factorization: Global optimality, algorithms, and applications. arXiv preprint arXiv:1708.07850 (2017).
BEYOND MATRIX FACTORIZATION
¡
Structured Tensor Factorization and Deep Learning.
¡
Mapping:
¡
Dimension: r = # number of columns in U and V
¡
Optimization problem:
¡
Positive homogeneity of the network architecture and parallel subnetwork structure.
¡
Example of regularizer: product of norms. 19/20
CMAP Machine Learning Journal Club, December 13th 2018
Φ(U, V ) = UV T
Φ(W 1, . . . , W K) = k(· · · 2( 1(V W 1)W 2) · · · W K)
Non-linearity Input features Weights
min
W 1,...,W K `(Y, Φ(W 1, . . . , W K)) + Θ(W 1, . . . , W K)
Figure from ICCV ‘17 tutorial on Global Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond (Ben Haeffele and René Vidal)
r = # number of parallel subnetworks
CONCLUSION AND FUTURE WORK
¡
Structured matrix factorization as a general formulation of many popular problems (LR MF, sparse PCA, NMF, SDL, etc.)
¡
Global optimality of structured matrix factorization.
¡
Iterative algorithm using local descent and polar problem to reach global optimum.
¡
Limits of direct applicability: polar problem optimization landscape can be complicated.
¡
Extendable to the analysis of global optimality for deep learning problems.
¡
Further reading
¡
HAEFFELE, B. D., AND VIDAL, R. Global optimality in tensor factorization, deep learning, and beyond. arXiv preprint arXiv:1506.07540 (2015).
¡
BACH, F. Convex relaxations of structured matrix factorizations. arXiv preprint arXiv:1309.3117 (2013).
¡
SCHWAB, E., HAEFFELE, B., CHARON, N., AND VIDAL, R. Separable dictionary learning with global optimality and applications to diffusion mri. arXiv preprint arXiv:1807.05595 (2018).
20/20
CMAP Machine Learning Journal Club, December 13th 2018
THANK YOU FOR YOUR ATTENTION
CMAP , December 13th 2018 20/20 Presented article:
Haeffele, B.D. and Vidal, R. (2017). Structured Low-Rank Matrix Factorization: Global optimality, Algorithms, and Applications. URL: https://arxiv.org/abs/1708.07850
REFERENCES I
(1) ADLER, A., ELAD, M., AND HEL-OR, Y. Linear-time subspace clustering via bipartite graph modeling. IEEE transactions on
neural networks and learning systems 26, 10 (2015), 2234–2246.
(2) BACH, F. Convex relaxations of structured matrix factorizations. arXiv preprint arXiv:1309.3117 (2013). (3) BACH, F., MAIRAL, J., AND PONCE, J. Convex sparse matrix factorizations. arXiv preprint arXiv:0812.1869 (2008). (4) ELHAMIFAR, E., AND VIDAL, R. Sparse subspace clustering: Algorithm, theory, and applications. IEEE transactions on
pattern analysis and machine intelligence 35, 11 (2013), 2765–2781.
(5) HAEFFELE, B. D., AND VIDAL, R. Global optimality in tensor factorization, deep learning, and beyond. arXiv preprint
arXiv:1506.07540 (2015).
(6) HAEFFELE, B. D., AND VIDAL, R. Structured low-rank matrix factorization: Global optimality, algorithms, and
- applications. arXiv preprint arXiv:1708.07850 (2017).
(7) HENDRICKX, J. M., AND OLSHEVSKY, A. Matrix p-norms are np-hard to approximate if p≠1,2,∞. SIAM Journal on
Matrix Analysis and Applications 31, 5 (2010), 2802–2812.
(8) LIU, G., LIN, Z., AND YU, Y. Robust subspace segmentation by low-rank representation. In Proceedings of the 27th
international conference on machine learning (ICML-10) (2010), pp. 663–670.
REFERENCES II
(9) OLSHAUSEN, B. A., AND FIELD, D. J. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision
research 37, 23 (1997), 3311–3325.
(10)SCHWAB, E., HAEFFELE, B., CHARON, N., AND VIDAL, R. Separable dictionary learning with global optimality and
applications to diffusion mri. arXiv preprint arXiv:1807.05595 (2018).
(11)SUN, J., QU, Q., AND WRIGHT, J. Complete dictionary recovery over the sphere. arXiv preprint arXiv:1504.06785
(2015).
(12)VIDAL, R., MA, Y., AND SASTRY, S. S. Generalized principal component analysis, vol. 5. Springer, 2016. (13)ZHU, Z., LI, Q., TANG, G., AND WAKIN, M. B. Global optimality in low-rank matrix optimization. IEEE Transactions on
Signal Processing 66, 13 (2018), 3614–3628.