Robust Principal Component Pursuit via Alternating Minimization - - PowerPoint PPT Presentation

robust principal component pursuit via alternating
SMART_READER_LITE
LIVE PREVIEW

Robust Principal Component Pursuit via Alternating Minimization - - PowerPoint PPT Presentation

Robust Principal Component Pursuit via Alternating Minimization Scheme on Matrix Manifolds Tao Wu Institute for Mathematics and Scientific Computing Karl-Franzens-University of Graz joint work with Prof. Michael Hinterm uller


slide-1
SLIDE 1

Robust Principal Component Pursuit via Alternating Minimization Scheme on Matrix Manifolds

Tao Wu

Institute for Mathematics and Scientific Computing Karl-Franzens-University of Graz

joint work with Prof. Michael Hinterm¨ uller

tao.wu@uni-graz.at Riem-RPCP (1/19)

slide-2
SLIDE 2

Low-rank paradigm.

Low-rank matrices arise in one way or another:

◮ low-degree statistical processes

e.g. collaborative filtering, latent semantic indexing.

◮ regularization on complex objects

e.g. manifold learning, metric learning.

◮ approximation of compact operators

e.g. proper orthogonal decomposition.

Fig.: Collaborative filtering (courtesy of wikipedia.org).

tao.wu@uni-graz.at Riem-RPCP (2/19)

slide-3
SLIDE 3

Robust principal component pursuit.

◮ Sparse component corresponds to pattern-irrelevant outliers. ◮ Robustifies classical principal component analysis. ◮ Carries important information in certain applications;

e.g. moving objects in surveillance video.

◮ Robust principal component pursuit:

data low-rank sparse noise Z = A + B + N

◮ Introduced in [Cand´

es, Li, Ma, and Wright, ’11], [Chandrasekaran, Sanghavi, Parrilo, and Willsky, ’11].

tao.wu@uni-graz.at Riem-RPCP (3/19)

slide-4
SLIDE 4

Convex-relaxation approach.

◮ A popular (convex) variational model:

min Anuclear + λBℓ1 s.t. A + B − Z ≤ ε.

◮ Considered in [Cand´

es, Li, Ma, and Wright, ’11], [Chandrasekaran, Sanghavi, Parrilo, and Willsky, ’11], ...

◮ rank(A) relaxed by nuclear-norm;

B0 relaxed by ℓ1-norm.

◮ Numerical solvers: proximal gradient method, augmented

Lagrangian method, ... Efficiency is constrained by SVD in full dimension at each iteration.

tao.wu@uni-graz.at Riem-RPCP (4/19)

slide-5
SLIDE 5

Manifold constrained least-squares model.

◮ Our variational model:

min 1 2A + B − Z2 s.t. A ∈ M(r) := {A ∈ Rm×n : rank(A) ≤ r}, B ∈ N(s) := {B ∈ Rm×n : B0 ≤ s}.

◮ Our goal is to develop an algorithm such that:

◮ globally converges to a stationary point (often a local

minimizer).

◮ provides exact decomposition with high probability for noiseless

data.

◮ outperforms solvers based on convex-relaxation approach,

especially in large scales.

tao.wu@uni-graz.at Riem-RPCP (5/19)

slide-6
SLIDE 6

Existence of solution and optimality condition.

◮ A little quadratic regularization (0 < µ ≪ 1) is included for

the (theoretical) sake of existence of a solution; i.e. min f(A, B) := 1 2A + B − Z2 + µ 2 A2, s.t. (A, B) ∈ M(r) × N(s). In numerics, choosing µ = 0 seems fine.

◮ Stationarity condition as variational inequalities:

  • ∆, (1 + µ)A∗ + B∗ − Z ≥ 0,

for any ∆ ∈ TM(r)(A∗), ∆, A∗ + B∗ − Z ≥ 0, for any ∆ ∈ TN(s)(B∗). TM(r)(A∗) and TN(s)(B∗) refer to tangent cones.

tao.wu@uni-graz.at Riem-RPCP (6/19)

slide-7
SLIDE 7

Constraints of Riemannian manifolds.

◮ M(r) is Riemannian manifold around A∗ if rank(A∗) = r;

N(s) is Riemannian manifold around B∗ if B∗0 = s.

◮ Optimality condition reduces to:

  • PTM(r)(A∗)((1 + µ)A∗ + B∗ − Z) = 0,

PTN (s)(B∗)(A∗ + B∗ − Z) = 0. PTM(r)(A∗) and PTN (s)(B∗) are orthogonal projections onto subspaces.

◮ Tangent space formulae:

TM(r)(A∗) = {UMV ⊤ + UpV ⊤ + UV ⊤

p : A∗ = UΣV ⊤ as compact SVD,

M ∈ Rr×r, Up ∈ Rm×r, U ⊤

p U = 0, Vp ∈ Rn×r, V ⊤ p V = 0},

TN (s)(B∗) = {∆ ∈ Rm×n : supp(∆) ⊂ supp(B∗)}.

tao.wu@uni-graz.at Riem-RPCP (7/19)

slide-8
SLIDE 8

A conceptual alternating minimization scheme.

Initialize A0 ∈ M(r), B0 ∈ N(s). Set k := 0 and iterate:

  • 1. Ak+1 ≈ arg minA∈M(r)

1 2A + Bk − Z2 + µ 2A2.

  • 2. Bk+1 ≈ arg minB∈N(s)

1 2Ak+1 + B − Z2.

Theorem (sufficient descrease + stationarity ⇒ convergence)

Let {(Ak, Bk)} be generated as above. Suppose that there exists δ > 0, εk

a ↓ 0, and εk b ↓ 0 such that for all k:

f(Ak+1, Bk) ≤ f(Ak, Bk) − δAk+1 − Ak2, f(Ak+1, Bk+1) ≤ f(Ak+1, Bk) − δBk+1 − Bk2, ∆, (1 + µ)Ak+1 + Bk − Z ≥ −εk

a∆,

for any ∆ ∈ TM(r)(Ak+1), ∆, Ak+1 + Bk+1 − Z ≥ −εk

b∆,

for any ∆ ∈ TN (s)(Bk+1).

Then any non-degenerate limit point (A∗, B∗), i.e. rank(A∗) = r and B∗0 = s, satisfies the first-order optimality condition.

tao.wu@uni-graz.at Riem-RPCP (8/19)

slide-9
SLIDE 9

Sparse matrix subproblem.

◮ The global solution PN(s)(Z − Ak+1) (as metric projection)

can be efficiently calculated from “sorting”.

◮ The global solution may not necessarily fulfill the sufficient

descrease condition.

◮ Whenever necessary, safeguard by a local solution:

Bk+1

ij

= (Z − Ak+1)ij, if Bk

ij = 0,

0,

  • therwise.

◮ Given non-degeneracy of Bk+1, i.e. Bk+10 = s, the exact

stationarity holds.

tao.wu@uni-graz.at Riem-RPCP (9/19)

slide-10
SLIDE 10

Low-rank matrix subproblem: a Riemannian perspective.

◮ Global solution PM(r)( 1 1+µ(Z − Bk)) as metric projection:

◮ available due to Eckart-Young theorem; i.e.

1 1 + µ(Z − Bk) =

n

  • j=1

σjujv⊤

j

⇒ PM(r)( 1 1 + µ(Z − Bk)) =

r

  • j=1

σjujv⊤

j . ◮ but requires SVD in full dimension

expensive for large-scale problems (e.g. m, n ≥ 2000).

◮ Alternatively resolved by a single Riemannian optimization

step on matrix manifold.

◮ Riemannian optimization applied to low-rank matrix/tensor

problems; see [Simonsson and Eld´ en, ’10], [Savas and Lim, ’10], [Vandereycken, ’13], ...

◮ Our goal: The subproblem solver should activate the

convergence criteria, i.e. sufficient descrease + stationarity.

tao.wu@uni-graz.at Riem-RPCP (10/19)

slide-11
SLIDE 11

Riemannian optimization: an overview.

M f R

◮ References: [Smith, ’93], [Edelman, Arias, and Smith, ’98],

[Absil, Mahony, and Sepulchre, ’08], ...

◮ Why Riemannian optimization?

◮ Local homeomorphism is computationally infeasible/expensive. ◮ Intrinsically low dimensionality of the underlying manifold. ◮ Further dimension reduction via quotient manifold.

◮ Typical Riemannian manifolds in applications:

◮ finite-dimensional (matrix manifold): Stiefel manifold,

Grassmann manifold, fixed-rank matrix manifold, ...

◮ infinite-dimensional: shape/curve spaces, ... tao.wu@uni-graz.at Riem-RPCP (11/19)

slide-12
SLIDE 12

Riemannian optimization: a conceptual algorithm.

00˙AMS September 23, 2007 x M TxM Rx(ξ) ξ

retract ¯

M(r)(Ak, ∆k)

∆k Ak T ¯

M(r)(Ak)

¯ M(r)

At the current iterate:

  • 1. Build a quadratic model in the tangent space using

Riemannian gradient and Riemannian Hessian.

  • 2. Based on the quadratic model, build a tangential search path.
  • 3. Perform backtracking path search via retraction to determine

the step size.

  • 4. Generate the next iterate.

tao.wu@uni-graz.at Riem-RPCP (12/19)

slide-13
SLIDE 13

Riemannian gradient and Hessian.

¯ M(r) := {A : rank(A) = r}; fk

A : A ∈ ¯

M(r) → f(A, Bk).

◮ Riemannian gradient, gradfk A(A) ∈ T ¯ M(r)(A), is defined s.t.

gradfk

A(A), ∆ = Dfk A(A)[∆], ∀∆ ∈ T ¯ M(r)(A).

gradfk

A(A) = PT ¯

M(r)(A)(∇fk

A(A)). ◮ Riemannian Hessian, Hessfk A(A) : T ¯ M(r)(A) → T ¯ M(r)(A), is

defined s.t. Hessfk

A(A)[∆] = ∇∆gradfk A(A), ∀∆ ∈ T ¯ M(A).

Hessf k

A(A)[∆] = (I − UU ⊤)∇f k A(A)(I − V V ⊤)∆⊤UΣ−1V ⊤

+ UΣ−1V ⊤∆⊤(I − UU ⊤)∇f k

A(A)(I − V V ⊤)

+ (1 + µ)∆.

See, e.g., [Vandereycken, ’12].

tao.wu@uni-graz.at Riem-RPCP (13/19)

slide-14
SLIDE 14

Dogleg search path and projective retraction.

) ∆ pB full step ( ) — g) pU — g Trust region p Optimal trajectory dogleg path unconstrained min along ( ( ∆(σ) ∆C ∆N

00˙AMS September 23, 2007 x M TxM Rx(ξ) ξ

retract ¯

M(r)(Ak, ∆k)

∆k Ak T ¯

M(r)(Ak)

¯ M(r)

◮ “Dogleg” path ∆k(τ k) as approximation of optimal trajectory

  • f tangential trust-region subproblem (left figure):

min fk

A(Ak) + gk, ∆ + 1

2∆, Hk[∆] s.t. ∆ ∈ T ¯

M(r)(Ak), ∆ ≤ σ. ◮ Metric projection as retraction (right figure):

retract ¯

M(r)(Ak, ∆k(τ k)) = P ¯ M(r)(Ak + ∆k(τ k)).

Computationally efficient: “reduced” SVD on 2r-by-2r matrix!

tao.wu@uni-graz.at Riem-RPCP (14/19)

slide-15
SLIDE 15

Low-rank matrix subproblem: projected dogleg step.

Given Ak ∈ ¯ M(r), Bk ∈ N(s):

  • 1. Compute gk, Hk, and build the dogleg search path ∆k(τ k) in

T ¯

M(r)(Ak).

  • 2. Whenever non-positive definiteness of Hk is detected, replace

the dogleg search path by the line search path along steepest descent direction, i.e. ∆(τ k) = −τ kgk.

  • 3. Perform backtracking path/line search; i.e. find the largest

step size τ k ∈ {2, 3/2, 1, 1/2, 1/4, 1/8, ...} s.t. the sufficient descrease condition is satisfied:

f k

A(Ak)−f k A(P ¯ M(r)(Ak+∆k(τ k))) ≥ δAk−P ¯ M(r)(Ak+∆k(τ k))2.

  • 4. Return Ak+1 = fk

A(P ¯ M(r)(Ak + ∆k(τ k))).

tao.wu@uni-graz.at Riem-RPCP (15/19)

slide-16
SLIDE 16

Low-rank matrix subproblem: convergence theory.

◮ Backtracking path search:

◮ The sufficient descrease condition can always be fulfilled after

finitely many trails on τ k.

◮ Any accumulation point of {Ak} is stationary.

◮ Further assume Hessf(A∗, B∗)

  • µ=0 ≻ 0 at a non-degenerate

accumulation point (A∗, B∗). Then

◮ Tangent-space transversality holds, i.e.

T ¯

M(r)(A∗) ∩ TN (s)(B∗) = {0}.

◮ Contractivity of PT ¯

M(r)(A∗) ◦ PTN (s)(B∗): ∃κ ∈ [0, 1) s.t.

(PT ¯

M(r)(A∗) ◦ PTN (s)(B∗))(∆) ≤ κ∆.

◮ q-linear convergence of {Ak} towards stationarity:

lim sup

k→∞

Ak+1 − A∗ Ak − A∗ ≤ κ.

tao.wu@uni-graz.at Riem-RPCP (16/19)

slide-17
SLIDE 17

Numerical implementation.

◮ Trimming Adaptive tuning of rank rk+1 and cardinality

sk+1 based on the current iterate (Ak, Bk).

◮ k-means clustering on (nonzero) singular values of Ak in

logarithmic scale.

◮ hard thresholding on entries of Bk.

◮ q-linear convergence confirmed numerically:

1 2 3 4 5 6 7 8 10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10 iteration ||Ak−A*||/||A*||

(a) Convergence of {Ak}.

1 2 3 4 5 6 7 8 10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10 iteration ||Bk−B*||/||B*||

(b) Convergence of {Bk}.

tao.wu@uni-graz.at Riem-RPCP (17/19)

slide-18
SLIDE 18

Comparison with augmented Lagrangian method (m = n = 2000).

20 40 60 80 100 120 140 10

−7

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10 CPU time relative error on Ak AMS AMS# fSVD−ALM pSVD−ALM

(a) Relative error of {Ak}.

20 40 60 80 100 120 140 10

−7

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10 CPU time relative error on Bk AMS AMS# fSVD−ALM pSVD−ALM

(b) Relative error of {Bk}.

20 40 60 80 100 120 140 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 CPU time rank(Ak)/n AMS AMS# fSVD−ALM pSVD−ALM

(c) Phase transition of {Ak}.

20 40 60 80 100 120 140 0.1 0.15 0.2 0.25 0.3 0.35 0.4 CPU time ||Bk||0/n2 AMS AMS# fSVD−ALM pSVD−ALM

(d) Phase transition of {Bk}.

tao.wu@uni-graz.at Riem-RPCP (18/19)

slide-19
SLIDE 19

Application to surveillance video.

◮ Problem settings:

◮ A sequence of 200 frames taken from a surveillance video at an

airport.

◮ Each frame is a gray image of resolution 144 × 176. ◮ Stack 3D-array into a 25344 × 200 matrix.

◮ Results:

◮ CPU time: AMS 39.4s; ALM 124.4s. ◮ Visual comparison. tao.wu@uni-graz.at Riem-RPCP (19/19)