Online robust matrix factorization for dependent data streams - - PowerPoint PPT Presentation

online robust matrix factorization for dependent data
SMART_READER_LITE
LIVE PREVIEW

Online robust matrix factorization for dependent data streams - - PowerPoint PPT Presentation

Online robust matrix factorization for dependent data streams Hanbaek Lyu Department of Mathematics, University of California, Los Angeles Seminar on applied math and data Science, HKUST Joint work with HanQin Cai and Deanna Needell Mar. 24,


slide-1
SLIDE 1

Online robust matrix factorization for dependent data streams

Hanbaek Lyu

Department of Mathematics, University of California, Los Angeles Seminar on applied math and data Science, HKUST Joint work with HanQin Cai and Deanna Needell

  • Mar. 24, 2019

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-2
SLIDE 2

Overview

1

Introduction

2

ORMF algorithm and convergence result

3

Applications: Dictionary learning from networks

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-3
SLIDE 3

Introduction

  • 1. Introduction

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-4
SLIDE 4

Introduction

Learning parts of images – Image reconstruction

Original image – Cycle (1938) by M.C. Escher Reconstructed image usinglearned dictionary 11 by11 Dictionary learned from Cycleby M.C. Escher

(basis)

Diconary learned from Cycle by M. C. Escher Original image - Cycle by M. C. Escher (1928) Reconstructed image using learned diconary

◮ Dictionary learning enables a compressed representation of complex objects

using a few dictionary elements.

◮ Used in data compression, reconstruction, transfer learning, etc. Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-5
SLIDE 5

Introduction

Learning parts of images – Image reconstruction

Original image – Cycle (1938) by M.C. Escher Reconstructed image usinglearned dictionary 11 by11 Dictionary learned from Cycleby M.C. Escher

(basis)

Diconary learned from Cycle by M. C. Escher Original image - Cycle by M. C. Escher (1928) Reconstructed image using learned diconary

◮ Dictionary learning enables a compressed representation of complex objects

using a few dictionary elements.

◮ Used in data compression, reconstruction, transfer learning, etc. Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-6
SLIDE 6

Introduction

Learning parts of images – Image reconstruction

Original image – Cycle (1938) by M.C. Escher Reconstructed image usinglearned dictionary 11 by11 Dictionary learned from Cycleby M.C. Escher

(basis)

Diconary learned from Cycle by M. C. Escher Original image - Cycle by M. C. Escher (1928) Reconstructed image using learned diconary

◮ Dictionary learning enables a compressed representation of complex objects

using a few dictionary elements.

◮ Used in data compression, reconstruction, transfer learning, etc. ◮ Img recons. = (local approx. by dict.) + (Averaging) Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-7
SLIDE 7

Introduction

Simultaneous dictionary learning and outlier detection

Detected outlier Corrupted image Reconstructed image by ORNMF Dictionary learned

◮ What defines an outlier? How can we detect them? ◮ Low-rank based approach – Outlier = Data - Low-rank approx. ◮ Dictionary-based approach – Outlier = Data - Reconstruction from dictionary ◮ Dictionary learning has to be done in a robust way Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-8
SLIDE 8

Introduction ◮ Matrix Factorization is a fundamental tool in dictionary learning problems.

𝑒 𝑜 𝑒 𝑠 𝑜 𝑠

× ≅

𝑌 𝑋 𝐼 Dictionary Code Data Dictionary NMF

Sample patches Code

×

1 2 3 4 1 1 1 1 1 1 1 1 1 2 3 4 1 2 3 4 Graph Matrix Pixel picture 441 10000 (rank-r basis)

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-9
SLIDE 9

Introduction ◮ Matrix Factorization is a fundamental tool in dictionary learning problems.

𝑒 𝑜 𝑒 𝑠 𝑜 𝑠

× ≅

𝑌 𝑋 𝐼 Dictionary Code Data Dictionary NMF

Sample patches Code

×

1 2 3 4 1 1 1 1 1 1 1 1 1 2 3 4 1 2 3 4 Graph Matrix Pixel picture 441 10000 (rank-r basis)

◮ Formulated as an optimization problem:

minimize X − WH + λ1H1 (Reconstruction error) subject to W ∈ C, H ∈ C′ (Constraints)

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-10
SLIDE 10

Introduction ◮ Matrix Factorization is a fundamental tool in dictionary learning problems.

𝑒 𝑜 𝑒 𝑠 𝑜 𝑠

× ≅

𝑌 𝑋 𝐼 Dictionary Code Data Dictionary NMF

Sample patches Code

×

1 2 3 4 1 1 1 1 1 1 1 1 1 2 3 4 1 2 3 4 Graph Matrix Pixel picture 441 10000 (rank-r basis)

◮ Formulated as an optimization problem:

minimize X − WH + λ1H1 (Reconstruction error) subject to W ∈ C, H ∈ C′ (Constraints)

◮ Non-convex optimization problem → No guarantee for global convergence Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-11
SLIDE 11

Introduction ◮ Robust Matrix Factorization enables simultaneous dictionary learning and

  • utlier detection

𝑒 𝑜 𝑒 𝑠 𝑜 𝑠

× ≅

𝑌 𝑋 Dictionary Code Data 𝐼

+

Outlier 𝑇 𝑜 𝑒 Original Image Reconstructed Image

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-12
SLIDE 12

Introduction ◮ Robust Matrix Factorization enables simultaneous dictionary learning and

  • utlier detection

𝑒 𝑜 𝑒 𝑠 𝑜 𝑠

× ≅

𝑌 𝑋 Dictionary Code Data 𝐼

+

Outlier 𝑇 𝑜 𝑒 Original Image Reconstructed Image

◮ Formulated as an optimization problem:

minimize X − WH − S + λ1H1 + λ2S1 (Reconstruction error) subject to W ∈ C, H ∈ C′ (Constraints)

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-13
SLIDE 13

Introduction ◮ Robust Matrix Factorization enables simultaneous dictionary learning and

  • utlier detection

𝑒 𝑜 𝑒 𝑠 𝑜 𝑠

× ≅

𝑌 𝑋 Dictionary Code Data 𝐼

+

Outlier 𝑇 𝑜 𝑒 Original Image Reconstructed Image

◮ Formulated as an optimization problem:

minimize X − WH − S + λ1H1 + λ2S1 (Reconstruction error) subject to W ∈ C, H ∈ C′ (Constraints)

◮ Non-convex optimization problem → No guarantee for global convergence Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-14
SLIDE 14

Introduction

Matrix Factorization - other examples

◮ Singular Value Decomposition (SVD):

minimize

W ∈Rd×r, H∈Rr×nX − WHF

◮ Non-negative Matrix Factorization (NMF):

minimize

W ∈Rd×r

≥0 , H∈Rr×n ≥0

X − WHF

  • Corresponding dictionary columns can be interpreted as ‘parts’ of the data

matrix (Lee, Seung ’99 [lee1999learning])

◮ Subspace Clustering (may have r > d):

minimize

W ∈Rd×r , H group sparseX − WHF

Matrix Completion, Probabilistic PCA, Sparse PCA, Robust PCA, Poisson PCA, Heteroscedastic PCA, Bilinear Inverse Problems, Robust NMF, Max-Plus Factorization ...

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-15
SLIDE 15

Introduction

Illustration of RMF application to images

𝑒 𝑜 𝑒 𝑠 𝑜 𝑠

× ≅

𝑌 𝑋 𝐼 Dictionary Code Data Dictionary RNMF

Sample sq. patches Code

×

1 2 3 4 1 1 1 1 1 1 1 1 1 2 3 4 1 2 3 4 Graph Matrix Pixel picture # of sq. patches sampled (rank-r basis)

𝑙 𝑙 𝑙

Outlier

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-16
SLIDE 16

Introduction

Online RMF

◮ Data matrix could be too large to be loaded in a memory or processed at once

𝑋

  • (𝐼, 𝑇)

𝑌 𝑍

  • Underlying information

Observed data Dictionary (Code, Noise) 𝑋

  • 𝑌

𝑍

  • 𝑋
  • 𝑌

𝑍

  • 𝑋
  • 𝑌

𝑍

(𝐼, 𝑇) (𝐼, 𝑇) (𝐼, 𝑇)

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-17
SLIDE 17

Introduction

Online RMF

◮ Data matrix could be too large to be loaded in a memory or processed at once ◮ Only sub-matrices of a huge data set may be available through sampling

𝑋

  • (𝐼, 𝑇)

𝑌 𝑍

  • Underlying information

Observed data Dictionary (Code, Noise) 𝑋

  • 𝑌

𝑍

  • 𝑋
  • 𝑌

𝑍

  • 𝑋
  • 𝑌

𝑍

(𝐼, 𝑇) (𝐼, 𝑇) (𝐼, 𝑇)

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-18
SLIDE 18

Introduction

Online RMF

◮ Data matrix could be too large to be loaded in a memory or processed at once ◮ Only sub-matrices of a huge data set may be available through sampling ◮ We may want to learn from a complicated probability distribution on the

sample space of data – e.g., posterior distribution

𝑋

  • (𝐼, 𝑇)

𝑌 𝑍

  • Underlying information

Observed data Dictionary (Code, Noise) 𝑋

  • 𝑌

𝑍

  • 𝑋
  • 𝑌

𝑍

  • 𝑋
  • 𝑌

𝑍

(𝐼, 𝑇) (𝐼, 𝑇) (𝐼, 𝑇)

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-19
SLIDE 19

Introduction

Online RMF

◮ Data matrix could be too large to be loaded in a memory or processed at once ◮ Only sub-matrices of a huge data set may be available through sampling ◮ We may want to learn from a complicated probability distribution on the

sample space of data – e.g., posterior distribution

◮ The Online Matrix Factorization (OMF) problem concerns a similar matrix

factorization problem for a sequence of input matrices (Xt)t≥0.

𝑋

  • (𝐼, 𝑇)

𝑌 𝑍

  • Underlying information

Observed data Dictionary (Code, Noise) 𝑋

  • 𝑌

𝑍

  • 𝑋
  • 𝑌

𝑍

  • 𝑋
  • 𝑌

𝑍

(𝐼, 𝑇) (𝐼, 𝑇) (𝐼, 𝑇)

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-20
SLIDE 20

Introduction

Reminder of matrix factorization

◮ Robust Matrix Factorization

𝑒 𝑜 𝑒 𝑠 𝑜 𝑠

× ≅

𝑌 𝑋 Dictionary Code Data 𝐼

+

Outlier 𝑇 𝑜 𝑒 Original Image Reconstructed Image

minimize X − WH − SF + λ1H1 + λ2S2 (Reconstruction error) subject to W ∈ C, H ∈ C′ (Compact, convex)

◮ Online RMF for streaming data:

Learn Robust Dictionary W from a seq. of data matrices (Xt)t≥0.

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-21
SLIDE 21

ORMF algorithm and convergence result

  • 2. ORMF algorithm and convergence result

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-22
SLIDE 22

ORMF algorithm and convergence result

Online MF as Empirical Loss Minimization

◮ Fix λ > 0 and define the following the quadratic loss function

ℓ(X, W ) = inf

H∈C′⊆Rr×n, S∈Rd×nX − WH − S2 F + λ1H1 + λ2S1

Define the expected loss and empirical loss functions f (W ) = EX∼π[ℓ(X, W )], ft(W ) = 1 t

t

  • s=1

ℓ(Xs, W )

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-23
SLIDE 23

ORMF algorithm and convergence result

Online MF as Empirical Loss Minimization

◮ Fix λ > 0 and define the following the quadratic loss function

ℓ(X, W ) = inf

H∈C′⊆Rr×n, S∈Rd×nX − WH − S2 F + λ1H1 + λ2S1

Define the expected loss and empirical loss functions f (W ) = EX∼π[ℓ(X, W )], ft(W ) = 1 t

t

  • s=1

ℓ(Xs, W )

◮ If (Xt)t≥0 is i.i.d. with common distribution π, then by SLLN,

lim

t→∞ ft(W ) = f (W )

a.s. for all W ∈ C.

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-24
SLIDE 24

ORMF algorithm and convergence result

Online MF as Empirical Loss Minimization

◮ Fix λ > 0 and define the following the quadratic loss function

ℓ(X, W ) = inf

H∈C′⊆Rr×n, S∈Rd×nX − WH − S2 F + λ1H1 + λ2S1

Define the expected loss and empirical loss functions f (W ) = EX∼π[ℓ(X, W )], ft(W ) = 1 t

t

  • s=1

ℓ(Xs, W )

◮ If (Xt)t≥0 is i.i.d. with common distribution π, then by SLLN,

lim

t→∞ ft(W ) = f (W )

a.s. for all W ∈ C.

◮ Same holds if (Xt)t≥0 is a Markov chain (irreducible, aperiodic, Harris

recurrent) by Markov chain ergodic theorem.

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-25
SLIDE 25

ORMF algorithm and convergence result

Online MF as Empirical Loss Minimization

◮ Fix λ > 0 and define the following the quadratic loss function

ℓ(X, W ) = inf

H∈C′⊆Rr×n, S∈Rd×nX − WH − S2 F + λ1H1 + λ2S1

Define the expected loss and empirical loss functions f (W ) = EX∼π[ℓ(X, W )], ft(W ) = 1 t

t

  • s=1

ℓ(Xs, W )

◮ If (Xt)t≥0 is i.i.d. with common distribution π, then by SLLN,

lim

t→∞ ft(W ) = f (W )

a.s. for all W ∈ C.

◮ Same holds if (Xt)t≥0 is a Markov chain (irreducible, aperiodic, Harris

recurrent) by Markov chain ergodic theorem.

◮ Furthermore, for C compact, by Glivenko-Cantelli

lim

t→∞ sup W ∈C

ft(W ) − f (W ) → 0 a.s.

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-26
SLIDE 26

ORMF algorithm and convergence result

Online MF as Empirical Loss Minimization

◮ Fix λ > 0 and define the following the quadratic loss function

ℓ(X, W ) = inf

H∈C′⊆Rr×nX − WH − S2 F + λH1,

Define the expected loss and empirical loss functions f (W ) = EX∼π[ℓ(X, W )], ft(W ) = 1 t

t

  • s=1

ℓ(Xs, W )

◮ Empirical Loss (Risk) Minimization for Online RMF:

Input: (Markovian) Sequence of data matrices (Xt)t≥0, Xt ∼ π. Objective: Wt = argminW ∈Cft(W )

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-27
SLIDE 27

ORMF algorithm and convergence result

Online MF as Empirical Loss Minimization

◮ Fix λ > 0 and define the following the quadratic loss function

ℓ(X, W ) = inf

H∈C′⊆Rr×nX − WH − S2 F + λH1,

Define the expected loss and empirical loss functions f (W ) = EX∼π[ℓ(X, W )], ft(W ) = 1 t

t

  • s=1

ℓ(Xs, W )

◮ Empirical Loss (Risk) Minimization for Online RMF:

Input: (Markovian) Sequence of data matrices (Xt)t≥0, Xt ∼ π. Objective: Wt = argminW ∈Cft(W )

◮ But how do we minimize the empirical loss ft?

  • ft is non-convex
  • Each ℓ(Xs, W ) involves separate optimization
  • Need to store all data X1, · · · , Xt.

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-28
SLIDE 28

ORMF algorithm and convergence result

Asymptotic solution minimizing surrogate loss function

◮ Online surrogate optimization algorithm:

Given Xt:

  • (Ht, St) = argminH∈C′Xt − Wt−1H − S2

F + λ1H1 + λ2S1

Wt = argminW ∈C ˆ ft(W ), where ˆ ft(W ) is a surrogate loss defined by (ft(W ) ≤) ˆ ft(W ) = 1 t

t

  • s=1

(Xs − WHs − S2

F + λ1Hs1 + λ2Ss1).

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-29
SLIDE 29

ORMF algorithm and convergence result

Asymptotic solution minimizing surrogate loss function

◮ Online surrogate optimization algorithm:

Given Xt:

  • (Ht, St) = argminH∈C′Xt − Wt−1H − S2

F + λ1H1 + λ2S1

Wt = argminW ∈C ˆ ft(W ), where ˆ ft(W ) is a surrogate loss defined by (ft(W ) ≤) ˆ ft(W ) = 1 t

t

  • s=1

(Xs − WHs − S2

F + λ1Hs1 + λ2Ss1).

◮ Recycle the previously found codes H1, · · · , Ht and outliers S1, · · · , St and use

them as approximate solutions of the sub-problems.

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-30
SLIDE 30

ORMF algorithm and convergence result

Asymptotic solution minimizing surrogate loss function

◮ Online surrogate optimization algorithm:

Given Xt:

  • (Ht, St) = argminH∈C′Xt − Wt−1H − S2

F + λ1H1 + λ2S1

Wt = argminW ∈C ˆ ft(W ), where ˆ ft(W ) is a surrogate loss defined by (ft(W ) ≤) ˆ ft(W ) = 1 t

t

  • s=1

(Xs − WHs − S2

F + λ1Hs1 + λ2Ss1).

◮ Recycle the previously found codes H1, · · · , Ht and outliers S1, · · · , St and use

them as approximate solutions of the sub-problems.

◮ Block optimization + Majorization - Minimization (MM) + Convex relaxation Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-31
SLIDE 31

ORMF algorithm and convergence result

Asymptotic solution minimizing surrogate loss function

◮ Online surrogate optimization algorithm:

Given Xt:

  • (Ht, St) = argminH∈C′Xt − Wt−1H − S2

F + λ1H1 + λ2S1

Wt = argminW ∈C ˆ ft(W ), where ˆ ft(W ) is a surrogate loss defined by (ft(W ) ≤) ˆ ft(W ) = 1 t

t

  • s=1

(Xs − WHs − S2

F + λ1Hs1 + λ2Ss1).

◮ Recycle the previously found codes H1, · · · , Ht and outliers S1, · · · , St and use

them as approximate solutions of the sub-problems.

◮ Block optimization + Majorization - Minimization (MM) + Convex relaxation ◮ Wt = argminW tr(WAtW T) − 2tr(WBt) for summary matrices At, Bt Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-32
SLIDE 32

ORMF algorithm and convergence result

Solving joint sparse coding problem

◮ We solve the following joint sparse coding problem by proximal gradient:

(Ht, St) = argminH∈C′Xt − Wt−1H − S2

F + λ1H1 + λ2S1

(1)

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-33
SLIDE 33

ORMF algorithm and convergence result

Solving joint sparse coding problem

◮ We solve the following joint sparse coding problem by proximal gradient:

(Ht, St) = argminH∈C′Xt − Wt−1H − S2

F + λ1H1 + λ2S1

(1)

◮ Fix Wt−1 ∈ Rd×n and parameters α, β > 0. Define a d × (r + d) matrix

Gt−1 = [Wt−1, βId]. (2) Consider the following constrained LASSO problem Vt = argmin

V =[H,S′] (H,S′)∈Ccode×Rd×n

Xt − Gt−1V 2

F + αV 1.

(3)

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-34
SLIDE 34

ORMF algorithm and convergence result

Solving joint sparse coding problem

◮ We solve the following joint sparse coding problem by proximal gradient:

(Ht, St) = argminH∈C′Xt − Wt−1H − S2

F + λ1H1 + λ2S1

(1)

◮ Fix Wt−1 ∈ Rd×n and parameters α, β > 0. Define a d × (r + d) matrix

Gt−1 = [Wt−1, βId]. (2) Consider the following constrained LASSO problem Vt = argmin

V =[H,S′] (H,S′)∈Ccode×Rd×n

Xt − Gt−1V 2

F + αV 1.

(3)

◮ Equivalent to the original problem for the choice α = λ1 and β = λ1/λ2:

Xt − Gt−1V 2

F + αV 1 = Xt − Wt−1H − βS′2 F + αH1 + αS′1

= Xt − Wt−1H − S2

F + αH1 + (α/β)S1,

with change of variable S = βS′.

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-35
SLIDE 35

ORMF algorithm and convergence result

f = expected loss, ft = empirical loss, ˆ ft = surrogate loss Theorem (Cai, Lyu, Needell ’20+) Suppose (Xt)t≥0 is a Hidden Markov chain (irreducible, aperiodic, finite state). Let (Wt, Ht, St)t≥1 be a solution to the ORMF algorithm before.

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-36
SLIDE 36

ORMF algorithm and convergence result

f = expected loss, ft = empirical loss, ˆ ft = surrogate loss Theorem (Cai, Lyu, Needell ’20+) Suppose (Xt)t≥0 is a Hidden Markov chain (irreducible, aperiodic, finite state). Let (Wt, Ht, St)t≥1 be a solution to the ORMF algorithm before. (i) limt→∞ E[ft(Wt)] = limt→∞ E[ˆ ft(Wt)] < ∞. (ii) ft(Wt) − ˆ ft(Wt) → 0 as t → ∞ almost surely. (iii) Wt → Set of critical points of f as t → ∞ almost surely.

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-37
SLIDE 37

ORMF algorithm and convergence result

f = expected loss, ft = empirical loss, ˆ ft = surrogate loss Theorem (Cai, Lyu, Needell ’20+) Suppose (Xt)t≥0 is a Hidden Markov chain (irreducible, aperiodic, finite state). Let (Wt, Ht, St)t≥1 be a solution to the ORMF algorithm before. (i) limt→∞ E[ft(Wt)] = limt→∞ E[ˆ ft(Wt)] < ∞. (ii) ft(Wt) − ˆ ft(Wt) → 0 as t → ∞ almost surely. (iii) Wt → Set of critical points of f as t → ∞ almost surely.

◮ IDEA: Condition on distant past + Control conditional error by MC mixing +

Control unconditional error by MC uniform functional CLT

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-38
SLIDE 38

ORMF algorithm and convergence result

f = expected loss, ft = empirical loss, ˆ ft = surrogate loss Theorem (Cai, Lyu, Needell ’20+) Suppose (Xt)t≥0 is a Hidden Markov chain (irreducible, aperiodic, finite state). Let (Wt, Ht, St)t≥1 be a solution to the ORMF algorithm before. (i) limt→∞ E[ft(Wt)] = limt→∞ E[ˆ ft(Wt)] < ∞. (ii) ft(Wt) − ˆ ft(Wt) → 0 as t → ∞ almost surely. (iii) Wt → Set of critical points of f as t → ∞ almost surely.

◮ IDEA: Condition on distant past + Control conditional error by MC mixing +

Control unconditional error by MC uniform functional CLT

◮ First convergence result for ORMF algorithms for Markovian input (even i.i.d.) Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-39
SLIDE 39

ORMF algorithm and convergence result

f = expected loss, ft = empirical loss, ˆ ft = surrogate loss Theorem (Cai, Lyu, Needell ’20+) Suppose (Xt)t≥0 is a Hidden Markov chain (irreducible, aperiodic, finite state). Let (Wt, Ht, St)t≥1 be a solution to the ORMF algorithm before. (i) limt→∞ E[ft(Wt)] = limt→∞ E[ˆ ft(Wt)] < ∞. (ii) ft(Wt) − ˆ ft(Wt) → 0 as t → ∞ almost surely. (iii) Wt → Set of critical points of f as t → ∞ almost surely.

◮ IDEA: Condition on distant past + Control conditional error by MC mixing +

Control unconditional error by MC uniform functional CLT

◮ First convergence result for ORMF algorithms for Markovian input (even i.i.d.) ◮ A similar result was obtained for non-robust version by Lyu, Needell, and

Balzano 19’ for dependent data matrices.

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-40
SLIDE 40

ORMF algorithm and convergence result

f = expected loss, ft = empirical loss, ˆ ft = surrogate loss Theorem (Cai, Lyu, Needell ’20+) Suppose (Xt)t≥0 is a Hidden Markov chain (irreducible, aperiodic, finite state). Let (Wt, Ht, St)t≥1 be a solution to the ORMF algorithm before. (i) limt→∞ E[ft(Wt)] = limt→∞ E[ˆ ft(Wt)] < ∞. (ii) ft(Wt) − ˆ ft(Wt) → 0 as t → ∞ almost surely. (iii) Wt → Set of critical points of f as t → ∞ almost surely.

◮ IDEA: Condition on distant past + Control conditional error by MC mixing +

Control unconditional error by MC uniform functional CLT

◮ First convergence result for ORMF algorithms for Markovian input (even i.i.d.) ◮ A similar result was obtained for non-robust version by Lyu, Needell, and

Balzano 19’ for dependent data matrices.

◮ The first result of this kind was obtained for non-robust version by MBPS 10’

for i.i.d. data matrices.

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-41
SLIDE 41

ORMF algorithm and convergence result

Notations

◮ Fix λ > 0 and define the following the quadratic loss function

ℓ(X, W ) = inf

H∈C⊆Rr×n,S∈Rr×dX − WH − S2 F + λ1H1 + λ2S1,

Define the expected loss and empirical loss functions f (W ) = EX∼π[ℓ(X, W )], ft(W ) = 1 t

t

  • s=1

ℓ(Xs, W )

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-42
SLIDE 42

ORMF algorithm and convergence result

Notations

◮ Fix λ > 0 and define the following the quadratic loss function

ℓ(X, W ) = inf

H∈C⊆Rr×n,S∈Rr×dX − WH − S2 F + λ1H1 + λ2S1,

Define the expected loss and empirical loss functions f (W ) = EX∼π[ℓ(X, W )], ft(W ) = 1 t

t

  • s=1

ℓ(Xs, W )

◮ Online surrogate optimization algorithm:

Given Xt:

  • (Ht, St) = argminH∈C′Xt − Wt−1H − S2

F + λ1H1 + λ2S1

Wt = argminW ∈C ˆ ft(W ), where ˆ ft(W ) is a surrogate loss defined by (ft(W ) ≤) ˆ ft(W ) = 1 t

t

  • s=1

(Xs − WHs − S2

F + λ1Hs1 + λ2Ss1).

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-43
SLIDE 43

ORMF algorithm and convergence result

Notations

◮ Fix λ > 0 and define the following the quadratic loss function

ℓ(X, W ) = inf

H∈C⊆Rr×n,S∈Rr×dX − WH − S2 F + λ1H1 + λ2S1,

Define the expected loss and empirical loss functions f (W ) = EX∼π[ℓ(X, W )], ft(W ) = 1 t

t

  • s=1

ℓ(Xs, W )

◮ Online surrogate optimization algorithm:

Given Xt:

  • (Ht, St) = argminH∈C′Xt − Wt−1H − S2

F + λ1H1 + λ2S1

Wt = argminW ∈C ˆ ft(W ), where ˆ ft(W ) is a surrogate loss defined by (ft(W ) ≤) ˆ ft(W ) = 1 t

t

  • s=1

(Xs − WHs − S2

F + λ1Hs1 + λ2Ss1).

◮ WTS:W converges to the set of critical points of the expected loss f Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-44
SLIDE 44

ORMF algorithm and convergence result

f = expected loss, ft = empirical loss, ˆ ft = surrogate loss Proposition (i) ˆ ft+1(Wt+1) − ˆ ft(Wt) ≤

1 t+1 (ℓ(Xt+1, Wt) − ft(Wt)).

(ii) 0 ≤

1 t+1

  • ˆ

ft(Wt) − ft(Wt)

1 t+1 (ℓ(Xt+1, Wt) − ft(Wt)) + ˆ

ft(Wt) − ˆ ft+1(Wt+1).

Sketch of main argument:

◮ ∞

t=0 E

  • 1

t+1 (ℓ(Xt+1, Wt) − ft(Wt))+

< ∞ implies E[ˆ ft(Wt)] converges.

◮ ∞

t=0 E

  • 1

t+1

  • ˆ

ft(Wt) − ft(Wt)

  • < ∞ implies ˆ

ft(Wt) − ft(Wt) → 0 a.s.

◮ ft ≤ ˆ

ft, Wt = argmin ˆ ft, ˆ ft(Wt) − ft(Wt) → 0 a.s. imply Wt → Set of critical points of f a.s. Suffices to show

  • t=0
  • E
  • 1

t + 1 (ℓ(Xt+1, Wt) − ft(Wt))

  • < ∞

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-45
SLIDE 45

ORMF algorithm and convergence result

Key estimate in the i.i.d. case

◮ Suffices to show ∞

t=0

  • E
  • 1

t+1 (ℓ(Xt+1, Wt) − ft(Wt))

  • < ∞

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-46
SLIDE 46

ORMF algorithm and convergence result

Key estimate in the i.i.d. case

◮ Suffices to show ∞

t=0

  • E
  • 1

t+1 (ℓ(Xt+1, Wt) − ft(Wt))

  • < ∞

◮ Suppose data matrices Xt are i.i.d. and let Ft denote the information up to

time t. Then

  • E
  • ℓ(Xt+1, Wt) − ft(Wt)
  • Ft
  • ≤ |EX∼π[ℓ(X, Wt)] − ft(Wt)|

= |f (Wt) − ft(Wt)| ≤ f − ft∞

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-47
SLIDE 47

ORMF algorithm and convergence result

Key estimate in the i.i.d. case

◮ Suffices to show ∞

t=0

  • E
  • 1

t+1 (ℓ(Xt+1, Wt) − ft(Wt))

  • < ∞

◮ Suppose data matrices Xt are i.i.d. and let Ft denote the information up to

time t. Then

  • E
  • ℓ(Xt+1, Wt) − ft(Wt)
  • Ft
  • ≤ |EX∼π[ℓ(X, Wt)] − ft(Wt)|

= |f (Wt) − ft(Wt)| ≤ f − ft∞

◮ f − ft∞→ 0 Glivenko-Cantelli Thm. (W ∈ Compact set) Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-48
SLIDE 48

ORMF algorithm and convergence result

Key estimate in the i.i.d. case

◮ Suffices to show ∞

t=0

  • E
  • 1

t+1 (ℓ(Xt+1, Wt) − ft(Wt))

  • < ∞

◮ Suppose data matrices Xt are i.i.d. and let Ft denote the information up to

time t. Then

  • E
  • ℓ(Xt+1, Wt) − ft(Wt)
  • Ft
  • ≤ |EX∼π[ℓ(X, Wt)] − ft(Wt)|

= |f (Wt) − ft(Wt)| ≤ f − ft∞

◮ f − ft∞→ 0 Glivenko-Cantelli Thm. (W ∈ Compact set) ◮ E[t1/2f − ft∞] = O(1) by uniform functional CLT Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-49
SLIDE 49

ORMF algorithm and convergence result

Key estimate in the i.i.d. case

◮ Suffices to show ∞

t=0

  • E
  • 1

t+1 (ℓ(Xt+1, Wt) − ft(Wt))

  • < ∞

◮ Suppose data matrices Xt are i.i.d. and let Ft denote the information up to

time t. Then

  • E
  • ℓ(Xt+1, Wt) − ft(Wt)
  • Ft
  • ≤ |EX∼π[ℓ(X, Wt)] − ft(Wt)|

= |f (Wt) − ft(Wt)| ≤ f − ft∞

◮ f − ft∞→ 0 Glivenko-Cantelli Thm. (W ∈ Compact set) ◮ E[t1/2f − ft∞] = O(1) by uniform functional CLT ◮ Averaging over Ft, this gives

  • E
  • 1

t + 1 (ℓ(Xt+1, Wt) − ft(Wt))

  • ≤ E
  • E

(ℓ(Xt+1, Wt) − ft(Wt)) t + 1

  • Ft
  • ≤ t−3/2E[t1/2f − ft∞]

= O(t−3/2).

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-50
SLIDE 50

ORMF algorithm and convergence result

Key estimate in the Markovian case

◮ If (Xt)t≥0 is Markovian, then

E[ℓ(Xt+1, W ) | Ft] = EX∼π[ℓ(X, W )] = f (Wt).

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-51
SLIDE 51

ORMF algorithm and convergence result

Key estimate in the Markovian case

◮ If (Xt)t≥0 is Markovian, then

E[ℓ(Xt+1, W ) | Ft] = EX∼π[ℓ(X, W )] = f (Wt).

◮ Instead, condition on a distant past Ft−N and see how much the chain

mixes to the stationary distribution during [t − N, t].

  • E[ℓ(Xt+1, W ) | Ft−N] − f (W )
  • ≤ 2ℓ(·, W )∞PN+1(x, ·) − πTV .

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-52
SLIDE 52

ORMF algorithm and convergence result

Key estimate in the Markovian case

◮ If (Xt)t≥0 is Markovian, then

E[ℓ(Xt+1, W ) | Ft] = EX∼π[ℓ(X, W )] = f (Wt).

◮ Instead, condition on a distant past Ft−N and see how much the chain

mixes to the stationary distribution during [t − N, t].

  • E[ℓ(Xt+1, W ) | Ft−N] − f (W )
  • ≤ 2ℓ(·, W )∞PN+1(x, ·) − πTV .

◮ The TV distance decays exponentially in N Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-53
SLIDE 53

ORMF algorithm and convergence result

Key estimate in the Markovian case

◮ If (Xt)t≥0 is Markovian, then

E[ℓ(Xt+1, W ) | Ft] = EX∼π[ℓ(X, W )] = f (Wt).

◮ Instead, condition on a distant past Ft−N and see how much the chain

mixes to the stationary distribution during [t − N, t].

  • E[ℓ(Xt+1, W ) | Ft−N] − f (W )
  • ≤ 2ℓ(·, W )∞PN+1(x, ·) − πTV .

◮ The TV distance decays exponentially in N ◮ Choose N = N(t) appropriately and average over Ft−N. Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-54
SLIDE 54

Applications: Dictionary learning from networks

  • 3. Applications: Dictionary learning from Facebook networks

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-55
SLIDE 55

Applications: Dictionary learning from networks

Facebook100 network data - UCLA26

◮ Traud, Mucha, Porter ’12 ◮ Snapshot of UCLA FB ntwk

  • n Sep. 2005

◮ (i, j)-entry =

1(user i and j are friends)

◮ Number of nodes = 20467 ◮ Number of edges = 747613 ◮ Edge density = 0.00357 ◮ Figure shows only the

network on first 3000 nodes

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-56
SLIDE 56

Applications: Dictionary learning from networks

Facebook100 network data - Caltech36

◮ Traud, Mucha, Porter ’12 ◮ Snapshot of Caltech FB ntwk

  • n Sep. 2005

◮ (i, j)-entry =

1(user i and j are friends)

◮ Number of nodes = 769 ◮ Number of edges = 8328 ◮ Edge density = 0.05640 Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-57
SLIDE 57

Applications: Dictionary learning from networks

Cycle (1938) by M.C. Escher Image Dictionary

Network Dictionary Learning Network data reconstruction Interpretable parts (Dictionary) Network Motif Sample MCMC Motif sampling Memoli, Lyu, Sivakoff (2019+)

Dictionary Dictionary Dictionary ⋮ 𝐸𝑏𝑢𝑏 𝐸𝑏𝑢𝑏 𝐸𝑏𝑢𝑏 ⋮

Lyu, Needell, Balzano (2019+) Online Matrix Factorization for Markovian data

+

(Low-rank basis)

UCLA26 Facebook network Caltech36 Facebook network Network Dictionary Network Dictionary

Main question: Can we learn parts of networks like we do for the images?

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-58
SLIDE 58

Applications: Dictionary learning from networks

Cycle (1938) by M.C. Escher Image Dictionary

Network Dictionary Learning Network data reconstruction Interpretable parts (Dictionary) Network Motif Sample MCMC Motif sampling Memoli, Lyu, Sivakoff (2019+)

Dictionary Dictionary Dictionary ⋮ 𝐸𝑏𝑢𝑏 𝐸𝑏𝑢𝑏 𝐸𝑏𝑢𝑏 ⋮

Lyu, Needell, Balzano (2019+) Online Matrix Factorization for Markovian data

+

(Low-rank basis)

UCLA26 Facebook network Caltech36 Facebook network Network Dictionary Network Dictionary

Main question: Can we learn parts of networks like we do for the images? Answer: Network Dictionary Learning (Lyu, Needell, and Balzano ’19)

◮ Theoretical background: MCMC, motif sampling, Markov chains, Optimizaion,

Online Matrix Factorization.

◮ Applications: Network + (compression, completion, comparison, classification,

visualization, inference)

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-59
SLIDE 59

Applications: Dictionary learning from networks

MCMC motif sampling + OMF dictionary learning

z MCMC Motif sampling from network 11 by 11 Network Dictionary from UCLA FB network Reconstructed UCLA FB network Original UCLA FB network

Minibatches of collected subgraph patterns

⋮ ⋮

Limiting Dictionary

Online Nonnegative Matrix Factorization Dictionary Dictionary Dictionary

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-60
SLIDE 60

Applications: Dictionary learning from networks

Network Dictionary Learning – UCLA26 Original UCLA26FBnetwork 25 Network Dictionary of size21 learned from UCLA26FBntwk 25 Diconary of size 21 learned from UCLA26 FB ntwk Original UCLA26 FB Ntwk

◮ Extract k-node subgraph patterns by k-chain motif sampling from UCLA26 ◮ Let k = 21, so that dim(all subgraph patterns) =

21

2

  • − 20 = 200.

◮ On the right: rank-25 (approximate) basis for subgraph patterns in UCLA26 Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-61
SLIDE 61

Applications: Dictionary learning from networks

Network Dictionary Learning – Reconstructing UCLA from UCLA

z11 by 11 Network Dictionary from UCLAFBnetwork Reconstructed UCLAF B network Original UCLAFBnetwork 25 Diconary learned from UCLA26 FB ntwk Original UCLA26 FB Ntwk Reconstructed UCLA Ntwk using Dict. learned from UCLA

◮ Can we reconstruct the original network using the learned dictionary? Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-62
SLIDE 62

Applications: Dictionary learning from networks

Learning parts of networks – Reconstructing UCLA from UCLA

z11 by 11 Network Dictionary from UCLAFBnetwork Reconstructed UCLAF B network Original UCLAFBnetwork 25 Diconary learned from UCLA26 FB ntwk Original UCLA26 FB Ntwk Reconstructed UCLA Ntwk using Dict. learned from UCLA

◮ 95% of reconstruction accuracy (# common edges)/(# edges in original) ◮ Ntwk recons. = (local approx. by dict.) + (Averaging) + (Rounding) Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-63
SLIDE 63

Applications: Dictionary learning from networks

Network Dictionary Learning – Caltech36 Original Caltech36 F Bntwk 25 Network Dictionary of size21 learned from Caltech36FBntwk Original Caltech FB Ntwk 25 Diconary of size 21 learned from Caltech36 FB ntwk

◮ Extract k-node subgraph patterns by k-chain motif sampling from Caltech36 ◮ We choose k = 21, so that dim(all subgraph patterns) =

21

2

  • − 20 = 200.

◮ On the right: rank-25 (approximate) basis for subgraph patterns in Caltech36 Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-64
SLIDE 64

Applications: Dictionary learning from networks

Network Dictionary Learning – Reconstructing Caltech from Caltech Original Caltech FB network 21 by 21 Network Dictionary from CaltechFBnetwork Reconstructed Caltech FB network 25 Diconary learned from Caltech36 FB ntwk Original Caltech36 FB Ntwk

  • Recons. Caltech ntwk using
  • Dict. learned from Caltech

◮ 85% of reconstruction accuracy (# common edges)/(# edges in original) ◮ Ntwk recons. = (local approx. by dict.) + (Averaging) + (Rounding) Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-65
SLIDE 65

Applications: Dictionary learning from networks

Network Dictionary Learning - Self-reconstruction accuracies

9 16 25 36 49 64 81 100 Number of Components 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Reconstruction Accuracy

Stanford3 Columbia2 Harvard1 UCLA26 Georgetown15 Dartmouth6 Brown11 Yale4 BC17 Tufts18 Wellesley22 Northwestern25 Duke14 Caltech36 Princeton12 MIT8 Northeastern19 Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-66
SLIDE 66

Applications: Dictionary learning from networks

Learning parts of networks – Reconstructing Caltech from Escher 21 by 21 Dictionary learned from Cycleby M.C. Escher Original CaltechFBnetwork Reconstructed Caltechntwk from Dict. learned from Escher 100 Diconary learned from Cycle by M. C. Escher Original Caltech FB Ntwk Reconstructed Caltech Ntwk using Dict. learned from Escher

◮ Can we use dictionary learned from Escher to reconstruct Caltech? Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-67
SLIDE 67

Applications: Dictionary learning from networks

Learning parts of networks – Reconstructing Caltech from Escher 21 by 21 Dictionary learned from Cycleby M.C. Escher Original CaltechFBnetwork Reconstructed Caltechntwk from Dict. learned from Escher 100 Diconary learned from Cycle by M. C. Escher Original Caltech FB Ntwk Reconstructed Caltech Ntwk using Dict. learned from Escher

◮ # edges in original ntwk = 16656 ◮ # edges in reconstructed ntwk = 34 ◮ # common edges = 0. (Zero reconstruction accuracy) ◮ Non-example of transfer learning Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-68
SLIDE 68

Applications: Dictionary learning from networks

Caltech from Caltech Caltech from UCLA

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-69
SLIDE 69

Applications: Dictionary learning from networks

Network Dictionary Learning - Cross-reconstruction accuracies

9 16 25 36 49 64 81 100 Number of Components 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Reconstruction Accuracy

MIT8 from Caltech36 MIT8 from Harvard1 MIT8 from UCLA26 Caltech36 from MIT8 Caltech36 from Harvard1 Caltech36 from UCLA26 Harvard1 from MIT8 Harvard1 from Caltech36 Harvard1 from UCLA26 UCLA26 from MIT8 UCLA26 from Caltech36 UCLA26 from Harvard1 Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-70
SLIDE 70

Applications: Dictionary learning from networks

Related current/future works

  • 1. Applications/Implications of Network Dictionary Learning

◮ Completion, inference, and transfer learning for social network data (joint

with Kureh and Porter)

◮ Edge completion, outlier detection on networks

  • 2. Deep neural networks + Matrix factorization

◮ Topic-aware chatbot using Recurrent NN and NMF (joint with summer

REU students and Needell)

  • 3. Learning parts of tensor data

◮ Hyper-motif sampling from hyper-networks ◮ Online tensor factorization for Markovian data (joint with Needell,

Strohmeier) (c.f., no convergence known even for the i.i.d. case) Applications: Dict. learning for video, and trajectory of evolving networks, dynamic topic modeling

  • 4. Further extension of Online Matrix Factorization

◮ OMF for variable number of dictionaries (added optimization dimension) ◮ OMF for non-stationary data matrices (what do we want to learn in this

case?)

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams

slide-71
SLIDE 71

Applications: Dictionary learning from networks

Thanks!

Hanbaek Lyu (UCLA) Online robust matrix factorization for dependent data streams