Learning sparsely used overcomplete dictionaries Alekh Agarwal - - PowerPoint PPT Presentation

learning sparsely used overcomplete dictionaries
SMART_READER_LITE
LIVE PREVIEW

Learning sparsely used overcomplete dictionaries Alekh Agarwal - - PowerPoint PPT Presentation

Learning sparsely used overcomplete dictionaries Alekh Agarwal Microsoft Research Joint work with Anima Anandkumar, Prateek Jain, Praneeth Netrapalli and Rashish Tandon Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary


slide-1
SLIDE 1

Learning sparsely used overcomplete dictionaries

Alekh Agarwal Microsoft Research

Joint work with Anima Anandkumar, Prateek Jain, Praneeth Netrapalli and Rashish Tandon

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-2
SLIDE 2

Motivation I: Feature learning

Practice

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-3
SLIDE 3

Motivation I: Feature learning

Practice Papers Features

1.2 0.8 1.5 3.5 2 0.1 0.3 0.7 0.8

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-4
SLIDE 4

Motivation I: Feature learning

Practice Papers

Feature eng.

Features

1.2 0.8 1.5 3.5 2 0.1 0.3 0.7 0.8

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-5
SLIDE 5

Motivation I: Feature learning

Practice Papers

Feature eng.

Features

1.2 0.8 1.5 3.5 2 0.1 0.3 0.7 0.8

Feature engineering takes considerable time and skill Typically critical to good performance

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-6
SLIDE 6

Motivation I: Feature learning

Practice Papers

Feature eng.

Features

1.2 0.8 1.5 3.5 2 0.1 0.3 0.7 0.8

Feature engineering takes considerable time and skill Typically critical to good performance Can we learn good features from data?

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-7
SLIDE 7

Motivation II: Signal compression

Expensive to store high-dimensional signals

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-8
SLIDE 8

Motivation II: Signal compression

Expensive to store high-dimensional signals Sparse signals have compact representation

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-9
SLIDE 9

Motivation II: Signal compression

Expensive to store high-dimensional signals Sparse signals have compact representation Can we learn a representation where signals of interest are sparse?

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-10
SLIDE 10

Dictionary learning in practice

Image compression (Bruckstein et al., 2009)

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-11
SLIDE 11

Dictionary learning in practice

Image compression (Bruckstein et al., 2009) Similar successes in image denoising, inpainting, superresolution, . . .

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-12
SLIDE 12

Dictionary learning in practice

Image compression (Bruckstein et al., 2009) Similar successes in image denoising, inpainting, superresolution, . . . Non-convex optimization, limited theoretical understanding

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-13
SLIDE 13

Dictionary learning setup

Goal Find a dictionary with r elements such that each data point is a combination of only s dictionary elements.

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-14
SLIDE 14

Dictionary learning setup

Goal Find a dictionary with r elements such that each data point is a combination of only s dictionary elements.

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-15
SLIDE 15

Dictionary learning setup

Goal Find a dictionary with r elements such that each data point is a combination of only s dictionary elements.

Examples Coefficients Dictionary

= Y A∗ X ∗

Encode faces using dictionary rather than pixel values Sparsity for compression, signal processing . . .

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-16
SLIDE 16

Dictionary learning setup

Goal Find a dictionary with r elements such that each data point is a combination of only s dictionary elements.

Examples Dictionary Coefficients

d

=

n r n r

X ∗ Y A∗

d

Topic models, overlapping clustering, image representation Overcomplete setting, r ≫ d relevant in practice

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-17
SLIDE 17

Alternating minimization

Objective min

A,X X1

  • i,j |Xij|

subject to Y = AX Dominant approach in practice Start with initial dictionary A(0) Sparse regression for coefficients given dictionary X(t + 1)i = arg min

x∈Rr x1 s.t.

Yi − A(t)x2 ≤ ǫt

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-18
SLIDE 18

Alternating minimization

Objective min

A,X X1

  • i,j |Xij|

subject to Y = AX Dominant approach in practice Start with initial dictionary A(0) Sparse regression for coefficients given dictionary Least squares for dictionary given coefficients A(t + 1) = YX(t + 1)+ i.e. Y ≈ A(t + 1)X(t + 1)

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-19
SLIDE 19

Alternating minimization

Objective min

A,X X1

  • i,j |Xij|

subject to Y = AX Dominant approach in practice Start with initial dictionary A(0) Sparse regression for coefficients given dictionary Least squares for dictionary given coefficients A(t + 1) = YX(t + 1)+ i.e. Y ≈ A(t + 1)X(t + 1) Similar to EM for this problem

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-20
SLIDE 20

Alternating minimization

Objective min

A,X X1

  • i,j |Xij|

subject to Y = AX Dominant approach in practice Start with initial dictionary A(0) Sparse regression for coefficients given dictionary Least squares for dictionary given coefficients A(t + 1) = YX(t + 1)+ i.e. Y ≈ A(t + 1)X(t + 1) Similar to EM for this problem Does not converge to global optimum from arbitrary A(0)

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-21
SLIDE 21

Alternating minimization goal

( A, X) = min

A,X X1

subject to Y = AX Y = AX is a non-convex constraint Average of solutions is not a solution! Y = AX, Y = (−A)(−X),

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-22
SLIDE 22

Alternating minimization goal

( A, X) = min

A,X X1

subject to Y = AX Y = AX is a non-convex constraint Average of solutions is not a solution! Y = AX, Y = (−A)(−X), Y = A + (−A) 2 X + (−X) 2

  • Non-convex optimization, NP-hard in general

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-23
SLIDE 23

Previous theory work

Exact recovery in undercomplete setting by Spielman et al. via linear programming

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-24
SLIDE 24

Previous theory work

Exact recovery in undercomplete setting by Spielman et al. via linear programming We combine alternating minimization with a novel initialization Global optimum despite non-convexity in overcomplete setting

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-25
SLIDE 25

Initialization: Key ideas

Find several samples with a common dictionary element

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-26
SLIDE 26

Initialization: Key ideas

−1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1

Find several samples with a common dictionary element Top singular vector of these samples is an estimate of this element

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-27
SLIDE 27

Correlation graph

Definition (Correlation graph) One node for each example Large correlation ⇒ common dictionary element

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-28
SLIDE 28

Correlation graph

Definition (Correlation graph) One node for each example Edge {Yi, Yj} if |Yi, Yj| ≥ ρ Large correlation ⇒ common dictionary element

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-29
SLIDE 29

Correlation graph

Definition (Correlation graph) One node for each example Edge {Yi, Yj} if |Yi, Yj| ≥ ρ Large correlation ⇒ common dictionary element

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-30
SLIDE 30

Correlation graph

Definition (Correlation graph) One node for each example Edge {Yi, Yj} if |Yi, Yj| ≥ ρ

Bad S1 S2 Good

Large correlation ⇒ common dictionary element Samples in a clique contain a common dictionary element

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-31
SLIDE 31

Correlation graph

Definition (Correlation graph) One node for each example Edge {Yi, Yj} if |Yi, Yj| ≥ ρ

Bad Good S1 S2

Large correlation ⇒ common dictionary element Samples in a clique contain a common dictionary element Easy to construct cliques

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-32
SLIDE 32

Correlation graph

Definition (Correlation graph) One node for each example Edge {Yi, Yj} if |Yi, Yj| ≥ ρ

Good Bad S1 S2

Large correlation ⇒ common dictionary element Samples in a clique contain a common dictionary element Easy to construct cliques

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-33
SLIDE 33

Initialization algorithm

  • 1. Construct correlation graph Gρ given a threshold ρ
  • 2. For each edge (Yi, Yj) in Gρ

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-34
SLIDE 34

Initialization algorithm

  • 1. Construct correlation graph Gρ given a threshold ρ
  • 2. For each edge (Yi, Yj) in Gρ

If (Yi, Yj) is good

(a) Let S be all common neighbors of Yi and Yj

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-35
SLIDE 35

Initialization algorithm

  • 1. Construct correlation graph Gρ given a threshold ρ
  • 2. For each edge (Yi, Yj) in Gρ

If (Yi, Yj) is good

(a) Let S be all common neighbors of Yi and Yj (b) Let M be the covariance matrix of S:

i∈S YiY T i

(c) Set a to the top singular vector of M

  • 3. Each vector

a is estimate of some A∗

i

Similar algorithm developed simultaneously and independently in Arora et al. (2013)

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-36
SLIDE 36

Assumptions

Incoherent dictionary: |A∗

i , A∗ j | ≤ µ0/

√ d

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-37
SLIDE 37

Assumptions

Incoherent dictionary: |A∗

i , A∗ j | ≤ µ0/

√ d Sparse coefficients: Each sample has at most s non-zero X ∗

ij with

random sparsity pattern

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-38
SLIDE 38

Exact recovery

Theorem (AAJNT’13) Suppose we have O(r2) examples. Use graph clustering algorithm to initialize alternating minimization. With high probability, for all t ≥ 1 and i = 1, 2, . . . , r A(t)i − A∗

i 2 ≤ A(0)i − A∗ i 2 2−t

Exact recovery from O(r2) samples Global optimum through novel initialization

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-39
SLIDE 39

Exact recovery

Theorem (AAJNT’13) Suppose we have O(r2) examples. Use graph clustering algorithm to initialize alternating minimization. With high probability, for all t ≥ 1 and i = 1, 2, . . . , r A(t)i − A∗

i 2 ≤ A(0)i − A∗ i 2 2−t

Exact recovery from O(r2) samples Global optimum through novel initialization Approximate recovery in initialization step Local linear convergence of alternating minimization

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-40
SLIDE 40

Local linear convergence

1 2 3 4 5 6 10

−6

10

−4

10

−2

10 Error vs Iteration (d=100, r=200, s=3) Iteration No. Error in A (log scale) n=1.5 s r log r n=2 s r log r n=2.5 s r log r n=3 s r log r n=3.5 s r log r

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-41
SLIDE 41

One-shot vs alternating

1.5 2 2.5 3 3.5 10

−10

10

−5

10 Error vs N (d=100, r=100, s=3, n=C s r log(r)) Error in A (log scale) C (n= C s r log(r)) Initialization Alternating Minimization

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-42
SLIDE 42

Sample complexity

1.5 2 2.5 3 3.5 4 4.5 0.2 0.4 0.6 0.8 1 n/r

  • Prob. of success

d=200,s=5 r=d r=2d r=4d r=8d

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-43
SLIDE 43

Alternating minimization proof sketch

Ideally want

X ∗ A∗ A(0)

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-44
SLIDE 44

Alternating minimization proof sketch

Ideally want

X ∗ A∗ A(0) X(1)

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-45
SLIDE 45

Alternating minimization proof sketch

Ideally want

A∗ X ∗ A(0) X(1) A(1)

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-46
SLIDE 46

Alternating minimization proof sketch

Ideally want

X(2) X ∗ A∗ A(0) X(1) A(1)

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-47
SLIDE 47

Alternating minimization proof sketch

Ideally want

X(2) X ∗ A∗ A(0) X(1) A(1)

But what about

X(2) X ∗ A∗ A(0) X(1) A(1)

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-48
SLIDE 48

Alternating minimization proof sketch (contd.)

Lemma Suppose X(t + 1) − X ∗∞ = O(1/s). Then A(t + 1)i − A∗

i 2 = O

s2 √ d X(t + 1) − X ∗∞

  • .

s2 ≤ √ d ensures error decreases Contraction by relating X(t + 1) − X ∗∞ to A(t)i − A∗

i 2

Good initialization ensures precondition

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-49
SLIDE 49

Conclusions

Provable recovery of overcomplete dictionaries Global optimality through novel initialization Local linear convergence of alternating minimization Local convexity under same initialization General theory for latent variable models

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning

slide-50
SLIDE 50

A Clustering Approach to Learn Sparsely-Used Overcomplete Dictionaries, arxiv:1309.1952 Learning Sparsely Used Overcomplete Dictionaries via Alternating Minimization, arxiv:1310.7991 Questions?

Agarwal, Anandkumar, Jain, Netrapalli, Tandon Overcomplete Dictionary Learning