Test of Time Award Online Dictionary Learning for Sparse Coding - - PowerPoint PPT Presentation

test of time award online dictionary learning for sparse
SMART_READER_LITE
LIVE PREVIEW

Test of Time Award Online Dictionary Learning for Sparse Coding - - PowerPoint PPT Presentation

Test of Time Award Online Dictionary Learning for Sparse Coding Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro International Conference on Machine Learning, 2019 Julien Mairal Online Dictionary Learning for Sparse Coding 1/15 Test


slide-1
SLIDE 1

Test of Time Award Online Dictionary Learning for Sparse Coding

Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro International Conference on Machine Learning, 2019

Julien Mairal Online Dictionary Learning for Sparse Coding 1/15

slide-2
SLIDE 2

Test of Time Award Online Learning for Matrix Factorization and Sparse Coding

Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro International Conference on Machine Learning, 2019

Julien Mairal Online Dictionary Learning for Sparse Coding 1/15

slide-3
SLIDE 3

Francis Jean Guillermo

Julien Mairal Online Dictionary Learning for Sparse Coding 2/15

slide-4
SLIDE 4

What are these papers about?

They are dealing with matrix factorization

X m n ≈ D p × A p n

Julien Mairal Online Dictionary Learning for Sparse Coding 3/15

slide-5
SLIDE 5

What are these papers about?

They are dealing with matrix factorization

X m n ≈ D p × A p n

when a factor is sparse.

Julien Mairal Online Dictionary Learning for Sparse Coding 3/15

slide-6
SLIDE 6

What are these papers about?

They are dealing with matrix factorization

X m n ≈ D p × A p n

  • r the other one.

Julien Mairal Online Dictionary Learning for Sparse Coding 3/15

slide-7
SLIDE 7

What are these papers about?

They are dealing with matrix factorization

X m n ≈ D p × A p n

  • r both.

Julien Mairal Online Dictionary Learning for Sparse Coding 3/15

slide-8
SLIDE 8

What are these papers about?

They are dealing with matrix factorization

X m n ≈ D p × A p n

  • r not only one factor is sparse, but it admits a particular structure.

Julien Mairal Online Dictionary Learning for Sparse Coding 3/15

slide-9
SLIDE 9

What are these papers about?

They are dealing with matrix factorization

X m n ≈ D p × A p n

  • r one factor admits a particular structure (e.g., piecewise constant), but it is not sparse.

Julien Mairal Online Dictionary Learning for Sparse Coding 3/15

slide-10
SLIDE 10

What these papers are about?

In these papers, data matrices have many columns,

X m ≈ D p × A p n → +∞ n → +∞

  • r an infinite number of columns, or columns are streamed online.

Julien Mairal Online Dictionary Learning for Sparse Coding 4/15

slide-11
SLIDE 11

Formulation(s)

X = [x1, x2, . . . , xn] is a data matrix. We may call D = [d1, . . . , dp] a dictionary. A = [α1, . . . , αn] carries the decomposition coefficients of X onto D.

Julien Mairal Online Dictionary Learning for Sparse Coding 5/15

slide-12
SLIDE 12

Formulation(s)

X = [x1, x2, . . . , xn] is a data matrix. We may call D = [d1, . . . , dp] a dictionary. A = [α1, . . . , αn] carries the decomposition coefficients of X onto D.

Interpretation as signal/data decomposition

X ≈ DA ⇐ ⇒ ∀ i, xi ≈ Dαi =

p

  • j=1

αi[j]dj.

Julien Mairal Online Dictionary Learning for Sparse Coding 5/15

slide-13
SLIDE 13

Formulation(s)

X = [x1, x2, . . . , xn] is a data matrix. We may call D = [d1, . . . , dp] a dictionary. A = [α1, . . . , αn] carries the decomposition coefficients of X onto D.

Interpretation as signal/data decomposition

X ≈ DA ⇐ ⇒ ∀ i, xi ≈ Dαi =

p

  • j=1

αi[j]dj.

Generic formulation

min

D∈D

1 n

n

  • i=1

L(xi, D) with L(x, D) △ = min

α∈A

1 2x − Dα2 + λψ(α).

Julien Mairal Online Dictionary Learning for Sparse Coding 5/15

slide-14
SLIDE 14

Formulation(s)

X = [x1, x2, . . . , xn] is a data matrix. We may call D = [d1, . . . , dp] a dictionary. A = [α1, . . . , αn] carries the decomposition coefficients of X onto D.

Interpretation as signal/data decomposition

X ≈ DA ⇐ ⇒ ∀ i, xi ≈ Dαi =

p

  • j=1

αi[j]dj.

Generic formulation / stochastic case

n

  • i=1

min

D∈D Ex [L(x, D)]

with L(x, D) △ = min

α∈A

1 2x − Dα2 + λψ(α).

Julien Mairal Online Dictionary Learning for Sparse Coding 5/15

slide-15
SLIDE 15

Formulation(s)

n

  • i=1

min

D∈D Ex [L(x, D)]

with L(x, D) △ = min

α∈A

1 2x − Dα2 + λψ(α).

Which formulations does it cover?

D A ψ non-negative matrix factorization Rm×p

+

Rp

+

[Paatero and Tapper, ’94]

Julien Mairal Online Dictionary Learning for Sparse Coding 6/15

slide-16
SLIDE 16

Formulation(s)

n

  • i=1

min

D∈D Ex [L(x, D)]

with L(x, D) △ = min

α∈A

1 2x − Dα2 + λψ(α).

Which formulations does it cover?

D A ψ non-negative matrix factorization Rm×p

+

Rp

+

sparse coding {D : ∀ j, dj ≤ 1} Rp .1

[Paatero and Tapper, ’94], [Olshausen and Field, ’96]

Julien Mairal Online Dictionary Learning for Sparse Coding 6/15

slide-17
SLIDE 17

Formulation(s)

n

  • i=1

min

D∈D Ex [L(x, D)]

with L(x, D) △ = min

α∈A

1 2x − Dα2 + λψ(α).

Which formulations does it cover?

D A ψ non-negative matrix factorization Rm×p

+

Rp

+

sparse coding {D : ∀ j, dj ≤ 1} Rp .1 non-negative sparse coding {D : ∀ j, dj ≤ 1} Rp

+

.1

[Paatero and Tapper, ’94], [Olshausen and Field, ’96], [Hoyer, 2002]

Julien Mairal Online Dictionary Learning for Sparse Coding 6/15

slide-18
SLIDE 18

Formulation(s)

n

  • i=1

min

D∈D Ex [L(x, D)]

with L(x, D) △ = min

α∈A

1 2x − Dα2 + λψ(α).

Which formulations does it cover?

D A ψ non-negative matrix factorization Rm×p

+

Rp

+

sparse coding {D : ∀ j, dj ≤ 1} Rp .1 non-negative sparse coding {D : ∀ j, dj ≤ 1} Rp

+

.1 structured sparse coding {D : ∀ j, dj ≤ 1} Rp .1 + Ω(.)

[Paatero and Tapper, ’94], [Olshausen and Field, ’96], [Hoyer, 2002], [Mairal et al., 2011]

Julien Mairal Online Dictionary Learning for Sparse Coding 6/15

slide-19
SLIDE 19

Formulation(s)

n

  • i=1

min

D∈D Ex [L(x, D)]

with L(x, D) △ = min

α∈A

1 2x − Dα2 + λψ(α).

Which formulations does it cover?

D A ψ non-negative matrix factorization Rm×p

+

Rp

+

sparse coding {D : ∀ j, dj ≤ 1} Rp .1 non-negative sparse coding {D : ∀ j, dj ≤ 1} Rp

+

.1 structured sparse coding {D : ∀ j, dj ≤ 1} Rp .1 + Ω(.) ≈ sparse PCA {D : ∀ j, dj2

2 + dj1 ≤ 1}

Rp .1

[Paatero and Tapper, ’94], [Olshausen and Field, ’96], [Hoyer, 2002], [Mairal et al., 2011], [Zou et al., 2004].

Julien Mairal Online Dictionary Learning for Sparse Coding 6/15

slide-20
SLIDE 20

Formulation(s)

n

  • i=1

min

D∈D Ex [L(x, D)]

with L(x, D) △ = min

α∈A

1 2x − Dα2 + λψ(α).

Which formulations does it cover?

D A ψ non-negative matrix factorization Rm×p

+

Rp

+

sparse coding {D : ∀ j, dj ≤ 1} Rp .1 non-negative sparse coding {D : ∀ j, dj ≤ 1} Rp

+

.1 structured sparse coding {D : ∀ j, dj ≤ 1} Rp .1 + Ω(.) ≈ sparse PCA {D : ∀ j, dj2

2 + dj1 ≤ 1}

Rp .1 . . . . . . . . . . . .

[Paatero and Tapper, ’94], [Olshausen and Field, ’96], [Hoyer, 2002], [Mairal et al., 2011], [Zou et al., 2004].

Julien Mairal Online Dictionary Learning for Sparse Coding 6/15

slide-21
SLIDE 21

The sparse coding context

was introduced by Olshausen and Field, ’96. It was the first time (together with ICA, see [Bell and Sejnowski, ’97]) that a simple unsupervised learning principle would lead to various sorts of “Gabor-like” filters, when trained on natural image patches.

Julien Mairal Online Dictionary Learning for Sparse Coding 7/15

slide-22
SLIDE 22

The sparse coding context

Remember that we can play with various structured sparsity-inducing penalties:

[Jenatton et al. 2010], [Kavukcuoglu et al., 2009], [Mairal et al. 2011], [Hyv¨ arinen and Hoyer, 2001].

Julien Mairal Online Dictionary Learning for Sparse Coding 8/15

slide-23
SLIDE 23

Sparsity and simplicity principles

1921 1921: Wrinch and Jeffrey’s simplicity principle.

Julien Mairal Online Dictionary Learning for Sparse Coding 9/15

slide-24
SLIDE 24

Sparsity and simplicity principles

1921 1950 1921: Wrinch and Jeffrey’s simplicity principle. 1952: Markowitz’s portfolio selection.

Julien Mairal Online Dictionary Learning for Sparse Coding 9/15

slide-25
SLIDE 25

Sparsity and simplicity principles

1921 1950 1960 1970 1921: Wrinch and Jeffrey’s simplicity principle. 1952: Markowitz’s portfolio selection. 1960’s and 70’s: best subset selection in statistics.

Julien Mairal Online Dictionary Learning for Sparse Coding 9/15

slide-26
SLIDE 26

Sparsity and simplicity principles

1921 1950 1960 1970 1980 1990 1921: Wrinch and Jeffrey’s simplicity principle. 1952: Markowitz’s portfolio selection. 1960’s and 70’s: best subset selection in statistics. 1990’s: the wavelet era in signal processing.

Julien Mairal Online Dictionary Learning for Sparse Coding 9/15

slide-27
SLIDE 27

Sparsity and simplicity principles

1921 1950 1960 1970 1980 1990 2000 1921: Wrinch and Jeffrey’s simplicity principle. 1952: Markowitz’s portfolio selection. 1960’s and 70’s: best subset selection in statistics. 1990’s: the wavelet era in signal processing. 1996: Olshausen and Field’s dictionary learning method. 1994–1996: the Lasso (Tibshirani) and Basis pursuit (Chen and Donoho).

Julien Mairal Online Dictionary Learning for Sparse Coding 9/15

slide-28
SLIDE 28

Sparsity and simplicity principles

1921 1950 1960 1970 1980 1990 2000 2010 1921: Wrinch and Jeffrey’s simplicity principle. 1952: Markowitz’s portfolio selection. 1960’s and 70’s: best subset selection in statistics. 1990’s: the wavelet era in signal processing. 1996: Olshausen and Field’s dictionary learning method. 1994–1996: the Lasso (Tibshirani) and Basis pursuit (Chen and Donoho). 2004: compressed sensing (Candes, Romberg and Tao). 2006: Elad and Aharon’s image denoising method.

Julien Mairal Online Dictionary Learning for Sparse Coding 9/15

slide-29
SLIDE 29

Context of 2009

Julien Mairal Online Dictionary Learning for Sparse Coding 10/15

slide-30
SLIDE 30

Context of 2009

Many successful stories of dictionary learning in image processing

image denoising, inpainting, demosaicing, super-resolution . . .

[Elad and Aharon., 2006], [Mairal et al., 2008], [Yang et al., 2008] . . .

Julien Mairal Online Dictionary Learning for Sparse Coding 10/15

slide-31
SLIDE 31

Context of 2009

Many successful stories of dictionary learning in image processing

image denoising, inpainting, demosaicing, super-resolution . . .

[Elad and Aharon., 2006], [Mairal et al., 2008], [Yang et al., 2008] . . .

Julien Mairal Online Dictionary Learning for Sparse Coding 10/15

slide-32
SLIDE 32

Context of 2009

Many successful stories of dictionary learning in image processing

image denoising, inpainting, demosaicing, super-resolution . . .

Also successful stories in computer vision for modeling local features

dictionary learning on top of SIFT wins the PASCAL VOC’09 challenge. another variant wins the ImageNet 2010 challenge.

[Yang et al., 2009], [Lin et al. 2010] . . .

Julien Mairal Online Dictionary Learning for Sparse Coding 10/15

slide-33
SLIDE 33

Context of 2009

Many successful stories of dictionary learning in image processing

image denoising, inpainting, demosaicing, super-resolution . . .

Also successful stories in computer vision for modeling local features

dictionary learning on top of SIFT wins the PASCAL VOC’09 challenge. another variant wins the ImageNet 2010 challenge.

Matrix factorization becomes a key technique for unsupervised data modeling

recommender systems (Netflix prize) and social networks. document clustering. genomic pattern discovery. . . .

[Koren et al., 2009b], [Ma et al. 2008], [Xu et al. 2003], [Brunet et al., 2004]. . .

Julien Mairal Online Dictionary Learning for Sparse Coding 10/15

slide-34
SLIDE 34

Context of 2009

Classical approach for matrix factorization: alternate minimization

min

D∈D,A∈A

1 2X − DA2

F + λψ(A).

which requires loading all data at every iteration (batch optimization).

Julien Mairal Online Dictionary Learning for Sparse Coding 11/15

slide-35
SLIDE 35

Context of 2009

Classical approach for matrix factorization: alternate minimization

min

D∈D,A∈A

1 2X − DA2

F + λψ(A).

which requires loading all data at every iteration (batch optimization). Meanwhile, L´ eon Bottou is advocating stochastic optimization for machine learning

see L´ eon’s tutorial at NIPS’07, or NeurIPS’18 test of time award [Bottou and Bousquet, 2008].

Julien Mairal Online Dictionary Learning for Sparse Coding 11/15

slide-36
SLIDE 36

Context of 2009

Classical approach for matrix factorization: alternate minimization

min

D∈D,A∈A

1 2X − DA2

F + λψ(A).

which requires loading all data at every iteration (batch optimization). Meanwhile, L´ eon Bottou is advocating stochastic optimization for machine learning which makes the risk minimization point of view relevant: min

D∈D

1 n

n

  • i=1

L(xi, D) with L(x, D) △ = min

α∈A

1 2x−Dα2+λψ(α).

see L´ eon’s tutorial at NIPS’07, or NeurIPS’18 test of time award [Bottou and Bousquet, 2008].

Julien Mairal Online Dictionary Learning for Sparse Coding 11/15

slide-37
SLIDE 37

What we did

We started experimenting with SGD

Julien Mairal Online Dictionary Learning for Sparse Coding 12/15

slide-38
SLIDE 38

What we did

We started experimenting with SGD and tuning the step-size turned out to be painful.

Julien Mairal Online Dictionary Learning for Sparse Coding 12/15

slide-39
SLIDE 39

What we did

We started experimenting with SGD and tuning the step-size turned out to be painful.

Can we then design an algorithm that would be as fast as SGD, but more practical?

Julien Mairal Online Dictionary Learning for Sparse Coding 12/15

slide-40
SLIDE 40

What we did

We started experimenting with SGD and tuning the step-size turned out to be painful.

Can we then design an algorithm that would be as fast as SGD, but more practical?

Idea 1: If we knew optimal codes α⋆

i for all xi’s in advance, then the problem becomes

min

D∈D

1 2trace

  • D⊤DB
  • − trace
  • D⊤C
  • with B = 1

n

n

  • i=1

α⋆

i α⋆⊤ i

and C = 1 n

n

  • i=1

xiα⋆⊤

i ,

which yields parameter-free block coordinate descent rules for updating D.

Julien Mairal Online Dictionary Learning for Sparse Coding 12/15

slide-41
SLIDE 41

What we did

We started experimenting with SGD and tuning the step-size turned out to be painful.

Can we then design an algorithm that would be as fast as SGD, but more practical?

Idea 1: If we knew optimal codes α⋆

i for all xi’s in advance, then the problem becomes

min

D∈D

1 2trace

  • D⊤DB
  • − trace
  • D⊤C
  • with B = 1

n

n

  • i=1

α⋆

i α⋆⊤ i

and C = 1 n

n

  • i=1

xiα⋆⊤

i ,

which yields parameter-free block coordinate descent rules for updating D. Idea 2: Build appropriate matrices B and C in an online fashion.

[Neal and Hinton, ’98]

Julien Mairal Online Dictionary Learning for Sparse Coding 12/15

slide-42
SLIDE 42

What we did

We started experimenting with SGD and tuning the step-size turned out to be painful.

Can we then design an algorithm that would be as fast as SGD, but more practical?

Idea 1: If we knew optimal codes α⋆

i for all xi’s in advance, then the problem becomes

min

D∈D

1 2trace

  • D⊤DB
  • − trace
  • D⊤C
  • with B = 1

n

n

  • i=1

α⋆

i α⋆⊤ i

and C = 1 n

n

  • i=1

xiα⋆⊤

i ,

which yields parameter-free block coordinate descent rules for updating D. Idea 2: Build appropriate matrices B and C in an online fashion.

What about theory?

We could provide guarantees of convergence to stationary points, even though the problem is non-convex, stochastic, constrained, and non-smooth.

[Neal and Hinton, ’98], [Mairal, 2013], [Mensch, 2018].

Julien Mairal Online Dictionary Learning for Sparse Coding 12/15

slide-43
SLIDE 43

Reasons for impact: How did it help other fields?

Julien Mairal Online Dictionary Learning for Sparse Coding 13/15

slide-44
SLIDE 44

Reasons for impact: How did it help other fields?

A timely context ( ≈ luck)

Datasets were becoming larger and larger, and there was suddenly a need for more scalable matrix factorization methods.

Julien Mairal Online Dictionary Learning for Sparse Coding 13/15

slide-45
SLIDE 45

Reasons for impact: How did it help other fields?

A timely context ( ≈ luck)

Datasets were becoming larger and larger, and there was suddenly a need for more scalable matrix factorization methods.

A combination of mathematics and engineering?

an efficient software package: the SPAMS toolbox. robustness to hyper-parameters: default setting that works (many times) in practice. (try it with pip install spams in Python, or download R/Matlab packages).

Julien Mairal Online Dictionary Learning for Sparse Coding 13/15

slide-46
SLIDE 46

Reasons for impact: How did it help other fields?

A timely context ( ≈ luck)

Datasets were becoming larger and larger, and there was suddenly a need for more scalable matrix factorization methods.

A combination of mathematics and engineering?

an efficient software package: the SPAMS toolbox. robustness to hyper-parameters: default setting that works (many times) in practice. (try it with pip install spams in Python, or download R/Matlab packages).

Flexibility in the constraints/penalty design

allowing the method to be used in unexpected contexts.

Julien Mairal Online Dictionary Learning for Sparse Coding 13/15

slide-47
SLIDE 47

Reasons for impact: How did it help other fields?

A timely context ( ≈ luck)

Datasets were becoming larger and larger, and there was suddenly a need for more scalable matrix factorization methods.

A combination of mathematics and engineering?

an efficient software package: the SPAMS toolbox. robustness to hyper-parameters: default setting that works (many times) in practice. (try it with pip install spams in Python, or download R/Matlab packages).

Flexibility in the constraints/penalty design

allowing the method to be used in unexpected contexts.

Julien Mairal Online Dictionary Learning for Sparse Coding 13/15

slide-48
SLIDE 48

Reasons for impact: How did it help other fields?

A timely context ( ≈ luck)

Datasets were becoming larger and larger, and there was suddenly a need for more scalable matrix factorization methods.

A combination of mathematics and engineering?

an efficient software package: the SPAMS toolbox. robustness to hyper-parameters: default setting that works (many times) in practice. (try it with pip install spams in Python, or download R/Matlab packages).

Flexibility in the constraints/penalty design

allowing the method to be used in unexpected contexts.

Julien Mairal Online Dictionary Learning for Sparse Coding 13/15

slide-49
SLIDE 49

Reasons for impact: How did it help other fields?

A timely context ( ≈ luck)

Datasets were becoming larger and larger, and there was suddenly a need for more scalable matrix factorization methods.

A combination of mathematics and engineering?

an efficient software package: the SPAMS toolbox. robustness to hyper-parameters: default setting that works (many times) in practice. (try it with pip install spams in Python, or download R/Matlab packages).

Flexibility in the constraints/penalty design

allowing the method to be used in unexpected contexts.

Julien Mairal Online Dictionary Learning for Sparse Coding 13/15

slide-50
SLIDE 50

Reasons for impact: How did it help other fields?

A timely context ( ≈ luck)

Datasets were becoming larger and larger, and there was suddenly a need for more scalable matrix factorization methods.

A combination of mathematics and engineering?

an efficient software package: the SPAMS toolbox. robustness to hyper-parameters: default setting that works (many times) in practice. (try it with pip install spams in Python, or download R/Matlab packages).

Flexibility in the constraints/penalty design

allowing the method to be used in unexpected contexts.

Julien Mairal Online Dictionary Learning for Sparse Coding 13/15

slide-51
SLIDE 51

Connection with neural networks

Julien Mairal Online Dictionary Learning for Sparse Coding 14/15

slide-52
SLIDE 52

Connection with neural networks

A cheap way to obtain a sparse code β from x and D is β = relu(D⊤x − λ), versus α ∈ arg min

α∈A

1 2x − Dα2 + λα1.

Julien Mairal Online Dictionary Learning for Sparse Coding 14/15

slide-53
SLIDE 53

Connection with neural networks

A cheap way to obtain a sparse code β from x and D is β = relu(D⊤x − λ), versus α ∈ arg min

α∈A

1 2x − Dα2 + λα1.

Then, not surprisingly, for dictionary learning,

end-to-end feature learning is feasible.

[Mairal et al., 2012]

Julien Mairal Online Dictionary Learning for Sparse Coding 14/15

slide-54
SLIDE 54

Connection with neural networks

A cheap way to obtain a sparse code β from x and D is β = relu(D⊤x − λ), versus α ∈ arg min

α∈A

1 2x − Dα2 + λα1.

Then, not surprisingly, for dictionary learning,

end-to-end feature learning is feasible.

  • ne can design convolutional and multilayer models.

[Mairal et al., 2012], [Zeiler and Fergus, 2010]

Julien Mairal Online Dictionary Learning for Sparse Coding 14/15

slide-55
SLIDE 55

Connection with neural networks

A cheap way to obtain a sparse code β from x and D is β = relu(D⊤x − λ), versus α ∈ arg min

α∈A

1 2x − Dα2 + λα1.

Then, not surprisingly, for dictionary learning,

end-to-end feature learning is feasible.

  • ne can design convolutional and multilayer models.

sparse decomposition algorithms perform neural network-like operations (LISTA).

[Mairal et al., 2012], [Zeiler and Fergus, 2010], [Gregor and LeCun, 2010].

Julien Mairal Online Dictionary Learning for Sparse Coding 14/15

slide-56
SLIDE 56

Thoughts

Is Wrinch and Jeffrey’s simplicity principle still relevant?

Simplicity is a key to interpretability and to model/hypothesis selection. Next form will probably be different than ℓ1. Which one? Simplicity is not enough. Various forms and robustness and stability are also needed.

[Yu and Kumbier, 2019].

Julien Mairal Online Dictionary Learning for Sparse Coding 15/15