Dictionary learning - fast and dirty Karin Schnass Department of - - PowerPoint PPT Presentation

dictionary learning fast and dirty
SMART_READER_LITE
LIVE PREVIEW

Dictionary learning - fast and dirty Karin Schnass Department of - - PowerPoint PPT Presentation

Dictionary learning - fast and dirty Karin Schnass Department of Mathematics University of Innsbruck karin.schnass@uibk.ac.at Der Wissenschaftsfonds Dagstuhl, August 31 Karin Schnass ITKM 1 / 16 why do we care about sparsity again? A


slide-1
SLIDE 1

Dictionary learning - fast and dirty

Karin Schnass

Department of Mathematics University of Innsbruck karin.schnass@uibk.ac.at

Der Wissenschaftsfonds

Dagstuhl, August 31

Karin Schnass ITKM 1 / 16

slide-2
SLIDE 2

why do we care about sparsity again?

A sparse representation of the data is the basis for

Karin Schnass ITKM 2 / 16

slide-3
SLIDE 3

why do we care about sparsity again?

A sparse representation of the data is the basis for efficient data processing,

e.g denoising, compressed sensing, inpainting

Example: inpaintinga

  • aJ. Mairal, F. Bach, J. Ponce, G. Sapiro, Online learning for matrix

factorization and sparse coding.

Karin Schnass ITKM 2 / 16

slide-4
SLIDE 4

why do we care about sparsity again?

A sparse representation of the data is the basis for efficient data processing,

e.g denoising, compressed sensing, inpainting

efficient data analysis,

e.g source separation, anomaly detection, sparse components

Example: sparse componentsa

aD.J. Field, B.A. Olshausen, Emergence of simple-cell receptive field

properties by learning a sparse code for natural images.

Karin Schnass ITKM 2 / 16

slide-5
SLIDE 5

why do we care about sparsity again?

A sparse representation of the data is the basis for efficient data processing,

e.g denoising, compressed sensing, inpainting

efficient data analysis,

e.g source separation, anomaly detection, sparse components

In all examples:

the sparser - the more efficient

Karin Schnass ITKM 2 / 16

slide-6
SLIDE 6

why do we care about dictionary learning?

data: Y = (y1, . . . , yN) N vectors yn ∈ Rd d, N large

Karin Schnass ITKM 3 / 16

slide-7
SLIDE 7

why do we care about dictionary learning?

data: Y = (y1, . . . , yN) N vectors yn ∈ Rd d, N large

Karin Schnass ITKM 3 / 16

slide-8
SLIDE 8

why do we care about dictionary learning?

data: Y = (y1, . . . , yN) N vectors yn ∈ Rd d, N large No need for intuition time (days vs. years)

Karin Schnass ITKM 3 / 16

slide-9
SLIDE 9

why do we care about dictionary learning?

data: Y = (y1, . . . , yN) N vectors yn ∈ Rd d, N large No need for intuition time (days vs. years)

Karin Schnass ITKM 3 / 16

slide-10
SLIDE 10

Let’s do it.

We have: data Y a model (Y is S-sparse in a d × K dictionary Φ)

Karin Schnass ITKM 4 / 16

slide-11
SLIDE 11

Let’s do it.

We have: data Y a model (Y is S-sparse in a d × K dictionary Φ) We want: an algorithm (fast, cheap) guarantees that the algorithm will find Φ.

Karin Schnass ITKM 4 / 16

slide-12
SLIDE 12

Let’s do it.

We have: data Y a model (Y is S-sparse in a d × K dictionary Φ) We want: an algorithm (fast, cheap) guarantees that the algorithm will find Φ. Promising directions: Graph clustering algorithms (not so cheap) Tensor methods (not so cheap) - later today! (Alternating) Optimisation (not so many guarantees)

Karin Schnass ITKM 4 / 16

slide-13
SLIDE 13

Let’s do it.

We have: data Y a model (Y is S-sparse in a d × K dictionary Φ) We want: an algorithm (fast, cheap) guarantees that the algorithm will find Φ. Promising directions: Graph clustering algorithms (not so cheap) Tensor methods (not so cheap) - later today! (Alternating) Optimisation (not so many guarantees)

Karin Schnass ITKM 4 / 16

slide-14
SLIDE 14

warm up - a bit of K-SVD.

Since our signals are S-sparse, let’s minimise min

Ψ∈D,X∈XS

Y − ΨX2

F

(Ψ ∈ D has normalised columns, X ∈ XS has S-sparse columns)

Karin Schnass ITKM 5 / 16

slide-15
SLIDE 15

warm up - a bit of K-SVD.

Since our signals are S-sparse, let’s minimise min

Ψ∈D,X∈XS

Y − ΨX2

F

(Ψ ∈ D has normalised columns, X ∈ XS has S-sparse columns)

  • r equivalently maximise

max

Ψ∈D

  • n

max

|I|≤S ΨIΨ† I yn2 2,

(1)

Karin Schnass ITKM 5 / 16

slide-16
SLIDE 16

warm up - a bit of K-SVD.

Since our signals are S-sparse, let’s minimise min

Ψ∈D,X∈XS

Y − ΨX2

F

(Ψ ∈ D has normalised columns, X ∈ XS has S-sparse columns)

  • r equivalently maximise

max

Ψ∈D

  • n

max

|I|≤S ΨIΨ† I yn2 2,

(1) but this leads to K-SVD which is slow.

Karin Schnass ITKM 5 / 16

slide-17
SLIDE 17

warm up - a bit of K-SVD.

Since our signals are S-sparse, let’s minimise min

Ψ∈D,X∈XS

Y − ΨX2

F

(Ψ ∈ D has normalised columns, X ∈ XS has S-sparse columns)

  • r equivalently maximise

max

Ψ∈D

  • n

max

|I|≤S ΨIΨ† I yn2 2,

(1) but this leads to K-SVD which is slow. So let’s modify the optimisation programme max

Ψ∈D

  • n

max

|I|≤S ΨIΨ† I yn2 2,

(2)

Karin Schnass ITKM 5 / 16

slide-18
SLIDE 18

warm up - a bit of K-SVD.

Since our signals are S-sparse, let’s minimise min

Ψ∈D,X∈XS

Y − ΨX2

F

(Ψ ∈ D has normalised columns, X ∈ XS has S-sparse columns)

  • r equivalently maximise

max

Ψ∈D

  • n

max

|I|≤S ΨIΨ† I yn2 2,

(1) but this leads to K-SVD which is slow. So let’s modify the optimisation programme max

Ψ∈D

  • n

max

i

|ψi, yn|2, (2)

Karin Schnass ITKM 5 / 16

slide-19
SLIDE 19

warm up - a bit of K-SVD.

Since our signals are S-sparse, let’s minimise min

Ψ∈D,X∈XS

Y − ΨX2

F

(Ψ ∈ D has normalised columns, X ∈ XS has S-sparse columns)

  • r equivalently maximise

max

Ψ∈D

  • n

max

|I|≤S ΨIΨ† I yn2 2,

(1) but this leads to K-SVD which is slow. So let’s modify the optimisation programme max

Ψ∈D

  • n

max

i

|ψi, yn|, (2)

Karin Schnass ITKM 5 / 16

slide-20
SLIDE 20

warm up - a bit of K-SVD.

Since our signals are S-sparse, let’s minimise min

Ψ∈D,X∈XS

Y − ΨX2

F

(Ψ ∈ D has normalised columns, X ∈ XS has S-sparse columns)

  • r equivalently maximise

max

Ψ∈D

  • n

max

|I|≤S ΨIΨ† I yn2 2,

(1) but this leads to K-SVD which is slow. So let’s modify the optimisation programme max

Ψ∈D

  • n

max

|I|≤S Ψ⋆ I yn1,

(2)

Karin Schnass ITKM 5 / 16

slide-21
SLIDE 21

Iterative Thresholding and K signal means (ITKsM)

To optimise: max

Ψ∈D

  • n

max

|I|=S Ψ⋆ I yn1

(3)

Karin Schnass ITKM 6 / 16

slide-22
SLIDE 22

Iterative Thresholding and K signal means (ITKsM)

To optimise: max

Ψ∈D

  • n

max

|I|=S Ψ⋆ I yn1

(3) Algorithm (ITKsM one iteration) Given an input dictionary Ψ and N training signals yn do: For all n find I t

Ψ,n = arg maxI:|I|=S Ψ⋆ I yn1.

For all k calculate ¯ ψk = 1 N

  • n

yn · sign(ψk, yn) · χ(I t

Ψ,n, k).

(4) Output ¯ Ψ = ( ¯ ψ1/ ¯ ψ12, . . . , ¯ ψK/ ¯ ψK2).

Karin Schnass ITKM 6 / 16

slide-23
SLIDE 23

ITKsM is

ridiculously cheap O(dKN) (parallelisable, online version) robust to noise, not exact or low sparsity (S = O(µ−2)) locally convergent (radius 1/√log K) for sparsity S = O(µ−2), needs only O(K log Kε−2) samples,

Karin Schnass ITKM 7 / 16

slide-24
SLIDE 24

ITKsM is

ridiculously cheap O(dKN) (parallelisable, online version) robust to noise, not exact or low sparsity (S = O(µ−2)) locally convergent (radius 1/√log K) for sparsity S = O(µ−2), needs only O(K log Kε−2) samples, but is not globally convergent

Karin Schnass ITKM 7 / 16

slide-25
SLIDE 25

ITKsM is

ridiculously cheap O(dKN) (parallelisable, online version) robust to noise, not exact or low sparsity (S = O(µ−2)) locally convergent (radius 1/√log K) for sparsity S = O(µ−2), needs only O(K log Kε−2) samples, but is not globally convergent Algorithm (ITKrM one iteration) Given an input dictionary Ψ and N training signals yn do: For all n find I t

Ψ,n = arg maxI:|I|=S Ψ⋆ I yn1.

For all k calculate ¯ ψk =

  • n:k∈I t

Ψ,n

sign(ψk, yn) · yn. Output ¯ Ψ = ( ¯ ψ1/ ¯ ψ12, . . . , ¯ ψK/ ¯ ψK2).

Karin Schnass ITKM 7 / 16

slide-26
SLIDE 26

ITKsM is

ridiculously cheap O(dKN) (parallelisable, online version) robust to noise, not exact or low sparsity (S = O(µ−2)) locally convergent (radius 1/√log K) for sparsity S = O(µ−2), needs only O(K log Kε−2) samples, but is not globally convergent Algorithm (ITKrM one iteration) Given an input dictionary Ψ and N training signals yn do: For all n find I t

Ψ,n = arg maxI:|I|=S Ψ⋆ I yn1.

For all k calculate ¯ ψk =

  • n:k∈I t

Ψ,n

sign(ψk, yn) ·

  • I − P(ΨI t

n) + P(ψk)

  • yn.

Output ¯ Ψ = ( ¯ ψ1/ ¯ ψ12, . . . , ¯ ψK/ ¯ ψK2).

Karin Schnass ITKM 7 / 16

slide-27
SLIDE 27

intermediate quiz Are we going to recover the dictionary?

Karin Schnass ITKM 8 / 16

slide-28
SLIDE 28

intermediate quiz Are we going to recover the dictionary?

Karin Schnass ITKM 8 / 16

slide-29
SLIDE 29

intermediate quiz Are we going to recover the dictionary?

Karin Schnass ITKM 8 / 16

slide-30
SLIDE 30

intermediate quiz Are we going to recover the dictionary? No, no, no!!

Karin Schnass ITKM 8 / 16

slide-31
SLIDE 31

intermediate quiz Are we going to recover the dictionary? No, no, no!! We need a sparse model

Karin Schnass ITKM 8 / 16

slide-32
SLIDE 32

first fine-tune the model

Signal model: take Φ with maxi=j |φi, φj| = µ < 1. draw a positive, decaying, normed sequence c so that a.s. c(S) − c(S + 1) > βS and c(S) − c(S + 1) c(1) > ∆S. for a random permutation p, sign sequence σ and subgaussian noise r set y = Φxc,p,σ + r

  • 1 + r2

2

, where xc,p,σ(k) = σ(k)c(p(k)). (5) E.g. c1 = . . . = cS = 1/ √ S ci = 0, i > S then βS = 1/ √ S, ∆S = 1

Karin Schnass ITKM 9 / 16

slide-33
SLIDE 33

a very detailed result

Theorem Let Φ be a unit norm frame with frame constants A ≤ B and coherence µ and assume that the N training signals yn are generated according to the signal model in (5) with coefficients that are S-sparse with absolute gap βS and relative gap ∆S . Assume further that S ≤

K 98B and εδ := K exp

1 4741µ2S

1 24(B+1) .

Fix a target error ¯ ε ≥ 8εµ,ρ, with εµ,ρ = 8K2√B + 1 Cr γ1,S exp

  • −β2

S

98 max{µ2, ρ2}

  • ,

(6) compare (??), and assume that ¯ ε ≤ 1 − γ2,S + dρ2. If for the input dictionary Ψ we have d(Ψ, Φ) ≤ ∆S √ 98B

  • 1

4 +

  • log
  • 2544K2(B+1)

∆S Cr γ1,S

  • and

d(Ψ, Φ) ≤ 1 32 √ S , (7) then after 12⌈log(¯ ε−1)⌉ iterations the output dictionary ˜ Ψ of ITKrM both in its batch and online version satisfies d(¯ Ψ, Φ) ≤ ¯ ε except with probability 60⌈log(¯ ε−1)⌉K exp

  • −C2

r γ2 1,S N ¯

ε2 576K max{S, B + 1}

  • ¯

ε + 1 − γ2,S + dρ2

  • .

(8) Karin Schnass ITKM 10 / 16

slide-34
SLIDE 34

and its understandable version

Theorem Assume the number of training samples N scales as O(K log Kε−2). If S ≤ O(

1 ℓµ2 log K ) then with high probability for

any starting dictionary Ψ within distance O(1/ √ S) to the generating dictionary Φ, i.e., max

k

φk − ψk2 ≤ O(1/ √ S), after O(log(ε−1)) iterations of ITKM the distance of the output dictionary ¯ Ψ to the generating dictionary will be smaller than max

k

φk − ¯ ψk2 ≤ max

  • ε, O
  • K 2−ℓ

. (9) If the signals are noiseless and exactly S-sparse with S ≤ O(µ−1), the right hand side above reduces to ε and the number of necessary training samples reduces to O(K log Kε−1).

Karin Schnass ITKM 11 / 16

slide-35
SLIDE 35

and its understandable version

Theorem Assume the number of training samples N scales as O(K log Kε−2). If S ≤ O(

1 ℓµ2 log K ) then with high probability for

any starting dictionary Ψ within distance O(1/ √ S) to the generating dictionary Φ, i.e., max

k

φk − ψk2 ≤ O(1/ √ S), after O(log(ε−1)) iterations of ITKM the distance of the output dictionary ¯ Ψ to the generating dictionary will be smaller than max

k

φk − ¯ ψk2 ≤ max

  • ε, O
  • K 2−ℓ

. (9) If the signals are noiseless and exactly S-sparse with S ≤ O(µ−1), the right hand side above reduces to ε and the number of necessary training samples reduces to O(K log Kε−1).

Karin Schnass ITKM 11 / 16

slide-36
SLIDE 36

plus some graphical explanations

Answers: In expectation there is a local maximum of (3) at/near the generating dictionary.

Karin Schnass ITKM 12 / 16

slide-37
SLIDE 37

plus some graphical explanations

Answers: In expectation there is a local maximum of (3) at/near the generating dictionary. For final accuracy ε the ITKMs needs K log Kε−2 training signals, in the ideal case ITKrM only K log Kε−1.

Karin Schnass ITKM 12 / 16

slide-38
SLIDE 38

plus some graphical explanations

Answers: In expectation there is a local maximum of (3) at/near the generating dictionary. For final accuracy ε the ITKMs needs K log Kε−2 training signals, in the ideal case ITKrM only K log Kε−1.

Karin Schnass ITKM 12 / 16

slide-39
SLIDE 39

plus some graphical explanations

Answers: In expectation there is a local maximum of (3) at/near the generating dictionary. For final accuracy ε the ITKMs needs K log Kε−2 training signals, in the ideal case ITKrM only K log Kε−1. The convergence radius of ITKsM resp. ITKrM is of size at least 1/√log K resp. 1/ √ S.

Karin Schnass ITKM 12 / 16

slide-40
SLIDE 40

plus some graphical explanations

Answers: In expectation there is a local maximum of (3) at/near the generating dictionary. For final accuracy ε the ITKMs needs K log Kε−2 training signals, in the ideal case ITKrM only K log Kε−1. The convergence radius of ITKsM resp. ITKrM is of size at least 1/√log K resp. 1/ √ S. Experimentally ITKrM shows global convergence properties.

Karin Schnass ITKM 12 / 16

slide-41
SLIDE 41

proofs

Karin Schnass ITKM 13 / 16

slide-42
SLIDE 42

proofs

and other gory details can be found in Convergence radius and sample complexity of ITKM algorithms for dictionary learning, arXiv:1503.07027 Identification of overcomplete dictionaries, to appear Journal

  • f Machine Learning Research, (arXiv: 1401.6354).

Karin Schnass ITKM 13 / 16

slide-43
SLIDE 43

questions

Questions: Is the global maximum of (1/3) at/near the generating dictionary?

Karin Schnass ITKM 14 / 16

slide-44
SLIDE 44

questions

Questions: Is the global maximum of (1/3) at/near the generating dictionary? How do we get into the basin of attraction? Initialisation?

Karin Schnass ITKM 14 / 16

slide-45
SLIDE 45

questions

Questions: Is the global maximum of (1/3) at/near the generating dictionary? How do we get into the basin of attraction? Initialisation? Is ITKrM/K-SVD globally convergent (random initialisation)?

Karin Schnass ITKM 14 / 16

slide-46
SLIDE 46

background and complementary literature

  • R. Rubinstein, A. Bruckstein, and M. Elad.

Dictionaries for sparse representation modeling. Proceedings of the IEEE, 98(6):1045–1057, 2010.

  • K. Schnass.

A personal introduction to theoretical dictionary learning. Internationale Mathematische Nachrichten, 228:5–15, 2015.

  • S. Arora, R. Ge, T. Ma, and A. Moitra.

Simple, efficient, and neural algorithms for sparse coding. arXiv:1503.00778, 2015.

  • J. Sun, Q. Qu, and J. Wright.

Complete dictionary recovery over the sphere. arXiv:1504.06785, 2015.

Karin Schnass ITKM 15 / 16

slide-47
SLIDE 47

the last slide

Questions Thanks for your attention!! Comments

Karin Schnass ITKM 16 / 16

slide-48
SLIDE 48

solutions?

local convergence of K-SVD and weighted ITKM

random properties of (non-tight) frames, eg. EI:k∈I,|I|=S

  • (I − ΦIΦ†

I )ΦΦ⋆φk

  • = ?

stability of eigenvectors stable average case results for sparse approximation (beyond thresholding)

global optimiser of ITKM/K-SVD principle (ideal case) max

Φ Ey max |I|=S Φ⋆ I y1

/

max

Φ Ey max |I|=S ΦIΦ† I y2 2

stuck at 1/ log(S) - via Khintchine, decoupling better tailbounds? combined estimates?

Karin Schnass ITKM 1 / 1