Learning low-dimensional models with penalized matrix factorization Rémi Gribonval - PANAMA Project-team INRIA Rennes - Bretagne Atlantique, France
remi.gribonval@inria.fr
Learning low-dimensional models with penalized matrix factorization - - PowerPoint PPT Presentation
Learning low-dimensional models with penalized matrix factorization Rmi Gribonval - PANAMA Project-team INRIA Rennes - Bretagne Atlantique, France remi.gribonval@inria.fr Overview Context: inverse problems, sparsity &
Learning low-dimensional models with penalized matrix factorization Rémi Gribonval - PANAMA Project-team INRIA Rennes - Bretagne Atlantique, France
remi.gribonval@inria.fr
Overview
2
✓ inverse problems, sparsity & low-dimensional models
Main Credits
✦
✦
3
small-project.eu
Inverse problems
Inverse Problems in Image Processing
5
+ Compression, Source Localization, Separation, Compressed Sensing ...
Inverse Problems in Acoustics
✓ localize emitting sources ✓ reconstruct emitted signals ✓ extrapolate acoustic field
6
y = Mx
time-series recorded at sensors (discretized) spatio-temporal acoustic field
∈ Rm ∈ RN
Inverse Problems & Signal Models
7
Observation Domain
Need for a model = prior knowledge
Sparse signal models
Typical Sparse Models
Black = zero White = zero
9
ANALYSIS ANALYSIS SYNTHESIS SYNTHESIS
Mathematical Expression
(ex: time-frequency atoms, wavelets)
10
Dictionary of atoms (Mallat & Zhang 93)
x ∈ Rd x ≈ X
k
zkdk = Dz ⇥z⇥0 = X
k
|zk|0 = card{k, zk = 0}
CoSparse Models and Inverse Problems
11
Observation Domain
CoSparse Models and Inverse Problems
11
Observation Domain
Algorithmic Principles
12
ˆ z = arg min
z 1 2ky MDzk2 2 + λkzkp p
ˆ x = Dˆ z
with
✓ gradient descent to improve data fidelity ✓ thresholding to promote (structured) sparsity
Algorithmic Principles
12
ˆ z = arg min
z 1 2ky MDzk2 2 + λkzkp p
ˆ x = Dˆ z
with
ˆ zi+1/2 ← ˆ zi + µDT MT (y − MDˆ zi) ˆ zi+1 ← Thresholdp(ˆ zi+1/2, λ)
✓ gradient descent to improve data fidelity ✓ thresholding to promote (structured) sparsity
Algorithmic Principles
12
ˆ z = arg min
z 1 2ky MDzk2 2 + λkzkp p
ˆ x = Dˆ z
with
ˆ zi+1/2 ← ˆ zi + µDT MT (y − MDˆ zi) ˆ zi+1 ← Thresholdp(ˆ zi+1/2, λ)
✓ gradient descent to improve data fidelity ✓ thresholding to promote (structured) sparsity
Algorithmic Principles
12
ˆ z = arg min
z 1 2ky MDzk2 2 + λkzkp p
ˆ x = Dˆ z
with
ˆ zi+1/2 ← ˆ zi + µDT MT (y − MDˆ zi) ˆ zi+1 ← Thresholdp(ˆ zi+1/2, λ)
Example: «Audio Inpainting»
13
Time (s) Frequency (Hz) 0.1 0.2 0.3 0.4 2000 4000 6000 8000 0.01 0.02 0.03 −1 1 Time (s) AmplitudeClicks Limited bandwidth Holes (Packet Loss)
0.01 0.02 0.03 −1 1 Time (s) AmplitudeClipping
Example: «Audio Inpainting»
13
0.01 0.02 0.03 −1 1 Time (s) AmplitudeClipping
Example: «Audio Inpainting»
13
0.01 0.02 0.03 −1 1 Time (s) AmplitudeClipping
http://people.rennes.inria.fr/Srdan.Kitic/?page_id=40
CoSparse models and inverse problems
14
Observation Domain
Dictionary learning for sparse modeling
Sparse Signal Model
16
Signal Image (Overcomplete) dictionary of atoms (wavelets ...) Sparse Representation Coefficients
D
From Analytic to Learned Dictionaries
17
Signals Images Analytic dictionaries (Fourier, wavelets ...)
From Analytic to Learned Dictionaries
17
Signals Images Hyperspectral Satellite imaging Spherical geometry Cosmology, HRTF (3D audio) Graph data Social networks Brain connectivity Vector valued Diffusion tensor Analytic dictionaries (Fourier, wavelets ...)
From Analytic to Learned Dictionaries
17
Signals Images Hyperspectral Satellite imaging Spherical geometry Cosmology, HRTF (3D audio) Graph data Social networks Brain connectivity Vector valued Diffusion tensor Analytic dictionaries (Fourier, wavelets ...)
Data-driven (learned) dictionaries
A Quest for the Perfect Sparse Model
18
Training database
A Quest for the Perfect Sparse Model
18
Training database
patch extraction
Training patches
xn = Dzn, 1 ≤ n ≤ N
Unknown sparse coefficients Unknown dictionary
A Quest for the Perfect Sparse Model
18
Training database
patch extraction
Training patches
xn = Dzn, 1 ≤ n ≤ N
Unknown sparse coefficients Unknown dictionary
A Quest for the Perfect Sparse Model
sparse learning
18
Training database
= edge-like atoms
[Olshausen & Field 96, Aharon et al 06, Mairal et al 09, ...]
= shifts of edge-like motifs
[Blumensath 05, Jost et al 05, ...] patch extraction
Training patches
xn = Dzn, 1 ≤ n ≤ N ˆ D
Dictionary learning as sparse matrix factorization
Dictionary Learning = Sparse Matrix Factorization
20
Dictionary Learning = Sparse Matrix Factorization
20
xn ≈ Dzn ∈ Rd D x1 ≈ z1
s-sparse = at most s nonzero entries
Dictionary Learning = Sparse Matrix Factorization
20
xn ≈ Dzn ∈ Rd D x1 x2 ≈ z1 z2
Dictionary Learning = Sparse Matrix Factorization
20
xn ≈ Dzn ∈ Rd D x1 x2 ≈ xN . . . . . . z1 z2 zN
Dictionary Learning = Sparse Matrix Factorization
21
d × N
with s-sparse columns
d × K K × N
Dictionary Learning = Sparse Matrix Factorization
21
d × N
with s-sparse columns
d × K K × N
sounds familiar? similar to ICA! X=AS
Many Approaches
22
✦
[see e.g. book by Comon & Jutten 2011]
✦
[Bach et al., 2008; Bradley and Bagnell, 2009]
✦
[Krause and Cevher, 2010]
✦
[Zhou et al., 2009]
✦
[Olshausen and Field, 1997; Pearlmutter & Zibulevsky 2001, Aharon et al. 2006; Lee et al., 2007; Mairal et al., 2010 (... and many other authors)]
Nonconvex optimization for dictionary learning
Sparse Coding Objective Function
24
✓ sparse regression
✦
LASSO/Basis Pursuit:
✦
Ideal s-sparse approximation:
fxn(D) = min
zn
1 2kxn Dznk2
2 + φ(zn)
φ(z) = λkzk1 φ(z) = χs(z) = ⇢ 0, kzk0 s; +1,
Sparse Coding Objective Function
25
✓ sparse regression
FX(D) = 1 N
N
X
n=1
fxn(D) fxn(D) = min
zn
1 2kxn Dznk2
2 + φ(zn)
/ min
Z
1 2kX DZk2
F + Φ(Z)
Learning = Constrained Minimization
26
D∈D FX(D)
D → ∞, Z → 0 D D = {D = [d1, . . . , dK], 8k kdkk2 = 1}
/ min
Z
1 2kX DZk2
F + Φ(Z)
A versatile matrix factorization framework
✦
penalty: L1 norm
✦
constraint: unit-norm dictionary
✦
penalty: indicator function of canonical basis vectors
✦
constraint: none
✦
penalty: indicator function of non-negative coefficients
✦
constraint: unit-norm non-negative dictionary
✦
penalty: none
✦
constraint: dictionary with orthonormal columns
27 D D
Algorithms for penalized matrix factorization
Principle: Alternate Optimization
✓ Update coefficients given current dictionary D ✓ Update dictionary given current coefficients Z
29
min
D,Z 1 2kX DZk2 F + Φ(Z)
min
zi 1 2kxi Dzik2 F + φ(zi)
min
D 1 2kX DZk2 F
Coefficient Update = Sparse Coding
✓ Batch: for all training samples i at each iteration ✓ Online: for one (randomly selected) training sample i
✓ L1 minimization , (Orthonormal) Matching Pursuit, ...
30
min
zi 1 2kxi Dzik2 F + φ(zi)
Dictionary Update
✓ Method of Optimal Directions (MOD) [Engan et al., 1999] ✓ K-SVD: with PCA [Aharon et al. 2006]
✦
coefficients are jointly updated ✓ Online L1: stoch. gradient [Engan & al 2007, Mairal et al., 2010]
31
ˆ D = X · pinv(Z) = arg min
D kX DZk2 F
min
D 1 2kX DZk2 F
... but also
✓ Non-negativity (NMF):
✦
Multiplicative update [Lee & Seung 1999] ✓ Known rows up to gains (blind calibration)
✦
Convex formulation [G & al 2012, Bilen & al 2013] ✓ Know-rows up to permutation (cable chaos)
✦
Branch & bound [Emiya & al, 2014]
32
D = diag(g)D0 D = ΠD0
Analytic vs Learned Dictionaries Learning Fast Transforms
Ph.D. of Luc Le Magoarou
Analytic vs Learned Dictionaries
34
Dictionary Adaptation to Training Data Computational Complexity Analytic
(Fourier, wavelets, ...)
No Low Learned Yes High
Analytic vs Learned Dictionaries
34
Dictionary Adaptation to Training Data Computational Complexity Analytic
(Fourier, wavelets, ...)
No Low Learned Yes High
Analytic vs Learned Dictionaries
34
Dictionary Adaptation to Training Data Computational Complexity Analytic
(Fourier, wavelets, ...)
No Low Learned Yes High Best of both worlds ?
Sparse-KSVD
✓ choose reference (fast) dictionary ✓ learn with the constraint: where is sparse
Sparse Signal Approximation,” IEEE TSP, vol. 58, no. 3, pp. 1553–1564.
35
X ≈ D0SZ S D = D0S D0
Sparse-KSVD
✓ choose reference (fast) dictionary ✓ learn with the constraint: where is sparse
Sparse Signal Approximation,” IEEE TSP, vol. 58, no. 3, pp. 1553–1564.
35
X ≈ D0SZ S D = D0S D0
two unknown factors
Sparse-KSVD
✓ choose reference (fast) dictionary ✓ learn with the constraint: where is sparse
Sparse Signal Approximation,” IEEE TSP, vol. 58, no. 3, pp. 1553–1564.
35
X ≈ D0SZ S D = D0S D0
strong prior! two unknown factors
Speed = Factorizable Structure
36
Y
Learning Fast Transforms = Chasing Butterflies
✓ covers standard fast transforms ✓ more flexible, better adaptation to training data ✓ benefits:
✦
Speed: inverse problems and more
✦
Storage: compression
✦
Statistical significance / sample complexity: denoising
✓ Nonconvex optimization algorithm: PALM
✦
guaranteed convergence to stationary point ✓ Hierarchical strategy
37
D =
M
Y
j=1
Sj
D
Example 1: Reverse-Engineering the Fast Hadamard Transform
38
On2 2n log2 n n2 2n log2 n
tested up to n=1024
Example 2: Image Denoising with Learned Fast Transform
39
small-project.eu
Example 2: Image Denoising with Learned Fast Transform
40
small-project.eu
O(n log2 n) O(n2)
Comparison with Sparse KSVD (KSVDS)
41
Comparison with Sparse KSVD (KSVDS)
42
very close to = DCT
D = D0S D0
Statistical guarantees
Theoretical Guarantees ?
44
ˆ DN ∈ arg min
D FX(D)
Theoretical Guarantees ?
44
Source localization, neural coding ...
✓
Compression, denoising, calibration, inverse problems ...
ˆ DN ∈ arg min
D FX(D)
✓ No «ground truth dictionary» ✓ Goal = performance generalization
Theoretical Guarantees ?
45
Source localization, neural coding ...
✓
Compression, denoising, calibration, inverse problems ...
ˆ DN ∈ arg min
D FX(D)
EFX( ˆ DN) ≤ min
D EFX(D) + ηN
✓ No «ground truth dictionary» ✓ Goal = performance generalization
(~Machine Learning)
✦[Maurer and Pontil, 2010; Vainsencher & al., 2010; Mehta and Gray, 2012; G. & al 2013]
Theoretical Guarantees ?
45
Source localization, neural coding ...
✓
Compression, denoising, calibration, inverse problems ...
ˆ DN ∈ arg min
D FX(D)
EFX( ˆ DN) ≤ min
D EFX(D) + ηN
✓ No «ground truth dictionary» ✓ Goal = performance generalization
(~Machine Learning)
✦[Maurer and Pontil, 2010; Vainsencher & al., 2010; Mehta and Gray, 2012; G. & al 2013]
✓ Ground truth ✓ Goal = dictionary estimation
Theoretical Guarantees ?
46
Source localization, neural coding ...
✓
Compression, denoising, calibration, inverse problems ...
ˆ DN ∈ arg min
D FX(D)
k ˆ DN D0kF x = D0z + ε
EFX( ˆ DN) ≤ min
D EFX(D) + ηN
✓ No «ground truth dictionary» ✓ Goal = performance generalization
(~Machine Learning)
✦[Maurer and Pontil, 2010; Vainsencher & al., 2010; Mehta and Gray, 2012; G. & al 2013]
✓ Ground truth ✓ Goal = dictionary estimation
(~Signal Processing)
✦[Independent Component Analysis, e.g. book Comon & Jutten 2011]
Theoretical Guarantees ?
46
Source localization, neural coding ...
✓
Compression, denoising, calibration, inverse problems ...
ˆ DN ∈ arg min
D FX(D)
k ˆ DN D0kF x = D0z + ε
EFX( ˆ DN) ≤ min
D EFX(D) + ηN
Theorem: Excess Risk Control
47
[G. & al, Sample Complexity of Dictionary Learning and Other Matrix Factorizations, 2013, arXiv/HAL]
Theorem: Excess Risk Control
✓ X obtained from N draws, i.i.d., bounded
47
[G. & al, Sample Complexity of Dictionary Learning and Other Matrix Factorizations, 2013, arXiv/HAL]
P(kxk2 1) = 1
Theorem: Excess Risk Control
✓ X obtained from N draws, i.i.d., bounded ✓ Penalty function
✦
non-negative and minimum at zero
✦
lower semi-continuous
✦
coercive
47
[G. & al, Sample Complexity of Dictionary Learning and Other Matrix Factorizations, 2013, arXiv/HAL]
P(kxk2 1) = 1
φ(z)
Theorem: Excess Risk Control
✓ X obtained from N draws, i.i.d., bounded ✓ Penalty function
✦
non-negative and minimum at zero
✦
lower semi-continuous
✦
coercive ✓ Constraint set : (upper box-counting) dimension h
✦
typically: h = dK d = signal dimension, K = number of atoms
47
D
[G. & al, Sample Complexity of Dictionary Learning and Other Matrix Factorizations, 2013, arXiv/HAL]
P(kxk2 1) = 1
φ(z)
Theorem: Excess Risk Control
✓ X obtained from N draws, i.i.d., bounded ✓ Penalty function
✦
non-negative and minimum at zero
✦
lower semi-continuous
✦
coercive ✓ Constraint set : (upper box-counting) dimension h
✦
typically: h = dK d = signal dimension, K = number of atoms
47
D
1 − 2e−x ηN ≤ C r (h + x) log N N
[G. & al, Sample Complexity of Dictionary Learning and Other Matrix Factorizations, 2013, arXiv/HAL]
EFX( ˆ DN) ≤ min
D EFX(D) + ηN
P(kxk2 1) = 1
φ(z)
A word about the proof
✓ Concentration of around its mean ✓ Lipschitz behaviour of
➡ main technical contribution under assumptions on penalty
✓ Union bound using covering numbers
✓ Dimension dependent bound ! ✓ With Rademacher’s complexities & Slepian’s Lemma,
can recover known dimension independent bounds
✓ E.g., for PCA
48
FX(D) Exfx(D) D 7! FX(D)
D
d → ∞ d
O r dK log N N !
O r K2 N !
Versatility of the Sample Complexity Results
✦
l1 norm / mixed norms / lp quasi-norms
✦
... but also non-coercive penalties (with additional RIP on constraint set):
✦
unit norm / sparse / shift-invariant / tensor product / tight frame ...
✦
«complexity» captured by box-counting dimension
✦
bounded samples
✦
... but also sub-Gaussian
✦
PCA / NMF / K-Means / sparse PCA
49
P(kxk2 1) = 1 P(kxk2 At) exp(t), t 1
Identifiability analysis ? Empirical findings
51
Numerical Example (2D)
X = D0Z0
51
Numerical Example (2D)
X = D0Z0 θ1 θ0 Dθ0,θ1
51
Numerical Example (2D)
X = D0Z0 θ1 θ0 Dθ0,θ1 kD−1
θ0,θ1Xk1
FX(D)
51
Numerical Example (2D)
X = D0Z0 θ1 θ0 Dθ0,θ1 kD−1
θ0,θ1Xk1
FX(D)
Symmetry = permutation ambiguity
51
Numerical Example (2D)
a) Global minima match angles of the original basis b) There is no other local minimum. Empirical observations
X = D0Z0 θ1 θ0 Dθ0,θ1 kD−1
θ0,θ1Xk1
FX(D)
Sparsity vs Coherence (2D)
52
−3 −2 −1 1 2 3 −4 −3 −2 −1 1 2 3 4 N = 1000 Bernoulli−Gaussian training samples −4 −3 −2 −1 1 2 3 −4 −3 −2 −1 1 2 3 4 N = 1000 Bernoulli−Gaussian training samples −3 −2 −1 1 2 3 −4 −3 −2 −1 1 2 3 N = 1000 Bernoulli−Gaussian training samples −2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 3 N = 1000 Bernoulli−Gaussian training samplessparse weakly sparse
p 1
µ = | cos(θ1 − θ0)|
1
incoherent coherent
ground truth=local min ground truth=global min no spurious local min
1 0.9 0.8 0.7 0.6. 0.5Sparsity vs Coherence (2D)
52
−3 −2 −1 1 2 3 −4 −3 −2 −1 1 2 3 4 N = 1000 Bernoulli−Gaussian training samples −4 −3 −2 −1 1 2 3 −4 −3 −2 −1 1 2 3 4 N = 1000 Bernoulli−Gaussian training samples −3 −2 −1 1 2 3 −4 −3 −2 −1 1 2 3 N = 1000 Bernoulli−Gaussian training samples −2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 3 N = 1000 Bernoulli−Gaussian training samplessparse weakly sparse
p 1
µ = | cos(θ1 − θ0)|
1
incoherent coherent
Empirical probability of success
ground truth=local min ground truth=global min no spurious local min
1 0.9 0.8 0.7 0.6. 0.5Sparsity vs Coherence (2D)
52
−3 −2 −1 1 2 3 −4 −3 −2 −1 1 2 3 4 N = 1000 Bernoulli−Gaussian training samples −4 −3 −2 −1 1 2 3 −4 −3 −2 −1 1 2 3 4 N = 1000 Bernoulli−Gaussian training samples −3 −2 −1 1 2 3 −4 −3 −2 −1 1 2 3 N = 1000 Bernoulli−Gaussian training samples −2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 3 N = 1000 Bernoulli−Gaussian training samplessparse weakly sparse
p 1
µ = | cos(θ1 − θ0)|
1
incoherent coherent
Empirical probability of success Rule of thumb: perfect recovery if: a) Incoherence b) Enough training samples (N large enough)
µ < 1 − p
Empirical Findings
✓ Global minima often match ground truth ✓ Often, there is no spurious local minimum
✓ sparsity level ? ✓ incoherence of D ? ✓ noise level ? ✓ presence / nature of outliers ? ✓ sample complexity (number of training samples) ?
53
[G. & Schnass 2010] [Geng & al 2011] [Jenatton, Bach & G.]
signal model
no yes yes
yes no yes noise no no yes cost function
Identifiability Analysis: Overview
54
min FX(D) minD,Z kZk1 s.t.DZ = X
− − − − − − − − −[G. & Schnass 2010] [Geng & al 2011] [Jenatton, Bach & G.]
signal model
no yes yes
yes no yes noise no no yes cost function
Identifiability Analysis: Overview
54
min FX(D) minD,Z kZk1 s.t.DZ = X
− − − − − − − − −φ(z) = λkzk1
[G. & Schnass 2010] [Geng & al 2011] [Jenatton, Bach & G.]
signal model
no yes yes
yes no yes noise no no yes cost function
Identifiability Analysis: Overview
54
min FX(D) minD,Z kZk1 s.t.DZ = X
− − − − − − − − −φ(z) = λkzk1
See also: [Spielman&al 2012, Agarwal & al 2013/2014, Arora & al 2013/2014, Schnass 2013, Schnass 2014]
✓ No «ground truth dictionary» ✓ Goal = performance generalization
(~Machine Learning)
✦[Maurer and Pontil, 2010; Vainsencher & al., 2010; Mehta and Gray, 2012; G. & al 2013]
✓ Ground truth ✓ Goal = dictionary estimation
(~Signal Processing)
✦[Independent Component Analysis, e.g. book Comon & Jutten 2011]
Theoretical Guarantees ?
55
Source localization, neural coding ...
✓
Compression, denoising, calibration, inverse problems ...
ˆ DN ∈ arg min
D FX(D)
k ˆ DN D0kF
EFX( ˆ DN) ≤ min
D EFX(D) + ηN
x = D0z + ε
✓ No «ground truth dictionary» ✓ Goal = performance generalization
(~Machine Learning)
✦[Maurer and Pontil, 2010; Vainsencher & al., 2010; Mehta and Gray, 2012; G. & al 2013]
✓ Ground truth ✓ Goal = dictionary estimation
(~Signal Processing)
✦[Independent Component Analysis, e.g. book Comon & Jutten 2011]
Theoretical Guarantees ?
55
Source localization, neural coding ...
✓
Compression, denoising, calibration, inverse problems ...
ˆ DN ∈ arg min
D FX(D)
k ˆ DN D0kF
EFX( ˆ DN) ≤ min
D EFX(D) + ηN
x = D0z + ε
✓
✓ No «ground truth dictionary» ✓ Goal = performance generalization
(~Machine Learning)
✦[Maurer and Pontil, 2010; Vainsencher & al., 2010; Mehta and Gray, 2012; G. & al 2013]
✓ Ground truth ✓ Goal = dictionary estimation
(~Signal Processing)
✦[Independent Component Analysis, e.g. book Comon & Jutten 2011]
Theoretical Guarantees ?
55
Source localization, neural coding ...
✓
Compression, denoising, calibration, inverse problems ...
ˆ DN ∈ arg min
D FX(D)
k ˆ DN D0kF
EFX( ˆ DN) ≤ min
D EFX(D) + ηN
x = D0z + ε
✓
«Ground Truth» = Sparse Signal Model
✓ (+ second moment assumptions)
56
J ⊂ [1, K], J = s
x = X
i∈J
zidi + ε = DJzJ + ε
P(min
j∈J |zi| < z) = 0
P(kzJk2 > Mz) = 0 P(kεk2 > M✏) = 0
«Ground Truth» = Sparse Signal Model
✓ (+ second moment assumptions)
56
J ⊂ [1, K], J = s
x = X
i∈J
zidi + ε = DJzJ + ε
P(min
j∈J |zi| < z) = 0
P(kzJk2 > Mz) = 0 P(kεk2 > M✏) = 0
NB: z not required to have i.i.d. entries
Theorem: Robust Local Identifiability
57
✦
dictionary with small coherence
Theorem: Robust Local Identifiability
57
µ(D0) = max
i6=j |hdi, dji| 2 [0, 1]
✦
dictionary with small coherence
✦
s-sparse coefficient model (no outlier, no noise)
Theorem: Robust Local Identifiability
57
s .
1 µ(D0)k |D0k |2
µ(D0) = max
i6=j |hdi, dji| 2 [0, 1]
✦
dictionary with small coherence
✦
s-sparse coefficient model (no outlier, no noise)
✓ for any small enough , with high probability on ,
there is a local minimum of such that
Theorem: Robust Local Identifiability
57
FX(D)
k ˆ D D0kF O(λsµk |D0k |2) s .
1 µ(D0)k |D0k |2
ˆ D
λ
X FX(D) = min
Z
1 2kX DZk2
F + λkZk1,1
µ(D0) = max
i6=j |hdi, dji| 2 [0, 1]
✦
dictionary with small coherence
✦
s-sparse coefficient model (no outlier, no noise)
✓ for any small enough , with high probability on ,
there is a local minimum of such that
Theorem: Robust Local Identifiability
57
FX(D)
k ˆ D D0kF O(λsµk |D0k |2) s .
1 µ(D0)k |D0k |2
ˆ D
λ
X FX(D) = min
Z
1 2kX DZk2
F + λkZk1,1
µ(D0) = max
i6=j |hdi, dji| 2 [0, 1]
✦
dictionary with small coherence
✦
s-sparse coefficient model (no outlier, no noise)
✓ for any small enough , with high probability on ,
there is a local minimum of such that
Theorem: Robust Local Identifiability
57
FX(D)
k ˆ D D0kF O(λsµk |D0k |2) s .
1 µ(D0)k |D0k |2
ˆ D
λ
X FX(D) = min
Z
1 2kX DZk2
F + λkZk1,1
µ(D0) = max
i6=j |hdi, dji| 2 [0, 1]
✦
dictionary with small coherence
✦
s-sparse coefficient model (no outlier, no noise)
✓ for any small enough , with high probability on ,
there is a local minimum of such that
Theorem: Robust Local Identifiability
57
FX(D)
k ˆ D D0kF O(λsµk |D0k |2) s .
1 µ(D0)k |D0k |2
ˆ D
λ
X FX(D) = min
Z
1 2kX DZk2
F + λkZk1,1
µ(D0) = max
i6=j |hdi, dji| 2 [0, 1]
Example 1: Orthonormal Dictionary
58
s .
1 µ(D0)k |D0k |2
µ(D0) = 0
✓ for , with high probability on , there is a
local minimum such that
✓ exact recovery
Example 1: Orthonormal Dictionary
58
k ˆ D D0kF O(λsµk |D0k |2) X s .
1 µ(D0)k |D0k |2
M✏ < λ ≤ z/4 µ(D0) = 0
✓ for , with high probability on , there is a
local minimum such that
✓ exact recovery
Example 1: Orthonormal Dictionary
58
k ˆ D D0kF O(λsµk |D0k |2) X s .
1 µ(D0)k |D0k |2
=0
M✏ < λ ≤ z/4 µ(D0) = 0
✓ for , with high probability on , there is a
local minimum such that
✓ exact recovery
Example 1: Orthonormal Dictionary
58
k ˆ D D0kF O(λsµk |D0k |2) X s .
1 µ(D0)k |D0k |2
=0
M✏ < λ ≤ z/4 N = Ω(d4) D ∈ Rd×d µ(D0) = 0
✓ for , with high probability on , there is a
local minimum such that
✓ exact recovery
Example 1: Orthonormal Dictionary
58
k ˆ D D0kF O(λsµk |D0k |2) X s .
1 µ(D0)k |D0k |2
=0
M✏ < λ ≤ z/4 N = Ω(d4) D ∈ Rd×d µ(D0) = 0
Example 2: Guarantees vs Observations
59
10 1 10 2 10 3 10 4 10 5 10 −2 10 −1 10 Hadamard−Dirac dictionary in dimension d number N of training signals relative error d=8 d=8 d=16 d=16 d=32 (random init.) d=32 (oracle init.) 10 10 1 10 −3 10 −2 10 −1 10 10 1 Hadamard dictionary in dimension d Noise level Relative error d=8 d=8 d=16 d=16 d=32 (random init.) d=32 (oracle init.)dxd dx2d
d x d OrthonormalExample 2: Guarantees vs Observations
59
10 1 10 2 10 3 10 4 10 5 10 −2 10 −1 10 Hadamard−Dirac dictionary in dimension d number N of training signals relative error d=8 d=8 d=16 d=16 d=32 (random init.) d=32 (oracle init.) 10 10 1 10 −3 10 −2 10 −1 10 10 1 Hadamard dictionary in dimension d Noise level Relative error d=8 d=8 d=16 d=16 d=32 (random init.) d=32 (oracle init.)Predicted slope
dxd dx2d
d x d OrthonormalFlavor of the proof
✓ Minimum exactly at ground truth ✓ one-sided directional derivatives
✓ Minimum close to ground truth ✓ Zero at ground truth ✓ Lower bound at radius r
?
D
Characterizing Local Minima (1)
61
D
ground truth
FX(D) − FX(D0) FX(D) − FX(D0) D0 D0
stable to dictionary perturbations and noise
✦
adaptation from [Fuchs, 2005; Zhao and Yu, 2006; Wainwright, 2009]
✦
uses «guess» of minimizer
Leveraging Sparse Recovery Results
62
fxn(D) = min
zn
1 2kxn Dznk2
2 + λkznk1
fx(D) = φx(D|sign(z0)) x = D0z0 + ε ˆ z = D+
J x − λ(D> J DJ)1sign(z0)
D
Step 1: Asymptotic Regime
63
D0 Efx(D) − Efx(D0)
expectation
✓ more explicit form
D
Step 1: Asymptotic Regime
63
D0 Efx(D) − Efx(D0) Eφx(D|sign(z0)) − Eφx(D0|sign(z0))
expectation
✓ more explicit form ✓ lower bound
✦
where
D
Step 1: Asymptotic Regime
63
D0 Efx(D) − Efx(D0) Eφx(D|sign(z0)) − Eφx(D0|sign(z0)) akD D0kF (kD D0kF r0) r0 = O(λsµk |D0k |2)
expectation asymptotic bound
✓ more explicit form ✓ lower bound
✦
where
✓ there is a local minimum within radius r0
D
Step 1: Asymptotic Regime
63
D0 Efx(D) − Efx(D0) Eφx(D|sign(z0)) − Eφx(D0|sign(z0)) akD D0kF (kD D0kF r0) r0 = O(λsµk |D0k |2) ˆ D
expectation asymptotic bound
✓ local min with if
✓ local minimum if [noiseless case]
D
Step 2: Finite Sample Analysis
64
D0
[Rademacher averages & Slepian’s lemma]
= O( q
log N N )
= O(r2
0/
√ N) sup
D
|FX(D) − Efx(D)| ≤ ηN
bound on empirical average expectation
r−2) N = Ω(dK3 N = Ω(dK3) D ∈ Rd×K FX(D) − FX(D0)
k ˆ D D0kF < r
Outliers ?
65
x = X
i∈J
zidi + ε = DJzJ + ε
no noise / no outliers no outliers many small outliers few large outliers
X = [Xin, Xout]
D
Step 3: Robustness to Outliers
66
D0
bound on inliers’ empirical average
FX(D) − FX(D0)
D
Step 3: Robustness to Outliers
66
D0
bound on inliers’ empirical average
FX(D) − FX(D0) r > r0
✓ then whp there is a local min s.t.
D
Step 3: Robustness to Outliers
66
D0 (admissible «energy» of outliers) FX(D) − FX(D0) ˆ D r > r0
k ˆ D D0kF < r
X
n∈outlier
kxnk2 C(r)Ninlier
From Local to Global Guarantees ?
67
ground truth=local min ground truth=global min
1 0.9 0.8 0.7 0.6. 0.5D∈D FX(D)
no spurious local min
From Local to Global Guarantees ?
67
ground truth=local min ground truth=global min
1 0.9 0.8 0.7 0.6. 0.5D∈D FX(D)
no spurious local min
Related results: [Spielman&al 2012, Agarwal & al 2013/2014, Arora & al 2013/2014]
Recent results
68
Reference Overcomplete Noise Outliers Global min / algorithm Polynomial algorithm Exact (no noise, no outlier, n finite) Sample complexity Admissible sparsity for exact recovery Coefficient model (main characteristics) Georgiev et al. [2005] k = m 1, Combinatorial approach X 7 7 X 7 X mD ∈ Rm×p
Recent results
68
Reference Overcomplete Noise Outliers Global min / algorithm Polynomial algorithm Exact (no noise, no outlier, n finite) Sample complexity Admissible sparsity for exact recovery Coefficient model (main characteristics) Georgiev et al. [2005] k = m 1, Combinatorial approach X 7 7 X 7 X mD ∈ Rm×p
Recent results
68
Reference Overcomplete Noise Outliers Global min / algorithm Polynomial algorithm Exact (no noise, no outlier, n finite) Sample complexity Admissible sparsity for exact recovery Coefficient model (main characteristics) Georgiev et al. [2005] k = m 1, Combinatorial approach X 7 7 X 7 X mD ∈ Rm×p
Recent results
68
Reference Overcomplete Noise Outliers Global min / algorithm Polynomial algorithm Exact (no noise, no outlier, n finite) Sample complexity Admissible sparsity for exact recovery Coefficient model (main characteristics) Georgiev et al. [2005] k = m 1, Combinatorial approach X 7 7 X 7 X mD ∈ Rm×p
Recent results
68
Reference Overcomplete Noise Outliers Global min / algorithm Polynomial algorithm Exact (no noise, no outlier, n finite) Sample complexity Admissible sparsity for exact recovery Coefficient model (main characteristics) Georgiev et al. [2005] k = m 1, Combinatorial approach X 7 7 X 7 X mD ∈ Rm×p
POLYNOMIAL ALGORITHMS
To conclude ...
Summary
✓ widely used in image processing and machine learning
✓ batch / online algorithms (K-SVD & al)
✓ sample complexity (also NMF, PCA, sparse PCA ...)
December 2013]
✓ local stability and robustness guarantees
70
What’s next ?
✓ Empirically yes ... on simple synthetic data ✓ Guarantees from cost functions to algorithms ?
✦
Dictionaries with intrinsically fast implementations
fast transforms, http://hal.inria.fr/hal-01010577, June 2014]
✦
Compressive learning with randomized generalized moments
✓ analysis sparsity, classification, clustering ...
71
72