Sparse dictionary learning in the presence of noise & outliers Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France
remi.gribonval@inria.fr
Sparse dictionary learning in the presence of noise & outliers - - PowerPoint PPT Presentation
Sparse dictionary learning in the presence of noise & outliers Rmi Gribonval INRIA Rennes - Bretagne Atlantique, France remi.gribonval@inria.fr Overview Context: sparse signal processing Dictionary learning Statistical
Sparse dictionary learning in the presence of noise & outliers Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France
remi.gribonval@inria.fr
Overview
2
Sparse signal processing
Sparse Signal / Image Processing
4
+ Compression, Source Localization, Separation, Compressed Sensing ...
Typical Sparse Models
Black = zero White = zero
5
ANALYSIS ANALYSIS SYNTHESIS SYNTHESIS
Mathematical expression
(ex: time-frequency atoms, wavelets)
6
Dictionary of atoms (Mallat & Zhang 93)
x ∈ Rd x ≈ X
k
zkdk = Dz ⇥z⇥0 = X
k
|zk|0 = card{k, zk = 0}
CoSparse models and inverse problems
7
Observation Domain
Acoustic Imaging
✓ direct optical measures ✓ sequential ✓ 2000 measures
✓ indirect acoustic measures ✓ 120 microphones at a time ✓ 120 x 16 = 1920 measures ✓ Tikhonov regularization
8
echange.inria.fr
Compressive Nearfield Acoustic Holography
9
echange.inria.fr
Dictionary learning
with K. Schnass, F. Bach, R. Jenatton
small-project.eu
Sparse Atomic Decompositions
11
Signal Image (Overcomplete) dictionary of atoms Sparse Representation Coefficients
✓ bottleneck = large-scale algorithms
✓ bottleneck = dictionary/operator design/learning
Data Deluge + Jungle
12
Signals Images Hyperspectral Satellite imaging Spherical geometry Cosmology, HRTF (3D audio) Graph data Social networks Brain connectivity Vector valued Diffusion tensor
Unknown sparse coefficients Unknown dictionary
A quest for the perfect sparse model
sparse learning
13
Training database
= edge-like atoms
[Olshausen & Field 96, Aharon et al 06, Mairal et al 09, ...]
= shifts of edge-like motifs
[Blumensath 05, Jost et al 05, ...] patch extraction
Training patches
xn = Dzn, 1 ≤ n ≤ N ˆ D
Dictionary Learning = Sparse Matrix Factorization
14
D x1 x2 ≈ xN
. . . . . .
z1 z2 zN
d × N
with s-sparse columns
d × K K × N
Many approaches
15
✦
[see e.g. book by Comon & Jutten 2011]
✦
[Bach et al., 2008; Bradley and Bagnell, 2009]
✦
[Krause and Cevher, 2010]
✦
[Zhou et al., 2009]
✦
[Olshausen and Field, 1997; Pearlmutter & Zibulevsky 2001, Aharon et al. 2006; Lee et al., 2007; Mairal et al., 2010 (... and many other authors)]
Sparse coding objective function
16
fxn(D) = min
zn
1 2kxn Dznk2
2 + λkznk1
FX(D) = 1 N
N
X
n=1
fxn(D) / min
Z
1 2kX DZk2
F + λkZk1
Learning = constrained minimization
✓ Online learning with SPAMS library (Mairal & al) ✓ Constraint = dictionary with unit columns
17
D∈D FX(D)
D = {D = [d1, . . . , dD], k ⇥dk⇥2 = 1}
Empirical findings
19
Numerical example (2D)
a) Global minima match angles of the original basis b) There is no other local minimum. Empirical observations
X = D0Z0 θ1 θ0 Dθ0,θ1 kD−1
θ0,θ1Xk1
FX(D)
Symmetry = permutation ambiguity
ground truth=local min ground truth=global min no spurious local min
1 0.9 0.8 0.7 0.6. 0.5Sparsity vs coherence (2D)
20
−3 −2 −1 1 2 3 −4 −3 −2 −1 1 2 3 4 N = 1000 Bernoulli−Gaussian training samples −4 −3 −2 −1 1 2 3 −4 −3 −2 −1 1 2 3 4 N = 1000 Bernoulli−Gaussian training samples −3 −2 −1 1 2 3 −4 −3 −2 −1 1 2 3 N = 1000 Bernoulli−Gaussian training samples −2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 3 N = 1000 Bernoulli−Gaussian training samplessparse weakly sparse
p
1
µ = | cos(θ1 − θ0)|
1
incoherent coherent
Empirical probability of success Rule of thumb: perfect recovery if: a) Incoherence b) Enough training samples (N large enough)
µ < 1 − p
Empirical findings
✓ Global minima often match ground truth ✓ Often, there is no spurious local minimum
✓ sparsity of Z ? ✓ incoherence of D ? ✓ noise level ? ✓ presence / nature of outliers ? ✓ sample complexity (number of training samples) ?
21
Theoretical guarantees
Theoretical guarantees
✦
[Maurer and Pontil, 2010; Vainsencher et al., 2010; Mehta and Gray, 2012]
✦
[Independent Component Analysis, e.g. book Comon & Jutten 2011]
23
k ˆ D D0kF
✓
Array processing perspective
✦Dictionary ~ directions of arrival
✦Identification ~ source localization
✓
Neural coding perspective:
✦Dictionaries ~ receptive fields
FX( ˆ D) − min
D EXFX(D)
[G. & Schnass 2010] [Geng & al 2011] [Jenatton, Bach & G.]
signal model
no yes yes
yes no yes noise no no yes cost function
Theoretical guarantees: overview
24
min FX(D) minD,Z kZk1 s.t.DZ = X
− − − − − − − − −Sparse Signal Model
25
P(|zi| < z) = 0
x = X
i∈J
zidi + ε = DJzJ + ε
J ⊂ [1, K], J = s
Local stability & robustness
✓ Assumptions:
✦
✦
s-sparse sub-Gaussian coefficient model (no outlier) ✓ Conclusion:
✦
with high probability there exists a local minimum of such that
✓ technical assumption: bounded coefficient model
26
FX(D)
⇤D D0⇤F ⇥ C r sdK3 · log N N D0 sµ(D0) ⌧ 1
Learning Guarantees vs Empirical Findings
27
10 1 10 2 10 3 10 4 10 5 10 −2 10 −1 10 Hadamard−Dirac dictionary in dimension d number N of training signals relative error d=8 d=8 d=16 d=16 d=32 (random init.) d=32 (oracle init.) 10 10 1 10 −3 10 −2 10 −1 10 10 1 Hadamard dictionary in dimension d Noise level Relative error d=8 d=8 d=16 d=16 d=32 (random init.) d=32 (oracle init.)Predicted slope
dxd dx2d
Flavor of the proof
✓ Minimum exactly at ground truth ✓ one-sided directional derivatives
✓ Minimum close to ground truth ✓ Zero at ground truth ✓ Lower bound at radius r
r
D
Characterizing local minima (1)
29
D
ground truth
FX(D) − FX(D0) FX(D) − FX(D0) D0 D0
✦
adaptation from [Fuchs, 2005; Zhao and Yu, 2006; Wainwright, 2009] ✓ Approximate cost function
Controlling the cost function
30
FX(D) fxn(D) = min
zn
1 2kxn Dznk2
2 + λkznk1
fx(D) = φx(D|sign(z0)) x = D0z0 + ε ΦX(D) ≈ FX(D)
✓ Need uniform lower bound on the sphere
✓ Lower bound expectation for a given D ✓ Control Lipschitz constant (with high probability) ✓ Conclude with epsilon-net argument
Controlling the approximate cost function
31
kD D0kF = r
ΦX(D) − ΦX(D0)
✓ lower-bound on approximate cost function ✓ lower-bound on cost function
r
D
Putting the pieces together
32
Admissible energy
D0
ΦX(D) − ΦX(D0)
FX(D) − FX(D0)
1 N X
n∈outlier
kxnk2
2 c
From local to global guarantees ?
33
ground truth=local min ground truth=global min
1 0.9 0.8 0.7 0.6. 0.5D∈D FX(D)
no spurious local min
To conclude ...
Summary
✓ widely used in image processing and machine learning ✓ from heuristics ...
✦
✓ ... to statistics
✦
local stability and robustness guarantees
✦
http://hal.inria.fr/hal-00737152 [Jenatton, G. & Bach, Local stability and robustness of sparse dictionary learning in the presence of noise, Oct 2012]
35
What’s next ?
✓ global guarantees ? empirically yes ✓ sharp sample complexity ✓ guarantees from cost functions to algorithms
✓ synthesis / analysis flavor (e.g. TV-like) ✓ structured models (shift-invariance, etc. ) ✓ structured sparsity (e.g. trees, graphs)
36
37