Robust nonnegative matrix factorisation with the β-divergence and applications in imaging
C´ edric F´ evotte
Institut de Recherche en Informatique de Toulouse Imaging & Machine Learning Institut Henri Poincar´ e April 2019
Robust nonnegative matrix factorisation with the -divergence and - - PowerPoint PPT Presentation
Robust nonnegative matrix factorisation with the -divergence and applications in imaging C edric F evotte Institut de Recherche en Informatique de Toulouse Imaging & Machine Learning Institut Henri Poincar e April 2019 Outline
C´ edric F´ evotte
Institut de Recherche en Informatique de Toulouse Imaging & Machine Learning Institut Henri Poincar´ e April 2019
Generalities Matrix factorisation models Nonnegative matrix factorisation (NMF) Optimisation for NMF Measures of fit Majorisation-minimisation Applications in imaging Hyperspectral unmixing in remote sensing Factor analysis in dynamic PET
2
Data often available in matrix form.
samples f e a t u r e s coefficient
3
Data often available in matrix form.
users m
i e s movie rating
4
4
Data often available in matrix form.
text documents w
d s word count
57
5
Data often available in matrix form.
time f r e q u e n c i e s Fourier coefficient
0.3
6
≈ dictionary learning low-rank approximation factor analysis latent semantic analysis
data X dictionary W activations H
7
≈ dictionary learning low-rank approximation factor analysis latent semantic analysis
data X dictionary W activations H
8
for dimensionality reduction (coding, low-dimensional embedding)
9
for unmixing (source separation, latent topic discovery)
10
for interpolation (collaborative filtering, image inpainting)
11
N samples F features K patterns ◮ data V and factors W, H have nonnegative entries. ◮ nonnegativity of W ensures interpretability of the dictionary, because
patterns wk and samples vn belong to the same space.
◮ nonnegativity of H tends to produce part-based representations, because
subtractive combinations are forbidden.
Early work by (Paatero and Tapper, 1994), landmark Nature paper by (Lee and Seung,
1999)
12
(Lee and Seung, 1999; Hofmann, 1999)
Encyclopedia entry: 'Constitution of the United States'
president (148) congress (124) power (120) united (104) constitution (81) amendment (71) government (57) law (49)
≈
court government council culture supreme constitutional rights justice president served governor secretary senate congress presidential elected flowers leaves plant perennial flower plants growing annual disease behaviour glands contact symptoms skin pain infection
× vn W hn
reproduced from (Lee and Seung, 1999)
13
(Smaragdis and Brown, 2003)
1 2 3 4 Component Frequency 1 2 3 4 Component Frequency (Hz) Time (sec) Input music passage 0.5 1 1.5 2 2.5 3 100 500 1000 2000 3500 6000 16000 20000
reproduced from (Smaragdis, 2013)
14
(Berry, Browne, Langville, Pauca, and Plemmons, 2007)
reproduced from (Bioucas-Dias et al., 2012)
15
Generalities Matrix factorisation models Nonnegative matrix factorisation (NMF) Optimisation for NMF Measures of fit Majorisation-minimisation Applications in imaging Hyperspectral unmixing in remote sensing Factor analysis in dynamic PET
16
Minimise a measure of fit between V and WH, subject to nonnegativity: min
W,H≥0 D(V|WH) =
d([V]fn|[WH]fn), where d(x|y) is a scalar cost function, e.g.,
◮ squared Euclidean distance (Paatero and Tapper, 1994; Lee and Seung, 2001) ◮ Kullback-Leibler divergence (Lee and Seung, 1999; Finesso and Spreij, 2006) ◮ Itakura-Saito divergence (F´
evotte, Bertin, and Durrieu, 2009)
◮ α-divergence (Cichocki et al., 2008) ◮ β-divergence (Cichocki et al., 2006; F´
evotte and Idier, 2011)
◮ Bregman divergences (Dhillon and Sra, 2005) ◮ and more in (Yang and Oja, 2011)
Regularisation terms often added to D(V|WH) for sparsity, smoothness, dynamics, etc. Nonconvex problem.
17
◮ Let V ∼ p(V|WH) such that
◮ E[V|WH] = WH ◮ p(V|WH) =
fn p(vfn|[WH]fn)
◮ then the following correspondences apply with
D(V|WH) = − log p(V|WH) + cst data support distribution/noise divergence examples real-valued additive Gaussian squared Euclidean many integer multinomial⋆ weighted KL word counts integer Poisson generalised KL photon counts nonnegative multiplicative Gamma Itakura-Saito spectrogram generally nonnegative Tweedie β-divergence generalises above models
⋆conditional independence over f does not apply
18
A popular measure of fit in NMF (Basu et al., 1998; Cichocki and Amari, 2010) dβ(x|y) def =
1 β (β−1)
β ∈ R\{0, 1} x log x
y + (y − x)
β = 1
x y − log x y − 1
β = 0 Special cases:
◮ squared Euclidean distance (β = 2) ◮ generalised Kullback-Leibler (KL) divergence (β = 1) ◮ Itakura-Saito (IS) divergence (β = 0)
Properties:
◮ Homogeneity: dβ(λx|λy) = λβdβ(x|y) ◮ dβ(x|y) is a convex function of y for 1 ≤ β ≤ 2 ◮ Bregman divergence
19
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 y d(x=1|y) β = 2 (Euc)
20
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 y d(x=1|y) β = 2 (Euc) β = 1 (KL)
21
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 y d(x=1|y) β = 2 (Euc) β = 1 (KL) β = 0 (IS)
22
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 y d(x=1|y) β = 2 (Euc) β = 1 (KL) β = 0 (IS) β = −1
23
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 y d(x=1|y) β = 2 (Euc) β = 1 (KL) β = 0 (IS) β = −1 β = 3
24
◮ Block-coordinate update of H given W(i−1) and W given H(i). ◮ Updates of W and H equivalent by transposition:
V ≈ WH ⇔ VT ≈ HTWT
◮ Objective function separable in the columns of H or the rows of W:
D(V|WH) =
D(vn|Whn)
◮ Essentially left with nonnegative linear regression:
min
h≥0 C(h) def
= D(v|Wh) Numerous references in the image restoration literature, e.g., (Richardson,
1972; Lucy, 1974; Daube-Witherspoon and Muehllehner, 1986; De Pierro, 1993)
Block-descent algorithm, nonconvex problem, initialisation is an issue.
25
Build G(h|˜ h) such that G(h|˜ h) ≥ C(h) and G(˜ h|˜ h) = C(˜ h). Optimise (iteratively) G(h|˜ h) instead of C(h).
3 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Objective function C(h)
26
Build G(h|˜ h) such that G(h|˜ h) ≥ C(h) and G(˜ h|˜ h) = C(˜ h). Optimise (iteratively) G(h|˜ h) instead of C(h).
3 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 h(0) h(1) Objective function C(h) Auxiliary function G(h|h(0))
26
Build G(h|˜ h) such that G(h|˜ h) ≥ C(h) and G(˜ h|˜ h) = C(˜ h). Optimise (iteratively) G(h|˜ h) instead of C(h).
3 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 h(1) h(2) h(0) Objective function C(h) Auxiliary function G(h|h(1))
26
Build G(h|˜ h) such that G(h|˜ h) ≥ C(h) and G(˜ h|˜ h) = C(˜ h). Optimise (iteratively) G(h|˜ h) instead of C(h).
3 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 h(3) h(2) h(1) h(0) Objective function C(h) Auxiliary function G(h|h(2))
26
Build G(h|˜ h) such that G(h|˜ h) ≥ C(h) and G(˜ h|˜ h) = C(˜ h). Optimise (iteratively) G(h|˜ h) instead of C(h).
3 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 h* h(3) h(2) h(1) h(0) Objective function C(h) Auxiliary function G(h|h*)
26
◮ Finding a good & workable local majorisation is the crucial point. ◮ Treating convex and concave terms separately with Jensen and tangent
inequalities usually works. E.g.: CIS(h) =
vf
log
wfkhk
27
◮ Finding a good & workable local majorisation is the crucial point. ◮ Treating convex and concave terms separately with Jensen and tangent
inequalities usually works. E.g.: CIS(h) =
vf
log
wfkhk
◮ In most cases, leads to nonnegativity-preserving multiplicative algorithms:
hk = ˜ hk
hkC(˜
h) ∇+
hkC(˜
h) γ
◮ ∇hk C(h) = ∇+
hk C(h) − ∇− hk C(h) and the two summands are nonnegative.
◮ if ∇hk C(˜
h) > 0, ratio of summands < 1 and hk goes left.
◮ γ is a divergence-specific scalar exponent.
◮ Details in (F´
evotte and Idier, 2011; Yang and Oja, 2011; Zhao and Tan, 2018)
27
Generalities Matrix factorisation models Nonnegative matrix factorisation (NMF) Optimisation for NMF Measures of fit Majorisation-minimisation Applications in imaging Hyperspectral unmixing in remote sensing Factor analysis in dynamic PET
28
(F´ evotte and Dobigeon, 2015)
◮ Data: two unfolded hyperspectral cubes, F ∼ 150, N = 50 × 50
◮ Aviris instrument over Moffett Field (CA), lake, soil & vegetation. ◮ Hyspex/Madonna instrument over Villelongue (FR), forested area.
◮ a percentage of the pixels is randomly removed. ◮ W and H estimated with K = 3 (∼ ground truth) and various values of β. ◮ missing pixels are reconstructed from ˆ
V = WH.
◮ evaluation using the average spectral angle mapper (aSAM):
aSAM(V) = 1 N
N
acos vn, ˆ vn vnˆ vn
(F´ evotte and Dobigeon, 2015)
−1 1 2 3 0.1 0.2 0.3 0.4 SAM MOFFETT p=0.25 −1 1 2 3 0.1 0.2 0.3 0.4 SAM p=0.5 −1 1 2 3 0.1 0.2 0.3 0.4 beta SAM p=0.75 −1 1 2 3 0.03 0.032 0.034 MADONNA p=0.25 −1 1 2 3 0.03 0.032 0.034 p=0.5 −1 1 2 3 0.03 0.032 0.034 beta p=0.75
Recommended value β ≈ 1.5 (compromise between Poisson and additive Gaussian noise).
30
(F´ evotte and Dobigeon, 2015)
◮ Variants of the linear mixing model account for “non-linear” effects:
vn ≈ Whn + rn
◮ Often, rn has a parametric form, e.g., linear combination of quadratic
components {wk ⊙ wj}kj (Nascimento and Bioucas-Dias, 2009; Fan et al.,
2009; Altmann et al., 2012)
◮ Nonlinear effects usually affect few pixels only. ◮ We treat them as non-parametric sparse outliers.
min
W,H,R≥0 Dβ(V|WH + R) + λR2,1
where R2,1 = N
n=1 rn2 induces sparsity at group level. ◮ Optimised with majorisation-minimisation.
31
(F´ evotte and Dobigeon, 2015)
Moffett Field data
reproduced from (Dobigeon, 2007)
32
(F´ evotte and Dobigeon, 2015)
Unmixing results spectral endmembers & activation maps
(red: β = 1, black: β = 2) (β = 1)
1 2 Vegetation 1 2 Water 1 2 Soil
Outlier term captures specific water/soil interactions.
33
(F´ evotte and Dobigeon, 2015)
Villelongue/Madonna data (forested area)
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
34
(F´ evotte and Dobigeon, 2015)
Unmixing results spectral endmembers & activation maps
(red: β = 1, black: β = 2) (β = 1)
0.4 0.6 0.8 1 Chesnut tree 0.4 0.6 0.8 1 Oak tree 0.4 0.6 0.8 1
Outlier term seems to capture patterns due to sensor miscalibration.
35
Generalities Matrix factorisation models Nonnegative matrix factorisation (NMF) Optimisation for NMF Measures of fit Majorisation-minimisation Applications in imaging Hyperspectral unmixing in remote sensing Factor analysis in dynamic PET
36
(Cavalcanti, Oberlin, Dobigeon, F´ evotte, Stute, Ribeiro, and Tauber, 2019)
◮ 3D functional imaging ◮ Observe the temporal evolution of the brain activity after injecting a
radiotracer (biomarker of a specific compound).
◮ vn is the time-activity curve (TAC) in voxel n. ◮ Neuroimaging: mixed contributions of 4 TAC signatures in each voxel.
Time 500 1000 1500 2000 2500 3000 3500 Concentration of radiotracer 5 10 15 20 25 30 35 40 45 50sGray matter
Time 500 1000 1500 2000 2500 3000 3500 Concentration of radiotracer 5 10 15 20 25 30 35 40 45 50Blood
Time 500 1000 1500 2000 2500 3000 3500 Concentration of radiotracer 5 10 15 20 25 30 35 40 45 50nsGray matter
Time 500 1000 1500 2000 2500 3000 3500 Concentration of radiotracer 5 10 15 20 25 30 35 40 45 50White matter
Dynamic positron emission tomography PET voxel decomposition reproduced from (Cavalcanti, 2018)
37
(Cavalcanti, Oberlin, Dobigeon, F´ evotte, Stute, Ribeiro, and Tauber, 2019)
Mixing model
◮ the specific-binding TAC signature varies in space:
vn ≈ [w1 + δn]h1n +
K
wkhkn ≈ [w1 + Dbn]h1n +
K
wkhkn ≈ Whn + h1n Dbn
◮ D is fixed and pre-trained using labeled or simulated data.
Estimation min
W,H,B≥0 Dβ(V|WH + 1 h1 ⊙ DB) + λB2,1
Optimised with majorisation-minimisation.
38
Unmixing results
◮ real dynamic PET image of a stroke subject injected with a tracer for
neuroinflammation.
◮ MRI ground-truth region of the stroke.
0.5 1 0.5 1 10 20 30 40 0.5 1 10 20 30 40 0.5 1 10 20 30 40 0.5 1 0.5 1 10 20 30 40 0.5 1 10 20 30 40 0.5 1 10 20 30 40 0.5 1 0.5 1 10 20 30 40 0.5 1 10 20 30 40 0.5 1 10 20 30 40
Fig.: Specific-binding activation (h1n) and variability maps (bn2,1) in three different planes and for three values of β
39
◮ NMF can efficiently unmix composite data in imaging problems. ◮ Application-specific variants have been proposed. ◮ The β-divergence can be adjusted to the statistics of the noise. ◮ Majorisation-minimisation works well in this setting.
ERC-funded postdoc positions in machine learning & signal processing:
◮ Multimodal data processing for multimedia artistic creation (with Tim van Cruys) ◮ Learning with low-rank models (with Emmanuel Soubies) ◮ Bayesian deep learning (with Nicolas Dobigeon)
http://projectfactory.irit.fr/
40
Plenary speakers
Yuejie Chi, CMU Pier Luigi Dragotti, ICL Emilie Chouzenoux, Univ. Paris-Est Bhaskar Rao, UC San Diego Mark Davenport, Georgia Tech Simon Thorpe, CNRS Monika D¨
Lenka Zdeborova, CNRS Special talk by Michael I. Jordan, UC Berkeley http://spars-workshop.org/
41
a post-nonlinear mixing model for hyperspectral imagery. IEEE Transactions on Image Processing, 21 (6):3017–3025, June 2012.
density power divergence. Biometrika, 85(3):549–559, Sep. 1998.
applications for approximate nonnegative matrix factorization. Computational Statistics & Data Analysis, 52(1):155–173, Sep. 2007.
Hyperspectral unmixing overview: Geometrical, statistical, and sparse regression-based approaches. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 5(2):354–379, 2012.
evotte, S. Stute, M. Ribeiro, and C. Tauber. Factor analysis of dynamic PET images: beyond Gaussian noise. IEEE Transactions on Medical Imaging, -: –, 2019. ISSN 0278-0062. doi: 10.1109/TMI.2019.2906828.
measures of similarities. Entropy, 12(6):1532–1568, June 2010.
Family of new algorithms. In Proc. International Conference on Independent Component Analysis and Blind Signal Separation (ICA), pages 32–39, Charleston SC, USA, Mar. 2006.
Pattern Recognition Letters, 29(9):1433–1440, July 2008.
42
for volume ECT. IEEE Transactions on Medical Imaging, 5(5):61 – 66, 1986. doi: 10.1109/TMI.1986.4307748.
Advances in Neural Information Processing Systems (NIPS), 2005.
Nonlinear unmixing of hyperspectral images: Models and algorithms. IEEE Signal Proccessing Magazine, 31(1):89–94, Jan. 2014.
linear model for analysing laboratory simulated-forest hyperspectral data. International Journal of Remote Sensing, 30(11):2951–2962, June 2009.
evotte and N. Dobigeon. Nonlinear hyperspectral unmixing with robust nonnegative matrix
10.1109/TIP.2015.2468177. URL https://www.irit.fr/~Cedric.Fevotte/publications/journals/tip2015.pdf.
evotte and J. Idier. Algorithms for nonnegative matrix factorization with the beta-divergence. Neural Computation, 23(9):2421–2456, Sep. 2011. doi: 10.1162/NECO a 00168. URL https://www.irit.fr/~Cedric.Fevotte/publications/journals/neco11.pdf.
evotte, N. Bertin, and J.-L. Durrieu. Nonnegative matrix factorization with the Itakura-Saito
10.1162/neco.2008.04-08-771. URL https://www.irit.fr/~Cedric.Fevotte/publications/journals/neco09_is-nmf.pdf.
43
Linear Algebra and its Applications, 416:270–287, 2006.
and Development in Information Retrieval (SIGIR), 1999. URL http://www.cs.brown.edu/~th/papers/Hofmann-SIGIR99.pdf.
401:788–791, 1999.
Information Processing Systems 13, pages 556–562, 2001.
79:745–754, 1974. doi: 10.1086/111605.
utilization of error estimates of data values. Environmetrics, 5:111–126, 1994.
http://web.engr.illinois.edu/~paris/pubs/smaragdis-waspaa2013keynote.pdf.
2003.
44
nonnegative matrix factorization. IEEE Transactions on Neural Networks, 22:1878 – 1891, Dec.
regularized nonnegative matrix factorization. IEEE Transactions on Signal Processing, 66(1): 129–138, Jan 2018. ISSN 1053-587X. doi: 10.1109/TSP.2017.2757914.
45
46
red pixels indicate negative values
47
experiment reproduced from (Lee and Seung, 1999)
48