Nonnegative matrix factorization and applications in audio signal processing
C´ edric F´ evotte
Laboratoire Lagrange, Nice Machine Learning Crash Course Genova, June 2015
1
Nonnegative matrix factorization and applications in audio signal - - PowerPoint PPT Presentation
Nonnegative matrix factorization and applications in audio signal processing C edric F evotte Laboratoire Lagrange, Nice Machine Learning Crash Course Genova, June 2015 1 Outline Generalities Matrix factorisation models Nonnegative
C´ edric F´ evotte
Laboratoire Lagrange, Nice Machine Learning Crash Course Genova, June 2015
1
Generalities Matrix factorisation models Nonnegative matrix factorisation Majorisation-minimisation algorithms Audio examples Piano toy example Audio restoration Audio bandwidth extension Multichannel IS-NMF
2
Data often available in matrix form.
samples f e a t u r e s coefficient
3
Data often available in matrix form.
users m
i e s movie rating
4
4
Data often available in matrix form.
text documents w
d s word count
57
5
Data often available in matrix form.
time f r e q u e n c i e s Fourier coefficient
0.3
6
≈ dictionary learning low-rank approximation factor analysis latent semantic analysis
data X dictionary W activations H
7
≈ dictionary learning low-rank approximation factor analysis latent semantic analysis
data X dictionary W activations H
8
for dimensionality reduction (coding, low-dimensional embedding)
9
for unmixing (source separation, latent topic discovery)
10
for interpolation (collaborative filtering, image inpainting)
11
N samples F features K patterns ◮ data V and factors W, H have nonnegative entries. ◮ nonnegativity of W ensures interpretability of the dictionary, because
patterns wk and samples vn belong to the same space.
◮ nonnegativity of H tends to produce part-based representations, because
subtractive combinations are forbidden.
Early work by Paatero and Tapper (1994), landmark Nature paper by Lee and Seung (1999)
12
13
red pixels indicate negative values
14
experiment reproduced from (Lee and Seung, 1999)
15
(Lee and Seung, 1999; Hofmann, 1999)
Encyclopedia entry: 'Constitution of the United States'
president (148) congress (124) power (120) united (104) constitution (81) amendment (71) government (57) law (49)
≈
court government council culture supreme constitutional rights justice president served governor secretary senate congress presidential elected flowers leaves plant perennial flower plants growing annual disease behaviour glands contact symptoms skin pain infection
× vn W hn
reproduced from (Lee and Seung, 1999)
16
(Berry, Browne, Langville, Pauca, and Plemmons, 2007)
reproduced from (Bioucas-Dias et al., 2012)
17
(Smaragdis and Brown, 2003)
1 2 3 4 Component Frequency 1 2 3 4 Component Frequency (Hz) Time (sec) Input music passage 0.5 1 1.5 2 2.5 3 100 500 1000 2000 3500 6000 16000 20000
reproduced from (Smaragdis, 2013)
18
Generalities Matrix factorisation models Nonnegative matrix factorisation Majorisation-minimisation algorithms Audio examples Piano toy example Audio restoration Audio bandwidth extension Multichannel IS-NMF
19
Minimise a measure of fit between V and WH, subject to nonnegativity: min
W,H≥0 D(V|WH) =
d([V]fn|[WH]fn), where d(x|y) is a scalar cost function, e.g.,
◮ squared Euclidean distance (Paatero and Tapper, 1994; Lee and Seung, 2001) ◮ Kullback-Leibler divergence (Lee and Seung, 1999; Finesso and Spreij, 2006) ◮ Itakura-Saito divergence (F´
evotte, Bertin, and Durrieu, 2009)
◮ α-divergence (Cichocki et al., 2008) ◮ β-divergence (Cichocki et al., 2006; F´
evotte and Idier, 2011)
◮ Bregman divergences (Dhillon and Sra, 2005) ◮ and more in (Yang and Oja, 2011)
Regularisation terms often added to D(V|WH) for sparsity, smoothness, dynamics, etc.
20
◮ Block-coordinate update of H given W(i−1) and W given H(i). ◮ Updates of W and H equivalent by transposition:
V ≈ WH ⇔ VT ≈ HTWT
◮ Objective function separable in the columns of H or the rows of W:
D(V|WH) =
D(vn|Whn)
◮ Essentially left with nonnegative linear regression:
min
h≥0 C(h) def
= D(v|Wh)
Numerous references in the image restoration literature. e.g., (Richardson, 1972; Lucy, 1974; Daube-Witherspoon and Muehllehner, 1986; De Pierro, 1993)
21
Build G(h|˜ h) such that G(h|˜ h) ≥ C(h) and G(˜ h|˜ h) = C(˜ h). Optimise (iteratively) G(h|˜ h) instead of C(h).
3 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Objective function C(h)
22
Build G(h|˜ h) such that G(h|˜ h) ≥ C(h) and G(˜ h|˜ h) = C(˜ h). Optimise (iteratively) G(h|˜ h) instead of C(h).
3 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 h(0) h(1) Objective function C(h) Auxiliary function G(h|h(0))
22
Build G(h|˜ h) such that G(h|˜ h) ≥ C(h) and G(˜ h|˜ h) = C(˜ h). Optimise (iteratively) G(h|˜ h) instead of C(h).
3 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 h(1) h(2) h(0) Objective function C(h) Auxiliary function G(h|h(1))
22
Build G(h|˜ h) such that G(h|˜ h) ≥ C(h) and G(˜ h|˜ h) = C(˜ h). Optimise (iteratively) G(h|˜ h) instead of C(h).
3 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 h(3) h(2) h(1) h(0) Objective function C(h) Auxiliary function G(h|h(2))
22
Build G(h|˜ h) such that G(h|˜ h) ≥ C(h) and G(˜ h|˜ h) = C(˜ h). Optimise (iteratively) G(h|˜ h) instead of C(h).
3 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 h* h(3) h(2) h(1) h(0) Objective function C(h) Auxiliary function G(h|h*)
22
◮ Finding a good & workable local majorisation is the crucial point. ◮ For most the divergences mentioned, Jensen and tangent inequalities are
usually enough.
◮ In many cases, leads to multiplicative algorithms such that
hk = ˜ hk
hkC(˜
h) ∇+
hkC(˜
h) γ where
◮ ∇hk C(h) = ∇−
hk C(h) − ∇+ hk C(h) and the two summands are nonnegative
◮ γ is a divergence-specific scalar exponent.
◮ More details about MM in (Lee and Seung, 2001; F´
evotte and Idier, 2011; Yang and Oja, 2011).
23
◮ Squared Euclidean distance is a common default choice. ◮ Underlies a Gaussian additive noise model such that
vfn = [WH]fn + ǫfn. Can generate negative values – not very natural for nonnegative data.
◮ Many other options.
Select a right divergence (for a specific problem) by
◮ comparing performances, given ground-truth data. ◮ assessing the ability to predict missing/unseen data (interpolation,
cross-validation).
◮ probabilistic modelling:
D(V|WH) = − log p(V|WH) + cst
24
◮ Let V ∼ p(V|WH) such that E[V|WH] = WH ◮ then the following correspondences apply with
D(V|WH) = − log p(V|WH) + cst data support distribution/noise divergence examples real-valued additive Gaussian squared Euclidean many integer multinomial Kullback-Leibler word counts integer Poisson generalised KL photon counts nonnegative multiplicative Gamma Itakura-Saito spectral data generally nonnegative Tweedie β-divergence generalises above models
25
Generalities Matrix factorisation models Nonnegative matrix factorisation Majorisation-minimisation algorithms Audio examples Piano toy example Audio restoration Audio bandwidth extension Multichannel IS-NMF
26
Figure: Three representations of data.
27
IS-NMF on power spectrogram with K = 8
−10 −8 −6 −4 −2 K = 1 Dictionary W 5000 10000 15000 Coefficients H −0.2 0.2 Reconstructed components −10 −8 −6 −4 −2 K = 2 5000 10000 −0.2 0.2 −10 −8 −6 −4 −2 K = 3 2000 4000 6000 −0.2 0.2 −10 −8 −6 −4 −2 K = 4 2000 4000 6000 8000 −0.2 0.2 −10 −8 −6 −4 −2 K = 5 1000 2000 −0.2 0.2 −10 −8 −6 −4 −2 K = 6 100 200 −0.2 0.2 −10 −8 −6 −4 −2 K = 7 2 4 −0.2 0.2 50 100 150 200 250 300 350 400 450 500 −10 −8 −6 −4 −2 K = 8 100 200 300 400 500 600 1 2 0.5 1 1.5 2 2.5 3 x 10
5−0.2 0.2
Pitch estimates: 65.0 68.0 61.0 72.0 (True values: 61, 65, 68, 72)
28
KL-NMF on magnitude spectrogram with K = 8
−6 −4 −2 K = 1 Dictionary W 50 100 Coefficients H −0.2 0.2 Reconstructed components −6 −4 −2 K = 2 20 40 60 −0.2 0.2 −6 −4 −2 K = 3 20 40 −0.2 0.2 −6 −4 −2 K = 4 10 20 30 −0.2 0.2 −6 −4 −2 K = 5 20 40 −0.2 0.2 −6 −4 −2 K = 6 5 10 15 −0.2 0.2 −6 −4 −2 K = 7 5 10 −0.2 0.2 50 100 150 200 250 300 350 400 450 500 −6 −4 −2 K = 8 100 200 300 400 500 600 2 4 0.5 1 1.5 2 2.5 3 x 10
5−0.2 0.2
Pitch estimates: 65.2 68.2 61.0 72.2 56.2 (True values: 61, 65, 68, 72)
29
Louis Armstrong and His Hot Five
Log−power spectrogram Freq. 20 40 60 80 100 120 10 20 30 40 50 60 70 80 90 100 −0.5 0.5 Original data Amp. Time (s)
30
Louis Armstrong and His Hot Five
Original mono = Accompaniment
+ Brass
+ Trombone
+ Noise
Original mono denoised Original denoised & upmixed to stereo
31
(Sun and Mazumder, 2013)
Y =
Full-band training samples Band-limited samples
adapted from (Sun and Mazumder, 2013)
32
(Sun and Mazumder, 2013)
AC/DC example band-limited data (Back in Black) training data (Highway to Hell) bandwidth extended ground truth
Examples from http: // statweb. stanford. edu/ ~ dlsun/ bandwidth. html , used with permission from the author.
33
(Ozerov and F´ evotte, 2010) + +
Sources S NMF: W H Mixing system A Mixture X Multichannel NMF problem: Estimate W, H and A from X noise 1 noise 2
◮ Best scores on the underdetermined speech and music separation task at the
Signal Separation Evaluation Campaign (SiSEC) 2008.
◮ IEEE Signal Processing Society 2014 Best Paper Award.
34
(Ozerov, F´ evotte, Blouet, and Durrieu, 2011)
◮ the decomposition is guided by the operator: source activation time-codes
are input to the separation system.
◮ set forced zeros in H when a source is silent.
35
applications for approximate nonnegative matrix factorization. Computational Statistics & Data Analysis, 52(1):155–173, Sep. 2007.
Hyperspectral unmixing overview: Geometrical, statistical, and sparse regression-based approaches. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 5(2):354–379, 2012.
Family of new algorithms. In Proc. International Conference on Independent Component Analysis and Blind Signal Separation (ICA), pages 32–39, Charleston SC, USA, Mar. 2006.
Pattern Recognition Letters, 29(9):1433–1440, July 2008.
for volume ECT. IEEE Transactions on Medical Imaging, 5(5):61 – 66, 1986. doi: 10.1109/TMI.1986.4307748.
Advances in Neural Information Processing Systems (NIPS), 2005.
evotte and J. Idier. Algorithms for nonnegative matrix factorization with the beta-divergence. Neural Computation, 23(9):2421–2456, Sep. 2011. doi: 10.1162/NECO a 00168. URL http://www.unice.fr/cfevotte/publications/journals/neco11.pdf.
36
evotte, N. Bertin, and J.-L. Durrieu. Nonnegative matrix factorization with the Itakura-Saito
10.1162/neco.2008.04-08-771. URL http://www.unice.fr/cfevotte/publications/journals/neco09_is-nmf.pdf.
Linear Algebra and its Applications, 416:270–287, 2006.
and Development in Information Retrieval (SIGIR), 1999. URL http://www.cs.brown.edu/~th/papers/Hofmann-SIGIR99.pdf.
401:788–791, 1999.
Information Processing Systems 13, pages 556–562, 2001.
79:745–754, 1974. doi: 10.1086/111605.
audio source separation. IEEE Transactions on Audio, Speech and Language Processing, 18(3): 550–563, Mar. 2010. doi: 10.1109/TASL.2009.2031510. URL http://www.unice.fr/cfevotte/publications/journals/ieee_asl_multinmf.pdf.
evotte, R. Blouet, and J.-L. Durrieu. Multichannel nonnegative tensor factorization with structured constraints for user-guided audio source separation. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, May
37
utilization of error estimates of data values. Environmetrics, 5:111–126, 1994.
http://web.engr.illinois.edu/~paris/pubs/smaragdis-waspaa2013keynote.pdf.
(MLSP), 2013.
nonnegative matrix factorization. IEEE Transactions on Neural Networks, 22:1878 – 1891, Dec.
38