An Efficient Posterior Regularized Latent Variable Model for - - PowerPoint PPT Presentation

an efficient posterior regularized latent variable model
SMART_READER_LITE
LIVE PREVIEW

An Efficient Posterior Regularized Latent Variable Model for - - PowerPoint PPT Presentation

An Efficient Posterior Regularized Latent Variable Model for Interactive Sound Source Separation Nicholas J. Bryan, Stanford University Gautham J. Mysore, Adobe Research ICML 2013 Sound Check 1 Motivation I Real world


slide-1
SLIDE 1

An Efficient Posterior Regularized Latent Variable Model 
 for Interactive Sound Source Separation

1

Nicholas J. Bryan, Stanford University Gautham J. Mysore, Adobe Research ICML 2013

Sound Check

slide-2
SLIDE 2

Motivation I

§ Real world sounds are mixtures of many individual sounds

2

+

slide-3
SLIDE 3

Current State-of-the-Art

3

§ Non-negative matrix factorization (NMF)

  • [Lee & Seung, 2001; Smaragdis & Brown 2003]
  • § Related latent variable models (LVM)
  • [Raj & Smaragdis 2005, Smaragdis et al., 2006]
slide-4
SLIDE 4

Latent Variable Model

P(f|z) P(z)

P(t|z)

  • Probabilistic latent component analysis (PLCA) [Smaragdis et al., 2006]

Basis vectors, frequency components, dictionary

P(f|z)

P(t|z) Time activations or gains P(z) Latent component weights

X ≈ P(f, t) = P

z

P(z)P(f|z)P(t|z)

slide-5
SLIDE 5

Latent Variable Model

P(f|z) P(z)

P(t|z)

X ≈ P(f, t) = P

z

P(z)P(f|z)P(t|z)

  • Solve via an expectation-maximization (EM) algorithm
slide-6
SLIDE 6

Latent Variable Model

6

P(f|z) P(z) P(t|z)

P(s = s1|f, t)

P(s = s2|f, t)

X ≈ P(f, t) = P

z

P(z)P(f|z)P(t|z)

slide-7
SLIDE 7

Problems

7

§ Requires isolated training data (supervised/semi-supervised)

  • § Don’t incorporate auditory/perceptual models of hearing

§ One-shot process, cannot correct for poor results § Very difficult, underdetermined problem

slide-8
SLIDE 8

Focus

8

§ Eliminate the need to explicit training data § Method of user feedback to guide separation § Algorithm to incorporate the user feedback

slide-9
SLIDE 9

Paradigm: Listen, Paint, Remove

9

Speech + Cell Phone Speech Cell Phone

looping playback

slide-10
SLIDE 10

p(f|z)

p(t|z)

p(z)

Latent Variable Model w/Painting Constraints

10

§ Incorporate painting annotations into the model

Λ1 Λ2

˜ P(f, t) = P

z

˜ P(z) ˜ P(f|z) ˜ P(t|z)

slide-11
SLIDE 11

Constraints

11

§ Constraints typical encoded as:

§ Prior probabilities on model parameters § Direct observations

  • § Does not (reasonably) allow time-frequency constraints
  • § Posterior regularization [Graça et al., 2007, 2009]

§ Complementary method that allows time-frequency constraints § Iterative optimization procedure for each E step § Well suited for our problem

  • P(f|z)

P(z)

P(t|z)

P(z|f, t)

slide-12
SLIDE 12

Expectation Maximization

12

E Step: M Step: Θn+1

= arg max

Θ

F(Qn+1, Θ) Qn+1 = arg max

Q

F(Q, Θn) = arg min

Q

KL(Q||P)

ln P(X|Θ) = F(Q, Θ) + KL(Q||P)

ln P(X|Θ) ≥ F(Q, Θ)

slide-13
SLIDE 13

Expectation Maximization w/Posterior Constraints I

13

E Step: M Step: Θn+1

= arg max

Θ

F(Qn+1, Θ)

ln P(X|Θ) = F(Q, Θ) + KL(Q||P)

ln P(X|Θ) ≥ F(Q, Θ)

Qn+1 = arg max

Q∈Q

F(Q, Θn) = arg min

Q∈Q

KL(Q||P)

slide-14
SLIDE 14

Linear Grouping Expectation Constraints

14

  • For each time-frequency point of

, solve

P(z|f, t)

arg min

Q∈Q

KL( Q(z|f, t) || P(z|f, t) )

Λ1 Λ2 arg min

q

q T ln p + q T ln q + q T λ subject to q T 1 = 1, q ⌫ 0

λ T = [Λ1ft Λ1ft Λ1ft . . . Λ2ft Λ2ft Λ2ft]

slide-15
SLIDE 15

Fast Updates

15

  • With simple penalty, both E and M steps are in closed form
  • Reduces to simple, fast multiplicative updates vs. NMF
  • Roughly the same computational cost as without constraints
slide-16
SLIDE 16

Evaluation

16

  • BSS-EVAL metrics [Vincent et al., 2006]
  • Signal-to-Distortion Ratio (SDR)
  • Signal-to-Interference Ratio (SIR)
  • Signal-to-Artifact Ratio (SAR)
  • Test material
  • Cell phone + speech (C), drums + bass (D), orchestra + cough (O), piano +

wrong note (P), siren + speech (S)

  • Vocals + background music (S1, S2, S3, S4)
  • Results
  • Outperformed prior state-of-the-art on tested material
  • Outperformed SiSEC 2011 vocals + background music winner
slide-17
SLIDE 17

Live Demonstration

17

slide-18
SLIDE 18

Jackson 5 Remix

18

Jackson 5’s “I want You Back” Cher Llyod’s “Want U Back” Remix

slide-19
SLIDE 19

A Look Back

§ Perceptual domain, objective evaluation is difficult § Human evaluation within the learning process

  • § Processing training data only

19

slide-20
SLIDE 20

Conclusion

20

§ Sound source separation algorithm

§ Time-frequency constraints via posterior regularization § No explicit training data § Efficient, interactive algorithm w/closed-form update equations § Improved separation quality over prior work § Open source software

§ Poster ID: 348 § Demos at ccrma.stanford.edu/~njb/research/iss

slide-21
SLIDE 21

An Efficient Posterior Regularized Latent Variable Model 
 for Interactive Sound Source Separation

21

Nicholas J. Bryan, Stanford University Gautham J. Mysore, Adobe Research ICML 2013