Monaural speech separation using source-adapted models Ron Weiss, - - PowerPoint PPT Presentation

monaural speech separation using source adapted models
SMART_READER_LITE
LIVE PREVIEW

Monaural speech separation using source-adapted models Ron Weiss, - - PowerPoint PPT Presentation

Monaural speech separation using source-adapted models Ron Weiss, Dan Ellis { ronw,dpwe } @ee.columbia.edu LabROSA Department of Electrical Enginering Columbia University 2007 IEEE Workshop on Applications of Signal Processing to Audio and


slide-1
SLIDE 1

Monaural speech separation using source-adapted models

Ron Weiss, Dan Ellis

{ronw,dpwe}@ee.columbia.edu

LabROSA Department of Electrical Enginering Columbia University

2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 1 / 15

slide-2
SLIDE 2

Monaural speech separation

Given single channel recording of multiple talkers Infer the original source signals from mixture Under-determined - more unknowns (sources) than observations

Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 2 / 15

slide-3
SLIDE 3

Speech separation challenge [Cooke and Lee, 2006]

Single channel, two-talker mixtures of utterances from 34 speakers Constrained grammar: command(4) color(4) preposition(4) letter(25)

digit(10) adverb(4)

Task: determine letter and digit for source that said “white”

  • 9 to 6 dB TMR

Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 3 / 15

slide-4
SLIDE 4

Model-based separation

State index Frequency (kHz) Model means 20 40 60 80 100 120 2 4 6 8 −50 −40 −30 −20 −10

Use constraints from prior signal models to guide separation HMM, log spectral features Factorial model inference

Explain each frame of mixed signal as combination of model states

e.g. Iroquois [Kristjansson et al., 2006]

Speaker-dependent models Acoustic dynamics and grammar constraints Superhuman performance

Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 4 / 15

slide-5
SLIDE 5

Model-based separation - Limitations

Rely on speaker-dependent models to disambiguate sources What if the task isn’t so well defined?

No a priori knowledge of speaker identities or grammar

Adapt speaker-independent source model [Ozerov et al., 2005]

Problems

1

Want to adapt to a single utterance, not enough data for MLLR

Use PCA to reduce number of adaptation parameters - “Eigenvoices”

2

Only observation is mixed signal

Iterative separation/adaptation algorithm

Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 5 / 15

slide-6
SLIDE 6

Eigenvoices [Kuhn et al., 2000]

Train N speaker-dependent models

priors on space of speaker variation

Pack model parameters (Gaussian means) into speaker supervector Principal component analysis to find orthonormal bases Speaker model is a linear combination of bases: µ = ¯ µ + w U + g

adapted model mean voice weights eigenvoice bases gain Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 6 / 15

slide-7
SLIDE 7

Eigenvoice example

Frequency (kHz) Mean voice b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax 2 4 6 8 Frequency (kHz) Eigenvoice dimension 1 b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax 2 4 6 8 Frequency (kHz) Eigenvoice dimension 2 b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax 2 4 6 8 Frequency (kHz) Eigenvoice dimension 3 b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax 2 4 6 8 Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 7 / 15

slide-8
SLIDE 8

Separation algorithm - Signal separation

Compose factorial HMM from adapted models Find maximum likelihood path using Viterbi algorithm Reconstruct source signals from Viterbi path

model 1 model 2

  • bservations / time

Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 8 / 15

slide-9
SLIDE 9

Separation algorithm - Model adaptation

Find projection of reconstructed source signals onto eigenvoice bases But state sequence is hidden, need EM

E-step: HMM forward-backward M-step: for each possible state sequence, project signal frames onto corresponding sequence of states from each eigenvoice basis vector

Iterate...

Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 9 / 15

slide-10
SLIDE 10

Separation example

Mixture: t32_swil2a_m18_sbar9n 2 4 6 8 Adaptation iteration 1 2 4 6 8 Frequency (kHz) Adaptation iteration 3 2 4 6 8 Adaptation iteration 5 2 4 6 8 Time (sec) SD model separation 0.5 1 1.5 2 4 6 8 −40 −20

Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 10 / 15

slide-11
SLIDE 11

Performance

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20 25 30 35 40 45 50 55 60 Iteration Accuracy Diff Gender Same Gender Same Talker

Letter-digit accuracy averaged across all TMRs Adaptation improves separation Same talker case hard - source permutations

Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 11 / 15

slide-12
SLIDE 12

Performance - Adapted vs. source-dependent models

6dB 3dB 0dB −3dB −6dB −9dB 20 40 60 80 Same Talker 6dB 3dB 0dB −3dB −6dB −9dB 20 40 60 80 Accuracy Same Gender 6dB 3dB 0dB −3dB −6dB −9dB 20 40 60 80 Diff Gender SD SA SI Baseline Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 12 / 15

slide-13
SLIDE 13

Performance - Held out speakers

34 30 20 10 20 30 40 50 60 70 80 Accuracy Same Gender Num training speakers SA SD 34 30 20 10 20 30 40 50 60 70 80 Diff Gender Num training speakers

Trained models on subset of speakers Tested on mixtures from held out speakers Performance suffers for both systems Relative decrease significantly bigger for SD than SA Open question: scale

Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 13 / 15

slide-14
SLIDE 14

Summary

Limitations of model-based source separation Algorithm for model adaptation from mixed signal Significant improvement over speaker-independent models Source-dependent models better on matched training/testing data Adaptation helps generalize better to held out speakers

Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 14 / 15

slide-15
SLIDE 15

References

Cooke, M. and Lee, T. W. (2006). The speech separation challenge. Kristjansson, T., Hershey, J., Olsen, P., Rennie, S., and Gopinath, R. (2006). Super-human multi-talker speech recognition: The IBM 2006 speech separation challenge system. In Proceedings of Interspeech. Kuhn, R., Junqua, J., Nguyen, P., and Niedzielski, N. (2000). Rapid speaker adaptation in eigenvoice space. IEEE Transations on Speech and Audio Processing, 8(6):695 – 707. Ozerov, A., Philippe, P., Gribonval, R., and Bimbot, F. (2005). One microphone singing voice separation using source-adapted models. In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 15 / 15

slide-16
SLIDE 16

Separation algorithm - Initialization

Fast convergence needs good initialization Want to differentiate source models to get best separation Get initial coefficient for each eigenvoice dimension independently

Coarsely quantize eigenvoice weights Find most likely combination in mixture

−2000 −1500 −1000 −500 500 1000 1500 2000 −2000 −1500 −1000 −500 500 1000 1500 2000 w1 w2 Eigenvoice weights vs speaker gender Male Female