Underdetermined Source Separation Using Speaker Subspace Models - - PowerPoint PPT Presentation

underdetermined source separation using speaker subspace
SMART_READER_LITE
LIVE PREVIEW

Underdetermined Source Separation Using Speaker Subspace Models - - PowerPoint PPT Presentation

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions Underdetermined Source Separation Using Speaker Subspace Models Thesis Defense Ron Weiss May 4, 2009 Ron Weiss Underdetermined Source


slide-1
SLIDE 1

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Underdetermined Source Separation Using Speaker Subspace Models

Thesis Defense Ron Weiss May 4, 2009

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 1 / 34

slide-2
SLIDE 2

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

1

Introduction

2

Speaker subspace model

3

Monaural speech separation

4

Binaural separation

5

Conclusions

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 2 / 34

slide-3
SLIDE 3

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

1

Introduction

2

Speaker subspace model

3

Monaural speech separation

4

Binaural separation

5

Conclusions

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 3 / 34

slide-4
SLIDE 4

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Audio source separation

Many real world signals contain contributions from multiple sources

E.g. cocktail party

Want to infer the original sources from the mixture

Robust speech recognition Hearing aids

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 4 / 34

slide-5
SLIDE 5

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Previous work

Instantaneous mixing system

   y1(t) . . . yC(t)    =    a11 . . . a1I . . . ... . . . aC1 . . . aCI       x1(t) . . . xI(t)    Simplest case: more channels than sources (overdetermined)

Perfect separation possible

Use constraints on source signals to guide separation

Independence constraints (e.g. independent component analysis) Spatial constraints (e.g. beamforming)

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 5 / 34

slide-6
SLIDE 6

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Underdetermined source separation

More sources than channels, need stronger constraints CASA: Use perceptual cues similar to human auditory system

Segment STFT into short glimpses of each source By harmonicity, common onset, etc. Sequential grouping heuristics Create time-frequency mask for each source

Inference based on prior source models

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 6 / 34

slide-7
SLIDE 7

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Time-frequency masking

Mixture Frequency (kHz) 2 4 6 8 Clean source Frequency (kHz) 2 4 6 8 Masks Time (sec) Frequency (kHz) 0.5 1 1.5 2 2.5 3 2 4 6 8 Reconstructed source (8.2 dB SNR) Time (sec) Frequency (kHz) 0.5 1 1.5 2 2.5 3 2 4 6 8 −50 −40 −30 −20 −10

Natural sounds tend to be sparse in time and frequency

10% of spectrogram cells contain 78% of energy

And redundant

Still intelligible when 22% of source energy is masked

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 7 / 34

slide-8
SLIDE 8

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Model-based separation

Use constraints from prior source models to guide separation

Leverage differences in spectral characteristics of different sources

Hidden Markov models, log spectral features Factorial model inference e.g. IBM Iroquois system [Kristjansson et al., 2006]

Speaker-dependent models Acoustic dynamics and grammar constraints Superhuman performance under some conditions

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 8 / 34

slide-9
SLIDE 9

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Model-based separation – Limitations

Rely on speaker-dependent models to disambiguate sources What if the task isn’t so well defined?

No prior knowledge of speaker identities or grammar

Use speaker-independent (SI) model for all sources

Need strong temporal constraints or sources will permute “place white by t 4 now” mixed with “lay green with p 9 again” Separated source: “place white by t p 9 again”

Solution: adapt speaker-independent model to compensate

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 9 / 34

slide-10
SLIDE 10

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

1

Introduction

2

Speaker subspace model Model adaptation Eigenvoices

3

Monaural speech separation

4

Binaural separation

5

Conclusions

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 10 / 34

slide-11
SLIDE 11

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Model selection vs. adaptation

Model selection (e.g. [Kristjansson et al., 2006]) Given set of speaker-dependent (SD) models:

1

Identify sources in mixture

2

Use corresponding models for separation

How to generalize to speakers outside of training set?

Selection – choose closest model Adaptation – interpolate

Speaker models Mean voice Speaker subspace bases Quantization boundaries

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 11 / 34

slide-12
SLIDE 12

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Model adaptation

Adjust model parameters to better match

  • bservations

Caveats

1

Want to adapt to a single utterance, not enough data for MLLR, MAP

Need adaptation framework with few parameters

2

Observations are mixture of multiple sources

Iterative separation/adaptation algorithm

Feature 1 Feature 2 Original distribution Observations Adapted distribution

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 12 / 34

slide-13
SLIDE 13

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Eigenvoice adaptation [Kuhn et al., 2000]

Train a set of SD models

Pack params into speaker supervector Samples from space of speaker variation

Principal component analysis to find

  • rthonormal bases for speaker subspace

Model is linear combination of bases

Speaker models Speaker subspace bases Other models

Eigenvoice adaptation

µ = ¯ µ + U w + B h

adapted mean eigenvoice weights channel channel model voice bases bases weights Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 13 / 34

slide-14
SLIDE 14

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Eigenvoice bases

Mean voice = speaker-independent model Eigenvoices shift formant frequencies, add pitch Independent bases to capture channel variation

Frequency (kHz) Mean Voice b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax 2 4 6 8 Frequency (kHz) Eigenvoice dimension 1 b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax 2 4 6 8 Frequency (kHz) Eigenvoice dimension 2 b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax 2 4 6 8 Frequency (kHz) Eigenvoice dimension 3 b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax 2 4 6 8 −50 −40 −30 −20 −10 2 4 6 8 2 4 6 8 2 4 6 8

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 14 / 34

slide-15
SLIDE 15

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

1

Introduction

2

Speaker subspace model

3

Monaural speech separation Mixed signal model Adaptation algorithm Experiments

4

Binaural separation

5

Conclusions

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 15 / 34

slide-16
SLIDE 16

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Eigenvoice factorial HMM

Model mixture with combination of source HMMs Need adaptation parameters wi to estimate source signals xi(t) and vice versa

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 16 / 34

slide-17
SLIDE 17

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Adaptation algorithm

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 17 / 34

slide-18
SLIDE 18

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Adaptation example

Mixture: t32_swil2a_m18_sbar9n 2 4 6 8 −40 −20 Adaptation iteration 1 2 4 6 8 −40 −20 Frequency (kHz) Adaptation iteration 3 2 4 6 8 −40 −20 Adaptation iteration 5 2 4 6 8 −40 −20 Time (sec) SD model separation 0.5 1 1.5 2 4 6 8 −40 −20 Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 18 / 34

slide-19
SLIDE 19

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

2006 Speech separation challenge [Cooke and Lee, 2006]

Single channel mixtures of utterances from 34 different speakers Constrained grammar:

command(4) color(4) preposition(4) letter(25) digit(10) adverb(4)

Separation/recognition task

Determine letter and digit for source that said “white”

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 19 / 34

slide-20
SLIDE 20

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Performance – Adapted vs. source-dependent models

−3 dB

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 20 / 34

slide-21
SLIDE 21

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Experiments – Switchboard

Mixture Time (sec) Frequency (kHz) 1 2 3 4 2 4 6 8 −50 −40 −30 −20 −10

What about previously unseen speakers? Switchboard: corpus of conversational telephone speech

200+ hours, 500+ speakers

Task significantly more difficult than Speech Separation Challenge

Spontaneous speech Large vocabulary Significant channel variation across calls

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 21 / 34

slide-22
SLIDE 22

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Switchboard – Results

Adaptation outperforms SD model selection

Model selection errors due to channel variation

SD performance drops off under mismatched conditions SA performance improves as number of training speakers increases

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 22 / 34

slide-23
SLIDE 23

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

1

Introduction

2

Speaker subspace model

3

Monaural speech separation

4

Binaural separation Mixed signal model Parameter estimation and source separation Experiments

5

Conclusions

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 23 / 34

slide-24
SLIDE 24

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Binaural audition

yℓ(t) =

  • i

xi(t − τ ℓ

i ) ∗ hℓ i (t)

yr(t) =

  • i

xi(t − τ r

i ) ∗ hr i (t)

Given stereo recording of multiple sound sources Utilize spatial cues to aid separation

Interaural time difference (ITD) Interaural level difference (ILD)

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 24 / 34

slide-25
SLIDE 25

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

MESSL: Interaural model [Mandel and Ellis, 2007]

Model-based EM Source Separation and Localization Probabilistic model of interaural spectrogram

Independent of underlying source signals

Assume each time-frequency cell is dominated by a single source EM algorithm to learn model parameters for each source Derive probabilistic time-frequency masks for separation

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 25 / 34

slide-26
SLIDE 26

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

MESSL-SP: Source prior

Extend MESSL to include prior source model Pre-trained GMM for speech signals in mixture Channel model to compensate for HRTF and reverberation Can incorporate eigenvoice adaptation (MESSL-EV)

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 26 / 34

slide-27
SLIDE 27

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Parameter estimation and source separation

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 27 / 34

slide-28
SLIDE 28

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Experiments

0.2 0.4 0.6 0.8 1 Ground truth (12.04 dB) Frequency (kHz) 0.5 1 1.5 2 4 6 8 DUET (3.84 dB) 0.5 1 1.5 2 4 6 8 2D−FD−BSS (5.41 dB) 0.5 1 1.5 2 4 6 8 MESSL (5.66 dB) Frequency (kHz) Time (sec) 0.5 1 1.5 2 4 6 8 MESSL−SP (10.01 dB) Time (sec) 0.5 1 1.5 2 4 6 8 MESSL−EV (10.37 dB) Time (sec) 0.5 1 1.5 2 4 6 8

SSC TIMIT

Mixtures of 2 and 3 speech sources, anechoic and reverberant Evaluated on TIMIT and SSC test data Source models trained on SSC data (32 components) Compare MESSL systems to:

DUET – Clustering using ILD/ITD histogram [Yilmaz and Rickard, 2004] 2S-FD-BSS – Frequency domain ICA [Sawada et al., 2007]

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 28 / 34

slide-29
SLIDE 29

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Experiments – Performance as function of distractor angle

20 40 60 80 5 10 15 SNR improvement (dB) 2 sources (anechoic) Ground truth MESSL−EV MESSL−SP MESSL 2S−FD−BSS DUET 20 40 60 80 5 10 15 2 sources (reverb) 20 40 60 80 5 10 15 SNR improvement (dB) Separation (degrees) 3 sources (anechoic) 20 40 60 80 5 10 15 Separation (degrees) 3 sources (reverb) Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 29 / 34

slide-30
SLIDE 30

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Experiments – Matched vs. mismatched

Average 2 4 6 8 10 12 SNR Improvement (dB) GRID Average 2 4 6 8 10 12 SNR Improvement (dB) TIMIT Ground truth MESSL−EV MESSL−SP MESSL 2S−FD−BSS DUET

SSC – matched train/test speakers

MESSL-EV, MESSL-SP beat MESSL baseline by ∼ 3 dB in reverb MESSL-EV beats MESSL-SP by ∼ 1 dB on anechoic mixtures

TIMIT – mismatched train/test speakers

Small difference between MESSL-EV and MESSL-SP

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 30 / 34

slide-31
SLIDE 31

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

1

Introduction

2

Speaker subspace model

3

Monaural speech separation

4

Binaural separation

5

Conclusions

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 31 / 34

slide-32
SLIDE 32

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Summary

Prior signal models for underdetermined source separation Subspace model for source adaptation

Adapt Gaussian means and covariances using a single utterance Natural extension to compensate for source-independent channel effects

Monaural separation

Speaker-dependent > speaker-adapted ≫ speaker-independent Adaptation helps generalize better to held out speakers Improves as number of training speakers increases

Binaural separation

Extend MESSL framework to use source models (joint with M. Mandel) Improved performance by incorporating simple SI model Smaller improvement with adaptation

Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 32 / 34

slide-33
SLIDE 33

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

Contributions

Model-based source separation making minimal assumptions using subspace adaptation Extend model-based approach to binaural separation

Ellis, D. P. W. and Weiss, R. J. (2006). Model-based monaural source separation using a vector-quantized phase-vocoder representation. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages V–957–960. Weiss, R. J. and Ellis, D. P. W. (2006). Estimating single-channel source separation masks: Relevance vector machine classifiers vs. pitch-based masking. In Proc. ISCA Tutorial and Research Workshop on Statistical and Perceptual Audition (SAPA), pages 31–36. Weiss, R. J. and Ellis, D. P. W. (2007). Monaural speech separation using source-adapted models. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 114–117. Weiss, R. J. and Ellis, D. P. W. (2008). Speech separation using speaker-adapted eigenvoice speech models. Computer Speech and Language, In Press, Corrected Proof:–. Weiss, R. J., Mandel, M. I., and Ellis, D. P. W. (2008). Source separation based on binaural cues and source model constraints. In Proc. Interspeech, pages 419–422. Weiss, R. J. and Ellis, D. P. W. (2009). A Variational EM Algorithm for Learning Eigenvoice Parameters in Mixed Signals. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 33 / 34

slide-34
SLIDE 34

Outline Introduction Speaker subspace model Monaural speech separation Binaural separation Conclusions

References

Cooke, M. and Lee, T.-W. (2006). The speech separation challenge. Kristjansson, T., Hershey, J., Olsen, P., Rennie, S., and Gopinath, R. (2006). Super-human multi-talker speech recognition: The IBM 2006 speech separation challenge system. In Proc. Interspeech, pages 97–100. Kuhn, R., Junqua, J., Nguyen, P., and Niedzielski, N. (2000). Rapid speaker adaptation in eigenvoice space. IEEE Transations on Speech and Audio Processing, 8(6):695–707. Mandel, M. I. and Ellis, D. P. W. (2007). EM localization and separation using interaural level and phase cues. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Sawada, H., Araki, S., and Makino, S. (2007). A two-stage frequency-domain blind source separation method for underdetermined convolutive mixtures. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Yilmaz, O. and Rickard, S. (2004). Blind separation of speech mixtures via time-frequency masking. IEEE Transactions on Signal Processing, 52(7):1830–1847. Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models May 4, 2009 34 / 34

slide-35
SLIDE 35

Extra slides

6

Extra slides

slide-36
SLIDE 36

Extra slides

Factorial HMM separation

Each source signal is characterized by state sequence through its HMM Viterbi algorithm to find maximum likelihood path through combined factorial HMM Reconstruct source signals using Viterbi path Aggressively prune unlikely paths to speed up separation

slide-37
SLIDE 37

Extra slides

Adaptation algorithm initialization

−1000 1000 2000 −2000 −1000 1000 2000 1 2 w1 w2 −1000 1000 2000 1 2 w1 −1000 1000 2000 1 2 w1 Male Female

Fast convergence needs good initialization Want to differentiate source models to get best initial separation Treat each eigenvoice dimension independently

Coarsely quantize weights Find most likely combination in mixture

slide-38
SLIDE 38

Extra slides

Adaptation performance

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20 25 30 35 40 45 50 55 60 65 70 Iteration Avg Accuracy

Diff Gender Same Gender Same Talker

Letter-digit accuracy averaged across all TMRs Adaptation clearly improves separation Same talker case hard – source permutations

slide-39
SLIDE 39

Extra slides

Variational learning

Approximate EM algorithm to estimate adaptation parameters Treat each source HMM independently Introduce variational parameters to couple them

slide-40
SLIDE 40

Extra slides

Performance – Learning algorithm comparison

Adapting Gaussian covariances and means significantly improves performance Hierarchical algorithm outperforms variational EM But variational algorithm is significantly (∼ 4x) faster At same speed variational EM performs better

slide-41
SLIDE 41

Extra slides

Performance – Comparison to other participants

slide-42
SLIDE 42

Extra slides

MESSL-EV: Putting it all together

Big mixture of Gaussians Interaural model

ITD: Gaussian for each source and time delay ILD: Single Gaussian for each source

Source model

Separate channel responses for each source at each ear Both channels share eigenvoice adaptation parameters

Explain each point in spectrogram by a particular source, time delay, and source model mixture component

slide-43
SLIDE 43

Extra slides

MESSL-EV example

IPD (0.73 dB) Frequency (kHz) 2 4 6 8 ILD (8.54 dB) Frequency (kHz) 2 4 6 8 SP (7.93 dB) Frequency (kHz) 2 4 6 8 Full mask (10.37 dB) Time (sec) 0.5 1 1.5 2 4 6 8 0.5 1

IPD informative in low frequencies, but not in high frequencies ILD primarily adds information about high frequencies Source model introduces correlations across frequency and emphasizes reliable time-frequency regions

Helps resolve ambiguities in interaural parameters from reverberation and spatial aliasing

slide-44
SLIDE 44

Extra slides

Just for fun...