Improved Soft Decisions in Missing Data ASR: Using Harmonicity in - - PowerPoint PPT Presentation

improved soft decisions in missing data asr using
SMART_READER_LITE
LIVE PREVIEW

Improved Soft Decisions in Missing Data ASR: Using Harmonicity in - - PowerPoint PPT Presentation

Improved Soft Decisions in Missing Data ASR: Using Harmonicity in Conjunction with Local SNR Estimates Speech and Hearing Research Group, Dept. Computer Science, University of Sheffield, UK January 24, 2001 ved Soft


slide-1
SLIDE 1

Improved Soft Decisions in Missing Data ASR: Using Harmonicity in Conjunction with Local SNR Estimates

Speech and Hearing Research Group,

  • Dept. Computer Science,

University of Sheffield, UK January 24, 2001

slide-2
SLIDE 2

ved Soft Decisions in Missing Data ASR

Improved Soft Decisions in Missing Data ASR: Combining Masks

  • Soft Decisions in Missing Data
  • Harmonicity-based Fuzzy Masks
  • Merging Local SNR and Harmonicity Masks
  • Aurora 2000 Results
  • Conclusions

Septemer 25, 2000 1

slide-3
SLIDE 3

ved Soft Decisions in Missing Data ASR

Soft Decisions in Missing Data

1

F (S) SNR Estimate

Time Frequency

Threshold Discrete 0/1 Mask Fuzzy Mask

v, T 1

Soft mask values are interpreted as "the probability that the data is reliable". So rather than use the present data likelihood OR the missing data ‘induction constraint’, every point uses weighted sum of BOTH terms.

Septemer 25, 2000 2

slide-4
SLIDE 4

ved Soft Decisions in Missing Data ASR

Using Soft Decisions

Missing data probability calculation for discrete masks, showing the separate present and missing components:

  • ✁✄✂
☎✝✆ ✞✠✟ ✡ ☛ ☞
✁✄✂ ✡ ☎ ✆ ✞ ✌ ☛ ✍ ✎ ✂ ✌ ✏ ✑ ✒
✁✄✂ ✌ ☎ ✆ ✞✄✓ ✂ ✌

With soft decisions the probability due to each feature vector component becomes a weighted sum of the present and missing probability terms:

✂ ☎ ✆ ✞ ✟ ✔ ✡ ✕ ✖ ✗ ✘ ✡
✁✄✂ ✡ ☎✚✙ ✞ ✛ ✁ ✎✝✜ ✘ ✡ ✞ ✎ ✂ ✡ ✏ ✢ ✒
✁✄✂ ✡ ☎✚✙ ✞✄✓ ✂ ✡ ✣

Septemer 25, 2000 3

slide-5
SLIDE 5

ved Soft Decisions in Missing Data ASR

Using Soft Decisions

Generalising to models employing Gaussian mixtures:

✂ ✄ ☎ ✆✞✝ ✟✡✠ ☛ ☞ ✌ ✍ ✂ ✄ ✆✞✎ ✍ ✏ ✍ ✆✞✑ ✍ ✟ ✝ ✒ ✠ ☛ ✓ ✆✕✔ ✖ ✎ ✍ ☛ ✔ ✑ ✍ ✗ ✘ ✙ ✏ ✚ ✆ ✑ ✍ ✟ ✝ ✒ ✠ ☛✞✛ ✑ ✍ ☛ ✜

Septemer 25, 2000 4

slide-6
SLIDE 6

ved Soft Decisions in Missing Data ASR

Harmonicity Masks

Fuzzy Harmonicity Mask

Applied to each channel

  • ver a temporal window

32 frequency channels, 150 lags Correlogram 32 Channels 32 frequency channels Gammatone Filterbank Noisy Signal Instanteous Envelope Haircell Autocorrelation Sum Across Autocorrelogram Summary Pitch Peak Tracking Correlogram Select lag from (freq, lag) Peak’s lag index (~1/f0)

1

s ,T1

1

Peak’s Height (Degree of Voicing) f0 Model Frequency

  • The Harmonicity Mask is designed to mark voiced speech

regions.

  • It works well when noise is inharmonic or the SNR is favourable.
  • Refinements necessary when noise is harmonic and dominant:

–> pitch tracking, multisource decoding?

Septemer 25, 2000 5

slide-7
SLIDE 7

ved Soft Decisions in Missing Data ASR

Mask Combination

We now have two fuzzy masks:

  • Fuzzy SNR-based mask - Works well in stationary noise.
  • Fuzzy Harmonicity-based mask - Highlights voiced speech

regions. We also have a ‘degree of voicing’ parameter, V. How do we combine the masks?

Septemer 25, 2000 6

slide-8
SLIDE 8

ved Soft Decisions in Missing Data ASR

Mask Combination

Discrete combination: (One parameter) If

frame is Voiced, else frame is Unvoiced. Then,

Voiced frames –> Use harmonicity-based mask

Unvoiced frames –> Fall back on SNR masks

Fuzzy combination: (Two parameters)

h s wM +(1-w)M

1 1 1

Harmonicity Mask, M SNR Mask, M w s h Mask Combination s ,T s ,T s ,T

1 2 3 3 2 1

Raw Harmonicity Data Degree of Voicing Hybrid Mask Local SNR Estimate

Septemer 25, 2000 7

slide-9
SLIDE 9

ved Soft Decisions in Missing Data ASR

Tuning the Voicing Sigmoid

5ms 10ms 15ms 0.4 0.6 0.8 1 1.2 Lag (~ 1/f0) Voicing Clean 5ms 10ms 15ms 0.4 0.6 0.8 1 1.2 Lag (~ 1/f0) Voicing Car 10dB

Voicing vs. Lag for female and male speakers.

Septemer 25, 2000 8

slide-10
SLIDE 10

ved Soft Decisions in Missing Data ASR

Comparison with Apriori Masks

Male "4382" + Car @ 20dB SNR

Apriori 50Hz 3800Hz Local SNR Estimate Mask Harmonicity Based Mask Combined Mask 1.7 secs

Septemer 25, 2000 9

slide-11
SLIDE 11

ved Soft Decisions in Missing Data ASR

Comparison with Apriori Masks

Male "4382" + Car @ 10dB SNR

Apriori 50Hz 3800Hz Local SNR Estimate Mask Harmonicity Based Mask Combined Mask 1.7 secs

Septemer 25, 2000 10

slide-12
SLIDE 12

ved Soft Decisions in Missing Data ASR

Aurora 2000 Experiments

  • Trained on clean data.
  • Testing using Set A

(i.e. subway, exhibition, babble and car noises).

  • Features: 32 channel gammatone filter bank, + deltas.
  • Two slightly different sets of models

+ Aurora Models: 16 states per digit, + DC Models: 11.5 states per digit on average.

  • 7 mixtures per state

(note, relatively large num. of mixes needed for spectral features).

Septemer 25, 2000 11

slide-13
SLIDE 13

ved Soft Decisions in Missing Data ASR

Aurora Results: Test Set A

−5 5 10 15 20 Clean 20 40 60 80 100 WER SNR (dB) Subway Noise

Discrete SNR Fuzzy SNR +Harmonicity

−5 5 10 15 20 Clean 20 40 60 80 100 WER SNR (dB) Babble Noise

Discrete SNR Fuzzy SNR +Harmonicity

−5 5 10 15 20 Clean 20 40 60 80 100 WER SNR (dB) Car Noise

Discrete SNR Fuzzy SNR +Harmonicity

−5 5 10 15 20 Clean 20 40 60 80 100 WER SNR (dB) Exhibition Noise

Discrete SNR Fuzzy SNR +Harmonicity

(32 channel filter bank + deltas)

Septemer 25, 2000 12

slide-14
SLIDE 14

ved Soft Decisions in Missing Data ASR

Aurora Results: WER averaged over noise condition

−5 5 10 15 20 Clean 10 20 30 40 50 60 70 80 90 100 WER SNR (dB)

Discrete SNR Fuzzy SNR +Harmonicity (MultiCondition)

MASK / SNR

  • 5dB

0dB 5dB 10dB 15dB 20dB Clean Discrete SNR 83.8 56.6 34.0 17.2 8.5 4.1 1.2 Fuzzy SNR 69.7 41.2 20.1 10.1 5.7 3.4 1.5 + Harmonicity 66.6 36.4 16.9 8.3 4.3 2.5 1.4

Septemer 25, 2000 13

slide-15
SLIDE 15

ved Soft Decisions in Missing Data ASR

Aurora WER Results: Aurora vs. DC Word Models

−5 5 10 15 20 Clean 10 20 30 40 50 60 70 80 90 100 WER SNR (dB)

16 State Models DC Models

Models / SNR

  • 5dB

0dB 5dB 10dB 15dB 20dB Clean 16 State Models 66.6 36.4 16.9 8.3 4.3 2.5 1.4 DC Word Models 69.4 39.8 19.9 9.9 5.3 3.2 1.7

Septemer 25, 2000 14

slide-16
SLIDE 16

ved Soft Decisions in Missing Data ASR

Conclusions

  • In combination, Harmonicity and Local SNR masks perform

better than either mask individually, i.e: + better approximation to the apriori (‘cheating’) mask, + better recognition results.

  • The mask generation parameters are robust,

i.e. one set of parameters will perform well over a large range of noise types, and noise levels.

  • Sensible values can be estimated from clean speech.

Septemer 25, 2000 15

slide-17
SLIDE 17

ved Soft Decisions in Missing Data ASR

Further Work

  • Temporal Smoothing.

Smoothing the masks appears to improve results for some noise types - but seriously damages results for others.

  • Using F0 Information.

Using F0 to distinguish between voiced speech and harmonic

  • noise. F0 tracking. ‘Multi-pitch’ decoding.
  • Adaptive Sigmoid Parameters.

Techniques for fine tuning the mask generation parameters according to the noise estimate.

  • More General Mask Combination Techniques.

Septemer 25, 2000 16

slide-18
SLIDE 18

ved Soft Decisions in Missing Data ASR

Learning Noise Specific Parameters

20 KHz TIDigits + Factory Noise

5 10 15 20 200 5 10 15 20 25 30 35 40 45 50

Digit recognition accuracy SNR (dB)

Discrete SNR Fuzzy SNR (ICSLP) Tuned Fuzzy Autoc/SNR (Apriori)

Parameters tuned to minimise distance to Apriori masks at 0 & 5 db.

Septemer 25, 2000 17