An Efferent-inspired Auditory Model Front-end for Speech Recognition - - PowerPoint PPT Presentation

an efferent inspired auditory model front end for speech
SMART_READER_LITE
LIVE PREVIEW

An Efferent-inspired Auditory Model Front-end for Speech Recognition - - PowerPoint PPT Presentation

An Efferent-inspired Auditory Model Front-end for Speech Recognition Chia-ying Lee, James Glass and Oded Ghitza* MIT Computer Science and Artificial Intelligence Lab, Cambridge, MA, USA *Boston University Hearing Research Lab, Boston, MA, USA


slide-1
SLIDE 1

An Efferent-inspired Auditory Model Front-end for Speech Recognition

Chia-ying Lee, James Glass and Oded Ghitza*

MIT Computer Science and Artificial Intelligence Lab, Cambridge, MA, USA *Boston University Hearing Research Lab, Boston, MA, USA

slide-2
SLIDE 2

Motivation

  • Human v.s. Automatic Speech Recognizers (ASRs)
  • Humans are particularly good at dealing with previously unseen

noise or dynamic noises.

slide-3
SLIDE 3

Motivation

  • Human v.s. Automatic Speech Recognizers (ASRs)
  • Humans are particularly good at dealing with previously unseen

noise or dynamic noises.

  • Mounting evidence of the role of efferent-feedback

in mammalian auditory systems

  • Operating point of the cochlea is regulated by background noise
  • Results in stable internal representations
slide-4
SLIDE 4

Motivation

  • Human v.s. Automatic Speech Recognizers (ASRs)
  • Humans are particularly good at dealing with previously unseen

noise or dynamic noises.

  • Mounting evidence of the role of efferent-feedback

in mammalian auditory systems

  • Operating point of the cochlea is regulated by background noise
  • Results in stable internal representations
  • Explore potential use of a feedback mechanism for

ASR

  • Use a MOC efferent-inspired auditory model as an ASR front-end
slide-5
SLIDE 5

Cochlea Inner Hair Cell Middle Ear Dynamic Range Window

  • Messing et al., 2009

An Efferent-inspired Auditory Model

G

slide-6
SLIDE 6

Model of Ascending Pathway

Cochlea Inner Hair Cell Middle Ear Dynamic Range Window

slide-7
SLIDE 7

Model of Ascending Pathway

Cochlea Inner Hair Cell Middle Ear Dynamic Range Window

  • Middle Ear
  • Modeled by a high-pass filter
slide-8
SLIDE 8

Model of Ascending Pathway

  • J. Goldstein, 1990
  • Multi-Band Path Non-Linear model (MBPNL)

Non- linear Cochlea Inner Hair Cell Middle Ear Dynamic Range Window

slide-9
SLIDE 9

MBPNL Model

  • Modeling cochlear nonlinearity
  • Example for center frequency = 1820 Hz
  • filter characteristics change instantaneously as a function of input

signal strength

slide-10
SLIDE 10

Cochlea Inner Hair Cell Middle Ear Dynamic Range Window

  • Inner Hair Cell
  • Generic MIT model
  • A half-wave rectifier followed by a low pass filter

Model of Ascending Pathway

slide-11
SLIDE 11

Cochlea Inner Hair Cell Middle Ear Dynamic Range Window

  • Dynamic Range Window (DRW)
  • A hard limiter with upper and lower bounds, representing the

dynamic range of auditory nerve firing

Model of Ascending Pathway

slide-12
SLIDE 12

Dynamic Range Window

Input Output

Lower Bound Upper Bound

  • No firing for signals below the lower bound
  • Saturation in firing rate for signals above the upper bound
slide-13
SLIDE 13

An Efferent-inspired Auditory Model

n(t) Cochlea Inner Hair Cell Middle Ear Dynamic Range Window

slide-14
SLIDE 14

An Efferent-inspired Auditory Model

n(t) Cochlea Inner Hair Cell Middle Ear Dynamic Range Window

  • G is adjusted based on the background noise such that the
  • utput of the DRW is at “epsilon level”.
  • G impacts the filter response in the MBPNL cochlear model.

G

slide-15
SLIDE 15

An Efferent-inspired Auditory Model

  • The noisy speech signal is processed by the tuned auditory model.

s(t) + n(t) Cochlea Inner Hair Cell Middle Ear Dynamic Range Window

G

slide-16
SLIDE 16

Definitions

Non- linear Cochlea Inner Hair Cell Middle Ear Dynamic Range Window

  • Open-loop model
  • The model for the ascending pathway
slide-17
SLIDE 17

G

Definitions

Non- linear Cochlea Inner Hair Cell Middle Ear Dynamic Range Window

  • Closed-loop model
  • The ascending pathway model with the efferent-inspired feedback
slide-18
SLIDE 18

Visual Illustration

Short time Fourier transform Closed-loop model

  • Rows represent speech in different types of noise at 10 dB SNR
slide-19
SLIDE 19

A Closed-loop Front-end for ASR

s(t)+n(t) Cochlea Inner Hair Cell Dynamic Range Window Middle Ear G

  • Need to extract features that can be processed by speech recognizers
slide-20
SLIDE 20

DC

  • ffset

Framing

Log

Framing Log DCT Cochlea Inner Hair Cell Dynamic Range Window Middle Ear G R(n)

  • The feature generation method follows the standard MFCC

extraction process.

A Closed-loop Front-end for ASR

s(t)+n(t)

slide-21
SLIDE 21

Experimental Setup

  • Corpus creation (noisy speech data synthesis)
  • Feature extraction methods
  • Recognizer training and testing
  • Experimental results
slide-22
SLIDE 22

Corpus Creation

  • Noise signals
  • Stationary noise: speech-shaped, white, pink
  • Non-stationary Aurora2 noise: train, subway
  • Speech signals
  • Aurora2 digits (TIDigits)
  • Noisy speech synthesis
  • Noise signals are fixed at 70 dB SPL
  • Speech signals are adjusted to create 5 to 20 dB SNRs
  • 300 ms adaptation prior to speech signal
slide-23
SLIDE 23

Feature Extraction Methods

  • Three feature extraction methods
  • MFCC baseline with conventional normalization method
  • The open-loop auditory model (in paper)
  • The closed-loop auditory model
slide-24
SLIDE 24

Training data

20 dB SNR N1 6672 Training Utterances

Recognizer Training and Testing

  • Standard Aurora2 HMM-based recognizer was used
  • Jackknifing experiments with mismatched training and test conditions

15 dB SNR 10 dB SNR 5 dB SNR N2 N3 N4

Test data

N5 4004 Test Utterances 20 dB SNR 15 dB SNR 10 dB SNR 5 dB SNR

slide-25
SLIDE 25

Experimental Results

MFCC Baseline Closed-loop model Average STD 86 92 8.6 4.7

Accuracy (%)

  • The closed-loop model performs 43% better than the MFCC baseline,

and reduced variation across mismatched conditions by 45%.

slide-26
SLIDE 26

20 Avg speech- shaped White Pink

Subway

Train

MFCC baseline

95 92 91 88 94 94 90 89 84 93 91 85 85 76 92 81 73 76 62 84 90 85 85 77 91 96 94 95 93 96 96 93 96 92 95 94 91 95 89 93 83 83 91 78 84 92 90 94 88 92 Acc (%) dB SNR

Experimental Results

Closed-loop model

  • The closed-loop model performed better than the baseline across

all mismatched training and test conditions.

15 10 5 20 Avg speech- shaped White Pink

Subway

Train

Acc (%) dB SNR

15 10 5

slide-27
SLIDE 27

Conclusions

  • Key ideas
  • Efferent-inspired feedback regulates the operating point of the

front-end

  • Results in a stable representation -- a desired property for ASR
  • Experimental validation
  • Digit recognition in noise in mismatched conditions with multiple

noise types and SNRs

  • The closed-loop model outperformed the baseline across all

mismatched training and test conditions.

  • The results indicate that incorporating feedback in the front-end

shows promise for generating robust speech features.