texas instruments mit signals information and algorithms
play

- Texas Instruments MIT - Signals Information and Algorithms Lab - PowerPoint PPT Presentation

A Low-Power Text-Dependent Speaker Verification System with Narrow-band Feature Pre-selection and Weighted Dynamic Time Warping Qing He, Gregory Wornell, Wei Ma June 21, 2016 - Texas Instruments MIT - Signals Information and Algorithms Lab


  1. A Low-Power Text-Dependent Speaker Verification System with Narrow-band Feature Pre-selection and Weighted Dynamic Time Warping Qing He, Gregory Wornell, Wei Ma June 21, 2016 - Texas Instruments MIT - Signals Information and Algorithms Lab

  2. Motivation: Low-Power Wake-up • Conventionally, for voice wake up, the host device is always ON – High data acquisition rate to minimize information loss and to enable flexible downstream processing – Involves many stages of processing on high dimensional data • Much lower power consumption can be achieved with an application-specific voice-authenticated wake-up front-end – Early-stage signal dimension reduction with analog components – Adaptive data acquisition and robust processing Always ON : >100mWs Host device Turn on with wake-up signal ADC (DSP) Wake-up signal Low-Power Front-end Always ON : ~ 50-300uW 2

  3. System Architecture: A Comparison High power-consumption! Conventional System: Acoustic Speaker Accept/ Windowing feature Verification Reject MFCC extraction Sampling Feature Extraction Unit Sampling Rate High Dimensional Fast processing e.g., ~24 kHz Features Proposed System: Low power-consumption! Enrollment Samples NBSC Accept/ Narrowband Weighted DTW Reject Filter Bank Pattern Match Low rate ADC Spectral feature extraction Low-Rate Analog Front-End < 4 kHz Processing (kHz) 3

  4. Spectral Feature Pre-Selection Spectral Feature Pre-Selection  A few carefully selected narrow-bands are Speaker-verification capable of preserving most speech information Backend Output 4

  5. Feature Selection Weighted DTW Experiments Review: Speech Sound Generation (b) Vocal tract modulation: (c) Speech spectrogram: (a) Excitation signal: Essential for speech recognition Separable in the cepstral domain 5

  6. Feature Selection Feature Extraction Command Recognition Cepstral Representation of Speech Harmonics Vocal Tract Modulation Cepstral Coefficients Spectral Density Spectrogram Frequency (Hz) IFFT Time (s) Quefrency (Cycle/kHz) Frequency (Hz) • Acquiring the entire speech spectrum and performing transformation to the cepstral domain is power expensive! Question: How to extract without acquiring the full spectrum or performing transformation to the cepstral domain? 6

  7. Feature Selection Weighted DTW Experiments Point-Wise Spectral Sampling on the Harmonics Cepstral Coefficients Spectrogram is retained [dB] Frequency (Hz) 0 Time (s) Frequency (Hz) [dB] × 0 0 7

  8. Feature Selection Feature Extraction Command Recognition Narrow-Band Spectral Filtering Spectrogram [dB] Frequency (Hz) 0 Time (s) Frequency (Hz) [dB] × 0 8

  9. Feature Selection Weighted DTW Experiments Narrow-band Spectral Filtering: Parameters Spectrogram is mostly retained [dB] Frequency (Hz) 0 Time (s) Frequency (Hz) With = 100Hz, = 200Hz, = 800Hz, aliasing at the baseband is attenuated significantly where is the narrow-band band-width is the spacing between narrow-bands 9

  10. Feature Selection Weighted DTW Experiments Narrow-Band Spectral Coefficients (NBSC) BP Filtering BP Filtering • Narrow-band spectral features retain essential speech information • A small number of filters low-power • Low-rate sampling and simple processing 10

  11. Feature Selection Weighted DTW Experiments Block Diagram of the Proposed System Enrollment Samples × Weighted threshold Dynamic Time-Warping Accept/ . Reject (DTW) . . Digital back-end Analog front-end • Individual bands can be discarded in the presence of noise 11

  12. Feature Selection Weighted DTW Experiments Weighted DTW Spectral Feature Selection Voice-authenticated wake-up:  Identifies the user and the passphrase in one shot Weighted Dynamic  User-defined passphrase (~1s) Time Warping  Very few enrollment samples (e.g., 3) Output 12

  13. Feature Selection Weighted DTW Experiments Overview: Speaker-Verification Systems • Text-Dependent Speaker-Verification – Model based: GMM [Reynolds, 2000], i-vectors [Dehak, 2011], DNN [Liu, 2015], HMM [Rosenberg, 1990] Enrollments Training data Model Model Training Adaptation threshold MFCC Accept/ Feature Model Reject Extraction – Template based: DTW [ Sakoe, 1978 ]: Enrollments No prior model training threshold Distance Feature Accept/ Reject Extraction Measure 13

  14. Feature Selection Weighted DTW Experiments Weighted Dynamic Time Warping Reference signal Speech input stretch compress Large penalty Small penalty M = 3 • The distance between two points and is equal to the distance plus a penalty term • Penalty scales with the # of consecutive warping steps M • Penalty scales with the signal magnitude • Penalty for warping is low when signal is small • Penalty for warping is high when signal is large 14

  15. Feature Selection Weighted DTW Experiments Distance Matrix Computation where Cost is a function of the signal magnitude and the # of consecutive warping steps 15

  16. Feature Selection Weighted DTW Experiments Classical v.s. Weighted DTW Fails to align the signal envelopes The shape of T is mutated Signal envelopes are well aligned Less mutation on signal envelope 16

  17. Feature Selection Weighted DTW Experiments Spectral Feature Selection System Experiment Weighted Dynamic Time Warping Output 17

  18. Feature Selection Weighted DTW Experiments Experiment Setup Passphrase # of speakers # of repetitions Data Set: Hi Galaxy 40 40 OK Glass 40 20 OK Hua Wei 30 20 • Noisy samples: • Wind and car noises are added to each clean sample such that the total SNR is 3dB • # of enrollment samples: 3 Parameters: • Narrow-band spectral coefficients (NBSC) band-width: 200Hz • f0 estimation using autocorrelation method [Rabiner, 1976] Baseline Systems: • 40-dim MFCC + Classical DTW • 40-dim MFCC + GMM-UBM model 18

  19. Feature Selection Weighted DTW Experiments Summary of Experiment Results features below 2kHz are dropped Clean (EER [%]) Noisy (3dB) (EER [%]) Features MFCC NBSC MFCC NBSC Algorithm (40-dim) (12 bands) (40-dim) (8 bands) Weighted-DTW 0.9 1.1 10.5 5.7 DTW 1.4 1.5 13 6.7 GMM/UBM 2.6 N/A 6.8 N/A • Without noise, the NBSC yields comparable accuracy to the MFCC features • At 3dB SNR, the NBSC yields much better accuracy than the MFCC features • The Weighted-DTW yields improved accuracy than the classical DTW for all features • The proposed system yields improved accuracy than the GMM/UBM method • Taking only 3 enrollment samples as prior • Without prior background model training 19

  20. Feature Selection Weighted DTW Experiments Experiments: Adaptive Band Selection Features NBSC (EER [%]) # of filters 6 8 10 12 Clean 1.99 1.9 1.54 1.1 Noisy 6.8 6.6 6.3 5.7 (band selection) Noisy 15.5 15 15 14.5 (all bands) • Accuracy improves as the # of bands increases • Accuracy improves significantly with band selection 20

  21. System Power Estimation Fixed-Power Additional Total Power (uW) Power per Band (12 bands) (uW) (uW) Front-end TI’s 13 band filter-bank features 150 10 270 Back-end Text-dependent speaker verification 0 <9 <108 <380 • Back-end implementation: – Cortex-M0 micro-controller – Clock-speed: 40MHz – Decision: every 60 ms 21

  22. Summary: Low-Power Text-Dependent Speaker Verification  Early stage signal dimension reduction Spectral Feature  Analog feature extraction front-end Selection  Low-rate sampling and processing  Support adaptive band-selection  Improved robustness to noise (discard Weighted Dynamic noisy bands) Time Warping  Demonstrated comparable accuracy to existing systems, with low-power implementation Output 22

  23. Questions? 23

  24. Back-up Slides 24

  25. Feature Selection Feature Extraction Text-dependent speaker verification False-Positive under Continuous Running • Out-of-vocabulary samples: • 50000 samples of 1.2s duration • Short commands, utterances from audio books and conversations • Decision threshold is the same as the speaker-verification EER threshold Clean Noisy (3dB) Features (OOV False Positive [%]) (OOV False Positive [%]) MFCC NBSC MFCC NBSC Algorithm (40-dim) (12 bands) (40-dim) (8 bands) Weighted-DTW 0 0 1.4 0.6 ~1 false-positive per hour in a noisy restaurant 25

  26. Feature Selection Feature Extraction Text-dependent speaker verification Experiments: Adaptive Band Selection Features NBSC MFSC (EER [%]) (EER [%]) # of filters 6 8 10 12 13 26 1.95 1.83 Clean 1.99 1.9 1.54 1.1 16.4 17.2 Noisy 6.8 6.6 6.3 5.7 (band selection) 33.4 33.9 Noisy 15.5 15 15 14.5 (all bands) • Accuracy improves as the # of bands increases • Accuracy improves significantly with band selection • NBSC yields much better performance than the MFSC, which uses a larger number of filters 26

  27. Feature Selection Weighted DTW Experiments Narrow-Band Spectral Filtering Cepstral Coefficients Spectrogram [dB] Frequency (Hz) Quefrency (Cycle/kHz) Time (s) Frequency (Hz) = 0 0 × = 0 27 0 0

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend