- Texas Instruments MIT - Signals Information and Algorithms Lab - PowerPoint PPT Presentation

A Low-Power Text-Dependent Speaker Verification System with Narrow-band Feature Pre-selection and Weighted Dynamic Time Warping Qing He, Gregory Wornell, Wei Ma June 21, 2016 - Texas Instruments MIT - Signals Information and Algorithms Lab

Motivation: Low-Power Wake-up • Conventionally, for voice wake up, the host device is always ON – High data acquisition rate to minimize information loss and to enable flexible downstream processing – Involves many stages of processing on high dimensional data • Much lower power consumption can be achieved with an application-specific voice-authenticated wake-up front-end – Early-stage signal dimension reduction with analog components – Adaptive data acquisition and robust processing Always ON : >100mWs Host device Turn on with wake-up signal ADC (DSP) Wake-up signal Low-Power Front-end Always ON : ~ 50-300uW 2

System Architecture: A Comparison High power-consumption! Conventional System: Acoustic Speaker Accept/ Windowing feature Verification Reject MFCC extraction Sampling Feature Extraction Unit Sampling Rate High Dimensional Fast processing e.g., ~24 kHz Features Proposed System: Low power-consumption! Enrollment Samples NBSC Accept/ Narrowband Weighted DTW Reject Filter Bank Pattern Match Low rate ADC Spectral feature extraction Low-Rate Analog Front-End < 4 kHz Processing (kHz) 3

Spectral Feature Pre-Selection Spectral Feature Pre-Selection  A few carefully selected narrow-bands are Speaker-verification capable of preserving most speech information Backend Output 4

Feature Selection Weighted DTW Experiments Review: Speech Sound Generation (b) Vocal tract modulation: (c) Speech spectrogram: (a) Excitation signal: Essential for speech recognition Separable in the cepstral domain 5

Feature Selection Feature Extraction Command Recognition Cepstral Representation of Speech Harmonics Vocal Tract Modulation Cepstral Coefficients Spectral Density Spectrogram Frequency (Hz) IFFT Time (s) Quefrency (Cycle/kHz) Frequency (Hz) • Acquiring the entire speech spectrum and performing transformation to the cepstral domain is power expensive! Question: How to extract without acquiring the full spectrum or performing transformation to the cepstral domain? 6

Feature Selection Weighted DTW Experiments Point-Wise Spectral Sampling on the Harmonics Cepstral Coefficients Spectrogram is retained [dB] Frequency (Hz) 0 Time (s) Frequency (Hz) [dB] × 0 0 7

Feature Selection Feature Extraction Command Recognition Narrow-Band Spectral Filtering Spectrogram [dB] Frequency (Hz) 0 Time (s) Frequency (Hz) [dB] × 0 8

Feature Selection Weighted DTW Experiments Narrow-band Spectral Filtering: Parameters Spectrogram is mostly retained [dB] Frequency (Hz) 0 Time (s) Frequency (Hz) With = 100Hz, = 200Hz, = 800Hz, aliasing at the baseband is attenuated significantly where is the narrow-band band-width is the spacing between narrow-bands 9

Feature Selection Weighted DTW Experiments Narrow-Band Spectral Coefficients (NBSC) BP Filtering BP Filtering • Narrow-band spectral features retain essential speech information • A small number of filters low-power • Low-rate sampling and simple processing 10

Feature Selection Weighted DTW Experiments Block Diagram of the Proposed System Enrollment Samples × Weighted threshold Dynamic Time-Warping Accept/ . Reject (DTW) . . Digital back-end Analog front-end • Individual bands can be discarded in the presence of noise 11

Feature Selection Weighted DTW Experiments Weighted DTW Spectral Feature Selection Voice-authenticated wake-up:  Identifies the user and the passphrase in one shot Weighted Dynamic  User-defined passphrase (~1s) Time Warping  Very few enrollment samples (e.g., 3) Output 12

Feature Selection Weighted DTW Experiments Overview: Speaker-Verification Systems • Text-Dependent Speaker-Verification – Model based: GMM [Reynolds, 2000], i-vectors [Dehak, 2011], DNN [Liu, 2015], HMM [Rosenberg, 1990] Enrollments Training data Model Model Training Adaptation threshold MFCC Accept/ Feature Model Reject Extraction – Template based: DTW [ Sakoe, 1978 ]: Enrollments No prior model training threshold Distance Feature Accept/ Reject Extraction Measure 13

Feature Selection Weighted DTW Experiments Weighted Dynamic Time Warping Reference signal Speech input stretch compress Large penalty Small penalty M = 3 • The distance between two points and is equal to the distance plus a penalty term • Penalty scales with the # of consecutive warping steps M • Penalty scales with the signal magnitude • Penalty for warping is low when signal is small • Penalty for warping is high when signal is large 14

Feature Selection Weighted DTW Experiments Distance Matrix Computation where Cost is a function of the signal magnitude and the # of consecutive warping steps 15

Feature Selection Weighted DTW Experiments Classical v.s. Weighted DTW Fails to align the signal envelopes The shape of T is mutated Signal envelopes are well aligned Less mutation on signal envelope 16

Feature Selection Weighted DTW Experiments Spectral Feature Selection System Experiment Weighted Dynamic Time Warping Output 17

Feature Selection Weighted DTW Experiments Experiment Setup Passphrase # of speakers # of repetitions Data Set: Hi Galaxy 40 40 OK Glass 40 20 OK Hua Wei 30 20 • Noisy samples: • Wind and car noises are added to each clean sample such that the total SNR is 3dB • # of enrollment samples: 3 Parameters: • Narrow-band spectral coefficients (NBSC) band-width: 200Hz • f0 estimation using autocorrelation method [Rabiner, 1976] Baseline Systems: • 40-dim MFCC + Classical DTW • 40-dim MFCC + GMM-UBM model 18

Feature Selection Weighted DTW Experiments Summary of Experiment Results features below 2kHz are dropped Clean (EER [%]) Noisy (3dB) (EER [%]) Features MFCC NBSC MFCC NBSC Algorithm (40-dim) (12 bands) (40-dim) (8 bands) Weighted-DTW 0.9 1.1 10.5 5.7 DTW 1.4 1.5 13 6.7 GMM/UBM 2.6 N/A 6.8 N/A • Without noise, the NBSC yields comparable accuracy to the MFCC features • At 3dB SNR, the NBSC yields much better accuracy than the MFCC features • The Weighted-DTW yields improved accuracy than the classical DTW for all features • The proposed system yields improved accuracy than the GMM/UBM method • Taking only 3 enrollment samples as prior • Without prior background model training 19

Feature Selection Weighted DTW Experiments Experiments: Adaptive Band Selection Features NBSC (EER [%]) # of filters 6 8 10 12 Clean 1.99 1.9 1.54 1.1 Noisy 6.8 6.6 6.3 5.7 (band selection) Noisy 15.5 15 15 14.5 (all bands) • Accuracy improves as the # of bands increases • Accuracy improves significantly with band selection 20

System Power Estimation Fixed-Power Additional Total Power (uW) Power per Band (12 bands) (uW) (uW) Front-end TI’s 13 band filter-bank features 150 10 270 Back-end Text-dependent speaker verification 0 <9 <108 <380 • Back-end implementation: – Cortex-M0 micro-controller – Clock-speed: 40MHz – Decision: every 60 ms 21

Summary: Low-Power Text-Dependent Speaker Verification  Early stage signal dimension reduction Spectral Feature  Analog feature extraction front-end Selection  Low-rate sampling and processing  Support adaptive band-selection  Improved robustness to noise (discard Weighted Dynamic noisy bands) Time Warping  Demonstrated comparable accuracy to existing systems, with low-power implementation Output 22

Questions? 23

Back-up Slides 24

Feature Selection Feature Extraction Text-dependent speaker verification False-Positive under Continuous Running • Out-of-vocabulary samples: • 50000 samples of 1.2s duration • Short commands, utterances from audio books and conversations • Decision threshold is the same as the speaker-verification EER threshold Clean Noisy (3dB) Features (OOV False Positive [%]) (OOV False Positive [%]) MFCC NBSC MFCC NBSC Algorithm (40-dim) (12 bands) (40-dim) (8 bands) Weighted-DTW 0 0 1.4 0.6 ~1 false-positive per hour in a noisy restaurant 25

Feature Selection Feature Extraction Text-dependent speaker verification Experiments: Adaptive Band Selection Features NBSC MFSC (EER [%]) (EER [%]) # of filters 6 8 10 12 13 26 1.95 1.83 Clean 1.99 1.9 1.54 1.1 16.4 17.2 Noisy 6.8 6.6 6.3 5.7 (band selection) 33.4 33.9 Noisy 15.5 15 15 14.5 (all bands) • Accuracy improves as the # of bands increases • Accuracy improves significantly with band selection • NBSC yields much better performance than the MFSC, which uses a larger number of filters 26

Feature Selection Weighted DTW Experiments Narrow-Band Spectral Filtering Cepstral Coefficients Spectrogram [dB] Frequency (Hz) Quefrency (Cycle/kHz) Time (s) Frequency (Hz) = 0 0 × = 0 27 0 0

- Texas Instruments MIT - Signals Information and Algorithms Lab - PowerPoint PPT Presentation

A Low-Power Text-Dependent Speaker Verification System with Narrow-band Feature Pre-selection and Weighted Dynamic Time Warping Qing He, Gregory Wornell, Wei Ma June 21, 2016 - Texas Instruments MIT - Signals Information and Algorithms Lab

Asynchronous Events: Signals Signals Concepts Generating Signals Catching Signals

Asynchronous Events: Signals Signals Concepts Generating Signals

Payment Instruments, Payment Instruments, Payment Instruments, Payment Instruments, Financial

MIT MIT S EMINAR ON S EMINAR ON MIT ESD.69 EMINAR ON EMINAR ON MIT HST.926 H EALTH EALTH C ARE

Topic 1: LTI Systems Overview: Introduction to Signals Types of Signals: CT/DT,

at Texas Instruments Paul Westbrook Sustainable Development Manager, LEED AP Texas Instruments

Surgical Instruments Instruments Instruments are classified by their function Cutting

6.003: Signals and Systems Signals and Systems September 8, 2011 1 6.003: Signals and Systems

Justin Solomon Sebastian Claici MIT MIT Justin Solomon Sebastian Claici MIT MIT Client

6.003: Signals and Systems Fourier Series November 1, 2011 1 Last Time: Describing Signals by

EE361: SIGNALS AND SYSTEMS II REVIEW SIGNALS AND SYSTEMS I http://www.ee.unlv.edu/~b1morris/ee361

091031 091031 VIDEO SIGNALS VIDEO SIGNALS Lecturer: Marco Marcon 091032 - AUDIO AND VIDEO

Signals Maninder Kaur professormaninder@gmail.com Maninder Kaur www.eazynotes.com 1 Various

Signal Encoding Techniques Digital Data, Analog Signals Analog Data, Digital Signals ITS323:

Signals - II Tevfik Ko ar Louisiana State University October 9 th , 2008 1 2 Sending

Musical Instruments They sound different, even on the same note They require energy to

Randomness extraction from Bell violation with continuous parametric down conversion Lijiong Shen

CITESEERX DATA: SEMANTICIZING SCHOLARLY PAPERS Jian Wu, IST, Pennsylvania State University Chen

Using and Extending LIMBO for Descriptive Modeling of Arrival Behaviors Symposium on Software

Update on the F 2 experiment Abel Sun Carnegie Mellon University Hall C Collaboration Meeting,

Beam Extraction and Transport Taneli Kalvas Department of Physics, University of Jyvskyl,

Why Meaco? Trading since 1991 Dehumidifiers are our core business

Density Ratio Estimation Density Ratio Estimation in Machine Learning in Machine Learning

1 Nathan C. Habana, 1 John W. Jenson, 2 Stephen B. Gingerich 1 Water & Environmental Research

- Texas Instruments MIT - Signals Information and Algorithms Lab - PowerPoint PPT Presentation

A Low-Power Text-Dependent Speaker Verification System with Narrow-band Feature Pre-selection and Weighted Dynamic Time Warping Qing He, Gregory Wornell, Wei Ma June 21, 2016 - Texas Instruments MIT - Signals Information and Algorithms Lab

Asynchronous Events: Signals Signals Concepts Generating Signals Catching Signals

Asynchronous Events: Signals Signals Concepts Generating Signals

Payment Instruments, Payment Instruments, Payment Instruments, Payment Instruments, Financial

MIT MIT S EMINAR ON S EMINAR ON MIT ESD.69 EMINAR ON EMINAR ON MIT HST.926 H EALTH EALTH C ARE

Topic 1: LTI Systems Overview: Introduction to Signals Types of Signals: CT/DT,

at Texas Instruments Paul Westbrook Sustainable Development Manager, LEED AP Texas Instruments

Surgical Instruments Instruments Instruments are classified by their function Cutting

6.003: Signals and Systems Signals and Systems September 8, 2011 1 6.003: Signals and Systems

Justin Solomon Sebastian Claici MIT MIT Justin Solomon Sebastian Claici MIT MIT Client

6.003: Signals and Systems Fourier Series November 1, 2011 1 Last Time: Describing Signals by

EE361: SIGNALS AND SYSTEMS II REVIEW SIGNALS AND SYSTEMS I http://www.ee.unlv.edu/~b1morris/ee361

091031 091031 VIDEO SIGNALS VIDEO SIGNALS Lecturer: Marco Marcon 091032 - AUDIO AND VIDEO

Signals Maninder Kaur professormaninder@gmail.com Maninder Kaur www.eazynotes.com 1 Various

Signal Encoding Techniques Digital Data, Analog Signals Analog Data, Digital Signals ITS323:

Signals - II Tevfik Ko ar Louisiana State University October 9 th , 2008 1 2 Sending

Musical Instruments They sound different, even on the same note They require energy to

Randomness extraction from Bell violation with continuous parametric down conversion Lijiong Shen

CITESEERX DATA: SEMANTICIZING SCHOLARLY PAPERS Jian Wu, IST, Pennsylvania State University Chen

Using and Extending LIMBO for Descriptive Modeling of Arrival Behaviors Symposium on Software

Update on the F 2 experiment Abel Sun Carnegie Mellon University Hall C Collaboration Meeting,

Beam Extraction and Transport Taneli Kalvas Department of Physics, University of Jyvskyl,

Why Meaco? Trading since 1991 Dehumidifiers are our core business

Density Ratio Estimation Density Ratio Estimation in Machine Learning in Machine Learning

1 Nathan C. Habana, 1 John W. Jenson, 2 Stephen B. Gingerich 1 Water &amp; Environmental Research

1 Nathan C. Habana, 1 John W. Jenson, 2 Stephen B. Gingerich 1 Water & Environmental Research