Multi-Task Joint-Learning for Robust Voice Activity Detection - PowerPoint PPT Presentation

Multi-Task Joint-Learning for Robust Voice Activity Detection Yimeng Zhuang, Sibo Tong, Maofan Yin, Yanmin Qian, Kai Yu Speech Lab Department of Computer Science & Engineering Shanghai Jiao Tong University October 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yimeng Zhuang, Sibo Tong, Maofan Yin, Yanmin Qian, Kai Yu Multi-Task Joint-Learning for Robust Voice Activity Detection SJTU Speech Lab 1 / 11

VAD Overview ◮ Voice activity detection ◮ A technique used in speech processing in which the presence or absence of human speech is detected ◮ Model based VAD ◮ Zero crossings rate ◮ Energy ◮ Long term spectral ◮ Gaussian mixture model(GMM) ◮ Deep neural network(DNN) based VAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yimeng Zhuang, Sibo Tong, Maofan Yin, Yanmin Qian, Kai Yu Multi-Task Joint-Learning for Robust Voice Activity Detection SJTU Speech Lab 2 / 11

Basic DNN based VAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yimeng Zhuang, Sibo Tong, Maofan Yin, Yanmin Qian, Kai Yu Multi-Task Joint-Learning for Robust Voice Activity Detection SJTU Speech Lab 3 / 11

Multi-frame prediction 2 N M L vad ( W ) = − 1 ∑ ∑ ∑ λ t d s ( n + t ) i log P ( s ( n + t ) i | o n , W ) (1) N n =1 t = − M i =1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yimeng Zhuang, Sibo Tong, Maofan Yin, Yanmin Qian, Kai Yu Multi-Task Joint-Learning for Robust Voice Activity Detection SJTU Speech Lab 4 / 11

Train multi-frame DNN with multi-task joint-learning N L ( W ) = L vad ( W ) + 1 ∑ o n − o n ∥ 2 2 + κ ∥ W ∥ 2 ∥ ˆ (2) 2 N n =1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yimeng Zhuang, Sibo Tong, Maofan Yin, Yanmin Qian, Kai Yu Multi-Task Joint-Learning for Robust Voice Activity Detection SJTU Speech Lab 5 / 11

Prediction ◮ Enhancement layer is removed ◮ Functions to combine multiple prediction results ◮ Maximum: P ( s t | o , W )= − M ≤ i ≤ M { P ( s t | o t + i , W ) } max (3) ◮ Arithmetic mean: M 1 ∑ P ( s t | o , W )= P ( s t | o t + i , W ) (4) 2 M + 1 i = − M ◮ Harmonic mean: M 1 1 1 ∑ P ( s t | o , W ) = (5) 2 M + 1 P ( s t | o t + i , W ) i = − M ◮ Geometric mean: M 1 ∑ log P ( s t | o , W )= log P ( s t | o t + i , W ) (6) 2 M +1 i = − M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yimeng Zhuang, Sibo Tong, Maofan Yin, Yanmin Qian, Kai Yu Multi-Task Joint-Learning for Robust Voice Activity Detection SJTU Speech Lab 6 / 11

Experiment Setup ◮ Aurora 4 dataset is used ◮ Six different types of noises, including airport, babble, car, restaurant, street and train ◮ 10-20 dB SNR ◮ 7 test sets, including the clean set and six noise sets (seen noise) ◮ To simulate a more realistic scenario, an unseen noise test set is designed with 100 noise types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yimeng Zhuang, Sibo Tong, Maofan Yin, Yanmin Qian, Kai Yu Multi-Task Joint-Learning for Robust Voice Activity Detection SJTU Speech Lab 7 / 11

Choosing context window size and score combination methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yimeng Zhuang, Sibo Tong, Maofan Yin, Yanmin Qian, Kai Yu Multi-Task Joint-Learning for Robust Voice Activity Detection SJTU Speech Lab 8 / 11

Frame-level evaluation (AUC) Hidden Noise Single Multi-frame Multi-frame layers condition frame + Multi-task clean 99.75 99.78 99.79 2 seen 98.85 98.95 99.00 (1+1) unseen 96.62 97.35 97.72 clean 99.76 99.79 99.79 3 seen 98.90 99.03 99.08 (2+1) unseen 96.82 97.58 97.95 ◮ The model of multi-frame prediction with multi-task joint-learning yields best results ◮ The multi-task approach is an effective method to further impove VAD performance at frame-level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yimeng Zhuang, Sibo Tong, Maofan Yin, Yanmin Qian, Kai Yu Multi-Task Joint-Learning for Robust Voice Activity Detection SJTU Speech Lab 9 / 11

Segment-level evaluation ( J V AD ) Hidden Noise Single Multi-frame Multi-frame layers condition frame +Multi-task clean 81.6 90.28 91.0 2 seen 55.4 71.81 71.9 (1+1) unseen 45.9 63.80 65.7 clean 82.2 90.23 91.3 3 seen 56.5 71.89 75.1 (2+1) unseen 46.0 63.86 66.6 ◮ J V AD is sensitive to boundary accuracy and the total number of speech/non-speech segments. Improved J V AD suggests that the proposed approaches produce more accurate boundaries and less fragiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yimeng Zhuang, Sibo Tong, Maofan Yin, Yanmin Qian, Kai Yu Multi-Task Joint-Learning for Robust Voice Activity Detection SJTU Speech Lab 10 / 11

Conclusion ◮ Multi-frame prediction with multi-task joint learning is utilized for VAD ◮ The proposed approach need to predict classification posteriors covering the neighboring multiple frames ◮ A speech enhancement task is jointly trained in order to generate better regression ability ◮ Future work ◮ More experiments are needed to exam whether other score combination functions can get a better performance ◮ Also it is worth exploiting a postprocessing method that suits this new proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yimeng Zhuang, Sibo Tong, Maofan Yin, Yanmin Qian, Kai Yu Multi-Task Joint-Learning for Robust Voice Activity Detection SJTU Speech Lab 11 / 11

Multi-Task Joint-Learning for Robust Voice Activity Detection - PowerPoint PPT Presentation

Multi-Task Joint-Learning for Robust Voice Activity Detection Yimeng Zhuang, Sibo Tong, Maofan Yin, Yanmin Qian, Kai Yu Speech Lab Department of Computer Science & Engineering Shanghai Jiao Tong University October 2016 . . . . . . .

Slide 1 Page: 1 The Leader's Voice Slide 3 Page: 5 The Leader's Voice Slide 4 Page: 6 The

Voice Activity Detection Voice Activity Detection Speaker Recognition Feature Extraction

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and

Digital Voice VHF, UHF, and HF Analog Voice - AM/SSB Analog Voice - FM Digital Voice GMSK UHF

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

Multi-Task Active Learning Yi Zhang Outline Active Learning Multi-Task Active Learning

Aisle Safety Light Brightness SFMTA Fleet Engineering Voice Annunciator Volume Voice

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

There is a voice speaking. That voice is sovereign. That voice alone is sovereign. Jeremiah

Year 3 Reading Activity 1 Prefixes - page 2 Activity 2 Context clues page 15

Multi-Task Learning and Matrix Regularization Andreas Argyriou TTI Chicago Outline

Getting Sta rted with Voice API Lorna Mitchell Getting Sta rted with Voice API Use the Voice

Bond Task Force Draft Bond Task Force Recommendations Tuesday, February 27 , 2018 Bond Task

Task 1d: River basin management Task leader: LNEC; Involved partners EU: ISPRA, DTU, EWA Task

p wered Yva productivity AI Task Manager @nerdybff Task Management Task Management Todoist

Variable Fonts and the future of typography Jason Pamental | @jpamental TYPO Labs | Berlin

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

First Contact Resolution - is it counting bubbles in the water? NERYS CORFIELD INJECTION

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

Computer Supported Human-Human Multilingual Communication February 29, 2008 Alex Waibel

Improved Modeling of Cross-Decoder Phone Co-occurrences in SVM-based Phonotactic Language

Saint Oscar Romero 1917-1980 Year 4 Gods People Saint Oscar Romero 1917-1980 A

Looking for exemplar effects: testing the comprehension and memory ry representations of f r'