DNN-based Ultrasound-to-Speech Conversion for a Silent Speech - PowerPoint PPT Presentation

DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface Tamás Gábor Csapó, 1 , 2 Tamás Grósz, 3 Gábor Gosztolya 3 , 4 , László Tóth, 4 Alexandra Markó 2 , 5 1 BME Department of Telecommunications and Media Informatics 2 MTA-ELTE Lendület Lingual Articulation Research Group 3 Institute of Informatics, University of Szeged 4 MTA-SZTE Research Group on Artificial Intelligence 5 ELTE Department of Phonetics Hungary Interspeech 2017 Stockholm Aug 24, 2017

Introduction Methods Results Summary Articulation and speech Introduction 2 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Articulation and speech Silent Speech Interface (SSI) I goal convert silent articulation to audible speech while the original speaker just ’mouths’ articulatory-to-acoustic mapping imaging techniques Ultrasound Tongue Imaging (UTI) Electromagnetic Articulography (EMA) Permanent Magnetic Articulography (PMA) lip video multimodal . . . 3 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Articulation and speech Silent Speech Interface (SSI) II Ultrasound Tongue Imaging (UTI) used in speech research since the early ’80s ultrasound transducer positioned below the chin during speech tongue movement on video (up to 100 frames/sec) tongue surface has a greater brightness than the surrounding tissue and air relatively good temporal and spatial resolution [Stone et al., 1983, Stone, 2005] 4 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Articulation and speech Silent Speech Interface (SSI) III Vocal tract Ultrasound sample tongue 5 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Articulation and speech Silent Speech Interface (SSI) IV SSI types recognition followed by synthesis direct synthesis mapping techniques Gaussian Mixture Models Mixture of Factor Analyzers Deep Neural Networks (DNN) previously only one study about UTI and DNN [Jaumard-Hakoun et al., 2016] singing voice synthesis estimation of vocoder spectral parameters based on UTI and lip video AutoEncoder / Multi-Layer Perceptron 6 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Articulation and speech Goal of the current study initial experiments in articulatory-to-acoustic mapping “Micro” ultrasound equipment – access raw data direct speech synthesis from ultrasound recordings based on deep learning, using a feed-forward deep neural network 7 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning Methods 8 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning Recordings and data I parallel / synchronized ultrasound and speech recordings “Micro” system, with stabilization headset (Articulate Instruments Ltd.) one female speaker 473 Hungarian sentences from PPBA database [Olaszy, 2013] ultrasound speed: 82 fps speech sampling frequency: 22 050 Hz 9 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning Recordings and data II Images from the same speaker - with different quality. 10 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning Processing the speech signal simple impulse-noise excited vocoder analysis speech resampled to 11 050 Hz excitation parameter: fundamental frequency ( F 0) spectral parameter: Mel-Generalized Cepstrum (MGC), 12-order MGC − LSP in synchrony with ultrasound → frame shift: 1 / (82 fps) synthesis impulse-noise excitation using original F 0 spectral filtering using predicted MGC 11 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning Preprocessing the ultrasound data (64x946) raw ultrasound image is 64 × 946 pixels resized to 64 × 119 using bicubic interpolation input of DNN simplest case: one ultrasound image more advanced: use several consecutive images further reduction of image size is necessary size of the DNN input vector can be reduced by discarding irrelevant pixels 12 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning Correlation-based feature selection mean / maximum of the correlation of pixels (DNN input) with MGC (DNN output) only retained 5, 10, ..., 25% of the pixels with the largest importance scores A raw ultrasound image and the mask (max., 20%). 13 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning Eigentongue feature extraction find a finite set of orthogonal images (called eigentounges) [Hueber et al., 2007] apply PCA on the ultrasound images, keep 20% of the information The first two extracted Eigentongues. 14 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning Deep learning I input: 1 image / 5 consecutive images feature selection / EigenTongue output: spectral parameters ( MGC ) fully connected deep rectifier neural networks 5 hidden layers, 1000 neurons / layer linear output layer two training types joint model: one DNN for the full MGC vector separate models: separate DNNs for each of the output features 15 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Objective measurements Subjective listening test Experimental results 16 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Objective measurements Subjective listening test Objective measurements Normalized Mean Square Error (NMSE) and mean R 2 scores on the development set Mean R 2 Type NMSE DNN (separate models) 0.409 0.597 DNN (joint model) 0.384 0.619 DNN (feature selection (max.), 20%) 0.441 0.562 DNN (feature selection (avg.), 20%) 0.442 0.561 DNN (Eigentongue, 20%) 0.432 0.577 DNN (feature sel. (max.), 5 images) 0.380 0.625 DNN (feature sel. (avg.), 5 images) 0.388 0.615 DNN (Eigentongue, 5 images) 0.402 0.608 17 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Objective measurements Subjective listening test Subjective listening test I MUSHRA (MUlti-Stimulus test with Hidden Reference and Anchor) 6 types, 10 sentences, 1 speaker natural sentences vocoded reference anchor: using constant MGC from a /swa/ vowel 3 proposed DNN approaches goal: evaluate overall naturalness rate from 0 (highly unnatural) to 100 (highly natural) 23 Hungarian listeners 20 females, 3 males; 19–32 years old 18 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Objective measurements Subjective listening test Subjective listening test II Natural DNN (joint model) Anchor DNN (Eigentongue, 5 images) Vocoded DNN (feat.sel.(max.), 5 images) 100 80 Mean naturalness 60 94.82 40 56.22 20 31.10 32.18 30.21 2.65 0 Results of the listening test. 19 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Summary Applications Future plans Summary, conclusions 20 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

Introduction Methods Results Summary Summary Applications Future plans Summary and conclusions I goal of the study: synthesize speech from tongue ultrasound images DNN-based articulatory-to-acoustic mapping tongue ultrasound → vocoder spectral parameters various approaches joint model to predict all spectral features separate models for predicting the spectral features two variants of a correlation-based feature selection Eigentongue feature selection to reduce the size of ultrasound images feature selection methods combined with using several consecutive ultrasound frames 21 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

DNN-based Ultrasound-to-Speech Conversion for a Silent Speech - PowerPoint PPT Presentation

DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface Tams Gbor Csap, 1 , 2 Tams Grsz, 3 Gbor Gosztolya 3 , 4 , Lszl Tth, 4 Alexandra Mark 2 , 5 1 BME Department of Telecommunications and Media Informatics 2

What is ultrasound? piezo-electric effect Ultrasound is energy! a vibration! It is not

Ultrasound Ultrasound Ultrasound imaging uses high frequency sound waves beyond the range of

Ultrasound Ultrasound Ultrasound imaging uses high frequency sound waves beyond the range of

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Silent Shout 2.009 SILVER Silent Shout The Market 65 million users $9.7 billion market 2.009

DNN-based Branch-and-bound for the Quadratic Assignment Problem *Koichi Fujii, Naoki Ito, Yuji

ACR Ultrasound Practice ACR Ultrasound Practice Accreditation and Technical Standard

Objectives Basic principles of lung ultrasound Key lung ultrasound findings Brief

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

5 CONVERSION FUNCTIONS Data type conversion Implicit data type Explicit data type conversion

Ultrasound molecular imaging: oncology & cardiology applications Medical ultrasound

Ultrasound Guided Volume Assessment Starr Knight, M.D. HIGH RISK Hawaii Feb 14, 2014 Outline

The Dark Side of DNN Pruning Reza Yazdani Marc Riera Jose-Maria Arnau Antonio Gonzlez

THE SILENT EPIDEMIC OF TBIS: LISTENING FOR DEPRESSION AND SUICIDE THE SILENT EPIDEMIC OF

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 11: Recurrent

A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski,

GPU-Enabled Ultrasound Imaging Real-Time, Fully-Flexible Data Processing Dr. Christoph

25 FEBRUARY 2019 INTERIM RESULTS INVESTOR PRESENTATION PAUL SWINNEY, CEO LIZ DIXON, FD

27-28 June 2016 Test MODEL FOR IMPROVEMENT What are we trying to Aim accomplish? How do we

Company presentation Locating leaks is crucial to safeguard the safety of the workers, to

Presentation April 2020 Forward Looking Statements This presentation contains statements that

Lateral Violence Presentation 1) What is lateral violence? brainstorm Provide definition

LOSSES OEE Workshop Siyambulela Bozo: Junior Project Manager AIDC - TPM Pres resentation

Infrared spectral indices to understand unresolved stellar clusters and galaxies

DNN-based Ultrasound-to-Speech Conversion for a Silent Speech - PowerPoint PPT Presentation

DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface Tams Gbor Csap, 1 , 2 Tams Grsz, 3 Gbor Gosztolya 3 , 4 , Lszl Tth, 4 Alexandra Mark 2 , 5 1 BME Department of Telecommunications and Media Informatics 2

What is ultrasound? piezo-electric effect Ultrasound is energy! a vibration! It is not

Ultrasound Ultrasound Ultrasound imaging uses high frequency sound waves beyond the range of

Ultrasound Ultrasound Ultrasound imaging uses high frequency sound waves beyond the range of

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Silent Shout 2.009 SILVER Silent Shout The Market 65 million users $9.7 billion market 2.009

DNN-based Branch-and-bound for the Quadratic Assignment Problem *Koichi Fujii, Naoki Ito, Yuji

ACR Ultrasound Practice ACR Ultrasound Practice Accreditation and Technical Standard

Objectives Basic principles of lung ultrasound Key lung ultrasound findings Brief

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

5 CONVERSION FUNCTIONS Data type conversion Implicit data type Explicit data type conversion

Ultrasound molecular imaging: oncology &amp; cardiology applications Medical ultrasound

Ultrasound Guided Volume Assessment Starr Knight, M.D. HIGH RISK Hawaii Feb 14, 2014 Outline

The Dark Side of DNN Pruning Reza Yazdani Marc Riera Jose-Maria Arnau Antonio Gonzlez

THE SILENT EPIDEMIC OF TBIS: LISTENING FOR DEPRESSION AND SUICIDE THE SILENT EPIDEMIC OF

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 11: Recurrent

A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski,

GPU-Enabled Ultrasound Imaging Real-Time, Fully-Flexible Data Processing Dr. Christoph

25 FEBRUARY 2019 INTERIM RESULTS INVESTOR PRESENTATION PAUL SWINNEY, CEO LIZ DIXON, FD

27-28 June 2016 Test MODEL FOR IMPROVEMENT What are we trying to Aim accomplish? How do we

Company presentation Locating leaks is crucial to safeguard the safety of the workers, to

Presentation April 2020 Forward Looking Statements This presentation contains statements that

Lateral Violence Presentation 1) What is lateral violence? brainstorm Provide definition

LOSSES OEE Workshop Siyambulela Bozo: Junior Project Manager AIDC - TPM Pres resentation

Infrared spectral indices to understand unresolved stellar clusters and galaxies

Ultrasound molecular imaging: oncology & cardiology applications Medical ultrasound