DNN-based Ultrasound-to-Speech Conversion for a Silent Speech - - PowerPoint PPT Presentation

dnn based ultrasound to speech conversion for a silent
SMART_READER_LITE
LIVE PREVIEW

DNN-based Ultrasound-to-Speech Conversion for a Silent Speech - - PowerPoint PPT Presentation

DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface Tams Gbor Csap, 1 , 2 Tams Grsz, 3 Gbor Gosztolya 3 , 4 , Lszl Tth, 4 Alexandra Mark 2 , 5 1 BME Department of Telecommunications and Media Informatics 2


slide-1
SLIDE 1

DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface

Tamás Gábor Csapó,1,2 Tamás Grósz,3 Gábor Gosztolya3,4, László Tóth,4 Alexandra Markó2,5

1BME Department of Telecommunications and Media Informatics 2MTA-ELTE Lendület Lingual Articulation Research Group 3Institute of Informatics, University of Szeged 4MTA-SZTE Research Group on Artificial Intelligence 5ELTE Department of Phonetics

Hungary

Interspeech 2017 Stockholm Aug 24, 2017

slide-2
SLIDE 2

Introduction Methods Results Summary Articulation and speech

Introduction

2 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-3
SLIDE 3

Introduction Methods Results Summary Articulation and speech

Silent Speech Interface (SSI) I

goal

convert silent articulation to audible speech while the original speaker just ’mouths’ articulatory-to-acoustic mapping

imaging techniques

Ultrasound Tongue Imaging (UTI) Electromagnetic Articulography (EMA) Permanent Magnetic Articulography (PMA) lip video multimodal . . .

3 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-4
SLIDE 4

Introduction Methods Results Summary Articulation and speech

Silent Speech Interface (SSI) II

Ultrasound Tongue Imaging (UTI)

used in speech research since the early ’80s ultrasound transducer positioned below the chin during speech tongue movement on video (up to 100 frames/sec) tongue surface has a greater brightness than the surrounding tissue and air relatively good temporal and spatial resolution

[Stone et al., 1983, Stone, 2005]

4 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-5
SLIDE 5

Introduction Methods Results Summary Articulation and speech

Silent Speech Interface (SSI) III

Vocal tract Ultrasound sample

tongue

5 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-6
SLIDE 6

Introduction Methods Results Summary Articulation and speech

Silent Speech Interface (SSI) IV

SSI types

recognition followed by synthesis direct synthesis

mapping techniques

Gaussian Mixture Models Mixture of Factor Analyzers Deep Neural Networks (DNN)

previously only one study about UTI and DNN

[Jaumard-Hakoun et al., 2016]

singing voice synthesis estimation of vocoder spectral parameters based on UTI and lip video AutoEncoder / Multi-Layer Perceptron

6 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-7
SLIDE 7

Introduction Methods Results Summary Articulation and speech

Goal of the current study

initial experiments in articulatory-to-acoustic mapping

“Micro” ultrasound equipment – access raw data direct speech synthesis from ultrasound recordings based on deep learning, using a feed-forward deep neural network

7 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-8
SLIDE 8

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning

Methods

8 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-9
SLIDE 9

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning

Recordings and data I

parallel / synchronized ultrasound and speech recordings “Micro” system, with stabilization headset (Articulate Instruments Ltd.)

  • ne female speaker

473 Hungarian sentences from PPBA database

[Olaszy, 2013]

ultrasound speed: 82 fps speech sampling frequency: 22 050 Hz

9 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-10
SLIDE 10

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning

Recordings and data II

Images from the same speaker - with different quality.

10 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-11
SLIDE 11

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning

Processing the speech signal

simple impulse-noise excited vocoder analysis

speech resampled to 11 050 Hz excitation parameter: fundamental frequency (F0) spectral parameter: Mel-Generalized Cepstrum (MGC), 12-order MGC − LSP in synchrony with ultrasound → frame shift: 1 / (82 fps)

synthesis

impulse-noise excitation using original F0 spectral filtering using predicted MGC

11 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-12
SLIDE 12

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning

Preprocessing the ultrasound data

(64x946)

raw ultrasound image is 64×946 pixels

resized to 64×119 using bicubic interpolation

input of DNN

simplest case: one ultrasound image more advanced: use several consecutive images further reduction of image size is necessary size of the DNN input vector can be reduced by discarding irrelevant pixels

12 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-13
SLIDE 13

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning

Correlation-based feature selection

mean / maximum of the correlation of pixels (DNN input) with MGC (DNN output)

  • nly retained 5, 10, ..., 25% of the pixels with

the largest importance scores

A raw ultrasound image and the mask (max., 20%).

13 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-14
SLIDE 14

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning

Eigentongue feature extraction

find a finite set of orthogonal images (called eigentounges) [Hueber et al., 2007] apply PCA on the ultrasound images, keep 20% of the information

The first two extracted Eigentongues.

14 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-15
SLIDE 15

Introduction Methods Results Summary Recordings Speech Ultrasound Deep learning

Deep learning I

input: 1 image / 5 consecutive images feature selection / EigenTongue

  • utput: spectral parameters (MGC)

fully connected deep rectifier neural networks 5 hidden layers, 1000 neurons / layer linear output layer two training types

joint model: one DNN for the full MGC vector separate models: separate DNNs for each of the

  • utput features

15 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-16
SLIDE 16

Introduction Methods Results Summary Objective measurements Subjective listening test

Experimental results

16 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-17
SLIDE 17

Introduction Methods Results Summary Objective measurements Subjective listening test

Objective measurements

Normalized Mean Square Error (NMSE) and mean R2 scores on the development set Type NMSE Mean R2 DNN (separate models) 0.409 0.597 DNN (joint model) 0.384 0.619 DNN (feature selection (max.), 20%) 0.441 0.562 DNN (feature selection (avg.), 20%) 0.442 0.561 DNN (Eigentongue, 20%) 0.432 0.577 DNN (feature sel. (max.), 5 images) 0.380 0.625 DNN (feature sel. (avg.), 5 images) 0.388 0.615 DNN (Eigentongue, 5 images) 0.402 0.608

17 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-18
SLIDE 18

Introduction Methods Results Summary Objective measurements Subjective listening test

Subjective listening test I

MUSHRA (MUlti-Stimulus test with Hidden Reference and Anchor) 6 types, 10 sentences, 1 speaker

natural sentences vocoded reference anchor: using constant MGC from a /swa/ vowel 3 proposed DNN approaches

goal: evaluate overall naturalness

rate from 0 (highly unnatural) to 100 (highly natural)

23 Hungarian listeners

20 females, 3 males; 19–32 years old

18 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-19
SLIDE 19

Introduction Methods Results Summary Objective measurements Subjective listening test

Subjective listening test II

20 40 60 80 100 Mean naturalness

94.82 2.65 56.22 30.21 31.10 32.18

Natural Anchor Vocoded DNN (joint model) DNN (Eigentongue, 5 images) DNN (feat.sel.(max.), 5 images)

Results of the listening test.

19 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-20
SLIDE 20

Introduction Methods Results Summary Summary Applications Future plans

Summary, conclusions

20 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-21
SLIDE 21

Introduction Methods Results Summary Summary Applications Future plans

Summary and conclusions I

goal of the study: synthesize speech from tongue ultrasound images

DNN-based articulatory-to-acoustic mapping tongue ultrasound → vocoder spectral parameters

various approaches

joint model to predict all spectral features separate models for predicting the spectral features two variants of a correlation-based feature selection Eigentongue feature selection to reduce the size of ultrasound images feature selection methods combined with using several consecutive ultrasound frames

21 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-22
SLIDE 22

Introduction Methods Results Summary Summary Applications Future plans

Summary and conclusions II

synthesized sentences (using the original F0) are mostly intelligible

a) vocoded reference [ sample] b) DNN (feature sel. (max.), 5 images) [ sample]

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1000 2000 3000 4000 5000 Frequency (Hz) a) 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time (s) 1000 2000 3000 4000 5000 Frequency (Hz) b)

„Gyengéden megcirógatta az orrát egy papírcsiptet˝

  • vel.”

22 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-23
SLIDE 23

Introduction Methods Results Summary Summary Applications Future plans

Applications

Silent Speech Interface (long-term goals)

useful for the speaking impaired (e.g. after laryngectomy) can be used in extremely noisy environments to create speech, just with ’mouthing’ private conversations in public areas

23 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-24
SLIDE 24

Introduction Methods Results Summary Summary Applications Future plans

Future plans

mapping from articulatory data to F0 investigate other neural network types (e.g. AutoEncoders and CNNs) use multimodal articulatory data (e.g. video of the lips; EMA) test more advanced vocoders record real silent speech (silent articulation)

24 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-25
SLIDE 25

Introduction Methods Results Summary Summary Applications Future plans

Thank you for the attention!

25 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion

slide-26
SLIDE 26

Introduction Methods Results Summary Summary Applications Future plans

References I

Hueber, T., Aversano, G., Chollet, G., Denby, B., Dreyfus, G., Oussar, Y., Roussel, P ., and Stone, M. (2007). Eigentongue feature extraction for an ultrasound-based silent speech interface. In Proc. ICASSP, pages 1245–1248, Honolulu, HI, USA. Jaumard-Hakoun, A., Xu, K., Leboullenger, C., Roussel-Ragot, P ., and Denby, B. (2016). An Articulatory-Based Singing Voice Synthesis Using Tongue and Lips Imaging. In Proc. Interspeech, pages 1467–1471. Olaszy, G. (2013). Precíziós, párhuzamos magyar beszédadatbázis fejlesztése és szolgáltatásai. Beszédkutatás 2013, pages 261–270. Stone, M. (2005). A guide to analysing tongue motion from ultrasound images. Clinical Linguistics & Phonetics, 19(6-7):455–501. Stone, M., Sonies, B., Shawker, T., Weiss, G., and Nadel, L. (1983). Analysis of real-time ultrasound images of tongue configuration using a grid-digitizing system. Journal of Phonetics, 11:207–218. 26 / 25 Csapó, Grósz, Gosztolya, Tóth, Markó DNN-based Ultrasound-to-Speech Conversion