h V oice C onversion C onversion C hallenge C h ll hallenge 2016 - - PowerPoint PPT Presentation

h v
SMART_READER_LITE
LIVE PREVIEW

h V oice C onversion C onversion C hallenge C h ll hallenge 2016 - - PowerPoint PPT Presentation

The V oice h V oice C onversion C onversion C hallenge C h ll hallenge 2016 2016 2016 2016 The h ll Tomoki Toda (Nagoya U, Japan) Ling Hui Chen (USTC, China) Li H i Ch Daisuke Saito (Tokyo U, Japan) Fernando Villavicencio (NII, Japan)


slide-1
SLIDE 1

h V

C Ch ll

h ll

2016 2016

The The Voice

  • ice Conversion
  • nversion Challenge

hallenge 2016

2016

Tomoki Toda (Nagoya U, Japan) Li H i Ch Ling‐Hui Chen (USTC, China) Daisuke Saito (Tokyo U, Japan) Fernando Villavicencio (NII, Japan) Mirjam Wester (CSTR UK) Mirjam Wester (CSTR, UK) Zhizheng Wu (CSTR, UK) J i hi Y i hi Junichi Yamagishi (NII/CSTR, Japan/UK)

  • Sep. 10th, 2016
slide-2
SLIDE 2

Voice Conversion (VC)

  • Technique to modify speech waveform to convert non‐/para‐

linguistic information while preserving linguistic information

How to factorize? How to factorize? How to analyze? How to generate?

VC VC

How to convert? How to parameterize?

  • Research progress since the late 1980s

p g

  • Development of various VC techniques (& potential applications)
  • Not straightforward to compare across different VC techniques…

Not straightforward to compare across different VC techniques…

1

slide-3
SLIDE 3

Voice Conversion Challenge 2016

Objective Objective

Better understand different VC techniques by comparing their Better understand different VC techniques by comparing their performance performance using a freely using a freely available dataset as a common dataset available dataset as a common dataset

  • Following a policy of Blizzard Challenge [Bl

k & T k d 2005]

performance performance using a freely using a freely‐available dataset as a common dataset available dataset as a common dataset

  • Following a policy of Blizzard Challenge [Black & Tokuda, 2005]

“Evaluation campaign” rather than “competition”

  • Also reveal a risk of VC techniques
  • Effective but possible to be used for spoofing
  • Effective but possible to be used for spoofing
  • Important to inform people of VC as “kitchen knife”

2

slide-4
SLIDE 4

Timelines of VCC 2016

(Short announcement at INTERSPEECH2015) (Sep. 9th, 2015) ( ) Announcement & registration open ( p , )

  • Nov. 18th, 2015

Release of training data

  • Nov. 25th, 2015

1.5 months for training Release of evaluation data

  • Jan. 8th, 2016

1 week for conversion Deadline to submit the converted voice samples

  • Jan. 15th, 2016

h f l i Notification of results Feb 29th 2016 1.5 months for evaluation

3

Notification of results

  • Feb. 29 , 2016
slide-5
SLIDE 5

Task of VCC 2016

  • Simple speaker identity conversion [Abe et al., 1990]
  • Develop conversion systems using parallel data of each speaker pair

Source speech Target speech T t k

Please say the same thing. Please say the same thing.

Source speech Target speech S k Target speaker

t e sa e t g t e sa e t g

Source speaker

  • 1. Training with parallel data (utterance pairs)

Let’s convert my voice. Let’s convert my voice.

) (x y f 

Conversion system

  • 2. Conversion of any utterance

y

4

slide-6
SLIDE 6

VCC 2016 Dataset [http://dx.doi.org/10.7488/ds/1430]

  • DAPS (Data And Production Speech) [Mysore, 2015]
  • Professional US English speakers
  • Freely available [https://archive.org/details/daps_dataset]
  • Design of VCC 2016 dataset

S l 10 k i l di 5 f l d 5 l k

  • Select 10 speakers including 5 female and 5 male speakers
  • Manually segmented into 216 sentences in each speaker
  • Down‐sampled to 16 kHz

# of speakers # of sentences # of speakers # of sentences Sources 3 females & 2 males 162 for training & 54 for evaluation Targets 2 females & 3 males 162 for training Targets 2 females & 3 males 162 for training

5

slide-7
SLIDE 7

Rules of VCC 2016

  • Requirement
  • Develop all 5 x 5 = 25 combinations of source‐target pairs
  • Main guidelines
  • Main guidelines
  • Transform any acoustic features

OK ! M l di i f i i NOT ll d

  • Manual edit or tuning of systems in conversion

NOT allowed

  • Use manual transcriptions

NOT allowed

  • Use automatic speech recognition (ASR)

OK!

  • To develop a system for a certain speaker pair using

data of other pairs within VCC 2016 dataset NOT allowed

  • Use external data outside VCC 2016 dataset

OK!

  • Discard a part of utterances of the training set

OK!

  • Submit multiple entries

NOT allowed

6

slide-8
SLIDE 8

Evaluation Methodology

  • Subjective evaluation
  • Use only 16 speaker pairs (2 males & 2 females) from 25 speaker pairs
  • Use headphones in sound‐treated booths
  • Listeners: 200 subjects

1 O i i t t t l

  • 1. Opinion test on naturalness
  • Evaluate naturalness of each voice sample using a 5‐scale opinion score
  • 1 (completely unnatural) to 5 (completely natural)
  • 2. Pair‐comparison test on speaker similarity
  • 2. Pair comparison test on speaker similarity
  • Judge whether 2 voice samples are uttered by the same speaker
  • Decision with confidence
  • Decision with confidence

Same, absolutely sure Same, not sure Different, not sure Different, absolutely sure

7

absolutely sure not sure not sure absolutely sure

slide-9
SLIDE 9

Baseline System (Freely Available)

  • VC tools [Toda] within FestVox [Black & Lenzo]
  • Analysis methods
  • F0 extraction with Edinburgh Speech Tools [Taylor et al.]
  • Spectral analysis with Signal Processing Toolkit (SPTK) [Tokuda et al.]
  • Converted parameters
  • Converted parameters
  • Mel‐cepstrum (MCEP): Trajectory‐wise conversion (MLPG) using global

variance (GV) w/ Gaussian mixture model (GMM) ( ) / ( )

  • Log‐scaled F0 (LF0 ): Linear transformation w/ mean & variance (M&V)

S th i th d

  • Synthesis methods
  • Simple pulse/noise excitation

M l l t i t (MLSA) filt

  • Mel‐log spectrum approximate (MLSA) filter

8

slide-10
SLIDE 10

Submitted Systems

Team name Ana‐Syn Converted Parameters & Conversion Methods ASR +DB A Ahocoder MCEP GMM MGE MLPG PF LF M&V No No A Ahocoder MCEP GMM, MGE, MLPG, PF LF0 M&V No No B STRAIGHT MCEP Exemplar, MLPG, GV LF0 M&V No No C STRAIGHT MLSP DNN & GMM, PF LF0 M&V No Yes D STRAIGHT MCEP MDN & GMM, PF LF0 M&V No No E Ahocoder MCEP GMM, FW & Scaling LF0 M&V No No F STRAIGHT MCEP Phone posteriorgram LF M&V Yes Yes F STRAIGHT MCEP Phone posteriorgram LF0 M&V Yes Yes G STRAIGHT MCEP LSTM‐RNN LF0 M&V Spk rate Yes Yes H STRAIGHT MCEP DNN, MTL LF0 M&V Spk rate Yes Yes I Ahocoder LSP GMM, MMSE, i‐vector LF0 M&V No Yes J STRAIGHT MCEP GMM, MS, diff filter LF0 M&V BAP No No K TEAP MLSP FW & GMM diff filter F shift Spk rate No No K TEAP MLSP FW & GMM, diff filter F0 shift Spk rate No No L STRAIGHT Multi systems & selection LF0 M&V Resid Yes Yes M STRAIGHT MCEP LSTM LF0 M&V No No N LPC LP coef FW F0 shift Spk rate No No O STRAIGHT ST spec FW & GTDNN LF0 LSTM BAP No No P STRAIGHT MCEP GMM MLPG GV LF M&V BAP No No 9 P STRAIGHT MCEP GMM, MLPG, GV LF0 M&V BAP No No Q Ahocoder MCEP Frame selection, MLPG LF0 M&V No No

slide-11
SLIDE 11

Excitation

Submitted Systems

F0 pattern Spectral envelope

Team name Ana‐Syn Converted Parameters & Conversion Methods ASR +DB A Ahocoder MCEP GMM MGE MLPG PF LF M&V No No

Duration

0 p

A Ahocoder MCEP GMM, MGE, MLPG, PF LF0 M&V No No B STRAIGHT MCEP Exemplar, MLPG, GV LF0 M&V No No C STRAIGHT MLSP DNN & GMM, PF LF0 M&V No Yes D STRAIGHT MCEP MDN & GMM, PF LF0 M&V No No E Ahocoder MCEP GMM, FW & Scaling LF0 M&V No No F STRAIGHT MCEP Phone posteriorgram LF M&V Yes Yes F STRAIGHT MCEP Phone posteriorgram LF0 M&V Yes Yes G STRAIGHT MCEP LSTM‐RNN LF0 M&V Spk rate Yes Yes H STRAIGHT MCEP DNN, MTL LF0 M&V Spk rate Yes Yes I Ahocoder LSP GMM, MMSE, i‐vector LF0 M&V No Yes J STRAIGHT MCEP GMM, MS, diff filter LF0 M&V BAP No No K TEAP MLSP FW & GMM diff filter F shift Spk rate No No K TEAP MLSP FW & GMM, diff filter F0 shift Spk rate No No L STRAIGHT Multi systems & selection LF0 M&V Resid Yes Yes M STRAIGHT MCEP LSTM LF0 M&V No No N LPC LP coef FW F0 shift Spk rate No No O STRAIGHT ST spec FW & GTDNN LF0 LSTM BAP No No P STRAIGHT MCEP GMM MLPG GV LF M&V BAP No No 9 P STRAIGHT MCEP GMM, MLPG, GV LF0 M&V BAP No No Q Ahocoder MCEP Frame selection, MLPG LF0 M&V No No

slide-12
SLIDE 12

Overall Results of Listening Tests

100

ty

ter 80

imilarit

Target Target

J

Bett

P

60

eaker s

Baseline Baseline A B D G J L M O P

40 60

] on spe

F I K M Q

40

rate [%]

C E H I N

20

  • rrect r

Source Source

N

1 2 3 4 5

MOS on naturalness Co

10

MOS on naturalness

Better

slide-13
SLIDE 13

Overall Results of Listening Tests

100

ty

ter 80

imilarit

Target Target

J

Bett

P

60

eaker s

Baseline Baseline A B D G J L M O P

40 60

] on spe

F I K M Q

40

rate [%]

C E H I N

20

  • rrect r

Source Source

N

1 2 3 4 5

MOS on naturalness Co

10

MOS on naturalness

Better

slide-14
SLIDE 14

Overall Results of Listening Tests

100

ty

ter

MOS = 3.5

80

imilarit

Target Target

J

Bett

P C 75%

60

eaker s

Baseline Baseline A B D G J L M O P Correct = 75%

40 60

] on spe

F I K M Q

40

rate [%]

C E H I N

20

  • rrect r

Source Source

N

1 2 3 4 5

MOS on naturalness Co

10

MOS on naturalness

Better

slide-15
SLIDE 15

Overall Results of Listening Tests

100

ty

ter

MOS = 3.5

80

imilarit

Target Target

J

Bett

P C 75%

60

eaker s

Baseline Baseline A B D G J L M O P Correct = 75%

40 60

] on spe

F I K M Q

40

rate [%]

C E H I N

20

  • rrect r

Source Source

N

1 2 3 4 5

MOS on naturalness Co

10

MOS on naturalness

Better

slide-16
SLIDE 16

Discussion and Future Plan

  • Issues of listening test
  • US English evaluated by British English subjects (less sensitive to prosody?)
  • Hard to separately evaluate prosodic and spectral conversion
  • Suggestions towards next challenge

U f i i

  • Use fewer or more training utterances
  • Use non‐parallel datasets
  • Use data recorded in non‐ideal acoustic conditions
  • Future plan and collaboration
  • Future plan and collaboration
  • Provide converted voices for the Automatic Speaker Verification

Spoofing and Countermeasures (ASVspoof) Challenge [Wu et al 2015] Spoofing and Countermeasures (ASVspoof) Challenge [Wu et al., 2015]

  • Hold VCC every 2 years (?)
  • Appreciate you help (e g provide data manage evaluation

)!

11

  • Appreciate you help (e.g., provide data, manage evaluation, …)!
slide-17
SLIDE 17

Conclusions

  • Voice Conversion Challenge 2016 (VCC 2016)
  • Task: speaker identity conversion
  • Datasets: VCC 2016 dataset from DAPS dataset
  • Participants: 17 teams
  • Test: naturalness & speaker similarity evaluated by 200 subjects

p y y j

  • Results: MOS on naturalness < 3.5 & correct rate on similarity < 75%

 VCC homepage: http://vc‐challenge.org/ (to be updated)  Datasets & results: http://dx.doi.org/10.7488/ds/1430 Datasets & results: http://dx.doi.org/10.7488/ds/1430  Email: vcc2016@vc‐challenge.org

Any comments and suggestions are very welcome!

12

slide-18
SLIDE 18

Acknowledgement

  • We are grateful to
  • COLIPS for sponsoring the evaluation
  • iFLYTEK for supporting database development
  • This work was supported in part by
  • EPSRC through Programme Grant EP/I031022/1 (NST) and
  • EPSRC through Programme Grant EP/I031022/1 (NST) and

EP/J002526/1 (CAF) JSPS KAKENHI G t N b 26280060

  • JSPS KAKENHI Grant Number 26280060
  • We specially thank to Blizzard Challenge Organizers [King et al.]

to kindly allow us to use the evaluation system!