h V oice C onversion C onversion C hallenge C h ll hallenge 2016 - PowerPoint PPT Presentation

The V oice h V oice C onversion C onversion C hallenge C h ll hallenge 2016 2016 2016 2016 The h ll Tomoki Toda (Nagoya U, Japan) Ling ‐ Hui Chen (USTC, China) Li H i Ch Daisuke Saito (Tokyo U, Japan) Fernando Villavicencio (NII, Japan) Mirjam Wester (CSTR UK) Mirjam Wester (CSTR, UK) Zhizheng Wu (CSTR, UK) J Junichi Yamagishi (NII/CSTR, Japan/UK) i hi Y i hi Sep. 10 th , 2016

Voice Conversion (VC) • Technique to modify speech waveform to convert non ‐ /para ‐ linguistic information while preserving linguistic information How to factorize? How to factorize? How to generate? How to analyze? VC VC How to convert? How to parameterize? • Research progress since the late 1980s p g • Development of various VC techniques (& potential applications) • Not straightforward to compare across different VC techniques… Not straightforward to compare across different VC techniques… 1

V oice C onversion C hallenge 2016 Objective Objective Better understand different VC techniques by comparing their Better understand different VC techniques by comparing their performance performance using a freely performance performance using a freely using a freely available dataset as a common dataset using a freely ‐ available dataset as a common dataset available dataset as a common dataset available dataset as a common dataset • Following a policy of Blizzard Challenge [Black & Tokuda, 2005] • Following a policy of Blizzard Challenge [Bl k & T k d 2005] “Evaluation campaign” rather than “competition” • Also reveal a risk of VC techniques • Effective but possible to be used for spoofing • Effective but possible to be used for spoofing • Important to inform people of VC as “kitchen knife” 2

Timelines of VCC 2016 (Sep. 9 th , 2015) ( p , ) ( (Short announcement at INTERSPEECH2015) ) Nov. 18 th , 2015 Announcement & registration open Nov. 25 th , 2015 Release of training data 1.5 months for training Jan. 8 th , 2016 Release of evaluation data 1 week for conversion Jan. 15 th , 2016 Deadline to submit the converted voice samples 1.5 months for evaluation h f l i Feb 29 th 2016 Feb. 29 , 2016 Notification of results Notification of results 3

Task of VCC 2016 • Simple speaker identity conversion [Abe et al ., 1990] • Develop conversion systems using parallel data of each speaker pair Source speech Source speech Target speech Target speech Please say Please say t e sa e t the same thing. g the same thing. t e sa e t g S Source speaker k Target speaker T t k 1. Training with parallel data (utterance pairs) Let’s convert Let’s convert  y ( x my voice. f ) my voice. Conversion system 2. Conversion of any utterance y 4

VCC 2016 Dataset [http://dx.doi.org/10.7488/ds/1430] • DAPS ( D ata A nd P roduction S peech) [Mysore, 2015] • Professional US English speakers • Freely available [https://archive.org/details/daps_dataset] • Design of VCC 2016 dataset • Select 10 speakers including 5 female and 5 male speakers S l 10 k i l di 5 f l d 5 l k • Manually segmented into 216 sentences in each speaker • Down ‐ sampled to 16 kHz # of speakers # of speakers # of sentences # of sentences Sources 3 females & 2 males 162 for training & 54 for evaluation Targets Targets 2 females & 3 males 2 females & 3 males 162 for training 162 for training 5

Rules of VCC 2016 • Requirement • Develop all 5 x 5 = 25 combinations of source ‐ target pairs • Main guidelines • Main guidelines • Transform any acoustic features OK ! • Manual edit or tuning of systems in conversion M l di i f i i NOT ll NOT allowed d • Use manual transcriptions NOT allowed • Use automatic speech recognition (ASR) OK! • To develop a system for a certain speaker pair using data of other pairs within VCC 2016 dataset NOT allowed • Use external data outside VCC 2016 dataset OK! • Discard a part of utterances of the training set OK! • Submit multiple entries NOT allowed 6

Evaluation Methodology • Subjective evaluation • Use only 16 speaker pairs (2 males & 2 females) from 25 speaker pairs • Use headphones in sound ‐ treated booths • Listeners: 200 subjects 1 O i i 1. Opinion test on naturalness t t t l • Evaluate naturalness of each voice sample using a 5 ‐ scale opinion score • 1 (completely unnatural) to 5 (completely natural) 2. Pair ‐ comparison test on speaker similarity 2. Pair comparison test on speaker similarity • Judge whether 2 voice samples are uttered by the same speaker • • Decision with confidence Decision with confidence Same , Same , Different , Different , absolutely sure absolutely sure not sure not sure not sure not sure absolutely sure absolutely sure 7

Baseline System (Freely Available) • VC tools [Toda] within FestVox [Black & Lenzo] • Analysis methods F 0 extraction with Edinburgh Speech Tools [Taylor et al .] • • Spectral analysis with Signal Processing Toolkit (SPTK) [Tokuda et al .] • Converted parameters • Converted parameters • Mel ‐ cepstrum ( MCEP ): Trajectory ‐ wise conversion ( MLPG ) using global variance ( GV ) w/ Gaussian mixture model ( GMM ) ( ) / ( ) • Log ‐ scaled F 0 ( L F 0 ): Linear transformation w/ mean & variance ( M&V ) • Synthesis methods S th i th d • Simple pulse/noise excitation • M l l Mel ‐ log spectrum approximate ( MLSA ) filter t i t ( MLSA ) filt 8

Submitted Systems Team name Ana ‐ Syn Converted Parameters & Conversion Methods ASR +DB A A Ahocoder Ahocoder MCEP MCEP GMM MGE MLPG PF GMM , MGE , MLPG, PF L F M&V L F 0 M&V No No No No B STRAIGHT MCEP Exemplar , MLPG, GV L F 0 M&V No No C STRAIGHT MLSP DNN & GMM , PF L F 0 M&V No Yes D STRAIGHT MCEP MDN & GMM , PF No No L F 0 M&V E Ahocoder MCEP GMM , FW & Scaling L F 0 M&V No No F F STRAIGHT STRAIGHT MCEP MCEP Phone posteriorgram Phone posteriorgram L F M&V L F 0 M&V Yes Yes Yes Yes G STRAIGHT MCEP LSTM ‐ RNN L F 0 M&V Spk rate Yes Yes H STRAIGHT MCEP DNN , MTL L F 0 M&V Spk rate Yes Yes I Ahocoder LSP GMM , MMSE, i ‐ vector L F 0 M&V No Yes J STRAIGHT MCEP GMM , MS, diff filter L F 0 M&V BAP No No K K TEAP TEAP MLSP MLSP FW & GMM diff filter FW & GMM , diff filter F shift F 0 shift Spk rate Spk rate No No No No L STRAIGHT Multi systems & selection L F 0 M&V Resid Yes Yes M STRAIGHT MCEP LSTM No No L F 0 M&V N LPC LP coef FW F 0 shift Spk rate No No O STRAIGHT ST spec FW & GTDNN L F 0 LSTM BAP No No P P STRAIGHT STRAIGHT MCEP MCEP GMM , MLPG, GV GMM MLPG GV L F M&V L F 0 M&V BAP BAP No No No No Q Ahocoder MCEP Frame selection , MLPG L F 0 M&V No No 9

Submitted Systems Excitation F 0 pattern 0 p Spectral envelope Duration Team name Ana ‐ Syn Converted Parameters & Conversion Methods ASR +DB A A Ahocoder Ahocoder MCEP MCEP GMM , MGE , MLPG, PF GMM MGE MLPG PF L F M&V L F 0 M&V No No No No B STRAIGHT MCEP Exemplar , MLPG, GV L F 0 M&V No No C STRAIGHT MLSP DNN & GMM , PF L F 0 M&V No Yes D STRAIGHT MCEP MDN & GMM , PF No No L F 0 M&V E Ahocoder MCEP GMM , FW & Scaling L F 0 M&V No No F F STRAIGHT STRAIGHT MCEP MCEP Phone posteriorgram Phone posteriorgram L F M&V L F 0 M&V Yes Yes Yes Yes G STRAIGHT MCEP LSTM ‐ RNN L F 0 M&V Spk rate Yes Yes H STRAIGHT MCEP DNN , MTL L F 0 M&V Spk rate Yes Yes I Ahocoder LSP GMM , MMSE, i ‐ vector L F 0 M&V No Yes J STRAIGHT MCEP GMM , MS, diff filter L F 0 M&V BAP No No K K TEAP TEAP MLSP MLSP FW & GMM diff filter FW & GMM , diff filter F shift F 0 shift Spk rate Spk rate No No No No L STRAIGHT Multi systems & selection L F 0 M&V Resid Yes Yes M STRAIGHT MCEP LSTM No No L F 0 M&V N LPC LP coef FW F 0 shift Spk rate No No O STRAIGHT ST spec FW & GTDNN L F 0 LSTM BAP No No P P STRAIGHT STRAIGHT MCEP MCEP GMM , MLPG, GV GMM MLPG GV L F M&V L F 0 M&V BAP BAP No No No No Q Ahocoder MCEP Frame selection , MLPG L F 0 M&V No No 9

Overall Results of Listening Tests 100 ter ty imilarit Bett Target Target 80 J J P P eaker s G D A O Baseline Baseline B L 60 60 M M ] on spe Q K F I I rate [%] 40 40 E H C N N orrect r 20 Source Source Co 0 1 2 3 4 5 MOS on naturalness MOS on naturalness Better 10

Overall Results of Listening Tests 100 MOS = 3.5 ter ty imilarit Bett Target Target 80 J J P P C Correct = 75% 75% eaker s G D A O Baseline Baseline B L 60 60 M M ] on spe Q K F I I rate [%] 40 40 E H C N N orrect r 20 Source Source Co 0 1 2 3 4 5 MOS on naturalness MOS on naturalness Better 10

h V oice C onversion C onversion C hallenge C h ll hallenge 2016 - PowerPoint PPT Presentation

The V oice h V oice C onversion C onversion C hallenge C h ll hallenge 2016 2016 2016 2016 The h ll Tomoki Toda (Nagoya U, Japan) Ling Hui Chen (USTC, China) Li H i Ch Daisuke Saito (Tokyo U, Japan) Fernando Villavicencio (NII, Japan)

Projections onto spaces of polynomials Tommaso Russo (Joint work in progress with P. Hjek)

Position Paper Akbar Siami Namin Mohan Sridharan Department of Computer Science Department of

In search of CurveSwap: Measuring elliptic curve implementations in the wild Luke Valenta ,

DeBIN: Predicting Debug Information in Stripped Binaries ht https://debin.ai Jingxuan Pesho

Budget Allocation for Sequential Customer Engagement Craig Boutilier, Google Research, Mountain

Burning Your Skis LESSON 9 Your Response to the

Constructing Canonical Strategies For Parallel Implementation Of Isogeny Based Cryptography Aaron

Review of Probability 1 Probability Theory: Many techniques in speech processing require the

Stefan Heule, Eric Schkufza, Rahul Sharma, Alex Aiken PLDI, Santa Barbara, June 16, 2016 Symbolic

Chapter 3: Basics from Probability Theory and Statistics It is likely that unlikely things should

X 1 1 1 1 0 0 0 0 1000

The Effective Audit Committee 2 April 2019 The Holiday Inn Housekeeping Fire alarms test

Probability and Independence Definition If P(B) > 0 , the conditional probability of A given

A template-based approach for speech synthesis intonation generation using LSTMs Srikanth Ronanki

{Total Aerosol Carbon / Sulfate} in the Free Troposphere at MLO Barry Huebert, Steve Howell, John

Mutual Information Example - SSD 5 x 10 6 10 20 5 30 4 40 50 3 60 2 70 1 80 R I

? ? Computer (finite-state) Linguists Scientists string rewrites finite-state (equivalent)

Products of free spaces and applications Pedro L. Kaufmann I BWB - Maresias 2014 Pedro L.

Hidden Markov Models Hsin-min Wang References: 1. L. R. Rabiner and B. H. Juang, (1993)

Disjunctive cuts in branch-and-but-and-price algorithms Application to the capacitated vehicle

CSSE463: Image Recognition Day 31 Today: Bayesian classifiers Questions? Bayesian

CSSE463: Image Recognition Day 17 Today: Bayesian classifiers Tomorrow: Lightning talks

Render for CNN: Viewpoint Estimation in Images Using CNNsTrained with Rendered 3D Model Views Hao

Announcement Slides for Worship 8/23/2020 Slide 5 Thanks to everyone who came to the Elevate

h V oice C onversion C onversion C hallenge C h ll hallenge 2016 - PowerPoint PPT Presentation

The V oice h V oice C onversion C onversion C hallenge C h ll hallenge 2016 2016 2016 2016 The h ll Tomoki Toda (Nagoya U, Japan) Ling Hui Chen (USTC, China) Li H i Ch Daisuke Saito (Tokyo U, Japan) Fernando Villavicencio (NII, Japan)

Projections onto spaces of polynomials Tommaso Russo (Joint work in progress with P. Hjek)

Position Paper Akbar Siami Namin Mohan Sridharan Department of Computer Science Department of

In search of CurveSwap: Measuring elliptic curve implementations in the wild Luke Valenta ,

DeBIN: Predicting Debug Information in Stripped Binaries ht https://debin.ai Jingxuan Pesho

Budget Allocation for Sequential Customer Engagement Craig Boutilier, Google Research, Mountain

Burning Your Skis LESSON 9 Your Response to the

Constructing Canonical Strategies For Parallel Implementation Of Isogeny Based Cryptography Aaron

Review of Probability 1 Probability Theory: Many techniques in speech processing require the

Stefan Heule, Eric Schkufza, Rahul Sharma, Alex Aiken PLDI, Santa Barbara, June 16, 2016 Symbolic

Chapter 3: Basics from Probability Theory and Statistics It is likely that unlikely things should

X 1 1 1 1 0 0 0 0 1000

The Effective Audit Committee 2 April 2019 The Holiday Inn Housekeeping Fire alarms test

Probability and Independence Definition If P(B) &gt; 0 , the conditional probability of A given

A template-based approach for speech synthesis intonation generation using LSTMs Srikanth Ronanki

{Total Aerosol Carbon / Sulfate} in the Free Troposphere at MLO Barry Huebert, Steve Howell, John

Mutual Information Example - SSD 5 x 10 6 10 20 5 30 4 40 50 3 60 2 70 1 80 R I

? ? Computer (finite-state) Linguists Scientists string rewrites finite-state (equivalent)

Products of free spaces and applications Pedro L. Kaufmann I BWB - Maresias 2014 Pedro L.

Hidden Markov Models Hsin-min Wang References: 1. L. R. Rabiner and B. H. Juang, (1993)

Disjunctive cuts in branch-and-but-and-price algorithms Application to the capacitated vehicle

CSSE463: Image Recognition Day 31 Today: Bayesian classifiers Questions? Bayesian

CSSE463: Image Recognition Day 17 Today: Bayesian classifiers Tomorrow: Lightning talks

Render for CNN: Viewpoint Estimation in Images Using CNNsTrained with Rendered 3D Model Views Hao

Announcement Slides for Worship 8/23/2020 Slide 5 Thanks to everyone who came to the Elevate

Probability and Independence Definition If P(B) > 0 , the conditional probability of A given