Analysis of the Voice Conversion Challenge 2016 Evaluation Results - - PowerPoint PPT Presentation

▶

Jan 01, 2023 377 likes •608 views

Analysis of the Voice Conversion Challenge 2016 Evaluation Results Mirjam Wester, Zhizheng Wu & Junichi Yamagishi I V N E U R S E C I H T S Y T T O H R G F R E U D B I N Voice Conversion Voice converted voices

SLIDE 1

R T S C

T H E U N I V E R S I T Y O F E D I N B U R G H

Analysis of the Voice Conversion Challenge 2016 Evaluation Results

Mirjam Wester, Zhizheng Wu & Junichi Yamagishi

SLIDE 2

Voice Conversion

Voice converted voices were evaluated in terms of naturalness and similarity. The questions we addressed were:

1. How natural does the voice converted voice

sound?

2. How similar does the voice converted voice

sound compared to the target speaker and to the source speaker?

SLIDE 3

Naturalness

How to make task do-able for listeners?
How to measure naturalness?

SLIDE 4

Amount of data…

5 target and 5 source speakers -> 25 voices.
17 participants + baseline: 20 * 18 = 450 voices !
Reduced source-target (ST) pairs from 25 to 16
288 voices + 4 source + 4 target = 296 stimuli

—> 50 minutes

It would take too long for a single listener to

judge naturalness and similarity

SLIDE 5

Amount of data…

Instead of asking each listener to judge all ST

pairs how about just one single ST pair?

In terms of time this would be an excellent

solution.

However, each listener would then only

encounter one gender condition and listeners needed to encounter the full range of gender conditions as ratings are context-sensitive.

SLIDE 6

Our solution…

Intermediate solution: each listener hears 8

source-target (ST) pairs

Two from each gender condition, to make the

two sets as comparable as possible.

SLIDE 7

How to measure?

Standard MOS like Blizzard for naturalness
(1) totally unnatural to (5) completely natural
The subjects were instructed that the score

should reflect their opinion of how natural or unnatural the sentence sounded

SLIDE 8

Listeners

Each set was rated by 100 subjects
Duration roughly 25 minutes
The order of stimuli was random
Each sentence selected at random with

replacement from pool of 30 test sentences

Sentences > 5 sec or < 2 sec were removed for

the listening tests (hence not 54 sentences)

SLIDE 9

Similarity

Judging how similar voices are on a scale from 1

to 5 may not be all that meaningful.

Judging how similar two voices are not part of

everyday speech perception.

However, recognising speakers is something we

do all the time.

—> Same/different paradigm

SLIDE 10

Similarity: exp set-up

Listeners were given pairs of stimuli and the

instructions:

“Do you think these two samples could have been

produced by the same speaker? Some of the samples may sound somewhat degraded/distorted. Please try to listen beyond the distortion and concentrate on identifying the voice. Are the two voices the same or different? You have the option to indicate how sure you are of your decision.”

SLIDE 11

Similarity: exp set-up

The scale for judging was:
Same: absolutely sure
Same: not sure
Different: not sure
Different: absolutely sure
VC stimuli compared to target speaker and to

source speaker.

SLIDE 12

Similarity: exp set-up

Each listener was given three ST pairs to judge,
ne within-gender, one cross-gender and one at

random ensuing all ST pairs were covered across listeners.

200 listeners

SLIDE 13

Results

Naturalness -MOS

SLIDE 14

T N K J L O P G F A B Q E H D M I B_ C 1 2 3 4 5 System Score

SLIDE 15

S T N K J O L P G F Q B A E H D M I B_ C 1 2 3 4 5

Set 1

S T N K J L O P G F A B Q E H D M I B_ C 1 2 3 4 5

Set 2

SLIDE 16

N K S T J L O E H D M I B_ P G F A B Q N S T O L E H D M I B_ P G F A B Q K J N K S T J L O E H D M I B_ P G F A B Q All ST pairs Set 1 Set 2 C C C

S T N K J O L P G F Q B A E H D M I B_ C 1 2 3 4 5

Set 1

S T N K J L O P G F A B Q E H D M I B_ C 1 2 3 4 5

Set 2

Significance

SLIDE 17

T N K O L J P A H F Q B G E D M B_ I C 1 2 3 4 5

S N J K L P Q G F B O H E A D M I B_ C

T K G L O F J P N A E B D Q M H B_ C I 1 2 3 4 5

T O K N J G L B P F A Q E D M H I B_ C

SLIDE 18

T N K O L J P A H F Q B G E D M B_ I C 1 2 3 4 5 MM

S N J K L P Q G F B O H E A D M I B_ C FF

T K G L O F J P N A E B D Q M H B_ C I 1 2 3 4 5 MF

T O K N J G L B P F A Q E D M H I B_ C FM

S T O L J G E D M C N I B O H E K G S T L E B D Q H B_ C I O F J P N A O K S T G L D M H I B_ C B P F A Q E N J P A H F Q B M J K L T S Q G M B_ C F D P A MM MF FM FF B_ I

SLIDE 19

Results

Similarity: Same-Different

SLIDE 20

J P D G A O L B N K F E H Q I B_ M C T S

T J P G O L D A B K B_ Q M F H E I S N C Different: Absolutely sure Different: Not sure Same: Not sure Same: Absolutely sure 20 40 60 80 100 S H K N E I P T B Q D F B_ J O A L C G M Different: absolutely sure Different: not sure Same: not sure Same: absolutely sure 20 40 60 80 100

Source Target

SLIDE 21

VCC - evaluation

Such a large evaluation complex, compromises

inevitable.

Two sets of source-target pairs for naturalness

ratings not ideal.

Including comparisons to source as well as

target was informative.

SLIDE 22

VCC data set

Database (training and test samples)
Participants’ submissions
Listening test materials
Available at:

http:/ /dx.doi.org/10.7488/ds/1430