Analysis of the Voice Conversion Challenge 2016 Evaluation Results - - PowerPoint PPT Presentation

analysis of the voice conversion challenge 2016
SMART_READER_LITE
LIVE PREVIEW

Analysis of the Voice Conversion Challenge 2016 Evaluation Results - - PowerPoint PPT Presentation

Analysis of the Voice Conversion Challenge 2016 Evaluation Results Mirjam Wester, Zhizheng Wu & Junichi Yamagishi I V N E U R S E C I H T S Y T T O H R G F R E U D B I N Voice Conversion Voice converted voices


slide-1
SLIDE 1

R T S C

T H E U N I V E R S I T Y O F E D I N B U R G H

Analysis of the Voice Conversion Challenge 2016 Evaluation Results

Mirjam Wester, Zhizheng Wu & Junichi Yamagishi

slide-2
SLIDE 2

Voice Conversion

Voice converted voices were evaluated in terms of naturalness and similarity. The questions we addressed were:

  • 1. How natural does the voice converted voice

sound?

  • 2. How similar does the voice converted voice

sound compared to the target speaker and to the source speaker?

slide-3
SLIDE 3

Naturalness

  • How to make task do-able for listeners?
  • How to measure naturalness?
slide-4
SLIDE 4

Amount of data…

  • 5 target and 5 source speakers -> 25 voices.
  • 17 participants + baseline: 20 * 18 = 450 voices !
  • Reduced source-target (ST) pairs from 25 to 16
  • 288 voices + 4 source + 4 target = 296 stimuli

—> 50 minutes

  • It would take too long for a single listener to

judge naturalness and similarity

slide-5
SLIDE 5

Amount of data…

  • Instead of asking each listener to judge all ST

pairs how about just one single ST pair?

  • In terms of time this would be an excellent

solution.

  • However, each listener would then only

encounter one gender condition and listeners needed to encounter the full range of gender conditions as ratings are context-sensitive.

slide-6
SLIDE 6

Our solution…

  • Intermediate solution: each listener hears 8

source-target (ST) pairs

  • Two from each gender condition, to make the

two sets as comparable as possible.

slide-7
SLIDE 7

How to measure?

  • Standard MOS like Blizzard for naturalness
  • (1) totally unnatural to (5) completely natural
  • The subjects were instructed that the score

should reflect their opinion of how natural or unnatural the sentence sounded

slide-8
SLIDE 8

Listeners

  • Each set was rated by 100 subjects
  • Duration roughly 25 minutes
  • The order of stimuli was random
  • Each sentence selected at random with

replacement from pool of 30 test sentences

  • Sentences > 5 sec or < 2 sec were removed for

the listening tests (hence not 54 sentences)

slide-9
SLIDE 9

Similarity

  • Judging how similar voices are on a scale from 1

to 5 may not be all that meaningful.

  • Judging how similar two voices are not part of

everyday speech perception.

  • However, recognising speakers is something we

do all the time.

  • —> Same/different paradigm
slide-10
SLIDE 10

Similarity: exp set-up

  • Listeners were given pairs of stimuli and the

instructions:

  • “Do you think these two samples could have been

produced by the same speaker? Some of the samples may sound somewhat degraded/distorted. Please try to listen beyond the distortion and concentrate on identifying the voice. Are the two voices the same or different? You have the option to indicate how sure you are of your decision.”

slide-11
SLIDE 11

Similarity: exp set-up

  • The scale for judging was:
  • Same: absolutely sure
  • Same: not sure
  • Different: not sure
  • Different: absolutely sure
  • VC stimuli compared to target speaker and to

source speaker.

slide-12
SLIDE 12

Similarity: exp set-up

  • Each listener was given three ST pairs to judge,
  • ne within-gender, one cross-gender and one at

random ensuing all ST pairs were covered across listeners.

  • 200 listeners
slide-13
SLIDE 13

Results

  • Naturalness -MOS
slide-14
SLIDE 14
  • S

T N K J L O P G F A B Q E H D M I B_ C 1 2 3 4 5 System Score

slide-15
SLIDE 15

S T N K J O L P G F Q B A E H D M I B_ C 1 2 3 4 5

Set 1

S T N K J L O P G F A B Q E H D M I B_ C 1 2 3 4 5

Set 2

slide-16
SLIDE 16

N K S T J L O E H D M I B_ P G F A B Q N S T O L E H D M I B_ P G F A B Q K J N K S T J L O E H D M I B_ P G F A B Q All ST pairs Set 1 Set 2 C C C

S T N K J O L P G F Q B A E H D M I B_ C 1 2 3 4 5

Set 1

S T N K J L O P G F A B Q E H D M I B_ C 1 2 3 4 5

Set 2

Significance

slide-17
SLIDE 17
  • S

T N K O L J P A H F Q B G E D M B_ I C 1 2 3 4 5

MM

  • T

S N J K L P Q G F B O H E A D M I B_ C

FF

  • S

T K G L O F J P N A E B D Q M H B_ C I 1 2 3 4 5

MF

  • S

T O K N J G L B P F A Q E D M H I B_ C

FM

slide-18
SLIDE 18
  • S
T N K O L J P A H F Q B G E D M B_ I C 1 2 3 4 5 MM
  • T
S N J K L P Q G F B O H E A D M I B_ C FF
  • S
T K G L O F J P N A E B D Q M H B_ C I 1 2 3 4 5 MF
  • S
T O K N J G L B P F A Q E D M H I B_ C FM
  • N K

S T O L J G E D M C N I B O H E K G S T L E B D Q H B_ C I O F J P N A O K S T G L D M H I B_ C B P F A Q E N J P A H F Q B M J K L T S Q G M B_ C F D P A MM MF FM FF B_ I

slide-19
SLIDE 19

Results

  • Similarity: Same-Different
slide-20
SLIDE 20

J P D G A O L B N K F E H Q I B_ M C T S

T J P G O L D A B K B_ Q M F H E I S N C Different: Absolutely sure Different: Not sure Same: Not sure Same: Absolutely sure 20 40 60 80 100 S H K N E I P T B Q D F B_ J O A L C G M Different: absolutely sure Different: not sure Same: not sure Same: absolutely sure 20 40 60 80 100

Source Target

slide-21
SLIDE 21

VCC - evaluation

  • Such a large evaluation complex, compromises

inevitable.

  • Two sets of source-target pairs for naturalness

ratings not ideal.

  • Including comparisons to source as well as

target was informative.

slide-22
SLIDE 22

VCC data set

  • Database (training and test samples)
  • Participants’ submissions
  • Listening test materials
  • Available at:

http:/ /dx.doi.org/10.7488/ds/1430

22