AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss - - PowerPoint PPT Presentation

autovc zero shot voice style transfer with only
SMART_READER_LITE
LIVE PREVIEW

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss - - PowerPoint PPT Presentation

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss Kaizhi Qian* 1 , Yang Zhang* 23 , Shiyu Chang 23 , Xuesong Yang 1 , Mark Hasegawa-Johnson 1 1 University of Illinois at Urbana-Champaign 2 MIT-IBM Watson AI Lab 3 IBM Research


slide-1
SLIDE 1

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

Kaizhi Qian*1, Yang Zhang*23, Shiyu Chang23, Xuesong Yang1, Mark Hasegawa-Johnson1

1University of Illinois at Urbana-Champaign 2MIT-IBM Watson AI Lab 3IBM Research Cambridge

slide-2
SLIDE 2

Motivation

  • Voice conversion aims to modify the source

speech to make it sound like being uttered by another speaker.

slide-3
SLIDE 3

Motivation

  • Voice conversion aims to modify the source

speech to make it sound like being uttered by another speaker.

  • Existing voice style transfer techniques:

Ø Use complex architectures and training schemes but do not work well for speech Ø Only convert between seen speakers

slide-4
SLIDE 4

Content Encoder

Training

AutoVC

Decoder Speaker Encoder

slide-5
SLIDE 5

Content Encoder

𝐷" 𝑌" Training

AutoVC

Decoder Speaker Encoder

slide-6
SLIDE 6

Content Encoder

𝐷" 𝑇" 𝑌" 𝑌"

AutoVC

Decoder Speaker Encoder

Training

slide-7
SLIDE 7

Content Encoder

𝐷" 𝑇" % 𝑌"→" 𝑌" 𝑌" Training

AutoVC

Decoder Speaker Encoder

slide-8
SLIDE 8
  • Train only on self-reconstruction Loss:

𝔽 % 𝑌"→" − 𝑌" )

) + 𝜇

, 𝐷"→" − 𝐷" "

Content Encoder

𝐷" 𝑇" % 𝑌"→" 𝑌" 𝑌" 𝑌" Training

AutoVC

Decoder Speaker Encoder

slide-9
SLIDE 9
  • Train only on self-reconstruction Loss:

𝔽 % 𝑌"→" − 𝑌" )

) + 𝜇

, 𝐷"→" − 𝐷" "

Content Encoder

𝐷" 𝑇" % 𝑌"→" 𝑌" 𝑌" 𝑌" Training

AutoVC

Decoder Speaker Encoder

, 𝐷"→"

slide-10
SLIDE 10
  • Train only on self-reconstruction Loss:

𝔽 % 𝑌"→" − 𝑌" )

) + 𝜇

, 𝐷"→" − 𝐷" "

Content Encoder

𝐷" 𝑇" % 𝑌"→" 𝑌" 𝑌" 𝑌" Training

AutoVC

Decoder Speaker Encoder Content Encoder

𝐷" 𝑌" Conversion

Decoder Speaker Encoder

, 𝐷"→"

slide-11
SLIDE 11
  • Train only on self-reconstruction Loss:

𝔽 % 𝑌"→" − 𝑌" )

) + 𝜇

, 𝐷"→" − 𝐷" "

Content Encoder

𝐷" 𝑇" % 𝑌"→" 𝑌" 𝑌" 𝑌" Training

AutoVC

Decoder Speaker Encoder Content Encoder

𝐷" 𝑇) 𝑌" 𝑌) Conversion

Decoder Speaker Encoder

, 𝐷"→"

slide-12
SLIDE 12
  • Train only on self-reconstruction Loss:

𝔽 % 𝑌"→" − 𝑌" )

) + 𝜇

, 𝐷"→" − 𝐷" "

Content Encoder

𝐷" 𝑇" % 𝑌"→" 𝑌" 𝑌" 𝑌" Training

AutoVC

Decoder Speaker Encoder Content Encoder

𝐷" 𝑇) % 𝑌"→) 𝑌" 𝑌) Conversion

Decoder Speaker Encoder

, 𝐷"→"

slide-13
SLIDE 13
  • Train only on self-reconstruction Loss:

𝔽 % 𝑌"→" − 𝑌" )

) + 𝜇

, 𝐷"→" − 𝐷" "

  • With bottleneck tuning, AutoVC can match the

distribution!

Content Encoder

𝐷" 𝑇" % 𝑌"→" 𝑌" 𝑌" 𝑌" Training

AutoVC

Decoder Speaker Encoder Content Encoder

𝐷" 𝑇) % 𝑌"→) 𝑌" 𝑌) Conversion

Decoder Speaker Encoder

, 𝐷"→"

slide-14
SLIDE 14
  • Speaker encoder is pretrained

AutoVC

Content Encoder

𝐷" 𝑇) % 𝑌"→) 𝑌" 𝑌) Conversion

Decoder Speaker Encoder

slide-15
SLIDE 15
  • Speaker encoder is pretrained
  • Can generalize to unseen speakers – zero-shot

conversion

AutoVC

Content Encoder

𝐷" 𝑇) % 𝑌"→) 𝑌" 𝑌) Conversion

Decoder Speaker Encoder Seen OR Unseen

slide-16
SLIDE 16

Conversion Between Seen Speakers

M2M F2F M2F F2M 1 2 3 4 M2M F2F M2F F2M

MOS Similarity AutoVC AutoVC-one-hot StarGAN1 Chou et. al.2

1StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks 2Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio

Representations

Source Target Converted

slide-17
SLIDE 17
  • The first zero-shot voice conversion framework

M2M F2F M2F F2M 1 2 3 4 M2M F2F M2F F2M

Seen to seen Seen to unseen Unseen to seen Unseen to unseen

Conversion Between Unseen Speakers

MOS Similarity Source Target Converted

slide-18
SLIDE 18

Take Away

  • Autoencoder is all you need to achieve

theoretically ideal voice conversion

slide-19
SLIDE 19

Take Away

  • Autoencoder is all you need to achieve

theoretically ideal voice conversion

  • AutoVC generalizes well to unseen speakers
slide-20
SLIDE 20

Thank you!

Poster #225