AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss - - PowerPoint PPT Presentation

▶

Oct 07, 2022 307 likes •531 views

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss Kaizhi Qian* 1 , Yang Zhang* 23 , Shiyu Chang 23 , Xuesong Yang 1 , Mark Hasegawa-Johnson 1 1 University of Illinois at Urbana-Champaign 2 MIT-IBM Watson AI Lab 3 IBM Research

SLIDE 1

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

Kaizhi Qian*1, Yang Zhang*23, Shiyu Chang23, Xuesong Yang1, Mark Hasegawa-Johnson1

1University of Illinois at Urbana-Champaign 2MIT-IBM Watson AI Lab 3IBM Research Cambridge

SLIDE 2

Motivation

Voice conversion aims to modify the source

speech to make it sound like being uttered by another speaker.

SLIDE 3

Motivation

Voice conversion aims to modify the source

speech to make it sound like being uttered by another speaker.

Existing voice style transfer techniques:

Ø Use complex architectures and training schemes but do not work well for speech Ø Only convert between seen speakers

SLIDE 4

Content Encoder

Training

AutoVC

Decoder Speaker Encoder

SLIDE 5

Content Encoder

𝐷" 𝑌" Training

AutoVC

Decoder Speaker Encoder

SLIDE 6

Content Encoder

𝐷" 𝑇" 𝑌" 𝑌"

AutoVC

Decoder Speaker Encoder

Training

SLIDE 7

Content Encoder

𝐷" 𝑇" % 𝑌"→" 𝑌" 𝑌" Training

AutoVC

Decoder Speaker Encoder

SLIDE 8

Train only on self-reconstruction Loss:

𝔽 % 𝑌"→" − 𝑌" )

) + 𝜇

, 𝐷"→" − 𝐷" "

Content Encoder

𝐷" 𝑇" % 𝑌"→" 𝑌" 𝑌" 𝑌" Training

AutoVC

Decoder Speaker Encoder

SLIDE 9

Train only on self-reconstruction Loss:

𝔽 % 𝑌"→" − 𝑌" )

) + 𝜇

, 𝐷"→" − 𝐷" "

Content Encoder

𝐷" 𝑇" % 𝑌"→" 𝑌" 𝑌" 𝑌" Training

AutoVC

Decoder Speaker Encoder

, 𝐷"→"

SLIDE 10

Train only on self-reconstruction Loss:

𝔽 % 𝑌"→" − 𝑌" )

) + 𝜇

, 𝐷"→" − 𝐷" "

Content Encoder

𝐷" 𝑇" % 𝑌"→" 𝑌" 𝑌" 𝑌" Training

AutoVC

Decoder Speaker Encoder Content Encoder

𝐷" 𝑌" Conversion

Decoder Speaker Encoder

, 𝐷"→"

SLIDE 11

Train only on self-reconstruction Loss:

𝔽 % 𝑌"→" − 𝑌" )

) + 𝜇

, 𝐷"→" − 𝐷" "

Content Encoder

𝐷" 𝑇" % 𝑌"→" 𝑌" 𝑌" 𝑌" Training

AutoVC

Decoder Speaker Encoder Content Encoder

𝐷" 𝑇) 𝑌" 𝑌) Conversion

Decoder Speaker Encoder

, 𝐷"→"

SLIDE 12

Train only on self-reconstruction Loss:

𝔽 % 𝑌"→" − 𝑌" )

) + 𝜇

, 𝐷"→" − 𝐷" "

Content Encoder

𝐷" 𝑇" % 𝑌"→" 𝑌" 𝑌" 𝑌" Training

AutoVC

Decoder Speaker Encoder Content Encoder

𝐷" 𝑇) % 𝑌"→) 𝑌" 𝑌) Conversion

Decoder Speaker Encoder

, 𝐷"→"

SLIDE 13

Train only on self-reconstruction Loss:

𝔽 % 𝑌"→" − 𝑌" )

) + 𝜇

, 𝐷"→" − 𝐷" "

With bottleneck tuning, AutoVC can match the

distribution!

Content Encoder

𝐷" 𝑇" % 𝑌"→" 𝑌" 𝑌" 𝑌" Training

AutoVC

Decoder Speaker Encoder Content Encoder

𝐷" 𝑇) % 𝑌"→) 𝑌" 𝑌) Conversion

Decoder Speaker Encoder

, 𝐷"→"

SLIDE 14

Speaker encoder is pretrained

AutoVC

Content Encoder

𝐷" 𝑇) % 𝑌"→) 𝑌" 𝑌) Conversion

Decoder Speaker Encoder

SLIDE 15

Speaker encoder is pretrained
Can generalize to unseen speakers – zero-shot

conversion

AutoVC

Content Encoder

𝐷" 𝑇) % 𝑌"→) 𝑌" 𝑌) Conversion

Decoder Speaker Encoder Seen OR Unseen

SLIDE 16

Conversion Between Seen Speakers

M2M F2F M2F F2M 1 2 3 4 M2M F2F M2F F2M

MOS Similarity AutoVC AutoVC-one-hot StarGAN1 Chou et. al.2

1StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks 2Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio

Representations

Source Target Converted

SLIDE 17

The first zero-shot voice conversion framework

M2M F2F M2F F2M 1 2 3 4 M2M F2F M2F F2M

Seen to seen Seen to unseen Unseen to seen Unseen to unseen

Conversion Between Unseen Speakers

MOS Similarity Source Target Converted

SLIDE 18

Take Away

Autoencoder is all you need to achieve

theoretically ideal voice conversion

SLIDE 19

Take Away

Autoencoder is all you need to achieve

theoretically ideal voice conversion

AutoVC generalizes well to unseen speakers

SLIDE 20

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss - - PowerPoint PPT Presentation

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

Motivation

speech to make it sound like being uttered by another speaker.

Motivation

speech to make it sound like being uttered by another speaker.

Ø Use complex architectures and training schemes but do not work well for speech Ø Only convert between seen speakers

AutoVC

AutoVC

AutoVC

AutoVC

AutoVC

AutoVC

AutoVC

AutoVC

AutoVC

distribution!

AutoVC

AutoVC

conversion

AutoVC

Conversion Between Seen Speakers

Conversion Between Unseen Speakers

Take Away

theoretically ideal voice conversion

Take Away

theoretically ideal voice conversion

Thank you!

Poster #225