Representations with Instance Normalization Ju-Chieh Chou , Hung-yi - - PowerPoint PPT Presentation

▶

May 19, 2023 342 likes •549 views

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization Ju-Chieh Chou , Hung-yi Lee, Interspeech 2019. Outline 1. Introduction 2. Proposed Approach Model Experiments 3. Conclusion Outline

SLIDE 1

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

Ju-Chieh Chou, Hung-yi Lee, Interspeech 2019.

SLIDE 2

Outline

1. Introduction
2. Proposed Approach
Model
Experiments
3. Conclusion

SLIDE 3

Outline

1. Introduction
2. Proposed Approach
Model
Experiments
3. Conclusion

SLIDE 4

Voice conversion

Change the characteristic of an utterance while maintaining the

language content the same.

Characteristic: accent, speaker identity, emotion…
This work: focuses on speaker identity conversion.

How are you How are you

Model

Speaker 1 Speaker A

SLIDE 5

Conventional: supervised VC with parallel data

Same sentences, different signal from 2 speakers.
Formulated as a supervised learning problem.
Problem: require parallel data, which is hard

to collect.

How are you How are you

Speaker 1 Speaker A

Nice to meet you Nice to meet you I am fine I am fine

Parallel data

SLIDE 6

Recently: unsupervised VC with non-parallel data

Trained on non-parallel corpus, which is more attainable.
Prior work: utilize deep generative model, ex. VAE, GAN, cycleGAN.
Problem: cannot convert to speakers not in the training data.
Our goal: train a model which is able to convert to speakers not in

the training data.

Speaker 1 Speaker A

Don’t have to speak same sentences.

SLIDE 7

Motivation

Intuition: speech signals inherently carry both content and

speaker information.

Learn the content/speaker representation separately.
Synthesize the target voice by combining the source content

representation and target speaker representation.

Encoder

How are you content representation: ”How are you”

Decoder

How are you Source speaker representation Target speaker representation

SLIDE 8

Outline

1. Introduction
2. Proposed Approach
Model
Experiments
3. Conclusion

SLIDE 9

Model overview

One-shot VC: use a utterance from target speaker as reference, and

synthesize this reference speaker’s voice.

Idea: separately encode speaker and content information with some

special designed layers.

SLIDE 10

Idea

Speaker information - invariant within an utterance.
Content information - varying within an utterance.

IN AdaIN AVG

Average Pooling Layer: calculating speaker information (𝛿, 𝛾). M𝑑

′ = 𝑁𝑑 − 𝜈𝑑

𝜏𝑑

Feature map

Instance Normalization Layer: normalizing speaker information (𝜈, 𝜏) while preserving content information. Adaptive Instance Normalization Layer: provide speaker information (𝛿, 𝛾).

M𝑑

′ = 𝛿𝑑

𝑁𝑑 − 𝜈𝑑 𝜏𝑑 + 𝛾𝑑

Special Designed Layers:

Channel

𝑁𝑑

′ = ෍ 𝑢=1 𝑈 𝑁𝑑 𝑢

𝑈

Intuition: normalize global information out (ex. high frequency), retain changes over time.

SLIDE 11

Model - training

Speaker Encoder 𝐹𝑡

𝑨𝑡 𝑦 𝑦 𝑨𝑑

Content Encoder 𝐹𝑑 Decoder D IN AdaIN IN AdaIN normalizing speaker information (𝜈, 𝜏) while preserving content information. provide speaker information (𝛿, 𝛾).

Problem: how to factorize the representations?

AVG AVG calculating speaker information(𝛿, 𝛾).

SLIDE 12

Model - testing

Speaker Encoder 𝐹𝑡

𝑨𝑡 𝑦 𝑦 𝑨𝑑

Content Encoder 𝐹𝑑 Decoder D IN AdaIN AVG Target speaker’s utterance Source speaker’s utterance Converted IN AdaIN normalizing speaker information (𝜈, 𝜏) while preserving content information. provide speaker information (𝛿, 𝛾). AVG calculating speaker information(𝛿, 𝛾).

SLIDE 13

Experiments – effect of IN

Train another speaker classifier to see how much speaker information

in content representations.

The lower the accuracy is, the less speaker information it contains.
Content encoder + IN: less speaker information.

𝑭𝒅 With IN 𝑭𝒅 Without IN Acc. 0.375 0.658

Speaker Classifier

𝑨𝑑 (content representation) Predict speaker

SLIDE 14

Experiments – speaker embedding visualization

Does speaker encoder learns

meaningful representations?

One color represents one

speaker’s utterances.

𝑨𝑡 from different speakers are

well separated.

Speaker Encoder 𝐹𝑡

𝑨𝑡

AVG

Unseen speakers’ utterances

SLIDE 15

Experiments - subjective

Ask subjects to score the

similarity between 2 utterances in 4-scales.

SLIDE 16

Experiments - subjective

Ask subjects to score the

similarity between 2 utterances in 4-scales.

Our model is able to

generate the voice similar to target speaker’s.

SLIDE 17

Demo (unseen)

Male to Male

Source: Target: Converted:

Female to Male

Source: Target: Converted:

Demo page: https://jjery2243542.github.io/one-shot- vc-demo/

SLIDE 18

Conclusion

We proposed a one-shot VC model, which is able to convert to

unseen speaker with one reference utterance.

By IN and AdaIN, our model is able to learn factorized

representations.

SLIDE 19

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

Ju-Chieh Chou, Hung-yi Lee, Interspeech 2019.

Outline

Outline

Voice conversion

Model

Conventional: supervised VC with parallel data

Parallel data

Recently: unsupervised VC with non-parallel data

Motivation

speaker information.

representation and target speaker representation.

Encoder

Decoder

Outline

Model overview

Idea

IN AdaIN AVG

M𝑑

𝑁𝑑 − 𝜈𝑑 𝜏𝑑 + 𝛾𝑑

Model - training

𝑨𝑡 𝑦 𝑦 𝑨𝑑

Model - testing

𝑨𝑡 𝑦 𝑦 𝑨𝑑

Experiments – effect of IN

𝑭𝒅 With IN 𝑭𝒅 Without IN Acc. 0.375 0.658

Speaker Classifier

Experiments – speaker embedding visualization

meaningful representations?

speaker’s utterances.

well separated.

𝑨𝑡

Experiments - subjective

Experiments - subjective

Demo (unseen)

Conclusion

Thank you for your attention.