Representations with Instance Normalization Ju-Chieh Chou , Hung-yi - - PowerPoint PPT Presentation

representations with instance normalization
SMART_READER_LITE
LIVE PREVIEW

Representations with Instance Normalization Ju-Chieh Chou , Hung-yi - - PowerPoint PPT Presentation

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization Ju-Chieh Chou , Hung-yi Lee, Interspeech 2019. Outline 1. Introduction 2. Proposed Approach Model Experiments 3. Conclusion Outline


slide-1
SLIDE 1

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

Ju-Chieh Chou, Hung-yi Lee, Interspeech 2019.

slide-2
SLIDE 2

Outline

  • 1. Introduction
  • 2. Proposed Approach
  • Model
  • Experiments
  • 3. Conclusion
slide-3
SLIDE 3

Outline

  • 1. Introduction
  • 2. Proposed Approach
  • Model
  • Experiments
  • 3. Conclusion
slide-4
SLIDE 4

Voice conversion

  • Change the characteristic of an utterance while maintaining the

language content the same.

  • Characteristic: accent, speaker identity, emotion…
  • This work: focuses on speaker identity conversion.

How are you How are you

Model

Speaker 1 Speaker A

slide-5
SLIDE 5

Conventional: supervised VC with parallel data

  • Same sentences, different signal from 2 speakers.
  • Formulated as a supervised learning problem.
  • Problem: require parallel data, which is hard

to collect.

How are you How are you

Speaker 1 Speaker A

Nice to meet you Nice to meet you I am fine I am fine

Parallel data

slide-6
SLIDE 6

Recently: unsupervised VC with non-parallel data

  • Trained on non-parallel corpus, which is more attainable.
  • Prior work: utilize deep generative model, ex. VAE, GAN, cycleGAN.
  • Problem: cannot convert to speakers not in the training data.
  • Our goal: train a model which is able to convert to speakers not in

the training data.

Speaker 1 Speaker A

Don’t have to speak same sentences.

slide-7
SLIDE 7

Motivation

  • Intuition: speech signals inherently carry both content and

speaker information.

  • Learn the content/speaker representation separately.
  • Synthesize the target voice by combining the source content

representation and target speaker representation.

Encoder

How are you content representation: ”How are you”

Decoder

How are you Source speaker representation Target speaker representation

slide-8
SLIDE 8

Outline

  • 1. Introduction
  • 2. Proposed Approach
  • Model
  • Experiments
  • 3. Conclusion
slide-9
SLIDE 9

Model overview

  • One-shot VC: use a utterance from target speaker as reference, and

synthesize this reference speaker’s voice.

  • Idea: separately encode speaker and content information with some

special designed layers.

slide-10
SLIDE 10

Idea

  • Speaker information - invariant within an utterance.
  • Content information - varying within an utterance.

IN AdaIN AVG

Average Pooling Layer: calculating speaker information (𝛿, 𝛾). M𝑑

′ = 𝑁𝑑 − 𝜈𝑑

𝜏𝑑

Feature map

Instance Normalization Layer: normalizing speaker information (𝜈, 𝜏) while preserving content information. Adaptive Instance Normalization Layer: provide speaker information (𝛿, 𝛾).

M𝑑

′ = 𝛿𝑑

𝑁𝑑 − 𝜈𝑑 𝜏𝑑 + 𝛾𝑑

Special Designed Layers:

Channel

𝑁𝑑

′ = ෍ 𝑢=1 𝑈 𝑁𝑑 𝑢

𝑈

Intuition: normalize global information out (ex. high frequency), retain changes over time.

slide-11
SLIDE 11

Model - training

Speaker Encoder 𝐹𝑡

𝑨𝑡 𝑦 𝑦 𝑨𝑑

Content Encoder 𝐹𝑑 Decoder D IN AdaIN IN AdaIN normalizing speaker information (𝜈, 𝜏) while preserving content information. provide speaker information (𝛿, 𝛾).

Problem: how to factorize the representations?

AVG AVG calculating speaker information(𝛿, 𝛾).

slide-12
SLIDE 12

Model - testing

Speaker Encoder 𝐹𝑡

𝑨𝑡 𝑦 𝑦 𝑨𝑑

Content Encoder 𝐹𝑑 Decoder D IN AdaIN AVG Target speaker’s utterance Source speaker’s utterance Converted IN AdaIN normalizing speaker information (𝜈, 𝜏) while preserving content information. provide speaker information (𝛿, 𝛾). AVG calculating speaker information(𝛿, 𝛾).

slide-13
SLIDE 13

Experiments – effect of IN

  • Train another speaker classifier to see how much speaker information

in content representations.

  • The lower the accuracy is, the less speaker information it contains.
  • Content encoder + IN: less speaker information.

𝑭𝒅 With IN 𝑭𝒅 Without IN Acc. 0.375 0.658

Speaker Classifier

𝑨𝑑 (content representation) Predict speaker

slide-14
SLIDE 14

Experiments – speaker embedding visualization

  • Does speaker encoder learns

meaningful representations?

  • One color represents one

speaker’s utterances.

  • 𝑨𝑡 from different speakers are

well separated.

Speaker Encoder 𝐹𝑡

𝑨𝑡

AVG

Unseen speakers’ utterances

slide-15
SLIDE 15

Experiments - subjective

  • Ask subjects to score the

similarity between 2 utterances in 4-scales.

slide-16
SLIDE 16

Experiments - subjective

  • Ask subjects to score the

similarity between 2 utterances in 4-scales.

  • Our model is able to

generate the voice similar to target speaker’s.

slide-17
SLIDE 17

Demo (unseen)

Male to Male

Source: Target: Converted:

Female to Male

Source: Target: Converted:

Demo page: https://jjery2243542.github.io/one-shot- vc-demo/

slide-18
SLIDE 18

Conclusion

  • We proposed a one-shot VC model, which is able to convert to

unseen speaker with one reference utterance.

  • By IN and AdaIN, our model is able to learn factorized

representations.

slide-19
SLIDE 19

Thank you for your attention.