Unsupervised Voice Conversion by Separately Embedding Speaker and Content Information with Deep Generative Model 以分別嵌入語者及語言內容資訊之深層生成模 型達成無監督式語音轉換
Speaker: 周儒杰 (Ju-Chieh Chou) Advisor: 李琳山 (Lin-shan Lee)
Unsupervised Voice Conversion by Separately Embedding Speaker and - - PowerPoint PPT Presentation
Unsupervised Voice Conversion by Separately Embedding Speaker and Content Information with Deep Generative Model Speaker: (Ju-Chieh Chou)
Speaker: 周儒杰 (Ju-Chieh Chou) Advisor: 李琳山 (Lin-shan Lee)
language content the same.
How are you How are you
Speaker 1 Speaker A
2 speakers.
to speaker A.
is hard to collect.
How are you How are you
Speaker 1 Speaker A
Nice to meet you Nice to meet you I am fine I am fine
Speaker 1 Speaker A
Don’t have to speak same sentences.
Voice Conversion Parallel Data Non-parallel Data Without Transcription (phonemes) With Transcription (phonemes) Multi-target Single-target One-shot
How are you content representation: ”How are you”
How are you Source speaker representation Target speaker representation
3 models are needed for 3 target speakers. Model-B Model-A Model-C
Speaker A Speaker B Speaker 1 Speaker C
Speaker 1 Speaker A Speaker B Speaker C
Only one model is needed. 𝑂2 models for N speakers.
Encoder
content representation: enc(x)
Decoder
Reconstruction loss
Classifier-1
Identify the speaker Remove speaker information
Training
Speaker 1 Speaker id
Encoder
content representation: enc(x)
Decoder
Reconstruction loss
Classifier-1
Identify the speaker Remove speaker information Train iteratively
Training
Target speaker id
Testing
Speaker 1 Speaker id
enc(x)
Speaker y
Predict speaker y
Training
enc(x)
Speaker y’
Testing
Same speaker Different speaker
target (lack details). Leads to over-smoothed spectra, and result in buzzy synthesized speech.
enc(x)
Speaker id Reconstruction loss
Predict speaker
Training
May be over-smoothed
Encoder
Content: enc(x)
Decoder
Random sampled speaker id From stage 1, fixed
Generator
Random sampled speaker id
Training
Residual signal
Encoder
Content: enc(x)
Decoder
Random sampled speaker id From stage 1, fixed
Generator
Random sampled speaker id
Discriminator
Real or generated
Training
Residual signal
Encoder
Content: enc(x)
Decoder
Speaker id From stage 1, fixed
Generator
Speaker id
Discriminator Classifier-2
Real or generated Identify the speaker
Training
Residual signal
Encoder
Content: enc(x)
Decoder
Speaker id From stage 1, fixed
Generator
Speaker id
Discriminator Classifier-2
Real or generated Identify the speaker
Training Testing
Target speaker
Residual signal
Target speaker
improved by stage 2.
Comparison to baseline [1]. Is stage 2 helpful?
“Stage 1 + stage 2” is better. “Stage 1 alone” is better. Indistinguishable. “Stage 1 + stage 2” is better. “CycleGAN-VC” [1] is better. Indistinguishable.
CycleGAN-VC: Kaneko et.al. EUSIPCO 2018 [1]
Male to Female
Source: Target: Converted:
Female to Female
Source: Target: Converted:
Source: Target: Converted:
Demo page: https://jjery2243542.github.io/voice_con version_demo/
Speaker 1 Speaker A Speaker B Speaker C
Prior work: only able to convert to speakers in training data Speaker id one-hot encoding (training data includes those of all target speakers)
Speaker 1 Target Speaker
Target speaker reference utterance (one-shot)
This work: source/target speakers unseen during training.
Average Pooling Layer: calculating speaker information (𝛿, 𝛾). M𝑑
′ = 𝑁𝑑 − 𝜈𝑑
𝜏𝑑
Feature map
Instance Normalization Layer: normalizing speaker information(𝜈, 𝜏) while preserving content information. Adaptive Instance Normalization Layer: provide speaker information (𝛿, 𝛾).
′ = 𝛿𝑑
Special Designed Layers:
Channel
𝑁𝑑
′ = 𝑢=1 𝑈 𝑁𝑑 𝑢
𝑈
across time.
Speaker Encoder 𝐹𝑡
Content Encoder 𝐹𝑑 Decoder D IN AdaIN IN AdaIN normalizing speaker information (𝜈, 𝜏) while preserving content information. provide speaker information (𝛿, 𝛾).
Problem: how to factorize the representations?
AVG AVG calculating speaker information(𝛿, 𝛾).
Speaker Encoder 𝐹𝑡
Content Encoder 𝐹𝑑 Decoder D IN AdaIN AVG Target speaker’s utterance Source speaker’s utterance Converted IN AdaIN normalizing speaker information (𝜈, 𝜏) while preserving content information. provide speaker information (𝛿, 𝛾). AVG calculating speaker information(𝛿, 𝛾).
in content representations.
𝑨𝑑 (content representation) Predict speaker
Speaker Encoder 𝐹𝑡
AVG
Unseen speakers’ utterances
similarity between 2 utterances in 4-scales.
similarity between 2 utterances in 4-scales.
generate the voice similar to target speaker’s.
Male to Male
Source: Target: Converted:
Female to Male
Source: Target: Converted:
Demo page: https://jjery2243542.github.io/one-shot- vc-demo/
embedding speaker and content information”.
○
We proposed a multi-target VC model by removing speaker information with adversarial training.
○
GAN training mitigate the problem of over-smoothing and improve the result.
○
We proposed a one-shot VC model, which is able to convert to unseen speaker with one reference utterance.
○
By IN and AdaIN, our model is able to learn factorized representations.
retain changes across time.
′ = 𝑁𝑑 − 𝜈𝑑
′ = 𝛿𝑑
Feature map Provided by speaker encoder (control th global information)
Male to Female
Source: Target: Converted:
Female to Female
Source: Target: Converted:
Source: Target: Converted:
Demo page: https://jjery2243542.github.io/voice_con version_demo/
Speaker 1 Speaker A Speaker B Speaker C
Prior work: only able to convert to speakers in training data Speaker id (fixed length)
Speaker 1 Target Speaker
Target speaker reference utterance (unseen during training)
This work: use one utterance as reference (one-shot)