unsupervised voice conversion by separately embedding
play

Unsupervised Voice Conversion by Separately Embedding Speaker and - PowerPoint PPT Presentation

Unsupervised Voice Conversion by Separately Embedding Speaker and Content Information with Deep Generative Model Speaker: (Ju-Chieh Chou)


  1. Unsupervised Voice Conversion by Separately Embedding Speaker and Content Information with Deep Generative Model 以分別嵌入語者及語言內容資訊之深層生成模 型達成無監督式語音轉換 Speaker: 周儒杰 (Ju-Chieh Chou) Advisor: 李琳山 (Lin-shan Lee)

  2. Outline 1. Introduction • Voice Conversion • Branches • Motivation 2. Proposed Approach • Multi-target Model • Model • Experiments • One-shot Model • Model • Experiments 3. Conclusion

  3. Outline 1. Introduction • Voice Conversion • Branches • Motivation 2. Proposed Approach • Multi-target Model • Model • Experiments • One-shot Model • Model • Experiments 3. Conclusion

  4. Voice conversion Change the characteristic of an utterance while maintaining the ● language content the same. Characteristic: accent, speaker identity, emotion… ● This work: focus on speaker identity conversion. ● Speaker A Speaker 1 Model How are you How are you

  5. Conventional: supervised with parallel data Speaker A Speaker 1 Same sentences, different signal from ● 2 speakers. Train a model to map from speaker 1 ● to speaker A. Problem: require parallel data, which ● is hard to collect. How are you How are you Parallel data Nice to meet you Nice to meet you I am fine I am fine

  6. This work: unsupervised with non-parallel data Trained on non-parallel corpus, which is more attainable. ● Actively investigated. ● Prior work: utilize deep generative model, ex. VAE, GAN, cycleGAN. ● Speaker 1 Speaker A Don’t have to speak same sentences.

  7. Voice Conversion Branches Yeh et.al. Parallel Data With Single-target Voice Transcription Conversion (phonemes) ch3 Non-parallel Data Without Data Multi-target Transcription Efficiency (phonemes) ch4 This work One-shot

  8. Motivation ● Intuition: speech signals inherently carry both content and speaker information. ● Learn the content/speaker representation separately. ● Synthesize the target voice by combining the source content representation and target speaker representation. Target speaker representation Source speaker representation Encoder Decoder content How are you How are you representation: ”How are you”

  9. Outline 1. Introduction • Voice Conversion • Branches • Motivation 2. Proposed Approach • Multi-target Model • Model • Experiments • One-shot Model • Model • Experiments 3. Conclusion

  10. Multi-target unsupervised with non-parallel data 3 models are needed for 3 target speakers. Model-A Only one model is needed. Speaker A Speaker 1 Speaker A Model-B Model Speaker B Speaker B Speaker 1 Speaker C Model-C Speaker C 𝑂 2 models for N speakers.

  11. Stage 1: disentanglement between content and speaker representation ● Goal of classifier-1: maximize the likelihood being the speaker. Training Speaker 1 Decoder Speaker id content Encoder Identify the representation: enc(x) Classifier-1 speaker Remove speaker information Reconstruction loss

  12. Stage 1: disentanglement between content and speaker representation ● Goal of classifier-1: maximize the likelihood being the speaker. ● Goal of encoder: minimize the likelihood being the speaker. Target speaker id Testing Training Speaker 1 Decoder Speaker id content Encoder Identify the representation: enc(x) Classifier-1 speaker Remove speaker information Train iteratively Reconstruction loss

  13. Problem of stage 1: training-testing mismatch Training Same speaker Decoder Speaker y Encoder enc(x) Predict Classifier-1 speaker y Testing Different speaker Speaker y’ Decoder Encoder enc(x)

  14. Problem of stage 1: over-smoothed spectra Stage 1 alone can synthesis target voice to some extent. ● Reconstruction loss encourages the model to generate average value of the ● target (lack details). Leads to over-smoothed spectra, and result in buzzy synthesized speech. Training Decoder Speaker id Encoder enc(x) Predict Classifier-1 speaker Reconstruction loss May be over-smoothed

  15. Stage 2: patch the output with a residual signal ● Random sample a speaker id as condition. ● Train another generator to produce residual signal (spectra details), making the output more natural. Random sampled Training speaker id From stage 1, fixed Content: Encoder Decoder enc(x) Random sampled Generator speaker id Residual signal

  16. Stage 2: patch the output with a residual signal ● Discriminator is to discriminate whether synthesized or real data. ● Generator is to fool the discriminator. Real data Random sampled Training speaker id From stage 1, fixed Content: Encoder Decoder enc(x) Real or Discriminator generated Random sampled Generator speaker id Residual signal

  17. Stage 2: patch the output with a residual signal ● Classifier-2 is to identify the speaker. ● The generator will also try to make the classifier-2 predict correct speaker. Real data Training Speaker id From stage 1, fixed Content: Encoder Decoder enc(x) Real or Discriminator generated Identify the Classifier-2 Speaker id Generator speaker Residual signal

  18. Stage 2: patch the output with a residual signal ● Generator and discriminator/classifier-2 are trained iteratively. Real data Target speaker Training Testing Speaker id From stage 1, fixed Content: Encoder Decoder enc(x) Real or Discriminator generated Target speaker Identify the Classifier-2 Speaker id Generator speaker Residual signal

  19. Experiments – spectrogram visualization Is stage 2 helpful? ● Sharpness of the spectrogram is ● improved by stage 2.

  20. Experiments – subjective preference ● Ask subjects to choose their preference in terms of naturalness and similarity. ● Stage 2 improved. ● Comparable to baseline approach. Comparison to baseline [1]. Is stage 2 helpful? “Stage 1 + stage 2” is better. “Stage 1 + stage 2” is better. “ CycleGAN- VC” [1] is better. “Stage 1 alone” is better. Indistinguishable . Indistinguishable. CycleGAN-VC: Kaneko et.al. EUSIPCO 2018 [1]

  21. Demo page: https://jjery2243542.github.io/voice_con version_demo/ Demo Male to Female Source: Target: Converted: Female to Female Source: Target: Converted: Prof. Hung-yi Lee(male, never seen in training data) to Female Source: Target: Converted:

  22. Outline 1. Introduction • Voice Conversion • Branches • Motivation 2. Proposed Approach • Multi-target Model • Model • Experiments • One-shot Model • Model • Experiments 3. Conclusion

  23. One-shot unsupervised with non-parallel data Prior work: only able to convert to This work: source/target speakers speakers in training data unseen during training. Speaker 1 Target Speaker Speaker 1 Speaker A Model Model Model Speaker B Speaker C Target speaker reference utterance (one-shot) Speaker id one-hot encoding (training data includes those of all target speakers)

  24. Idea ● Speaker information - invariant within an utterance. ● Content information - varying within an utterance. Special Designed Layers: Feature map Channel Instance Normalization Layer: normalizing speaker IN ′ = 𝑁 𝑑 − 𝜈 𝑑 information( 𝜈, 𝜏 ) while preserving content information. M 𝑑 𝜏 𝑑 𝑈 𝑁 𝑑 Average Pooling Layer: calculating speaker information ( 𝛿, 𝛾) . AVG 𝑢 ′ = ෍ 𝑁 𝑑 𝑈 𝑢=1 Adaptive Instance Normalization Layer: provide speaker 𝑁 𝑑 − 𝜈 𝑑 AdaIN ′ = 𝛿 𝑑 information (𝛿, 𝛾) . M 𝑑 + 𝛾 𝑑 𝜏 𝑑

  25. Intuition Normalize global information out (ex. high frequency), retain changes ● across time.

  26. Model - training Problem: how to factorize the representations? S peaker AVG 𝑨 𝑡 Encoder 𝐹 𝑡 𝑦 𝑦 AdaIN Content Decoder 𝑨 𝑑 IN Encoder 𝐹 𝑑 D calculating speaker information( 𝛿, 𝛾 ). AVG IN normalizing speaker information ( 𝜈, 𝜏 ) while preserving content information. provide speaker information ( 𝛿, 𝛾 ). AdaIN

  27. Model - testing Target speaker’s utterance S peaker AVG 𝑨 𝑡 Encoder 𝐹 𝑡 𝑦 𝑦 Converted AdaIN Content Decoder 𝑨 𝑑 IN Encoder 𝐹 𝑑 D Source speaker’s utterance calculating speaker information( 𝛿, 𝛾 ). AVG IN normalizing speaker information ( 𝜈, 𝜏 ) while preserving content information. provide speaker information ( 𝛿, 𝛾 ). AdaIN

  28. Experiments – effect of IN Train another speaker classifier to see how much speaker information ● in content representations. The lower the accuracy is, the less speaker information it contains. ● Content encoder + IN: less speaker information. ● Speaker 𝑨 𝑑 (content Predict representation) speaker Classifier 𝑭 𝒅 With IN 𝑭 𝒅 Without IN Acc. 0.375 0.658

  29. Experiments – speaker embedding visualization • Does speaker encoder learns meaningful representations? • One color represents one speaker’s utterances. • 𝑨 𝑡 from different speakers are well separated. S peaker Unseen speakers’ AVG 𝑨 𝑡 Encoder 𝐹 𝑡 utterances

  30. Experiments - subjective • Ask subjects to score the similarity between 2 utterances in 4-scales.

  31. Experiments - subjective • Ask subjects to score the similarity between 2 utterances in 4-scales. • Our model is able to generate the voice similar to target speaker’s.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend