cot cooperative training for generative
play

CoT: Cooperative Training for Generative Modeling of Discrete Data - PowerPoint PPT Presentation

CoT: Cooperative Training for Generative Modeling of Discrete Data https://github.com/desire2020/CoT Sidi Lu, Lantao Yu, Siyuan Feng, Yaoming Zhu, Weinan Zhang, and Yong Yu Shanghai Jiao Tong University Autoregressive Models Autoregressive


  1. CoT: Cooperative Training for Generative Modeling of Discrete Data https://github.com/desire2020/CoT Sidi Lu, Lantao Yu, Siyuan Feng, Yaoming Zhu, Weinan Zhang, and Yong Yu Shanghai Jiao Tong University

  2. Autoregressive Models β€’ Autoregressive models factorize the distribution sequentially to build a fully tractable density function: β€’ π‘ž πœ„ 𝑦 0 , 𝑦 1 , … , 𝑦 π‘œβˆ’1 = π‘ž πœ„ 𝑦 0 π‘ž πœ„ 𝑦 1 𝑑 [0:1) )π‘ž πœ„ 𝑦 2 𝑑 [0:2) )π‘ž πœ„ 𝑦 3 𝑑 [0:3) ) … π‘ž πœ„ 𝑦 π‘œβˆ’1 𝑑 [0:π‘œβˆ’1) )

  3. Teacher Forcing and Exposure Bias β€’ For each sequence in the training set, maximize the estimated likelihood in the log scale: initial state p(x|s) p(x|s) p(x|s) Model Model Model forced forced estimate estimate Β·Β·Β· observation observation I have

  4. Teacher Forcing and Exposure Bias β€’ When used to generate random sample: initial state p(x|s) p(x|s) p(x|s) Model Model Model self self stochastic stochastic Β·Β·Β· observation observation sample sample Billie Jean

  5. Teacher Forcing and Exposure Bias β€’ Exposure Bias [Ranzato et al., 2015]: β€’ The intermediate process under training stage and inference stage is inconsistent. β€’ The distribution shift would accumulate along the timeline. Teacher Forcing Real p(x|s) Model Training Prefix Random Sampling Generated p(x|s) Model Inference Prefix

  6. Exposure Bias and Kullback-Leibler Divergence β€’ Exposure Bias could also be regarded as a result of optimization via minimizing Kullback-Leibler Divergence, denoted as KL(P||Q) for measured distributions P, Q.

  7. Kullback-Leibler Divergence, Symmetry of Divergences β€’ For any P, Q, KL(P||Q) not necessarily equals to KL(Q||P) β€’ KL ---smoothed and symmetrized--> Jensen-Shannon Divergence β€’ where M = 0.5 * (P + G)

  8. GAN, SeqGAN and Language GANs β€’ Ian Goodfellow proposed Generative Adversarial Network [2014] β€’ Ideally, GAN minimizes the JSD β€’ Can’t be directly applied to discrete sequence generation β€’ SeqGAN uses the REINFORCE gradient estimator to resolve this.

  9. Problems of SeqGAN β€’ Not trivially able to work from scratch. β€’ SeqGAN’s work-around: Pre-training via teacher forcing. β€’ Trade diversity for quality (mode collapse) β€’ According to previous reports([Lu et al. 2018; Caccia et al. 2018])

  10. Problems of SeqGAN β€’ Training signal is too sparse. initial state p(x|s) p(x|s) p(x|s) Generator Generator Generator self self stochastic stochastic observation observation sample sample Billie Jean single point signal single point signal Discriminator

  11. Cooperative Training: Back to Formula! β€’ Reconsider the algorithm from estimating & minimizing JSD: β€’ where M = 0.5 * (P + G) β€’ Instead of using a discriminator to achieve this, use another sequence model called β€œMediator” to approximate the mixture density M.

  12. Cooperative Training: More Information from Mediator β€’ Key Idea: The mediator provides DISTRIBUTION level signal in each time step. Generator Generator Generator initial state G(x|s) G(x|s) G(x|s) Billie Jean signal signal signal M(x|s) M(x|s) M(x|s) Mediator Mediator Mediator

  13. Cooperative Training: Factorizing the Cumulative Gradient Through Time, Final Objectives β€’ Generator Gradient: β€’ where 𝜌 𝑕 𝑑 𝑒 = 𝐻 πœ„ 𝑦 𝑑 𝑒 , 𝜌 𝑛 𝑑 𝑒 = 𝑁 𝜚 𝑦 𝑑 𝑒 , β€’ Mediator Objective:

  14. Experiment: Synthetic Turing Test

  15. Experiment: Real World Data Quality Test on EMNLP2017 WMT Reasonable Diversity Test on News Section EMNLP2017 WMT News Section

  16. Poster #44 Conclusion β€’ Key Ideas: β€’ Use a max-max game to replace min-max game of GANs, while still focusing on minimization of JSD. β€’ Use distribution-level signal from the introduced mediator in each step. β€’ Advantage: β€’ Works from scratch. β€’ Trade-off invariant performance gain while still being computationally cheap enough.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend