CoT: Cooperative Training for Generative Modeling of Discrete Data - - PowerPoint PPT Presentation

β–Ά
cot cooperative training for generative
SMART_READER_LITE
LIVE PREVIEW

CoT: Cooperative Training for Generative Modeling of Discrete Data - - PowerPoint PPT Presentation

CoT: Cooperative Training for Generative Modeling of Discrete Data https://github.com/desire2020/CoT Sidi Lu, Lantao Yu, Siyuan Feng, Yaoming Zhu, Weinan Zhang, and Yong Yu Shanghai Jiao Tong University Autoregressive Models Autoregressive


slide-1
SLIDE 1

CoT: Cooperative Training for Generative Modeling of Discrete Data

Sidi Lu, Lantao Yu, Siyuan Feng, Yaoming Zhu, Weinan Zhang, and Yong Yu Shanghai Jiao Tong University

https://github.com/desire2020/CoT

slide-2
SLIDE 2

Autoregressive Models

  • Autoregressive models factorize the distribution sequentially to build

a fully tractable density function:

  • π‘žπœ„ 𝑦0, 𝑦1, … , π‘¦π‘œβˆ’1 =

π‘žπœ„ 𝑦0 π‘žπœ„ 𝑦1 𝑑[0:1))π‘žπœ„ 𝑦2 𝑑[0:2))π‘žπœ„ 𝑦3 𝑑[0:3)) … π‘žπœ„ π‘¦π‘œβˆ’1 𝑑[0:π‘œβˆ’1))

slide-3
SLIDE 3

Teacher Forcing and Exposure Bias

  • For each sequence in the training set, maximize the estimated

likelihood in the log scale:

initial state

Model

p(x|s)

Model

p(x|s)

I estimate forced

  • bservation

have estimate forced

  • bservation

Model

p(x|s)

Β·Β·Β·

slide-4
SLIDE 4

Teacher Forcing and Exposure Bias

  • When used to generate random sample:

initial state

Model

p(x|s)

Model

p(x|s)

Billie stochastic sample self

  • bservation

Jean stochastic sample self

  • bservation

Model

p(x|s)

Β·Β·Β·

slide-5
SLIDE 5

Teacher Forcing and Exposure Bias

  • Exposure Bias [Ranzato et al., 2015]:
  • The intermediate process under training stage and inference stage is

inconsistent.

  • The distribution shift would accumulate along the timeline.

Real Prefix Model

p(x|s)

Generated Prefix Model

p(x|s)

Teacher Forcing Training Random Sampling Inference

slide-6
SLIDE 6

Exposure Bias and Kullback-Leibler Divergence

  • Exposure Bias could also be regarded as a result of optimization via

minimizing Kullback-Leibler Divergence, denoted as KL(P||Q) for measured distributions P, Q.

slide-7
SLIDE 7

Kullback-Leibler Divergence, Symmetry of Divergences

  • For any P, Q, KL(P||Q) not necessarily equals to KL(Q||P)
  • KL ---smoothed and symmetrized--> Jensen-Shannon Divergence
  • where M = 0.5 * (P + G)
slide-8
SLIDE 8

GAN, SeqGAN and Language GANs

  • Ian Goodfellow proposed Generative Adversarial Network [2014]
  • Ideally, GAN minimizes the JSD
  • Can’t be directly applied to discrete sequence generation
  • SeqGAN uses the REINFORCE gradient estimator to resolve this.
slide-9
SLIDE 9

Problems of SeqGAN

  • Not trivially able to work from scratch.
  • SeqGAN’s work-around: Pre-training via teacher forcing.
  • Trade diversity for quality (mode collapse)
  • According to previous reports([Lu et al. 2018; Caccia et al. 2018])
slide-10
SLIDE 10

Problems of SeqGAN

  • Training signal is too sparse.

initial state

Generator

p(x|s)

Generator

p(x|s)

Billie stochastic sample self

  • bservation

Jean stochastic sample self

  • bservation

Generator

p(x|s)

single point signal single point signal Discriminator

slide-11
SLIDE 11

Cooperative Training: Back to Formula!

  • Reconsider the algorithm from estimating & minimizing JSD:
  • where M = 0.5 * (P + G)
  • Instead of using a discriminator to achieve this, use another sequence

model called β€œMediator” to approximate the mixture density M.

slide-12
SLIDE 12

Cooperative Training: More Information from Mediator

  • Key Idea: The mediator provides DISTRIBUTION level signal in each

time step. initial state

Generator Generator Billie Jean Generator Mediator Mediator Mediator

G(x|s) M(x|s) signal G(x|s) M(x|s) signal G(x|s) M(x|s) signal

slide-13
SLIDE 13

Cooperative Training: Factorizing the Cumulative Gradient Through Time, Final Objectives

  • Generator Gradient:
  • where πœŒπ‘• 𝑑𝑒 = π»πœ„ 𝑦 𝑑𝑒 , πœŒπ‘› 𝑑𝑒 = π‘πœš 𝑦 𝑑𝑒 ,
  • Mediator Objective:
slide-14
SLIDE 14

Experiment: Synthetic Turing Test

slide-15
SLIDE 15

Experiment: Real World Data

Quality Test on EMNLP2017 WMT News Section Reasonable Diversity Test on EMNLP2017 WMT News Section

slide-16
SLIDE 16

Conclusion

  • Key Ideas:
  • Use a max-max game to replace min-max game of GANs, while still focusing
  • n minimization of JSD.
  • Use distribution-level signal from the introduced mediator in each step.
  • Advantage:
  • Works from scratch.
  • Trade-off invariant performance gain while still being

computationally cheap enough.

Poster #44