Probabilistic Graphical Models Inference & Learning in DL - - PowerPoint PPT Presentation

β–Ά
probabilistic graphical models
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Graphical Models Inference & Learning in DL - - PowerPoint PPT Presentation

Probabilistic Graphical Models Inference & Learning in DL Zhiting Hu Lecture 19, March 29, 2017 Reading: 1 Deep Generative Models l Explicit probabilistic models Provide an explicit parametric specification of the distribution of l


slide-1
SLIDE 1

Inference & Learning in DL

Zhiting Hu

Lecture 19, March 29, 2017

Reading:

Probabilistic Graphical Models

1

slide-2
SLIDE 2

Deep Generative Models

2

l Explicit probabilistic models

l

Provide an explicit parametric specification of the distribution of π’š

l

Tractable likelihood function p#(π’š)

l

E.g.,

π‘ž 𝑦, 𝑨|𝛽 = π‘ž 𝑦|𝑨 π‘ž(𝑨|𝛽)

slide-3
SLIDE 3

Deep Generative Models

3

l Explicit probabilistic models

l

Provide an explicit parametric specification of the distribution of π’š

l

Tractable likelihood function p#(π’š)

l

E.g., Sigmoid Belief Nets

π‘ž 𝑀./ = 1 𝒙.,π’Š/

(3), 𝑑.

= 𝜏 𝒙.

6π’Š/ (3) + 𝑑.

π‘ž β„Ž9/

(3) = 1 𝒙9, π’Š/ : ,𝑑9) = 𝜏(𝒙9 6π’Š/ : + 𝑑9)

π’˜/ = 0,1 = π’Š/

(3) = 0,1 >

π’Š/

(:) = 0,1 ?

slide-4
SLIDE 4

Deep Generative Models

4

l Explicit probabilistic models

l

Provide an explicit parametric specification of the distribution of π’š

l

Tractable likelihood function p#(π’š)

l

E.g., Deep generative model parameterized with NNs (e.g., VAEs)

π‘ž# π’š π’œ = 𝑂 π’š; 𝜈# π’œ ,𝜏 π‘ž π’œ = 𝑂(π’œ; 𝟏,𝑱)

slide-5
SLIDE 5

Deep Generative Models

5

l Implicit probabilistic models

l

Defines a stochastic process to simulate data π’š

l

Do not require tractable likelihood function

l

Data simulator

l

Natural approach for problems in population genetics, weather, ecology, etc.

l

E.g., generate data from a deterministic equation given parameters and random noise (e.g., GANs)

π’š/ = 𝑕 π’œ/; 𝜾 π’œ/ ∼ 𝑂(𝟏,𝑱)

slide-6
SLIDE 6

Recap: Variational Inference

l Consider a probabilistic model l Assume variational distribution l Lower bound for log likelihood

l

Free energy

6

log π‘ž π’š = 𝐿𝑀 π‘Ÿπ” π’œ π’š || π‘žπœΎ π’œ π’š + Qπ‘ŸR π’œ π’š log π‘ž# π’š, π’œ π‘ŸR π’œ π’š

π’œ

β‰₯ Qπ‘ŸR π’œ π’š logπ‘ž# π’š, π’œ π‘ŸR π’œ π’š

π’œ

≔ β„’(𝜾, 𝝔;π’š)

π‘ž# π’š,π’œ π‘ŸR π’œ | π’š 𝐺 𝜾, 𝝔; π’š = βˆ’log π‘ž π’š + 𝐿𝑀(π‘Ÿπ” π’œ π’š || π‘žπœΎ(π’œ|π’š))

slide-7
SLIDE 7

Wake Sleep Algorithm

l Consider a generative model

l

E.g., sigmoid brief nets

l Variational bound: l Use a inference network l Maximize the bound w.r.t. Γ 

Wake phase

l l

Get samples from through bottom-up pass

l

Use the samples as targets for updating the generator

7

π‘ž# π’š π’œ

log π‘ž π’š β‰₯ Q π‘ŸR π’œ π’š log π‘ž# π’š,π’œ π‘ŸR π’œ π’š

π’œ

≔ β„’(𝜾, 𝝔;π’š)

π‘ŸR π’œ π’š π‘ž#

max𝜾 E\(π’œ|π’š) log π‘žπœΎ π’š π’œ π‘Ÿ π’œ π’š

slide-8
SLIDE 8

Wake Sleep Algorithm

l [Hinton et al., Science 1995] l Generally applicable to a wide range of generative models by

training a separate inference network

l Consider a generative model , with prior

l

E.g., multi-layer brief nets

l Free energy l Inference network

l

a.k.a. recognition network

8

π‘ž# π’š π’œ

𝐺 𝜾, 𝝔; π’š = βˆ’log π‘ž π’š + 𝐿𝑀(π‘Ÿπ” π’œ π’š || π‘žπœΎ(π’œ|π’š))

π‘ŸR π’œ π’š π‘ž(π’œ)

slide-9
SLIDE 9

Wake Sleep Algorithm

l Free energy: l Minimize the free energy w.r.t. Γ 

Wake phase

l l

Get samples from through bottom-up pass on training data

l

Use the samples as targets for updating the generator

9

𝐺 𝜾, 𝝔; π’š = βˆ’log π‘ž π’š + 𝐿𝑀(π‘Ÿπ” π’œ π’š || π‘žπœΎ(π’œ|π’š))

π‘ž#

max𝜾 E\(π’œ|π’š) log π‘žπœΎ π’š π’œ π‘Ÿ

R1 R2 π’š

[Figure courtesy: Maei’s slides]

slide-10
SLIDE 10

Wake Sleep Algorithm

l Free energy: l Maximize the free energy w.r.t. ?

l

computationally expensive / high variance

l Instead, maximize w.r.t. . Γ 

Sleep phase

l l

β€œDreaming” up samples from through top-down pass

l

Use the samples as targets for updating the recognition network

10

π‘ŸR π’œ π’š

𝐺 𝜾, 𝝔; π’š = βˆ’log π‘ž π’š + 𝐿𝑀(π‘Ÿπ” π’œ π’š || π‘žπœΎ(π’œ|π’š)) 𝐺′ 𝜾,𝝔;π’š = βˆ’log π‘ž π’š + 𝐿𝑀(π‘ž π’œ π’š || π‘ŸR(π’œ|π’š))

π‘ŸR π’œ π’š

G1 G2 R1 R2 π’š

max𝝔 E^(π’œ,π’š) log π‘ŸR π’œ π’š π‘ž

slide-11
SLIDE 11

Wake Sleep Algorithm

l Wake phase:

l

Use recognition network to perform a bottom-up pass in order to create samples for layers above (from data)

l

Train generative network using samples obtained from recognition model

l Sleep phase:

l

Use generative weights to reconstruct data by performing a top-down pass

l

Train recognition weights using samples obtained from generative model

l KL is not symmetric l Doesn’t optimize a well-defined objective function l Not guaranteed to converge

11

G1 G2 R1 R2 π’š

slide-12
SLIDE 12

Variational Auto-encoders (VAEs)

l [Kingma & Welling, 2014] l Enjoy similar applicability with wake-sleep algorithm

l

Not applicable to discrete latent variables

l

Optimize a variational lower bound on the log-likelihood

l

Reduce variance through reparameterization of the recognition distribution

l

Alternatives: use control variates as in reinforcement learning [Mnih & Gregor, 2014]

12

slide-13
SLIDE 13

Variational Auto-encoders (VAEs)

l Generative model , with prior

l

a.k.a. decoder

l Inference network

l

a.k.a. encoder, recognition network

l Variational lower bound

13

π‘ž# π’š π’œ π‘ž(π’œ) π‘ŸR π’œ π’š

log π‘ž π’š β‰₯ E\_ π’œ π’š logπ‘ž# π’š,π’œ βˆ’ KL π‘ŸR π’œ π’š || π‘ž π’œ ≔ β„’(𝜾,𝝔;π’š)

slide-14
SLIDE 14

Variational Auto-encoders (VAEs)

l Variational lower bound l Optimize w.r.t.

l

the same with the wake phase

l Optimize w.r.t.

l

Directly computing the gradient with MC estimation a REINFORCE-like update rule which suffers from high variance [Mnih & Gregor 2014] (Next lecture for more on REINFORCE)

l

VAEs use a reparameterization trick to reduce variance

14

β„’ 𝜾, 𝝔; π’š = E\_ π’œ π’š log π‘ž# π’š, π’œ βˆ’ KL(π‘ŸR π’œ π’š || π‘ž(π’œ))

β„’(𝜾, 𝝔; π’š) π‘ž# π’š π’œ β„’(𝜾, 𝝔; π’š) π‘ŸR π’œ π’š

slide-15
SLIDE 15

VAEs: Reparameterization Trick

l

15

slide-16
SLIDE 16

VAEs: Reparameterization Trick

l

16

π‘ŸR π’œ(9) π’š(9) = π’ͺ(π’œ 9 ; 𝝂 9 ,𝝉:(9)𝑱) π’œ = π’œR(𝝑) is a deterministic mapping of 𝝑

[Figure courtesy: Chang’s slides]

slide-17
SLIDE 17

VAEs: Reparameterization Trick

l

Variational lower bound

l

Optimize w.r.t.

l l

Uses the gradients w.r.t. the latent variables

l

For Gaussian distributions, can be computed and differentiated analytically

17

β„’ 𝜾, 𝝔; π’š π‘ŸR π’œ π’š

β„’ 𝜾, 𝝔;π’š = E\_ π’œ π’š log π‘ž# π’š,π’œ βˆ’ KL π‘ŸR π’œ π’š || π‘ž π’œ E\_ π’œ π’š log π‘ž# π’š, π’œ = Eπ‘βˆΌπ’ͺ(𝟏,𝑱) log π‘ž# π’š,π’œR 𝝑 𝛼

R E\_ π’œ π’š log π‘ž# π’š, π’œ

= Eπ‘βˆΌπ’ͺ(𝟏,𝑱) 𝛼

R log π‘ž# π’š,π’œR 𝝑

KL π‘ŸR π’œ π’š || π‘ž π’œ

slide-18
SLIDE 18

VAEs: Training

l

18

slide-19
SLIDE 19

VAEs: Results

l

19

slide-20
SLIDE 20

VAEs: Results

l

20

Generated MNIST images [Gregor et al., 2015]

slide-21
SLIDE 21

VAEs: Limitations and variants

l Element-wise reconstruction error

l

For image generation, to reconstruct every pixels

l

Sensitive to irrelevant variance, e.g., translations

l

Variant: feature-wise (perceptual-level) reconstruction [Dosovitskiy et al., 2016]

l

Use a pre-trained neural network to extract features of data

l

Generated images are required to have similar feature vectors with the data

l

Variant: Combining VAEs with GANs [Larsen et al., 2016] (more later)

21

Reconstruction results with different loss

slide-22
SLIDE 22

VAEs: Limitations and variants

l Not applicable to discrete latent variables

l

Differentiable reparameterization does not apply to discrete variables

l

Wake-sleep algorithm/GANs allow discrete latents

l

Variant: marginalize out discrete latents [Kingma et al., 2014]

l

Expensive when the discrete space is large

l

Variant: use continuous approximations

l

Gumbel-softmax [Jang et al, 2017] for approximating multinomial variables

l

Variant: combine VAEs with wake-sleep algorithm [Hu et al., 2017]

22

slide-23
SLIDE 23

VAEs: Limitations and variants

l

Usually use a fixed standard normal distribution as prior

l l

For ease of inference and learning

l

Limited flexibility: converting the data distribution to fixed, single-mode prior distribution

l

Variant: use hierarchical nonparametric priors [Goyal et al., 2017]

l

E.g., Dirichlet process, nested Chinese restaurant process (more later)

l

Learn the structures of priors jointly with the model

23

π‘ž π’œ = π’ͺ(π’œ; 𝟏,𝑱)

slide-24
SLIDE 24

VAEs: Limitations and variants

l

Usually use a fixed standard normal distribution as prior

l l

For ease of inference and learning

l

Limited flexibility: converting the data distribution to fixed, single-mode prior distribution

l

Variant: use hierarchical nonparametric priors [Goyal et al., 2017]

l

E.g., Dirichlet process, nested Chinese restaurant process (more later)

l

Learn the structures of priors jointly with the model

24

π‘ž π’œ = π’ͺ(π’œ; 𝟏,𝑱)

slide-25
SLIDE 25

Deep Generative Models

25

l Implicit probabilistic models

l

Defines a stochastic process to simulate data π’š

l

Do not require tractable likelihood function

l

Data simulator

l

Natural approach for problems in population genetics, weather, ecology, etc.

l

E.g., generate data from a deterministic equation given parameters and random noise (e.g., GANs)

π’š/ = 𝑕 π’œ/; 𝜾 π’œ/ ∼ 𝑂(𝟏,𝑱)

slide-26
SLIDE 26

Generative Adversarial Nets (GANs)

l

[Goodfellow et al., 2014]

l

Assume implicit generative model

l

Learn cost function jointly

l

Interpreted as a mini-max game between a generator and a discriminator

l

Generate sharp, high-fidelity samples

26

slide-27
SLIDE 27

Generative Adversarial Nets (GANs)

l

Generator

l

Maps noise variable to data space

l

Discriminator

l

Outputs the probability that came from the data rather than the generator

l Learning

l

Train 𝐸 to maximize the probability of assigning the correct label to both training examples and generated samples

l

Train 𝐻 to fool the discriminator

27

π’š/ = 𝐻 π’œ/,𝜾i π’œ/ ∼ 𝑂(𝟏,𝑱) 𝐸 π’š, 𝜾j ∈ [0,1]

π’š

[Figure courtesy: Kim’s slides]

slide-28
SLIDE 28

GANs: Learning

l

For 𝐸: binary cross entropy with label 1 for real, 0 for fake

l

For 𝐻:

l

Alternate training of 𝐸 and 𝐻

28

min

p Eπ’œ[log 1βˆ’ 𝐸(𝐻(π’œ))]

slide-29
SLIDE 29

GANs: Theoretical results

l l

Plug in the minimax objective

l

is the global minimum and the only solution is

29

C(G) = max

D V (G, D)

=Ex∼pdata[log Dβˆ—

G(x)] + Ez∼pz[log(1 Dβˆ— G(G(z)))]

=Ex∼pdata[log Dβˆ—

G(x)] + Ex∼pg[log(1 Dβˆ— G(x))]

=Ex∼pdata  log pdata(x) Pdata(x) + pg(x)

  • + Ex∼pg

ο£Ώ log pg(x) pdata(x) + pg(x)

  • 1. The global minimum of the virtual training criterion

is achieved if and

π·βˆ— = βˆ’log 4 π‘ži = π‘žjtut

slide-30
SLIDE 30

GANs in Practice

l

Optimizing 𝐸 to completion in inner loop is computational prohibitive

l

Alternate between k steps of optimizing D and one step of optimizing G

l

Optimizing 𝐻

l

suffers from vanishing gradient when 𝐸 is too strong

l

Uses in practice

30

min

p Eπ’œ log 1 βˆ’ 𝐸 𝐻 π’œ

max

p Eπ’œ log 𝐸 𝐻 π’œ

slide-31
SLIDE 31

GANs in Practice

l Instability of training

l

Requires careful balance between the training of 𝐸 and 𝐻

l

[Arjovsky & Botton, 2017]: under certain conditions, is a centered Cauchy distribution with infinite expectation and variance

l

Mode collapsing

l

Generated samples are often from only a few modes of the data distribution

l

A set of heuristics attempting to fix the problems

l

e.g., [Salimans et al., 2016]: minibatch discrimination, one-sided label smoothing, …

31

Eπ’œ log 𝐸 𝐻 π’œ

slide-32
SLIDE 32

GANs: Applications & results

l Generating images

32

slide-33
SLIDE 33

GANs: Applications & results

l Generating images

l

Translating images (e.g., Isola et al., 2016)

33

slide-34
SLIDE 34

GANs: Applications & results

l Generating images

l

Translating images (e.g., Isola et al., 2016)

l

Domain adaptation (e.g., Purushotham et al., 2017)

l

Imitation learning (e.g., Ho & Ermon 2016)

l

…

34

slide-35
SLIDE 35

GANs vs. VAEs

l Variational Auto-encoders:

l

Probabilistic graphical model framework

l

Allow efficient Bayesian inference

l

Generated samples tend to be blurry

l

An issue of maximum likelihood training

l

Do not support discrete latent variables

l

GANs:

l

Generate sharp images

l

Do not support inference (π’š β†’ π’œ)

l

Do not support discrete visible variables

35

slide-36
SLIDE 36

GANs: Limitations & variants

l Do not support inference

l

No mechanism to inferring π’œ from π’š

l

Variants: additionally learn an inference network

l

[Dumoulin, et al., 2016, Donahue et al., 2016]

36

slide-37
SLIDE 37

GANs: Limitations & variants

37

l Do not support discrete visible variables

l

Non-differentiability of samples hinders gradients backprop

l

Variant: treats generator training as policy learning [Yu et al., 2017]

l

High variance, slow convergence

l

Variant: continuous approximations

l

Gumbel-softmax [Kusner & Hernndez-Lobato, 2016]

l

Only qualitative results with small discrete space

l

Variant: combines VAEs with GANs-like discriminators [Hu et al., 2017]

l

Uses VAEs to handle discrete visibles, GANs to handle discrete latents

l

Wake-sleep style learning

slide-38
SLIDE 38

GANs: Limitations & variants

38

l Unable to control the attributes of generated samples

l

Uninterpretability of the input latent vector π’œ

l

An Issue shared with VAEs and other DNN methods

l

Variants: add a mutual-information regularizer to enforce disentangled hidden codes [Chen et al., 2016]

l

Unsupervised

l

Semantic of each dimension is observed after training, rather than designated by users in a controllable way

slide-39
SLIDE 39

GANs: Limitations & variants

39

l Unable to control the attributes of generated samples

l

Uninterpretability of the input latent vector π’œ

l

An Issue shared with VAEs and other DNN methods

l

Variants: add a mutual-information regularizer to enforce disentangled hidden codes [Chen et al., 2016]

l

Unsupervised

l

Semantic of each dimension is observed after training, rather than designated by users in a controllable way

l

Variants: use supervision information to enforce designated semantics on certain dimensions of π’œ [Odena et al., 2017; Hu et al., 2017]

slide-40
SLIDE 40

Harnessing DNNs with Logic Rules

l [Hu et al., 2016] l Deep NNs

l

Heavily rely on massive labeled data

l

Uninterpretable

l

Hard to encode human intention/domain knowledge

40

slide-41
SLIDE 41

Harnessing DNNs with Logic Rules

l How humans learn

l

Learn from concrete examples (as DNNs do)

l

Learn from general knowledge and rich experiences [Minksy 1980; Lake et al., 2015]

l

E.g., the past tense of verbs1:

41

1 https://www.technologyreview.com/s/544606/can-this-man-make-aimore-human

Examples:

add -> added accept -> accepted ignore -> ignored end -> ended block -> blocked love -> loved

… Rule:

regular verbs –d/-ed V.S.

slide-42
SLIDE 42

DNNs + knowledge

l Logic rule

l

A flexible declarative language

l

Expresses structured knowledge

l

E.g., sentence sentiment analysis

l Input-target space: (π‘Œ, 𝑍) l Soft first-order logic (FOL) rules: (𝑠, πœ‡)

l

𝑠 π‘Œ, 𝑍 ∈ 0,1

l

πœ‡: confidence level of the rule

42

𝑇 has β€˜π΅-but-𝐢’ structure β‡’ ( sentiment(𝑇) ⇔ sentiment(𝐢) )

slide-43
SLIDE 43

Rule knowledge distillation

l Neural network π‘ž# 𝑧 𝑦

43

at iteration 𝑒: soft prediction of π‘ž# true hard label

slide-44
SLIDE 44

Rule knowledge distillation

l Neural network π‘ž# 𝑧 𝑦 l Train to imitate the outputs of a rule-regularized teacher

network (i.e. distillation)

44

at iteration 𝑒: soft prediction of π‘ž# true hard label soft prediction of the teacher network

slide-45
SLIDE 45

Rule knowledge distillation

l Neural network π‘ž# 𝑧 𝑦 l Train to imitate the outputs of a rule-regularized teacher

network (i.e. distillation)

45

at iteration 𝑒: soft prediction of π‘ž# true hard label balancing parameter soft prediction of the teacher network

slide-46
SLIDE 46

Teacher network construction

l Teacher network: π‘Ÿ(𝑍|π‘Œ)

l

Comes out of π‘ž

l

Fits the logic rules: 𝐹\ 𝑠 π‘Œ,𝑍 = 1, with confidence πœ‡

46

slack variable rule constraints closed-form solution:

slide-47
SLIDE 47

Method summary

47

l At each iteration

l

Construct a teacher network through posterior constraints

l

Train the NN to emulate the predictions of the teacher

slide-48
SLIDE 48

Application: sentiment classification

l Sentence => positive/negative l Base network: CNN [Kim, 2014]

l

Rule knowledge:

l

sentence S with structure A-but-B: => sentiment of B dominates

48

slide-49
SLIDE 49

Results

49

l accuracy (%)

slide-50
SLIDE 50

References

l

[Dosovitskiy et al., 2016] β€œGenerating Images with Perceptual Similarity Metrics based on Deep Networks”, NIPS’16

l

[Larsen et al., 2016] β€œAutoencodingbeyond pixels using a learned similarity metric”, ICML’16

l

[Kingma et al., 2014] β€œSemi-supervised learning with deep generative models”, NIPS’14

l

[Jang et al., 2017] β€œCategorical Reparameterization with Gumbel-Softmax”, ICLR’17

l

[Hu et al., 2017] β€œControllable Text Generation”, 2017

l

[Goyal et al., 2017] β€œNonparametric Variational Auto-encoders for Hierarchical Representation Learning”, 2017

l

[Goodfellow et al., 2014] β€œGenerative Adversarial Nets”, NIPS’14

l

[Arjovsky & Botton, 2017] β€œTowards Principled Methods for Training Generative Adversarial Networks”, ICLR’17

l

[Salimans et al., 2016] β€œImproved Techniques for Training GANs”, NIPS’16

l

[Isola et al., 2016] β€œImage-to-Image Translation with Conditional Adversarial Networks”, 2016

l

[Purushotham et al., 2017] β€œVariational Recurrent Adversarial Deep Domain Adaptation”, ICLR’17

l

[Ho & Ermon 2016] β€œGenerative Adversarial Imitation Learning”, NIPS’16

l

[Dumoulin, et al., 2016] β€œAdversarially Learned Inference”, ICLR’17

l

[Donahue et al., 2016] β€œAdversarial Feature Learning”, 2016

l

[Yu et al., 2017] β€œSeqGAN: Sequence Generative Adversarial Nets with Policy Gradient”, AAAI’17

l

[Kusner & Hernndez-Lobato, 2016] β€œGANs for sequences of discrete elements with the gumbel-softmax distribution”, 2016

l

[Chen el al., 2016] β€œInfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets”, NIPS’16

l

[Odena et al., 2017] β€œConditional image synthesis with auxiliary classifier GANs”, 2017

l

[Hu et al., 2016] β€œHarnessing DNNs with logic rules”, ACL’2016

l

[Minksy, 1980] Learning meaning. Technical Report AI Lab Memo. 1980

l

[Lake et al., 2015] Human-level concept learning through probabilistic program induction. Science’15.

l

[Kim, 2014] Convolutional neural networks for sentence classification. EMNLP’14

l

Hamid Reza Maei, Wake-Sleep algorithm for Representational Learning

l

Mark Chang, Variational Autoencoder

l

Namju Kim, Generative Adversarial Networks (GAN) 50