15-780 Graduate Artificial Intelligence: Probabilistic inference - - PowerPoint PPT Presentation

β–Ά
15 780 graduate artificial intelligence probabilistic
SMART_READER_LITE
LIVE PREVIEW

15-780 Graduate Artificial Intelligence: Probabilistic inference - - PowerPoint PPT Presentation

15-780 Graduate Artificial Intelligence: Probabilistic inference J. Zico Kolter (this lecture) and Nihar Shah Carnegie Mellon University Spring 2020 1 Outline Probabilistic graphical models Probabilistic inference Exact inference


slide-1
SLIDE 1

15-780 – Graduate Artificial Intelligence: Probabilistic inference

  • J. Zico Kolter (this lecture) and Nihar Shah

Carnegie Mellon University Spring 2020

1

slide-2
SLIDE 2

Outline

Probabilistic graphical models Probabilistic inference Exact inference Sample-based inference A brief look at deep generative models

2

slide-3
SLIDE 3

Outline

Probabilistic graphical models Probabilistic inference Exact inference Sample-based inference A brief look at deep generative models

3

slide-4
SLIDE 4

Probabilistic graphical models

Probabilistic graphical models are all about representing distributions π‘ž π‘Œ where π‘Œ represents some large set of random variables Example: suppose π‘Œ ∈ 0,1 ν‘› (π‘œ-dimensional random variable), would take 2ν‘› βˆ’ 1 parameters to describe the full joint distribution Graphical models offer a way to represent these same distributions more compactly, by exploiting conditional independencies in the distribution Note: I’m going to use β€œprobabilistic graphical model” and β€œBayesian network” interchangeably, even though there are differences

4

slide-5
SLIDE 5

Bayesian networks

A Bayesian network is defined by

  • 1. A directed acyclic graph, 𝐻 = {π‘Š =

π‘Œ1, … , π‘Œν‘› , 𝐹}

  • 2. A set of conditional distributions π‘ž π‘Œν‘– Parents π‘Œν‘–

Defines the joint probability distribution π‘ž π‘Œ = ∏

ν‘–=1 ν‘›

π‘ž π‘Œν‘– Parents π‘Œν‘– Equivalently: each node is conditionally independent of all non-descendants given its parents

5

slide-6
SLIDE 6

Example Bayesian network

Conditional independencies let us simply the joint distribution: π‘ž π‘Œ1, π‘Œ2, π‘Œ3, π‘Œ4 = π‘ž π‘Œ1 π‘ž π‘Œ2 π‘Œ1 π‘ž π‘Œ3 π‘Œ1, π‘Œ2 π‘ž π‘Œ4 π‘Œ1, π‘Œ2, π‘Œ3 = π‘ž π‘Œ1 π‘ž π‘Œ2 π‘Œ1)π‘ž π‘Œ3 π‘Œ2 π‘ž π‘Œ4 π‘Œ3

6

X1 X2 X3 X4

24 βˆ’ 1 = 15 parameters (assuming binary variables)

slide-7
SLIDE 7

Example Bayesian network

Conditional independencies let us simply the joint distribution: π‘ž π‘Œ1, π‘Œ2, π‘Œ3, π‘Œ4 = π‘ž π‘Œ1 π‘ž π‘Œ2 π‘Œ1 π‘ž π‘Œ3 π‘Œ1, π‘Œ2 π‘ž π‘Œ4 π‘Œ1, π‘Œ2, π‘Œ3 = π‘ž π‘Œ1 π‘ž π‘Œ2 π‘Œ1)π‘ž π‘Œ3 π‘Œ2 π‘ž π‘Œ4 π‘Œ3

7

X1 X2 X3 X4

24 βˆ’ 1 = 15 parameters (assuming binary variables) 1 parameter 2 parameters

slide-8
SLIDE 8

Example Bayesian network

Conditional independencies let us simply the joint distribution: π‘ž π‘Œ1, π‘Œ2, π‘Œ3, π‘Œ4 = π‘ž π‘Œ1 π‘ž π‘Œ2 π‘Œ1 π‘ž π‘Œ3 π‘Œ1, π‘Œ2 π‘ž π‘Œ4 π‘Œ1, π‘Œ2, π‘Œ3 = π‘ž π‘Œ1 π‘ž π‘Œ2 π‘Œ1)π‘ž π‘Œ3 π‘Œ2 π‘ž π‘Œ4 π‘Œ3

8

X1 X2 X3 X4

24 βˆ’ 1 = 15 parameters (assuming binary variables) 7 parameters

slide-9
SLIDE 9

Poll: Simple Bayesian network

What conditional independencies exist in the following Bayesian network? 1. π‘Œ1 and π‘Œ2 are marginally independent 2. π‘Œ4 is conditionally independent of π‘Œ1 given π‘Œ3 3. π‘Œ1 is conditionally independent of π‘Œ4 given π‘Œ3 4. π‘Œ1 is conditionally independent of π‘Œ2 given π‘Œ3

9

X1 X2 X3 X4

slide-10
SLIDE 10

Generative model

Can also describe the probabilistic distribution as a sequential β€œstory”, this is called a generative model π‘Œ1 ∼ Bernoulli 𝜚 1 π‘Œ2| π‘Œ1 = 𝑦1 ∼ Bernoulli 𝜚ν‘₯1

2

π‘Œ3| π‘Œ2 = 𝑦2 ∼ Bernoulli 𝜚ν‘₯2

3

π‘Œ4| π‘Œ3 = 𝑦3 ∼ Bernoulli 𝜚ν‘₯3

3

β€œFirst sample π‘Œ1 from a Bernoulli distribution with parameter 𝜚 1 , then sample π‘Œ2 from a Bernoulli distribution with parameter 𝜚ν‘₯1

2 , where 𝑦1 is the value we sampled for π‘Œ1, then

sample π‘Œ3 from a Bernoulli …”

10

X1 X2 X3 X4

slide-11
SLIDE 11

More general generative models

This notion of a β€œsequential story” (generative model) is extremely powerful for describing very general distributions Naive Bayes: 𝑍 ∼ Bernoulli 𝜚 π‘Œν‘–|𝑍 = 𝑧 ∼ Categorical πœšν‘¦

ν‘–

Gaussian mixture model: π‘Ž ∼ Categorical 𝜚 π‘Œ|π‘Ž = 𝑨 ∼ π’ͺ πœˆν‘§, Ξ£ν‘§

11

slide-12
SLIDE 12

More general generative models

Linear regression: 𝑍 |π‘Œ = 𝑦 ∼ π’ͺ πœ„π‘ˆ 𝑦, 𝜏2 Changepoint model: π‘Œ ∼ Uniform 0,1 𝑍 |π‘Œ = 𝑦 ∼ {π’ͺ 𝜈1, 𝜏2 if 𝑦 < 𝑒 π’ͺ 𝜈2, 𝜏2 if 𝑦 β‰₯ 𝑒 Latent Dirichlet Allocation: 𝑁 documents, 𝐿 topics, 𝑂𝑗 words/document πœ„π‘— ∼ Dirichlet 𝛽 (topic distributions per document) πœšπ‘™ ∼ Dirichlet 𝛾 (word distributions per topic) 𝑨𝑗,π‘˜ ∼ Categorical πœ„π‘— (topic of π‘˜th word in document 𝑗) π‘₯𝑗,π‘˜ ∼ Categorical πœšπ‘¨π‘—,π‘˜ (π‘˜th word in document 𝑗)

12

slide-13
SLIDE 13

Outline

Probabilistic graphical models Probabilistic inference Exact inference Sample-based inference A brief look at deep generative models

13

slide-14
SLIDE 14

The inference problem

Given observations (i.e., knowing the value of some of the variables in a model), what is the distribution over the other (hidden) variables? A relatively β€œeasy” problem if we observe variables at the β€œbeginning” of chains in a Bayesian network:

  • If we observe the value of π‘Œ1, then π‘Œ2, π‘Œ3, π‘Œ4 have the same distribution

as before, just with π‘Œ1 β€œfixed”

  • But if we observe π‘Œ4 what is the distribution over π‘Œ1, π‘Œ2, π‘Œ3?

14

X1 X2 X3 X4 X1 X2 X3 X4

slide-15
SLIDE 15

Many types of inference problems

Marginal inference: given a generative distribution for π‘ž X over π‘Œ = {π‘Œ1, … , π‘Œν‘›}, determine π‘ž(π‘Œβ„) for ℐ βŠ† {1, … , π‘œ} MAP inference: determine assignment with the maximum probability Conditional variants: solve either of the two variants conditioned on some

  • bservable variables, e.g.

π‘ž(π‘Œβ„|π‘Œβ„° = 𝑦ℰ)

15

slide-16
SLIDE 16

Approaches to inference

There are three categories of common approaches to inference (more exist, but these are most common)

  • 1. Exact methods: Bayes’ rule or variable elimination methods
  • 2. Sampling approaches: draw samples from the the distribution over hidden

variables, without construction them explicitly

  • 3. Approximate variational approaches: approximate distributions over hidden

variables using β€œsimple” distributions, minimizing the difference between these distributions and the true distributions

16

slide-17
SLIDE 17

Outline

Probabilistic graphical models Probabilistic inference Exact inference Sample-based inference A brief look at deep generative models

17

slide-18
SLIDE 18

Exact inference example

Mixture of Gaussians model: π‘Ž ∼ Categorical 𝜚 π‘Œ|π‘Ž = 𝑨 ∼ π’ͺ πœˆν‘§, Ξ£ν‘§ Task: compute π‘ž(π‘Ž|𝑦) In this case, we can solve inference exactly with Bayes’ rule: π‘ž π‘Ž 𝑦 = π‘ž 𝑦 π‘Ž π‘ž π‘Ž βˆ‘ν‘§ π‘ž 𝑦 𝑨 π‘ž 𝑨

18

Z X

slide-19
SLIDE 19

Exact inference in graphical models

In some cases, it’s possible to exploit the structure of the graphical model to develop efficient exact inference methods Example: how can I compute π‘ž(π‘Œ4)? π‘ž π‘Œ4 = βˆ‘

ν‘₯1,ν‘₯2,ν‘₯3

𝑄 𝑦1 𝑄 𝑦2 𝑦1 𝑄 𝑦3 𝑦2 𝑄 π‘Œ4 𝑦3

19

X1 X2 X3 X4

slide-20
SLIDE 20

Need for approximate inference

In most cases, the exact distribution over hidden variables cannot be computed, would require representing an exponentially large distribution over hidden variables (or infinite, in continuous case) π‘Žν‘– ∼ Bernoulli πœšν‘– , 𝑗 = 1, … , π‘œ π‘Œ|π‘Ž = 𝑨 ∼ π’ͺ πœ„ν‘‡ 𝑨, 𝜏2 Distribution 𝑄 (π‘Ž|𝑦) is a full distribution over π‘œ binary random variables

20

Z1 X Z2 Zn Β· Β· Β·

slide-21
SLIDE 21

Outline

Probabilistic graphical models Probabilistic inference Exact inference Sample-based inference A brief look at deep generative models

21

slide-22
SLIDE 22

Sample-based inference

If we can draw samples from a posterior distribution, then we can approximate arbitrary probabilistic queries about that distribution A naive strategy (rejection sampling): draw samples from the generative model until we find one that matches the observed data, distribution over other variables will be samples of the hidden variables given observed variables As we get more complex models, and more observed variables, probability that we see our exact observations goes to zero

22

X1 X2 X3 X4

slide-23
SLIDE 23

Markov Chain Monte Carlo

Let’s consider a generic technique for generating samples from a distribution π‘ž π‘Œ (suppose distribution is complex so that we cannot directly compute or sample) Our strategy is going to be to generate samples π‘Œν‘‘ via some conditional distribution π‘ž(π‘Œν‘‘+1|π‘Œν‘‘), constructed to guarantee that π‘ž π‘Œν‘‘ β†’ π‘ž(π‘Œ)

23

slide-24
SLIDE 24

Metropolis-Hastings Algorithm

One of the workhorses of modern probabilistic methods

  • 1. Pick some 𝑦0 (e.g., completely randomly)
  • 2. For 𝑒 = 1,2, …

Sample: Μƒ 𝑦푑+1 ∼ π‘Ÿ π‘Œβ€² π‘Œ = 𝑦푑 Set: 𝑦푑+1 ≔ Μƒ 𝑦푑+1 π‘₯. π‘ž. min 1, π‘ž Μƒ 𝑦푑+1 π‘Ÿ 𝑦푑 Μƒ 𝑦푑+1 π‘ž 𝑦푑 π‘Ÿ Μƒ 𝑦푑+1 𝑦푑 𝑦푑

  • therwise

24

slide-25
SLIDE 25

Notes on MH

We choose π‘Ÿ(π‘Œβ€²|π‘Œ) so that we can easily sample from the distribution (e.g., for continuous distributions, it’s common to choose) π‘Ÿ π‘Œβ€² π‘Œ = 𝑦 = π’ͺ 𝑦′ 𝑦; 𝐽 Note that even if we cannot compute the probabilities π‘ž(𝑦푑) and π‘ž( Μƒ 𝑦푑+1) we can

  • ften compute their ratio π‘ž( Μƒ

𝑦푑+1)/π‘ž(𝑦푑) (requires only being able to compute the unnormalized probabilities), e.g., consider the case

25

X1 X2 X3 X4

slide-26
SLIDE 26

Proof of MH algorithm

Theorem: For samples generated by MH, π‘ž(π‘Œν‘‘) β†’ π‘ž π‘Œ as 𝑒 β†’ ∞ Proof: We’ll proceed in two parts.

  • 1. (Detailed balance equations) First, we show that given any distribution

π‘ž(π‘Œ) and a conditional distribution π‘ž π‘Œβ€² π‘Œ , then if π‘ž π‘Œ π‘ž π‘Œβ€² π‘Œ = π‘ž π‘Œβ€² π‘ž π‘Œ π‘Œβ€² and if π‘ž(π‘Œβ€²|π‘Œ) > 0, βˆ€π‘¦, 𝑦′ then repeatedly sampling 𝑦푑+1 ∼ π‘ž π‘Œβ€² π‘Œ = 𝑦푑 gives π‘ž π‘Œν‘‘ β†’ π‘ž π‘Œ

  • 2. The Metropolis-Hastings update gives a distribution that satisfies the

detailed balance equations

26

slide-27
SLIDE 27

Proof of MH algorithm (cont)

Part 1: (not a complete proof), detailed balance says that for 𝑦푑, 𝑦푑+1 π‘ž 𝑦푑 π‘ž 𝑦푑+1 𝑦푑 = π‘ž 𝑦푑+1 π‘ž 𝑦푑 𝑦푑+1 Summing both sizes over over 𝑦푑 gives βˆ‘

ν‘₯ν‘‘

π‘ž 𝑦푑 π‘ž 𝑦푑+1 𝑦푑 = π‘ž 𝑦푑+1 which is equivalent to the fact that π‘ž(π‘Œ) is a stationary distribution of the conditional distribution π‘ž(π‘Œβ€²|π‘Œ) Under some properties of conditional distributions that we won’t cover, repeated sampling from the conditional will converge to the stationary distribution, assuming e.g. conditional has positive probabilities

27

slide-28
SLIDE 28

Proof of MH algorithm (cont)

Part 2: First, note that detailed balance is trivially satisfied for 𝑦푑+1 = 𝑦푑 π‘ž 𝑦푑+1 𝑦푑 π‘ž 𝑦푑 = π‘ž 𝑦푑 𝑦푑+1)π‘ž 𝑦푑+1 Now assuming 𝑦푑+1 β‰  𝑦푑, suppose that (opposite case proceeds in exactly the same manner) π‘ž 𝑦푑 π‘Ÿ 𝑦푑+1 𝑦푑 ≀ π‘ž 𝑦푑+1 π‘Ÿ(𝑦푑|𝑦푑+1) Then: min 1, π‘ž 𝑦푑+1 π‘Ÿ 𝑦푑 𝑦푑+1) π‘ž 𝑦푑 π‘Ÿ 𝑦푑+1 𝑦푑) = 1 min 1, π‘ž 𝑦푑 π‘Ÿ 𝑦푑+1 𝑦푑) π‘ž 𝑦푑+1 π‘Ÿ 𝑦푑 𝑦푑+1) π‘ž 𝑦푑+1 π‘Ÿ 𝑦푑 𝑦푑+1) = π‘ž 𝑦푑 π‘Ÿ 𝑦푑+1 𝑦푑)

28

slide-29
SLIDE 29

Proof of MH algorithm (cont)

So finally, note that π‘ž 𝑦푑 π‘ž 𝑦푑+1 𝑦푑 = π‘ž 𝑦푑 π‘Ÿ 𝑦푑+1 𝑦푑 min 1, π‘ž 𝑦푑+1 π‘Ÿ 𝑦푑 𝑦푑+1) π‘ž 𝑦푑 π‘Ÿ 𝑦푑+1 𝑦푑) = π‘ž 𝑦푑 π‘Ÿ 𝑦푑+1 𝑦푑 = min 1, π‘ž 𝑦푑 π‘Ÿ 𝑦푑+1 𝑦푑) π‘ž 𝑦푑+1 π‘Ÿ 𝑦푑 𝑦푑+1) π‘ž 𝑦푑+1 π‘Ÿ 𝑦푑 𝑦푑+1) = π‘ž 𝑦푑+1 π‘ž 𝑦푑 𝑦푑+1) (first and last lines follow by definition of π‘ž(𝑦푑+1|𝑦푑) and π‘ž(𝑦푑|𝑦푑+1), second and third by equations on the previous slide) Which shows the transition probabilities satisfy detailed balance. ∎

29

slide-30
SLIDE 30

Poll: Metropolis-Hastings

Given the following true distributions π‘ž and sampling distributions π‘Ÿ would result in creating accurate samples from the true distribution?

  • 1. π‘ž 𝑦 = π’ͺ 0,1 , π‘Ÿ 𝑦′ = 𝑉 0,1
  • 2. π‘ž 𝑦 = 𝑉 0,1 , π‘Ÿ 𝑦′ = π’ͺ(0,1)
  • 3. π‘ž 𝑦 = π’ͺ 0,1 , 𝑦′|𝑦 = 𝑦 + 𝑉 βˆ’1,1

30

slide-31
SLIDE 31

Gibbs sampling

An application of MH to graphical models leads to what is called Gibbs sampling Suppose we want to draw a sample from π‘ž(π‘Ž|π‘Œ = 𝑦) (i.e., sample over unobserved variables given observed variables)

  • 1. Initialize 𝑨 randomly
  • 2. Repeat: pick some 𝑗 and sample

𝑨푖 ∼ 𝑄 (π‘Žν‘–|π‘ŽΒ¬ν‘– = 𝑨¬푖, π‘Œ = 𝑦) Practical to implement as long as we can sample from a variable given a fixed value of all other variables (can exploit independence structure)

31

slide-32
SLIDE 32

Gibbs as Metropolis-Hastings

We can derive Gibbs sampling as an application of the MH algorithm, with the proposal distribution (omitting π‘Œ terms for simplicity) π‘Ÿν‘– π‘Žβ€² π‘Ž = { 𝑨푖

β€² ∼ 𝑄 (π‘Žν‘–|π‘ŽΒ¬ν‘– = 𝑨¬푖)

𝑨푗

β€² = 𝑨푗

Under this distribution, proposal is always accepted: π‘ž 𝑨′ π‘Ÿν‘– 𝑨 𝑨′ π‘ž 𝑨 π‘Ÿν‘– 𝑨′ 𝑨 = π‘ž 𝑨푖

β€²|𝑨¬푖 β€²

π‘ž 𝑨¬푖

β€²

π‘ž(𝑨푖|𝑨¬푖

β€² )

π‘ž 𝑨푖 𝑨¬푖 π‘ž 𝑨¬푖 π‘ž(𝑨푖

β€²|𝑨¬푖) = π‘ž 𝑨푖 β€²|𝑨¬푖 β€²

π‘ž 𝑨¬푖

β€²

π‘ž(𝑨푖|𝑨¬푖

β€² )

π‘ž 𝑨푖 𝑨¬푖

β€²

π‘ž 𝑨¬푖 π‘ž(𝑨푖

β€²|𝑨¬푖 β€² ) = 1

Technically, this uses a different π‘Ÿν‘– selected at random for each π‘Žν‘– variable, but we can show that the product of all these individual π‘Ÿν‘–β€™s lead to a single β€œglobal” π‘Ÿ that still has all the necessary properties

32

slide-33
SLIDE 33

Outline

Probabilistic graphical models Probabilistic inference Exact inference Sample-based inference A brief look at deep generative models

33

slide-34
SLIDE 34

Deep generative models

Probabilistic models + deep learning (what could be better?) A huge landscape, going back many years, and we will just briefly highlight two common current approaches

  • Variational autoencoders
  • Generative adversarial networks
  • See also (not discussed): normalizing flow models

34

slide-35
SLIDE 35

Generator networks

Starting point for most deep generative models is a β€œgenerator network” 𝐻, a generative model that takes random noise inputs and outputs elements from the desired distribution, e.g. 𝑨 ∼ π’ͺ 0, 𝐽 𝑦 ∼ π’ͺ(𝐻 𝑨; πœ„ , 𝜏2𝐽) For these models, how do we train the network via e.g. maximum likelihood estimation? maximize

νœƒ

βˆ‘

ν‘–=1 ν‘š

log π‘ž 𝑦 ν‘– ; πœ„ ≑ maximize

νœƒ

βˆ‘

ν‘–=1 ν‘š

log ∫

ν‘§

π‘ž 𝑦 ν‘– |𝑨; πœ„ π‘ž 𝑨 𝑒𝑨 Typically hard to optimize via β€œstandard” approaches (e.g., sampling + MLE), so alternative approaches are needed

35

slide-36
SLIDE 36

Variational autoencoders

Variational autoencoders (VAEs) approximate the MLE using the so-called variational lower bound: for any distribution π‘Ÿ(π‘Ž|π‘Œ) we have log π‘ž 𝑦 ν‘– β‰₯ πΉν‘§βˆΌν‘ž 푍|ν‘₯ ν‘– log π‘ž 𝑦 ν‘– 𝑨 βˆ’ 𝐿𝑀[π‘Ÿ π‘Ž|𝑦 ν‘– ||π‘ž(π‘Ž)] where 𝐿𝑀 π‘ž||π‘Ÿ = ∫ π‘ž 𝑦 log 푝 ν‘₯

ν‘ž ν‘₯ 𝑒𝑦 is called the KL-divergence between two

distributions In general, finding the β€œright” distribution π‘Ÿ(π‘Ž|𝑦) is the goal of what are called variational inference methods Key idea of variational autoencoders: use a neural network to predict the π‘Ÿ(π‘Ž|𝑦) distribution

36

slide-37
SLIDE 37

VAEs continued

VAEs compute two networks, the β€œencoder” network that predicts π‘Ÿ(π‘Ž|𝑦) and a β€œdecoder” (generator) network that models π‘ž(π‘Œ|𝑨) Train VAEs by maximizing the variational lower bound maximize

퐸,퐺

βˆ‘

ν‘–=1 ν‘š

π…ν‘§βˆΌν‘ž 푍|ν‘₯ ν‘– [log π‘ž(𝑦 ν‘– |𝑨)] βˆ’ 𝐿𝑀(π‘Ÿ(π‘Ž|𝑦(ν‘–))| π‘ž π‘Ž (some tricks required to make this a differentiable process)

37

x E q(Z|x) z G p(X|z)

slide-38
SLIDE 38

Generative adversarial models (GANs)

An alternative approach to training deep generative: try to build a classifier that can β€œtell apart” generated samples from real data minimize

𝐻

maximize

𝐸

1 𝑛 βˆ‘

𝑗=1 𝑛

log π‘ž(𝑦 𝑗 ; 𝐸) + π…π‘¦βˆΌπ‘ž(𝑦,𝑨;𝐻)[log(1 βˆ’ π‘ž(𝑦 ; 𝐸))] Training requires solving a min-max optimization problem but current results suggest that it can generate very realistic samples β€œAvoids” challenges associated with MLE by not trying to approximate it at all, but considering a different loss function

38

slide-39
SLIDE 39

Examples of GANs

Samples of faces generated by (quite complex) GAN

39

Figure from (Kerras et al., 2018)