15-780 Graduate Artificial Intelligence: Probabilistic inference - PowerPoint PPT Presentation

15-780 – Graduate Artificial Intelligence: Probabilistic inference J. Zico Kolter (this lecture) and Nihar Shah Carnegie Mellon University Spring 2020 1

Outline Probabilistic graphical models Probabilistic inference Exact inference Sample-based inference A brief look at deep generative models 2

Probabilistic graphical models Probabilistic graphical models are all about representing distributions 𝑞 𝑌 where 𝑌 represents some large set of random variables 0,1 푛 ( 𝑜 -dimensional random variable), would take Example: suppose 𝑌 ∈ 2 푛 − 1 parameters to describe the full joint distribution Graphical models offer a way to represent these same distributions more compactly, by exploiting conditional independencies in the distribution Note: I’m going to use “probabilistic graphical model” and “Bayesian network” interchangeably, even though there are differences 4

Bayesian networks A Bayesian network is defined by 1. A directed acyclic graph, 𝐻 = {𝑊 = 𝑌 1 , … , 𝑌 푛 , 𝐹} 2. A set of conditional distributions 𝑞 𝑌 푖 Parents 𝑌 푖 Defines the joint probability distribution 푛 𝑞 𝑌 = ∏ 𝑞 𝑌 푖 Parents 𝑌 푖 푖=1 Equivalently: each node is conditionally independent of all non-descendants given its parents 5

Example Bayesian network X 1 X 2 X 3 X 4 Conditional independencies let us simply the joint distribution: 𝑞 𝑌 1 , 𝑌 2 , 𝑌 3 , 𝑌 4 = 𝑞 𝑌 1 𝑞 𝑌 2 𝑌 1 𝑞 𝑌 3 𝑌 1 , 𝑌 2 𝑞 𝑌 4 𝑌 1 , 𝑌 2 , 𝑌 3 2 4 − 1 = 15 = 𝑞 𝑌 1 𝑞 𝑌 2 𝑌 1 )𝑞 𝑌 3 𝑌 2 𝑞 𝑌 4 𝑌 3 parameters (assuming binary variables) 6

Example Bayesian network X 1 X 2 X 3 X 4 Conditional independencies let us simply the joint distribution: 𝑞 𝑌 1 , 𝑌 2 , 𝑌 3 , 𝑌 4 = 𝑞 𝑌 1 𝑞 𝑌 2 𝑌 1 𝑞 𝑌 3 𝑌 1 , 𝑌 2 𝑞 𝑌 4 𝑌 1 , 𝑌 2 , 𝑌 3 2 4 − 1 = 15 = 𝑞 𝑌 1 𝑞 𝑌 2 𝑌 1 )𝑞 𝑌 3 𝑌 2 𝑞 𝑌 4 𝑌 3 parameters (assuming binary variables) 1 parameter 2 parameters 7

Example Bayesian network X 1 X 2 X 3 X 4 Conditional independencies let us simply the joint distribution: 𝑞 𝑌 1 , 𝑌 2 , 𝑌 3 , 𝑌 4 = 𝑞 𝑌 1 𝑞 𝑌 2 𝑌 1 𝑞 𝑌 3 𝑌 1 , 𝑌 2 𝑞 𝑌 4 𝑌 1 , 𝑌 2 , 𝑌 3 2 4 − 1 = 15 = 𝑞 𝑌 1 𝑞 𝑌 2 𝑌 1 )𝑞 𝑌 3 𝑌 2 𝑞 𝑌 4 𝑌 3 parameters (assuming binary 7 parameters variables) 8

Poll: Simple Bayesian network What conditional independencies exist in the following Bayesian network? X 1 X 2 1. 𝑌 1 and 𝑌 2 are marginally independent 2. 𝑌 4 is conditionally independent of 𝑌 1 given 𝑌 3 X 3 3. 𝑌 1 is conditionally independent of 𝑌 4 given 𝑌 3 X 4 4. 𝑌 1 is conditionally independent of 𝑌 2 given 𝑌 3 9

Generative model Can also describe the probabilistic distribution as a sequential “story”, this is called a generative model 𝑌 1 ∼ Bernoulli 𝜚 1 2 𝑌 2 | 𝑌 1 = 𝑦 1 ∼ Bernoulli 𝜚 푥 1 X 1 X 2 X 3 X 4 3 𝑌 3 | 𝑌 2 = 𝑦 2 ∼ Bernoulli 𝜚 푥 2 3 𝑌 4 | 𝑌 3 = 𝑦 3 ∼ Bernoulli 𝜚 푥 3 “First sample 𝑌 1 from a Bernoulli distribution with parameter 𝜚 1 , then sample 𝑌 2 from a 2 , where 𝑦 1 is the value we sampled for 𝑌 1 , then Bernoulli distribution with parameter 𝜚 푥 1 sample 𝑌 3 from a Bernoulli …” 10

More general generative models This notion of a “sequential story” (generative model) is extremely powerful for describing very general distributions Naive Bayes: 𝑍 ∼ Bernoulli 𝜚 푖 𝑌 푖 |𝑍 = 𝑧 ∼ Categorical 𝜚 푦 Gaussian mixture model: 𝑎 ∼ Categorical 𝜚 𝑌|𝑎 = 𝑨 ∼ 𝒪 𝜈 푧 , Σ 푧 11

More general generative models Linear regression: 𝑍 |𝑌 = 𝑦 ∼ 𝒪 𝜄 𝑈 𝑦, 𝜏 2 Changepoint model: 𝑌 ∼ Uniform 0,1 𝑍 |𝑌 = 𝑦 ∼ {𝒪 𝜈 1 , 𝜏 2 if 𝑦 < 𝑢 𝒪 𝜈 2 , 𝜏 2 if 𝑦 ≥ 𝑢 Latent Dirichlet Allocation: 𝑁 documents, 𝐿 topics, 𝑂 𝑗 words/document 𝜄 𝑗 ∼ Dirichlet 𝛽 (topic distributions per document) 𝜚 𝑙 ∼ Dirichlet 𝛾 (word distributions per topic) 𝑨 𝑗,𝑘 ∼ Categorical 𝜄 𝑗 (topic of 𝑘th word in document 𝑗) 𝑥 𝑗,𝑘 ∼ Categorical 𝜚 𝑨 𝑗 ,𝑘 (𝑘th word in document 𝑗) 12

The inference problem Given observations (i.e., knowing the value of some of the variables in a model), what is the distribution over the other (hidden) variables? A relatively “easy” problem if we observe variables at the “beginning” of chains in a Bayesian network: • If we observe the value of 𝑌 1 , then 𝑌 2 , 𝑌 3 , 𝑌 4 have the same distribution as before, just with 𝑌 1 “fixed” X 1 X 2 X 3 X 4 • But if we observe 𝑌 4 what is the distribution over 𝑌 1 , 𝑌 2 , 𝑌 3 ? X 1 X 2 X 3 X 4 14

Many types of inference problems Marginal inference: given a generative distribution for 𝑞 X over 𝑌 = {𝑌 1 , … , 𝑌 푛 } , determine 𝑞(𝑌 ℐ ) for ℐ ⊆ {1, … , 𝑜} MAP inference: determine assignment with the maximum probability Conditional variants: solve either of the two variants conditioned on some observable variables, e.g. 𝑞(𝑌 ℐ |𝑌 ℰ = 𝑦 ℰ ) 15

Approaches to inference There are three categories of common approaches to inference (more exist, but these are most common) 1. Exact methods: Bayes’ rule or variable elimination methods 2. Sampling approaches: draw samples from the the distribution over hidden variables, without construction them explicitly 3. Approximate variational approaches: approximate distributions over hidden variables using “simple” distributions, minimizing the difference between these distributions and the true distributions 16

Exact inference example Mixture of Gaussians model: 𝑎 ∼ Categorical 𝜚 𝑌|𝑎 = 𝑨 ∼ 𝒪 𝜈 푧 , Σ 푧 Task: compute 𝑞(𝑎|𝑦) Z X In this case, we can solve inference exactly with Bayes’ rule: 𝑞 𝑦 𝑎 𝑞 𝑎 𝑞 𝑎 𝑦 = ∑ 푧 𝑞 𝑦 𝑨 𝑞 𝑨 18

Exact inference in graphical models In some cases, it’s possible to exploit the structure of the graphical model to develop efficient exact inference methods Example: how can I compute 𝑞(𝑌 4 ) ? X 1 X 2 X 3 X 4 𝑞 𝑌 4 = ∑ 𝑄 𝑦 1 𝑄 𝑦 2 𝑦 1 𝑄 𝑦 3 𝑦 2 𝑄 𝑌 4 𝑦 3 푥 1 ,푥 2 ,푥 3 19

Need for approximate inference In most cases, the exact distribution over hidden variables cannot be computed, would require representing an exponentially large distribution over hidden variables (or infinite, in continuous case) 𝑎 푖 ∼ Bernoulli 𝜚 푖 , 𝑗 = 1, … , 𝑜 𝑌|𝑎 = 𝑨 ∼ 𝒪 𝜄 푇 𝑨, 𝜏 2 Z 1 Z 2 Z n · · · X Distribution 𝑄 (𝑎|𝑦) is a full distribution over 𝑜 binary random variables 20

Sample-based inference If we can draw samples from a posterior distribution, then we can approximate arbitrary probabilistic queries about that distribution A naive strategy (rejection sampling): draw samples from the generative model until we find one that matches the observed data, distribution over other variables will be samples of the hidden variables given observed variables As we get more complex models, and more observed variables, probability that we see our exact observations goes to zero X 1 X 2 X 3 X 4 22

Markov Chain Monte Carlo Let’s consider a generic technique for generating samples from a distribution 𝑞 𝑌 (suppose distribution is complex so that we cannot directly compute or sample) Our strategy is going to be to generate samples 𝑌 푡 via some conditional distribution 𝑞(𝑌 푡+1 |𝑌 푡 ) , constructed to guarantee that 𝑞 𝑌 푡 → 𝑞(𝑌) 23

̃ ̃ ̃ ̃ ̃ Metropolis-Hastings Algorithm One of the workhorses of modern probabilistic methods 1. Pick some 𝑦 0 (e.g., completely randomly) 2. For 𝑢 = 1,2, … Sample: 𝑦 푡+1 ∼ 𝑟 𝑌 ′ 𝑌 = 𝑦 푡 Set: 𝑦 푡+1 𝑟 𝑦 푡 𝑦 푡+1 1, 𝑞 𝑦 푡+1 𝑥. 𝑞. min 𝑦 푡+1 ≔ 𝑦 푡+1 𝑦 푡 𝑞 𝑦 푡 𝑟 𝑦 푡 otherwise 24

Notes on MH We choose 𝑟(𝑌 ′ |𝑌) so that we can easily sample from the distribution (e.g., for continuous distributions, it’s common to choose) 𝑟 𝑌 ′ 𝑌 = 𝑦 = 𝒪 𝑦 ′ 𝑦; 𝐽 Note that even if we cannot compute the probabilities 𝑞(𝑦 푡 ) and 𝑞( ̃ 𝑦 푡+1 ) we can 𝑦 푡+1 )/𝑞(𝑦 푡 ) (requires only being able to compute the often compute their ratio 𝑞( ̃ unnormalized probabilities), e.g., consider the case X 1 X 2 X 3 X 4 25

15-780 Graduate Artificial Intelligence: Probabilistic inference - PowerPoint PPT Presentation

15-780 Graduate Artificial Intelligence: Probabilistic inference J. Zico Kolter (this lecture) and Nihar Shah Carnegie Mellon University Spring 2020 1 Outline Probabilistic graphical models Probabilistic inference Exact inference

Artificial Intelligence Artificial Intelligence Artificial Intelligence Study and design of

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

15-780 Graduate Artificial Intelligence: Probabilistic modeling J. Zico Kolter (this lecture)

Artificial intelligence Artificial Intelligence is the science of PHILOSOPHY OF ARTIFICIAL

Artificial Intelligence Intro (Chapter 1 of AIMA) Summary Artificial Intelligence What is AI?

15-780 Graduate Artificial Intelligence: Adversarial attacks and provable defenses J. Zico

15-780 - graduate artificial intelligence ai and education i . Shayan Doroudi April 24, 2017 1

15-780 Graduate Artificial Intelligence: Convolutional and recurrent networks J. Zico Kolter

15-780 Graduate Artificial Intelligence: Integer programming J. Zico Kolter (this lecture)

15-780 Graduate Artificial Intelligence: Integer programming J. Zico Kolter (this lecture)

15-780 Graduate Artificial Intelligence: Machine learning J. Zico Kolter (this lecture) and

15-780 Graduate Artificial Intelligence: Optimization J. Zico Kolter (this lecture) and Ariel

15-780 Graduate Artificial Intelligence: Convolutional and recurrent networks J. Zico Kolter

15-780 - graduate artificial intelligence ai and education iii . Shayan Doroudi May 1, 2017 1

What is Artificial Intelligence? CPSC 322 Lecture 1 September 5, 2007 What is Artificial

1 Prior Sampling Prior Sampling For i=1, 2, , n +c 0.5 -c 0.5 Sample x i from P(X i

Approximate Knowledge Compilation by Online Collapsed Importance Sampling Tal Friedman and Guy

Stat 5101 Lecture Slides: Deck 7 Asymptotics, also called Large Sample Theory Charles J. Geyer

Introductory Lecture 2 Suzanne Lenhart University of Tennessee, Knoxville Departments of

Reducing Sampling Error in the Monte Carlo Policy Gradient Estimator Josiah Hanna and Peter Stone

Monte Carlo Methods Lecture slides for Chapter 17 of Deep Learning www.deeplearningbook.org Ian

Chapter 7 Inferences Based on a Single Sample: Estimation with Confidence Intervals

Decision Reductions Crypto 2011 Daniele Micciancio Petros Mol August 17, 2011 1 Learning With