Probabilistic Graphical Models Guest Lecture by Narges Razavian - - PowerPoint PPT Presentation

probabilistic graphical models
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Graphical Models Guest Lecture by Narges Razavian - - PowerPoint PPT Presentation

Probabilistic Graphical Models Guest Lecture by Narges Razavian Machine Learning Class April 14 2017 Today What is probabilistic graphical model and why it is useful? Bayesian Networks Basic Inference Generative Models Fancy Inference


slide-1
SLIDE 1

Probabilistic Graphical Models

Guest Lecture by Narges Razavian Machine Learning Class April 14 2017

slide-2
SLIDE 2

Today

What is probabilistic graphical model and why it is useful? Bayesian Networks Basic Inference Generative Models Fancy Inference (when some variables are unobserved) How to learn model parameters from data Undirected Graphical Models Inference (Belief Propagation) New Directions in PGM research & wrapping up

slide-3
SLIDE 3

“What I cannot create, I do not understand.”

  • Richard Feynman
slide-4
SLIDE 4

Generative models vs Discriminative models

Discriminative models learn P(Y|X). It’s easier, requires less data, but is only useful for one particular task: Given X, what is P(Y|X)?

[Example: Logistic Regression, Feed-Forward or Convolutional Neural Networks, etc.]

Generative models instead learn P(Y, X) completely. Once they do that, they can compute everything! P(X) = ∫y P(X,Y) P(Y) = ∫x P(X,Y) P(Y|X) = P(Y,X) / ∫y P(Y,X)

[Caveat: No Free Lunch!! You want to answer every question under the sun? You need more data!]

slide-5
SLIDE 5

Probabilistic Graphical Models: Main “Classic” approach to modeling P(Y, X) P( Y1 ,…, YM , X1 , … ,XD )

slide-6
SLIDE 6

Some Calculations on Space Imagine each variable is binary P( Y1 ,…, YM , X1 , … , XD )

slide-7
SLIDE 7

Some Calculations on Space Imagine each variable is binary P( Y1 ,…, YM , X1 , … , XD )

How many parameters do we need to estimate from data to specify P(Y,X)??

slide-8
SLIDE 8

Some Calculations on Space Imagine each variable is binary P( Y1 ,…, YM , X1 , … , XD )

How many parameters do we need to estimate from data to specify P(Y,X)??

2(M+D) -1

slide-9
SLIDE 9

Too many parameters!

What can be done? 1) Look for conditional independences 2) Use Chain Rule for probabilities to break P(Y,X) into smaller pieces 3) Rewrite P(Y,X) as product of smaller factors

a) Maybe you have more data for a subset of variables..

4) Simplify some of the modeling assumptions to cut parameters

a) I.e. Assume data is multivariate Gaussian b) I.e. Assume conditional independencies even if they don’t really always apply

slide-10
SLIDE 10

Bayesian Networks

Use chain rule for probabilities

  • This is always true, no approximations or assumptions, so no reduction in

number of parameters either

  • BNs: Conditional Independence Assumption:

○ For some of variables, P(Xi| X1, …, Xi-1) is approximated with P(Xi| Subset of (X1, …, Xi-1) ) ■ This “Subset of (X1, …, Xi-1)” is referred to as Parents(Xi) ■ Reduce parameters (if binary for instance) from 2(i-1) to 2|parents(Xi)|

slide-11
SLIDE 11

Bayesian Networks

Variable and assumption Number of parameters in binary case: Raw Chain Rule BN Chain Rule X1 (Difficulty) P(X1) 1 P(X1) 2-1 P(X1) X2 (Intelligence) P(X2|X1) = P(X2) 2 P(X2|X1) 1 P(X2) X3 (Grade) P(X3|X1,X2) = P(X3|X1,X2) 4 P(X3|X1,X2) 4 P(X3|X1,X2) X4 (SAT score) P(X4| X1,X2,X3) = P(X4|X2) 8 P(X4| X1,X2,X3) 2 P(X4|X2) X5 (Letter) P(X5 | X1,X2,X3,X4) = P(X5 | X3) 16 P(X5 | X1,X2,X3,X4) 2 P(X5 | X3) Total P(X1,X2,X3,X4,X5) 1+2+4+8+16 = 31 1+1+4+2+2 =10

X1: Difficulty X2: Intelligence X3: Grade X4: SAT X5: Letter

  • f recom
slide-12
SLIDE 12

Some Example of a BN for SNPs

slide-13
SLIDE 13

Benefits of Bayesian Networks

1) Once estimated they can answer any conditional or marginal queries!

a) Called Inference

2) Fewer parameters to estimate! 3) We can start putting prior information into the network 4) We can incorporate LATENT(Hidden/Unobserved) variables based on how we/domain experts think variables might be related 5) Generating samples from the distribution becomes super easy.

slide-14
SLIDE 14

Inference in Bayesian Networks

X1: Difficulty X2: Intelligence X3: Grade X4: SAT X5: Letter

  • f recom

Query types: 1) Conditional probabilities P(Y|X)=? P(Xi==a|X\i==B,Y==C)=? 2) Maximum a posteriori estimate Argmax xi P(Xi|X\i) = ? Argmax yi P(Yi| X) = ?

slide-15
SLIDE 15

Key operation: Marginalization P(X) = Σy P(X,Y)

P(X5 | X2=a) = ?? P(X5 | X2=a) = P(X5 , X2=a) / P(X2=a) P(X5 , X2=a) = ΣX1,X3,X4 P(X1,X2=a,X3,X4,X5) P(X2=a) = ΣX1,X3,X4,X5 P(X1,X2=a,X3,X4,X5)

X1: Difficulty X2: Intelligence X3: Grade X4: SAT X5: Letter

  • f recom
slide-16
SLIDE 16

Marginalize from the first parents (root) to the variable...

slide-17
SLIDE 17

Marginalize from the first parents (root) to the variable...

slide-18
SLIDE 18

Marginalize from the first parents (root) to the variable...

slide-19
SLIDE 19

Marginalize from the first parents (root) to the variable...

slide-20
SLIDE 20

Marginalize from the first parents (root) to the variable...

slide-21
SLIDE 21

Marginalize from the first parents (root) to the variable...

slide-22
SLIDE 22

Marginalize from the first parents (root) to the variable...

This method is called sum-product or variable elimination

slide-23
SLIDE 23

Marginalization when P(X) = Σy P(X,Y)

P(X5 | X2=a) = ?? P(X5 | X2=a) = P(X5 , X2=a) / P(X2=a)

X1: Difficulty X2: Intelligence X3: Grade X4: SAT X5: Letter

  • f recom
slide-24
SLIDE 24

Marginalization when P(X) = Σy P(X,Y)

P(X5 | X2=a) = ?? P(X5 | X2=a) = P(X5 , X2=a) / P(X2=a)

X1: Difficulty X2: Intelligence X3: Grade X4: SAT X5: Letter

  • f recom

P(X5 , X2=a) = ΣX1,X3,X4 P(X1,X2=a,X3,X4,X5) = ΣX1,X3,X4 P(X1) P(X2=a) P(X3|X1,X2=a) P(X4|X2=a) P(X5|X3) = P(X2=a) ΣX1,X3,X4 P(X1) P(X3|X1,X2=a) P(X4|X2=a) P(X5|X3) = P(X2=a) ΣX1,X3 P(X1) P(X3|X1,X2=a) P(X5|X3) ΣX4 P(X4|X2=a) = P(X2=a) ΣX1,X3 P(X1) P(X3|X1,X2=a) P(X5|X3) = P(X2=a) ΣX1,X3 P(X1) PX2=a(X3|X1) P(X5|X3) = P(X2=a) ΣX3 P(X5|X3) ΣX1 PX2=a(X3|X1) P(X1) = P(X2=a) ΣX3 P(X5|X3) fx2=a(X3) = P(X2=a) ΣX3 P(X5|X3) fx2=a(X3) = P(X2=a) gx2=a(X5) = P(X2=a) gx2=a(X5)

slide-25
SLIDE 25

Marginalization when P(X) = Σy P(X,Y)

P(X5 | X2=a) = gx2=a(X5) P(X5 | X2=a) = P(X5 , X2=a) / P(X2=a)

X1: Difficulty X2: Intelligence X3: Grade X4: SAT X5: Letter

  • f recom

P(X5 , X2=a) = ΣX1,X3,X4 P(X1,X2=a,X3,X4,X5) = ΣX1,X3,X4 P(X1) P(X2=a) P(X3|X1,X2=a) P(X4|X2=a) P(X5|X3) = P(X2=a) ΣX1,X3,X4 P(X1) P(X3|X1,X2=a) P(X4|X2=a) P(X5|X3) = P(X2=a) ΣX1,X3 P(X1) P(X3|X1,X2=a) P(X5|X3) ΣX4 P(X4|X2=a) = P(X2=a) ΣX1,X3 P(X1) P(X3|X1,X2=a) P(X5|X3) = P(X2=a) ΣX1,X3 P(X1) PX2=a(X3|X1) P(X5|X3) = P(X2=a) ΣX3 P(X5|X3) ΣX1 PX2=a(X3|X1) P(X1) = P(X2=a) ΣX3 P(X5|X3) fx2=a(X3) = P(X2=a) ΣX3 P(X5|X3) fx2=a(X3) = P(X2=a) gx2=a(X5) = P(X2=a) gx2=a(X5)

slide-26
SLIDE 26
slide-27
SLIDE 27

Estimating Parameters of a Bayesian Network

  • Maximum Likelihood Estimation
  • Also sometimes Maximum Pseudolikelihood estimation
slide-28
SLIDE 28

How to estimate parameters of a Bayesian Network?

(1) You have observed all Y,X variables and dependency structure is known

If you remember from other lectures: Likelihood(D; Parameters) = ∏Dj in data P(Dj | Parameters) = ∏Dj in data ∏Xij in Dj P(Xij | Par(Xij) , Parameters{Par(Xij) -> Xij}) = ∏i in variable set ∏Dj in data P(Xij | Par(Xij) , Parameters{Par(Xij) -> Xij}) = ∏i in variable set (Independent Local terms function of All observed Xij and Par(Xij)) MLE-Parameters{Par(Xij) -> Xij} = Argmax (Local likelihood of observed Xij and Par(Xij) in data!)

slide-29
SLIDE 29

How to estimate parameters of a Bayesian Network?

(1) You have observed all Y,X variables and dependency structure is known

P(Xi = a | Parents(Xi) = B) = Count (Xi==a & Pa(Xi) == B) Count (Pa(Xi) == B)

  • If variables are discrete:
slide-30
SLIDE 30

How to estimate parameters of a Bayesian Network?

(1) You have observed all Y,X variables and dependency structure is known

P(Xi = a | Parents(Xi) = B) = Count (Xi==a & Pa(Xi) == B) Count (Pa(Xi) == B)

  • If variables are discrete:
  • If variables are continuous:

P(Xi = a | Parents(Xi) = B) = fit Some_PDF_Function(a,B)

slide-31
SLIDE 31

How to estimate parameters of a Bayesian Network?

(1) You have observed all Y,X variables and dependency structure is known

P(Xi=a | Parent(Xi) =B) = Some_PDF_Function(a, B) Single Multivariate Gaussian Mixture of Multivariate Gaussian Non-parametric density functions

slide-32
SLIDE 32

How to estimate parameters of a Bayesian Network?

(2) You have observed all Y,X variables, but dependency structure is NOT known

slide-33
SLIDE 33

Structure learning when all variables are observed

1) Neighborhood Selection:

  • Lasso: L1 regularized regression per variable, learning using other

variables.

  • Not necessarily a tree structure

2) Tree Learning via Chaw-Liu method:

  • Per variable pairs find empirical distribution P(Xi,Xj) = Count(Xi,Xj)/M
  • Per variable pairs, compute mutual information
  • Use I(Xi,Xj) as weight in graph. Learn maximum spanning tree.
slide-34
SLIDE 34

How to estimate parameters of a Bayesian Network?

(3) You have unobserved variables!!, but dependency structure is known

Most commonly used Bayesian Networks these days!

slide-35
SLIDE 35

In practice, Bayes Nets are most used to inject priors and structure into the task

Modeling documents as a collection of topics where each topic is a distribution over words: Topic Modeling via Latent Dirichlet Allocation

slide-36
SLIDE 36

In Practice, Bayes Nets are most used to inject priors and structure

Correcting for hidden confounders in expression data

slide-37
SLIDE 37

Correcting for hidden confounders in expression data

In Practice, Bayes Nets are most used to inject priors and structure

slide-38
SLIDE 38

Estimation/Inference in when missing values

1) Sometimes P(observed) = Σunobserved P(observed & unobserved) has closed form!

a) Combining Gaussian conditional and priors usually lead to Gaussian marginals (has closed form) b) If your prior distribution on latent variables is a conjugate to the conditional distribution, you get closed form i) Lots of known pairs of distribution. Gaussian and Gaussian; Dirichlet and Multinomial; Gamma and Gamma; etc. etc.

2) Expectation maximization (EM)

a) Initialize parameters randomly. b) Do Inference: MAP-Estimate: Most likely value unobserved variables (E step) c) Re-estimate: MLE-Estimate: re-estimate the parameters (M step) d) Iterate (a) and (b) until parameters converge

slide-39
SLIDE 39

Estimation/Inference in when missing values

3) Gibbs sampling or MCMC a) Initialize randomly. b) Sample new P(xi| everything else). c) Burn-In: Repeat over variables & draw thousands of samples sequentially. d) Eventually (It’s proven), you’ll be sampling from true distribution! Use those samples to compute anything you want. (Note that in those samples all variables are observed) 3) Variational Inference (Approximate another model which HAS a closed form) a) Find a functional mapping from the probability under the original bayesian model into the probability under ‘simpler’ model (per data point) b) Estimation = Minimize the gap between the two distributions

slide-40
SLIDE 40

Example of EM for estimating Hidden Markov Model Parameters

X1 X2 X3 X4 X5 X6 Y1 Y2 Y3 Y4 Y5 Y6

P(X,Y) = P(Y1)P(X1|Y1)∏iP(Yi|Yi-1)P(Xi|Yi)

slide-41
SLIDE 41

Gibbs Sampling for all variants of models. Let your imagination go wild!

slide-42
SLIDE 42

Problems with Bayesian Networks

Prior has to have the form of conditional probability What if the variables are symmetric? Bayes Nets can’t have loops What if the relationship can be described in un-normalized way? (i.e energy)

A B

slide-43
SLIDE 43

Undirected Graphical Models (aka Markov Random Fields)

  • Comes from world of Statistical Physics and modeling energy and electron spins.
  • Define a joint probability as normalized product of factors (i.e. energies) over

cliques of variables P( X1 ,..., XD ) = 1 / Z ∏Ci={subsets of X1..XD} f(Ci) Z = Σ x1,x2,..xD f(Ci)

  • In practice people often use pairwise and node-wise factors only.

○ Often called Edge and Node potentials

  • The main problem with these models: How do we estimate Z?!
slide-44
SLIDE 44

Conditional Independencies in Markov Random Fields

We assume one edge for every pairwise potential. According to definition of undirected graphical models: Every variable Xi is conditionally independent of other variables Xj, if in every path that goes from Xi to Xj, at least one variable is observed.

slide-45
SLIDE 45

Example: Gaussian Graphical Models:

They are equivalent to a multivariate Gaussian distribution with: Easily allow conditional independence decisions especially during inference.

and

slide-46
SLIDE 46

Computing Z (Normalization factor)

Note: Z is a function of the parameters not the samples. So without Z, you can still compute some conditional probabilities But need Z to compute MAP estimates Actual probabilities Just like with Bayes Nets: You can use sum-product method to compute Z

slide-47
SLIDE 47

Factor graph representation of MRFs

P(X) = 1/Z f1(x1,x2)f2(x2,x3,x4)f3(x3,x5)f4(x4,x6) Z = Σx1,x2,x3,x4,x5,x6 f1(x1,x2)f2(x2,x3,x4)f3(x3,x5)f4(x4,x6) = Σx1,x2 f1(x1,x2) Σx3,x4 f2(x2,x3,x4) (Σx5 f3(x3,x5)) (Σx6 f4(x4,x6)))

slide-48
SLIDE 48

Belief Propagation Algorithm

Kschischang, Frank R., Brendan J. Frey, and H-A. Loeliger. "Factor graphs and the sum-product algorithm." IEEE Transactions on information theory (2001)

slide-49
SLIDE 49

Some notes on Belief Propagation/Inference in MRFs

  • If the structure doesn’t have a loop, results are exact
  • If the structure is loopy, still people use loopy BP for inferring Z.

○ keep passing messages until the messages converge ○ Some theoretical properties of the convergence exist.

  • Sometimes messages don’t have a closed form.

○ Use approximations to keep them within closed form ■ i.e. Incoming D messages are mixtures of K gaussians ■ Outgoing would be mixture of DK gaussians ■ Reapproximate them with K new gaussians ○ Variants of this method exist like ■ Expectation propagation

  • If replacing sum with max, you can get MAP estimates at the same time

complexity

slide-50
SLIDE 50

Related Topics (No time to cover)

Generative Adversarial Networks

  • Another method to generate samples but without factorizing the probability
  • When conditional independencies are bad assumptions
  • Useful for highly correlated data like images, sounds etc.

Deep variational inference: Make that function that maps the two distributions more powerful and optimize that via gradient descent Probabilistic Programming! http://probabilistic-programming.org/wiki/Home Nonparametric models (dirichlet processes) & Kernel based graphical models Causal inference and Bayesian Networks

slide-51
SLIDE 51

Back to the big picture

PGMs give you a full model of the task

  • You can inject prior information into your model
  • You can use partial data for better estimation
  • Give you justifications for your results.
  • Easy to interpret and allow humans to find hypothesis
  • If your data changes you can adjust parts of the model but re-estimate other

parts Comes with the costs:

  • You’re making independence assumption: Often wrong
  • You’re multiplying a ton of factors: Errors can grow exponentially
  • Inference can be slow if you need sampling