Probabilistic Graphical Models David Sontag New York University - - PowerPoint PPT Presentation

probabilistic graphical models
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Graphical Models David Sontag New York University - - PowerPoint PPT Presentation

Probabilistic Graphical Models David Sontag New York University Lecture 1, January 31, 2013 David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 1 / 44 One of the most exciting advances in machine learning (AI, signal processing,


slide-1
SLIDE 1

Probabilistic Graphical Models

David Sontag

New York University

Lecture 1, January 31, 2013

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 1 / 44

slide-2
SLIDE 2

One of the most exciting advances in machine learning (AI, signal processing, coding, control, . . .) in the last decades

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 2 / 44

slide-3
SLIDE 3

How can we gain global insight based on local observations?

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 3 / 44

slide-4
SLIDE 4

Key idea

1 Represent the world as a collection of random variables X1, . . . , Xn

with joint distribution p(X1, . . . , Xn)

2 Learn the distribution from data 3 Perform “inference” (compute conditional distributions

p(Xi | X1 = x1, . . . , Xm = xm))

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 4 / 44

slide-5
SLIDE 5

Reasoning under uncertainty

As humans, we are continuously making predictions under uncertainty Classical AI and ML research ignored this phenomena Many of the most recent advances in technology are possible because

  • f this new, probabilistic, approach

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 5 / 44

slide-6
SLIDE 6

Applications: Deep question answering

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 6 / 44

slide-7
SLIDE 7

Applications: Machine translation

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 7 / 44

slide-8
SLIDE 8

Applications: Speech recognition

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 8 / 44

slide-9
SLIDE 9

Applications: Stereo vision

  • utput: disparity!

input: two images!

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 9 / 44

slide-10
SLIDE 10

Key challenges

1 Represent the world as a collection of random variables X1, . . . , Xn

with joint distribution p(X1, . . . , Xn)

How does one compactly describe this joint distribution? Directed graphical models (Bayesian networks) Undirected graphical models (Markov random fields, factor graphs)

2 Learn the distribution from data

Maximum likelihood estimation. Other estimation methods? How much data do we need? How much computation does it take?

3 Perform “inference” (compute conditional distributions

p(Xi | X1 = x1, . . . , Xm = xm))

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 10 / 44

slide-11
SLIDE 11

Syllabus overview

We will study Representation, Inference & Learning First in the simplest case

Only discrete variables Fully observed models Exact inference & learning

Then generalize

Continuous variables Partially observed data during learning (hidden variables) Approximate inference & learning

Learn about algorithms, theory & applications

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 11 / 44

slide-12
SLIDE 12

Logistics: class

Class webpage:

http://cs.nyu.edu/~dsontag/courses/pgm13/ Sign up for mailing list! Draft slides posted before each lecture

Book: Probabilistic Graphical Models: Principles and Techniques by Daphne Koller and Nir Friedman, MIT Press (2009)

Required readings for each lecture posted to course website. Many additional reference materials available!

Office hours: Wednesday 5-6pm and by appointment. 715 Broadway, 12th floor, Room 1204 Teaching Assistant: Li Wan (wanli@cs.nyu.edu) Li’s Office hours: Monday 5-6pm. 715 Broadway, Room 1231

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 12 / 44

slide-13
SLIDE 13

Logistics: prerequisites & grading

Prerequisites:

Previous class on machine learning Basic concepts from probability and statistics Algorithms (e.g., dynamic programming, graphs, complexity) Calculus

Grading: problem sets (65%) + in class final exam (30%) + participation (5%)

Class attendance is required. 7-8 assignments (every 1–2 weeks). Both theory and programming. First homework out today, due next Thursday (Feb. 7) at 5pm Important: See collaboration policy on class webpage

Solutions to the theoretical questions require formal proofs. For the programming assignments, I recommend Python, Java, or

  • Matlab. Do not use C++.

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 13 / 44

slide-14
SLIDE 14

Review of probability: outcomes

Reference: Chapter 2 and Appendix A

An outcome space specifies the possible outcomes that we would like to reason about, e.g.

Ω = { }

Coin toss

, Ω = { }

Die toss

, , , , ,

We specify a probability p(ω) for each outcome ω such that p(ω) ≥ 0,

  • ω∈Ω

p(ω) = 1

E.g., p( ) = .6 p( ) = .4

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 14 / 44

slide-15
SLIDE 15

Review of probability: events

An event is a subset of the outcome space, e.g.

O = { }

Odd die tosses

, , E = { }

Even die tosses

, ,

The probability of an event is given by the sum of the probabilities of the outcomes it contains, p(E) =

  • ω∈E

p(ω)

E.g., p(E) = p( ) + p( ) + p( ) = 1/2, if fair die

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 15 / 44

slide-16
SLIDE 16

Independence of events

Two events A and B are independent if p(A ∩ B) = p(A)p(B) Are these two events independent? B A No! p(A ∩ B) = 0, p(A)p(B) = 1 6 2 Now suppose our outcome space had two different die:

Ω = { }

2 die tosses

, , , … ,

62 = 36 outcomes

and the probability distribution is such that each die is independent,

p( ) = p( ) ) p( p( ) = p( ) ) p(

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 16 / 44

slide-17
SLIDE 17

Independence of events

Two events A and B are independent if p(A ∩ B) = p(A)p(B) Are these two events independent?

B A

p(A) = p( ) p(B) = p( )

Yes!

2 ! p(A ∩ B) = 0 p( ) p(A)p(B) = 1 6 2 p( ) ) p(

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 17 / 44

slide-18
SLIDE 18

Conditional probability

A B A

1 ∩B

Let A, B be events, p(B) > 0. p(A | B) = p(A ∩ B) p(B) Claim 1:

ω∈S p(ω | S) = 1

Claim 2: If A and B are independent, then p(A | B) = p(A)

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 18 / 44

slide-19
SLIDE 19

Two important rules

1 Chain rule

Let S1, . . . Sn be events, p(Si) > 0. p(S1 ∩ S2 ∩ · · · ∩ Sn) = p(S1)p(S2 | S1) · · · p(Sn | S1, . . . , Sn−1)

2 Bayes’ rule

Let S1, S2 be events, p(S1) > 0 and p(S2) > 0. p(S1 | S2) = p(S1 ∩ S2) p(S2) = p(S2 | S1)p(S1) p(S2)

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 19 / 44

slide-20
SLIDE 20

Discrete random variables

Often each outcome corresponds to a setting of various attributes (e.g., “age”, “gender”, “hasPneumonia”, “hasDiabetes”) A random variable X is a mapping X : Ω → D

D is some set (e.g., the integers) Induces a partition of all outcomes Ω

For some x ∈ D, we say p(X = x) = p({ω ∈ Ω : X(ω) = x}) “probability that variable X assumes state x” Notation: Val(X) = set D of all values assumed by X (will interchangeably call these the “values” or “states” of variable X) p(X) is a distribution:

x∈Val(X) p(X = x) = 1

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 20 / 44

slide-21
SLIDE 21

Multivariate distributions

Instead of one random variable, have random vector X(ω) = [X1(ω), . . . , Xn(ω)] Xi = xi is an event. The joint distribution p(X1 = x1, . . . , Xn = xn) is simply defined as p(X1 = x1 ∩ · · · ∩ Xn = xn) We will often write p(x1, . . . , xn) instead of p(X1 = x1, . . . , Xn = xn) Conditioning, chain rule, Bayes’ rule, etc. all apply

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 21 / 44

slide-22
SLIDE 22

Working with random variables

For example, the conditional distribution p(X1 | X2 = x2) = p(X1, X2 = x2) p(X2 = x2) . This notation means p(X1 = x1 | X2 = x2) = p(X1=x1,X2=x2)

p(X2=x2)

∀x1 ∈ Val(X1) Two random variables are independent, X1 ⊥ X2, if p(X1 = x1, X2 = x2) = p(X1 = x1)p(X2 = x2) for all values x1 ∈ Val(X1) and x2 ∈ Val(X2).

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 22 / 44

slide-23
SLIDE 23

Example

Consider three binary-valued random variables X1, X2, X3 Val(Xi) = {0, 1} Let outcome space Ω be the cross-product of their states: Ω = Val(X1) × Val(X2) × Val(X3) Xi(ω) is the value for Xi in the assignment ω ∈ Ω Specify p(ω) for each outcome ω ∈ Ω by a big table: x1 x2 x3 p(x1, x2, x3) .11 1 .02 . . . 1 1 1 .05 How many parameters do we need to specify? 23 − 1

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 23 / 44

slide-24
SLIDE 24

Marginalization

Suppose X and Y are random variables with distribution p(X, Y ) X: Intelligence, Val(X) = {“Very High”, “High”} Y : Grade, Val(Y ) = {“a”, “b”} Joint distribution specified by: X Y vh h a 0.7 0.15 b 0.1 0.05 p(Y = a) = ?= 0.85 More generally, suppose we have a joint distribution p(X1, . . . , Xn). Then, p(Xi = xi) =

  • x1
  • x2

· · ·

  • xi−1
  • xi+1

· · ·

  • xn

p(x1, . . . , xn)

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 24 / 44

slide-25
SLIDE 25

Conditioning

Suppose X and Y are random variables with distribution p(X, Y ) X: Intelligence, Val(X) = {“Very High”, “High”} Y : Grade, Val(Y ) = {“a”, “b”} X Y vh h a 0.7 0.15 b 0.1 0.05 Can compute the conditional probability p(Y = a | X = vh) = p(Y = a, X = vh) p(X = vh) = p(Y = a, X = vh) p(Y = a, X = vh) + p(Y = b, X = vh) = 0.7 0.7 + 0.1 = 0.875.

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 25 / 44

slide-26
SLIDE 26

Example: Medical diagnosis

Variable for each symptom (e.g. “fever”, “cough”, “fast breathing”, “shaking”, “nausea”, “vomiting”) Variable for each disease (e.g. “pneumonia”, “flu”, “common cold”, “bronchitis”, “tuberculosis”) Diagnosis is performed by inference in the model: p(pneumonia = 1 | cough = 1, fever = 1, vomiting = 0) One famous model, Quick Medical Reference (QMR-DT), has 600 diseases and 4000 findings

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 26 / 44

slide-27
SLIDE 27

Representing the distribution

Naively, could represent multivariate distributions with table of probabilities for each outcome (assignment) How many outcomes are there in QMR-DT? 24600 Estimation of joint distribution would require a huge amount of data Inference of conditional probabilities, e.g. p(pneumonia = 1 | cough = 1, fever = 1, vomiting = 0) would require summing over exponentially many variables’ values Moreover, defeats the purpose of probabilistic modeling, which is to make predictions with previously unseen observations

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 27 / 44

slide-28
SLIDE 28

Structure through independence

If X1, . . . , Xn are independent, then p(x1, . . . , xn) = p(x1)p(x2) · · · p(xn) 2n entries can be described by just n numbers (if |Val(Xi)| = 2)! However, this is not a very useful model – observing a variable Xi cannot influence our predictions of Xj If X1, . . . , Xn are conditionally independent given Y , denoted as Xi ⊥ X−i | Y , then p(y, x1, . . . , xn) = p(y)p(x1 | y)

n

  • i=2

p(xi | x1, . . . , xi−1, y) = p(y)p(x1 | y)

n

  • i=2

p(xi | y). This is a simple, yet powerful, model

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 28 / 44

slide-29
SLIDE 29

Example: naive Bayes for classification

Classify e-mails as spam (Y = 1) or not spam (Y = 0)

Let 1 : n index the words in our vocabulary (e.g., English) Xi = 1 if word i appears in an e-mail, and 0 otherwise E-mails are drawn according to some distribution p(Y , X1, . . . , Xn)

Suppose that the words are conditionally independent given Y . Then, p(y, x1, . . . xn) = p(y)

n

  • i=1

p(xi | y) Estimate the model with maximum likelihood. Predict with: p(Y = 1 | x1, . . . xn) = p(Y = 1) n

i=1 p(xi | Y = 1)

  • y={0,1} p(Y = y) n

i=1 p(xi | Y = y)

Are the independence assumptions made here reasonable? Philosophy: Nearly all probabilistic models are “wrong”, but many are nonetheless useful

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 29 / 44

slide-30
SLIDE 30

Bayesian networks

Reference: Chapter 3

A Bayesian network is specified by a directed acyclic graph G = (V , E) with:

1

One node i ∈ V for each random variable Xi

2

One conditional probability distribution (CPD) per node, p(xi | xPa(i)), specifying the variable’s probability conditioned on its parents’ values

Corresponds 1-1 with a particular factorization of the joint distribution: p(x1, . . . xn) =

  • i∈V

p(xi | xPa(i)) Powerful framework for designing algorithms to perform probability computations

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 30 / 44

slide-31
SLIDE 31

Example

Consider the following Bayesian network:

Grade Letter SAT Intelligence Difficulty d1 d0

0.6 0.4

i1 i0

0.7 0.3

i0 i1 s1 s0

0.95 0.2 0.05 0.8

g1 g2 g2 l1 l 0

0.1 0.4 0.99 0.9 0.6 0.01

i0,d0 i0,d1 i0,d0 i0,d1 g2 g3 g1

0.3 0.05 0.9 0.5 0.4 0.25 0.08 0.3 0.3 0.7 0.02 0.2

What is its joint distribution? p(x1, . . . xn) =

  • i∈V

p(xi | xPa(i)) p(d, i, g, s, l) = p(d)p(i)p(g | i, d)p(s | i)p(l | g)

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 31 / 44

slide-32
SLIDE 32

More examples

p(x1, . . . xn) =

  • i∈V

p(xi | xPa(i)) Will my car start this morning? Heckerman et al., Decision-Theoretic Troubleshooting, 1995

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 32 / 44

slide-33
SLIDE 33

More examples

p(x1, . . . xn) =

  • i∈V

p(xi | xPa(i)) What is the differential diagnosis? Beinlich et al., The ALARM Monitoring System, 1989

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 33 / 44

slide-34
SLIDE 34

Bayesian networks are generative models

naive Bayes

Y X1 X2 X3 Xn

. . .

Features Label

Evidence is denoted by shading in a node Can interpret Bayesian network as a generative process. For example, to generate an e-mail, we

1

Decide whether it is spam or not spam, by samping y ∼ p(Y )

2

For each word i = 1 to n, sample xi ∼ p(Xi | Y = y)

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 34 / 44

slide-35
SLIDE 35

Bayesian network structure implies conditional independencies!

Grade Letter SAT Intelligence Difficulty

The joint distribution corresponding to the above BN factors as p(d, i, g, s, l) = p(d)p(i)p(g | i, d)p(s | i)p(l | g) However, by the chain rule, any distribution can be written as p(d, i, g, s, l) = p(d)p(i | d)p(g | i, d)p(s | i, d, g)p(l | g, d, i, g, s) Thus, we are assuming the following additional independencies: D ⊥ I, S ⊥ {D, G} | I, L ⊥ {I, D, S} | G. What else?

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 35 / 44

slide-36
SLIDE 36

Bayesian network structure implies conditional independencies!

Generalizing the above arguments, we obtain that a variable is independent from its non-descendants given its parents Common parent – fixing B decouples A and C

  • =$6'%)76@6)76):R
  • !

# "

Cascade – knowing B decouples A and C

  • !

" #

V-structure – Knowing C couples A and B

  • !

" #

This important phenomona is called explaining away and is what makes Bayesian networks so powerful

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 36 / 44

slide-37
SLIDE 37

A simple justification (for common parent)

  • =$6'%)76@6)76):R
  • !

# " We’ll show that p(A, C | B) = p(A | B)p(C | B) for any distribution p(A, B, C) that factors according to this graph structure, i.e. p(A, B, C) = p(B)p(A | B)p(C | B)

Proof.

p(A, C | B) = p(A, B, C) p(B) = p(A | B)p(C | B)

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 37 / 44

slide-38
SLIDE 38

D-separation (“directed separated”) in Bayesian networks

Algorithm to calculate whether X ⊥ Z | Y by looking at graph separation Look to see if there is active path between X and Z when variables Y are observed:

(a)

X Y Z X Y Z

(b)

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 38 / 44

slide-39
SLIDE 39

D-separation (“directed separated”) in Bayesian networks

Algorithm to calculate whether X ⊥ Z | Y by looking at graph separation Look to see if there is active path between X and Z when variables Y are observed:

(a)

X Y Z

(b)

X Y Z

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 39 / 44

slide-40
SLIDE 40

D-separation (“directed separated”) in Bayesian networks

Algorithm to calculate whether X ⊥ Z | Y by looking at graph separation Look to see if there is active path between X and Z when variables Y are observed:

X Y Z X Y Z

(a) (b)

If no such path, then X and Z are d-separated with respect to Y d-separation reduces statistical independencies (hard) to connectivity in graphs (easy) Important because it allows us to quickly prune the Bayesian network, finding just the relevant variables for answering a query

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 40 / 44

slide-41
SLIDE 41

D-separation example 1

1

X

2

X

3

X X 4 X 5 X6

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 41 / 44

slide-42
SLIDE 42

D-separation example 2

1

X

2

X

3

X X 4 X 5 X6

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 42 / 44

slide-43
SLIDE 43

2011 Turing Award was for Bayesian networks

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 43 / 44

slide-44
SLIDE 44

Summary

Bayesian networks given by (G, P) where P is specified as a set of local conditional probability distributions associated with G’s nodes One interpretation of a BN is as a generative model, where variables are sampled in topological order Local and global independence properties identifiable via d-separation criteria Computing the probability of any assignment is obtained by multiplying CPDs

Bayes’ rule is used to compute conditional probabilities Marginalization or inference is often computationally difficult

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 44 / 44