Bayesian networks Lecture 18 David Sontag New York University - - PowerPoint PPT Presentation

bayesian networks lecture 18
SMART_READER_LITE
LIVE PREVIEW

Bayesian networks Lecture 18 David Sontag New York University - - PowerPoint PPT Presentation

Bayesian networks Lecture 18 David Sontag New York University Outline for today Modeling sequen&al data (e.g., =me series, speech processing) using hidden Markov models (HMMs) Bayesian networks Independence proper=es


slide-1
SLIDE 1

Bayesian networks Lecture 18

David Sontag New York University

slide-2
SLIDE 2

Outline for today

  • Modeling sequen&al data (e.g., =me series,

speech processing) using hidden Markov models (HMMs)

  • Bayesian networks

– Independence proper=es – Examples – Learning and inference

slide-3
SLIDE 3

Example applica=on: Tracking

Radar

Observe noisy measurements of missile loca=on: Y1, Y2, … Where is the missile now? Where will it be in 10 seconds?

slide-4
SLIDE 4

Probabilis=c approach

  • Our measurements of the missile loca=on were

Y1, Y2, …, Yn

  • Let Xt be the true <missile loca=on, velocity> at

=me t

  • To keep this simple, suppose that everything is

discrete, i.e. Xt takes the values 1, …, k

Grid the space:

slide-5
SLIDE 5

Probabilis=c approach

  • First, we specify the condi&onal distribu=on

Pr(Xt | Xt-1):

  • Then, we specify Pr(Yt | Xt=<(10,20), 200 mph

toward the northeast>):

With probability ½, Yt = Xt (ignoring the velocity). Otherwise, Yt is a uniformly chosen grid loca=on

From basic physics, we can bound the distance that the missile can have traveled

slide-6
SLIDE 6

Hidden Markov models

  • Assume that the joint distribu=on on X1, X2, …, Xn and Y1, Y2,

…, Yn factors as follows:

  • To find out where the missile is now, we do marginal

inference:

  • To find the most likely trajectory, we do MAP (maximum a

posteriori) inference:

Pr(x1, . . . xn, y1, . . . , yn) = Pr(x1) Pr(y1 | x1)

n

Y

t=2

Pr(xt | xt−1) Pr(yt | xt)

Pr(xn | y1, . . . , yn) arg max

x

Pr(x1, . . . , xn | y1, . . . , yn)

1960’s

slide-7
SLIDE 7

Inference

  • Recall, to find out where the missile is now, we do marginal

inference:

  • How does one compute this?
  • Applying rule of condi=onal probability, we have:
  • Naively, would seem to require kn-1 summa=ons,

Pr(xn | y1, . . . , yn)

Pr(xn | y1, . . . , yn) = Pr(xn, y1, . . . , yn) Pr(y1, . . . , yn) = Pr(xn, y1, . . . , yn) Pk

ˆ xn=1 Pr(ˆ

xn, y1, . . . , yn)

Pr(xn, y1, . . . , yn) = X

x1,...,xn−1

Pr(x1, . . . , xn, y1, . . . , yn)

Is there a more efficient algorithm?

slide-8
SLIDE 8

Marginal inference in HMMs

  • Use dynamic programming
  • For n=1, ini=alize
  • Total running =me is O(nk) – linear =me!

Pr(xn, y1, . . . , yn) = X

xn−1

Pr(xn−1, xn, y1, . . . , yn) = X

xn−1

Pr(xn−1, y1, . . . , yn−1) Pr(xn, yn | xn−1, y1, . . . , yn−1) = X

xn−1

Pr(xn−1, y1, . . . , yn−1) Pr(xn, yn | xn−1) = X

xn−1

Pr(xn−1, y1, . . . , yn−1) Pr(xn | xn−1) Pr(yn | xn, xn−1) = X

xn−1

Pr(xn−1, y1, . . . , yn−1) Pr(xn | xn−1) Pr(yn | xn)

Pr(x1, y1) = Pr(x1) Pr(y1 | x1)

Easy to do filtering

Pr(A = a) = X

b

Pr(B = b, A = a)

Pr( A = a, B = b) = Pr( A = a) Pr( B = b | A = a)

Condi=onal independence in HMMs

Pr(A = a, B = b) = Pr(A = a) Pr(B = b | A = a)

Condi=onal independence in HMMs

slide-9
SLIDE 9

MAP inference in HMMs

  • MAP inference in HMMs can also be solved in linear =me!
  • Formulate as a shortest paths problem

arg max

x

Pr(x1, . . . xn | y1, . . . , yn) = arg max

x

Pr(x1, . . . xn, y1, . . . , yn) = arg max

x

log Pr(x1, . . . xn, y1, . . . , yn) = arg max

x

log h Pr(x1) Pr(y1 | x1) i +

n

X

i=2

log h Pr(xi | xi−1) Pr(yi | xi) i

s t X1 X2 Xn-1 Xn

k nodes per variable

Weight for edge (xn, t) is 0

Called the Viterbi algorithm Path from s to t gives the MAP assignment

Weight for edge (s, x1) is

log h Pr(x1) Pr(y1 | x1) i

  • log

h Pr(xi | xi−1) Pr(yi | xi) i

Weight for edge (xi-1, xi) is -

slide-10
SLIDE 10

Applica=ons of HMMs

  • Speech recogni=on

– Predict phonemes from the sounds forming words (i.e., the actual signals)

  • Natural language processing

– Predict parts of speech (verb, noun, determiner, etc.) from the words in a sentence

  • Computa=onal biology

– Predict intron/exon regions from DNA – Predict protein structure from DNA (locally)

  • And many many more!
slide-11
SLIDE 11
  • We can represent a hidden Markov model with a graph:
  • There is a 1-1 mapping between the graph structure and the factoriza=on
  • f the joint distribu=on

HMMs as a graphical model

X1 X2 X3 X4 X5 X6 Y1 Y2 Y3 Y4 Y5 Y6 Pr(x1, . . . xn, y1, . . . , yn) = Pr(x1) Pr(y1 | x1)

n

Y

t=2

Pr(xt | xt−1) Pr(yt | xt) Shading in denotes

  • bserved variables (e.g. what

is available at test =me)

slide-12
SLIDE 12
  • We can represent a naïve Bayes model with a graph:
  • There is a 1-1 mapping between the graph structure and the factoriza=on
  • f the joint distribu=on

Y X1 X2 X3 Xn

. . .

Features Label

Naïve Bayes as a graphical model

Pr(y, x1, . . . , xn) = Pr(y)

n

Y

i=1

Pr(xi | y)

Shading in denotes

  • bserved variables (e.g. what

is available at test =me)

slide-13
SLIDE 13

Bayesian networks

  • A Bayesian network is specified by a directed acyclic graph

G=(V,E) with:

– One node i for each random variable Xi – One condi=onal probability distribu=on (CPD) per node, p(xi | xPa(i)), specifying the variable’s probability condi=oned on its parents’ values

  • Corresponds 1-1 with a par=cular factoriza=on of the joint

distribu=on:

  • Powerful framework for designing algorithms to perform

probability computa=ons

p(x1, . . . xn) = Y

i∈V

p(xi | xPa(i))

slide-14
SLIDE 14

2011 Turing award was for Bayesian networks

slide-15
SLIDE 15

Example

  • Consider the following Bayesian network:
  • What is its joint distribu=on?

Grade Letter SAT Intelligence Difficulty d1 d0

0.6 0.4

i1 i0

0.7 0.3

i0 i1 s1 s0

0.95 0.2 0.05 0.8

g1 g2 g2 l1 l 0

0.1 0.4 0.99 0.9 0.6 0.01

i0,d0 i0,d1 i0,d0 i0,d1 g2 g3 g1

0.3 0.05 0.9 0.5 0.4 0.25 0.08 0.3 0.3 0.7 0.02 0.2

p(x1, . . . xn) = Y

i∈V

p(xi | xPa(i)) p(d, i, g, s, l) = p(d)p(i)p(g | i, d)p(s | i)p(l | g) Example from Koller & Friedman, Probabilis&c Graphical Models, 2009

slide-16
SLIDE 16

Example

  • Consider the following Bayesian network:
  • What is this model assuming?

Grade Letter SAT Intelligence Difficulty d1 d0

0.6 0.4

i1 i0

0.7 0.3

i0 i1 s1 s0

0.95 0.2 0.05 0.8

g1 g2 g2 l1 l 0

0.1 0.4 0.99 0.9 0.6 0.01

i0,d0 i0,d1 i0,d0 i0,d1 g2 g3 g1

0.3 0.05 0.9 0.5 0.4 0.25 0.08 0.3 0.3 0.7 0.02 0.2

Example from Koller & Friedman, Probabilis&c Graphical Models, 2009

SAT 6? Grade SAT ⊥ Grade | Intelligence

slide-17
SLIDE 17

Example

  • Consider the following Bayesian network:
  • Compared to a simple log-linear model to predict intelligence:

– Captures non-linearity between grade, course difficulty, and intelligence – Modular. Training data can come from different sources! – Built in feature selec&on: lerer of recommenda=on is irrelevant given grade

Grade Letter SAT Intelligence Difficulty d1 d0

0.6 0.4

i1 i0

0.7 0.3

i0 i1 s1 s0

0.95 0.2 0.05 0.8

g1 g2 g2 l1 l 0

0.1 0.4 0.99 0.9 0.6 0.01

i0,d0 i0,d1 i0,d0 i0,d1 g2 g3 g1

0.3 0.05 0.9 0.5 0.4 0.25 0.08 0.3 0.3 0.7 0.02 0.2

Example from Koller & Friedman, Probabilis&c Graphical Models, 2009

slide-18
SLIDE 18

Bayesian networks enable use of domain knowledge

Will my car start this morning?

Heckerman et al., Decision-Theore=c Troubleshoo=ng, 1995 p(x1, . . . xn) = Y

i∈V

p(xi | xPa(i))

slide-19
SLIDE 19

p(x1, . . . xn) = Y

i∈V

p(xi | xPa(i))

Bayesian networks enable use of domain knowledge

What is the differen=al diagnosis?

Beinlich et al., The ALARM Monitoring System, 1989

slide-20
SLIDE 20

Bayesian networks are genera&ve models

  • Can sample from the joint distribu=on, top-down
  • Suppose Y can be “spam” or “not spam”, and Xi is a binary

indicator of whether word i is present in the e-mail

  • Let’s try genera=ng a few emails!
  • Oven helps to think about Bayesian networks as a genera=ve

model when construc=ng the structure and thinking about the model assump=ons

Y X1 X2 X3 Xn

. . .

Features Label

slide-21
SLIDE 21

Inference in Bayesian networks

  • Compu=ng marginal probabili=es in tree structured Bayesian

networks is easy

– The algorithm called “belief propaga=on” generalizes what we showed for hidden Markov models to arbitrary trees

  • Wait… this isn’t a tree! What can we do?

X1 X2 X3 X4 X5 X6

Y1 Y2 Y3 Y4 Y5 Y6

Y X1 X2 X3 Xn

. . .

Features Label

slide-22
SLIDE 22

Inference in Bayesian networks

  • In some cases (such as this) we can transform this into what is

called a “junc=on tree”, and then run belief propaga=on

slide-23
SLIDE 23

Approximate inference

  • There is also a wealth of approximate inference algorithms that can

be applied to Bayesian networks such as these

  • Markov chain Monte Carlo algorithms repeatedly sample

assignments for es=ma=ng marginals

  • Varia=onal inference algorithms (determinis=c) find a simpler

distribu=on which is “close” to the original, then compute marginals using the simpler distribu=on

slide-24
SLIDE 24

Maximum likelihood es=ma=on in Bayesian networks

Suppose that we know the Bayesian network structure G Let θxi|xpa(i) be the parameter giving the value of the CPD p(xi | xpa(i)) Maximum likelihood estimation corresponds to solving: max

θ

1 M

M

X

m=1

log p(xM; θ) subject to the non-negativity and normalization constraints This is equal to: max

θ

1 M

M

X

m=1

log p(xM; θ) = max

θ

1 M

M

X

m=1 N

X

i=1

log p(xM

i

| xM

pa(i); θ)

= max

θ N

X

i=1

1 M

M

X

m=1

log p(xM

i

| xM

pa(i); θ)

The optimization problem decomposes into an independent optimization problem for each CPD! Has a simple closed-form solution.