Bayesian networks Lecture 18 David Sontag New York University - - PowerPoint PPT Presentation
Bayesian networks Lecture 18 David Sontag New York University - - PowerPoint PPT Presentation
Bayesian networks Lecture 18 David Sontag New York University Outline for today Modeling sequen&al data (e.g., =me series, speech processing) using hidden Markov models (HMMs) Bayesian networks Independence proper=es
Outline for today
- Modeling sequen&al data (e.g., =me series,
speech processing) using hidden Markov models (HMMs)
- Bayesian networks
– Independence proper=es – Examples – Learning and inference
Example applica=on: Tracking
Radar
Observe noisy measurements of missile loca=on: Y1, Y2, … Where is the missile now? Where will it be in 10 seconds?
Probabilis=c approach
- Our measurements of the missile loca=on were
Y1, Y2, …, Yn
- Let Xt be the true <missile loca=on, velocity> at
=me t
- To keep this simple, suppose that everything is
discrete, i.e. Xt takes the values 1, …, k
Grid the space:
Probabilis=c approach
- First, we specify the condi&onal distribu=on
Pr(Xt | Xt-1):
- Then, we specify Pr(Yt | Xt=<(10,20), 200 mph
toward the northeast>):
With probability ½, Yt = Xt (ignoring the velocity). Otherwise, Yt is a uniformly chosen grid loca=on
From basic physics, we can bound the distance that the missile can have traveled
Hidden Markov models
- Assume that the joint distribu=on on X1, X2, …, Xn and Y1, Y2,
…, Yn factors as follows:
- To find out where the missile is now, we do marginal
inference:
- To find the most likely trajectory, we do MAP (maximum a
posteriori) inference:
Pr(x1, . . . xn, y1, . . . , yn) = Pr(x1) Pr(y1 | x1)
n
Y
t=2
Pr(xt | xt−1) Pr(yt | xt)
Pr(xn | y1, . . . , yn) arg max
x
Pr(x1, . . . , xn | y1, . . . , yn)
1960’s
Inference
- Recall, to find out where the missile is now, we do marginal
inference:
- How does one compute this?
- Applying rule of condi=onal probability, we have:
- Naively, would seem to require kn-1 summa=ons,
Pr(xn | y1, . . . , yn)
Pr(xn | y1, . . . , yn) = Pr(xn, y1, . . . , yn) Pr(y1, . . . , yn) = Pr(xn, y1, . . . , yn) Pk
ˆ xn=1 Pr(ˆ
xn, y1, . . . , yn)
Pr(xn, y1, . . . , yn) = X
x1,...,xn−1
Pr(x1, . . . , xn, y1, . . . , yn)
Is there a more efficient algorithm?
Marginal inference in HMMs
- Use dynamic programming
- For n=1, ini=alize
- Total running =me is O(nk) – linear =me!
Pr(xn, y1, . . . , yn) = X
xn−1
Pr(xn−1, xn, y1, . . . , yn) = X
xn−1
Pr(xn−1, y1, . . . , yn−1) Pr(xn, yn | xn−1, y1, . . . , yn−1) = X
xn−1
Pr(xn−1, y1, . . . , yn−1) Pr(xn, yn | xn−1) = X
xn−1
Pr(xn−1, y1, . . . , yn−1) Pr(xn | xn−1) Pr(yn | xn, xn−1) = X
xn−1
Pr(xn−1, y1, . . . , yn−1) Pr(xn | xn−1) Pr(yn | xn)
Pr(x1, y1) = Pr(x1) Pr(y1 | x1)
Easy to do filtering
Pr(A = a) = X
b
Pr(B = b, A = a)
Pr( A = a, B = b) = Pr( A = a) Pr( B = b | A = a)
Condi=onal independence in HMMs
Pr(A = a, B = b) = Pr(A = a) Pr(B = b | A = a)
Condi=onal independence in HMMs
MAP inference in HMMs
- MAP inference in HMMs can also be solved in linear =me!
- Formulate as a shortest paths problem
arg max
x
Pr(x1, . . . xn | y1, . . . , yn) = arg max
x
Pr(x1, . . . xn, y1, . . . , yn) = arg max
x
log Pr(x1, . . . xn, y1, . . . , yn) = arg max
x
log h Pr(x1) Pr(y1 | x1) i +
n
X
i=2
log h Pr(xi | xi−1) Pr(yi | xi) i
s t X1 X2 Xn-1 Xn
…
k nodes per variable
Weight for edge (xn, t) is 0
Called the Viterbi algorithm Path from s to t gives the MAP assignment
Weight for edge (s, x1) is
log h Pr(x1) Pr(y1 | x1) i
- log
h Pr(xi | xi−1) Pr(yi | xi) i
Weight for edge (xi-1, xi) is -
Applica=ons of HMMs
- Speech recogni=on
– Predict phonemes from the sounds forming words (i.e., the actual signals)
- Natural language processing
– Predict parts of speech (verb, noun, determiner, etc.) from the words in a sentence
- Computa=onal biology
– Predict intron/exon regions from DNA – Predict protein structure from DNA (locally)
- And many many more!
- We can represent a hidden Markov model with a graph:
- There is a 1-1 mapping between the graph structure and the factoriza=on
- f the joint distribu=on
HMMs as a graphical model
X1 X2 X3 X4 X5 X6 Y1 Y2 Y3 Y4 Y5 Y6 Pr(x1, . . . xn, y1, . . . , yn) = Pr(x1) Pr(y1 | x1)
n
Y
t=2
Pr(xt | xt−1) Pr(yt | xt) Shading in denotes
- bserved variables (e.g. what
is available at test =me)
- We can represent a naïve Bayes model with a graph:
- There is a 1-1 mapping between the graph structure and the factoriza=on
- f the joint distribu=on
Y X1 X2 X3 Xn
. . .
Features Label
Naïve Bayes as a graphical model
Pr(y, x1, . . . , xn) = Pr(y)
n
Y
i=1
Pr(xi | y)
Shading in denotes
- bserved variables (e.g. what
is available at test =me)
Bayesian networks
- A Bayesian network is specified by a directed acyclic graph
G=(V,E) with:
– One node i for each random variable Xi – One condi=onal probability distribu=on (CPD) per node, p(xi | xPa(i)), specifying the variable’s probability condi=oned on its parents’ values
- Corresponds 1-1 with a par=cular factoriza=on of the joint
distribu=on:
- Powerful framework for designing algorithms to perform
probability computa=ons
p(x1, . . . xn) = Y
i∈V
p(xi | xPa(i))
2011 Turing award was for Bayesian networks
Example
- Consider the following Bayesian network:
- What is its joint distribu=on?
Grade Letter SAT Intelligence Difficulty d1 d0
0.6 0.4
i1 i0
0.7 0.3
i0 i1 s1 s0
0.95 0.2 0.05 0.8
g1 g2 g2 l1 l 0
0.1 0.4 0.99 0.9 0.6 0.01
i0,d0 i0,d1 i0,d0 i0,d1 g2 g3 g1
0.3 0.05 0.9 0.5 0.4 0.25 0.08 0.3 0.3 0.7 0.02 0.2
p(x1, . . . xn) = Y
i∈V
p(xi | xPa(i)) p(d, i, g, s, l) = p(d)p(i)p(g | i, d)p(s | i)p(l | g) Example from Koller & Friedman, Probabilis&c Graphical Models, 2009
Example
- Consider the following Bayesian network:
- What is this model assuming?
Grade Letter SAT Intelligence Difficulty d1 d0
0.6 0.4
i1 i0
0.7 0.3
i0 i1 s1 s0
0.95 0.2 0.05 0.8
g1 g2 g2 l1 l 0
0.1 0.4 0.99 0.9 0.6 0.01
i0,d0 i0,d1 i0,d0 i0,d1 g2 g3 g1
0.3 0.05 0.9 0.5 0.4 0.25 0.08 0.3 0.3 0.7 0.02 0.2
Example from Koller & Friedman, Probabilis&c Graphical Models, 2009
SAT 6? Grade SAT ⊥ Grade | Intelligence
Example
- Consider the following Bayesian network:
- Compared to a simple log-linear model to predict intelligence:
– Captures non-linearity between grade, course difficulty, and intelligence – Modular. Training data can come from different sources! – Built in feature selec&on: lerer of recommenda=on is irrelevant given grade
Grade Letter SAT Intelligence Difficulty d1 d0
0.6 0.4
i1 i0
0.7 0.3
i0 i1 s1 s0
0.95 0.2 0.05 0.8
g1 g2 g2 l1 l 0
0.1 0.4 0.99 0.9 0.6 0.01
i0,d0 i0,d1 i0,d0 i0,d1 g2 g3 g1
0.3 0.05 0.9 0.5 0.4 0.25 0.08 0.3 0.3 0.7 0.02 0.2
Example from Koller & Friedman, Probabilis&c Graphical Models, 2009
Bayesian networks enable use of domain knowledge
Will my car start this morning?
Heckerman et al., Decision-Theore=c Troubleshoo=ng, 1995 p(x1, . . . xn) = Y
i∈V
p(xi | xPa(i))
p(x1, . . . xn) = Y
i∈V
p(xi | xPa(i))
Bayesian networks enable use of domain knowledge
What is the differen=al diagnosis?
Beinlich et al., The ALARM Monitoring System, 1989
Bayesian networks are genera&ve models
- Can sample from the joint distribu=on, top-down
- Suppose Y can be “spam” or “not spam”, and Xi is a binary
indicator of whether word i is present in the e-mail
- Let’s try genera=ng a few emails!
- Oven helps to think about Bayesian networks as a genera=ve
model when construc=ng the structure and thinking about the model assump=ons
Y X1 X2 X3 Xn
. . .
Features Label
Inference in Bayesian networks
- Compu=ng marginal probabili=es in tree structured Bayesian
networks is easy
– The algorithm called “belief propaga=on” generalizes what we showed for hidden Markov models to arbitrary trees
- Wait… this isn’t a tree! What can we do?
X1 X2 X3 X4 X5 X6
Y1 Y2 Y3 Y4 Y5 Y6
Y X1 X2 X3 Xn
. . .
Features Label
Inference in Bayesian networks
- In some cases (such as this) we can transform this into what is
called a “junc=on tree”, and then run belief propaga=on
Approximate inference
- There is also a wealth of approximate inference algorithms that can
be applied to Bayesian networks such as these
- Markov chain Monte Carlo algorithms repeatedly sample
assignments for es=ma=ng marginals
- Varia=onal inference algorithms (determinis=c) find a simpler
distribu=on which is “close” to the original, then compute marginals using the simpler distribu=on
Maximum likelihood es=ma=on in Bayesian networks
Suppose that we know the Bayesian network structure G Let θxi|xpa(i) be the parameter giving the value of the CPD p(xi | xpa(i)) Maximum likelihood estimation corresponds to solving: max
θ
1 M
M
X
m=1
log p(xM; θ) subject to the non-negativity and normalization constraints This is equal to: max
θ
1 M
M
X
m=1
log p(xM; θ) = max
θ
1 M
M
X
m=1 N
X
i=1
log p(xM
i
| xM
pa(i); θ)
= max
θ N
X
i=1
1 M
M
X
m=1
log p(xM
i
| xM
pa(i); θ)
The optimization problem decomposes into an independent optimization problem for each CPD! Has a simple closed-form solution.