bayesian networks lecture 18
play

Bayesian networks Lecture 18 David Sontag New York University - PowerPoint PPT Presentation

Bayesian networks Lecture 18 David Sontag New York University Outline for today Modeling sequen&al data (e.g., =me series, speech processing) using hidden Markov models (HMMs) Bayesian networks Independence proper=es


  1. Bayesian networks Lecture 18 David Sontag New York University

  2. Outline for today • Modeling sequen&al data (e.g., =me series, speech processing) using hidden Markov models (HMMs) • Bayesian networks – Independence proper=es – Examples – Learning and inference

  3. Example applica=on: Tracking Observe noisy measurements of missile loca=on: Y 1 , Y 2 , … Radar Where is the missile now ? Where will it be in 10 seconds?

  4. Probabilis=c approach • Our measurements of the missile loca=on were Y 1 , Y 2 , …, Y n • Let X t be the true <missile loca=on, velocity> at =me t • To keep this simple, suppose that everything is discrete, i.e. X t takes the values 1, …, k Grid the space:

  5. Probabilis=c approach • First, we specify the condi&onal distribu=on Pr(X t | X t-1 ): From basic physics, we can bound the distance that the missile can have traveled • Then, we specify Pr(Y t | X t =<(10,20), 200 mph toward the northeast>): With probability ½, Y t = X t (ignoring the velocity). Otherwise, Y t is a uniformly chosen grid loca=on

  6. Hidden Markov models 1960’s • Assume that the joint distribu=on on X 1, X 2 , …, X n and Y 1 , Y 2 , …, Y n factors as follows: n Y Pr( x 1 , . . . x n , y 1 , . . . , y n ) = Pr( x 1 ) Pr( y 1 | x 1 ) Pr( x t | x t − 1 ) Pr( y t | x t ) t =2 • To find out where the missile is now , we do marginal inference : Pr( x n | y 1 , . . . , y n ) • To find the most likely trajectory , we do MAP (maximum a posteriori) inference : arg max Pr( x 1 , . . . , x n | y 1 , . . . , y n ) x

  7. Inference • Recall, to find out where the missile is now, we do marginal inference: Pr( x n | y 1 , . . . , y n ) • How does one compute this? • Applying rule of condi=onal probability, we have: Pr( x n | y 1 , . . . , y n ) = Pr( x n , y 1 , . . . , y n ) Pr( x n , y 1 , . . . , y n ) = P k Pr( y 1 , . . . , y n ) x n =1 Pr(ˆ x n , y 1 , . . . , y n ) ˆ • Naively, would seem to require k n-1 summa=ons, Is there a more efficient X algorithm? Pr( x n , y 1 , . . . , y n ) = Pr( x 1 , . . . , x n , y 1 , . . . , y n ) x 1 ,...,x n − 1

  8. Marginal inference in HMMs • Use dynamic programming X Pr( A = a ) = Pr( B = b, A = a ) X Pr( x n , y 1 , . . . , y n ) = Pr( x n − 1 , x n , y 1 , . . . , y n ) b Pr( � a, � B = � b ) = Pr( � a ) Pr( � B = � b | � A = � A = � A = � a ) x n − 1 X = Pr( x n − 1 , y 1 , . . . , y n − 1 ) Pr( x n , y n | x n − 1 , y 1 , . . . , y n − 1 ) Condi=onal independence in HMMs x n − 1 X = Pr( x n − 1 , y 1 , . . . , y n − 1 ) Pr( x n , y n | x n − 1 ) x n − 1 Pr( A = a, B = b ) = Pr( A = a ) Pr( B = b | A = a ) X = Pr( x n − 1 , y 1 , . . . , y n − 1 ) Pr( x n | x n − 1 ) Pr( y n | x n , x n − 1 ) Condi=onal independence in HMMs x n − 1 X = Pr( x n − 1 , y 1 , . . . , y n − 1 ) Pr( x n | x n − 1 ) Pr( y n | x n ) x n − 1 • For n=1, ini=alize Pr( x 1 , y 1 ) = Pr( x 1 ) Pr( y 1 | x 1 ) Easy to do filtering • Total running =me is O(nk) – linear =me!

  9. MAP inference in HMMs • MAP inference in HMMs can also be solved in linear =me! arg max Pr( x 1 , . . . x n | y 1 , . . . , y n ) = arg max Pr( x 1 , . . . x n , y 1 , . . . , y n ) x x = arg max log Pr( x 1 , . . . x n , y 1 , . . . , y n ) x n h i h i X = arg max log Pr( x 1 ) Pr( y 1 | x 1 ) + log Pr( x i | x i − 1 ) Pr( y i | x i ) x i =2 • Formulate as a shortest paths problem Weight for edge (x i-1 , x i ) is - Weight for edge (s, x 1 ) is h i log Pr( x i | x i − 1 ) Pr( y i | x i ) Path from s to t gives - h i log Pr( x 1 ) Pr( y 1 | x 1 ) the MAP assignment … s t Weight for edge (x n , t) is 0 k nodes per variable X 1 X 2 X n-1 X n Called the Viterbi algorithm

  10. Applica=ons of HMMs • Speech recogni=on – Predict phonemes from the sounds forming words (i.e., the actual signals) • Natural language processing – Predict parts of speech (verb, noun, determiner, etc.) from the words in a sentence • Computa=onal biology – Predict intron/exon regions from DNA – Predict protein structure from DNA (locally) • And many many more!

  11. HMMs as a graphical model We can represent a hidden Markov model with a graph: • X 1 X 2 X 3 X 4 X 5 X 6 Shading in denotes observed variables (e.g. what is available at test =me) Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 n Y Pr( x 1 , . . . x n , y 1 , . . . , y n ) = Pr( x 1 ) Pr( y 1 | x 1 ) Pr( x t | x t − 1 ) Pr( y t | x t ) t =2 There is a 1-1 mapping between the graph structure and the factoriza=on • of the joint distribu=on

  12. Naïve Bayes as a graphical model We can represent a naïve Bayes model with a graph: • Label Y Shading in denotes observed variables (e.g. what is available at test =me) . . . X1 X2 X3 Xn Features n Y Pr( y, x 1 , . . . , x n ) = Pr( y ) Pr( x i | y ) i =1 There is a 1-1 mapping between the graph structure and the factoriza=on • of the joint distribu=on

  13. Bayesian networks • A Bayesian network is specified by a directed acyclic graph G=(V,E) with: – One node i for each random variable X i – One condi=onal probability distribu=on (CPD) per node, p ( x i | x Pa(i) ), specifying the variable’s probability condi=oned on its parents’ values • Corresponds 1-1 with a par=cular factoriza=on of the joint distribu=on: Y p ( x 1 , . . . x n ) = p ( x i | x Pa ( i ) ) i ∈ V • Powerful framework for designing algorithms to perform probability computa=ons

  14. 2011 Turing award was for Bayesian networks

  15. Example • Consider the following Bayesian network: d 0 d 1 i 0 i 1 0.6 0.4 0.7 0.3 Difficulty Intelligence Example from Koller & g 1 g 2 g 3 Friedman, P robabilis&c Grade SAT i 0 , d 0 0.3 0.4 0.3 Graphical Models, 2009 i 0 , d 1 0.05 0.25 0.7 s 0 s 1 i 0 , d 0 0.9 0.08 0.02 Letter i 0 , d 1 i 0 0.5 0.3 0.2 0.95 0.05 i 1 0.2 0.8 l 0 l 1 g 1 0.1 0.9 g 2 0.4 0.6 g 2 0.99 0.01 • What is its joint distribu=on? Y p ( x 1 , . . . x n ) = p ( x i | x Pa ( i ) ) i ∈ V p ( d , i , g , s , l ) = p ( d ) p ( i ) p ( g | i , d ) p ( s | i ) p ( l | g )

  16. Example • Consider the following Bayesian network: d 0 d 1 i 0 i 1 0.6 0.4 0.7 0.3 Difficulty Intelligence Example from Koller & g 1 g 2 g 3 Friedman, P robabilis&c Grade SAT i 0 , d 0 0.3 0.4 0.3 Graphical Models, 2009 i 0 , d 1 0.05 0.25 0.7 s 0 s 1 i 0 , d 0 0.9 0.08 0.02 Letter i 0 , d 1 i 0 0.5 0.3 0.2 0.95 0.05 i 1 0.2 0.8 l 0 l 1 g 1 0.1 0.9 g 2 0.4 0.6 g 2 0.99 0.01 • What is this model assuming? SAT 6? Grade SAT ⊥ Grade | Intelligence

  17. Example • Consider the following Bayesian network: d 0 d 1 i 0 i 1 0.6 0.4 0.7 0.3 Difficulty Intelligence Example from Koller & g 1 g 2 g 3 Friedman, P robabilis&c Grade SAT i 0 , d 0 0.3 0.4 0.3 Graphical Models, 2009 i 0 , d 1 0.05 0.25 0.7 s 0 s 1 i 0 , d 0 0.9 0.08 0.02 Letter i 0 , d 1 i 0 0.5 0.3 0.2 0.95 0.05 i 1 0.2 0.8 l 0 l 1 g 1 0.1 0.9 g 2 0.4 0.6 g 2 0.99 0.01 • Compared to a simple log-linear model to predict intelligence: – Captures non-linearity between grade, course difficulty, and intelligence – Modular . Training data can come from different sources! – Built in feature selec&on : lerer of recommenda=on is irrelevant given grade

  18. Bayesian networks enable use of domain knowledge Y p ( x 1 , . . . x n ) = p ( x i | x Pa ( i ) ) i ∈ V Will my car start this morning? Heckerman et al. , Decision-Theore=c Troubleshoo=ng, 1995

  19. Bayesian networks enable use of domain knowledge Y p ( x 1 , . . . x n ) = p ( x i | x Pa ( i ) ) i ∈ V What is the differen=al diagnosis? Beinlich et al. , The ALARM Monitoring System, 1989

  20. Bayesian networks are genera&ve models • Can sample from the joint distribu=on, top-down • Suppose Y can be “spam” or “not spam”, and X i is a binary indicator of whether word i is present in the e-mail • Let’s try genera=ng a few emails! Label Y . . . X1 X2 X3 Xn Features • Oven helps to think about Bayesian networks as a genera=ve model when construc=ng the structure and thinking about the model assump=ons

  21. Inference in Bayesian networks • Compu=ng marginal probabili=es in tree structured Bayesian networks is easy – The algorithm called “belief propaga=on” generalizes what we showed for hidden Markov models to arbitrary trees Label X 1 X 2 X 3 X 4 X 5 X 6 Y . . . X1 X2 X3 Xn Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Features • Wait… this isn’t a tree! What can we do?

  22. Inference in Bayesian networks • In some cases (such as this) we can transform this into what is called a “junc=on tree”, and then run belief propaga=on

  23. Approximate inference • There is also a wealth of approximate inference algorithms that can be applied to Bayesian networks such as these • Markov chain Monte Carlo algorithms repeatedly sample assignments for es=ma=ng marginals • Varia=onal inference algorithms (determinis=c) find a simpler distribu=on which is “close” to the original, then compute marginals using the simpler distribu=on

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend