Bayesian networks Lecture 18 David Sontag New York University

Outline for today • Modeling sequen&al data (e.g., =me series, speech processing) using hidden Markov models (HMMs) • Bayesian networks – Independence proper=es – Examples – Learning and inference

Example applica=on: Tracking Observe noisy measurements of missile loca=on: Y 1 , Y 2 , … Radar Where is the missile now ? Where will it be in 10 seconds?

Probabilis=c approach • Our measurements of the missile loca=on were Y 1 , Y 2 , …, Y n • Let X t be the true <missile loca=on, velocity> at =me t • To keep this simple, suppose that everything is discrete, i.e. X t takes the values 1, …, k Grid the space:

Probabilis=c approach • First, we specify the condi&onal distribu=on Pr(X t | X t-1 ): From basic physics, we can bound the distance that the missile can have traveled • Then, we specify Pr(Y t | X t =<(10,20), 200 mph toward the northeast>): With probability ½, Y t = X t (ignoring the velocity). Otherwise, Y t is a uniformly chosen grid loca=on

Hidden Markov models 1960’s • Assume that the joint distribu=on on X 1, X 2 , …, X n and Y 1 , Y 2 , …, Y n factors as follows: n Y Pr( x 1 , . . . x n , y 1 , . . . , y n ) = Pr( x 1 ) Pr( y 1 | x 1 ) Pr( x t | x t − 1 ) Pr( y t | x t ) t =2 • To find out where the missile is now , we do marginal inference : Pr( x n | y 1 , . . . , y n ) • To find the most likely trajectory , we do MAP (maximum a posteriori) inference : arg max Pr( x 1 , . . . , x n | y 1 , . . . , y n ) x

Inference • Recall, to find out where the missile is now, we do marginal inference: Pr( x n | y 1 , . . . , y n ) • How does one compute this? • Applying rule of condi=onal probability, we have: Pr( x n | y 1 , . . . , y n ) = Pr( x n , y 1 , . . . , y n ) Pr( x n , y 1 , . . . , y n ) = P k Pr( y 1 , . . . , y n ) x n =1 Pr(ˆ x n , y 1 , . . . , y n ) ˆ • Naively, would seem to require k n-1 summa=ons, Is there a more efficient X algorithm? Pr( x n , y 1 , . . . , y n ) = Pr( x 1 , . . . , x n , y 1 , . . . , y n ) x 1 ,...,x n − 1

Marginal inference in HMMs • Use dynamic programming X Pr( A = a ) = Pr( B = b, A = a ) X Pr( x n , y 1 , . . . , y n ) = Pr( x n − 1 , x n , y 1 , . . . , y n ) b Pr( � a, � B = � b ) = Pr( � a ) Pr( � B = � b | � A = � A = � A = � a ) x n − 1 X = Pr( x n − 1 , y 1 , . . . , y n − 1 ) Pr( x n , y n | x n − 1 , y 1 , . . . , y n − 1 ) Condi=onal independence in HMMs x n − 1 X = Pr( x n − 1 , y 1 , . . . , y n − 1 ) Pr( x n , y n | x n − 1 ) x n − 1 Pr( A = a, B = b ) = Pr( A = a ) Pr( B = b | A = a ) X = Pr( x n − 1 , y 1 , . . . , y n − 1 ) Pr( x n | x n − 1 ) Pr( y n | x n , x n − 1 ) Condi=onal independence in HMMs x n − 1 X = Pr( x n − 1 , y 1 , . . . , y n − 1 ) Pr( x n | x n − 1 ) Pr( y n | x n ) x n − 1 • For n=1, ini=alize Pr( x 1 , y 1 ) = Pr( x 1 ) Pr( y 1 | x 1 ) Easy to do filtering • Total running =me is O(nk) – linear =me!

MAP inference in HMMs • MAP inference in HMMs can also be solved in linear =me! arg max Pr( x 1 , . . . x n | y 1 , . . . , y n ) = arg max Pr( x 1 , . . . x n , y 1 , . . . , y n ) x x = arg max log Pr( x 1 , . . . x n , y 1 , . . . , y n ) x n h i h i X = arg max log Pr( x 1 ) Pr( y 1 | x 1 ) + log Pr( x i | x i − 1 ) Pr( y i | x i ) x i =2 • Formulate as a shortest paths problem Weight for edge (x i-1 , x i ) is - Weight for edge (s, x 1 ) is h i log Pr( x i | x i − 1 ) Pr( y i | x i ) Path from s to t gives - h i log Pr( x 1 ) Pr( y 1 | x 1 ) the MAP assignment … s t Weight for edge (x n , t) is 0 k nodes per variable X 1 X 2 X n-1 X n Called the Viterbi algorithm

Applica=ons of HMMs • Speech recogni=on – Predict phonemes from the sounds forming words (i.e., the actual signals) • Natural language processing – Predict parts of speech (verb, noun, determiner, etc.) from the words in a sentence • Computa=onal biology – Predict intron/exon regions from DNA – Predict protein structure from DNA (locally) • And many many more!

HMMs as a graphical model We can represent a hidden Markov model with a graph: • X 1 X 2 X 3 X 4 X 5 X 6 Shading in denotes observed variables (e.g. what is available at test =me) Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 n Y Pr( x 1 , . . . x n , y 1 , . . . , y n ) = Pr( x 1 ) Pr( y 1 | x 1 ) Pr( x t | x t − 1 ) Pr( y t | x t ) t =2 There is a 1-1 mapping between the graph structure and the factoriza=on • of the joint distribu=on

Naïve Bayes as a graphical model We can represent a naïve Bayes model with a graph: • Label Y Shading in denotes observed variables (e.g. what is available at test =me) . . . X1 X2 X3 Xn Features n Y Pr( y, x 1 , . . . , x n ) = Pr( y ) Pr( x i | y ) i =1 There is a 1-1 mapping between the graph structure and the factoriza=on • of the joint distribu=on

Bayesian networks • A Bayesian network is specified by a directed acyclic graph G=(V,E) with: – One node i for each random variable X i – One condi=onal probability distribu=on (CPD) per node, p ( x i | x Pa(i) ), specifying the variable’s probability condi=oned on its parents’ values • Corresponds 1-1 with a par=cular factoriza=on of the joint distribu=on: Y p ( x 1 , . . . x n ) = p ( x i | x Pa ( i ) ) i ∈ V • Powerful framework for designing algorithms to perform probability computa=ons

2011 Turing award was for Bayesian networks

Example • Consider the following Bayesian network: d 0 d 1 i 0 i 1 0.6 0.4 0.7 0.3 Difficulty Intelligence Example from Koller & g 1 g 2 g 3 Friedman, P robabilis&c Grade SAT i 0 , d 0 0.3 0.4 0.3 Graphical Models, 2009 i 0 , d 1 0.05 0.25 0.7 s 0 s 1 i 0 , d 0 0.9 0.08 0.02 Letter i 0 , d 1 i 0 0.5 0.3 0.2 0.95 0.05 i 1 0.2 0.8 l 0 l 1 g 1 0.1 0.9 g 2 0.4 0.6 g 2 0.99 0.01 • What is its joint distribu=on? Y p ( x 1 , . . . x n ) = p ( x i | x Pa ( i ) ) i ∈ V p ( d , i , g , s , l ) = p ( d ) p ( i ) p ( g | i , d ) p ( s | i ) p ( l | g )

Example • Consider the following Bayesian network: d 0 d 1 i 0 i 1 0.6 0.4 0.7 0.3 Difficulty Intelligence Example from Koller & g 1 g 2 g 3 Friedman, P robabilis&c Grade SAT i 0 , d 0 0.3 0.4 0.3 Graphical Models, 2009 i 0 , d 1 0.05 0.25 0.7 s 0 s 1 i 0 , d 0 0.9 0.08 0.02 Letter i 0 , d 1 i 0 0.5 0.3 0.2 0.95 0.05 i 1 0.2 0.8 l 0 l 1 g 1 0.1 0.9 g 2 0.4 0.6 g 2 0.99 0.01 • What is this model assuming? SAT 6? Grade SAT ⊥ Grade | Intelligence

Example • Consider the following Bayesian network: d 0 d 1 i 0 i 1 0.6 0.4 0.7 0.3 Difficulty Intelligence Example from Koller & g 1 g 2 g 3 Friedman, P robabilis&c Grade SAT i 0 , d 0 0.3 0.4 0.3 Graphical Models, 2009 i 0 , d 1 0.05 0.25 0.7 s 0 s 1 i 0 , d 0 0.9 0.08 0.02 Letter i 0 , d 1 i 0 0.5 0.3 0.2 0.95 0.05 i 1 0.2 0.8 l 0 l 1 g 1 0.1 0.9 g 2 0.4 0.6 g 2 0.99 0.01 • Compared to a simple log-linear model to predict intelligence: – Captures non-linearity between grade, course difficulty, and intelligence – Modular . Training data can come from different sources! – Built in feature selec&on : lerer of recommenda=on is irrelevant given grade

Bayesian networks enable use of domain knowledge Y p ( x 1 , . . . x n ) = p ( x i | x Pa ( i ) ) i ∈ V Will my car start this morning? Heckerman et al. , Decision-Theore=c Troubleshoo=ng, 1995

Bayesian networks enable use of domain knowledge Y p ( x 1 , . . . x n ) = p ( x i | x Pa ( i ) ) i ∈ V What is the differen=al diagnosis? Beinlich et al. , The ALARM Monitoring System, 1989

Bayesian networks are genera&ve models • Can sample from the joint distribu=on, top-down • Suppose Y can be “spam” or “not spam”, and X i is a binary indicator of whether word i is present in the e-mail • Let’s try genera=ng a few emails! Label Y . . . X1 X2 X3 Xn Features • Oven helps to think about Bayesian networks as a genera=ve model when construc=ng the structure and thinking about the model assump=ons

Inference in Bayesian networks • Compu=ng marginal probabili=es in tree structured Bayesian networks is easy – The algorithm called “belief propaga=on” generalizes what we showed for hidden Markov models to arbitrary trees Label X 1 X 2 X 3 X 4 X 5 X 6 Y . . . X1 X2 X3 Xn Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Features • Wait… this isn’t a tree! What can we do?

Inference in Bayesian networks • In some cases (such as this) we can transform this into what is called a “junc=on tree”, and then run belief propaga=on

Approximate inference • There is also a wealth of approximate inference algorithms that can be applied to Bayesian networks such as these • Markov chain Monte Carlo algorithms repeatedly sample assignments for es=ma=ng marginals • Varia=onal inference algorithms (determinis=c) find a simpler distribu=on which is “close” to the original, then compute marginals using the simpler distribu=on

Bayesian networks Lecture 18 David Sontag New York University - PowerPoint PPT Presentation

Bayesian networks Lecture 18 David Sontag New York University Outline for today Modeling sequen&al data (e.g., =me series, speech processing) using hidden Markov models (HMMs) Bayesian networks Independence proper=es

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Bayesian networks (2) Lirong Xia Last class Bayesian networks compact, graphical

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Networks Volker Sorge Intro to AI: Specifying Probability Distributions Lecture 8

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Bayesian Methods for Neural Networks Readings: Bishop, Neural Networks for Pattern Recognition .

Chapter14 Probabilistic Reasoning (Bayesian Networks) Sec. 1 - 2 20070607 Chap14 1

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Bayesian Networks Philipp Koehn 2 April 2020 Philipp Koehn Artificial Intelligence: Bayesian

Bayesian Networks Philipp Koehn 6 April 2017 Philipp Koehn Artificial Intelligence: Bayesian

Probabilistic Modeling: Bayesian Networks Bioinformatics: Sequence Analysis COMP 571 - Spring

Bayesian Networks Li Xiong Slide credits: Page (Wisconsin) CS760 , Zhu (Wisconsin) KDD 12

US ST AR WARS MI SSI L E DE F E NSE SYST E M Jim No lto n 14 Se pt 2017 NOT E

Understanding the Middle Valley Apportionment Under the Rio Grande Compact and Associated

Systems Security: Why is Measurement Important? Patrick Traynor CSE 544 - Advanced Systems

FTBF Future Plans Mandy Rominsky FTBF Annual Review 09 November 2015 Overview

Knowledge Representation (II) Paolo Turrini Department of Computing, Imperial College London

Inference in First-order logic ICS 271 Fall 2015 Chapter 9: Russell and Norvig Universal

Set 6: Events and Animation Animation and Events Animation attention, explanation

Set 7: Events and Animation Animation and Events Animation attention, explanation

Sambuz

Useful Links

Newsletter

Mail Us