Bayesian Networks and Decision Graphs Chapter 6 Chapter 6 p. 1/17 - PowerPoint PPT Presentation

Bayesian Networks and Decision Graphs Chapter 6 Chapter 6 – p. 1/17

Learning probabilities from a database We have: ➤ A Bayesian network structure. ➤ A database of cases over (some of) the variables. We want: ➤ A Bayesian network model (with probabilities) representing the database. P(Pr) Cases Pr Bt Ut Pr 1 . ? pos pos Pr 2 . yes neg pos 3 . yes pos ? 4 . yes pos neg Bt Ut 5 . ? neg ? Bt Ut P(Bt | Pr) P(Ut | Pr) Chapter 6 – p. 2/17

Complete data: Maximum likelihood estimation We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: Structure T T T Probability, P ( pin up ) = 0 . 1 0 . 2 0 . 3 M 0 . 1 M 0 . 2 M 0 . 3 Model Chapter 6 – p. 3/17

Complete data: Maximum likelihood estimation We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: Structure T T T Probability, P ( pin up ) = 0 . 1 0 . 2 0 . 3 M 0 . 1 M 0 . 2 M 0 . 3 Model We can measure how well a model fits the data using: P ( D| M θ ) = P ( pin up , pin up , pin down , . . . , pin up | M θ ) = P ( pin up | M θ ) P ( pin up | M θ ) P ( pin down | M θ ) · . . . · P ( pin up | M θ ) This is also called the likelihood of M θ given D . Chapter 6 – p. 3/17

Complete data: Maximum likelihood estimation We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: Structure T T T Probability, P ( pin up ) = 0 . 1 0 . 2 0 . 3 M 0 . 1 M 0 . 2 M 0 . 3 Model We select the parameter ˆ θ that maximizes: ˆ θ = arg max P ( D| M θ ) θ 100 Y = arg max P ( d i | M θ ) θ i =1 µ · θ 80 (1 − θ ) 20 . = arg max θ Chapter 6 – p. 3/17

Complete data: Maximum likelihood estimation We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: Structure T T T Probability, P ( pin up ) = 0 . 1 0 . 2 0 . 3 M 0 . 1 M 0 . 2 M 0 . 3 Model By setting: d dθ µ · θ 80 (1 − θ ) 20 = 0 we get the maximum likelihood estimate: ˆ θ = 0 . 8 . Chapter 6 – p. 3/17

Complete data: maximum likelihood estimation In general, you get a maximum likelihood estimate as the fraction of counts over the total number of counts. We want P ( A = a | B = b, C = c ) ! B C A To find the maximum likelihood estimate ˆ P ( A = a | B = b, C = c ) we simply calculate: ˆ P ( A = a | B = b, C = c ) = Chapter 6 – p. 4/17

Complete data: maximum likelihood estimation In general, you get a maximum likelihood estimate as the fraction of counts over the total number of counts. We want P ( A = a | B = b, C = c ) ! B C A To find the maximum likelihood estimate ˆ P ( A = a | B = b, C = c ) we simply calculate: ˆ P ( A = a, B = b, C = c ) ˆ P ( A = a | B = b, C = c ) = ˆ P ( B = b, C = c ) Chapter 6 – p. 4/17

Complete data: maximum likelihood estimation In general, you get a maximum likelihood estimate as the fraction of counts over the total number of counts. We want P ( A = a | B = b, C = c ) ! B C A To find the maximum likelihood estimate ˆ P ( A = a | B = b, C = c ) we simply calculate: h N ( A = a,B = b,C = c ) i ˆ P ( A = a, B = b, C = c ) N ˆ P ( A = a | B = b, C = c ) = = ˆ h i N ( B = b,C = c ) P ( B = b, C = c ) N Chapter 6 – p. 4/17

Complete data: maximum likelihood estimation In general, you get a maximum likelihood estimate as the fraction of counts over the total number of counts. We want P ( A = a | B = b, C = c ) ! B C A To find the maximum likelihood estimate ˆ P ( A = a | B = b, C = c ) we simply calculate: h N ( A = a,B = b,C = c ) i ˆ P ( A = a, B = b, C = c ) N ˆ P ( A = a | B = b, C = c ) = = ˆ h i N ( B = b,C = c ) P ( B = b, C = c ) N = N ( A = a, B = b, C = c ) . N ( B = b, C = c ) So we have a simple counting problem! Chapter 6 – p. 4/17

Complete data: maximum likelihood estimation Unfortunately, maximum likelihood estimation has a drawback: Last three letters aaa aab aba abb baa bba bab bbb 2 2 2 2 5 7 5 7 aa First ab 3 4 4 4 1 2 0 2 two 0 1 0 0 3 5 3 5 ba letters bb 5 6 6 6 2 2 2 2 By using this table to estimate e.g. P ( T 1 = b, T 2 = a, T 3 = T 4 = T 5 = a ) we get: P ( T 1 = b, T 2 = a, T 3 = T 4 = T 5 = a ) = N ( T 1 = b, T 2 = a, T 3 = T 4 = T 5 = a ) ˆ = 0 N This is not reliable! Chapter 6 – p. 5/17

Complete data: maximum likelihood estimation An even prior distribution corresponds to adding a virtual count of 1 : Last three letters aaa aab aba abb baa bba bab bbb aa 2 2 2 2 5 7 5 7 First ab 3 4 4 4 1 2 0 2 two ba 0 1 0 0 3 5 3 5 letters bb 5 6 6 6 2 2 2 2 From this table we get: T 1 T 1 T 1 a b a b a b ⇒ ⇒ ` 33 ` 18 ´ ´ a 32 17 a 32 + 1 17 + 1 a 54 50 T 2 T 2 T 2 ` 21 ` 32 ´ ´ b 20 31 b 20 + 1 31 + 1 b 54 50 P ( T 2 | T 1 ) = N ′ ( T 1 ,T 2 ) N ′ ( T 1 , T 2 ) N ( T 1 , T 2 ) N ′ ( T 1 ) Chapter 6 – p. 6/17

Incomplete data How do we handle cases with missing values: ➤ Faulty sensor readings. ➤ Values have been intentionally removed. ➤ Some variables may be unobservable. Why don’t we just throw away the cases with missing values? Chapter 6 – p. 7/17

Incomplete data How do we handle cases with missing values: ➤ Faulty sensor readings. ➤ Values have been intentionally removed. ➤ Some variables may be unobservable. Why don’t we just throw away the cases with missing values? A B A B Using the entire database: a 1 b 1 a 2 b 1 a 1 b 1 a 2 b 1 N ( a 1 ) 10 ˆ a 1 b 1 a 2 b 1 P ( a 1 ) = N ( a 1 ) + N ( a 2 ) = 10 + 10 = 0 . 5 . a 1 b 1 a 2 b 1 ⇒ a 1 b 1 a 2 b 1 Having removed the cases with missing val- a 1 b 1 a 2 ? ues: a 1 b 1 a 2 ? N ′ ( a 1 ) a 1 b 1 a 2 ? 10 P ′ ( a 1 ) = ˆ N ′ ( a 1 ) + N ′ ( a 2 ) = 10 + 5 = 2 / 3 . a 1 b 1 a 2 ? a 1 b 1 a 2 ? Chapter 6 – p. 7/17

How is the data missing? We need to take into account how the data is missing: Missing completely at random The probability that a value is missing is independent of both the observed and unobserved values. Missing at random The probability that a value is missing depends only on the observed values. Non-ignorable Neither MAR nor MCAR. What is the type of missingness: ➤ In an exit poll, where an extreme right-wing party is running for parlament? ➤ In a database containing the results of two tests, where the second test has only per- formed (as a “backup test”) when the result of the first test was negative? ➤ In a monitoring system that is not completely stable and where some sensor values are not stored properly? Chapter 6 – p. 8/17

The EM algorithm Cases Pr Bt Ut Pr 1 . ? pos pos 2 . yes neg pos 3 . yes pos ? 4 . yes pos neg Bt Ut 5 . ? neg ? Estimate the required probability distributions for the network Chapter 6 – p. 9/17

The EM algorithm Cases Pr Bt Ut Pr 1 . ? pos pos 2 . yes neg pos 3 . yes pos ? 4 . yes pos neg Bt Ut 5 . ? neg ? If the database was complete we would estimate the required probabilities, P ( Pr ) , P ( Ut | Pr ) and P ( Bt | Pr ) as: P ( Pr = yes ) = N ( Pr = yes ) N P ( Ut = yes | Pr = yes ) = N ( Ut = yes , Pr = yes ) N ( Pr = yes ) P ( Bt = yes | Pr = no ) = N ( Bt = yes , Pr = no ) N ( Pr = no ) So estimating the probabilities is basically a counting problem! Chapter 6 – p. 9/17

The EM algorithm Cases Pr Bt Ut Pr 1 . ? pos pos 2 . yes neg pos 3 . yes pos ? 4 . yes pos neg Bt Ut 5 . ? neg ? Estimate P ( Pr ) from the database above: Case 2 , 3 and 4 contributes with a value 1 to N ( Pr = yes ) , but what is the contribution from case 1 and 5 ? ➤ Case 1 contributes with P ( Pr = yes | Bt = pos , Ut = pos ) . ➤ Case 5 contributes with P ( Pr = yes | Bt = neg ) . To find these probabilities we assume some initial distributions, P 0 ( · ) , have been assigned to the network. We are basically calculating the expectation for N ( Pr = yes ) , denoted E [ N ( Pr = yes )] Chapter 6 – p. 9/17

The EM algorithm Cases Pr Bt Ut Pr 1 . ? pos pos 2 . yes neg pos 3 . yes pos ? 4 . yes pos neg Bt Ut 5 . ? neg ? Using P 0 ( Pr ) = (0 . 5 , 0 . 5) , P 0 ( Bt | Pr = yes ) = (0 . 5 , 0 . 5) etc., as starting distributions we get: E [ N ( Pr = yes )] = P 0 ( Pr = yes | Bt = Ut = pos ) + 1 + 1 + 1 + P 0 ( Pr = yes | Bt = neg ) = 0 . 5 + 1 + 1 + 1 + 0 . 5 = 4 E [ N ( Pr = no )] = P 0 ( Pr = no | Bt = Ut = pos ) + 0 + 0 + 0 + P 0 ( Pr = no | Bt = neg ) = 0 . 5 + 0 + 0 + 0 + 0 . 5 = 1 P 1 ( Pr = yes ) = E [ N ( Pr = yes )] = 4 So we e.g. get: ˆ 5 = 0 . 8 N Chapter 6 – p. 9/17

The EM algorithm Cases Pr Bt Ut Pr 1 . ? pos pos 2 . yes neg pos 3 . yes pos ? 4 . yes pos neg Bt Ut 5 . ? neg ? To estimate ˆ P 1 ( Ut | Pr ) = E [ N ( Ut , Pr )] / E [ N ( Pr )] we e.g. need: E [ N ( Ut = p , Pr = y )] = P 0 ( Ut = p , Pr = y | Bt = Ut = p ) + 1 + P 0 ( Ut = p , Pr = y | Bt = p , Pr = y ) + 0 + P 0 ( Ut = p , Pr = y | Bt = n ) = 0 . 5 + 1 + 0 . 5 + 0 + 0 . 25 = 2 . 25 E [ N ( Pr = yes )] = P 0 ( Pr = yes | Bt = Ut = pos ) + 1 + 1 + 1 + P 0 ( Pr = yes | Bt = neg ) =0 . 5 + 1 + 1 + 1 + 0 . 5 = 4 So we e.g. get: P 1 ( Ut = pos | Pr = yes ) = E [ N ( Ut = p , Pr = y )] = 2 . 25 ˆ = 0 . 5625 E [ N ( Pr = yes )] 4 Chapter 6 – p. 9/17

Bayesian Networks and Decision Graphs Chapter 6 Chapter 6 p. 1/17 - PowerPoint PPT Presentation

Bayesian Networks and Decision Graphs Chapter 6 Chapter 6 p. 1/17 Learning probabilities from a database We have: A Bayesian network structure. A database of cases over (some of) the variables. We want: A Bayesian network model

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Graphs () Graphs () Graphs Graphs Graphs are collections of nodes

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Bayesian networks (2) Lirong Xia Last class Bayesian networks compact, graphical

Bayesian decision theory Andrea Passerini passerini@disi.unitn.it Machine Learning Bayesian

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Week 4 Kullmann Graphs and directed graphs Elementary Graph Algorithms Representing graphs

Bayesian Methods for Neural Networks Readings: Bishop, Neural Networks for Pattern Recognition .

Chapter14 Probabilistic Reasoning (Bayesian Networks) Sec. 1 - 2 20070607 Chap14 1

Graphs Graphs Examples Definitions Implementation/Representation of graphs Graphs

On some classes of Deza graphs Deza graphs without 3-cocliques Line graphs V.V. Kabanov 1 Deza

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Pin Hole Cameras & Warp Functions Instructor - Simon Lucey 16-423 - Designing Computer Vision

Is There a Trade-Off Between Fairness and Accuracy? A Perspective Using Mismatched Hypothesis

James Minutilli Talent Sourcing Coordinator Dennis DeYoung Business Relations Specialist

in Global Routing Hamid Shojaei, Azadeh Davoodi, and Jeffrey Linderoth* Department of Electrical

Project 5 Virtual Memory COS 318 Fall 2015 Project

Virtual Private Networks Distributed Systems Paul Krzyzanowski Private networks Problem You

Other threats Threat model (beyond TLS) TLS = confidentiality, integrity, authenticity

Review of Symbols Review of Symbols CS 105 Basic Parameters Tour of the Black Holes of

Bayesian Networks and Decision Graphs Chapter 6 Chapter 6 p. 1/17 - PowerPoint PPT Presentation

Bayesian Networks and Decision Graphs Chapter 6 Chapter 6 p. 1/17 Learning probabilities from a database We have: A Bayesian network structure. A database of cases over (some of) the variables. We want: A Bayesian network model

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Graphs () Graphs () Graphs Graphs Graphs are collections of nodes

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Bayesian networks (2) Lirong Xia Last class Bayesian networks compact, graphical

Bayesian decision theory Andrea Passerini passerini@disi.unitn.it Machine Learning Bayesian

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Week 4 Kullmann Graphs and directed graphs Elementary Graph Algorithms Representing graphs

Bayesian Methods for Neural Networks Readings: Bishop, Neural Networks for Pattern Recognition .

Chapter14 Probabilistic Reasoning (Bayesian Networks) Sec. 1 - 2 20070607 Chap14 1

Graphs Graphs Examples Definitions Implementation/Representation of graphs Graphs

On some classes of Deza graphs Deza graphs without 3-cocliques Line graphs V.V. Kabanov 1 Deza

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Pin Hole Cameras &amp; Warp Functions Instructor - Simon Lucey 16-423 - Designing Computer Vision

Is There a Trade-Off Between Fairness and Accuracy? A Perspective Using Mismatched Hypothesis

James Minutilli Talent Sourcing Coordinator Dennis DeYoung Business Relations Specialist

in Global Routing Hamid Shojaei, Azadeh Davoodi, and Jeffrey Linderoth* Department of Electrical

Project 5 Virtual Memory COS 318 Fall 2015 Project

Virtual Private Networks Distributed Systems Paul Krzyzanowski Private networks Problem You

Other threats Threat model (beyond TLS) TLS = confidentiality, integrity, authenticity

Review of Symbols Review of Symbols CS 105 Basic Parameters Tour of the Black Holes of

Pin Hole Cameras & Warp Functions Instructor - Simon Lucey 16-423 - Designing Computer Vision