bayesian networks and decision graphs
play

Bayesian Networks and Decision Graphs Chapter 6 Chapter 6 p. 1/17 - PowerPoint PPT Presentation

Bayesian Networks and Decision Graphs Chapter 6 Chapter 6 p. 1/17 Learning probabilities from a database We have: A Bayesian network structure. A database of cases over (some of) the variables. We want: A Bayesian network model


  1. Bayesian Networks and Decision Graphs Chapter 6 Chapter 6 – p. 1/17

  2. Learning probabilities from a database We have: ➤ A Bayesian network structure. ➤ A database of cases over (some of) the variables. We want: ➤ A Bayesian network model (with probabilities) representing the database. P(Pr) Cases Pr Bt Ut Pr 1 . ? pos pos Pr 2 . yes neg pos 3 . yes pos ? 4 . yes pos neg Bt Ut 5 . ? neg ? Bt Ut P(Bt | Pr) P(Ut | Pr) Chapter 6 – p. 2/17

  3. Complete data: Maximum likelihood estimation We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: Structure T T T Probability, P ( pin up ) = 0 . 1 0 . 2 0 . 3 M 0 . 1 M 0 . 2 M 0 . 3 Model Chapter 6 – p. 3/17

  4. Complete data: Maximum likelihood estimation We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: Structure T T T Probability, P ( pin up ) = 0 . 1 0 . 2 0 . 3 M 0 . 1 M 0 . 2 M 0 . 3 Model We can measure how well a model fits the data using: P ( D| M θ ) = P ( pin up , pin up , pin down , . . . , pin up | M θ ) = P ( pin up | M θ ) P ( pin up | M θ ) P ( pin down | M θ ) · . . . · P ( pin up | M θ ) This is also called the likelihood of M θ given D . Chapter 6 – p. 3/17

  5. Complete data: Maximum likelihood estimation We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: Structure T T T Probability, P ( pin up ) = 0 . 1 0 . 2 0 . 3 M 0 . 1 M 0 . 2 M 0 . 3 Model We select the parameter ˆ θ that maximizes: ˆ θ = arg max P ( D| M θ ) θ 100 Y = arg max P ( d i | M θ ) θ i =1 µ · θ 80 (1 − θ ) 20 . = arg max θ Chapter 6 – p. 3/17

  6. Complete data: Maximum likelihood estimation We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: Structure T T T Probability, P ( pin up ) = 0 . 1 0 . 2 0 . 3 M 0 . 1 M 0 . 2 M 0 . 3 Model By setting: d dθ µ · θ 80 (1 − θ ) 20 = 0 we get the maximum likelihood estimate: ˆ θ = 0 . 8 . Chapter 6 – p. 3/17

  7. Complete data: maximum likelihood estimation In general, you get a maximum likelihood estimate as the fraction of counts over the total number of counts. We want P ( A = a | B = b, C = c ) ! B C A To find the maximum likelihood estimate ˆ P ( A = a | B = b, C = c ) we simply calculate: ˆ P ( A = a | B = b, C = c ) = Chapter 6 – p. 4/17

  8. Complete data: maximum likelihood estimation In general, you get a maximum likelihood estimate as the fraction of counts over the total number of counts. We want P ( A = a | B = b, C = c ) ! B C A To find the maximum likelihood estimate ˆ P ( A = a | B = b, C = c ) we simply calculate: ˆ P ( A = a, B = b, C = c ) ˆ P ( A = a | B = b, C = c ) = ˆ P ( B = b, C = c ) Chapter 6 – p. 4/17

  9. Complete data: maximum likelihood estimation In general, you get a maximum likelihood estimate as the fraction of counts over the total number of counts. We want P ( A = a | B = b, C = c ) ! B C A To find the maximum likelihood estimate ˆ P ( A = a | B = b, C = c ) we simply calculate: h N ( A = a,B = b,C = c ) i ˆ P ( A = a, B = b, C = c ) N ˆ P ( A = a | B = b, C = c ) = = ˆ h i N ( B = b,C = c ) P ( B = b, C = c ) N Chapter 6 – p. 4/17

  10. Complete data: maximum likelihood estimation In general, you get a maximum likelihood estimate as the fraction of counts over the total number of counts. We want P ( A = a | B = b, C = c ) ! B C A To find the maximum likelihood estimate ˆ P ( A = a | B = b, C = c ) we simply calculate: h N ( A = a,B = b,C = c ) i ˆ P ( A = a, B = b, C = c ) N ˆ P ( A = a | B = b, C = c ) = = ˆ h i N ( B = b,C = c ) P ( B = b, C = c ) N = N ( A = a, B = b, C = c ) . N ( B = b, C = c ) So we have a simple counting problem! Chapter 6 – p. 4/17

  11. Complete data: maximum likelihood estimation Unfortunately, maximum likelihood estimation has a drawback: Last three letters aaa aab aba abb baa bba bab bbb 2 2 2 2 5 7 5 7 aa First ab 3 4 4 4 1 2 0 2 two 0 1 0 0 3 5 3 5 ba letters bb 5 6 6 6 2 2 2 2 By using this table to estimate e.g. P ( T 1 = b, T 2 = a, T 3 = T 4 = T 5 = a ) we get: P ( T 1 = b, T 2 = a, T 3 = T 4 = T 5 = a ) = N ( T 1 = b, T 2 = a, T 3 = T 4 = T 5 = a ) ˆ = 0 N This is not reliable! Chapter 6 – p. 5/17

  12. Complete data: maximum likelihood estimation An even prior distribution corresponds to adding a virtual count of 1 : Last three letters aaa aab aba abb baa bba bab bbb aa 2 2 2 2 5 7 5 7 First ab 3 4 4 4 1 2 0 2 two ba 0 1 0 0 3 5 3 5 letters bb 5 6 6 6 2 2 2 2 From this table we get: T 1 T 1 T 1 a b a b a b ⇒ ⇒ ` 33 ` 18 ´ ´ a 32 17 a 32 + 1 17 + 1 a 54 50 T 2 T 2 T 2 ` 21 ` 32 ´ ´ b 20 31 b 20 + 1 31 + 1 b 54 50 P ( T 2 | T 1 ) = N ′ ( T 1 ,T 2 ) N ′ ( T 1 , T 2 ) N ( T 1 , T 2 ) N ′ ( T 1 ) Chapter 6 – p. 6/17

  13. Incomplete data How do we handle cases with missing values: ➤ Faulty sensor readings. ➤ Values have been intentionally removed. ➤ Some variables may be unobservable. Why don’t we just throw away the cases with missing values? Chapter 6 – p. 7/17

  14. Incomplete data How do we handle cases with missing values: ➤ Faulty sensor readings. ➤ Values have been intentionally removed. ➤ Some variables may be unobservable. Why don’t we just throw away the cases with missing values? A B A B Using the entire database: a 1 b 1 a 2 b 1 a 1 b 1 a 2 b 1 N ( a 1 ) 10 ˆ a 1 b 1 a 2 b 1 P ( a 1 ) = N ( a 1 ) + N ( a 2 ) = 10 + 10 = 0 . 5 . a 1 b 1 a 2 b 1 ⇒ a 1 b 1 a 2 b 1 Having removed the cases with missing val- a 1 b 1 a 2 ? ues: a 1 b 1 a 2 ? N ′ ( a 1 ) a 1 b 1 a 2 ? 10 P ′ ( a 1 ) = ˆ N ′ ( a 1 ) + N ′ ( a 2 ) = 10 + 5 = 2 / 3 . a 1 b 1 a 2 ? a 1 b 1 a 2 ? Chapter 6 – p. 7/17

  15. How is the data missing? We need to take into account how the data is missing: Missing completely at random The probability that a value is missing is independent of both the observed and unobserved values. Missing at random The probability that a value is missing depends only on the observed val- ues. Non-ignorable Neither MAR nor MCAR. What is the type of missingness: ➤ In an exit poll, where an extreme right-wing party is running for parlament? ➤ In a database containing the results of two tests, where the second test has only per- formed (as a “backup test”) when the result of the first test was negative? ➤ In a monitoring system that is not completely stable and where some sensor values are not stored properly? Chapter 6 – p. 8/17

  16. The EM algorithm Cases Pr Bt Ut Pr 1 . ? pos pos 2 . yes neg pos 3 . yes pos ? 4 . yes pos neg Bt Ut 5 . ? neg ? Estimate the required probability distributions for the network Chapter 6 – p. 9/17

  17. The EM algorithm Cases Pr Bt Ut Pr 1 . ? pos pos 2 . yes neg pos 3 . yes pos ? 4 . yes pos neg Bt Ut 5 . ? neg ? If the database was complete we would estimate the required probabilities, P ( Pr ) , P ( Ut | Pr ) and P ( Bt | Pr ) as: P ( Pr = yes ) = N ( Pr = yes ) N P ( Ut = yes | Pr = yes ) = N ( Ut = yes , Pr = yes ) N ( Pr = yes ) P ( Bt = yes | Pr = no ) = N ( Bt = yes , Pr = no ) N ( Pr = no ) So estimating the probabilities is basically a counting problem! Chapter 6 – p. 9/17

  18. The EM algorithm Cases Pr Bt Ut Pr 1 . ? pos pos 2 . yes neg pos 3 . yes pos ? 4 . yes pos neg Bt Ut 5 . ? neg ? Estimate P ( Pr ) from the database above: Case 2 , 3 and 4 contributes with a value 1 to N ( Pr = yes ) , but what is the contribution from case 1 and 5 ? ➤ Case 1 contributes with P ( Pr = yes | Bt = pos , Ut = pos ) . ➤ Case 5 contributes with P ( Pr = yes | Bt = neg ) . To find these probabilities we assume some initial distributions, P 0 ( · ) , have been assigned to the network. We are basically calculating the expectation for N ( Pr = yes ) , denoted E [ N ( Pr = yes )] Chapter 6 – p. 9/17

  19. The EM algorithm Cases Pr Bt Ut Pr 1 . ? pos pos 2 . yes neg pos 3 . yes pos ? 4 . yes pos neg Bt Ut 5 . ? neg ? Using P 0 ( Pr ) = (0 . 5 , 0 . 5) , P 0 ( Bt | Pr = yes ) = (0 . 5 , 0 . 5) etc., as starting distributions we get: E [ N ( Pr = yes )] = P 0 ( Pr = yes | Bt = Ut = pos ) + 1 + 1 + 1 + P 0 ( Pr = yes | Bt = neg ) = 0 . 5 + 1 + 1 + 1 + 0 . 5 = 4 E [ N ( Pr = no )] = P 0 ( Pr = no | Bt = Ut = pos ) + 0 + 0 + 0 + P 0 ( Pr = no | Bt = neg ) = 0 . 5 + 0 + 0 + 0 + 0 . 5 = 1 P 1 ( Pr = yes ) = E [ N ( Pr = yes )] = 4 So we e.g. get: ˆ 5 = 0 . 8 N Chapter 6 – p. 9/17

  20. The EM algorithm Cases Pr Bt Ut Pr 1 . ? pos pos 2 . yes neg pos 3 . yes pos ? 4 . yes pos neg Bt Ut 5 . ? neg ? To estimate ˆ P 1 ( Ut | Pr ) = E [ N ( Ut , Pr )] / E [ N ( Pr )] we e.g. need: E [ N ( Ut = p , Pr = y )] = P 0 ( Ut = p , Pr = y | Bt = Ut = p ) + 1 + P 0 ( Ut = p , Pr = y | Bt = p , Pr = y ) + 0 + P 0 ( Ut = p , Pr = y | Bt = n ) = 0 . 5 + 1 + 0 . 5 + 0 + 0 . 25 = 2 . 25 E [ N ( Pr = yes )] = P 0 ( Pr = yes | Bt = Ut = pos ) + 1 + 1 + 1 + P 0 ( Pr = yes | Bt = neg ) =0 . 5 + 1 + 1 + 1 + 0 . 5 = 4 So we e.g. get: P 1 ( Ut = pos | Pr = yes ) = E [ N ( Ut = p , Pr = y )] = 2 . 25 ˆ = 0 . 5625 E [ N ( Pr = yes )] 4 Chapter 6 – p. 9/17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend