ece 6504 advanced topics in machine learning
play

ECE 6504: Advanced Topics in Machine Learning Probabilistic - PowerPoint PPT Presentation

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning Topics Bayes Nets (Finish) Parameter Learning Structure Learning Readings: KF 18.1, 18.3; Barber 9.5, 10.4 Dhruv Batra Virginia


  1. ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning Topics – Bayes Nets – (Finish) Parameter Learning – Structure Learning Readings: KF 18.1, 18.3; Barber 9.5, 10.4 Dhruv Batra Virginia Tech

  2. Administrativia • HW1 – Out – Due in 2 weeks: Feb 17, Feb 19, 11:59pm – Please please please please start early – Implementation: TAN, structure + parameter learning – Please post questions on Scholar Forum. (C) Dhruv Batra 2

  3. Recap of Last Time (C) Dhruv Batra 3

  4. Learning Bayes nets Known structure Unknown structure Fully observable Very easy Hard data Missing data Somewhat easy Very very hard (EM) Data CPTs – x (1) P(X i | Pa Xi ) … x (m) structure parameters (C) Dhruv Batra Slide Credit: Carlos Guestrin 4

  5. Learning the CPTs For each discrete variable X i Data x (1) … P MLE ( X i = a | Pa X i = b ) = Count( X i = a, Pa X i = b ) ˆ x (m) Count(Pa X i = b ) (C) Dhruv Batra Slide Credit: Carlos Guestrin 5

  6. Plan for today • (Finish) BN Parameter Learning – Parameter Sharing – Plate notation • (Start) BN Structure Learning – Log-likelihood score – Decomposability – Information never hurts (C) Dhruv Batra 6

  7. Meta BN • Explicitly showing parameters as variables • Example on board – One variable X; parameter θ X – Two variables X,Y; parameters θ X , θ Y|X (C) Dhruv Batra 7

  8. Global parameter independence • Global parameter independence: Flu Allergy – All CPT parameters are independent – Prior over parameters is product of prior over Sinus CPTs Nose Headache • Proposition : For fully observable data D , if prior satisfies global parameter independence, then

  9. Parameter Sharing • What if X 1 , … , X n are n random variables for coin tosses of the same coin? (C) Dhruv Batra 9

  10. Naïve Bayes vs Bag-of-Words • What’s the difference? • Parameter sharing! (C) Dhruv Batra 10

  11. Text classification • Classify e-mails – Y = {Spam,NotSpam} • What about the features X ? – X i represents i th word in document; i = 1 to doc-length – X i takes values in vocabulary, 10,000 words, etc. (C) Dhruv Batra 11

  12. Bag of Words • Position in document doesn’t matter : P(X i =x i |Y=y) = P(X k =x i |Y=y) – Order of words on the page ignored – Parameter sharing When the lecture is over, remember to wake up the person sitting next to you in the lecture room. (C) Dhruv Batra Slide Credit: Carlos Guestrin 12

  13. Bag of Words • Position in document doesn’t matter : P(X i =x i |Y=y) = P(X k =x i |Y=y) – Order of words on the page ignored – Parameter sharing in is lecture lecture next over person remember room sitting the the the to to up wake when you (C) Dhruv Batra Slide Credit: Carlos Guestrin 13

  14. HMMs semantics: Details X 1 = {a, … z} X 2 = {a, … z} X 3 = {a, … z} X 4 = {a, … z} X 5 = {a, … z} O 1 = O 2 = O 3 = O 4 = O 5 = Just 3 distributions: (C) Dhruv Batra Slide Credit: Carlos Guestrin 14

  15. N-grams • Learnt from Darwin’s On the Origin of Species Bigrams Unigrams _ a b c d e f g h i j k l m n o p q r s t u v w x y z 1 0.16098 _ _ 2 0.06687 a a 3 0.01414 b b 4 0.02938 c c 5 0.03107 d d 6 0.11055 e e 7 0.02325 f f 8 0.01530 g g 9 0.04174 h h 10 0.06233 i i 11 0.00060 j j 12 0.00309 k k 13 0.03515 l l 14 0.02107 m m 15 0.06007 n n 16 0.06066 o o 17 0.01594 p p 18 0.00077 q q 19 0.05265 r r 20 0.05761 s s 21 0.07566 t t 22 0.02149 u u 23 0.00993 v v 24 0.01341 w w 25 0.00208 x x 26 0.01381 y y 27 0.00039 z z (C) Dhruv Batra Image Credit: Kevin Murphy 15

  16. Plate Notation • X 1 , … , X n are n random variables for coin tosses of the same coin • Plate denotes replication (C) Dhruv Batra 16

  17. Plate Notation Y Plates denote X j replication of random variables D (C) Dhruv Batra 17

  18. Hierarchical Bayesian Models • Why stop with a single prior? (C) Dhruv Batra 18

  19. BN: Parameter Learning: What you need to know • Parameter Learning – MLE • Decomposes; results in counting procedure • Will shatter dataset if too many parents – Bayesian Estimation • Conjugate priors • Priors = regularization (also viewed as smoothing) • Hierarchical priors – Plate notation – Shared parameters (C) Dhruv Batra 19

  20. Learning Bayes nets Known structure Unknown structure Fully observable Very easy Hard data Missing data Somewhat easy Very very hard (EM) Data CPTs – x (1) P(X i | Pa Xi ) … x (m) structure parameters (C) Dhruv Batra Slide Credit: Carlos Guestrin 20

  21. Goals of Structure Learning • Prediction – Care about a good structure because presumably it will lead to good predictions • Discovery – I want to understand some system Data CPTs – x (1) P(X i | Pa Xi ) … x (m) structure parameters (C) Dhruv Batra 21

  22. Types of Errors • Truth: Flu Allergy Sinus Nose Headache • Recovered: Flu Flu Allergy Allergy Sinus Sinus Nose Nose Headache Headache (C) Dhruv Batra 22

  23. Learning the structure of a BN Data • Constraint-based approach – Test conditional independencies in data – Find an I-map <x 1 (1) , … ,x n (1) > … • Score-based approach <x 1 (m) , … ,x n (m) > – Finding a structure and parameters is a density Learn structure and estimation task parameters – Evaluate model as we evaluated parameters • Maximum likelihood • Bayesian • etc. Flu Allergy Sinus Nose Headache (C) Dhruv Batra Slide Credit: Carlos Guestrin 23

  24. Score-based approach Possible structures Data Flu Allergy Score structure Learn parameters Sinus -52 Nose Headache <x 1 (1) , … ,x n (1) > … <x 1 (m) , … ,x n (m) > Flu Allergy Score structure Learn parameters Sinus -60 Nose Headache Flu Allergy Score structure Learn parameters Sinus -500 Nose Headache (C) Dhruv Batra Slide Credit: Carlos Guestrin 24

  25. How many graphs? • N vertices. • How many (undirected) graphs? • How many (undirected) trees? (C) Dhruv Batra 25

  26. What’s a good score? • Score(G) = log-likelihood(G : D, θ MLE ) (C) Dhruv Batra 26

  27. Information-theoretic interpretation of Maximum Likelihood Score • Consider two node graph – Derived on board (C) Dhruv Batra 27

  28. Information-theoretic interpretation of Maximum Likelihood Score • For a general graph G Flu Allergy Sinus Nose Headache (C) Dhruv Batra Slide Credit: Carlos Guestrin 28

  29. Information-theoretic interpretation of Maximum Likelihood Score Flu Allergy Sinus Nose Headache • Implications: – Intuitive: higher mutual info à higher score – Decomposes over families in BN (node and it’s parents) – Same score for I-equivalent structures! – Information never hurts! (C) Dhruv Batra 29

  30. Chow-Liu tree learning algorithm 1 • For each pair of variables X i ,X j – Compute empirical distribution: – Compute mutual information: • Define a graph – Nodes X 1 , … ,X n – Edge (i,j) gets weight (C) Dhruv Batra Slide Credit: Carlos Guestrin 30

  31. Chow-Liu tree learning algorithm 2 • Optimal tree BN – Compute maximum weight spanning tree – Directions in BN: pick any node as root, and direct edges away from root • breadth-first-search defines directions (C) Dhruv Batra Slide Credit: Carlos Guestrin 31

  32. Can we extend Chow-Liu? • Tree augmented naïve Bayes (TAN) [Friedman et al. ’ 97] – Naïve Bayes model overcounts, because correlation between features not considered – Same as Chow-Liu, but score edges with: (C) Dhruv Batra Slide Credit: Carlos Guestrin 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend