1
play

1 Bayes Theorem Bayesian Categorization Determine category of x k - PDF document

Axioms of Probability Theory All probabilities between 0 and 1 P A 0 ( ) 1 CS 391L: Machine Learning: True proposition has probability 1, false has Bayesian Learning: probability 0. Nave Bayes P(true) = 1


  1. Axioms of Probability Theory • All probabilities between 0 and 1 ≤ ≤ P A 0 ( ) 1 CS 391L: Machine Learning: • True proposition has probability 1, false has Bayesian Learning: probability 0. Naïve Bayes P(true) = 1 P(false) = 0. • The probability of disjunction is: P A ∨ B = P A + P B − P A ∧ B ( ) ( ) ( ) ( ) Raymond J. Mooney University of Texas at Austin A ∧ B A B 1 2 Conditional Probability Independence • P( A | B ) is the probability of A given B • A and B are independent iff: • Assumes that B is all and only information P A B = P A ( | ) ( ) These two constraints are logically equivalent known. P B A = P B ( | ) ( ) • Defined by: • Therefore, if A and B are independent: P A ∧ B ( ) ∧ P A B P A B = ( ) ( | ) P A B = = P A ( | ) ( ) P B ( ) P B ( ) ∧ = P A B P A P B ( ) ( ) ( ) A ∧ B A B 3 4 Joint Distribution Probabilistic Classification • The joint probability distribution for a set of random variables, • Let Y be the random variable for the class which takes values X 1 ,…, X n gives the probability of every combination of values (an n - { y 1 , y 2 ,… y m }. dimensional array with v n values if all variables are discrete with v • Let X be the random variable describing an instance consisting values, all v n values must sum to 1): P( X 1 ,…, X n ) of a vector of values for n features < X 1 , X 2 … X n >, let x k be a positive negative possible value for X and x ij a possible value for X i. circle square circle square • For classification, we need to compute P( Y = y i | X = x k ) for i =1… m red 0.20 0.02 red 0.05 0.30 • However, given no other assumptions, this requires a table blue 0.02 0.01 blue 0.20 0.20 giving the probability of each category for each possible instance • The probability of all possible conjunctions (assignments of values to in the instance space, which is impossible to accurately estimate some subset of variables) can be calculated by summing the from a reasonably-sized training set. appropriate subset of values from the joint distribution. ∧ circle = + = – Assuming Y and all X i are binary, we need 2 n entries to specify P red ( ) 0 . 20 0 . 05 0 . 25 P( Y =pos | X = x k ) for each of the 2 n possible x k ’s since P red = + + + = ( ) 0 . 20 0 . 02 0 . 05 0 . 3 0 . 57 P( Y =neg | X = x k ) = 1 – P( Y =pos | X = x k ) • Therefore, all conditional probabilities can also be calculated. – Compared to 2 n+1 – 1 entries for the joint distribution P( Y , X 1 , X 2 … X n ) P positive ∧ red ∧ circle ( ) 0 . 20 P positive red ∧ circle = = = ( | ) 0 . 80 P red ∧ circle 5 6 ( ) 0 . 25 1

  2. Bayes Theorem Bayesian Categorization • Determine category of x k by determining for each y i P E H P H ( | ) ( ) P H E = ( | ) P E ( ) P Y = y P X = x Y = y ( ) ( | ) = = = P Y y X x i k i ( | ) i k P X = x ( ) Simple proof from definition of conditional probability: k ∧ • P( X=x k ) can be determined since categories are P H E ( ) P H E = (Def. cond. prob.) ( | ) P E ( ) complete and disjoint. ∧ P H E ( ) P E H = ( | ) (Def. cond. prob.) m m P Y = y P X = x Y = y ( ) ( | ) P H P Y = y X = x = i k i = ( ) ( | ) 1 i k = P X x ( ) ∑ ∑ P H ∧ E = P E H P H i = i = 1 1 k ( ) ( | ) ( ) m = = = = = P X x P Y y P X x Y y ( ) ( ) ( | ) P E H P H ( | ) ( ) k i k i P H E = QED: ∑ ( | ) i = 1 P E ( ) 7 8 Bayesian Categorization (cont.) Generative Probabilistic Models • Assume a simple (usually unrealistic) probabilistic method • Need to know: by which the data was generated. – Priors: P( Y = y i ) • For categorization, each category has a different – Conditionals: P( X = x k | Y = y i ) parameterized generative model that characterizes that category. • P( Y = y i ) are easily estimated from data. • Training : Use the data for each category to estimate the – If n i of the examples in D are in y i then P( Y = y i ) = n i / | D| parameters of the generative model for that category. – Maximum Likelihood Estimation (MLE): Set parameters to • Too many possible instances (e.g. 2 n for binary maximize the probability that the model produced the given features) to estimate all P( X = x k | Y = y i ). training data. – If M λ denotes a model with parameter values λ and D k is the training data for the k th class, find model parameters for class k • Still need to make some sort of independence ( λ k ) that maximize the likelihood of D k : assumptions about the features to make learning λ k = P D M argmax ( | ) k λ tractable. λ • Testing : Use Bayesian analysis to determine the category model that most likely generated a specific test instance. 9 10 Naïve Bayes Generative Model Naïve Bayes Inference Problem lg red circ neg ?? ?? pos pos neg pos pos neg Category neg pos pos pos neg pos neg Category lg lg red circ red circ red circ red circ med med blue blue tri tri sm tri tri sm sm blue sqr circ sm blue sqr circ lg med grn lg med grn circ grn tri circ grn tri red grn red med red grn red med med med circ circ tri circ circ tri lg blue circ sqr lg blue circ sqr lg lg sm red lg lg sm red blue sm blue sm sqr sqr red red lg red red lg sm med circ sm grn sqr tri sm med circ sm grn sqr tri blue blue Size Color Shape Size Color Shape Size Color Shape Size Color Shape Negative Negative Positive Positive 11 12 2

  3. Naïve Bayesian Categorization Naïve Bayes Example • If we assume features of an instance are independent given the category ( conditionally independent ). Probability positive negative n P( Y ) 0.5 0.5 ∏ = = P X Y P X X X Y P X Y ( | ) ( , , | ) ( | ) P(small | Y ) n i 0.4 0.4 1 2 L = i 1 P(medium | Y ) 0.1 0.2 • Therefore, we then only need to know P( X i | Y ) for each P(large | Y ) 0.5 0.4 Test Instance: possible pair of a feature-value and a category. P(red | Y ) 0.9 0.3 <medium ,red, circle> • If Y and all X i and binary, this requires specifying only 2 n P(blue | Y ) 0.05 0.3 parameters: P(green | Y ) 0.05 0.4 – P( X i =true | Y =true) and P( X i =true | Y =false) for each X i P(square | Y ) 0.05 0.4 – P( X i =false | Y ) = 1 – P( X i =true | Y ) P(triangle | Y ) 0.05 0.3 • Compared to specifying 2 n parameters without any P(circle | Y ) 0.9 0.3 independence assumptions. 13 14 Naïve Bayes Example Estimating Probabilities • Normally, probabilities are estimated based on observed Probability positive negative frequencies in the training data. P( Y ) 0.5 0.5 • If D contains n k examples in category y k , and n ijk of these n k P(medium | Y ) 0.1 0.2 examples have the j th value for feature X i , x ij , then: Test Instance: P(red | Y ) n 0.9 0.3 ijk <medium ,red, circle> P X = x Y = y = ( | ) P(circle | Y ) 0.9 0.3 i ij k n k P(positive | X ) = P(positive)*P(medium | positive)*P(red | positive)*P(circle | positive) / P( X ) • However, estimating such probabilities from small training 0.5 * 0.1 * 0.9 * 0.9 sets is error-prone. = 0.0405 / P( X ) = 0.0405 / 0.0495 = 0.8181 • If due only to chance, a rare feature, X i , is always false in P(negative | X ) = P(negative)*P(medium | negative)*P(red | negative)*P(circle | negative) / P( X ) the training data, ∀ y k :P( X i = true | Y = y k ) = 0. 0.5 * 0.2 * 0.3 * 0.3 • If X i = true then occurs in a test example, X , the result is that = 0.009 / P( X ) = 0.009 / 0.0495 = 0.1818 ∀ y k : P( X | Y =y k ) = 0 and ∀ y k : P( Y =y k | X ) = 0 P(positive | X ) + P(negative | X ) = 0.0405 / P( X ) + 0.009 / P( X ) = 1 P( X ) = (0.0405 + 0.009) = 0.0495 15 16 Probability Estimation Example Smoothing • To account for estimation from small samples, Probability positive negative Ex Size Color Shape Category probability estimates are adjusted or smoothed . P( Y ) 0.5 0.5 P(small | Y ) 1 small red circle positive 0.5 0.5 • Laplace smoothing using an m -estimate assumes that P(medium | Y ) 0.0 0.0 each feature is given a prior probability, p , that is 2 large red circle positive P(large | Y ) 0.5 0.5 assumed to have been previously observed in a P(red | Y ) 1.0 0.5 3 small red triangle negitive “virtual” sample of size m . P(blue | Y ) 0.0 0.5 + 4 large blue circle negitive n mp P(green | Y ) 0.0 0.0 = = = ijk P X x Y y ( | ) P(square | Y ) 0.0 0.0 i ij k n + m k P(triangle | Y ) 0.0 0.5 Test Instance X : • For binary features, p is simply assumed to be 0.5. P(circle | Y ) 1.0 0.5 <medium, red, circle> P(positive | X ) = 0.5 * 0.0 * 1.0 * 1.0 / P(X) = 0 P(negative | X ) = 0.5 * 0.0 * 0.5 * 0.5 / P(X) = 0 17 18 3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend