1 Bayes Theorem Bayesian Categorization Determine category of x k - PDF document

Axioms of Probability Theory • All probabilities between 0 and 1 ≤ ≤ P A 0 ( ) 1 CS 391L: Machine Learning: • True proposition has probability 1, false has Bayesian Learning: probability 0. Naïve Bayes P(true) = 1 P(false) = 0. • The probability of disjunction is: P A ∨ B = P A + P B − P A ∧ B ( ) ( ) ( ) ( ) Raymond J. Mooney University of Texas at Austin A ∧ B A B 1 2 Conditional Probability Independence • P( A | B ) is the probability of A given B • A and B are independent iff: • Assumes that B is all and only information P A B = P A ( | ) ( ) These two constraints are logically equivalent known. P B A = P B ( | ) ( ) • Defined by: • Therefore, if A and B are independent: P A ∧ B ( ) ∧ P A B P A B = ( ) ( | ) P A B = = P A ( | ) ( ) P B ( ) P B ( ) ∧ = P A B P A P B ( ) ( ) ( ) A ∧ B A B 3 4 Joint Distribution Probabilistic Classification • The joint probability distribution for a set of random variables, • Let Y be the random variable for the class which takes values X 1 ,…, X n gives the probability of every combination of values (an n - { y 1 , y 2 ,… y m }. dimensional array with v n values if all variables are discrete with v • Let X be the random variable describing an instance consisting values, all v n values must sum to 1): P( X 1 ,…, X n ) of a vector of values for n features < X 1 , X 2 … X n >, let x k be a positive negative possible value for X and x ij a possible value for X i. circle square circle square • For classification, we need to compute P( Y = y i | X = x k ) for i =1… m red 0.20 0.02 red 0.05 0.30 • However, given no other assumptions, this requires a table blue 0.02 0.01 blue 0.20 0.20 giving the probability of each category for each possible instance • The probability of all possible conjunctions (assignments of values to in the instance space, which is impossible to accurately estimate some subset of variables) can be calculated by summing the from a reasonably-sized training set. appropriate subset of values from the joint distribution. ∧ circle = + = – Assuming Y and all X i are binary, we need 2 n entries to specify P red ( ) 0 . 20 0 . 05 0 . 25 P( Y =pos | X = x k ) for each of the 2 n possible x k ’s since P red = + + + = ( ) 0 . 20 0 . 02 0 . 05 0 . 3 0 . 57 P( Y =neg | X = x k ) = 1 – P( Y =pos | X = x k ) • Therefore, all conditional probabilities can also be calculated. – Compared to 2 n+1 – 1 entries for the joint distribution P( Y , X 1 , X 2 … X n ) P positive ∧ red ∧ circle ( ) 0 . 20 P positive red ∧ circle = = = ( | ) 0 . 80 P red ∧ circle 5 6 ( ) 0 . 25 1

Bayes Theorem Bayesian Categorization • Determine category of x k by determining for each y i P E H P H ( | ) ( ) P H E = ( | ) P E ( ) P Y = y P X = x Y = y ( ) ( | ) = = = P Y y X x i k i ( | ) i k P X = x ( ) Simple proof from definition of conditional probability: k ∧ • P( X=x k ) can be determined since categories are P H E ( ) P H E = (Def. cond. prob.) ( | ) P E ( ) complete and disjoint. ∧ P H E ( ) P E H = ( | ) (Def. cond. prob.) m m P Y = y P X = x Y = y ( ) ( | ) P H P Y = y X = x = i k i = ( ) ( | ) 1 i k = P X x ( ) ∑ ∑ P H ∧ E = P E H P H i = i = 1 1 k ( ) ( | ) ( ) m = = = = = P X x P Y y P X x Y y ( ) ( ) ( | ) P E H P H ( | ) ( ) k i k i P H E = QED: ∑ ( | ) i = 1 P E ( ) 7 8 Bayesian Categorization (cont.) Generative Probabilistic Models • Assume a simple (usually unrealistic) probabilistic method • Need to know: by which the data was generated. – Priors: P( Y = y i ) • For categorization, each category has a different – Conditionals: P( X = x k | Y = y i ) parameterized generative model that characterizes that category. • P( Y = y i ) are easily estimated from data. • Training : Use the data for each category to estimate the – If n i of the examples in D are in y i then P( Y = y i ) = n i / | D| parameters of the generative model for that category. – Maximum Likelihood Estimation (MLE): Set parameters to • Too many possible instances (e.g. 2 n for binary maximize the probability that the model produced the given features) to estimate all P( X = x k | Y = y i ). training data. – If M λ denotes a model with parameter values λ and D k is the training data for the k th class, find model parameters for class k • Still need to make some sort of independence ( λ k ) that maximize the likelihood of D k : assumptions about the features to make learning λ k = P D M argmax ( | ) k λ tractable. λ • Testing : Use Bayesian analysis to determine the category model that most likely generated a specific test instance. 9 10 Naïve Bayes Generative Model Naïve Bayes Inference Problem lg red circ neg ?? ?? pos pos neg pos pos neg Category neg pos pos pos neg pos neg Category lg lg red circ red circ red circ red circ med med blue blue tri tri sm tri tri sm sm blue sqr circ sm blue sqr circ lg med grn lg med grn circ grn tri circ grn tri red grn red med red grn red med med med circ circ tri circ circ tri lg blue circ sqr lg blue circ sqr lg lg sm red lg lg sm red blue sm blue sm sqr sqr red red lg red red lg sm med circ sm grn sqr tri sm med circ sm grn sqr tri blue blue Size Color Shape Size Color Shape Size Color Shape Size Color Shape Negative Negative Positive Positive 11 12 2

Naïve Bayesian Categorization Naïve Bayes Example • If we assume features of an instance are independent given the category ( conditionally independent ). Probability positive negative n P( Y ) 0.5 0.5 ∏ = = P X Y P X X X Y P X Y ( | ) ( , , | ) ( | ) P(small | Y ) n i 0.4 0.4 1 2 L = i 1 P(medium | Y ) 0.1 0.2 • Therefore, we then only need to know P( X i | Y ) for each P(large | Y ) 0.5 0.4 Test Instance: possible pair of a feature-value and a category. P(red | Y ) 0.9 0.3 <medium ,red, circle> • If Y and all X i and binary, this requires specifying only 2 n P(blue | Y ) 0.05 0.3 parameters: P(green | Y ) 0.05 0.4 – P( X i =true | Y =true) and P( X i =true | Y =false) for each X i P(square | Y ) 0.05 0.4 – P( X i =false | Y ) = 1 – P( X i =true | Y ) P(triangle | Y ) 0.05 0.3 • Compared to specifying 2 n parameters without any P(circle | Y ) 0.9 0.3 independence assumptions. 13 14 Naïve Bayes Example Estimating Probabilities • Normally, probabilities are estimated based on observed Probability positive negative frequencies in the training data. P( Y ) 0.5 0.5 • If D contains n k examples in category y k , and n ijk of these n k P(medium | Y ) 0.1 0.2 examples have the j th value for feature X i , x ij , then: Test Instance: P(red | Y ) n 0.9 0.3 ijk <medium ,red, circle> P X = x Y = y = ( | ) P(circle | Y ) 0.9 0.3 i ij k n k P(positive | X ) = P(positive)*P(medium | positive)*P(red | positive)*P(circle | positive) / P( X ) • However, estimating such probabilities from small training 0.5 * 0.1 * 0.9 * 0.9 sets is error-prone. = 0.0405 / P( X ) = 0.0405 / 0.0495 = 0.8181 • If due only to chance, a rare feature, X i , is always false in P(negative | X ) = P(negative)*P(medium | negative)*P(red | negative)*P(circle | negative) / P( X ) the training data, ∀ y k :P( X i = true | Y = y k ) = 0. 0.5 * 0.2 * 0.3 * 0.3 • If X i = true then occurs in a test example, X , the result is that = 0.009 / P( X ) = 0.009 / 0.0495 = 0.1818 ∀ y k : P( X | Y =y k ) = 0 and ∀ y k : P( Y =y k | X ) = 0 P(positive | X ) + P(negative | X ) = 0.0405 / P( X ) + 0.009 / P( X ) = 1 P( X ) = (0.0405 + 0.009) = 0.0495 15 16 Probability Estimation Example Smoothing • To account for estimation from small samples, Probability positive negative Ex Size Color Shape Category probability estimates are adjusted or smoothed . P( Y ) 0.5 0.5 P(small | Y ) 1 small red circle positive 0.5 0.5 • Laplace smoothing using an m -estimate assumes that P(medium | Y ) 0.0 0.0 each feature is given a prior probability, p , that is 2 large red circle positive P(large | Y ) 0.5 0.5 assumed to have been previously observed in a P(red | Y ) 1.0 0.5 3 small red triangle negitive “virtual” sample of size m . P(blue | Y ) 0.0 0.5 + 4 large blue circle negitive n mp P(green | Y ) 0.0 0.0 = = = ijk P X x Y y ( | ) P(square | Y ) 0.0 0.0 i ij k n + m k P(triangle | Y ) 0.0 0.5 Test Instance X : • For binary features, p is simply assumed to be 0.5. P(circle | Y ) 1.0 0.5 <medium, red, circle> P(positive | X ) = 0.5 * 0.0 * 1.0 * 1.0 / P(X) = 0 P(negative | X ) = 0.5 * 0.0 * 0.5 * 0.5 / P(X) = 0 17 18 3

1 Bayes Theorem Bayesian Categorization Determine category of x k - PDF document

Axioms of Probability Theory All probabilities between 0 and 1 P A 0 ( ) 1 CS 391L: Machine Learning: True proposition has probability 1, false has Bayesian Learning: probability 0. Nave Bayes P(true) = 1

An algorithm for the classification of nil- potent semigroups by coclass Andreas Distler

On Enumerating Subsemigroups of the Full Transformation Semigroup Attila Egri-Nagy joint work

Definite Meaning and Definite Marking Manfred Sailer largely based on joint work with Assif Am

Service composition: Deriving Component Designs from Global Requirements Gregor v. Bochmann

Escape... On staff design engineer for quick turnaround A selection of standard and

Community Solar Gardens, A Case Study Louise Seeba, General Counsel Public Housing Agency of

Resurfacing Projects for Complete Streets Potomac Avenue (S Crystal Dr to County Line) What we

Rail yard shunting A Challenge for CP? Per Kreuger (piak@sics.se) Joint work with Martin

narratives, networks, and art K. Hunter Wapman, Brian Lubars, Carl Mueller, Dan Larremore what

Arlington Democrats June 2020 General Meeting June 3, 2020 7:00 PM Welcome to Resisting While

Steep Dimers on Rail Yard Graphs Cdric Boutillier (UPMC) joint work with J. Bouttier (CEA), G.

Nuclear in My Back Yard Grant Program Pilot Briefing Slides American Nuclear Society Rev A

a Sustainable Community Landscape on Vacant Lots (in Chicago) 2/15/2014 Bill Morrisett

Beyond Recycling: The Other Rs Thank you for joining us! The webinar will begin shortly.

How I played with the wrong kids on the school yard and cofounded a tech bank By Peter Grosskopf

WITH CONFIDENCE DYNA-MAC HOLDINGS LTD. 1 Disclaimer: All Rights Reserv ed. Copy rights

Unit 1: Introduction to data 4. Introduction to statistical inference GOVT 3990 - Spring 2020

Lexical Functional Grammar Mary Dalrymple Centre for Linguistics and Philology Oxford University

Enabling Scalable Parallel Processing of Venus/OMNeT++ Network Models on the IBM Blue Gene/Q

Betha Gutsche WebJunction Program Manager, OCLC Getting to the Heart of the Community Through

Reflexives in the Correspondence Architecture Ash Asudeh Carleton University University of

Binary Tree Iterators After today, you should be able to implement _lazy_ iterators for

Distributed Data-Parallel Programming Parallel Programming and Data Analysis Heather Miller

TANSU A Workflow for Cabinet Layout Pavneet Arora PART I W Tansu is the Japanese