1
play

1 Problem with Brute force Nave Bayes ( ) ( ) ( ) It cannot - PDF document

Outline Bayes theorem Discrete Bayesian classifiers Maximum likelihood classification Brute force Bayesian learning Nave Bayes Lecture 5 Bayesian Belief Networks Bayes theorem Maximum likelihood ( ) ( ) ( ) P x


  1. Outline � Bayes theorem Discrete Bayesian classifiers � Maximum likelihood classification � “Brute force” Bayesian learning � Naïve Bayes Lecture 5 � Bayesian Belief Networks Bayes theorem Maximum likelihood ( ) ( ) ( ) P x | c P c � From training data estimate for all x and c i = P c | x ( ) � P(c i ) P x � P(x|c i ) � P(c) – prior probability of class c � During classification: � Expected proportion of data from class c on test � Choose class c i with maximal P(c i |x) � P(x) – prior probability of instance x � Probability of instance with attribute vector x to occur � P(c|x) – probability of an instance being of class c � P(c i |x) is also called likelihood given it is described by vector of attributes x � P(x|c) – probability of an instance of having attributes described by x given it comes from class c ‘Brute force’ Bayesian learning Example � Instance x described by attributes < a 1 ,…,a n > � Consider Data � We estimate: � P(No) = 62% � Most probable class: Wind Rain Balloon � P(Yes) = 38% Strong Shower No � P(< Weak,None> |No) = 20% ( ) ( ) = = arg max | ,..., c x P c a a Strong Shower No � P(< Weak,None> |Yes)= 100% i 1 n c � P(< Weak,Show.> |No) = 20% Strong None No i ( ) ( ) � … P a ,..., a | c P c Weak None Yes = = 1 n i i arg max ( ) � Classification of ,..., Weak Shower No P a a c < Weak,None> i 1 n ( ) ( ) = Weak None Yes arg max ,..., | � L(No) ~ 0.62 x 0.2= 0.12 P a a c P c 1 n i i Weak None No � L(Yes) ~ 0.38 x 1= 0.38 c i � Answer: Yes Weak None Yes 1

  2. Problem with ‘Brute force’ Naïve Bayes ( ) ( ) ( ) � It cannot generalize to unseen examples x new , � Brute force: = arg max ,..., | c x P a a c P c 1 n i i because it does not have estimates P(c i |x new ) c i � It is useless � Naïve Bayes assumes that attributes are independent for instances from a given class: � Brute force does not have any bias ( ) ( ) ∏ = � So in order to make learning possible we P a ,..., a | c P a | c 1 n i j i have to introduce a bias j ( ) ( ) ( ) ∏ � Which gives: = arg max | c x P c P a c i j i c j i � Assumption of independence is often violated by Naïve Bayes works surprisingly well anyway Example Missing estimates � Recall ‘advanced ballooning’ set: � What if none of training instances of class c i have attribute value a j ? Then: Sky Temper. Rain Wind Fly Balloon � P(a j |c i ) = 0, and Sunny Cold None Strong Yes ( ) ( ) = ∏ = ,..., | | 0 P a a c P a c Cloudy Cold Shower Weak Yes � 1 n i j i j Cloudy Cold Shower Strong No � no matter what are the values of other attributes Sunny Hot Shower Strong No � For example: � x = < Sunny, Hot, None, Weak> � Classify: x= < Cloudy, Hot, Shower, Strong> � P(Hot|Yes) = 0, hence � P(Y|x) ~ P(Y) P(Cl|Y) P(H|Y) P(Sh|Y) P (St|Y) � P(Yes|x) = 0 = 0.5 x 0.5 x 0 x 0.5 x 0.5 = 0 � P(N|x) ~ 0.5 x 0.5 x 0.5 x 1 x 1 = 0.125 Solution Learning to classify text � Let m denote the number of possible values of � For example: is an e-mail a spam? attribute a j � Represent each document by a set of words � For each class let us consider adding m “virtual � Independence assumptions: examples” with different values of a j � Order of words does not matter � Bayesian estimate for P(a j |c i ) becomes: � Co-occurrences of words do not matter + ( ) n 1 = � Learning: estimate from training documents: ciaj | P a c + j i n m � For every class c i estimate P(c i ) ci � Where: � For every word w and class c i estimate P(w|c i ) � n ci – number of training examples with class c i � Classification: maximum likelihood � n ciaj – number of training examples with class c i and attribute a j 2

  3. Learning in detail Classification in detail � Vocabulary = all distinct words in training text � Index all words in document to classify by j � For each class c i � i.e. denote j th word in the document by w j ( ) Number of documents of class c ( ) i = ( ) ∏ � Classify: i = P c ( ) arg max | � c document P c P w c Total number of documents i j i c j i � Text ci = concatenated documents of class c i � In practice P(w j |c i ) are small so their product � n ci = total # words in Text ci (count duplicates multiple times) is very close to 0; it is better to use: � For each word w j in Vocabulary ⎡ ⎤ ( ) � n ciwj = number of times word w j occurred in text Text ci ( ) ∏ = = ( ) arg max log ⎢ | ⎥ c document P c P w c + � ( i j i ) 1 n ⎣ ⎦ c = ciwj i j | P w c + j i ⎡ ⎤ n Vocabulary ( ) ⎥ ( ) ∑ ci = + arg max ⎢ log log | P c P w c i j i ⎣ ⎦ c j i Pre-processing Understanding Naïve Bayes � Allows adding background knowledge � Although Naïve Bayes is considered to be subsymbolic, the estimated probabilities may � May dramatically increase accuracy give insight on the classification process � Sample techniques: � For example in spam filtering � Lemmatisation - converts words to basic form � Words with maximum P(w j |spam) are the words � Stop-list - removes 100 most frequent words whose presence most predicts en e-mail to be a spam e-mail Bayesian Belief Networks Conditional independence � Naïve Bayes assumption of conditional � X is conditionally independent of Y given Z if � ∀ x,y,z: P(X= x | Y= y, Z= z) = P(X= x | Z= z) independence of attributes is too restrictive for some problems � Usually written: P(X|Y,Z) = P(X|Z) � Example: � But some assumptions need to be made to P(Thunder|Rain,Ligthining)= P(Thunder|Lightning) allow generalization � Used by Naïve Bayes: � Bayesian Belief Networks assume conditional � P(A 1 ,A 2 |C) = P(A 1 |A 2 ,C) P(A 2 |C) = P(A 1 |C)P(A 2 |C) independence among subset of attributes � Allows combining prior knowledge about Always true Only true if A 1 and A 2 (in)dependencies among attributes conditionally independent 3

  4. Bayesian Belief Network Learning Bayesian Network Flu � Probabilities of attribute values given parents can be estimated from the training set Weakness feVer Headache F 0.1 � Connections describe dependence & causality ¬ F Flu 0.9 � Each node is conditionally independent of its nondescendants, given its immediate predecessors Weakness feVer Headache � Examples: ¬ FV F ¬ V ¬ F ¬ V ¬ F ¬ F � feVer and Headache are independent given flu FV F F � feVer and weakness are not independent given flu W 0.9 0.8 0.5 0.05 V 0.8 0.1 H 0.6 0.1 ¬ W ¬ V ¬ H 0.1 0.2 0.5 0.95 0.2 0.9 0.4 0.9 Inference Naïve Bayes network � During Bayesian classification we compute: Flu ( ) ( ) ( ) ( ) = = arg max ,..., | arg max ,..., , c x P a a c P c P a a c 1 n i i 1 n i c c Sore throat feVer Headache i i � In general in Bayesian network with nodes Y i : n ( ) ( ( ) ) ∏ = P y ,..., y P y | Parents Y � In case of this network: 1 n i i = i 1 ( ) n ( ( ) ) ( ) n ∏ ( ) ∏ ( ( ) ) ( ) = = ,..., , | ,..., , | � Thus P a a c P a Parents A P c P a a c P a Parents A P c 1 n i i 1 n i i = = 1 1 i i n � Example: Classify patient: W,V, ¬ H ( ) ( ) ∏ = P a | c P c i � P(W,V, ¬ H,F) = P(W|VF) P(V|F) P( ¬ H|F) P(F) = i 1 Extensions to Bayesian nets Summary � Network with hidden states, e.g. � Inductive bias of Naïve Bayes: � Attributes are independent. � Although this assumption is often violated, it provides a very efficient tool often used � E.g. For spam filtering. � Applicable to data: � with many attributes (possibly missing), � which take discrete values (e.g. words). � Learning structure of the network from data � Bayesian belief networks � Allow prior knowledge about dependencies 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend