text classification i na ve bayes
play

Text classification I (Nave Bayes) CE-324: Modern Information - PowerPoint PPT Presentation

Text classification I (Nave Bayes) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Spring 2020 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Outline } Text


  1. Text classification I (NaΓ―ve Bayes) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Spring 2020 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

  2. Outline } Text classification } definition } relevance to information retrieval } NaΓ―ve Bayes classifier 2

  3. Formal definition of text classification } Document space π‘Œ } Docs are represented in this (typically high-dimensional) space } Set of classes 𝐷 = {𝑑 & , … , 𝑑 ) } } Example: 𝐷 = { spam, non βˆ’ spam } } Training set: a set of labeled docs. Each labeled doc 𝑒, 𝑑 ∈ π‘ŒΓ—π· } Using a learning method, we find a classifier 𝛿 . that maps docs to classes: 𝛿: π‘Œ β†’ 𝐷 3

  4. Examples of using classification in IR systems } Language identification (classes: English vs. French etc.) } Automatic detection of spam pages (spam vs. non-spam) } Automatic detection of secure pages for safe search } Topic-specific or vertical search – restrict search to a β€œvertical” like β€œrelated to health” (relevant to vertical vs. not) } Sentiment detection: is a movie or product review positive or negative (positive vs. negative) } Exercise: Find examples of uses of text classification in IR 4

  5. Ch. 13 Standing queries } The path from IR to text classification: } You have an information need to monitor, say: } Unrest in the Niger delta region } You want to rerun an appropriate query periodically to find new news items on this topic } You will be sent new documents that are found } I.e., it ’ s not ranking but classification (relevant vs. not relevant) } Such queries are called standing queries } Long used by β€œ information professionals ” } A modern mass instantiation is Google Alerts 5

  6. 6 3

  7. Spam filtering Another text classification task From: "" <takworlld@hotmail.com> Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY ! There is no need to spend hundreds or even thousands for similar courses I am 22 years old and I have already purchased 6 properties using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW ! ================================================= Click Below to order: http://www.wholesaledaily.com/sales/nmd.htm ================================================= 7

  8. Categorization/Classification } Given: } A representation of a document d } Issue: how to represent text documents. } Usually some type of high-dimensional space – bag of words } A fixed set of classes: C = {c 1 , c 2 ,…, c J } } Determine: } The category of d: Ξ³ (d) ∈ C } Ξ³ (d) is a classification function } We want to build classification functions ( β€œ classifiers ” ). 8

  9. Classification Methods (1) Β§ Manual classification Β§ Used by the originalYahoo! Directory Β§ Looksmart, about.com, ODP, PubMed Β§ Accurate when job is done by experts Β§ Consistent when the problem size and team is small Β§ Difficult and expensive to scale Β§ Means we need automatic classification methods for big problems 9

  10. Classification Methods (2) } Hand-coded rule-based classifiers } One technique used by news agencies, intelligence agencies, etc. } Widely deployed in government and enterprise } Vendors provide β€œ IDE ” for writing such rules } Issues: } Commercial systems have complex query languages } Accuracy can be high if a rule has been carefully refined over time by a subject expert } Building and maintaining these rules is expensive 10

  11. Classification Methods (3): Supervised learning } Given: } A document d } A fixed set of classes: C = {c 1 , c 2 ,…, c J } } A training set D of documents each with a label in C } Determine: } A learning method or algorithm which will enable us to learn a classifier Ξ³ } For a test document d , we assign it the class Ξ³ (d) ∈ C 11

  12. Bayes classifier } Bayesian classifier is a probabilistic classifier: 𝑑 = argmax 𝑄(𝐷 7 |𝑒) 7 𝑑 = argmax 𝑄 𝑒 𝐷 7 𝑄(𝐷 7 ) 7 } 𝑒 = 𝑒 & , … , 𝑒 = > } There are too many parameters 𝑄( 𝑒 & , … , 𝑒 = > |𝐷 7 ) } One for each unique combination of a class and a sequence of words. } We would need a very, very large number of training examples to estimate that many parameters. 12

  13. NaΓ―ve bayes assumption } NaΓ―ve bayes assumption: = > 𝑄 𝑒 𝐷 7 = 𝑄 𝑒 & , … , 𝑒 = > 𝐷 7 = ? 𝑄(𝑒 @ |𝐷 7 ) @A& } 𝑀 C : length of doc 𝑒 (number of tokens) } 𝑄(𝑒 @ |𝐷 7 ) : probability of term 𝑒 @ occurring in a doc of class 𝐷 7 } 𝑄(𝐷 7 ) : prior probability of class 𝐷 7 . 13

  14. NaΓ―ve bayes assumption } NaΓ―ve bayes assumption: = > 𝑄 𝑒 𝐷 7 = 𝑄 𝑒 & , … , 𝑒 = > 𝐷 7 = ? 𝑄(𝑒 @ |𝐷 7 ) @A& } 𝑀 C : length of doc 𝑒 (number of tokens) } 𝑄(𝑒 @ |𝐷 7 ) : probability of term 𝑒 @ occurring in a doc of class 𝐷 7 } 𝑄(𝐷 7 ) : prior probability of class 𝐷 7 . } Equivalent to (language model view): H DE FG,> 𝑄 𝑒 𝐷 7 = ? 𝑄 𝑒 @ 𝐷 7 @A& 14

  15. Naive Bayes classifier } Since log is a monotonic function, the class with the highest score does not change. = > 𝑑 = argmax 𝑄 𝑒 𝐷 7 𝑄(𝐷 7 ) = argmax 𝑄(𝐷 7 ) ? 𝑄(𝑒 @ |𝐷 7 ) 7 7 @A& = > 𝑑 = argmax log 𝑄(𝐷 7 ) + M log 𝑄 𝑒 @ 𝐷 7 7 @A& log 𝑄 𝑒 @ 𝐷 7 : a weight that indicates how good an indicator 𝑒 @ is for 𝐷 7 15 log(xy) = log(x) + log(y)

  16. Estimating parameters N(𝐷 7 ) and 𝑄 N 𝑒 @ 𝐷 7 ) from training data } Estimate 𝑄 } 𝑂 7 : number of docs in class 𝐷 7 } π‘ˆ @,7 : number of occurrence of 𝑒 @ in training docs from class 𝐷 7 (includes multiple occurrences) Q R N 𝐷 7 = } 𝑄 Q S G,R N 𝑒 @ 𝐷 7 ) = } 𝑄 V βˆ‘ S U,R UWX 16

  17. Problem with estimates: Zeros 𝑒: 𝐢𝐹𝐽𝐻𝐽𝑂𝐻 𝐡𝑂𝐸 π‘ˆπ΅π½π‘„πΉπ½ 𝐾𝑃𝐽𝑂 π‘‹π‘ˆπ‘ƒ 𝑄 π‘‹π‘ˆπ‘ƒ π·β„Žπ‘—π‘œπ‘ = 0 17

  18. οΏ½ οΏ½ Problem with estimates: Zeros } For doc 𝑒 containing a term 𝑒 that does not occur in any N 𝑑 𝑒 = 0 doc of a class 𝑑 β‡’ 𝑄 } Thus 𝑒 cannot be assigned to class 𝑑 } We use π‘ˆ D,h + 1 N 𝑒 𝑑 = 𝑄 βˆ‘ π‘ˆ D j ,h + π‘Š D j ∈H } Instead of π‘ˆ D,h N 𝑒 𝑑 = 𝑄 βˆ‘ π‘ˆ D j ,h D j ∈H 18

  19. NaΓ―ve Bayes: summary } Estimate parameters from the training corpus using add- one smoothing } For a new doc 𝑒 = 𝑒 & , … , 𝑒 = > , for each class, compute = > log 𝑄(𝐷 7 ) + βˆ‘ log 𝑄 𝑒 @ 𝐷 7 @A& } Assign doc 𝑒 to the class with the largest score 19

  20. NaΓ―ve Bayes: example } Training phase: } Estimate parameters of Naive Bayes classifier } Test phase } Classifying the test doc 20

  21. NaΓ―ve Bayes: example 𝐷 = π·β„Žπ‘—π‘œπ‘ } Estimating parameters N 𝐷 = m N 𝐷̅ = & Β¨ 𝑄 n , 𝑄 n N 𝐷𝐼𝐽𝑂𝐹𝑇𝐹|𝐷 = rs& tsu = u N 𝐷𝐼𝐽𝑂𝐹𝑇𝐹|𝐷̅ = &s& msu = v Β¨ 𝑄 &n 𝑄 w N π‘ˆπ‘ƒπΏπ‘π‘ƒ|𝐷 = zs& tsu = & N π‘ˆπ‘ƒπΏπ‘π‘ƒ|𝐷̅ = &s& msu = v Β¨ 𝑄 &n 𝑄 w N 𝐾𝐡𝑄𝐡𝑂|𝐷 = zs& tsu = & N 𝐾𝐡𝑄𝐡𝑂|𝐷̅ = &s& msu = v Β¨ 𝑄 &n 𝑄 w } Classifying the test doc: m N 𝐷|𝑒 ∝ m u Γ— & &n Γ— & } 𝑄 n Γ— &n β‰ˆ 0.0003 &n 𝑑̂ = 𝐷 m N 𝐷̅|𝑒 ∝ & v Γ— v w Γ— v } 𝑄 n Γ— w β‰ˆ 0.0001 w 21

  22. NaΓ―ve Bayes: training 22

  23. NaΓ―ve Bayes: test 23

  24. Time complexity of Naive Bayes Generally: |β„‚||π‘Š | < 𝐸 𝑀 β‚¬β€’β€š } 𝐸 : training set, π‘Š : vocabulary, β„‚ : set of classes } 𝑀 β‚¬β€’β€š : average length of a training doc } 𝑀 € : length of the test doc } 𝑁 € : number of distinct terms in the test doc } Thus: Naive Bayes is linear in the size of the training set (training) and the test doc (testing). } This is optimal time. 24

  25. Why does Naive Bayes work? } The independence assumptions do not really hold of docs written in natural language. } Naive Bayes can work well even though these assumptions are badly violated. } Classification is about predicting the correct class and not about accurately estimating probabilities. } Naive Bayes is terrible for correct estimation . . . } but it often performs well at choosing the correct class. 25

  26. Naive Bayes is not so naive } Naive Bayes has won some bakeoffs (e.g., KDD-CUP 97) } A good dependable baseline for text classification (but not the best) } Optimal if independence assumptions hold (never true for text, but true for some domains) } More robust to non-relevant features than some more complex learning methods } More robust to concept drift (changing of definition of class over time) than some more complex learning methods } Very fast } Low storage requirements 26

  27. Resources } Chapter 13 of IIR 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend