Text classification I (Nave Bayes) CE-324: Modern Information - PowerPoint PPT Presentation

Text classification I (Naïve Bayes) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Spring 2020 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

Outline } Text classification } definition } relevance to information retrieval } Naïve Bayes classifier 2

Formal definition of text classification } Document space 𝑌 } Docs are represented in this (typically high-dimensional) space } Set of classes 𝐷 = {𝑑 & , … , 𝑑 ) } } Example: 𝐷 = { spam, non − spam } } Training set: a set of labeled docs. Each labeled doc 𝑒, 𝑑 ∈ 𝑌×𝐷 } Using a learning method, we find a classifier 𝛿 . that maps docs to classes: 𝛿: 𝑌 → 𝐷 3

Examples of using classification in IR systems } Language identification (classes: English vs. French etc.) } Automatic detection of spam pages (spam vs. non-spam) } Automatic detection of secure pages for safe search } Topic-specific or vertical search – restrict search to a “vertical” like “related to health” (relevant to vertical vs. not) } Sentiment detection: is a movie or product review positive or negative (positive vs. negative) } Exercise: Find examples of uses of text classification in IR 4

Ch. 13 Standing queries } The path from IR to text classification: } You have an information need to monitor, say: } Unrest in the Niger delta region } You want to rerun an appropriate query periodically to find new news items on this topic } You will be sent new documents that are found } I.e., it ’ s not ranking but classification (relevant vs. not relevant) } Such queries are called standing queries } Long used by “ information professionals ” } A modern mass instantiation is Google Alerts 5

Spam filtering Another text classification task From: "" <takworlld@hotmail.com> Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY ! There is no need to spend hundreds or even thousands for similar courses I am 22 years old and I have already purchased 6 properties using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW ! ================================================= Click Below to order: http://www.wholesaledaily.com/sales/nmd.htm ================================================= 7

Categorization/Classification } Given: } A representation of a document d } Issue: how to represent text documents. } Usually some type of high-dimensional space – bag of words } A fixed set of classes: C = {c 1 , c 2 ,…, c J } } Determine: } The category of d: γ (d) ∈ C } γ (d) is a classification function } We want to build classification functions ( “ classifiers ” ). 8

Classification Methods (1) § Manual classification § Used by the originalYahoo! Directory § Looksmart, about.com, ODP, PubMed § Accurate when job is done by experts § Consistent when the problem size and team is small § Difficult and expensive to scale § Means we need automatic classification methods for big problems 9

Classification Methods (2) } Hand-coded rule-based classifiers } One technique used by news agencies, intelligence agencies, etc. } Widely deployed in government and enterprise } Vendors provide “ IDE ” for writing such rules } Issues: } Commercial systems have complex query languages } Accuracy can be high if a rule has been carefully refined over time by a subject expert } Building and maintaining these rules is expensive 10

Classification Methods (3): Supervised learning } Given: } A document d } A fixed set of classes: C = {c 1 , c 2 ,…, c J } } A training set D of documents each with a label in C } Determine: } A learning method or algorithm which will enable us to learn a classifier γ } For a test document d , we assign it the class γ (d) ∈ C 11

Bayes classifier } Bayesian classifier is a probabilistic classifier: 𝑑 = argmax 𝑄(𝐷 7 |𝑒) 7 𝑑 = argmax 𝑄 𝑒 𝐷 7 𝑄(𝐷 7 ) 7 } 𝑒 = 𝑢 & , … , 𝑢 = > } There are too many parameters 𝑄( 𝑢 & , … , 𝑢 = > |𝐷 7 ) } One for each unique combination of a class and a sequence of words. } We would need a very, very large number of training examples to estimate that many parameters. 12

Naïve bayes assumption } Naïve bayes assumption: = > 𝑄 𝑒 𝐷 7 = 𝑄 𝑢 & , … , 𝑢 = > 𝐷 7 = ? 𝑄(𝑢 @ |𝐷 7 ) @A& } 𝑀 C : length of doc 𝑒 (number of tokens) } 𝑄(𝑢 @ |𝐷 7 ) : probability of term 𝑢 @ occurring in a doc of class 𝐷 7 } 𝑄(𝐷 7 ) : prior probability of class 𝐷 7 . 13

Naïve bayes assumption } Naïve bayes assumption: = > 𝑄 𝑒 𝐷 7 = 𝑄 𝑢 & , … , 𝑢 = > 𝐷 7 = ? 𝑄(𝑢 @ |𝐷 7 ) @A& } 𝑀 C : length of doc 𝑒 (number of tokens) } 𝑄(𝑢 @ |𝐷 7 ) : probability of term 𝑢 @ occurring in a doc of class 𝐷 7 } 𝑄(𝐷 7 ) : prior probability of class 𝐷 7 . } Equivalent to (language model view): H DE FG,> 𝑄 𝑒 𝐷 7 = ? 𝑄 𝑢 @ 𝐷 7 @A& 14

Naive Bayes classifier } Since log is a monotonic function, the class with the highest score does not change. = > 𝑑 = argmax 𝑄 𝑒 𝐷 7 𝑄(𝐷 7 ) = argmax 𝑄(𝐷 7 ) ? 𝑄(𝑢 @ |𝐷 7 ) 7 7 @A& = > 𝑑 = argmax log 𝑄(𝐷 7 ) + M log 𝑄 𝑢 @ 𝐷 7 7 @A& log 𝑄 𝑢 @ 𝐷 7 : a weight that indicates how good an indicator 𝑢 @ is for 𝐷 7 15 log(xy) = log(x) + log(y)

Estimating parameters N(𝐷 7 ) and 𝑄 N 𝑢 @ 𝐷 7 ) from training data } Estimate 𝑄 } 𝑂 7 : number of docs in class 𝐷 7 } 𝑈 @,7 : number of occurrence of 𝑢 @ in training docs from class 𝐷 7 (includes multiple occurrences) Q R N 𝐷 7 = } 𝑄 Q S G,R N 𝑢 @ 𝐷 7 ) = } 𝑄 V ∑ S U,R UWX 16

Problem with estimates: Zeros 𝑒: 𝐶𝐹𝐽𝐻𝐽𝑂𝐻 𝐵𝑂𝐸 𝑈𝐵𝐽𝑄𝐹𝐽 𝐾𝑃𝐽𝑂 𝑋𝑈𝑃 𝑄 𝑋𝑈𝑃 𝐷ℎ𝑗𝑜𝑏 = 0 17

� � Problem with estimates: Zeros } For doc 𝑒 containing a term 𝑢 that does not occur in any N 𝑑 𝑒 = 0 doc of a class 𝑑 ⇒ 𝑄 } Thus 𝑒 cannot be assigned to class 𝑑 } We use 𝑈 D,h + 1 N 𝑢 𝑑 = 𝑄 ∑ 𝑈 D j ,h + 𝑊 D j ∈H } Instead of 𝑈 D,h N 𝑢 𝑑 = 𝑄 ∑ 𝑈 D j ,h D j ∈H 18

Naïve Bayes: summary } Estimate parameters from the training corpus using add- one smoothing } For a new doc 𝑒 = 𝑢 & , … , 𝑢 = > , for each class, compute = > log 𝑄(𝐷 7 ) + ∑ log 𝑄 𝑢 @ 𝐷 7 @A& } Assign doc 𝑒 to the class with the largest score 19

Naïve Bayes: example } Training phase: } Estimate parameters of Naive Bayes classifier } Test phase } Classifying the test doc 20

Naïve Bayes: example 𝐷 = 𝐷ℎ𝑗𝑜𝑏 } Estimating parameters N 𝐷 = m N 𝐷̅ = & ¨ 𝑄 n , 𝑄 n N 𝐷𝐼𝐽𝑂𝐹𝑇𝐹|𝐷 = rs& tsu = u N 𝐷𝐼𝐽𝑂𝐹𝑇𝐹|𝐷̅ = &s& msu = v ¨ 𝑄 &n 𝑄 w N 𝑈𝑃𝐿𝑍𝑃|𝐷 = zs& tsu = & N 𝑈𝑃𝐿𝑍𝑃|𝐷̅ = &s& msu = v ¨ 𝑄 &n 𝑄 w N 𝐾𝐵𝑄𝐵𝑂|𝐷 = zs& tsu = & N 𝐾𝐵𝑄𝐵𝑂|𝐷̅ = &s& msu = v ¨ 𝑄 &n 𝑄 w } Classifying the test doc: m N 𝐷|𝑒 ∝ m u × & &n × & } 𝑄 n × &n ≈ 0.0003 &n 𝑑̂ = 𝐷 m N 𝐷̅|𝑒 ∝ & v × v w × v } 𝑄 n × w ≈ 0.0001 w 21

Naïve Bayes: training 22

Naïve Bayes: test 23

Time complexity of Naive Bayes Generally: |ℂ||𝑊 | < 𝐸 𝑀 €•‚ } 𝐸 : training set, 𝑊 : vocabulary, ℂ : set of classes } 𝑀 €•‚ : average length of a training doc } 𝑀 € : length of the test doc } 𝑁 € : number of distinct terms in the test doc } Thus: Naive Bayes is linear in the size of the training set (training) and the test doc (testing). } This is optimal time. 24

Why does Naive Bayes work? } The independence assumptions do not really hold of docs written in natural language. } Naive Bayes can work well even though these assumptions are badly violated. } Classification is about predicting the correct class and not about accurately estimating probabilities. } Naive Bayes is terrible for correct estimation . . . } but it often performs well at choosing the correct class. 25

Naive Bayes is not so naive } Naive Bayes has won some bakeoffs (e.g., KDD-CUP 97) } A good dependable baseline for text classification (but not the best) } Optimal if independence assumptions hold (never true for text, but true for some domains) } More robust to non-relevant features than some more complex learning methods } More robust to concept drift (changing of definition of class over time) than some more complex learning methods } Very fast } Low storage requirements 26

Resources } Chapter 13 of IIR 27

Text classification I (Nave Bayes) CE-324: Modern Information - PowerPoint PPT Presentation

Text classification I (Nave Bayes) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Spring 2020 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Outline } Text

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

Web Information Retrieval Lecture 14 Text classification Sec. 13.1 Text Classification

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

Nave Bayes Classification Nickolai Riabov, Kenneth Tiong Brown University Fall 2013 Nickolai

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Introduction to Machine Learning Classification: Naive Bayes Learning goals 15 Understand the

INF4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 (Mostly Text)

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off

I ntroduction to Mobile Robotics Bayes Filter Kalm an Filter Wolfram Burgard 1 Bayes

Web Information Retrieval Lecture 13 Introduction to text classification and clustering

ACCELERATORS CINDY JOE SATURDAY MORNING PHYSICS OCTOBER 21, 2017 ABOUT ME Grew up in Arkansas

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets

http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means -

The Power of an Agile Mindset Linda Rising linda@lindarising.org www.lindarising.org

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Outline The big picture 1 Grounds for rebuttal 2 Structuring rebuttal 3 Definitional

Media Link Analysis and Web Search How to Organize the Web First try: Human curated Web