Data Mining 2020 Text Classification Naive Bayes Ad Feelders - PowerPoint PPT Presentation

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Data Mining 1 / 49

Text Mining Text Mining is data mining applied to text data. Often uses well-known data mining algorithms. Text data requires substantial pre-processing. This typically results in a large number of attributes (for example, the size of the dictionary). Ad Feelders ( Universiteit Utrecht ) Data Mining 2 / 49

Text Classification Predict the class(es) of text documents. Can be single-label or multi-label. Multi-label classification is often performed by building multiple binary classifiers (one for each possible class). Examples of text classification: topics of news articles, spam/no spam for e-mail messages, sentiment analysis (e.g. positive/negative review), opinion spam (e.g. fake reviews), music genre from song lyrics Ad Feelders ( Universiteit Utrecht ) Data Mining 3 / 49

Is this Rap, Blues, Metal, or Country? Blasting our way through the boundaries of Hell No one can stop us tonight We take on the world with hatred inside Mayhem the reason we fight Surviving the slaughters and killing we’ve lost Then we return from the dead Attacking once more now with twice as much strength We conquer then move on ahead [Chorus:] Evil My words defy Evil Has no disguise Evil Will take your soul Evil My wrath unfolds Satan our master in evil mayhem Guides us with every first step Our axes are growing with power and fury Soon there’ll be nothingness left Midnight has come and the leathers strapped on Evil is at our command We clash with God’s angel and conquer new souls Consuming all that we can Ad Feelders ( Universiteit Utrecht ) Data Mining 4 / 49

Probabilistic Classifier A probabilistic classifier assigns a probability to each class. In case a class prediction is required we typically predict the class with highest probability: P ( d | c ) P ( c ) ˆ c = arg max c ∈ C P ( c | d ) = arg max P ( d ) c ∈ C where d is a document, and C is the set of all possible class labels. Since P ( d ) = � c ∈ C P ( c , d ) is the same for all classes, we can ignore the denominator: c = arg max ˆ c ∈ C P ( c | d ) = arg max c ∈ C P ( d | c ) P ( c ) Ad Feelders ( Universiteit Utrecht ) Data Mining 5 / 49

Naive Bayes Represent document as set of features: c = arg max ˆ c ∈ C P ( c | d ) = arg max c ∈ C P ( x 1 , . . . , x m | c ) P ( c ) Naive Bayes assumption: P ( x 1 , . . . , x m | c ) = P ( x 1 | c ) P ( x 2 | c ) · . . . · P ( x m | c ) The features are assumed to be independent within each class (avoiding the curse of dimensionality). m � c nb = arg max c ∈ C P ( c ) P ( x i | c ) i =1 Ad Feelders ( Universiteit Utrecht ) Data Mining 6 / 49

Independence Graph of Naive Bayes C · · · X 2 X m X 1 Ad Feelders ( Universiteit Utrecht ) Data Mining 7 / 49

Bag Of Words Representation of a Document it 6 I 5 the 4 I love this movie! It's sweet, fairy it to 3 always love but with satirical humor. The to it whimsical it and 3 dialogue is great and the I and areanyone seen seen 2 adventure scenes are fun... friend dialogue yet 1 It manages to be whimsical happy recommend would 1 and romantic while laughing adventure satirical whimsical 1 sweet of at the conventions of the who it movie I to times 1 fairy tale genre. I would it but romantic I yet sweet 1 recommend it to just about several humor again the satirical 1 it anyone. I've seen it several the seen would adventure 1 times, and I'm always happy to scenes I the manages genre 1 to see it again whenever I the fun times I and fairy 1 have a friend who hasn't and about while humor 1 seen it yet! whenever have have 1 conventions with great 1 … … Figure 6.1 Intuition of the multinomial naive Bayes classifier applied to a movie review. The position of the words is ignored (the bag of words assumption) and we make use of the frequency of each word. Ad Feelders ( Universiteit Utrecht ) Data Mining 8 / 49

Bag Of Words Representation of a Document Not matter, the order and position do. Ad Feelders ( Universiteit Utrecht ) Data Mining 9 / 49

Multinomial Naive Bayes for Text Represent document d as a sequence of words: d = � w 1 , w 2 , . . . , w n � . n � c nb = arg max c ∈ C P ( c ) P ( w k | c ) k =1 Notice that P ( w | c ) is independent of word position or word order, so d is truly represented as a bag-of-words. Taking the log we obtain: n � c nb = arg max c ∈ C log P ( c ) + log P ( w k | c ) k =1 By the way, why is it allowed to take the log? Ad Feelders ( Universiteit Utrecht ) Data Mining 10 / 49

Multinomial Naive Bayes for Text Consider the text (perhaps after some pre-processing) catch as catch can We have d = � catch , as , catch , can � , with w 1 = catch , w 2 = as , w 3 = catch , and w 4 = can . Suppose we have two classes, say C = { + , −} , then for this document: c nb = arg c ∈{ + , −} log P ( c ) + log P ( catch | c ) + log P ( as | c ) max + log P ( catch | c ) + log P ( can | c ) = arg c ∈{ + , −} log P ( c ) + 2 log P ( catch | c ) + log P ( as | c ) max + log P ( can | c ) Ad Feelders ( Universiteit Utrecht ) Data Mining 11 / 49

Training Multinomial Naive Bayes Class priors: P ( c ) = N c ˆ N doc Word probabilities within each class: count( w i , c ) ˆ P ( w i | c ) = for all w i ∈ V , � w j ∈ V count( w j , c ) where V (for Vocabulary) denotes the collection of all words that occur in the training corpus (after possibly extensive pre-processing). Verify that � ˆ P ( w i | c ) = 1 , w i ∈ V as required. Ad Feelders ( Universiteit Utrecht ) Data Mining 12 / 49

Interpretation of word probabilities Word probabilities within each class: count( w i , c ) ˆ P ( w i | c ) = for all w i ∈ V � w j ∈ V count( w j , c ) Interpretation: if we draw a word at random from a document of class c , the probability that we draw w i is ˆ P ( w i | c ). Ad Feelders ( Universiteit Utrecht ) Data Mining 13 / 49

Training Multinomial Naive Bayes: Smoothing Perform smoothing to avoid zero probability estimates. Word probabilities within each class with Laplace smoothing are: count( w i , c ) + 1 count( w i , c ) + 1 ˆ P ( w i | c ) = w j ∈ V (count( w j , c ) + 1) = � � w j ∈ V count( w j , c ) + | V | Verify that again � ˆ P ( w i | c ) = 1 , w i ∈ V as required. The +1 is also called a pseudo-count: pretend you already observed one occurrence of each word in each class. Ad Feelders ( Universiteit Utrecht ) Data Mining 14 / 49

Worked Example: Movie Reviews Cat Documents Training - just plain boring - entirely predictable and lacks energy - no surprises and very few laughs + very powerful + the most fun film of the summer Test ? predictable with no fun N Ad Feelders ( Universiteit Utrecht ) Data Mining 15 / 49

Class Prior Probabilities Recall that: P ( c ) = N c ˆ N doc So we get: P (+) = 2 P ( − ) = 3 ˆ ˆ 5 5 Ad Feelders ( Universiteit Utrecht ) Data Mining 16 / 49

Word Conditional Probabilities To classify the test example, we need the following probability estimates: 14 + 20 = 1 1 + 1 P ( predictable | +) = 0 + 1 9 + 20 = 1 ˆ ˆ P ( predictable | − ) = 17 29 14 + 20 = 1 1 + 1 P ( no | +) = 0 + 1 9 + 20 = 1 ˆ ˆ P ( no | − ) = 17 29 14 + 20 = 1 0 + 1 P ( fun | +) = 1 + 1 9 + 20 = 2 ˆ ˆ P ( fun | − ) = 34 29 Classification: P (predictable no fun | − ) = 3 5 × 1 17 × 1 17 × 1 3 P ( − ) ˆ ˆ 34 = 49 , 130 P (predictable no fun | +) = 2 5 × 1 29 × 1 29 × 2 4 P (+) ˆ ˆ 29 = 121 , 945 The model predicts class negative for the test review. Ad Feelders ( Universiteit Utrecht ) Data Mining 17 / 49

Why smoothing? If we don’t use smoothing, the estimates are: P ( predictable | − ) = 1 P ( predictable | +) = 0 ˆ ˆ 9 = 0 14 P ( no | − ) = 1 P ( no | +) = 0 ˆ ˆ 9 = 0 14 P ( fun | − ) = 0 P ( fun | +) = 1 ˆ ˆ 14 = 0 9 Classification: P (predictable no fun | − ) = 3 5 × 1 14 × 1 P ( − ) ˆ ˆ 14 × 0 = 0 P (predictable no fun | +) = 2 5 × 0 × 0 × 1 P (+) ˆ ˆ 9 = 0 Both classes have estimated probability undefined! (division by zero) Ad Feelders ( Universiteit Utrecht ) Data Mining 18 / 49

Multinomial Naive Bayes: Training TrainMultinomialNB ( C , D ) 1 V ← ExtractVocabulary ( D ) 2 N doc ← CountDocs ( D ) 3 for each c ∈ C 4 do N c ← CountDocsInClass ( D , c ) 5 prior [ c ] ← N c / N doc 6 text c ← ConcatenateTextOfAllDocsInClass ( D , c ) 7 for each w ∈ V 8 do count cw ← CountWordOccurrence ( text c , w ) 9 for each w ∈ V count cw +1 10 do condprob [ w ][ c ] ← w ′ ( count cw ′ +1) � 11 return V , prior , condprob Ad Feelders ( Universiteit Utrecht ) Data Mining 19 / 49

Multinomial Naive Bayes: Prediction Predict the class of a document d . ApplyMultinomialNB ( C , V , prior , condprob , d ) 1 W ← ExtractWordOccurrencesFromDoc ( V , d ) 2 for each c ∈ C 3 do score [ c ] ← log prior [ c ] 4 for each w ∈ W 5 do score [ c ]+ = log condprob [ w ][ c ] 6 return arg max c ∈ C score [ c ] Ad Feelders ( Universiteit Utrecht ) Data Mining 20 / 49

Data Mining 2020 Text Classification Naive Bayes Ad Feelders - PowerPoint PPT Presentation

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Data Mining 1 / 49 Text Mining Text Mining is data mining applied to text data. Often uses well-known data mining

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

Introduction to Machine Learning Classification: Naive Bayes Learning goals 15 Understand the

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, LOGISTIC REGRESSION

Naive Bayes Classication Naive Bayes Classication In [1]: % matplotlib inline from

CSE 473: Artificial Intelligence Autumn 2010 Machine Learning: Naive Bayes and Perceptron Luke

INF4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 (Mostly Text)

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Overview Introduction to Information Retrieval Text classification

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Statistical classification Lecture notes Naive Bayes Bayes' theorem P ( c|a ) P ( a ) = P ( a|c

CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & Rich Zemels lectures

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

Overview Extracting Product Feature Motivation & Terminology Opinion Mining Work

Opinion Mining Exercises Feiyu Xu DFKI 12/13/13 Language Technology I 1 Opinion Mining

Multithreading Basics thread state: runnable, blocked Multithreading start, sleep,

Chapter 4: Threads Process Scheduling Process Creation and Termination Inter-process

Data Analytics Seminar-1 ISMLL Prof. Dr. Dr. Lars Schmidt Thieme, Mofassir Arif Mofassir,

Twitter Sentiment Analysis Instructor: Ekpe Okorafor 1. Big Data Academy - Accenture 2.

Data Mining Ice Cubes Tim Ruhe, Katharina Morik ADASS XXI, Paris 2011 Tim Ruhe, Katharina Morik

PSS718 - Data Mining Policy and Strategy Studies Asst.Prof.Dr. Burkay Gen Hacettepe University

Data Mining 2020 Text Classification Naive Bayes Ad Feelders - PowerPoint PPT Presentation

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Data Mining 1 / 49 Text Mining Text Mining is data mining applied to text data. Often uses well-known data mining

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

Introduction to Machine Learning Classification: Naive Bayes Learning goals 15 Understand the

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, LOGISTIC REGRESSION

Naive Bayes Classication Naive Bayes Classication In [1]: % matplotlib inline from

CSE 473: Artificial Intelligence Autumn 2010 Machine Learning: Naive Bayes and Perceptron Luke

INF4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 (Mostly Text)

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Overview Introduction to Information Retrieval Text classification

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Statistical classification Lecture notes Naive Bayes Bayes' theorem P ( c|a ) P ( a ) = P ( a|c

CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun &amp; Rich Zemels lectures

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

Overview Extracting Product Feature Motivation &amp; Terminology Opinion Mining Work

Opinion Mining Exercises Feiyu Xu DFKI 12/13/13 Language Technology I 1 Opinion Mining

Multithreading Basics thread state: runnable, blocked Multithreading start, sleep,

Chapter 4: Threads Process Scheduling Process Creation and Termination Inter-process

Data Analytics Seminar-1 ISMLL Prof. Dr. Dr. Lars Schmidt Thieme, Mofassir Arif Mofassir,

Twitter Sentiment Analysis Instructor: Ekpe Okorafor 1. Big Data Academy - Accenture 2.

Data Mining Ice Cubes Tim Ruhe, Katharina Morik ADASS XXI, Paris 2011 Tim Ruhe, Katharina Morik

PSS718 - Data Mining Policy and Strategy Studies Asst.Prof.Dr. Burkay Gen Hacettepe University

CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & Rich Zemels lectures

Overview Extracting Product Feature Motivation & Terminology Opinion Mining Work