Web Information Retrieval Lecture 14 Text classification Sec. 13.1 - PowerPoint PPT Presentation

Web Information Retrieval Lecture 14 Text classification

Sec. 13.1 Text Classification  Naïve Bayes Classification  Vector space methods for Text Classification  K Nearest Neighbors  Decision boundaries  Linear Classifiers

Recall a few probability basics  For events A and B:  Bayes’ Rule     P ( A , B ) P ( A B ) P ( A | B ) P ( B ) P ( B | A ) P ( A ) Prior P ( B | A ) P ( A )  P ( A | B ) P ( B ) Posterior

Sec.13.2 Probabilistic Methods  Our focus this lecture  Learning and classification methods based on probability theory.  Bayes theorem plays a critical role in probabilistic learning and classification.  Builds a generative model that approximates how data is produced  Uses prior probability of each category given no information about an item.  Categorization produces a posterior probability distribution over the possible categories given a description of an item.

Sec.13.2 Bayes’ Rule for text classification  For a document d and a class c  P(c) = Probability that we see a document of class c  P(d) = Probability that we see document d P ( c , d )  P ( c | d ) P ( d )  P ( d | c ) P ( c ) P ( c | d )  P ( d | c ) P ( c ) P ( d )

Sec.13.2 Naive Bayes Classifiers Task: Classify a new instance d based on a tuple of attribute   values into one of the classes c j  C d x , x , , x 1 2 n   c argmax P ( c | x , x , , x ) MAP j 1 2 n  c C j  P ( x , x , , x | c ) P ( c )  1 2 n j j argmax  P ( x , x , , x )  c C 1 2 n j   argmax P ( x , x , , x | c ) P ( c ) 1 2 n j j  c C j MAP is “maximum a posteriori” = most likely class

Sec.13.2 Naive Bayes Classifier: Naive Bayes Assumption  P ( c j )  Can be estimated from the frequency of classes in the training examples.  P ( x 1 ,x 2 ,…,x n |c j )  O( |X| n • |C| ) parameters  Could only be estimated if a very, very large number of training examples was available. Naive Bayes Conditional Independence Assumption:  Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P ( x i | c j ).

Sec.13.3 The Naive Bayes Classifier Flu X 1 X 2 X 3 X 4 X 5 runnynose sinus cough fever muscle-ache  Conditional Independence Assumption: features detect term presence and are independent of each other given the class :       P ( X , , X | C ) P ( X | C ) P ( X | C ) P ( X | C ) 1 5 1 2 5  This model is appropriate for binary variables  Multivariate Bernoulli model

Sec.13.3 Learning the Model C X 1 X 2 X 3 X 4 X 5 X 6  First attempt: maximum likelihood estimates  simply use the frequencies in the data  N ( C c ) ˆ  j P ( c ) j N     N ( X x , C c ) N ( X x , C c ) ˆ   i i j i i j P ( x | c )     i j N ( X w , C c ) N ( C c ) i j j  w Vocabulary

Sec.13.3 Problem with Maximum Likelihood Flu X 1 X 2 X 3 X 4 X 5 runnynose sinus cough fever muscle-ache       P ( X , , X | C ) P ( X | C ) P ( X | C ) P ( X | C ) 1 5 1 2 5 What if we have seen no training documents with the word muscle-  ache and classified in the topic Flu ? ( X 5  t | C  Flu )  N ( X 5  t , C  Flu ) ˆ  0 P N ( C  Flu ) Zero probabilities cannot be conditioned away, no matter the other  evidence!   ˆ ˆ  arg max P ( c ) P ( x | c ) c i i

Sec.13.3 Smoothing         N ( X x , C c ) N ( X x , C c ) ˆ   i i j i i j P ( x | c )          i j ( N ( X w , C c ) ) N ( C c ) Vocabulary i j j  w Vocubulary  More advanced smoothing is possible

Sec.13.2.1 Stochastic Language Models  Model probability of generating strings (each word in turn) in a language (commonly all strings over alphabet ∑ ). E.g., a unigram model Model M the man likes the woman 0.2 the 0.1 a 0.2 0.01 0.02 0.2 0.01 0.01 man 0.01 woman multiply 0.03 said P(s | M) = 0.00000008 0.02 likes …

Sec.13.2.1 Stochastic Language Models  Model probability of generating any string Model M2 Model M1 0.2 the 0.2 the the class pleaseth yon maiden 0.0001 class 0.01 class 0.03 sayst 0.0001 sayst 0.2 0.01 0.0001 0.0001 0.0005 0.02 pleaseth 0.0001 pleaseth 0.2 0.0001 0.02 0.1 0.01 0.1 yon 0.0001 yon 0.01 maiden 0.0005 maiden P(s|M2) > P(s|M1) 0.0001 woman 0.01 woman

Sec.13.2 Naive Bayes via a class conditional language model = multinomial NB C w 1 w 2 w 3 w 4 w 5 w 6  Effectively, the probability of each class is done as a class-specific unigram language model

Sec.13.2 Using Multinomial Naive Bayes Classifiers to Classify Text: Basic method Attributes are text positions, values are words.    c argmax P ( c ) P ( x | c ) NB j i j  c C i j     argmax P ( c ) P ( x " our" | c ) P ( x " text" | c ) j 1 j n j  c C j  Still too many possibilities  Assume that classification is independent of the positions of the words  Use same parameters for each position  Result is bag of words model

Sec.13.2 Naive Bayes: Learning From training corpus, extract Vocabulary  Calculate required P ( c j ) and P ( x k | c j ) terms   For each c j in C do  docs j  subset of documents for which the target class is c j | docs |  j  P ( c ) j | total # documents |  Text j  single document containing all docs j  for each word x k in Vocabulary  n k  number of occurrences of x k in Text j   n   k P ( x | c )   k j n | Vocabulary |

Sec.13.2 Naive Bayes: Classifying positions  all word positions in current document which  contain tokens found in Vocabulary Return c NB , where    c argmax P ( c ) P ( x | c ) NB j i j  c C  i positions j

Sec.13.2 Naive Bayes: Time Complexity  Training Time: O(| D | L ave + | C || V |)) where L ave is the average length of a document in D.  Assumes all counts are pre-computed in O(| D | L ave ) time during one pass through all of the data.  Generally just O(| D | L ave ) since usually | C || V | < | D | L ave Why?  Test Time: O(| C | L t ) where L t is the average length of a test document. Very efficient overall, linearly proportional to the time needed to  just read in all the data.

Sec.13.2 Underflow Prevention: using logs Multiplying lots of probabilities, which are between 0 and 1 by  definition, can result in floating-point underflow. Since log( xy ) = log( x ) + log( y ), it is better to perform all computations  by summing logs of probabilities rather than multiplying probabilities. Class with highest final un-normalized log probability score is still  the most probable.  c NB  argmax [log P ( c j )  log P ( x i | c j ) ] c j  C i  positions Note that model is now just max of sum of weights…  20

Naive Bayes Classifier  c NB  argmax [log P ( c j )  log P ( x i | c j ) ] c j  C i  positions  Simple interpretation: Each conditional parameter log P ( x i | c j ) is a weight that indicates how good an indicator x i is for c j .  The prior log P ( c j ) is a weight that indicates the relative frequency of c j .  The sum is then a measure of how much evidence there is for the document being in the class.  We select the class with the most evidence for it 21

Sec.13.5 Feature Selection: Why?  Text collections have a large number of features  10,000 – 1,000,000 unique words … and more  May make using a particular classifier feasible  Some classifiers can’t deal with 100,000 of features  Reduces training time  Training time for some methods is quadratic or worse in the number of features  Can improve generalization (performance)  Eliminates noise features  Avoids overfitting 22

Sec.13.5 Feature selection: how?  Two ideas:  Hypothesis testing statistics:  Are we confident that the value of one categorical variable is associated with the value of another  Chi-square test (  2 )  Information theory:  How much information does the value of one categorical variable give you about the value of another  Mutual information They’re similar, but  2 measures confidence in association, (based on  available statistics), while MI measures extent of association (assuming perfect knowledge of probabilities) 23

Violation of NB Assumptions  The independence assumptions do not really hold of documents written in natural language.  Conditional independence  Positional independence  Examples? 24

Web Information Retrieval Lecture 14 Text classification Sec. 13.1 - PowerPoint PPT Presentation

Web Information Retrieval Lecture 14 Text classification Sec. 13.1 Text Classification Nave Bayes Classification Vector space methods for Text Classification K Nearest Neighbors Decision boundaries Linear Classifiers

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Web Information Retrieval Lecture 9 Information Retrieval in the Web Search use (iProspect

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Information Retrieval CS276 Information Retrieval and Web Search Christopher

Information Retrieval CS276: Information Retrieval and Web Search Pandu

Information Retrieval CS276: Information Retrieval and Web Search Pandu

Information Retrieval CS276: Information Retrieval and Web Search

DEEP LEARNING applications Julia Rabetti Giannella Research assistant at VISGRAF Lab PhD in

Membership GROWING OUR CLUB The Object of Rotary is to encourage and foster the Why become ideal

1 SOME NOTES ON STATISTICAL INTERPRETATION Below I provide some basic notes on statistical

Top-Down AND Bottom-Up CGA Conference: Illuminating Space and Time in Data Science Krzysztof

CARES Act Funding A Community Conversation Cohosted by the Illinois Arts Council Agency &

Breakfasts 2018 Welcome to Decembers BIC Breakfast: Securing a Single Digital Presence in UK

Community Partnership Grant Program Information Session Franklin County Government Office Tower

Arts Council Grants ODSP Action Coalition April 20, 2017 ODSP Action Coalition 15 years

Web Information Retrieval Lecture 14 Text classification Sec. 13.1 - PowerPoint PPT Presentation

Web Information Retrieval Lecture 14 Text classification Sec. 13.1 Text Classification Nave Bayes Classification Vector space methods for Text Classification K Nearest Neighbors Decision boundaries Linear Classifiers

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Retrieval Models: Outline CS490W: Web I nformation Search &amp; Management Retrieval Models

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Web Information Retrieval Lecture 9 Information Retrieval in the Web Search use (iProspect

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Information Retrieval CS276 Information Retrieval and Web Search Christopher

Information Retrieval CS276: Information Retrieval and Web Search Pandu

Information Retrieval CS276: Information Retrieval and Web Search Pandu

Information Retrieval CS276: Information Retrieval and Web Search

DEEP LEARNING applications Julia Rabetti Giannella Research assistant at VISGRAF Lab PhD in

Membership GROWING OUR CLUB The Object of Rotary is to encourage and foster the Why become ideal

1 SOME NOTES ON STATISTICAL INTERPRETATION Below I provide some basic notes on statistical

Top-Down AND Bottom-Up CGA Conference: Illuminating Space and Time in Data Science Krzysztof

CARES Act Funding A Community Conversation Cohosted by the Illinois Arts Council Agency &amp;

Breakfasts 2018 Welcome to Decembers BIC Breakfast: Securing a Single Digital Presence in UK

Community Partnership Grant Program Information Session Franklin County Government Office Tower

Arts Council Grants ODSP Action Coalition April 20, 2017 ODSP Action Coalition 15 years

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

CARES Act Funding A Community Conversation Cohosted by the Illinois Arts Council Agency &