web information retrieval
play

Web Information Retrieval Lecture 14 Text classification Sec. 13.1 - PowerPoint PPT Presentation

Web Information Retrieval Lecture 14 Text classification Sec. 13.1 Text Classification Nave Bayes Classification Vector space methods for Text Classification K Nearest Neighbors Decision boundaries Linear Classifiers


  1. Web Information Retrieval Lecture 14 Text classification

  2. Sec. 13.1 Text Classification  Naïve Bayes Classification  Vector space methods for Text Classification  K Nearest Neighbors  Decision boundaries  Linear Classifiers

  3. Recall a few probability basics  For events A and B:  Bayes’ Rule     P ( A , B ) P ( A B ) P ( A | B ) P ( B ) P ( B | A ) P ( A ) Prior P ( B | A ) P ( A )  P ( A | B ) P ( B ) Posterior

  4. Sec.13.2 Probabilistic Methods  Our focus this lecture  Learning and classification methods based on probability theory.  Bayes theorem plays a critical role in probabilistic learning and classification.  Builds a generative model that approximates how data is produced  Uses prior probability of each category given no information about an item.  Categorization produces a posterior probability distribution over the possible categories given a description of an item.

  5. Sec.13.2 Bayes’ Rule for text classification  For a document d and a class c  P(c) = Probability that we see a document of class c  P(d) = Probability that we see document d P ( c , d )  P ( c | d ) P ( d )  P ( d | c ) P ( c ) P ( c | d )  P ( d | c ) P ( c ) P ( d )

  6. Sec.13.2 Naive Bayes Classifiers Task: Classify a new instance d based on a tuple of attribute   values into one of the classes c j  C d x , x , , x 1 2 n   c argmax P ( c | x , x , , x ) MAP j 1 2 n  c C j  P ( x , x , , x | c ) P ( c )  1 2 n j j argmax  P ( x , x , , x )  c C 1 2 n j   argmax P ( x , x , , x | c ) P ( c ) 1 2 n j j  c C j MAP is “maximum a posteriori” = most likely class

  7. Sec.13.2 Naive Bayes Classifier: Naive Bayes Assumption  P ( c j )  Can be estimated from the frequency of classes in the training examples.  P ( x 1 ,x 2 ,…,x n |c j )  O( |X| n • |C| ) parameters  Could only be estimated if a very, very large number of training examples was available. Naive Bayes Conditional Independence Assumption:  Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P ( x i | c j ).

  8. Sec.13.3 The Naive Bayes Classifier Flu X 1 X 2 X 3 X 4 X 5 runnynose sinus cough fever muscle-ache  Conditional Independence Assumption: features detect term presence and are independent of each other given the class :       P ( X , , X | C ) P ( X | C ) P ( X | C ) P ( X | C ) 1 5 1 2 5  This model is appropriate for binary variables  Multivariate Bernoulli model

  9. Sec.13.3 Learning the Model C X 1 X 2 X 3 X 4 X 5 X 6  First attempt: maximum likelihood estimates  simply use the frequencies in the data  N ( C c ) ˆ  j P ( c ) j N     N ( X x , C c ) N ( X x , C c ) ˆ   i i j i i j P ( x | c )     i j N ( X w , C c ) N ( C c ) i j j  w Vocabulary

  10. Sec.13.3 Problem with Maximum Likelihood Flu X 1 X 2 X 3 X 4 X 5 runnynose sinus cough fever muscle-ache       P ( X , , X | C ) P ( X | C ) P ( X | C ) P ( X | C ) 1 5 1 2 5 What if we have seen no training documents with the word muscle-  ache and classified in the topic Flu ? ( X 5  t | C  Flu )  N ( X 5  t , C  Flu ) ˆ  0 P N ( C  Flu ) Zero probabilities cannot be conditioned away, no matter the other  evidence!   ˆ ˆ  arg max P ( c ) P ( x | c ) c i i

  11. Sec.13.3 Smoothing         N ( X x , C c ) N ( X x , C c ) ˆ   i i j i i j P ( x | c )          i j ( N ( X w , C c ) ) N ( C c ) Vocabulary i j j  w Vocubulary  More advanced smoothing is possible

  12. Sec.13.2.1 Stochastic Language Models  Model probability of generating strings (each word in turn) in a language (commonly all strings over alphabet ∑ ). E.g., a unigram model Model M the man likes the woman 0.2 the 0.1 a 0.2 0.01 0.02 0.2 0.01 0.01 man 0.01 woman multiply 0.03 said P(s | M) = 0.00000008 0.02 likes …

  13. Sec.13.2.1 Stochastic Language Models  Model probability of generating any string Model M2 Model M1 0.2 the 0.2 the the class pleaseth yon maiden 0.0001 class 0.01 class 0.03 sayst 0.0001 sayst 0.2 0.01 0.0001 0.0001 0.0005 0.02 pleaseth 0.0001 pleaseth 0.2 0.0001 0.02 0.1 0.01 0.1 yon 0.0001 yon 0.01 maiden 0.0005 maiden P(s|M2) > P(s|M1) 0.0001 woman 0.01 woman

  14. Sec.13.2 Naive Bayes via a class conditional language model = multinomial NB C w 1 w 2 w 3 w 4 w 5 w 6  Effectively, the probability of each class is done as a class-specific unigram language model

  15. Sec.13.2 Using Multinomial Naive Bayes Classifiers to Classify Text: Basic method Attributes are text positions, values are words.    c argmax P ( c ) P ( x | c ) NB j i j  c C i j     argmax P ( c ) P ( x " our" | c ) P ( x " text" | c ) j 1 j n j  c C j  Still too many possibilities  Assume that classification is independent of the positions of the words  Use same parameters for each position  Result is bag of words model

  16. Sec.13.2 Naive Bayes: Learning From training corpus, extract Vocabulary  Calculate required P ( c j ) and P ( x k | c j ) terms   For each c j in C do  docs j  subset of documents for which the target class is c j | docs |  j  P ( c ) j | total # documents |  Text j  single document containing all docs j  for each word x k in Vocabulary  n k  number of occurrences of x k in Text j   n   k P ( x | c )   k j n | Vocabulary |

  17. Sec.13.2 Naive Bayes: Classifying positions  all word positions in current document which  contain tokens found in Vocabulary Return c NB , where    c argmax P ( c ) P ( x | c ) NB j i j  c C  i positions j

  18. Sec.13.2 Naive Bayes: Time Complexity  Training Time: O(| D | L ave + | C || V |)) where L ave is the average length of a document in D.  Assumes all counts are pre-computed in O(| D | L ave ) time during one pass through all of the data.  Generally just O(| D | L ave ) since usually | C || V | < | D | L ave Why?  Test Time: O(| C | L t ) where L t is the average length of a test document. Very efficient overall, linearly proportional to the time needed to  just read in all the data.

  19. Sec.13.2 Underflow Prevention: using logs Multiplying lots of probabilities, which are between 0 and 1 by  definition, can result in floating-point underflow. Since log( xy ) = log( x ) + log( y ), it is better to perform all computations  by summing logs of probabilities rather than multiplying probabilities. Class with highest final un-normalized log probability score is still  the most probable.  c NB  argmax [log P ( c j )  log P ( x i | c j ) ] c j  C i  positions Note that model is now just max of sum of weights…  20

  20. Naive Bayes Classifier  c NB  argmax [log P ( c j )  log P ( x i | c j ) ] c j  C i  positions  Simple interpretation: Each conditional parameter log P ( x i | c j ) is a weight that indicates how good an indicator x i is for c j .  The prior log P ( c j ) is a weight that indicates the relative frequency of c j .  The sum is then a measure of how much evidence there is for the document being in the class.  We select the class with the most evidence for it 21

  21. Sec.13.5 Feature Selection: Why?  Text collections have a large number of features  10,000 – 1,000,000 unique words … and more  May make using a particular classifier feasible  Some classifiers can’t deal with 100,000 of features  Reduces training time  Training time for some methods is quadratic or worse in the number of features  Can improve generalization (performance)  Eliminates noise features  Avoids overfitting 22

  22. Sec.13.5 Feature selection: how?  Two ideas:  Hypothesis testing statistics:  Are we confident that the value of one categorical variable is associated with the value of another  Chi-square test (  2 )  Information theory:  How much information does the value of one categorical variable give you about the value of another  Mutual information They’re similar, but  2 measures confidence in association, (based on  available statistics), while MI measures extent of association (assuming perfect knowledge of probabilities) 23

  23. Violation of NB Assumptions  The independence assumptions do not really hold of documents written in natural language.  Conditional independence  Positional independence  Examples? 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend