natural language processing cse 490u text classification
play

Natural Language Processing (CSE 490U): Text Classification Noah - PowerPoint PPT Presentation

Natural Language Processing (CSE 490U): Text Classification Noah Smith 2017 c University of Washington nasmith@cs.washington.edu January 2023, 2017 1 / 65 Text Classification Input: a piece of text x V , usually a document


  1. Natural Language Processing (CSE 490U): Text Classification Noah Smith � 2017 c University of Washington nasmith@cs.washington.edu January 20–23, 2017 1 / 65

  2. Text Classification Input: a piece of text x ∈ V † , usually a document (r.v. X ) Output: a label from a finite set L (r.v. L ) Standard line of attack: 1. Human experts label some data. 2. Feed the data to a supervised machine learning algorithm that constructs an automatic classifier classify : V † → L 3. Apply classify to as much data as you want! Note: we assume the texts are segmented already, even the new ones. 2 / 65

  3. Text Classification: Examples ◮ Library-like subjects (e.g., the Dewey decimal system) ◮ News stories: politics vs. sports vs. business vs. technology ... ◮ Reviews of films, restaurants, products: postive vs. negative ◮ Author attributes: identity, political stance, gender, age, ... ◮ Email, arXiv submissions, etc.: spam vs. not ◮ What is the reading level of a piece of text? ◮ How influential will a scientific paper be? ◮ Will a piece of proposed legislation pass? Closely related: relevance to a query. 3 / 65

  4. Evaluation Accuracy: A(classify) = p (classify( X ) = L ) � 1 if classify( x ) = ℓ � = p ( X = x , L = ℓ ) · 0 otherwise x ∈V † ,ℓ ∈L � = p ( X = x , L = ℓ ) · 1 { classify( x ) = ℓ } x ∈V † ,ℓ ∈L where p is the true distribution over data. Error is 1 − A . x 1 , ¯ x m , ¯ This is estimated using a test dataset � ¯ ℓ 1 � , . . . � ¯ ℓ m � : m A(classify) = 1 � � � ˆ x i ) = ¯ classify( ¯ ℓ i 1 m i =1 4 / 65

  5. Issues with Test-Set Accuracy 5 / 65

  6. Issues with Test-Set Accuracy ◮ Class imbalance: if p ( L = not spam ) = 0 . 99 , then you can get ˆ A ≈ 0 . 99 by always guessing “not spam.” 6 / 65

  7. Issues with Test-Set Accuracy ◮ Class imbalance: if p ( L = not spam ) = 0 . 99 , then you can get ˆ A ≈ 0 . 99 by always guessing “not spam.” ◮ Relative importance of classes or cost of error types. 7 / 65

  8. Issues with Test-Set Accuracy ◮ Class imbalance: if p ( L = not spam ) = 0 . 99 , then you can get ˆ A ≈ 0 . 99 by always guessing “not spam.” ◮ Relative importance of classes or cost of error types. ◮ Variance due to the test data. 8 / 65

  9. Evaluation in the Two-Class Case Suppose we have two classes, and one of them, t ∈ L is a “target.” ◮ E.g., given a query, find relevant documents. Precision and recall encode the goals of returning a “pure” set of targeted instances and capturing all of them. B A C actually in believed to be correctly the target in the target labeled class; class; as t L = t classify( x ) = t P(classify) = | C | | B | = | A ∩ B | ˆ | B | R(classify) = | C | | A | = | A ∩ B | ˆ | A | ˆ P · ˆ R ˆ F 1 (classify) = 2 · P + ˆ ˆ R 9 / 65

  10. Another View: Contingency Table L = t L � = t classify( X ) = t B \ C (false positives) C (true positives) B classify( X ) � = t A \ C (false negatives) (true negatives) A 10 / 65

  11. Evaluation with > 2 Classes Macroaveraged precision and recall: let each class be the target and report the average ˆ P and ˆ R across all classes. Microaveraged precision and recall: pool all one-vs.-rest decisions into a single contingency table, calculate ˆ P and ˆ R from that. 11 / 65

  12. Cross-Validation Remember that ˆ A , ˆ P , ˆ R , and ˆ F 1 are all estimates of the classifier’s quality under the true data distribution. ◮ Estimates are noisy! K -fold cross-validation: ◮ Partition the training set into K non-overlapping “folds” x 1 , . . . , x K . ◮ For i ∈ { 1 , . . . , K } : ◮ Train on x 1: n \ x i , using x i as development data. ◮ Estimate quality on the i th development set: ˆ A i ◮ Report the average: K A = 1 � ˆ ˆ A i K i =1 and perhaps also the standard error. 12 / 65

  13. Statistical Significance Suppose we have two classifiers, classify 1 and classify 2 . 13 / 65

  14. Statistical Significance Suppose we have two classifiers, classify 1 and classify 2 . Is classify 1 better? The “null hypothesis,” denoted H 0 , is that it isn’t. But if ˆ A 1 ≫ ˆ A 2 , we are tempted to believe otherwise. 14 / 65

  15. Statistical Significance Suppose we have two classifiers, classify 1 and classify 2 . Is classify 1 better? The “null hypothesis,” denoted H 0 , is that it isn’t. But if ˆ A 1 ≫ ˆ A 2 , we are tempted to believe otherwise. How much larger must ˆ A 1 be than ˆ A 2 to reject H 0 ? 15 / 65

  16. Statistical Significance Suppose we have two classifiers, classify 1 and classify 2 . Is classify 1 better? The “null hypothesis,” denoted H 0 , is that it isn’t. But if ˆ A 1 ≫ ˆ A 2 , we are tempted to believe otherwise. How much larger must ˆ A 1 be than ˆ A 2 to reject H 0 ? Frequentist view: how (im)probable is the observed difference, given H 0 = true? 16 / 65

  17. Statistical Significance Suppose we have two classifiers, classify 1 and classify 2 . Is classify 1 better? The “null hypothesis,” denoted H 0 , is that it isn’t. But if ˆ A 1 ≫ ˆ A 2 , we are tempted to believe otherwise. How much larger must ˆ A 1 be than ˆ A 2 to reject H 0 ? Frequentist view: how (im)probable is the observed difference, given H 0 = true? Caution: statistical significance is neither necessary nor sufficient for research significance! 17 / 65

  18. A Hypothesis Test for Text Classifiers McNemar (1947) 1. The null hypothesis: A 1 = A 2 2. Pick significance level α , an “acceptably” high probability of incorrectly rejecting H 0 . 3. Calculate the test statistic, k (explained in the next slide). 4. Calculate the probability of a more extreme value of k , assuming H 0 is true; this is the p -value. 5. Reject the null hypothesis if the p -value is less than α . The p -value is p ( this observation | H 0 is true ) , not the other way around! 18 / 65

  19. McNemar’s Test: Details Assumptions: independent (test) samples and binary measurements. Count test set error patterns: classify 1 classify 1 is incorrect is correct classify 2 is incorrect c 00 c 10 m · ˆ classify 2 is correct c 01 c 11 A 2 m · ˆ A 1 If A 1 = A 2 , then c 01 and c 10 are each distributed according to Binomial( c 01 + c 10 , 1 2 ) . test statistic k = min { c 01 , c 10 } k � c 01 + c 10 � 1 � p -value = 2 c 01 + c 10 − 1 j j =0 19 / 65

  20. Other Tests Different tests make different assumptions. Sometimes we calculate an interval that would be “unsurprising” under H 0 and test whether a test statistic falls in that interval (e.g., t -test and Wald test). In many cases, there is no closed form for estimating p -values, so we use random approximations (e.g., permutation test and paired bootstrap test). If you do lots of tests, you need to correct for that! Read lots more in Smith (2011), appendix B. 20 / 65

  21. Features in Text Classification A different representation of the text sequence r.v. X : feature r.v.s. For j ∈ { 1 , . . . , d } , let F j be a discrete random variable taking a value in F j . ◮ Often, these are term (word and perhaps n-gram) frequencies. ◮ Can also be word “presence” features. ◮ Transformations on word frequencies: logarithm, idf weighting ◮ Disjunctions of terms ◮ Clusters ◮ Task-specific lexicons 21 / 65

  22. Probabilistic Classification Classification rule: classify( f ) = argmax p ( ℓ | f ) ℓ ∈L p ( ℓ, f ) = argmax p ( f ) ℓ ∈L = argmax p ( ℓ, f ) ℓ ∈L 22 / 65

  23. Na¨ ıve Bayes Classifier d � p ( L = ℓ, F j = f 1 , . . . , F d = f d ) = p ( ℓ ) p ( F j = f j | ℓ ) j =1 d � = π ℓ θ f j | j,ℓ j =1 Parameters: ◮ π is the “class prior” (it sums to one) ◮ For each feature function j and label ℓ , a distribution over values θ ∗| j,ℓ (sums to one for every � j, ℓ � pair) The “bag of words” version of na¨ ıve Bayes: F j = X j | x | � p ( ℓ, x ) = π ℓ θ x j | ℓ j =1 23 / 65

  24. Na¨ ıve Bayes: Remarks ◮ Estimation by (smoothed) relative frequency estimation: easy! 24 / 65

  25. Na¨ ıve Bayes: Remarks ◮ Estimation by (smoothed) relative frequency estimation: easy! ◮ For continuous or integer-valued features, use different distributions. 25 / 65

  26. Na¨ ıve Bayes: Remarks ◮ Estimation by (smoothed) relative frequency estimation: easy! ◮ For continuous or integer-valued features, use different distributions. ◮ The bag of words version equates to building a conditional language model for each label. 26 / 65

  27. Na¨ ıve Bayes: Remarks ◮ Estimation by (smoothed) relative frequency estimation: easy! ◮ For continuous or integer-valued features, use different distributions. ◮ The bag of words version equates to building a conditional language model for each label. ◮ The Collins reading assumes a binary version, with F v indicating whether v ∈ V occurs in x . 27 / 65

  28. Generative vs. Discriminative Classification Na¨ ıve Bayes is the prototypical generative classifier. ◮ It describes a probabilistic process—“generative story”—for X and L . ◮ But why model X ? It’s always observed? Discriminative models instead: ◮ seek to optimize a performance measure, like accuracy, or a computationally convenient surrogate; ◮ do not worry about p ( X ) ; ◮ tend to perform better when you have reasonable amounts of data. 28 / 65

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend