Natural Language Processing (CSE 490U): Text Classification Noah - PowerPoint PPT Presentation

Natural Language Processing (CSE 490U): Text Classification Noah Smith � 2017 c University of Washington nasmith@cs.washington.edu January 20–23, 2017 1 / 65

Text Classification Input: a piece of text x ∈ V † , usually a document (r.v. X ) Output: a label from a finite set L (r.v. L ) Standard line of attack: 1. Human experts label some data. 2. Feed the data to a supervised machine learning algorithm that constructs an automatic classifier classify : V † → L 3. Apply classify to as much data as you want! Note: we assume the texts are segmented already, even the new ones. 2 / 65

Text Classification: Examples ◮ Library-like subjects (e.g., the Dewey decimal system) ◮ News stories: politics vs. sports vs. business vs. technology ... ◮ Reviews of films, restaurants, products: postive vs. negative ◮ Author attributes: identity, political stance, gender, age, ... ◮ Email, arXiv submissions, etc.: spam vs. not ◮ What is the reading level of a piece of text? ◮ How influential will a scientific paper be? ◮ Will a piece of proposed legislation pass? Closely related: relevance to a query. 3 / 65

Evaluation Accuracy: A(classify) = p (classify( X ) = L ) � 1 if classify( x ) = ℓ � = p ( X = x , L = ℓ ) · 0 otherwise x ∈V † ,ℓ ∈L � = p ( X = x , L = ℓ ) · 1 { classify( x ) = ℓ } x ∈V † ,ℓ ∈L where p is the true distribution over data. Error is 1 − A . x 1 , ¯ x m , ¯ This is estimated using a test dataset � ¯ ℓ 1 � , . . . � ¯ ℓ m � : m A(classify) = 1 � � � ˆ x i ) = ¯ classify( ¯ ℓ i 1 m i =1 4 / 65

Issues with Test-Set Accuracy 5 / 65

Issues with Test-Set Accuracy ◮ Class imbalance: if p ( L = not spam ) = 0 . 99 , then you can get ˆ A ≈ 0 . 99 by always guessing “not spam.” 6 / 65

Issues with Test-Set Accuracy ◮ Class imbalance: if p ( L = not spam ) = 0 . 99 , then you can get ˆ A ≈ 0 . 99 by always guessing “not spam.” ◮ Relative importance of classes or cost of error types. 7 / 65

Issues with Test-Set Accuracy ◮ Class imbalance: if p ( L = not spam ) = 0 . 99 , then you can get ˆ A ≈ 0 . 99 by always guessing “not spam.” ◮ Relative importance of classes or cost of error types. ◮ Variance due to the test data. 8 / 65

Evaluation in the Two-Class Case Suppose we have two classes, and one of them, t ∈ L is a “target.” ◮ E.g., given a query, find relevant documents. Precision and recall encode the goals of returning a “pure” set of targeted instances and capturing all of them. B A C actually in believed to be correctly the target in the target labeled class; class; as t L = t classify( x ) = t P(classify) = | C | | B | = | A ∩ B | ˆ | B | R(classify) = | C | | A | = | A ∩ B | ˆ | A | ˆ P · ˆ R ˆ F 1 (classify) = 2 · P + ˆ ˆ R 9 / 65

Another View: Contingency Table L = t L � = t classify( X ) = t B \ C (false positives) C (true positives) B classify( X ) � = t A \ C (false negatives) (true negatives) A 10 / 65

Evaluation with > 2 Classes Macroaveraged precision and recall: let each class be the target and report the average ˆ P and ˆ R across all classes. Microaveraged precision and recall: pool all one-vs.-rest decisions into a single contingency table, calculate ˆ P and ˆ R from that. 11 / 65

Cross-Validation Remember that ˆ A , ˆ P , ˆ R , and ˆ F 1 are all estimates of the classifier’s quality under the true data distribution. ◮ Estimates are noisy! K -fold cross-validation: ◮ Partition the training set into K non-overlapping “folds” x 1 , . . . , x K . ◮ For i ∈ { 1 , . . . , K } : ◮ Train on x 1: n \ x i , using x i as development data. ◮ Estimate quality on the i th development set: ˆ A i ◮ Report the average: K A = 1 � ˆ ˆ A i K i =1 and perhaps also the standard error. 12 / 65

Statistical Significance Suppose we have two classifiers, classify 1 and classify 2 . 13 / 65

Statistical Significance Suppose we have two classifiers, classify 1 and classify 2 . Is classify 1 better? The “null hypothesis,” denoted H 0 , is that it isn’t. But if ˆ A 1 ≫ ˆ A 2 , we are tempted to believe otherwise. 14 / 65

Statistical Significance Suppose we have two classifiers, classify 1 and classify 2 . Is classify 1 better? The “null hypothesis,” denoted H 0 , is that it isn’t. But if ˆ A 1 ≫ ˆ A 2 , we are tempted to believe otherwise. How much larger must ˆ A 1 be than ˆ A 2 to reject H 0 ? 15 / 65

Statistical Significance Suppose we have two classifiers, classify 1 and classify 2 . Is classify 1 better? The “null hypothesis,” denoted H 0 , is that it isn’t. But if ˆ A 1 ≫ ˆ A 2 , we are tempted to believe otherwise. How much larger must ˆ A 1 be than ˆ A 2 to reject H 0 ? Frequentist view: how (im)probable is the observed difference, given H 0 = true? 16 / 65

Statistical Significance Suppose we have two classifiers, classify 1 and classify 2 . Is classify 1 better? The “null hypothesis,” denoted H 0 , is that it isn’t. But if ˆ A 1 ≫ ˆ A 2 , we are tempted to believe otherwise. How much larger must ˆ A 1 be than ˆ A 2 to reject H 0 ? Frequentist view: how (im)probable is the observed difference, given H 0 = true? Caution: statistical significance is neither necessary nor sufficient for research significance! 17 / 65

A Hypothesis Test for Text Classifiers McNemar (1947) 1. The null hypothesis: A 1 = A 2 2. Pick significance level α , an “acceptably” high probability of incorrectly rejecting H 0 . 3. Calculate the test statistic, k (explained in the next slide). 4. Calculate the probability of a more extreme value of k , assuming H 0 is true; this is the p -value. 5. Reject the null hypothesis if the p -value is less than α . The p -value is p ( this observation | H 0 is true ) , not the other way around! 18 / 65

McNemar’s Test: Details Assumptions: independent (test) samples and binary measurements. Count test set error patterns: classify 1 classify 1 is incorrect is correct classify 2 is incorrect c 00 c 10 m · ˆ classify 2 is correct c 01 c 11 A 2 m · ˆ A 1 If A 1 = A 2 , then c 01 and c 10 are each distributed according to Binomial( c 01 + c 10 , 1 2 ) . test statistic k = min { c 01 , c 10 } k � c 01 + c 10 � 1 � p -value = 2 c 01 + c 10 − 1 j j =0 19 / 65

Other Tests Different tests make different assumptions. Sometimes we calculate an interval that would be “unsurprising” under H 0 and test whether a test statistic falls in that interval (e.g., t -test and Wald test). In many cases, there is no closed form for estimating p -values, so we use random approximations (e.g., permutation test and paired bootstrap test). If you do lots of tests, you need to correct for that! Read lots more in Smith (2011), appendix B. 20 / 65

Features in Text Classification A different representation of the text sequence r.v. X : feature r.v.s. For j ∈ { 1 , . . . , d } , let F j be a discrete random variable taking a value in F j . ◮ Often, these are term (word and perhaps n-gram) frequencies. ◮ Can also be word “presence” features. ◮ Transformations on word frequencies: logarithm, idf weighting ◮ Disjunctions of terms ◮ Clusters ◮ Task-specific lexicons 21 / 65

Probabilistic Classification Classification rule: classify( f ) = argmax p ( ℓ | f ) ℓ ∈L p ( ℓ, f ) = argmax p ( f ) ℓ ∈L = argmax p ( ℓ, f ) ℓ ∈L 22 / 65

Na¨ ıve Bayes Classifier d � p ( L = ℓ, F j = f 1 , . . . , F d = f d ) = p ( ℓ ) p ( F j = f j | ℓ ) j =1 d � = π ℓ θ f j | j,ℓ j =1 Parameters: ◮ π is the “class prior” (it sums to one) ◮ For each feature function j and label ℓ , a distribution over values θ ∗| j,ℓ (sums to one for every � j, ℓ � pair) The “bag of words” version of na¨ ıve Bayes: F j = X j | x | � p ( ℓ, x ) = π ℓ θ x j | ℓ j =1 23 / 65

Na¨ ıve Bayes: Remarks ◮ Estimation by (smoothed) relative frequency estimation: easy! 24 / 65

Na¨ ıve Bayes: Remarks ◮ Estimation by (smoothed) relative frequency estimation: easy! ◮ For continuous or integer-valued features, use different distributions. 25 / 65

Na¨ ıve Bayes: Remarks ◮ Estimation by (smoothed) relative frequency estimation: easy! ◮ For continuous or integer-valued features, use different distributions. ◮ The bag of words version equates to building a conditional language model for each label. 26 / 65

Na¨ ıve Bayes: Remarks ◮ Estimation by (smoothed) relative frequency estimation: easy! ◮ For continuous or integer-valued features, use different distributions. ◮ The bag of words version equates to building a conditional language model for each label. ◮ The Collins reading assumes a binary version, with F v indicating whether v ∈ V occurs in x . 27 / 65

Generative vs. Discriminative Classification Na¨ ıve Bayes is the prototypical generative classifier. ◮ It describes a probabilistic process—“generative story”—for X and L . ◮ But why model X ? It’s always observed? Discriminative models instead: ◮ seek to optimize a performance measure, like accuracy, or a computationally convenient surrogate; ◮ do not worry about p ( X ) ; ◮ tend to perform better when you have reasonable amounts of data. 28 / 65

Natural Language Processing (CSE 490U): Text Classification Noah - PowerPoint PPT Presentation

Natural Language Processing (CSE 490U): Text Classification Noah Smith 2017 c University of Washington nasmith@cs.washington.edu January 2023, 2017 1 / 65 Text Classification Input: a piece of text x V , usually a document

Natural Language Processing (CSE 490U): Neural Language Models Noah Smith 2017 c University

Natural Language Processing (CSE 490U): Language Models Noah Smith 2017 c University of

Natural Language Processing (CSE 490U): Featurized Language Models Noah Smith 2017 c

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Natural Language Processing (CSE 490U): Compositional Semantics Noah Smith 2017 c

Natural Language Processing (CSE 490U): Sequence Models (II) Noah Smith 2017 c University

Natural Language Processing (CSE 490U): Phrase Structure Noah Smith 2017 c University of

Natural Language Processing (CSE 490U): Introduction Noah Smith 2017 c University of

Natural Language Processing (CSE 490U): Generation: Translation & Summarization Noah Smith c

Natural Language Processing (CSE 517): Text Classification (II) Noah Smith 2016 c

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Predicting voter turnout from survey data Julia Silge Data Scientist at Stack Overflow DataCamp

Evaluation learning algorithm ? Do you want to predict accuracy or predict Charles Sutton

Lecture 4: Hypothesis Testing Ani Manichaikul amanicha@jhsph.edu 20 April 2007 1 / 69 Steps of

Bernoulli, Ramanujan, Toeplitz e le matrici triangolari Carmine Di Fiore, Francesco Tudisco, Paolo

Rater agreement - ordinal ratings Karl Bang Christensen Dept. of Biostatistics, Univ. of

Training Global Linear Models for Chinese Word Segmentation Dong Song and Anoop Sarkar Natural

Leveraging a Corpus of Natural Language Descriptions for Program Similarity Meital Zilberstein

Example-Based Automatic Phonetic Transcription Language Resources and Evaluation Conference 2010

Sambuz

Useful Links

Newsletter

Mail Us

Natural Language Processing (CSE 490U): Text Classification Noah - PowerPoint PPT Presentation

Natural Language Processing (CSE 490U): Text Classification Noah Smith 2017 c University of Washington nasmith@cs.washington.edu January 2023, 2017 1 / 65 Text Classification Input: a piece of text x V , usually a document

Natural Language Processing (CSE 490U): Neural Language Models Noah Smith 2017 c University

Natural Language Processing (CSE 490U): Language Models Noah Smith 2017 c University of

Natural Language Processing (CSE 490U): Featurized Language Models Noah Smith 2017 c

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Natural Language Processing (CSE 490U): Compositional Semantics Noah Smith 2017 c

Natural Language Processing (CSE 490U): Sequence Models (II) Noah Smith 2017 c University

Natural Language Processing (CSE 490U): Phrase Structure Noah Smith 2017 c University of

Natural Language Processing (CSE 490U): Introduction Noah Smith 2017 c University of

Natural Language Processing (CSE 490U): Generation: Translation &amp; Summarization Noah Smith c

Natural Language Processing (CSE 517): Text Classification (II) Noah Smith 2016 c

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Predicting voter turnout from survey data Julia Silge Data Scientist at Stack Overflow DataCamp

Evaluation learning algorithm ? Do you want to predict accuracy or predict Charles Sutton

Lecture 4: Hypothesis Testing Ani Manichaikul amanicha@jhsph.edu 20 April 2007 1 / 69 Steps of

Bernoulli, Ramanujan, Toeplitz e le matrici triangolari Carmine Di Fiore, Francesco Tudisco, Paolo

Rater agreement - ordinal ratings Karl Bang Christensen Dept. of Biostatistics, Univ. of

Training Global Linear Models for Chinese Word Segmentation Dong Song and Anoop Sarkar Natural

Leveraging a Corpus of Natural Language Descriptions for Program Similarity Meital Zilberstein

Example-Based Automatic Phonetic Transcription Language Resources and Evaluation Conference 2010

Sambuz

Useful Links

Newsletter

Mail Us

Natural Language Processing (CSE 490U): Generation: Translation & Summarization Noah Smith c