natural language processing cse 517 text classification ii
play

Natural Language Processing (CSE 517): Text Classification (II) - PowerPoint PPT Presentation

Natural Language Processing (CSE 517): Text Classification (II) Noah Smith 2016 c University of Washington nasmith@cs.washington.edu February 1, 2016 1 / 17 Quick Review: Text Classification Input: a piece of text x V , usually a


  1. Natural Language Processing (CSE 517): Text Classification (II) Noah Smith � 2016 c University of Washington nasmith@cs.washington.edu February 1, 2016 1 / 17

  2. Quick Review: Text Classification Input: a piece of text x ∈ V † , usually a document (r.v. X ) Output: a label from a finite set L (r.v. L ) Standard line of attack: 1. Human experts label some data. 2. Feed the data to a supervised machine learning algorithm that constructs an automatic classifier classify : V † → L 3. Apply classify to as much data as you want! We covered na¨ ıve Bayes, reviewed multinomial logistic regression, and, briefly, the perceptron. 2 / 17

  3. Multinomial Logistic Regression as “Log Loss” exp w · φ ( x , ℓ ) p ( L = ℓ | x ) = � ℓ ′ ∈L exp w · φ ( x , ℓ ′ ) MLE can be rewritten as a minimization problem: n � � exp w · φ ( x i , ℓ ′ ) ˆ w = argmin log − w · φ ( x i , ℓ i ) � �� � w i =1 ℓ ′ ∈L hope � �� � fear Recall from lecture 3: ◮ Be wise and regularize! ◮ Solve with batch or stochastic gradient methods. ◮ w j has an interpretation. 3 / 17

  4. Log Loss and Hinge Loss for ( x , ℓ ) � � � exp w · φ ( x , ℓ ′ ) log loss: log − w · φ ( x , ℓ ) ℓ ′ ∈L � � ℓ ′ ∈L w · φ ( x , ℓ ′ ) hinge loss: max − w · φ ( x , ℓ ) In the binary case, where “score” is the linear score of the correct label: 5 4 3 loss 2 1 0 −4 −2 0 2 4 score 4 / 17 In purple is the hinge loss, in blue is the log loss; in red is the

  5. Minimizing Hinge Loss: Perceptron n � � � ℓ ′ ∈L w · φ ( x i , ℓ ′ ) min max − w · φ ( x i , ℓ i ) w i =1 Stochastic subgradient descent on the above is called the perceptron algorithm. ◮ For t ∈ { 1 , . . . , T } : ◮ Pick i t uniformly at random from { 1 , . . . , n } . ◮ ˆ ℓ i t ← argmax ℓ ∈L w · φ ( x i t , ℓ ) � � φ ( x i t , ˆ ◮ w ← w − α ℓ i t ) − φ ( x i t , ℓ i t ) 5 / 17

  6. Error Costs Suppose that not all mistakes are equally bad. E.g., false positives vs. false negatives in spam detection. 6 / 17

  7. Error Costs Suppose that not all mistakes are equally bad. E.g., false positives vs. false negatives in spam detection. Let cost( ℓ, ℓ ′ ) quantify the “badness” of substituting ℓ ′ for correct label ℓ . 7 / 17

  8. Error Costs Suppose that not all mistakes are equally bad. E.g., false positives vs. false negatives in spam detection. Let cost( ℓ, ℓ ′ ) quantify the “badness” of substituting ℓ ′ for correct label ℓ . Intuition: estimate the scoring function so that score( ℓ i ) − score(ˆ ℓ ) ∝ cost( ℓ i , ˆ ℓ ) 8 / 17

  9. General Hinge Loss for ( x , ℓ ) � � ℓ ′ ∈L w · φ ( x , ℓ ′ ) + cost( ℓ, ℓ ′ ) max − w · φ ( x , ℓ ) In the binary case, with cost( − 1 , 1) = 1 : 6 function(x) −x + pmax(x, 1) 5 4 3 2 1 0 −4 −2 0 2 4 x In blue is the general hinge loss; in red is the “zero-one” loss (error). 9 / 17

  10. Support Vector Machines A different motivation for the generalized hinge: n � � w = ˆ α i,ℓ · φ ( x i , ℓ ) i =1 ℓ ∈L where most only a small number of α i,ℓ are nonzero. 10 / 17

  11. Support Vector Machines A different motivation for the generalized hinge: n � � w = ˆ α i,ℓ · φ ( x i , ℓ ) i =1 ℓ ∈L where most only a small number of α i,ℓ are nonzero. Those φ ( x i , ℓ ) are called “support vectors” because they “support” the decision boundary. � w · φ ( x , ℓ ′ ) = α i,ℓ · φ ( x i , ℓ ) · φ ( x , ℓ ′ ) ˆ ( i,ℓ ) ∈S See Crammer and Singer (2001) for the multiclass version. 11 / 17

  12. Support Vector Machines A different motivation for the generalized hinge: n � � w = ˆ α i,ℓ · φ ( x i , ℓ ) i =1 ℓ ∈L where most only a small number of α i,ℓ are nonzero. Those φ ( x i , ℓ ) are called “support vectors” because they “support” the decision boundary. � w · φ ( x , ℓ ′ ) = α i,ℓ · φ ( x i , ℓ ) · φ ( x , ℓ ′ ) ˆ ( i,ℓ ) ∈S See Crammer and Singer (2001) for the multiclass version. Really good tool: SVM light , http://svmlight.joachims.org 12 / 17

  13. Support Vector Machines: Remarks ◮ Regularization is critical; squared ℓ 2 is most common, and often used in (yet another) motivation around the idea of “maximizing margin” around the hyperplane separator. 13 / 17

  14. Support Vector Machines: Remarks ◮ Regularization is critical; squared ℓ 2 is most common, and often used in (yet another) motivation around the idea of “maximizing margin” around the hyperplane separator. ◮ Often, instead of linear models that explicitly calculate w · φ , these methods are “kernelized” and rearrange all calculations to involve inner-products between φ vectors. ◮ Example: K linear ( v , w ) = v · w K polynomial ( v , w ) = ( v · w + 1) p K Gaussian ( v , w ) = exp −� v − w � 2 2 2 σ 2 ◮ Linear kernels are most common in NLP. 14 / 17

  15. General Remarks ◮ Text classification: many problems, all solved with supervised learners. ◮ Lexicon features can provide problem-specific guidance. 15 / 17

  16. General Remarks ◮ Text classification: many problems, all solved with supervised learners. ◮ Lexicon features can provide problem-specific guidance. ◮ Na¨ ıve Bayes, log-linear, and SVM are all linear methods that tend to work reasonably well, with good features and smoothing/regularization. ◮ You should have a basic understanding of the tradeoffs in choosing among them. 16 / 17

  17. General Remarks ◮ Text classification: many problems, all solved with supervised learners. ◮ Lexicon features can provide problem-specific guidance. ◮ Na¨ ıve Bayes, log-linear, and SVM are all linear methods that tend to work reasonably well, with good features and smoothing/regularization. ◮ You should have a basic understanding of the tradeoffs in choosing among them. ◮ Rumor: random forests are widely used in industry when performance matters more than interpretability. 17 / 17

  18. General Remarks ◮ Text classification: many problems, all solved with supervised learners. ◮ Lexicon features can provide problem-specific guidance. ◮ Na¨ ıve Bayes, log-linear, and SVM are all linear methods that tend to work reasonably well, with good features and smoothing/regularization. ◮ You should have a basic understanding of the tradeoffs in choosing among them. ◮ Rumor: random forests are widely used in industry when performance matters more than interpretability. ◮ Lots of papers about neural networks, but with hyperparameter tuning applied fairly to linear models, the advantage is not clear (Yogatama et al., 2015). 18 / 17

  19. Readings and Reminders ◮ Jurafsky and Martin (2015); Collins (2011) ◮ Submit a suggestion for an exam question by Friday at 5pm. 19 / 17

  20. References I Michael Collins. The naive Bayes model, maximum-likelihood estimation, and the EM algorithm, 2011. URL http://www.cs.columbia.edu/~mcollins/em.pdf . Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research , 2(5): 265–292, 2001. Daniel Jurafsky and James H. Martin. Classification: Naive Bayes, logistic regression, sentiment (draft chapter), 2015. URL https://web.stanford.edu/~jurafsky/slp3/7.pdf . Dani Yogatama, Lingpeng Kong, and Noah A. Smith. Bayesian optimization of text representations. In Proc. of EMNLP , 2015. URL http://www.aclweb.org/anthology/D/D15/D15-1251.pdf . 20 / 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend