algorithms for nlp
play

Algorithms for NLP CS 11-711 Fall 2020 Lecture 2: Linear text - PowerPoint PPT Presentation

Algorithms for NLP CS 11-711 Fall 2020 Lecture 2: Linear text classification Emma Strubell Lets try this again Emma Yulia Bob Sanket Han Jiateng she/her she/her he/him he/him he/him he/him 2 Outline Basic


  1. Algorithms for NLP CS 11-711 · Fall 2020 Lecture 2: Linear text classification Emma Strubell

  2. Let’s try this again… Emma Yulia Bob Sanket Han Jiateng she/her she/her he/him he/him he/him he/him 2

  3. Outline ■ Basic representations of text data for classification ■ Four linear classifiers ■ Naïve Bayes ■ Perceptron ■ Large-margin (support vector machine; SVM) ■ Logistic regression 3

  4. Text classification Problem definition ■ Given a text xt w = ( w 1 , w 2 , . . . , w T ) ∈ V ∗ ■ Choose a label el y ∈ Y . ■ For example: ■ Sentiment analysis { positive, negative, neutral } Y = Y = ■ Toxic comment classification { toxic, non-toxic } Y = Y = ■ Language identification { Mandarin, English, Spanish, … } Y = Y = w = The drinks were strong but the fish tacos were bland y = negative w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 w 9 w 10 4

  5. How to represent text for classification? One choice of R : bag-of-words ■ Sequence length T can be different for every sentence/document ■ The bag-of-words is a fixed-length vector of word counts: w = The drinks were strong but the fish tacos were bland x = 0 … 1 1 … 1 … 0 1 2 1 … 2 … 0 g e r h o e t k d s e n r u s … … c h o r e n h o a … b … … fi a t c w a t r v y t a t l d b s z t r a a ■ Length of x is equal to the size of the vocabulary, V ■ For each x there may be many possible w (representation ignores word order) 5

  6. Linear classification on bag-of-words ■ Let score the compatibility of bag-of-words x and label y . x ψ ( x , y ) . Then: y = argmax ˆ ψ ( x , y ) . y ■ In a linear classifier this scoring function has the simple form: X ψ ( x , y ) = θ · f ( x , y ) = θ j × f j ( x , y ) , j =1 where θ is a vector of weights, and f is a feature function 6

  7. Feature functions ■ In classification, the feature function is usually a simple combination of x and y , such as: ( if y = positive x fantastic x f j ( x , y ) = otherwise 0 0 ■ If we have K labels, this corresponds to column vectors that look like: T f ( x , y = 1) = x 0 x 1 … x |V| 0 0 0 0 0 0 0 0 0 … 0 | {z } ( K − 1) × V 7

  8. Feature functions ■ In classification, the feature function is usually a simple combination of x and y , such as: ( if y = positive x fantastic x f j ( x , y ) = otherwise 0 0 ■ If we have K labels, this corresponds to column vectors that look like: T f ( x , y = 1) = x 0 x 1 … x |V| 0 0 0 0 0 0 0 0 0 … 0 T 0 0 0 … 0 x 0 x 1 … x |V| 0 0 0 0 … 0 f ( x , y = 2) = [ | {z } {z ( K − 2) × V V 8

  9. Feature functions ■ In classification, the feature function is usually a simple combination of x and y , such as: ( if y = negative x bland x f j ( x , y ) = otherwise 0 0 ■ If we have K labels, this corresponds to column vectors that look like: T f ( x , y = 1) = x 0 x 1 … x |V| 0 0 0 0 0 0 0 0 0 … 0 T 0 0 0 … 0 x 0 x 1 … x |V| 0 0 0 0 … 0 f ( x , y = 2) = [ T f ( x , y = K ) = 0 0 0 0 0 0 0 0 0 … 0 x 0 x 1 … x |V| | {z } ( K − 1) × V 9

  10. <latexit sha1_base64="smzW0dbHv5EhLkT3s462FG624o=">ADFHicbVLahRBFK1pX3F8JWbpnEQRGToSQRdBt0IbiI4k+DUEKqrb80UqUdTdXt0aPovxJ1+ibvg1r3/4QdY/RDsS5U1eHcx7l1uWmupMck+T2Irl2/cfPWzu3hnbv37j/Y3Xs487ZwHKbcKutOU+ZBSQNTlKjgNHfAdKrgJD1/U/tP1uC8tOYDbnJYaLY0UkjOMFAf38UpQYfz852R8k4aSy+DCYdGJHOjs/2Bn9oZnmhwSBXzPv5JMlxUTKHkiuohrTwkDN+zpYwD9CwoLMom5ar+ElgslhYF47BuGH/zyiZ9n6j0xCpGa78tq8mr/LNCxSvFqU0eYFgeCskChWjev/x5l0wFtAmDcydBrzFfMY5hSj2VFag1YO8fpReN8JA6MPCJW62ZyZ6VDAt1SYDwQqFVUm9+IevmsLzbC1z3w3kczuRIVWA1Dq5lIYpBQJpfXp8KyQNne/hbIJDsJ1ezYHU1YN5Mp6oOnS2SLvFa+285uioQATYRBtPT2oiwJ5PtrbgMZgfjyeH4P2L0dHrbmN2yCPymDwlE/KSHJG35JhMCSeGfCHfyPfoa/Qjuoh+tqHRoMvZJz2Lfv0FSFEHkQ=</latexit> Linear classification in Python x = 0 … 1 1 … 1 … 0 1 2 1 … 2 … 0 g e r h o e d t k s e n r u s … … c h r o e n h o … … … a b fi a t c w a t r v y t a t l d b s z t r a a θ = -0.16 -1.66 -1.55 0.23 0.17 -3.43 0.18 -2.08 -1.46 0.13 1.47 -0.06 1.84 … 0.36 K × V def compute_score(x, y, weights): total = 0 for feature, count in feature_function(x, y).items(): total += weights[feature] * count return total 10

  11. <latexit sha1_base64="smzW0dbHv5EhLkT3s462FG624o=">ADFHicbVLahRBFK1pX3F8JWbpnEQRGToSQRdBt0IbiI4k+DUEKqrb80UqUdTdXt0aPovxJ1+ibvg1r3/4QdY/RDsS5U1eHcx7l1uWmupMck+T2Irl2/cfPWzu3hnbv37j/Y3Xs487ZwHKbcKutOU+ZBSQNTlKjgNHfAdKrgJD1/U/tP1uC8tOYDbnJYaLY0UkjOMFAf38UpQYfz852R8k4aSy+DCYdGJHOjs/2Bn9oZnmhwSBXzPv5JMlxUTKHkiuohrTwkDN+zpYwD9CwoLMom5ar+ElgslhYF47BuGH/zyiZ9n6j0xCpGa78tq8mr/LNCxSvFqU0eYFgeCskChWjev/x5l0wFtAmDcydBrzFfMY5hSj2VFag1YO8fpReN8JA6MPCJW62ZyZ6VDAt1SYDwQqFVUm9+IevmsLzbC1z3w3kczuRIVWA1Dq5lIYpBQJpfXp8KyQNne/hbIJDsJ1ezYHU1YN5Mp6oOnS2SLvFa+285uioQATYRBtPT2oiwJ5PtrbgMZgfjyeH4P2L0dHrbmN2yCPymDwlE/KSHJG35JhMCSeGfCHfyPfoa/Qjuoh+tqHRoMvZJz2Lfv0FSFEHkQ=</latexit> Linear classification in Python x = 0 … 1 1 … 1 … 0 1 2 1 … 2 … 0 g e r h o e d t k s e n r u s … … c h r o e n h o … … … a b fi a t c w a t r v y t a t l d b s z t r a a θ = -1.13 -0.37 0.97 0.58 -1.46 -1 -0.49 2.35 0.49 -0.34 0.69 0.87 0.36 … -0.26 K × V import numpy as np def compute_score(x, y, weights): return np.dot(weights, feature_function(x, y)) 11

  12. Ok, but how to obtain θ ? ■ The learning problem is to find the right weights θ . ■ The rest of this lecture will cover four supervised learning algorithms: ■ Naïve Bayes ■ Perceptron ■ Large-margin (support vector machine) ■ Logistic regression ■ All these methods assume a labeled dataset of N examples: aset { ( x ( i ) , y ( i ) ) } N i =1 . 12

  13. Probabilistic classification ■ Naïve Bayes is a probabilistic classifier. It takes the following strategy: ■ Define a probability model p ( x , y ) ■ Estimate the parameters of the probability model by maximum likelihood , i.e. by maximizing the likelihood of the dataset ■ Set the scoring function equal to the log-probability: ψ ( x , y ) = log p ( x , y ) = log p ( y | x ) + C , where C is constant in y . This ensures that: y = argmax ˆ p ( y | x ) . y 13

  14. A probability model for text classification ■ First, assume each instance (( x , y ) pair) is independent of the others: N p ( x (1: N ) , y (1: N ) ) = p ( x ( i ) , y ( i ) ) . Y i =1 ■ Apply the chain rule of probability: p ( x , y ) = p ( x | y ) × p ( y ) ■ Define the parametric form of each probability: p ( y ) = Categorical( µ ) p ( x | y ) = Multinomial( φ , T ) . ■ The multinomial is a distribution over vectors of counts ■ The parameters μ and φ are vectors of probabilities 14

  15. The multinomial distribution ■ Suppose the word bland has probability φ j . What is the probability that this word appears 3 times? ■ Each word’s probability is exponentiated by its count: ⇣P V ⌘ ! V j =1 x j φ x j Y Multinomial( x ; φ , T ) = j . Q V j =1 ( x j !) j =1 ■ The coefficient is the count of the number of possible orderings of x . Crucially, it does not depend on the frequency parameter φ . 15

  16. <latexit sha1_base64="4F1q9zeaHiIDviznv+xz7u4uTGE=">ADHXicbVJLaxsxEJa3r9R9xGmPvYiaQPrA7KaF9BjaS6GXFGonYBmj1Y5sEa20SLNuzOJ/Uuip/Se9lV5Lf0jvldbyDoZkPQx81DH5MWnmM4z+d6MbNW7fv7Nzt3rv/4OFub+/RyNvSCRgKq607S7kHrQwMUaGs8IBz1MNp+n5u3X8dAHOK2s+4bKASc5nRklOAbXtNf7QBmqHDw9GNEXNHk27fXjQVwbvQqSBvRJYyfTvc5flR5mBQaO79OIkLnFTcoRIaVl1Wei4OczGAdoeOg2qerRV3Q/eDIqrQvHIK29lzMqnu/zNPAzDnO/XZs7bwuNi5RvplUyhQlghGbRrLUFC1d60Az5UCgXgbAhVNhVirm3HGBQa3u/uU2c9ALwNZHKi/rzl3mwMBnYfOcm+x5xSTPlV5mIHmpcVUxL/j62R4mS1U4RtFLjaSdJkGZNapmTJca5DI1lfbHZ45svpuj1DV5NB4PZ4twFSrGgptPbB05mxZtIqvtvProqEAl0GJDR/aRtGWJRkey2ugtHhIHk1OPz4un/8tlmZHfKEPCUHJCFH5Ji8JydkSARZkC/kG/kefY1+RD+jXxtq1GlyHpOWRb/ARK7CRY=</latexit> Naïve Bayes text classification ■ Naïve Bayes can be formulated in our linear classification framework by setting θ equal to the log parameters: θ = log φ y1,w1 log φ y1,w2 … log φ y1,wv log μ y1 log φ y2,w1 … log φ y2,wv log μ y2 … log φ yk,wv log μ yk K × ( V + 1) ψ ( x , y ) = θ · f ( x , y ) = log p ( x | y ) + log p ( y ) , where f ( x , y ) is extended to include an “offset” 1 for each possible label after the word counts. 16

  17. Estimating Naïve Bayes ■ In relative frequency estimation, the parameters are set to empirical frequencies: i : y ( i ) = y x ( i ) P count( y , j ) j ˆ φ y , j = = P V i : y ( i ) = y x ( i ) P V j 0 =1 count( y , j 0 ) P j 0 =1 j 0 count( y ) µ y = ˆ y 0 count( y 0 ) . P ■ This turns out to be identical to the maximum likelihood estimate (yay): N N ˆ p ( x ( i ) , y ( i ) ) = argmax log p ( x ( i ) , y ( i ) ) . Y X φ , ˆ µ = argmax φ , µ φ , µ i =1 i =1 17

  18. Smoothing, bias, variance ■ To deal with low counts, it can be helpful to smooth probabilities: α + count( y , j ) ˆ φ y , j = V α + P V . j 0 =1 count( y , j 0 ) ■ Smoothing introduces bias , moving the parameters away from their maximum- likelihood estimates. ■ But, it corrects variance , the extent to which the parameters depend on the idiosyncrasies of a finite dataset. ■ The smoothing term α is a hyperparameter that must be tuned on a development set . 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend