natural language processing
play

Natural Language Processing Angel Xuan Chang - PowerPoint PPT Presentation

SFU NatLangLab Natural Language Processing Angel Xuan Chang angelxuanchang.github.io/nlp-class adapted from lecture slides from Anoop Sarkar Simon Fraser University 2020-01-23 0 Natural Language Processing Angel Xuan Chang


  1. SFU NatLangLab Natural Language Processing Angel Xuan Chang angelxuanchang.github.io/nlp-class adapted from lecture slides from Anoop Sarkar Simon Fraser University 2020-01-23 0

  2. Natural Language Processing Angel Xuan Chang angelxuanchang.github.io/nlp-class adapted from lecture slides from Anoop Sarkar Simon Fraser University January 23, 2020 Part 1: Classification tasks in NLP 1

  3. Classification tasks in NLP Naive Bayes Classifier Log linear models 2

  4. Sentiment classification: Movie reviews ◮ neg unbelievably disappointing ◮ pos Full of zany characters and richly applied satire, and some great plot twists ◮ pos this is the greatest screwball comedy ever filmed ◮ neg It was pathetic. The worst part about it was the boxing scenes. 3

  5. Intent Detection ◮ ADDR CHANGE I just moved and want to change my address. ◮ ADDR CHANGE Please help me update my address. ◮ FILE CLAIM I just got into a terrible accident and I want to file a claim. ◮ CLOSE ACCOUNT I’m moving and I want to disconnect my service. 4

  6. Prepositional Phrases ◮ noun attach: I bought the shirt with pockets ◮ verb attach: I bought the shirt with my credit card ◮ noun attach: I washed the shirt with mud ◮ verb attach: I washed the shirt with soap ◮ Attachment depends on the meaning of the entire sentence – needs world knowledge, etc. ◮ Maybe there is a simpler solution: we can attempt to solve it using heuristics or associations between words 5

  7. Ambiguity Resolution: Prepositional Phrases in English ◮ Learning Prepositional Phrase Attachment: Annotated Data v n 1 p n 2 Attachment join board as director V is chairman of N.V. N using crocidolite in filters V bring attention to problem V is asbestos in products N making paper for filters N including three with cancer N . . . . . . . . . . . . . . . 6

  8. Prepositional Phrase Attachment Method Accuracy Always noun attachment 59.0 Most likely for each preposition 72.2 Average Human (4 head words only) 88.2 Average Human (whole sentence) 93.2 7

  9. Back-off Smoothing ◮ Random variable a represents attachment. ◮ a = n 1 or a = v (two-class classification) ◮ We want to compute probability of noun attachment: p ( a = n 1 | v , n 1 , p , n 2 ). ◮ Probability of verb attachment is 1 − p ( a = n 1 | v , n 1 , p , n 2 ). 8

  10. Back-off Smoothing 1. If f ( v , n 1 , p , n 2 ) > 0 and ˆ p � = 0 . 5 p ( a n 1 | v , n 1 , p , n 2 ) = f ( a n 1 , v , n 1 , p , n 2 ) ˆ f ( v , n 1 , p , n 2 ) 2. Else if f ( v , n 1 , p ) + f ( v , p , n 2 ) + f ( n 1 , p , n 2 ) > 0 and ˆ p � = 0 . 5 p ( a n 1 | v , n 1 , p , n 2 ) = f ( a n 1 , v , n 1 , p ) + f ( a n 1 , v , p , n 2 ) + f ( a n 1 , n 1 , p , n 2 ) ˆ f ( v , n 1 , p ) + f ( v , p , n 2 ) + f ( n 1 , p , n 2 ) 3. Else if f ( v , p ) + f ( n 1 , p ) + f ( p , n 2 ) > 0 p ( a n 1 | v , n 1 , p , n 2 ) = f ( a n 1 , v , p ) + f ( a n 1 , n 1 , p ) + f ( a n 1 , p , n 2 ) ˆ f ( v , p ) + f ( n 1 , p ) + f ( p , n 2 ) 4. Else if f ( p ) > 0 (try choosing attachment based on preposition alone) p ( a n 1 | v , n 1 , p , n 2 ) = f ( a n 1 , p ) ˆ f ( p ) 5. Else ˆ p ( a n 1 | v , n 1 , p , n 2 ) = 1 . 0 9

  11. Prepositional Phrase Attachment: Results ◮ Results (Collins and Brooks 1995) : 84.5% accuracy with the use of some limited word classes for dates, numbers, etc. ◮ Toutanova, Manning, and Ng, 2004 : use sophisticated smoothing model for PP attachment 86.18% with words & stems; with word classes: 87.54% ◮ Merlo, Crocker and Berthouzoz, 1997 : test on multiple PPs, generalize disambiguation of 1 PP to 2-3 PPs 1PP: 84.3% 2PP: 69.6% 3PP: 43.6% 10

  12. Natural Language Processing Angel Xuan Chang angelxuanchang.github.io/nlp-class adapted from lecture slides from Anoop Sarkar, Danqi Chen and Karthik Narasimhan Simon Fraser University January 23, 2020 Part 2: Probabilistic Classifiers 11

  13. Classification Task ◮ Input: ◮ A document d ◮ a set of classes C = { c 1 , c 2 , . . . , c m } ◮ Output: Predicted class c for document d ◮ Example: ◮ neg unbelievably disappointing ◮ pos this is the greatest screwball comedy ever filmed 12

  14. Supervised learning: Let’s use statistics! ◮ Inputs: ◮ Set of m classes C = { c 1 , c 2 , . . . , c m } ◮ Set of n labeled documents: { ( d 1 , c 1 ) , ( d 2 , c 2 ) , . . . , ( d n , c n ) } ◮ Output: Trained classifier F : d → c ◮ What form should F take? ◮ How to learn F ? 13

  15. Types of supervised classifiers 14

  16. Classification tasks in NLP Naive Bayes Classifier Log linear models 15

  17. Naive Bayes Classifier ◮ x is the input that can be represented as d independent features f j , 1 ≤ j ≤ d ◮ y is the output classification ◮ P ( y | x ) = P ( y ) · P ( x | y ) (Bayes Rule) P ( x ) ◮ P ( x | y ) = � d j =1 P ( f j | y ) ◮ P ( y | x ) ∝ P ( y ) · � d j =1 P ( f j | y ) ◮ We can ignore P ( x ) in the above equation because it is a constant scaling factor for each y . 16

  18. Naive Bayes Classifier for text classification ◮ For text classificaiton: input x = document d = ( w 1 , . . . , w k ), ◮ Use as our features the words w j , 1 ≤ j ≤ | V | where V is our vocabulary ◮ c is the output classification ◮ Assume that position of each word is irrelevant and that the words are conditionally independent given class c P ( w 1 , w 2 , . . . , w k | c ) = P ( w 1 | c ) P ( w 2 | c ) . . . P ( w k | c ) ◮ Maximum a posteriori estimate k � ˆ ˆ c MAP = arg max P ( c ) P ( d | c ) = arg max P ( c ) P ( w i | c ) c c i =1 17

  19. Bag of words 18

  20. Estimating probabilities Maximum likelihood estimate P ( c j ) = Count( c j ) ˆ n Count( w i , c j ) ˆ P ( w i | c j ) = � w ∈ V [Count( w , c j )] Smoothing Count( w i , c ) + α ˆ P ( w i | c ) = � w ∈ V [Count( w , c j ) + α ] 19

  21. Overall process Input: Set of labeled documents: { ( d i , c i ) } n i =1 ◮ Compute vocabulary V of all words ◮ Calculate P ( c j ) = Count( c j ) ˆ n ◮ Calculate Count( w i , c j ) + α ˆ P ( w i | c j ) = � w ∈ V [Count( w , c j ) + α ] ◮ Prediction: Given document d = ( w 1 , . . . , w k ) k � ˆ ˆ c MAP = arg max P ( c ) P ( w i | c ) c i =1 20

  22. Naive Bayes Example 21

  23. Tokenization Tokenization matters - it can affect your vocabulary ◮ aren’t aren’t arent are n’t aren t ◮ Emails, URLs, phone numbers, dates, emoticons 22

  24. Features ◮ Remember: Naive Bayes can use any set of features ◮ Captitalization, subword features (end with -ing), etc ◮ Domain knowledge crucial for performance Top features for spam detection [Alqatawna et al, IJCNSS 2015] 23

  25. Evaluation ◮ Table of prediction (binary classification) ◮ Ideally we want to get 24

  26. Evaluation Metrics Accuracy = TP + TN = 200 250 = 80% Total 25

  27. Evaluation Metrics Accuracy = TP + TN = 200 250 = 80% Total 26

  28. Precision and Recall 27

  29. Precision and Recall from Wikipedia 28

  30. F-Score 29

  31. Choosing Beta 30

  32. Aggregating scores ◮ We have Precision, Recall, F1 for each class ◮ How to combine them for an overall score? ◮ Macro-average: Compute for each class, then average ◮ Micro-average: Collect predictions for all classes and jointly evaluate 31

  33. Macro vs Micro average ◮ Macroaveraged precision: (0 . 5 + 0 . 9) / 2 = 0 . 7 ◮ Microaveraged precision: 100 / 120 = . 83 ◮ Microaveraged score is dominated by score on common classes 32

  34. Validation ◮ Choose a metric: Precision/Recall/F1 ◮ Optimize for metric on Validation (aka Development) set ◮ Finally evaluate on ‘unseen’ Test set ◮ Cross-validation ◮ Repeatedly sample several train-val splits ◮ Reduces bias due to sampling errors 33

  35. Advantanges of Naive Bayes ◮ Very fast, low storage requirements ◮ Robust to irrelevant features ◮ Very good in domains with many equally important features ◮ Optimal if the independence assumptions hold ◮ Good dependable baseline for text classification 34

  36. When to use Naive Bayes ◮ Small data sizes: Naive Bayes is great! . Rule-based classifiers can work well too ◮ Medium size datasets: More advanced classifiers might perform better (SVM, logistic regression) ◮ Large datasets: Naive Bayes becomes competive again (most learned classifiers will work well) 35

  37. Failings of Naive Bayes (1) Independence assumptions are too strong ◮ XOR problem: Naive Bayes cannot learn a decision boundary ◮ Both variables are jointly required to predict class. Independence assumption broken! 36

  38. Failings of Naive Bayes (2) Class Imbalance ◮ One or more classes have more instances than others ◮ Data skew causes NB to prefer one class over the other 37

  39. Failings of Naive Bayes (3) Weight magnitude errors ◮ Classes with larger weights are preferred ◮ 10 documents with class=MA and “Boston” occurring once each ◮ 10 documents with class=CA and “San Francisco” occurring once each ◮ New document d : “Boston Boston Boston San Francisco San Francisco” P (class = CA | d ) > P (class = MA | d ) 38

  40. Naive Bayes Summary ◮ Domain knowledge is crucial to selecting good features ◮ Handle class imbalance by re-weighting classes ◮ Use log scale operations instead of multiplying probabilities � P ( c NB ) = arg max log P ( c j ) + log P ( x i | c j ) c j ∈ C i ◮ Model is now just max of sum of weights 39

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend