Natural Language Processing Angel Xuan Chang - PowerPoint PPT Presentation

SFU NatLangLab Natural Language Processing Angel Xuan Chang angelxuanchang.github.io/nlp-class adapted from lecture slides from Anoop Sarkar Simon Fraser University 2020-01-23 0

Natural Language Processing Angel Xuan Chang angelxuanchang.github.io/nlp-class adapted from lecture slides from Anoop Sarkar Simon Fraser University January 23, 2020 Part 1: Classification tasks in NLP 1

Classification tasks in NLP Naive Bayes Classifier Log linear models 2

Sentiment classification: Movie reviews ◮ neg unbelievably disappointing ◮ pos Full of zany characters and richly applied satire, and some great plot twists ◮ pos this is the greatest screwball comedy ever filmed ◮ neg It was pathetic. The worst part about it was the boxing scenes. 3

Intent Detection ◮ ADDR CHANGE I just moved and want to change my address. ◮ ADDR CHANGE Please help me update my address. ◮ FILE CLAIM I just got into a terrible accident and I want to file a claim. ◮ CLOSE ACCOUNT I’m moving and I want to disconnect my service. 4

Prepositional Phrases ◮ noun attach: I bought the shirt with pockets ◮ verb attach: I bought the shirt with my credit card ◮ noun attach: I washed the shirt with mud ◮ verb attach: I washed the shirt with soap ◮ Attachment depends on the meaning of the entire sentence – needs world knowledge, etc. ◮ Maybe there is a simpler solution: we can attempt to solve it using heuristics or associations between words 5

Ambiguity Resolution: Prepositional Phrases in English ◮ Learning Prepositional Phrase Attachment: Annotated Data v n 1 p n 2 Attachment join board as director V is chairman of N.V. N using crocidolite in filters V bring attention to problem V is asbestos in products N making paper for filters N including three with cancer N . . . . . . . . . . . . . . . 6

Prepositional Phrase Attachment Method Accuracy Always noun attachment 59.0 Most likely for each preposition 72.2 Average Human (4 head words only) 88.2 Average Human (whole sentence) 93.2 7

Back-off Smoothing ◮ Random variable a represents attachment. ◮ a = n 1 or a = v (two-class classification) ◮ We want to compute probability of noun attachment: p ( a = n 1 | v , n 1 , p , n 2 ). ◮ Probability of verb attachment is 1 − p ( a = n 1 | v , n 1 , p , n 2 ). 8

Back-off Smoothing 1. If f ( v , n 1 , p , n 2 ) > 0 and ˆ p � = 0 . 5 p ( a n 1 | v , n 1 , p , n 2 ) = f ( a n 1 , v , n 1 , p , n 2 ) ˆ f ( v , n 1 , p , n 2 ) 2. Else if f ( v , n 1 , p ) + f ( v , p , n 2 ) + f ( n 1 , p , n 2 ) > 0 and ˆ p � = 0 . 5 p ( a n 1 | v , n 1 , p , n 2 ) = f ( a n 1 , v , n 1 , p ) + f ( a n 1 , v , p , n 2 ) + f ( a n 1 , n 1 , p , n 2 ) ˆ f ( v , n 1 , p ) + f ( v , p , n 2 ) + f ( n 1 , p , n 2 ) 3. Else if f ( v , p ) + f ( n 1 , p ) + f ( p , n 2 ) > 0 p ( a n 1 | v , n 1 , p , n 2 ) = f ( a n 1 , v , p ) + f ( a n 1 , n 1 , p ) + f ( a n 1 , p , n 2 ) ˆ f ( v , p ) + f ( n 1 , p ) + f ( p , n 2 ) 4. Else if f ( p ) > 0 (try choosing attachment based on preposition alone) p ( a n 1 | v , n 1 , p , n 2 ) = f ( a n 1 , p ) ˆ f ( p ) 5. Else ˆ p ( a n 1 | v , n 1 , p , n 2 ) = 1 . 0 9

Prepositional Phrase Attachment: Results ◮ Results (Collins and Brooks 1995) : 84.5% accuracy with the use of some limited word classes for dates, numbers, etc. ◮ Toutanova, Manning, and Ng, 2004 : use sophisticated smoothing model for PP attachment 86.18% with words & stems; with word classes: 87.54% ◮ Merlo, Crocker and Berthouzoz, 1997 : test on multiple PPs, generalize disambiguation of 1 PP to 2-3 PPs 1PP: 84.3% 2PP: 69.6% 3PP: 43.6% 10

Natural Language Processing Angel Xuan Chang angelxuanchang.github.io/nlp-class adapted from lecture slides from Anoop Sarkar, Danqi Chen and Karthik Narasimhan Simon Fraser University January 23, 2020 Part 2: Probabilistic Classifiers 11

Classification Task ◮ Input: ◮ A document d ◮ a set of classes C = { c 1 , c 2 , . . . , c m } ◮ Output: Predicted class c for document d ◮ Example: ◮ neg unbelievably disappointing ◮ pos this is the greatest screwball comedy ever filmed 12

Supervised learning: Let’s use statistics! ◮ Inputs: ◮ Set of m classes C = { c 1 , c 2 , . . . , c m } ◮ Set of n labeled documents: { ( d 1 , c 1 ) , ( d 2 , c 2 ) , . . . , ( d n , c n ) } ◮ Output: Trained classifier F : d → c ◮ What form should F take? ◮ How to learn F ? 13

Types of supervised classifiers 14

Classification tasks in NLP Naive Bayes Classifier Log linear models 15

Naive Bayes Classifier ◮ x is the input that can be represented as d independent features f j , 1 ≤ j ≤ d ◮ y is the output classification ◮ P ( y | x ) = P ( y ) · P ( x | y ) (Bayes Rule) P ( x ) ◮ P ( x | y ) = � d j =1 P ( f j | y ) ◮ P ( y | x ) ∝ P ( y ) · � d j =1 P ( f j | y ) ◮ We can ignore P ( x ) in the above equation because it is a constant scaling factor for each y . 16

Naive Bayes Classifier for text classification ◮ For text classificaiton: input x = document d = ( w 1 , . . . , w k ), ◮ Use as our features the words w j , 1 ≤ j ≤ | V | where V is our vocabulary ◮ c is the output classification ◮ Assume that position of each word is irrelevant and that the words are conditionally independent given class c P ( w 1 , w 2 , . . . , w k | c ) = P ( w 1 | c ) P ( w 2 | c ) . . . P ( w k | c ) ◮ Maximum a posteriori estimate k � ˆ ˆ c MAP = arg max P ( c ) P ( d | c ) = arg max P ( c ) P ( w i | c ) c c i =1 17

Bag of words 18

Estimating probabilities Maximum likelihood estimate P ( c j ) = Count( c j ) ˆ n Count( w i , c j ) ˆ P ( w i | c j ) = � w ∈ V [Count( w , c j )] Smoothing Count( w i , c ) + α ˆ P ( w i | c ) = � w ∈ V [Count( w , c j ) + α ] 19

Overall process Input: Set of labeled documents: { ( d i , c i ) } n i =1 ◮ Compute vocabulary V of all words ◮ Calculate P ( c j ) = Count( c j ) ˆ n ◮ Calculate Count( w i , c j ) + α ˆ P ( w i | c j ) = � w ∈ V [Count( w , c j ) + α ] ◮ Prediction: Given document d = ( w 1 , . . . , w k ) k � ˆ ˆ c MAP = arg max P ( c ) P ( w i | c ) c i =1 20

Naive Bayes Example 21

Tokenization Tokenization matters - it can affect your vocabulary ◮ aren’t aren’t arent are n’t aren t ◮ Emails, URLs, phone numbers, dates, emoticons 22

Features ◮ Remember: Naive Bayes can use any set of features ◮ Captitalization, subword features (end with -ing), etc ◮ Domain knowledge crucial for performance Top features for spam detection [Alqatawna et al, IJCNSS 2015] 23

Evaluation ◮ Table of prediction (binary classification) ◮ Ideally we want to get 24

Evaluation Metrics Accuracy = TP + TN = 200 250 = 80% Total 25

Evaluation Metrics Accuracy = TP + TN = 200 250 = 80% Total 26

Precision and Recall 27

Precision and Recall from Wikipedia 28

F-Score 29

Choosing Beta 30

Aggregating scores ◮ We have Precision, Recall, F1 for each class ◮ How to combine them for an overall score? ◮ Macro-average: Compute for each class, then average ◮ Micro-average: Collect predictions for all classes and jointly evaluate 31

Macro vs Micro average ◮ Macroaveraged precision: (0 . 5 + 0 . 9) / 2 = 0 . 7 ◮ Microaveraged precision: 100 / 120 = . 83 ◮ Microaveraged score is dominated by score on common classes 32

Validation ◮ Choose a metric: Precision/Recall/F1 ◮ Optimize for metric on Validation (aka Development) set ◮ Finally evaluate on ‘unseen’ Test set ◮ Cross-validation ◮ Repeatedly sample several train-val splits ◮ Reduces bias due to sampling errors 33

Advantanges of Naive Bayes ◮ Very fast, low storage requirements ◮ Robust to irrelevant features ◮ Very good in domains with many equally important features ◮ Optimal if the independence assumptions hold ◮ Good dependable baseline for text classification 34

When to use Naive Bayes ◮ Small data sizes: Naive Bayes is great! . Rule-based classifiers can work well too ◮ Medium size datasets: More advanced classifiers might perform better (SVM, logistic regression) ◮ Large datasets: Naive Bayes becomes competive again (most learned classifiers will work well) 35

Failings of Naive Bayes (1) Independence assumptions are too strong ◮ XOR problem: Naive Bayes cannot learn a decision boundary ◮ Both variables are jointly required to predict class. Independence assumption broken! 36

Failings of Naive Bayes (2) Class Imbalance ◮ One or more classes have more instances than others ◮ Data skew causes NB to prefer one class over the other 37

Failings of Naive Bayes (3) Weight magnitude errors ◮ Classes with larger weights are preferred ◮ 10 documents with class=MA and “Boston” occurring once each ◮ 10 documents with class=CA and “San Francisco” occurring once each ◮ New document d : “Boston Boston Boston San Francisco San Francisco” P (class = CA | d ) > P (class = MA | d ) 38

Naive Bayes Summary ◮ Domain knowledge is crucial to selecting good features ◮ Handle class imbalance by re-weighting classes ◮ Use log scale operations instead of multiplying probabilities � P ( c NB ) = arg max log P ( c j ) + log P ( x i | c j ) c j ∈ C i ◮ Model is now just max of sum of weights 39

Natural Language Processing Angel Xuan Chang - PowerPoint PPT Presentation

SFU NatLangLab Natural Language Processing Angel Xuan Chang angelxuanchang.github.io/nlp-class adapted from lecture slides from Anoop Sarkar Simon Fraser University 2020-01-23 0 Natural Language Processing Angel Xuan Chang

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Introduction Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing

Outline of todays lecture Overview of Natural Language Generation Components of Natural

Introduction to Natural Language Processing CMSC 470 Marine Carpuat Natural Language Processing

Todays whether: if, elif, or else! Congrats, Pats!

Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Medical Care of Vulnerable and Underserved Populations February 28- March 2, 2019 Holiday Inn

Taking Time Seriously Bryan OSullivan Twitter: @bos31337 Monday, June 18, 12 Lets talk

NETCONF Discussion Draft-ietf-i2rs-ephemeral-state-14.txt Presenter: Susan Hares Co-authors:

Catering for Sustainability Nicki Crayfourd Director of HSE Compass Group PLC Adam Leaver Head

Multicast and Unicast MAC Address Asignment Protocol (MUMAAP) Antonio de la Oliva InterDigital,

Sambuz

Useful Links

Newsletter

Mail Us