Part-of-Speech Tagging for Twitter: Annotation, Features, and - PowerPoint PPT Presentation

Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments presented by: Pragati Shah Sally Gao Kennan Grant

Overview 1. Introduction 2. Problem 3. Methodology 4. Results 5. Extensions 2

1. Introduction Primary goals and results Goals: ○ Enable richer text analysis of Twitter and other social media platforms ○ Provide case study on how to rapidly engineer core NLP system for new datasets Results: ○ ~90% accuracy on test corpus ○ Openly accessible annotated corpus and trained POS tagger 3

2. Problem: Why do we need a Twitter POS tagger? Twitter has 328 million monthly active users and is a fruitful source of user-generated content. However, POS tagging for Twitter is challenging. 1. Conversational tone 2. Unconventional orthography 3. Character limit (280 — used to be 140) 4

3. Methodology Summary 1,827 manually tagged tweets 1. Define Tagging Scheme 2. Create Features Develop tag set and manually Create additional features to annotate corpus incorporate into model 3. Build Tagger 4. Evaluate Conditional Random Field (CRF) Cross-validate and compare tagging accuracy against Stanford tagger 5

3. Methodology Tagset Development Aim: Develop intuitive tagset to maximize tagging consistency Steps: 1. Design coarse tagset: {Standard tags} + {Twitter-specific tags}. 2. Tokenize with Twitter tokenizer, and tag with Stanford POS tagger. 3. Correct automatic predictions of Step 2 with manual annotation. 4. Revise tokenization and tagging guidelines. 5. Correct annotations from Step 3. 6. Calculate annotator agreement. 7. Make final sweep to correct errors. 6

3. Methodology Tagset Development Cohen’s Kappa (κ) ◎ Measures inter-rater reliability In paper, κ = 0.914 ◎ i.e. the agreement between two raters who each classify N items into C mutually exclusive categories 7

3. Methodology Tagging Scheme Final Tagging Scheme: 25 tags Standard POS tags : (Nouns, Pronouns, Verbs, Adjectives etc.) ◎ Combined POS tags: {nominal, proper noun} × {verb, possessive} ◎ Twitter/online-specific tags: (#, @, URL & email-ids, emoticons and ◎ discourse markers). Miscellaneous Category tag (G): Multiword Abbreviations, Partial words, ◎ artifacts of tokenization errors, miscellaneous symbols, possessive endings 8

3. Methodology Tagging Scheme Tag Description Example S Nominal + possessive someone’s ^ Proper noun usa M Proper noun + verbal Mark’ll ! Interjection lol, haha, yea # Hashtag* #acl @ At-mention @BarackObama E Emoticon :-) G Other abbreviations, foreign words, ily [I love you] possessive endings, symbols, garbage ♫ --> *35% of hashtags were tagged with something other than # 9

3. Methodology Conditional Random Field ◎ Discriminative undirected probabilistic graphical model ○ Model global dependencies 10

3. Methodology Feature Engineering CRF enables the incorporation of arbitrary local features. Base features: ◎ A feature for each word type ◎ Features to check whether word contains digits or hyphens ◎ Suffix features ◎ Features looking at capitalization patterns 11

3. Methodology Feature Engineering TwOrth: Twitter Orthography. ◎ Regex-style rules to detect @ mentions, hashtags, URLs. ○ Names: Frequently capitalized tokens. ◎ Twitter users are inconsistent in their use of capitalization. ○ Likelihood of capitalization = ○ TagDict: Traditional tag dictionary. ◎ Features for POS tags from traditional tag dictionary (PTB). ○ DistSim: Distributional similarity. ◎ Representation of term similarity via distributional features. ○ Used 1.9 million tokens from 134,000 unlabeled tweets for 10,000 ○ most common terms. Metaph: Phonetic normalization. ◎ Used the Metaphone algorithm (1999) to create coarse phonetic ○ normalization, e.g. “lmao,” “lmaoo,” “lmaooo” map to LM. 12

3. Methodology Evaluation training set: 1,000 tweets (14,542 tokens) development set: 327 tweets (4,770 tokens) test set: 500 tweets (7,124 tokens) ◎ Trained Stanford tagger on labeled data ◎ Tuned Gaussian prior on development data ◎ In addition to tagger with full feature set, performed feature ablation experiments (remove one set of categories one at a time) 13

4. Results Tagging Accuracy CRF Tagger with full feature set Feature ablation experiments Stanford tagging accuracy Relative error reduction of 25% compared to the Stanford tagger 14

4. Results Challenges ◎ Despite the NAMES feature, the system struggles to identify proper nouns with non-standard capitalization ◎ The recall of proper nouns is only 71% ◎ The system also struggles with the miscellaneous category, G — accuracy of 26% 15

5. Extensions and Uses ◎ Cited by 739 according to Google Scholar ◎ Owoputi et al. (2013): Developed improved annotation guidelines ○ Improved annotations in the Gimpel et al. corpus ○ Twitter tagging improved from 90% to 93% accuracy ○ (state-of-the-art results) using large-scale unsupervised word clustering and new lexical features ◎ Mohammad et al. (2013) Used the Gimpel et al. POS tagger to build state-of-the-art ○ Twitter sentiment classifier. ◎ Lamb et al. (2013) Used the Gimpel et a. POS tagger to surveil the spread of flu ○ infections on Twitter. 16

Thanks! Any questions? 17

Part-of-Speech Tagging for Twitter: Annotation, Features, and - PowerPoint PPT Presentation

Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments presented by: Pragati Shah Sally Gao Kennan Grant Overview 1. Introduction 2. Problem 3. Methodology 4. Results 5. Extensions 2 1. Introduction Primary goals

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2003 References: 1. Speech and

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2005 References: 1. Speech and

Part of Speech Tagging Informatics 2A: Lecture 15 Mirella Lapata School of Informatics

POS Tagging HMMs L645 / B659 Dept. of Linguistics, Indiana University Fall 2015 1 / 17 POS

Part of Speech Tagging Informatics 2A: Lecture 16 John Longley School of Informatics University

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

The Tagging Task Part-of-Speech Tagging Input: the lead paint is unsafe Output: the/Det lead/N

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models Graham Neubig

Part Of Speech (POS) Tagging Based on Foundations of Statistical NLP by C. Manning & H.

Natural Language Processing Parts of Speech Part of Speech Tagging Dan Klein UC

Syntactic Processing: Parts-of-Speech Tagging CSE354 - Spring 2020 Task Syntactic

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Tagging and sequence

Forewords Tagging in a nutshell Sources Slides inspired by M. Rajman and J.-C. Chappelier,

Traffic UTM Tagging AdWords WebMaster Tools UTM TAGGING Where does my traffic come from? UTM

Selected Clients Changing behaviours Reference: https://www.mckinsey.com/business-functions/mark

Digitalization- story of connecting and fixing t f ti d fi i Global Mobility Roundtable

The Road Back Restart and Recovery Plan Chesterfield Township School District Parent Meetings

Information Technology Budget Presentation Page 1 of 10 Information Technology Vision Statement:

Credit Supplementation Institutions: Going Beyond Guarantee for SMEs A Country Paper

MAPS LTE S1 and Conformance 818 West Diamond Avenue - Third Floor, Gaithersburg, MD 20878

systems with distributed energy resources Benjamin Sigrin 1 & Andrew Mills 2 National

Indi diana na Bridge L Load R Rating Jeremy Hunter INDOT Bridge Design Manager Indiana

Part-of-Speech Tagging for Twitter: Annotation, Features, and - PowerPoint PPT Presentation

Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments presented by: Pragati Shah Sally Gao Kennan Grant Overview 1. Introduction 2. Problem 3. Methodology 4. Results 5. Extensions 2 1. Introduction Primary goals

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2003 References: 1. Speech and

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2005 References: 1. Speech and

Part of Speech Tagging Informatics 2A: Lecture 15 Mirella Lapata School of Informatics

POS Tagging HMMs L645 / B659 Dept. of Linguistics, Indiana University Fall 2015 1 / 17 POS

Part of Speech Tagging Informatics 2A: Lecture 16 John Longley School of Informatics University

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

The Tagging Task Part-of-Speech Tagging Input: the lead paint is unsafe Output: the/Det lead/N

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models Graham Neubig

Part Of Speech (POS) Tagging Based on Foundations of Statistical NLP by C. Manning &amp; H.

Natural Language Processing Parts of Speech Part of Speech Tagging Dan Klein UC

Syntactic Processing: Parts-of-Speech Tagging CSE354 - Spring 2020 Task Syntactic

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Tagging and sequence

Forewords Tagging in a nutshell Sources Slides inspired by M. Rajman and J.-C. Chappelier,

Traffic UTM Tagging AdWords WebMaster Tools UTM TAGGING Where does my traffic come from? UTM

Selected Clients Changing behaviours Reference: https://www.mckinsey.com/business-functions/mark

Digitalization- story of connecting and fixing t f ti d fi i Global Mobility Roundtable

The Road Back Restart and Recovery Plan Chesterfield Township School District Parent Meetings

Information Technology Budget Presentation Page 1 of 10 Information Technology Vision Statement:

Credit Supplementation Institutions: Going Beyond Guarantee for SMEs A Country Paper

MAPS LTE S1 and Conformance 818 West Diamond Avenue - Third Floor, Gaithersburg, MD 20878

systems with distributed energy resources Benjamin Sigrin 1 &amp; Andrew Mills 2 National

Indi diana na Bridge L Load R Rating Jeremy Hunter INDOT Bridge Design Manager Indiana

Part Of Speech (POS) Tagging Based on Foundations of Statistical NLP by C. Manning & H.

systems with distributed energy resources Benjamin Sigrin 1 & Andrew Mills 2 National