disambiguation on Twitter Damiano Spina, Enrique Amig and Julio - PowerPoint PPT Presentation

Filter keywords and majority class strategies for company name disambiguation on Twitter Damiano Spina, Enrique Amigó and Julio Gonzalo {damiano,enrique,julio}@lsi.uned.es UNED NLP & IR Group CLEF 2011 Conference September 19-22, Amsterdam

Goal • Two signals coming from intuition: – Filter keywords – Majority Class • Do they help characterizing and solving the problem?

WePS-3 Online Reputation Management Task

Tweets for query «jaguar» • related tweets=8 • unrelated tweets=2 • Related ratio = 8/(8+2) = 0.8

Tweets for query «orange» • related tweets=0 • unrelated tweets=10 • Related ratio = 0

Tweets for query «apple» • related tweets=5 • unrelated tweets=5 • Related ratio = 0.5

Fingerprint representation

WePS-3 Task 2 Systems

Filter keywords

Tweets for query «apple»

Tweets for query «apple» • positive keyword: store • 4 tweets annotated as «related»

Tweets for query «apple» • positive keyword: store • 4 tweets annotated as «related» • negative keyword: eating • 2 tweets annotated as «unrelated»

Tweets for query «apple» • positive keyword: store • 4 tweets annotated as «related» • negative keyword: eating • 2 tweets annotated as «unrelated» • Accuracy= 1.0 • Recall=60%

Manual keywords (perfects for a Web user) Company name Positive Keywords Negative Keywords amazon electronics, books, apparel, river, rainforest, deforestation, computers, buy bolivian, brazilian fox tv, broadcast, shows, episodes, fringe, animal, terrier, hunting, bones volkswagen, racing ford motor, cars, hybrids, crossovers, tom, harrison, henry, glenn, gucci mondeo, focus, fiesta, prices, dealer, electric

Manual keywords (perfects for a Web user) Company name Positive Keywords Negative Keywords amazon electronics, books, apparel, river, rainforest, deforestation, computers, buy bolivian, brazilian fox tv, broadcast, shows, episodes, fringe, animal, terrier, hunting, bones volkswagen, racing ford motor, cars, hybrids, crossovers, tom, harrison, henry, glenn, gucci mondeo, focus, fiesta, prices, dealer, electric Oracle keywords (perfects on Twitter) Company name Positive Keywords Negative Keywords amazon sale, books, deal, deals, gift followdaibosyu, pest, plug, brothers, pirotta fox money, weather, leader, denouncing, megan, matthew, lazy, valley, viewers michael ford mustang, focus, hybrid, motor, truck tom, harrison, rob, bring, coppola

Upper bound of Filter Keywords Oracle keywords 20 oracle keywords ≈ 50% recall 5 oracle keywords ≈ 30% recall

Upper bound of Filter Keywords Oracle keywords Manual keywords – ≈10 per company – 14.61 % recall (vs. 39.97% 10 oracle keyword) 20 oracle keywords ≈ 50% recall – 0.86 accuracy 5 oracle keywords ≈ 30% recall

Upper bound of Filter Keywords Oracle keywords Manual keywords – ≈10 per company – 14.61 % recall (vs. 39.97% 10 oracle keyword) 20 oracle keywords ≈ 50% recall – 0.86 accuracy 5 oracle keywords ≈ 30% recall Twitter ≠ Web

Majority Class

Tweets for query «jaguar» • related tweets=8 • unrelated tweets=2 • Related ratio = 8/(8+2) = 0.8 • Accuracy= 0.80 • Recall=100%

Upper bound of Majority Class winner-takes-all • For each test case /company name – all unrelated or all related

Upper bound of Majority Class winner-takes-all • For each test case /company name – all unrelated or all related • Optimal decision – 0.80 accuracy

Upper bound of Majority Class winner-takes-all • For each test case /company name – all unrelated or all related • Optimal decision – 0.80 accuracy • ≈ best manual system (0.83) • > best automatic system (0.75)

Filter keywords + majority class upperbound Filter keywords (oracle or manual) Majority Class? Tweets

(1) winner-takes-all Filter keywords (oracle or manual) Majority Class Tweets

(2) winner-takes-remainder Filter keywords (oracle or manual) Majority Class Tweets

(3) bootstrapping Filter keywords (oracle or manual) training Machine learning Tweets

(3) bootstrapping Filter keywords (oracle or manual) training Machine learning application Tweets

Filter keywords + majority class

Filter keywords + majority class ≈ ‘ all related ’ baseline

Filter keywords + majority class baseline

Filter keywords + majority class baseline • Automatic Discovery of Filter Keywords: Keyword Classification Terms Filter keywords (automatic)

Filter keywords + majority class baseline • Automatic Discovery of Filter Keywords: Keyword Classification Terms Filter keywords (automatic) – 13 Term features: • 3 Collection-based features • 6 Web-based features • 4 Expanded by co-occurrence features – 3 classification methods • Machine learning (Neural net + all features) • Heuristic (2 features: col_c_specificity + cooc_om_assoc ) • Hybrid (Neural net + heuristic’s features)

Automatic Tweets Classification 0,83 WePS-3 systems 0,75 0,73 0,63 (manual) accuracy 0,56 0,48 WePS-3 systems (automatic) Filter keywords + Majority Class baseline

Conclusions • Fingerprint representation – Behaviour of binary classification systems on skewed datasets – Baselines independent of corpus

Conclusions • Fingerprint representation – Behaviour of binary classification systems on skewed datasets – Baselines independent of corpus • Twitter ≠ Web – Oracle keywords ≠ Manual keywords

Conclusions • Fingerprint representation – Behaviour of binary classification systems on skewed datasets – Baselines independent of corpus • Twitter ≠ Web – Oracle keywords ≠ Manual keywords • Filter keywords & majority class strategies – Useful signals to help solving the problem – Both signals alone already give competitive performance

Filter keywords and majority class strategies for company name disambiguation on Twitter Damiano Spina, Enrique Amigó and Julio Gonzalo {damiano,enrique,julio}@lsi.uned.es UNED NLP & IR Group CLEF 2011 Conference September 19-22, Amsterdam

disambiguation on Twitter Damiano Spina, Enrique Amig and Julio - PowerPoint PPT Presentation

Filter keywords and majority class strategies for company name disambiguation on Twitter Damiano Spina, Enrique Amig and Julio Gonzalo {damiano,enrique,julio}@lsi.uned.es UNED NLP & IR Group CLEF 2011 Conference September 19-22,

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Publications, Identity, and Disambiguation NIH Workshop on Identifiers and Disambiguation in

Word Sense Disambiguation Word Sense Disambiguation (WSD) Given A

Word Sense Disambiguation WORD SENSE DISAMBIGUATION Homonymy and Polysemy As we have seen,

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Using Twitter for your CPD Janet Thomas November 2019 #PHYSIO19 Why twitter for CPD?

ML at Twitter: A Deep Dive into Twitters Timeline Cibele Montez Halasz, Twitter Cortex

//Dashboard //Twitter Panel //Twitter Panel Context and Actions Act based on the document

Structural Correspondence Learning for Parse Disambiguation Barbara Plank b.plank@rug.nl

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids

Word Sense Disambiguation Unsupervised WSD Modern WSD L645 / B659 (Some material from Jurafsky

Joint Entity Disambiguation and Clustering Angela Fahrni, Thierry Gckel and Michael Strube

InvIdenti: Author Disambiguation for 28 July 2016 Slide 1 Medical Patents Guide (IIIT-A) :

Author Disambiguation & Impact Assessment Gentner Day 2009 @ CERN Henning Weiler 1 Author

Word Sense Disambiguation for Ontological Document Classification Speaker: Georgiana Ifrim

SETTING UP A CP2K CALCULATION Iain Bethune (ibethune@epcc.ed.ac.uk) Overview How to run

Finding Top-k Min-Cost Connected Trees in Databases Bolin Ding 1 Jeffrey Xu Yu 1 Shan Wang 2 Lu

3/14/16 Review Class/Object Type Class Keyword class class Point

Processing Keyword Queries under Access Limitations Andrea Cal, Thomas Lynch, Davide Martinenghi,

Digital Advertising (PPC/SEM) Course Digital Advertising (PPC/SEM) Equinet 1 Academy Digital

Intro to Online Learning Instructor: Haifeng Xu Outline Online Learning/Optimization

Digital Magazine Design Page Make A Plan Plan 1.1 1.2 1.3 1.5 1.6 1.4 Structure

Reinforcement Learning Lecture 18a Gillian Hayes 7th March 2007 Gillian Hayes RL Lecture 18a

disambiguation on Twitter Damiano Spina, Enrique Amig and Julio - PowerPoint PPT Presentation

Filter keywords and majority class strategies for company name disambiguation on Twitter Damiano Spina, Enrique Amig and Julio Gonzalo {damiano,enrique,julio}@lsi.uned.es UNED NLP & IR Group CLEF 2011 Conference September 19-22,

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Publications, Identity, and Disambiguation NIH Workshop on Identifiers and Disambiguation in

Word Sense Disambiguation Word Sense Disambiguation (WSD) Given A

Word Sense Disambiguation WORD SENSE DISAMBIGUATION Homonymy and Polysemy As we have seen,

Word Meaning &amp; Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Using Twitter for your CPD Janet Thomas November 2019 #PHYSIO19 Why twitter for CPD?

ML at Twitter: A Deep Dive into Twitters Timeline Cibele Montez Halasz, Twitter Cortex

//Dashboard //Twitter Panel //Twitter Panel Context and Actions Act based on the document

Structural Correspondence Learning for Parse Disambiguation Barbara Plank b.plank@rug.nl

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids

Word Sense Disambiguation Unsupervised WSD Modern WSD L645 / B659 (Some material from Jurafsky

Joint Entity Disambiguation and Clustering Angela Fahrni, Thierry Gckel and Michael Strube

InvIdenti: Author Disambiguation for 28 July 2016 Slide 1 Medical Patents Guide (IIIT-A) :

Author Disambiguation &amp; Impact Assessment Gentner Day 2009 @ CERN Henning Weiler 1 Author

Word Sense Disambiguation for Ontological Document Classification Speaker: Georgiana Ifrim

SETTING UP A CP2K CALCULATION Iain Bethune (ibethune@epcc.ed.ac.uk) Overview How to run

Finding Top-k Min-Cost Connected Trees in Databases Bolin Ding 1 Jeffrey Xu Yu 1 Shan Wang 2 Lu

3/14/16 Review Class/Object Type Class Keyword class class Point

Processing Keyword Queries under Access Limitations Andrea Cal, Thomas Lynch, Davide Martinenghi,

Digital Advertising (PPC/SEM) Course Digital Advertising (PPC/SEM) Equinet 1 Academy Digital

Intro to Online Learning Instructor: Haifeng Xu Outline Online Learning/Optimization

Digital Magazine Design Page Make A Plan Plan 1.1 1.2 1.3 1.5 1.6 1.4 Structure

Reinforcement Learning Lecture 18a Gillian Hayes 7th March 2007 Gillian Hayes RL Lecture 18a

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Author Disambiguation & Impact Assessment Gentner Day 2009 @ CERN Henning Weiler 1 Author