disambiguation on Twitter Damiano Spina, Enrique Amig and Julio - - PowerPoint PPT Presentation

disambiguation on twitter
SMART_READER_LITE
LIVE PREVIEW

disambiguation on Twitter Damiano Spina, Enrique Amig and Julio - - PowerPoint PPT Presentation

Filter keywords and majority class strategies for company name disambiguation on Twitter Damiano Spina, Enrique Amig and Julio Gonzalo {damiano,enrique,julio}@lsi.uned.es UNED NLP & IR Group CLEF 2011 Conference September 19-22,


slide-1
SLIDE 1

Filter keywords and majority class strategies for company name disambiguation on Twitter

Damiano Spina, Enrique Amigó and Julio Gonzalo {damiano,enrique,julio}@lsi.uned.es UNED NLP & IR Group

CLEF 2011 Conference September 19-22, Amsterdam

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5

Goal

  • Two signals coming from intuition:

– Filter keywords – Majority Class

  • Do they help characterizing and solving the

problem?

slide-6
SLIDE 6

WePS-3 Online Reputation Management Task

slide-7
SLIDE 7

WePS-3 Online Reputation Management Task

slide-8
SLIDE 8

WePS-3 Online Reputation Management Task

slide-9
SLIDE 9
  • related tweets=8
  • unrelated tweets=2
  • Related ratio = 8/(8+2) = 0.8

Tweets for query «jaguar»

slide-10
SLIDE 10
  • related tweets=0
  • unrelated tweets=10
  • Related ratio = 0

Tweets for query «orange»

slide-11
SLIDE 11
  • related tweets=5
  • unrelated tweets=5
  • Related ratio = 0.5

Tweets for query «apple»

slide-12
SLIDE 12

Fingerprint representation

slide-13
SLIDE 13

Fingerprint representation

slide-14
SLIDE 14

Fingerprint representation

slide-15
SLIDE 15

Fingerprint representation

slide-16
SLIDE 16

WePS-3 Task 2 Systems

slide-17
SLIDE 17

WePS-3 Task 2 Systems

slide-18
SLIDE 18

Filter keywords

slide-19
SLIDE 19

Tweets for query «apple»

slide-20
SLIDE 20

Tweets for query «apple»

  • positive keyword: store
  • 4 tweets annotated as

«related»

slide-21
SLIDE 21
  • positive keyword: store
  • 4 tweets annotated as

«related»

  • negative keyword: eating
  • 2 tweets annotated as

«unrelated»

Tweets for query «apple»

slide-22
SLIDE 22
  • positive keyword: store
  • 4 tweets annotated as

«related»

  • negative keyword: eating
  • 2 tweets annotated as

«unrelated»

  • Accuracy= 1.0
  • Recall=60%

Tweets for query «apple»

slide-23
SLIDE 23

Company name Positive Keywords Negative Keywords amazon electronics, books, apparel, computers, buy river, rainforest, deforestation, bolivian, brazilian fox tv, broadcast, shows, episodes, fringe, bones animal, terrier, hunting, volkswagen, racing ford motor, cars, hybrids, crossovers, mondeo, focus, fiesta, prices, dealer, electric tom, harrison, henry, glenn, gucci

Manual keywords (perfects for a Web user)

slide-24
SLIDE 24

Company name Positive Keywords Negative Keywords amazon sale, books, deal, deals, gift followdaibosyu, pest, plug, brothers, pirotta fox money, weather, leader, denouncing, viewers megan, matthew, lazy, valley, michael ford mustang, focus, hybrid, motor, truck tom, harrison, rob, bring, coppola

Oracle keywords (perfects on Twitter)

Company name Positive Keywords Negative Keywords amazon electronics, books, apparel, computers, buy river, rainforest, deforestation, bolivian, brazilian fox tv, broadcast, shows, episodes, fringe, bones animal, terrier, hunting, volkswagen, racing ford motor, cars, hybrids, crossovers, mondeo, focus, fiesta, prices, dealer, electric tom, harrison, henry, glenn, gucci

Manual keywords (perfects for a Web user)

slide-25
SLIDE 25

Company name Positive Keywords Negative Keywords amazon sale, books, deal, deals, gift followdaibosyu, pest, plug, brothers, pirotta fox money, weather, leader, denouncing, viewers megan, matthew, lazy, valley, michael ford mustang, focus, hybrid, motor, truck tom, harrison, rob, bring, coppola

Oracle keywords (perfects on Twitter)

Company name Positive Keywords Negative Keywords amazon electronics, books, apparel, computers, buy river, rainforest, deforestation, bolivian, brazilian fox tv, broadcast, shows, episodes, fringe, bones animal, terrier, hunting, volkswagen, racing ford motor, cars, hybrids, crossovers, mondeo, focus, fiesta, prices, dealer, electric tom, harrison, henry, glenn, gucci

Manual keywords (perfects for a Web user)

slide-26
SLIDE 26

Company name Positive Keywords Negative Keywords amazon sale, books, deal, deals, gift followdaibosyu, pest, plug, brothers, pirotta fox money, weather, leader, denouncing, viewers megan, matthew, lazy, valley, michael ford mustang, focus, hybrid, motor, truck tom, harrison, rob, bring, coppola

Oracle keywords (perfects on Twitter)

Company name Positive Keywords Negative Keywords amazon electronics, books, apparel, computers, buy river, rainforest, deforestation, bolivian, brazilian fox tv, broadcast, shows, episodes, fringe, bones animal, terrier, hunting, volkswagen, racing ford motor, cars, hybrids, crossovers, mondeo, focus, fiesta, prices, dealer, electric tom, harrison, henry, glenn, gucci

Manual keywords (perfects for a Web user)

slide-27
SLIDE 27

Upper bound of Filter Keywords

5 oracle keywords ≈ 30% recall 20 oracle keywords ≈ 50% recall

Oracle keywords

slide-28
SLIDE 28

Upper bound of Filter Keywords

Manual keywords

– ≈10 per company – 14.61 % recall

(vs. 39.97% 10 oracle keyword)

– 0.86 accuracy

5 oracle keywords ≈ 30% recall 20 oracle keywords ≈ 50% recall

Oracle keywords

slide-29
SLIDE 29

Upper bound of Filter Keywords

Manual keywords

– ≈10 per company – 14.61 % recall

(vs. 39.97% 10 oracle keyword)

– 0.86 accuracy

5 oracle keywords ≈ 30% recall 20 oracle keywords ≈ 50% recall

Oracle keywords Twitter ≠ Web

slide-30
SLIDE 30

Majority Class

slide-31
SLIDE 31
  • related tweets=8
  • unrelated tweets=2
  • Related ratio = 8/(8+2) = 0.8

Tweets for query «jaguar»

  • Accuracy= 0.80
  • Recall=100%
slide-32
SLIDE 32

Upper bound of Majority Class

  • For each test case

/company name

– all unrelated or all related

winner-takes-all

slide-33
SLIDE 33

Upper bound of Majority Class

  • For each test case

/company name

– all unrelated or all related

  • Optimal decision

– 0.80 accuracy

winner-takes-all

slide-34
SLIDE 34

Upper bound of Majority Class

  • For each test case

/company name

– all unrelated or all related

  • Optimal decision

– 0.80 accuracy

  • ≈ best manual system

(0.83)

  • > best automatic system

(0.75)

winner-takes-all

slide-35
SLIDE 35

Filter keywords + majority class upperbound

Tweets Filter keywords (oracle or manual) Majority Class?

slide-36
SLIDE 36

(1) winner-takes-all

Tweets Filter keywords (oracle or manual) Majority Class

slide-37
SLIDE 37

(2) winner-takes-remainder

Tweets Majority Class Filter keywords (oracle or manual)

slide-38
SLIDE 38

(3) bootstrapping

Tweets Machine learning training Filter keywords (oracle or manual)

slide-39
SLIDE 39

(3) bootstrapping

Tweets Machine learning training Filter keywords (oracle or manual) application

slide-40
SLIDE 40

Filter keywords + majority class

slide-41
SLIDE 41

Filter keywords + majority class

≈ ‘all related’ baseline

slide-42
SLIDE 42

Filter keywords + majority class baseline

slide-43
SLIDE 43

Filter keywords + majority class baseline

Keyword Classification Terms Filter keywords (automatic)

  • Automatic Discovery of Filter Keywords:
slide-44
SLIDE 44

Filter keywords + majority class baseline

Keyword Classification Terms Filter keywords (automatic)

– 13 Term features:

  • 3 Collection-based features
  • 6 Web-based features
  • 4 Expanded by co-occurrence features

– 3 classification methods

  • Machine learning (Neural net + all features)
  • Heuristic (2 features: col_c_specificity + cooc_om_assoc)
  • Hybrid (Neural net + heuristic’s features)
  • Automatic Discovery of Filter Keywords:
slide-45
SLIDE 45

Automatic Tweets Classification

0,83 0,75 0,73 0,63 0,56 0,48 accuracy WePS-3 systems (automatic) Filter keywords + Majority Class baseline WePS-3 systems (manual)

slide-46
SLIDE 46

Conclusions

  • Fingerprint representation

– Behaviour of binary classification systems on skewed datasets – Baselines independent of corpus

slide-47
SLIDE 47

Conclusions

  • Fingerprint representation

– Behaviour of binary classification systems on skewed datasets – Baselines independent of corpus

  • Twitter ≠ Web

– Oracle keywords ≠ Manual keywords

slide-48
SLIDE 48

Conclusions

  • Fingerprint representation

– Behaviour of binary classification systems on skewed datasets – Baselines independent of corpus

  • Twitter ≠ Web

– Oracle keywords ≠ Manual keywords

  • Filter keywords & majority class strategies

– Useful signals to help solving the problem – Both signals alone already give competitive performance

slide-49
SLIDE 49

Filter keywords and majority class strategies for company name disambiguation on Twitter

CLEF 2011 Conference September 19-22, Amsterdam

Damiano Spina, Enrique Amigó and Julio Gonzalo {damiano,enrique,julio}@lsi.uned.es UNED NLP & IR Group