Tulip: Lightweight Entity Recognition and Disambiguation Using - - PowerPoint PPT Presentation

tulip lightweight entity recognition and disambiguation
SMART_READER_LITE
LIVE PREVIEW

Tulip: Lightweight Entity Recognition and Disambiguation Using - - PowerPoint PPT Presentation

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids Marek Lipczak Arash Koushkestani Evangelos Milios Problem definition The goal of Entity Recognition and Disambiguation (ERD) Identify


slide-1
SLIDE 1

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids

Marek Lipczak Arash Koushkestani Evangelos Milios

slide-2
SLIDE 2

Problem definition

2

 The goal of Entity Recognition and Disambiguation (ERD)

□ Identify mentions of entities □ Link the mentions to a relevant entry in an external knowledge base □ The knowledge base is typically a large subset of Wikipedia articles

 Example:

The selling offsets decent earnings from Cisco Systems and Home Depot. Techs fall, led by Microsoft and Intel. Michael Kors rises. Gold and oil slip.

slide-3
SLIDE 3

Recognition and Disambiguation

3

The selling offsets decent earnings from Cisco Systems and Home Depot. Techs fall, led by Microsoft and Intel. Michael Kors rises. Gold and oil slip.

 Recognition

□ Is this a valid mention of an entity present in the knowledge base?

 Disambiguation

□ Which of the potential entities (senses) is correct?

slide-4
SLIDE 4

Recognition and Disambiguation

4

The selling offsets decent earnings from Cisco Systems and Home Depot. Techs fall, led by Microsoft and Intel. Michael Kors rises. Gold and oil slip.

 Recognition

□ Is this a valid mention of an entity present in the knowledge base?

 Disambiguation

□ Which of the potential entities (senses) is correct?

 Default sense – the entity with a largest number of wiki-links with the

mention as the anchor text

□ Tulip focuses on default sense entities □ Main goal is to recognize whether the default sense is consistent with

the document

slide-5
SLIDE 5

Our background

5

 Visual Text Analytics Lab

□ Some experience with using ERD systems □ No experience implementing ERD systems

 Key issue with state-of-the-art systems: obvious false positive mistakes

□ Visualize Prof. Smith's research interests:

 Data Mining  Machine Learning  50 cent

 Our goal: minimize the number of false positives

slide-6
SLIDE 6

Tulip – system overview

6

 Spotter

□ Find all mentions of entities in the text (Solr Text Tagger) □ Special handling for personal names

 Recognizer

□ Retrieve profjles of spotted entities (from Sunfmower) □ Generate a topic centroid representing the document □ Select entities consistent with the document

slide-7
SLIDE 7

Spotter

7

 Spotter

□ Find all mentions of entities in the text (Solr Text Tagger) □ Special handling for personal names

 Recognizer

□ Retrieve profjles of spotted entities (from Sunfmower) □ Generate a topic centroid representing the document □ Select entities consistent with the document

slide-8
SLIDE 8

Solr Text Tagger

8

 Solr (Lucene) is a text search engine

□ Indexes textual documents □ Retrieve documents for keyword-based queries

 Solr Text Tagger

□ Indexes entity surface forms stored in a lexicon

 E.g., Baltimore Ravens, Ravens, Baltimore (…)

□ Uses full text documents as queries □ Finds all entity mentions in the document □ Retrieves the mentioned entities (candidate selection) □ Implemented based on Solr's Finite State Transducers

 By David Smiley and Rupert Westenthaler (thanks!)

slide-9
SLIDE 9

Building the lexicon

9

 Three sources of entity surface forms (external datasets)

□ Entity names (from Freebase) □ Wiki-links anchor text (from Wikipedia) □ Web anchor text (from Google's Wikilinks corpus)

slide-10
SLIDE 10

Building the lexicon

10

 Three sources of entity surface forms (external datasets)

□ Entity names (from Freebase) □ Wiki-links anchor text (from Wikipedia) □ Web anchor text (from Google's Wikilinks corpus)

 Special handling of personal names

□ “Jack” and “London” are not allowed as surface forms for Jack London □ Instead they are indexed as “generic” personal names and will be

matched only if Jack London is mentioned by his full name

slide-11
SLIDE 11

Building the lexicon

11

 Three sources of entity surface forms (external datasets)

□ Entity names (from Freebase) □ Wiki-links anchor text (from Wikipedia) □ Web anchor text (from Google's Wikilinks corpus)

 Special handling of personal names

□ “Jack” and “London” are not allowed as surface forms for Jack London □ Instead they are indexed as “generic” personal names and will be

matched only if Jack London is mentioned by his full name

 Flagging suspicious surface forms (e.g., “It” - Stephen King's novel)

□ stop-word fjlter marks all stop-words or phrases composed of stop-

words (e.g., This is)

□ Wiktionary fjlter marks all common nouns, verbs, adjectives, etc.

found in Wiktionary

□ lower-case fjlter marks all lower-case words or phrases

slide-12
SLIDE 12

Spotter – example

12

The [1] (...) [97] selling offsets decent earnings from Cisco Systems [1] and Home Depot [1]. Techs fall (1) (...) [7], led by Microsoft [1] (...) [13] and Intel [1] (...) [9]. Michael Kors [1] rises. Gold (1) (...) [31] and oil slip.

 Default sense for all mentions (Freebase only)

slide-13
SLIDE 13

Spotter – example

13

The [1] (...) [97] selling offsets decent earnings from Cisco Systems [1] and Home Depot [1]. Techs fall (1) (...) [7], led by Microsoft [1] (...) [13] and Intel [1] (...) [9]. Michael Kors [1] rises. Gold (1) (...) [31] and oil slip.

 Default sense for all mentions (Freebase only)  Default sense for all mentions (Freebase + Wikpedia)

slide-14
SLIDE 14

Spotter – example

14

The [1] (...) [97] selling offsets decent earnings from Cisco Systems [1] and Home Depot [1]. Techs fall (1) (...) [7], led by Microsoft [1] (...) [13] and Intel [1] (...) [9]. Michael Kors [1] rises. Gold (1) (...) [31] and oil slip.

 Default sense for all mentions (Freebase only)  Default sense for all mentions (Freebase + Wikpedia)  Suspicious mentions removed

slide-15
SLIDE 15

Spotter – example

15

The [1] (...) [97] selling offsets decent earnings from Cisco Systems [1] and Home Depot [1]. Techs fall (1) (...) [7], led by Microsoft [1] (...) [13] and Intel [1] (...) [9]. Michael Kors [1] rises. Gold (1) (...) [31] and oil slip.

 Default sense for all mentions (Freebase only)  Default sense for all mentions (Freebase + Wikpedia)  Suspicious mentions removed  How can we remove Michael Kors and bring back Home Depot?

□ Relatedness of entities to the document

slide-16
SLIDE 16

Recognizer

16

 Spotter

□ Find all mentions of entities in the text (Solr Text Tagger) □ Special handling for personal names

 Recognizer

□ Retrieve profjles of spotted entities (from Sunfmower) □ Generate a topic centroid representing the document □ Select entities consistent with the document

slide-17
SLIDE 17

Relatedness score

17

The selling offsets decent earnings from Cisco Systems and Home Depot. Techs fall, led by Microsoft and Intel. Michael Kors rises. Gold and oil slip.

 Our solution

□ Retrieve a profjle of every entity mentioned in the text □ Agglomerate the profjles in a centroid representing the document □ Check which entities are coherent with the topics (relatedness score)

How strongly

  • r

are related to the document?

slide-18
SLIDE 18

Relatedness score

18

The selling offsets decent earnings from Cisco Systems and Home Depot. Techs fall, led by Microsoft and Intel. Michael Kors rises. Gold and oil slip.

 Our solution

□ Retrieve a profjle of every entity mentioned in the text □ Agglomerate the profjles in a centroid representing the document □ Check which entities are coherent with the topics (relatedness score) □ How do we create the entity profjles?

How strongly

  • r

are related to the document?

slide-19
SLIDE 19

Relatedness – Sunflower

19

 A concept graph based on unifjed category graph from 120 Wikipedia

language versions

□ Each language version acts like a witness for the importance of stored

relation

 Compact and accurate category profjles for all Wikipedia articles

□ Removal of unimportant categories □ Inference of more general categories

slide-20
SLIDE 20

Sunflower – from graph to term profile

20

 Sunfmower graph is:

□ Directed □ Weighted (importance score) □ Sparse (only k most important

links per node)

 Category-based profjle is

a sparse, weighted term vector

□ All categories at distance < d □ Term weights based on edge weights □ E.g., k = 3, d = 2 □ Path weight is the product of edge weights

 w(Intel → Comp. of US → Ec. of US) = 0.42*0.27 = 0.11

□ Category weight is the sum of path weights

 w(Ec. of US) = 0.11 + 0.19 = 0.3

slide-21
SLIDE 21

Topic centroids in Tulip

21

 Retrieve category-based profjles for all default senses (example next slide)

slide-22
SLIDE 22

22

slide-23
SLIDE 23

Topic centroids in Tulip

23

 Retrieve category-based profjles for all default senses (example next slide)  Topic Centroid Generation

□ Centroid is a linear combination of entity profjles □ Default senses of non-suspicious mentions only

(entity core)

slide-24
SLIDE 24

Topic centroids in Tulip

24

 Retrieve category-based profjles for all default senses (example next slide)  Topic Centroid Generation

□ Centroid is a linear combination of entity profjles □ Default senses of non-suspicious mentions only

(entity core)

 Topic Centroid Refjnement

□ Entities far from the centroid are removed from the core □ Cosine similarity with predefjned threshold tcoh=0.2

slide-25
SLIDE 25

Topic centroids in Tulip

25

 Retrieve category-based profjles for all default senses (example next slide)  Topic Centroid Generation

□ Centroid is a linear combination of entity profjles □ Default senses of non-suspicious mentions only

(entity core)

 Topic Centroid Refjnement

□ Entities far from the centroid are removed from the core □ Cosine similarity with predefjned threshold tcoh=0.2

 Entity Scoring

□ Relatedness score assigned to each default sense entity

(including suspicious mentions)

slide-26
SLIDE 26

Topic centroids in Tulip

26

 Retrieve category-based profjles for all default senses (example next slide)  Topic Centroid Generation

□ Centroid is a linear combination of entity profjles □ Default senses of non-suspicious mentions only

(entity core)

 Topic Centroid Refjnement

□ Entities far from the centroid are removed from the core □ Cosine similarity with predefjned threshold tcoh=0.2

 Entity Scoring

□ Relatedness score assigned to each default sense entity

(including suspicious mentions)

 System output

□ Entities with score > tcoh □ Entity with best relatedness score for each mention

slide-27
SLIDE 27

Challenge results

27

 Tulip got second place in the long track

□ Category-based topic centroids – promising solution for relatedness □ Top recall among all submitted systems (?!) □ Lowest latency among all submitted systems

slide-28
SLIDE 28

Lightweight ERD

28

 Entity Recognition and Disambiguation is typically just a single step in a

more complex document processing system

 To be practical the ERD system has to be lightweight:

□ Fast – lowest latency among all competing systems, over 200

documents per minute

□ Adaptable – both Solr Text Tagger and Sunfmower can be easily

adapted to changing data

□ Compact – the full system requires less than 4 GB of operational

memory and uses no external data repositories

slide-29
SLIDE 29

Lightweight ERD

29

 Entity Recognition and Disambiguation is typically just a single step in a

more complex document processing system

 To be practical the ERD system has to be lightweight:

□ Fast – lowest latency among all competing systems, throughput of

  • ver 200 documents per minute

□ Adaptable – both Solr Text Tagger and Sunfmower can be easily

adapted to changing data

□ Compact – the full system requires less than 4 GB of operational

memory and uses no external data repositories

slide-30
SLIDE 30

The importance of default sense

30

 Analysis on 50 documents with ground-truth data (1166 entities)  85% of mentions that can be disambiguated, should be disambiguated

with default sense

□ Another 5% is explicitly disambiguated with another mention in the

document (e.g., E72 and Nokia E72)

 Focusing on default sense Tulip missed < 5% of entities

slide-31
SLIDE 31

The importance of default sense

31

 Analysis on 50 documents with ground-truth data (1166 entities)  85% of mentions that can be disambiguated, should be disambiguated

with default sense

□ Another 5% is explicitly disambiguated with another mention in the

document (e.g., E72 and Nokia E72)

 Focusing on default sense Tulip missed < 5% of entities

slide-32
SLIDE 32

The importance of default sense

32

 Analysis on 50 documents with ground-truth data (1166 entities)  85% of mentions that can be disambiguated, should be disambiguated

with default sense

□ Another 5% is explicitly disambiguated with another mention in the

document (e.g., E72 and Nokia E72)

 Focusing on default sense Tulip missed < 5% of entities

slide-33
SLIDE 33

Conclusions

33

 Wikipedia-based category profjles can be used to determine the

relatedness of an entity to the topics of a document

 Small size of category profjles allows the system to represent the

aggregated topics of the document in form of a centroid, which simplifjes the recognition process

 The pruning of suspicious mentions and focus on the default sense

entities helps Tulip to build precise document centroids that can be further used to clean or expand the set of returned entities

 The accuracy of extracted entities relies more on the successful

recognition of correct entity mentions rather than their disambiguation Project website: http://www.cs.dal.ca/~lipczak/erd/

slide-34
SLIDE 34

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids

Marek Lipczak Arash Koushkestani Evangelos Milios

slide-35
SLIDE 35

Solr Text Tagger

35

 Two level Finite State Transducers approach

□ Word to index (each edge is a letter) □ Surface form to list of entities (each edge is a word)