Extracting World and Linguistic Knowledge from Wikipedia Simone - - PDF document

extracting world and linguistic knowledge from wikipedia
SMART_READER_LITE
LIVE PREVIEW

Extracting World and Linguistic Knowledge from Wikipedia Simone - - PDF document

Extracting World and Linguistic Knowledge from Wikipedia Simone Paolo Ponzetto Michael Strube University of Heidelberg EML Research gGmbH Outline Introduction Deriving world knowledge from Wikipedia Leveraging linguistic knowledge


slide-1
SLIDE 1

Extracting World and Linguistic Knowledge from Wikipedia

Simone Paolo Ponzetto Michael Strube University of Heidelberg EML Research gGmbH

Outline

Introduction Deriving world knowledge from Wikipedia Leveraging linguistic knowledge Applications Outlook and future work Conclusions

slide-2
SLIDE 2

Outline

Introduction Deriving world knowledge from Wikipedia Leveraging linguistic knowledge Applications Outlook and future work Conclusions

Encyclopedic knowledge & NLP

The crisis at General Motors threatens to drag down Adam Opel, a storied German brand that GM bought 80 years ago, on the eve of the Great Depression. Many in the industry say Opel has a future only if it can get a temporary helping hand from the German government. But whether Chancellor Angela Merkel will make available the public financing needed to help release Opel from the clutches of General Motors now depends on a reluctant government, an influential automotive union that wants politicians to save jobs, and employees who yearn to re-establish Opel as an independent German company.

source: Herald Tribune Europe, March 6, 2009

What about a widely used resource like WordNet?

slide-3
SLIDE 3

Encyclopedic knowledge & NLP Encyclopedic knowledge & NLP

The crisis at General Motors threatens to drag down Adam Opel, a storied German brand that GM bought 80 years ago, on the eve of the Great Depression. Many in the industry say Opel has a future only if it can get a temporary helping hand from the German government. But whether Chancellor Angela Merkel will make available the public financing needed to help release Opel from the clutches of General Motors now depends on a reluctant government, an influential automotive union that wants politicians to save jobs, and employees who yearn to re-establish Opel as an independent German company.

source: Herald Tribune Europe, March 6, 2009

What about a widely used resource like WordNet? And Cyc?

slide-4
SLIDE 4

Encyclopedic knowledge & NLP Encyclopedic knowledge & NLP

The crisis at General Motors threatens to drag down Adam Opel, a storied German brand that GM bought 80 years ago, on the eve of the Great Depression. Many in the industry say Opel has a future only if it can get a temporary helping hand from the German government. But whether Chancellor Angela Merkel will make available the public financing needed to help release Opel from the clutches of General Motors now depends on a reluctant government, an influential automotive union that wants politicians to save jobs, and employees who yearn to re-establish Opel as an independent German company.

source: Herald Tribune Europe, March 6, 2009

What about a widely used resource like WordNet? And Cyc? Let’s check Wikipedia on that topic!

slide-5
SLIDE 5

Wikipedia Wikipedia

slide-6
SLIDE 6

Wikipedia Two main problems

  • 1. where to get this knowledge from?
  • 2. how to effectively use it within NLP applications to

advance the state-of-the-art?

slide-7
SLIDE 7

Outline

Introduction Deriving world knowledge from Wikipedia Leveraging linguistic knowledge Applications Outlook and future work Conclusions

Domain and world knowledge

project-specific domain knowledge bases: + very high quality – small domain – reusability – high cost

d d d

LTE−Lite−25

d d d d d d d has−hd−drive has−system−software uses−disk

Storage−Space

storage−space access−time has−cpu

Seagate−ST−3144 LTE−Lite−20

clock−frequency has−central−unit developed−by

Compaq Trackball

has−ball

Ball Notebook AT−Bus−HD−Drive Hard−Disk Capacity−MB−Pair Hard−Disk−Drive System−Software Computer−System Central−Unit Clock−MHz−Pair CPU Clock−Frequency Access−Time Workstation PC Time−MS−Pair

developed−by

Seagate

has−trackball

slide-8
SLIDE 8

Domain and world knowledge

WordNet + pretty high quality + good coverage of everyday language + many languages – very high cost – sense proliferation – coverage in domains arbitrary

entity physical entity abstract entity thing thing

  • bject

causal agent substance process abstraction change freshener horror jimdandy stinker whacker

Domain and world knowledge

Cyc + pretty high quality + good coverage of everyday language + common sense knowledge – very high cost – coverage in domains arbitrary – English only

slide-9
SLIDE 9

Domain and world knowledge

Ontology Learning from Text + low cost + potentially domain independent – mostly only small domains – low quality

General Motors isa isa isa isa US car company car company German car company Opel belongs to

Domain and world knowledge

Manual approach

+

Knowledge is manually input by human experts

➠ it produces high-quality information −

Limited amount of human experts

➠ expensive and low scalability to cover all domains

Automatic approach

+

It requires minimal supervision on large amounts of data

➠ low cost and scalable −

Overall quality lower than humans

➠ unconstrained output, not necessarily ‘ontologized’

slide-10
SLIDE 10

Domain and world knowledge

And Wikipedia? “one of the most fascinating developments of the Digital Age” “incredible example of open-source intellectual collaboration” “faith-based encyclopedia” “a joke at best”

Domain and world knowledge

And Wikipedia? “. . . an expert-led investigation carried out by Nature – the first to use peer review to compare Wikipedia and Britannica’s coverage of science . . . revealed numerous errors in both encyclopedias, but among 42 entries tested, the difference in accuracy was not particularly great: the average science entry in Wikipedia contained around four inaccuracies; Britannica about three.” (Nature 15. Dec. 2005)

slide-11
SLIDE 11

Domain and world knowledge

And Wikipedia? + low cost + very good coverage, domain independent + very many languages + up to date – ?? We evaluate quality empirically!

Where to get this knowledge from?

we are after a “steak and lobster” combination . . .

manual approaches achieve high quality for a limited coverage automatic ones achieve large coverage for a lower quality ➠ use manually annotated semi-structured input ➠ develop lightweight methods to generate large-coverage,

high-quality structured output

slide-12
SLIDE 12

Wikipedia

Wikipedia is . . .

  • a free, on-line encyclopedia
  • based on a model of communal content creation
  • available in more than 266 different languages (April 2009)
  • user interface provided by a Web-based Wiki software application,

e.g. MediaWiki, running on top of a LAMP architecture

  • edited as plain text by means of a markup language (wiki markup), in
  • rder to provide structured annotations

Why Wikipedia

Wikipedia is . . . 1. domain independent

it has a large coverage 2. up-to-date

to process current information 3. multilingual

to process information in many languages

slide-13
SLIDE 13

Wikipedia category network

  • since May 2004 Wikipedia provides a collaboratively generated

category network

Semantic relatedness with Wikipedia

WikiRelate! (Strube & Ponzetto, 2006):

  • 1. Wikipedia pages represent categorized concepts
  • 2. all Wikipedia categories form a semantic network
  • 3. relations between concepts are given along the network

➠ use the category network as a semantic network . . . ➠ . . . to compute semantic relatedness

slide-14
SLIDE 14

Comparison of different approaches

WikiRelate!, ESA and WLM leverage different features of Wikipedia

  • WikiRelate! uses categories (∼ 3 categories/article)
  • ESA uses articles (∼ 2,800,00) and words (∼ 400 words/article)
  • WLM uses hyperlinks (∼ 34 hyperlinks/article)

Deriving a taxonomy from Wikipedia

slide-15
SLIDE 15

Deriving a taxonomy

  • induce semantically-typed relations

Deriving a taxonomy

  • the category network is merely a thematic categorization of the

topics of articles

task label the relations between categories as isa and notisa

  • goal

transform a thematic categorization into a fully-fledged taxonomy

slide-16
SLIDE 16

Deriving a taxonomy

  • methods:
  • syntactic matching
  • connectivity in the network
  • lexico-syntactic patterns
  • results:
  • we start with 337,522 categories and 743,140 links
  • we generate 335,128 isa relations

large-scale, multi-domain taxonomy

Category network cleanup (1)

  • removal of meta-categories used for encyclopedia management,

e.g. categories under WIKIPEDIA ADMINISTRATION

  • we remove all nodes whose labels contain any of the following

strings: MEDIAWIKI, TEMPLATE, USER, PORTAL, CATEGORIES,

ARTICLES, PAGES

  • this leaves
  • 240,760 categories
  • 515,423 links

still to be processed

slide-17
SLIDE 17

Refinement link identification (2)

ALBUMS BY ARTIST MILES DAVIS ALBUMS is−refined−by FRENCH CUISINE CUISINE BY NATIONALITY is−refined−by

  • patterns such as Y X and X BY Z
  • their purpose is to better structure and simplify the categorization

network

  • we assume this represents is-refined-by-relations
  • this labels 126,920 category links notisa and leaves 388,503

relations to be analyzed

Syntax-based methods (3)

same lexical head

SCIENTISTS COMPUTER SCIENTISTS

isa

BRITISH COMPUTER SCIENTISTS

isa

  • head matching labels pairs of categories sharing the same lexical

head word (or lemma)

  • we identify lexical heads using the Stanford parser and lemmata

using morpha

slide-18
SLIDE 18

Syntax-based methods (3)

modifier in head position

ISLAMIC MYSTICISM ISLAM

notisa

  • modifier matching labels pairs as notisa, if the stem of the lexical

head of one of the categories occurs in non-head position in the

  • ther category, e.g. CRIME COMICS and CRIME or ISLAMIC

MYSTICISM and ISLAM

  • head and modifier matching identify 141,728 isa relations and

67,437 notisa relations ➠ relatively ‘simple’ (→ baseline) ➠ still large coverage

Connectivity-based methods (4)

COMPANIES LISTED ON NASDAQ MICROSOFT MICROSOFT (page)

isa same lexical head category titled as page

COMPUTER AND VIDEO GAME COMPANIES

[instance−of]

  • instance categorization assumes that relations between entities

(Wikipedia pages) and classes (categories) can be labeled as instance-of (Suchanek et al., 2007)

➠ identifies 14,886 isa relations

slide-19
SLIDE 19

Connectivity-based methods (4)

CARBAMATES AMIDES ETHYL CARBAMATE (page)

isa redundant link [instance−of] [instance−of]

  • if users redundantly categorize we take this as evidence for isa

relations, e.g. ETHYL CARBAMATE

➠ identifies 16,523 isa relations

we are left with 147,929 unclassified relations . . .

Lexico-syntactic based methods (5)

STIMULANT CAFFEINE

isa NP2,? (such as|like|, especially) NP* NP1 pattern match:

STIMULANTS SUCH AS CAFFEINE

  • we apply lexico-syntactic patterns to sentences in large text

corpora to identify isa relations (Hearst, 1992; Caraballo, 1999)

  • we assume that patterns used for identifying meronymic relations

(Berland & Charniak, 1999) indicate that the relation is not an isa relation ➠ notisa

slide-20
SLIDE 20

Lexico-syntactic based methods (5)

  • examples of ISA patterns:

➠ NP2,? (such as|like|, especially) NP* NP1

a stimulant such as caffeine

➠ NP1 NP* (and|or|,like) other NP2

caffeine and other stimulants

  • examples of NOTISA patterns:

➠ NP2’s NP1

car’s engine

➠ NP2 with NP1

a car with an engine

Lexico-syntactic based methods (5)

  • we use the Tipster corpus (2.5× 108 words) and the English

Wikipedia itself (8× 108 words)

  • Preprocessing: tokenization, sentence splitting, POS-tagging,

NP-chunking ➠ 15GB data

  • majority voting strategy between isa and notisa patterns
  • this method identifies 49,054 isa relations
  • we apply this method also to the relations identified in step (4) and

filter out 3,226 previously identified isa relations

slide-21
SLIDE 21

Inference-based methods (6)

isa

(previously found)

isa

(previously found)

isa

(inferred) COGNITIVE SCIENCES INTERDISCIPLINARY FIELDS ARTIFICIAL INTELLIGENCE

  • assumption: the isa relation models set inclusion, and therefore is a

transitive relation

  • propagate previously found relations based on transitivity

Inference-based methods (6)

isa

(previously found)

isa

(inferred) BORGHESE PAPAL FAMILIES ITALIAN NOBLE FAMILIES

same lexical head

  • propagate all isa relations to those supercategories whose head

lemma matches the head lemma of a previously identified isa supercategory

➠ propagate the isa relation to the sisters of the previously

identified isa supercategories

slide-22
SLIDE 22

Size of the taxonomy

Wikipedia Wikipedia ResearchCyc WordNet (sem. network) (taxonomy)

  

# concepts 300,000 # nodes # synsets 117,659 # categories 337,522 209,919

  

# assertions 3,000,000 # edges # semantic pointers 285,348 # category links 743,140 335,128

Semantic similarity

(results on Wikipedia from March 2008, journal submission)

M&C pl wup lch res WordNet 0.64 0.76 0.78 0.81 WikiRelate! 0.67 0.68 0.71 0.44 WikiRelate! isa 0.75 0.81 0.80 0.87 R&G pl wup lch res WordNet 0.74 0.80 0.84 0.82 WikiRelate! 0.65 0.68 0.69 0.34 WikiRelate! isa 0.70 0.77 0.75 0.78

slide-23
SLIDE 23

Comparison of different approaches

  • taxonomy (i.e. WordNet) based measured

M&C R&G Lin (1998) .82

Li et al. (2006) .89

  • Web based measured

M&C R&G Bollegala et al. (2007) .81

  • Wikipedia based measured

M&C R&G ESA .73 .82 WLM .70 .64 WikiRelate! isa .87 .78

Manual evaluation

1.106 instances evaluated manually by three judges R P F random baseline 51.1 51.6 51.3 syntax (1-3) 17.0 95.4 28.9 connectivity (1-4, 6) 38.9 88.1 54.0 pattern-based (1-3, 5-6) 62.7 84.3 71.9 all (1-6) 69.5 81.6 75.0