Extracting World and Linguistic Knowledge from Wikipedia Simone - PDF document

Extracting World and Linguistic Knowledge from Wikipedia Simone Paolo Ponzetto Michael Strube University of Heidelberg EML Research gGmbH Outline Introduction Deriving world knowledge from Wikipedia Leveraging linguistic knowledge Applications Outlook and future work Conclusions

Outline Introduction Deriving world knowledge from Wikipedia Leveraging linguistic knowledge Applications Outlook and future work Conclusions Encyclopedic knowledge & NLP The crisis at General Motors threatens to drag down Adam Opel , a storied German brand that GM bought 80 years ago, on the eve of the Great Depression. Many in the industry say Opel has a future only if it can get a temporary helping hand from the German government. But whether Chancellor Angela Merkel will make available the public financing needed to help release Opel from the clutches of General Motors now depends on a reluctant government, an influential automotive union that wants politicians to save jobs, and employees who yearn to re-establish Opel as an independent German company . source: Herald Tribune Europe, March 6, 2009 What about a widely used resource like WordNet ?

Encyclopedic knowledge & NLP Encyclopedic knowledge & NLP The crisis at General Motors threatens to drag down Adam Opel , a storied German brand that GM bought 80 years ago, on the eve of the Great Depression. Many in the industry say Opel has a future only if it can get a temporary helping hand from the German government. But whether Chancellor Angela Merkel will make available the public financing needed to help release Opel from the clutches of General Motors now depends on a reluctant government, an influential automotive union that wants politicians to save jobs, and employees who yearn to re-establish Opel as an independent German company . source: Herald Tribune Europe, March 6, 2009 What about a widely used resource like WordNet ? And Cyc ?

Encyclopedic knowledge & NLP Encyclopedic knowledge & NLP The crisis at General Motors threatens to drag down Adam Opel , a storied German brand that GM bought 80 years ago, on the eve of the Great Depression. Many in the industry say Opel has a future only if it can get a temporary helping hand from the German government. But whether Chancellor Angela Merkel will make available the public financing needed to help release Opel from the clutches of General Motors now depends on a reluctant government, an influential automotive union that wants politicians to save jobs, and employees who yearn to re-establish Opel as an independent German company . source: Herald Tribune Europe, March 6, 2009 What about a widely used resource like WordNet ? And Cyc ? Let’s check Wikipedia on that topic!

Wikipedia Wikipedia

Wikipedia Two main problems 1. where to get this knowledge from? 2. how to effectively use it within NLP applications to advance the state-of-the-art?

Outline Introduction Deriving world knowledge from Wikipedia Leveraging linguistic knowledge Applications Outlook and future work Conclusions Domain and world knowledge project-specific domain knowledge bases: + very high quality – small domain – reusability – high cost d d Trackball Ball has−ball has−trackball LTE−Lite−25 Compaq developed−by Notebook LTE−Lite−20 AT−Bus−HD−Drive Seagate−ST−3144 Seagate PC developed−by d d Workstation Hard−Disk Storage−Space uses−disk storage−space Computer−System Capacity−MB−Pair Hard−Disk−Drive d d Access−Time has−hd−drive access−time Time−MS−Pair System−Software d has−system−software d d Central−Unit CPU Clock−Frequency d has−cpu clock−frequency has−central−unit Clock−MHz−Pair

Domain and world knowledge WordNet + pretty high quality – very high cost + good coverage of everyday – sense proliferation language – coverage in domains arbitrary + many languages entity physical entity abstract entity thing thing object causal agent substance process abstraction change freshener horror jimdandy stinker whacker Domain and world knowledge Cyc + pretty high quality – very high cost + good coverage of everyday – coverage in domains arbitrary language – English only + common sense knowledge

Domain and world knowledge Ontology Learning from Text + low cost – mostly only small domains + potentially domain – low quality independent car company isa isa US car company German car company isa isa General Motors Opel belongs to Domain and world knowledge Manual approach Knowledge is manually input by human experts + ➠ it produces high-quality information − Limited amount of human experts ➠ expensive and low scalability to cover all domains Automatic approach It requires minimal supervision on large amounts of data + ➠ low cost and scalable − Overall quality lower than humans ➠ unconstrained output, not necessarily ‘ontologized’

Domain and world knowledge And Wikipedia? “ one of the most fascinating developments of the Digital Age ” “ incredible example of open-source intellectual collaboration ” “ faith-based encyclopedia ” “ a joke at best ” Domain and world knowledge And Wikipedia? “. . . an expert-led investigation carried out by Nature – the first to use peer review to compare Wikipedia and Britannica ’s coverage of science . . . revealed numerous errors in both encyclopedias, but among 42 entries tested, the difference in accuracy was not particularly great : the average science entry in Wikipedia contained around four inaccuracies; Britannica about three.” (Nature 15. Dec. 2005)

Domain and world knowledge And Wikipedia? + low cost – ?? + very good coverage, domain independent + very many languages + up to date We evaluate quality empirically! Where to get this knowledge from? we are after a “steak and lobster” combination . . . � manual approaches achieve high quality for a limited coverage � automatic ones achieve large coverage for a lower quality ➠ use manually annotated semi-structured input ➠ develop lightweight methods to generate large-coverage, high-quality structured output

Wikipedia Wikipedia is . . . • a free, on-line encyclopedia • based on a model of communal content creation • available in more than 266 different languages (April 2009) • user interface provided by a Web-based Wiki software application, e.g. MediaWiki, running on top of a LAMP architecture • edited as plain text by means of a markup language ( wiki markup ), in order to provide structured annotations Why Wikipedia Wikipedia is . . . 1. domain independent it has a large coverage ➠ 2. up-to-date to process current information ➠ 3. multilingual to process information ➠ in many languages

Wikipedia category network • since May 2004 Wikipedia provides a collaboratively generated category network Semantic relatedness with Wikipedia WikiRelate! (Strube & Ponzetto, 2006): 1. Wikipedia pages represent categorized concepts 2. all Wikipedia categories form a semantic network 3. relations between concepts are given along the network ➠ use the category network as a semantic network . . . ➠ . . . to compute semantic relatedness

Comparison of different approaches WikiRelate!, ESA and WLM leverage different features of Wikipedia • WikiRelate! uses categories ( ∼ 3 categories/article) • ESA uses articles ( ∼ 2,800,00) and words ( ∼ 400 words/article) • WLM uses hyperlinks ( ∼ 34 hyperlinks/article) Deriving a taxonomy from Wikipedia

Deriving a taxonomy • induce semantically-typed relations Deriving a taxonomy • the category network is merely a thematic categorization of the topics of articles task label the relations between categories ➠ as isa and notisa goal transform a thematic categorization • into a fully-fledged taxonomy

Deriving a taxonomy • methods : • syntactic matching • connectivity in the network • lexico-syntactic patterns • results : • we start with 337,522 categories and 743,140 links • we generate 335,128 isa relations large-scale , multi-domain taxonomy ➠ Category network cleanup (1) • removal of meta-categories used for encyclopedia management, e.g. categories under W IKIPEDIA A DMINISTRATION • we remove all nodes whose labels contain any of the following strings: MEDIAWIKI , TEMPLATE , USER , PORTAL , CATEGORIES , ARTICLES , PAGES • this leaves • 240,760 categories • 515,423 links still to be processed

Refinement link identification (2) ALBUMS BY ARTIST CUISINE BY NATIONALITY is−refined−by is−refined−by MILES DAVIS ALBUMS FRENCH CUISINE • patterns such as Y X and X BY Z • their purpose is to better structure and simplify the categorization network • we assume this represents is-refined-by -relations • this labels 126,920 category links notisa and leaves 388,503 relations to be analyzed Syntax-based methods (3) SCIENTISTS isa same lexical head COMPUTER SCIENTISTS isa BRITISH COMPUTER SCIENTISTS • head matching labels pairs of categories sharing the same lexical head word (or lemma) • we identify lexical heads using the Stanford parser and lemmata using morpha

Extracting World and Linguistic Knowledge from Wikipedia Simone - PDF document

Extracting World and Linguistic Knowledge from Wikipedia Simone Paolo Ponzetto Michael Strube University of Heidelberg EML Research gGmbH Outline Introduction Deriving world knowledge from Wikipedia Leveraging linguistic knowledge

1 Methods of Extracting or Obtaining Essential Oils The most common method for extracting

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Using Universal Linguistic Knowledge to Guide Grammar Induction [Naseem et al., 2010] Juri

Semantic Wikipedia [[enhances::Wikipedia]] Wikipedia today A free online encyclopdia

Extracting knowledge from life courses: clustering and visualization 1 Nicolas S. Mller, Alexis

References Zero Knowledge Proofs on Wikipedia, Zero Knowledge

A simple and robust A simple and robust algorithm for extracting algorithm for extracting

Extracting Tables from PDFs Extracting Tables from PDFs Using Camelot and Excalibur to

CKM 2006 CKM 2006 Extracting CKM phase from phase from Extracting CKM B K

Program Analysis Program Analysis Extracting information, in order to present Extracting

Extracting Gait Parameters Extracting Gait Parameters from Raw Data from Raw Data

Master EmLex CiTIUS Design and use of linguistic tools Introduction Linguistic Analysis

LCS 11: Cognitive Science Linguistic relativity Linguistic relativity GQ # 4.3 discussions

KNOWLEDGE ACQUISITION AND CONSTRUCTION Transfer of Knowledge Knowledge acquisition is the

Tor and Wikipedia Roger Dingledine The Free Haven Project 1 Motivation China blocks

Wikipedia: n ++ made easy Matt Might University of Utah / NGLY1.org matt.might.net What

Structure Validation: Automation, Vigilance, New Tools Anthony Linden Sandy Blake Institute of

Organic Compounds in Water and Wastewater Isolation of NOM Lecture #3 Dave Reckhow - Organics

What the A&D Industry is learning from Automotive Different ways of thinking Ryan Blanchette

A Path Forward on Identity Agreement on a problem space We all agree that E.164 numbers

Four ur Pr Pressure essure Driven riven Mem Membrane brane Pr Processes cesses 1 Four

Components -N(CH 3 ) 2 -0.83 -0.16 -1.70 -0.98 0.15 -NH 2 -0.66 -0.15 0.10 -0.74 0.08

Prices and Markets Session 4 Prof. Amine Ouazad Timeline for Prices and Markets 1 &

Quantum chemical investigation on the metabolism of the endogenous psychedelic

Extracting World and Linguistic Knowledge from Wikipedia Simone - PDF document

Extracting World and Linguistic Knowledge from Wikipedia Simone Paolo Ponzetto Michael Strube University of Heidelberg EML Research gGmbH Outline Introduction Deriving world knowledge from Wikipedia Leveraging linguistic knowledge

1 Methods of Extracting or Obtaining Essential Oils The most common method for extracting

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Using Universal Linguistic Knowledge to Guide Grammar Induction [Naseem et al., 2010] Juri

Semantic Wikipedia [[enhances::Wikipedia]] Wikipedia today A free online encyclopdia

Extracting knowledge from life courses: clustering and visualization 1 Nicolas S. Mller, Alexis

References Zero Knowledge Proofs on Wikipedia, Zero Knowledge

A simple and robust A simple and robust algorithm for extracting algorithm for extracting

Extracting Tables from PDFs Extracting Tables from PDFs Using Camelot and Excalibur to

CKM 2006 CKM 2006 Extracting CKM phase from phase from Extracting CKM B K

Program Analysis Program Analysis Extracting information, in order to present Extracting

Extracting Gait Parameters Extracting Gait Parameters from Raw Data from Raw Data

Master EmLex CiTIUS Design and use of linguistic tools Introduction Linguistic Analysis

LCS 11: Cognitive Science Linguistic relativity Linguistic relativity GQ # 4.3 discussions

KNOWLEDGE ACQUISITION AND CONSTRUCTION Transfer of Knowledge Knowledge acquisition is the

Tor and Wikipedia Roger Dingledine The Free Haven Project 1 Motivation China blocks

Wikipedia: n ++ made easy Matt Might University of Utah / NGLY1.org matt.might.net What

Structure Validation: Automation, Vigilance, New Tools Anthony Linden Sandy Blake Institute of

Organic Compounds in Water and Wastewater Isolation of NOM Lecture #3 Dave Reckhow - Organics

What the A&amp;D Industry is learning from Automotive Different ways of thinking Ryan Blanchette

A Path Forward on Identity Agreement on a problem space We all agree that E.164 numbers

Four ur Pr Pressure essure Driven riven Mem Membrane brane Pr Processes cesses 1 Four

Components -N(CH 3 ) 2 -0.83 -0.16 -1.70 -0.98 0.15 -NH 2 -0.66 -0.15 0.10 -0.74 0.08

Prices and Markets Session 4 Prof. Amine Ouazad Timeline for Prices and Markets 1 &amp;

Quantum chemical investigation on the metabolism of the endogenous psychedelic

What the A&D Industry is learning from Automotive Different ways of thinking Ryan Blanchette

Prices and Markets Session 4 Prof. Amine Ouazad Timeline for Prices and Markets 1 &