Aktuelle Themen der Angewandten Informatik
Semantische Technologien
(M-TANI)
Christian Chiarcos Angewandte Computerlinguistik chiarcos@informatik.uni-frankfurt.de
- 11. Juli 2013
Semantische Technologien (M-TANI) Christian Chiarcos Angewandte - - PowerPoint PPT Presentation
Aktuelle Themen der Angewandten Informatik Semantische Technologien (M-TANI) Christian Chiarcos Angewandte Computerlinguistik chiarcos@informatik.uni-frankfurt.de 11. Juli 2013 Machine Reading & Open IE Pretext: Continue from last
(Poon et al. 2010)
* Other BKBs may be built using syntax-based generalizations as described by Penas & Hovy (2010)
Fader et al. (2011), Identifying Relations for Open Information Extraction, EMNLP 2011.
Elvis was a great artist, but while all of Elvis’ colleagues loved the song “Oh yeah, honey”, Elvis did not perform that song at his concert in Hintertuepflingen. Idea (by Hearst): Sentences express class membership in very predictable patterns. Use these patterns for instance extraction.
Entity Class Elvis artist
Hearst patterns:
Knowledge Base Several pre-defined relations plus instances, e.g., for is-a relations (class membership) Slide from Fabian M. Suchanek (2010)
Bootstrapping Seed: manually collected instances of a relation (or use hand-crafted pattern to retrieve such instances) hand-crafted Hearst pattern
Bootstrapping Seed: manually collected instances of a relation (or use hand-crafted pattern to retrieve such instances) Search: for every seed instance, retrieve sentences that contain its elements seed
Bootstrapping Seed: manually collected instances of a relation (or use hand-crafted pattern to retrieve such instances) Search: for every seed instance, retrieve sentences that contain its elements seed
Bootstrapping Seed: manually collected instances of a relation (or use hand-crafted pattern to retrieve such instances) Search: for every seed instance, retrieve sentences that contain its elements seed
Bootstrapping Seed: manually collected instances of a relation (or use hand-crafted pattern to retrieve such instances) Search: for every seed instance, retrieve sentences that contain its elements Generate patterns: for every instance match, replace the matches with variables, keep the immediate context (say, the words between) Pruning: Keep only the most confident (frequent, recurring, etc.) patterns Iterate: Retrieve instances and interate pattern candidates
(Lowry: 1)
(Lowry: 4)
(52: 1)
(June: 1)
(June: 1)
(June: 1)
Corpus + O(R) hand-labeled data Corpus
Specified in advance Discovered automatically
Relation-specific Relation- independent
19
OPEN VERSUS TRADITIONAL IE
Etzioni, University of Washington
Etzioni, University of Washington 22
NUMBER OF RELATIONS
invented acquired by has a PhD in denied voted for inhibits tumor growth in inherited born in mastered the art of downloaded aspired to is the patron saint of expelled Arrived from wrote the book on
Etzioni, University of Washington 23
SAMPLE OrrF EXTRACTED RELATIONS
Etzioni, University of Washington 24
– Count shared (relation, arg2)
– Relations: count shared (arg1, arg2)
Etzioni, University of Washington 25
Yates et al. (2007), TextRunner: Open Information Extraction on the Web, NAACL-HLT 2007.
27
TEXTRUNNER
Etzioni, University of Washington
Etzioni, University of Washington 28
29
– improves the original TextRunner implementation by specifying constraints
=> basis of current TextRunner (http://openie.cs.washington.edu)
– Parser-based, not POS-based
Etzioni, University of Washington 30
Etzioni, University of Washington 32
Etzioni, University of Washington 33
Etzioni, University of Washington 34
Etzioni, University of Washington 35
36
Etzioni, University of Washington
"Recentemente", conferma Maria Serena Balestracci, "mi ha telefonato un signore da Bologna, che aveva sentito parlare del libro alla radio statistical MT (Google Translate) "... a gentleman from London ..." [Oct. 2010] statistical MT + Freebase (Google Knowledge Graph) "... a gentleman from Bologna ...« [July 2013]
41
Relation- independent extraction Synonym detection, Confidence Index in Lucene; Link entities
Etzioni, University of Washington
Web corpus
Extractor Raw tuples Assessor Extractions
(XYZ Corp.; acquired; Go Inc.) (oranges; contain; Vitamin C) (Einstein; was born in; Ulm) (XYZ; buyout of; Go Inc.) (Albert Einstein; born in; Ulm) (Einstein Bros.; sell; bagels) XYZ Corp. = XYZ Albert Einstein = Einstein != Einstein Bros. Acquire(XYZ Corp., Go Inc.) [7] BornIn(Albert Einstein, Ulm) [5] Sell(Einstein Bros., bagels) [1] Contain(oranges, Vitamin C) [1]
Query processor
DEMO
i.e. ReVerb
– http://ai.cs.washington.edu/www/media/downloadable/media/ sherlockrules.zip
require(process_A, product_B) :- Score 144.0 produce(process_A, product_C), be make from(product_B, product_C) require(process_A, product_B) :- Score 114.5 produce(process_A, product_C), make(product_B, product_C) require(process_A, product_B) :- Score 40.8 produce(process_A, product_C), to make(product_C, product_B) require(process_A, product_B) :- Score 37.8 produce(process_A, product_C), to make(product_B, product_C)
– http://ai.cs.washington.edu/www/media/downloadable/media/ sherlockrules.zip
require(process_A, product_B) :- Score 144.0 produce(process_A, product_C), be make from(product_B, product_C) require(process_A, product_B) :- Score 114.5 produce(process_A, product_C), make(product_B, product_C) require(process_A, product_B) :- Score 40.8 produce(process_A, product_C), to make(product_C, product_B) require(process_A, product_B) :- Score 37.8 produce(process_A, product_C), to make(product_B, product_C)
http://conceptnet5.media.mit.edu BUT: Uses a proprietary format => limited interoperability, reasoning ?
Etzioni, University of Washington 46
Etzioni, University of Washington 47
– Jurafsky & Martin (2009), §22.1-22.3 – Carstensen et al. (2010), §5.3.3-5.3.4
– Ken Barker et al. (2007), Learning by reading: A prototype system, performance baseline and lessons learned, In: Proceedings of 22nd National Conference on Artificial Intelligence (AAAI-07), Vancouver, BC.
– Poon, Hoifung et al. (2010), Machine Reading at the University of Washington, In Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading, Los Angeles, California, 87-95 – Anthony Fader, Stephen Soderland, and Oren Etzioni (2011), Identifying Relations for Open Information Extraction, EMNLP 2011.
– Anselmo Penas, Eduard Hovy (2010), Filling Knowledge Gaps in Text for Machine Reading, COLING 2010