Semantische Technologien (M-TANI) Christian Chiarcos Angewandte - - PowerPoint PPT Presentation

semantische technologien
SMART_READER_LITE
LIVE PREVIEW

Semantische Technologien (M-TANI) Christian Chiarcos Angewandte - - PowerPoint PPT Presentation

Aktuelle Themen der Angewandten Informatik Semantische Technologien (M-TANI) Christian Chiarcos Angewandte Computerlinguistik chiarcos@informatik.uni-frankfurt.de 11. Juli 2013 Machine Reading & Open IE Pretext: Continue from last


slide-1
SLIDE 1

Aktuelle Themen der Angewandten Informatik

Semantische Technologien

(M-TANI)

Christian Chiarcos Angewandte Computerlinguistik chiarcos@informatik.uni-frankfurt.de

  • 11. Juli 2013
slide-2
SLIDE 2

Machine Reading & Open IE

  • Pretext: Continue from last week

– Structured evidence, slide 90ff.

  • Machine Reading: Definition and goals
  • Open IE

– TextRunner

  • Applications
  • Structured Knowledge

– Entities, types, ontologies – OpenIE + LOD

slide-3
SLIDE 3

Machine Reading „Learning by Reading“

  • goal (informally): "acquisition of commonsense

knowledge"

– machine reading is the automatic, unsupervised understanding of text

  • Machine reading, or learning by reading, aims to

extract knowledge automatically from unstructured text and apply the extracted knowledge to end tasks such as decision making and question answering (Poon et al. 2010)

  • Our task is to build a formal representation of a

specific, coherent topic through deep processing of concise texts focused on that topic. (Barker et al. 2007)

slide-4
SLIDE 4

Machine Reading: Desiderata

  • End-to-end

– input raw text, extract knowledge, and be able to answer questions and support other end tasks

  • High quality

– extract knowledge with high accuracy

  • Large-scale

– acquire knowledge at Web-scale and be open to arbitrary domains, genres, and languages

  • Maximally autonomous

– the system should incur minimal human effort

  • Continuous learning from experience

– constantly integrate new information sources and learn from user questions and feedback

(Poon et al. 2010)

slide-5
SLIDE 5

Breadth/Depth tradeoff

(a) broad/shallow (e.g., KnowItAll/TextRunner)

– use a broad range of materials – extract repetitive facts from them  set of relational tuples

(b) narrow/deep (e.g., Möbius [Barker et al 2007])

– narrow range of materials (either in terms of simplified NL syntax or being limited to a single domain), – extract as much knowledge as possible from those materials  a coherent and complete semantic model for an entire focused text

slide-6
SLIDE 6

Breadth/Depth tradeoff

(c) support deep systems with resources built by broad/shallow systems

 Open IE to construct a Background Knowledge Base (BKB)*  consult this BKB in a deep system, e.g., for type inferences

  • r inference of implicit information

Today, we focus on a shallow system (KnowItAll/TextRunner, Oren Etzioni, University of Washington, since 2003)

– Slides from Oren Etzioni (2012), Open Information Extraction from the Web. Invited talk at the NAACL-HLTC 2012 Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX 2012), June 2012, Montréal, Canada

* Other BKBs may be built using syntax-based generalizations as described by Penas & Hovy (2010)

slide-7
SLIDE 7

Definition Machine Reading

  • “MR is an exploratory, open-ended,

serendipitous process”

  • “In contrast with many NLP tasks, MR is

inherently unsupervised”

  • “Very large scale”
  • “Forming Generalizations based on extracted

assertions”

  • ntology-free !
slide-8
SLIDE 8

Open Information Extraction

  • Definition and goals
  • Open IE

– TextRunner

  • Applications
  • Structured Knowledge

– Entities, types, ontologies – OpenIE + LOD

slide-9
SLIDE 9

Open IE

  • Open Information Extraction (IE) is the task of

extracting assertions from massive corpora without requiring a pre-specified vocabulary.

  • Information Extraction (IE) systems learn an extractor

for each target relation from labeled training examples

– does not scale to corpora where the number of target relations is very large, or where the target relations cannot be specified in advance.

  • Open IE approach: identifying relation phrases—

phrases that denote relations in English sentences

– extraction of arbitrary relations from sentences, obviating the restriction to a pre-specified vocabulary

Fader et al. (2011), Identifying Relations for Open Information Extraction, EMNLP 2011.

slide-10
SLIDE 10

Classical KR research

  • Declarative KR is expensive & difficult
  • Formal semantics is at odds with

– Broad scope – Distributed authorship

  • KBs are brittle: “can only be used for tasks

whose knowledge needs have been anticipated in advance”

slide-11
SLIDE 11

KR-based IE: Hearst Patterns

Elvis was a great artist, but while all of Elvis’ colleagues loved the song “Oh yeah, honey”, Elvis did not perform that song at his concert in Hintertuepflingen. Idea (by Hearst): Sentences express class membership in very predictable patterns. Use these patterns for instance extraction.

Entity Class Elvis artist

Hearst patterns:

  • X was a great Y

Knowledge Base Several pre-defined relations plus instances, e.g., for is-a relations (class membership) Slide from Fabian M. Suchanek (2010)

slide-12
SLIDE 12

KR-based IE: Bootstrapping

Bootstrapping Seed: manually collected instances of a relation (or use hand-crafted pattern to retrieve such instances) hand-crafted Hearst pattern

  • X was a great Y
slide-13
SLIDE 13

KR-based IE: Bootstrapping

Bootstrapping Seed: manually collected instances of a relation (or use hand-crafted pattern to retrieve such instances) Search: for every seed instance, retrieve sentences that contain its elements seed

  • (June, is-a, month)
  • (52, is-a, comic)
  • (Robert Altman, is-a, pothead)
  • (Lowry, is-a, reporter)
slide-14
SLIDE 14

KR-based IE: Bootstrapping

Bootstrapping Seed: manually collected instances of a relation (or use hand-crafted pattern to retrieve such instances) Search: for every seed instance, retrieve sentences that contain its elements seed

  • (June, is-a, month)
  • (52, is-a, comic)
  • (Robert Altman, is-a, pothead)
  • (Lowry, is-a, reporter)
slide-15
SLIDE 15

KR-based IE: Bootstrapping

Bootstrapping Seed: manually collected instances of a relation (or use hand-crafted pattern to retrieve such instances) Search: for every seed instance, retrieve sentences that contain its elements seed

  • (June, is-a, month)
  • (52, is-a, comic)
  • (Robert Altman, is-a, pothead)
  • (Lowry, is-a, reporter)
slide-16
SLIDE 16

KR-based IE: Bootstrapping

Bootstrapping Seed: manually collected instances of a relation (or use hand-crafted pattern to retrieve such instances) Search: for every seed instance, retrieve sentences that contain its elements Generate patterns: for every instance match, replace the matches with variables, keep the immediate context (say, the words between) Pruning: Keep only the most confident (frequent, recurring, etc.) patterns Iterate: Retrieve instances and interate pattern candidates

  • X. Education Y

(Lowry: 1)

  • X is a Y

(Lowry: 4)

  • X is a weekly American Y (52: 2)
  • X new Y

(52: 1)

  • X is the sixth Y

(June: 1)

  • X is National Y

(June: 1)

  • X is PTSD Awareness Y

(June: 1)

slide-17
SLIDE 17

Limits of Bootstrapping

  • established methodology to increase the

coverage of pattern-based information extraction / information retrieval

– cf. http://bootcat.sslmit.unibo.it (to bootstrap corpora for a particular language given a small number of seed words)

  • noise increases with every generation of patterns

and instances

  • noise level cannot be reliably measured

– no negative evidence

  • cannot extend the number of relations in the KB
slide-18
SLIDE 18

KR-based Open IE ?

  • A “universal ontology” is impossible

– Global consistency is like world peace

  • Micro ontologies ?

– Do these scale? Interconnections?

  • Ontological “glass ceiling”

– Limited vocabulary – Pre-determined predicates – Coverage restricted to pre-defined relations

slide-19
SLIDE 19

Traditional IE Open IE Input:

Corpus + O(R) hand-labeled data Corpus

Relations:

Specified in advance Discovered automatically

Extractor:

Relation-specific Relation- independent

19

OPEN VERSUS TRADITIONAL IE

Open vs. Traditional IE

Etzioni, University of Washington

How is Open IE Possible?

slide-20
SLIDE 20

Open IE: TextRunner (2007)

  • Extractor

– a single pass over all documents, POS-tagging, NP chunking – For each pair of NPs that are not too far apart,* apply a classifier to determine whether or not to extract a relationship

  • several other constraints apply, as well
slide-21
SLIDE 21

Open IE: TextRunner (2007)

  • Self-Supervised Classifier

– generate training examples for extraction – using several heuristic constraints, automatically label a the train set as trustworthy or untrustworthy (positive and negative examples – The classifier is trained on these examples

  • main feature: part of speech tags
slide-22
SLIDE 22

Number of Relations

DARPA MR Domains <50 NYU, Yago <100 NELL ~500 DBpedia 3.2 940 PropBank 3,600 VerbNet 5,000 WikiPedia InfoBoxes, f >

10

~5,000 TextRunner (phrases) 100,000+ ReVerb (phrases) 1,000,000+

Etzioni, University of Washington 22

NUMBER OF RELATIONS

slide-23
SLIDE 23

Relation Phrases

invented acquired by has a PhD in denied voted for inhibits tumor growth in inherited born in mastered the art of downloaded aspired to is the patron saint of expelled Arrived from wrote the book on

Etzioni, University of Washington 23

SAMPLE OrrF EXTRACTED RELATIONS

slide-24
SLIDE 24

Relation Phrases

Etzioni, University of Washington 24

slide-25
SLIDE 25

Open IE: TextRunner (2007)

  • Cleaning up relations

– Unsupervised, probabilistic synonym detection

  • P(Bill Clinton = President Clinton)

– Count shared (relation, arg2)

  • P(acquired = bought)

– Relations: count shared (arg1, arg2)

Etzioni, University of Washington 25

slide-26
SLIDE 26

OpenIE: TextRunner (2007)

  • 1. Single Pass Extractor
  • 2. Self-Supervised Classifier
  • 3. Synonym Resolution
  • 4. Query Interface

Evaluation (2007) 9 million Web documents => 7.8 million well-formed tuples randomly selected subset (400 tuples):

  • 80.4% deemed correct by human

Yates et al. (2007), TextRunner: Open Information Extraction on the Web, NAACL-HLT 2007.

slide-27
SLIDE 27

First Web-scale Open IE system

Distant supervision + CRF models of relations (Arg1, Relation phrase, Arg2)

1,000,000,000 distinct extractions

27

TEXTRUNNER

TextRunner (2007)

Etzioni, University of Washington

slide-28
SLIDE 28

Relation Extraction from Web

Etzioni, University of Washington 28

slide-29
SLIDE 29

29

born_in(Einstein, Ulm) headquartered_in(Microsoft, Redmond) founded_in(Microsoft, 1973) born_in(Bill Gates, Seattle) founded_in(Google, 1998) headquartered_in(Google, Mountain View) born_in(Sergey Brin, Moscow) founded_in(Microsoft, Albuquerque) born_in(Einstein, March) born_in(Sergey Brin, 1973)

TextRunner Extractions

slide-30
SLIDE 30

Systems for Open IE (2012)

  • ReVerb (http://reverb.cs.washington.edu, EMNLP 2011)

– improves the original TextRunner implementation by specifying constraints

  • restricting possible POS patterns
  • keep only patterns that have multiple different instances

=> basis of current TextRunner (http://openie.cs.washington.edu)

  • Ollie (http://knowitall.github.io/ollie, EMNLP 2012)

– Parser-based, not POS-based

  • Verbs  Nouns and more
  • Analyze context (beliefs, counterfactuals)

But what about entities, types, ontologies?

Etzioni, University of Washington 30

slide-31
SLIDE 31

Open IE Applications

  • Definition and goals
  • Open IE
  • Applications

– Search & Q/A – KB construction – Reasoning

  • Structured Knowledge

– Entities, types, ontologies – OpenIE + LOD

slide-32
SLIDE 32

Novel search paradigms

“Moving Up the Information Food Chain” Retrieval  Extraction Snippets, docs  Entities, Relations Keyword queries  Questions List of docs  Answers Essential for smartphones! (Siri meets Watson)

Etzioni, University of Washington 32

slide-33
SLIDE 33

Case Study over Yelp Reviews

  • 1. Map review corpus to (attribute, value)

(sushi = fresh) (parking = free)

  • 2. Natural-language queries

“Where’s the best sushi in Seattle?”

  • 3. Sort results via sentiment analysis

exquisite > very good > so, so

Etzioni, University of Washington 33

slide-34
SLIDE 34

RevMiner: Interface to 400K Yelp Reviews (http://revminer.com)

Etzioni, University of Washington 34

slide-35
SLIDE 35

Knowledge Base construction

KnowItAll/TextRunner (http://openie.cs.washington.edu)

  • 94M Rel-grams: n-grams, but over relations in

text

  • 600K Relation phrases
  • Relation Meta-data:

– 50K Domain/range for relations – 10K Functional relations – 30K Horn clauses

– 10M entailment rules

Etzioni, University of Washington 35

slide-36
SLIDE 36

36

Knowledge Bases -> Type Inference

slide-37
SLIDE 37

Examples of Learned Domain/range

  • elect(Country, Person)
  • predict(Expert, Event)
  • download(People, Software)
  • invest(People, Assets)
  • Was-born-in(Person, Location OR Date)

Etzioni, University of Washington

slide-38
SLIDE 38

Adding Structured Knowledge

  • Definition and goals
  • Open IE

– TextRunner

  • Applications
  • Structured Knowledge

– Entities, types, ontologies – OpenIE + LOD

slide-39
SLIDE 39

Entities, types, ontologies

  • After beating the Heat, the Celtics are now

the “top dog” in the NBA.

  • (the Celtics, beat, the Heat)

=> Ontologies to facilitate interpretation/disambiguation/reasoning

NER: recognize entity type => type-specific treatment

slide-40
SLIDE 40

Entities, types, ontologies

  • After beating the Heat, the Celtics are now

the “top dog” in the NBA.

  • (the Celtics, beat, the Heat)

=> Ontologies to facilitate interpretation/disambiguation/reasoning

NER: recognize entity type => type-specific treatment

Related domain: Machine Translation

"Recentemente", conferma Maria Serena Balestracci, "mi ha telefonato un signore da Bologna, che aveva sentito parlare del libro alla radio statistical MT (Google Translate) "... a gentleman from London ..." [Oct. 2010] statistical MT + Freebase (Google Knowledge Graph) "... a gentleman from Bologna ...« [July 2013]

slide-41
SLIDE 41

41

TextRunner

Relation- independent extraction Synonym detection, Confidence Index in Lucene; Link entities

Etzioni, University of Washington

Processing

Web corpus

Output Input

Extractor Raw tuples Assessor Extractions

(XYZ Corp.; acquired; Go Inc.) (oranges; contain; Vitamin C) (Einstein; was born in; Ulm) (XYZ; buyout of; Go Inc.) (Albert Einstein; born in; Ulm) (Einstein Bros.; sell; bagels) XYZ Corp. = XYZ Albert Einstein = Einstein != Einstein Bros. Acquire(XYZ Corp., Go Inc.) [7] BornIn(Albert Einstein, Ulm) [5] Sell(Einstein Bros., bagels) [1] Contain(oranges, Vitamin C) [1]

Query processor

DEMO

i.e. ReVerb

slide-42
SLIDE 42

Ontologies

Open IE + Entity types and synonym detection But still:

  • Lack of formal ontology/vocabulary
  • Inconsistent extractions
  • Reasoning?
slide-43
SLIDE 43

Open IE-based Reasoning

  • Sherlock: Extract Horn clauses from running text

– http://ai.cs.washington.edu/www/media/downloadable/media/ sherlockrules.zip

require(process_A, product_B) :- Score 144.0 produce(process_A, product_C), be make from(product_B, product_C) require(process_A, product_B) :- Score 114.5 produce(process_A, product_C), make(product_B, product_C) require(process_A, product_B) :- Score 40.8 produce(process_A, product_C), to make(product_C, product_B) require(process_A, product_B) :- Score 37.8 produce(process_A, product_C), to make(product_B, product_C)

slide-44
SLIDE 44

Open IE-based Reasoning

  • Sherlock: Extract Horn clauses from running text

– http://ai.cs.washington.edu/www/media/downloadable/media/ sherlockrules.zip

require(process_A, product_B) :- Score 144.0 produce(process_A, product_C), be make from(product_B, product_C) require(process_A, product_B) :- Score 114.5 produce(process_A, product_C), make(product_B, product_C) require(process_A, product_B) :- Score 40.8 produce(process_A, product_C), to make(product_C, product_B) require(process_A, product_B) :- Score 37.8 produce(process_A, product_C), to make(product_B, product_C)

Possible, but quality and coverage is limited. => Augment with other resources

slide-45
SLIDE 45

Open IE + Domain Knowledge

http://conceptnet5.media.mit.edu BUT: Uses a proprietary format => limited interoperability, reasoning ?

slide-46
SLIDE 46

OpenIE + Linked Open Data

  • Linked Open Output:

– Extractions  Linked-open Data (LOD) cloud – Relation normalization – Use LOD best practices

  • Specialized reasoners

– OWL/DL data models for LOD data

  • Still a desideratum

Etzioni, University of Washington 46

slide-47
SLIDE 47

Linked Open Data

  • Interoperability, distributed authorship, vs. a

monolithic system

  • Open IE meets RDF:

– Need URI’s for predicates. How to obtain? – What about errors in mapping to URI? – Ambiguity? Uncertainty?

Etzioni, University of Washington 47

slide-48
SLIDE 48

Context matters

  • If he wins 5 key states, Romney will be

president

  • (counterfactual: “if he wins 5 key

states”)

Discourse relations

(next week)

slide-49
SLIDE 49

References & Further Reading

  • Information Extraction

– Jurafsky & Martin (2009), §22.1-22.3 – Carstensen et al. (2010), §5.3.3-5.3.4

  • Deep Machine Reading

– Ken Barker et al. (2007), Learning by reading: A prototype system, performance baseline and lessons learned, In: Proceedings of 22nd National Conference on Artificial Intelligence (AAAI-07), Vancouver, BC.

  • Shallow Machine Reading / OpenIE (TextRunner)

– Poon, Hoifung et al. (2010), Machine Reading at the University of Washington, In Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading, Los Angeles, California, 87-95 – Anthony Fader, Stephen Soderland, and Oren Etzioni (2011), Identifying Relations for Open Information Extraction, EMNLP 2011.

  • FYI: Alternative Approaches in Machine Reading

– Anselmo Penas, Eduard Hovy (2010), Filling Knowledge Gaps in Text for Machine Reading, COLING 2010

  • an alternative strategy to build background knowledge bases for deep machine reading