Semantische Technologien (M-TANI) Christian Chiarcos Angewandte - PowerPoint PPT Presentation

Aktuelle Themen der Angewandten Informatik Semantische Technologien (M-TANI) Christian Chiarcos Angewandte Computerlinguistik chiarcos@informatik.uni-frankfurt.de 11. Juli 2013

Machine Reading & Open IE • Pretext: Continue from last week – Structured evidence, slide 90ff. • Machine Reading: Definition and goals • Open IE – TextRunner • Applications • Structured Knowledge – Entities, types, ontologies – OpenIE + LOD

Machine Reading „Learning by Reading“ • goal (informally): "acquisition of commonsense knowledge" – machine reading is the automatic, unsupervised understanding of text • Machine reading, or learning by reading, aims to extract knowledge automatically from unstructured text and apply the extracted knowledge to end tasks such as decision making and question answering (Poon et al. 2010) • Our task is to build a formal representation of a specific, coherent topic through deep processing of concise texts focused on that topic. (Barker et al. 2007)

Machine Reading: Desiderata • End-to-end – input raw text, extract knowledge, and be able to answer questions and support other end tasks • High quality – extract knowledge with high accuracy • Large-scale – acquire knowledge at Web-scale and be open to arbitrary domains, genres, and languages • Maximally autonomous – the system should incur minimal human effort • Continuous learning from experience – constantly integrate new information sources and learn from user questions and feedback (Poon et al. 2010)

Breadth/Depth tradeoff (a) broad/shallow (e.g., KnowItAll/TextRunner) – use a broad range of materials – extract repetitive facts from them  set of relational tuples (b) narrow/deep (e.g., Möbius [Barker et al 2007]) – narrow range of materials (either in terms of simplified NL syntax or being limited to a single domain), – extract as much knowledge as possible from those materials  a coherent and complete semantic model for an entire focused text

Breadth/Depth tradeoff (c) support deep systems with resources built by broad/shallow systems  Open IE to construct a Background Knowledge Base (BKB)*  consult this BKB in a deep system, e.g., for type inferences or inference of implicit information Today, we focus on a shallow system (KnowItAll/TextRunner, Oren Etzioni, University of Washington, since 2003) – Slides from Oren Etzioni (2012), Open Information Extraction from the Web. Invited talk at the NAACL-HLTC 2012 Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX 2012), June 2012, Montréal, Canada * Other BKBs may be built using syntax-based generalizations as described by Penas & Hovy (2010)

Definition Machine Reading • “MR is an exploratory, open -ended, serendipitous process” • “In contrast with many NLP tasks, MR is inherently unsupervised” • “Very large scale” • “Forming Generalizations based on extracted assertions” ontology-free !

Open Information Extraction • Definition and goals • Open IE – TextRunner • Applications • Structured Knowledge – Entities, types, ontologies – OpenIE + LOD

Open IE • Open Information Extraction (IE) is the task of extracting assertions from massive corpora without requiring a pre-specified vocabulary. • Information Extraction (IE) systems learn an extractor for each target relation from labeled training examples – does not scale to corpora where the number of target relations is very large, or where the target relations cannot be specified in advance. • Open IE approach: identifying relation phrases — phrases that denote relations in English sentences – extraction of arbitrary relations from sentences, obviating the restriction to a pre-specified vocabulary Fader et al. (2011), Identifying Relations for Open Information Extraction, EMNLP 2011.

Classical KR research • Declarative KR is expensive & difficult • Formal semantics is at odds with – Broad scope – Distributed authorship • KBs are brittle: “can only be used for tasks whose knowledge needs have been anticipated in advance”

KR-based IE: Hearst Patterns Knowledge Base Elvis was a great artist, but while all of Elvis’ colleagues loved the Several pre-defined relations plus song “Oh yeah, honey”, Elvis instances, e.g., for is-a relations did not perform that song at his (class membership) concert in Hintertuepflingen. Idea (by Hearst): Sentences express class membership in very predictable patterns. Use these patterns for instance extraction. Hearst patterns: Entity Class • X was a great Y Elvis artist Slide from Fabian M. Suchanek (2010)

KR-based IE: Bootstrapping Bootstrapping hand-crafted Hearst pattern • X was a great Y Seed: manually collected instances of a relation (or use hand-crafted pattern to retrieve such instances)

KR-based IE: Bootstrapping Bootstrapping seed • Seed: manually collected instances of a (June, is-a, month) • relation (or use hand-crafted pattern to (52, is-a, comic) • retrieve such instances) (Robert Altman, is-a, pothead) • (Lowry, is-a, reporter) Search: for every seed instance, retrieve sentences that contain its elements

KR-based IE: Bootstrapping Bootstrapping pattern candidates • Seed: manually collected instances of a X. Education Y (Lowry: 1) • relation (or use hand-crafted pattern to X is a Y (Lowry: 4) • retrieve such instances) X is a weekly American Y (52: 2) • X new Y (52: 1) • Search: for every seed instance, retrieve X is the sixth Y (June: 1) • sentences that contain its elements X is National Y (June: 1) • X is PTSD Awareness Y (June: 1) Generate patterns: for every instance match, replace the matches with variables, keep the immediate context (say, the words between) Pruning: Keep only the most confident (frequent, recurring, etc.) patterns Iterate: Retrieve instances and interate

Limits of Bootstrapping • established methodology to increase the coverage of pattern-based information extraction / information retrieval – cf. http://bootcat.sslmit.unibo.it (to bootstrap corpora for a particular language given a small number of seed words) • noise increases with every generation of patterns and instances • noise level cannot be reliably measured – no negative evidence • cannot extend the number of relations in the KB

KR-based Open IE ? • A “universal ontology” is impossible – Global consistency is like world peace • Micro ontologies ? – Do these scale? Interconnections? • Ontological “glass ceiling” – Limited vocabulary – Pre-determined predicates – Coverage restricted to pre-defined relations

OPEN VERSUS TRADITIONAL IE Open vs. Traditional IE Traditional IE Open IE Input: Corpus + O(R) Corpus hand-labeled data Relations: Specified Discovered in advance automatically Relation-specific Relation- Extractor: independent How is Open IE Possible? Etzioni, University of Washington 19

Open IE: TextRunner (2007) • Extractor – a single pass over all documents, POS-tagging, NP chunking – For each pair of NPs that are not too far apart,* apply a classifier to determine whether or not to extract a relationship • several other constraints apply, as well

Open IE: TextRunner (2007) • Self-Supervised Classifier – generate training examples for extraction – using several heuristic constraints, automatically label a the train set as trustworthy or untrustworthy (positive and negative examples – The classifier is trained on these examples • main feature: part of speech tags

NUMBER OF RELATIONS Number of Relations DARPA MR Domains <50 NYU, Yago <100 NELL ~500 DBpedia 3.2 940 PropBank 3,600 VerbNet 5,000 WikiPedia InfoBoxes, f > ~5,000 10 TextRunner (phrases ) 100,000+ ReVerb ( phrases ) 1,000,000+ Etzioni, University of Washington 22

SAMPLE OrrF EXTRACTED RELATIONS Relation Phrases invented acquired by has a PhD in inhibits tumor denied voted for growth in inherited born in mastered the art of is the patron downloaded aspired to saint of expelled Arrived from wrote the book on Etzioni, University of Washington 23

Relation Phrases Etzioni, University of Washington 24

Open IE: TextRunner (2007) • Cleaning up relations – Unsupervised, probabilistic synonym detection • P(Bill Clinton = President Clinton) – Count shared (relation, arg2) • P(acquired = bought) – Relations: count shared (arg1, arg2) Etzioni, University of Washington 25

Semantische Technologien (M-TANI) Christian Chiarcos Angewandte - PowerPoint PPT Presentation

Aktuelle Themen der Angewandten Informatik Semantische Technologien (M-TANI) Christian Chiarcos Angewandte Computerlinguistik chiarcos@informatik.uni-frankfurt.de 11. Juli 2013 Machine Reading & Open IE Pretext: Continue from last

Semantische Technologien (M-TANI) Christian Chiarcos Angewandte Computerlinguistik

UB UBK GmbH Integration von operativen Prozessen Agenda mittels Embedded Internet Technologien

Routing on Flat Labels Hauptseminar Innovative Internet-Technologien und Mobilkommunikation

Master Seminar Innovative Internet-Technologien und Mobilkommunikation WS 2011/2012 Prof.

Technologien und Mobilkommunikation Self-Healing in Self-Organising Networks Oliver Scheit

Discovering, Visualizing and Sharing Knowledge through Personalized Learning Knowledge Maps The

Adaptivitt in Lernplattform en W ie knnen Lernstile erkannt und bercksichtigt w erden?

Testen von Microservices Jrg Pfrnder Hypoport AG EUROPACE EUROPACE 15% der

Knowledge Graphs on the Web Which information can we find in them and which can we not?

Knowledge Representation 8 AI Slides (6e) c Lin Zuoquan@PKU 2003-2020 8 1 8 Knowledge

AN INTRODUCTION TO CONTENT DETERMINATION Gerard Casamayor Chris Mellish Contents 1. The place

Machine Learning and Knowledge Graphs Pasquale Minervini University College London @pminervini

Student Projects Multimedia Information Systems 2 VU (707.025) (Visual Analytics) SS 2016

A Neural Network Architecture for Detec2ng Gramma2cal Errors in SMT A Neural Network Architecture

Risk Management Workshop 1 Risk management workshop Why do we Risk Risk and need risk

Introduction Sung-Eui Yoon ( ) Course URL: http://sglab.kaist.ac.kr/~sungeui/CG About

OpenAIRE Advance Advancing Open Scholarship @ openaire_eu WACREN 2018 Conference and Annual

Cut Price Slides USDA/AMS weekly reported negotiated wholesale prices (national FOB plant)

L Ill > I Yt= Stvt 1441>1 na (2) ) Ex AN )Y Stv Ki ( ,i - 1 -

Grouping Synonyms by Definitions Ingrid Falk 1 , Claire Gardent 2 , Evelyne Jacquey 3 , Fabienne

XBOX-2 Status and Crab Cavity Testing Ben Woolley G. Burt, A. Dexter, G. Mcmonagle, I.

A Method for Grouping Synonyms Ingrid Falk 1 , Claire Gardent 2 , Evelyne Jacquey 3 , Fabienne

Building RT image with Yocto Pierre Ficheux (pierre.ficheux@smile.fr) 02/2018 Building RT image

ADVANCES IN WOMENS BACKGROUND HEALTH: A CRITICAL REVIEW OF THE Annual Update in

Semantische Technologien (M-TANI) Christian Chiarcos Angewandte - PowerPoint PPT Presentation

Aktuelle Themen der Angewandten Informatik Semantische Technologien (M-TANI) Christian Chiarcos Angewandte Computerlinguistik chiarcos@informatik.uni-frankfurt.de 11. Juli 2013 Machine Reading & Open IE Pretext: Continue from last

Semantische Technologien (M-TANI) Christian Chiarcos Angewandte Computerlinguistik

UB UBK GmbH Integration von operativen Prozessen Agenda mittels Embedded Internet Technologien

Routing on Flat Labels Hauptseminar Innovative Internet-Technologien und Mobilkommunikation

Master Seminar Innovative Internet-Technologien und Mobilkommunikation WS 2011/2012 Prof.

Technologien und Mobilkommunikation Self-Healing in Self-Organising Networks Oliver Scheit

Discovering, Visualizing and Sharing Knowledge through Personalized Learning Knowledge Maps The

Adaptivitt in Lernplattform en W ie knnen Lernstile erkannt und bercksichtigt w erden?

Testen von Microservices Jrg Pfrnder Hypoport AG EUROPACE EUROPACE 15% der

Knowledge Graphs on the Web Which information can we find in them and which can we not?

Knowledge Representation 8 AI Slides (6e) c Lin Zuoquan@PKU 2003-2020 8 1 8 Knowledge

AN INTRODUCTION TO CONTENT DETERMINATION Gerard Casamayor Chris Mellish Contents 1. The place

Machine Learning and Knowledge Graphs Pasquale Minervini University College London @pminervini

Student Projects Multimedia Information Systems 2 VU (707.025) (Visual Analytics) SS 2016

A Neural Network Architecture for Detec2ng Gramma2cal Errors in SMT A Neural Network Architecture

Risk Management Workshop 1 Risk management workshop Why do we Risk Risk and need risk

Introduction Sung-Eui Yoon ( ) Course URL: http://sglab.kaist.ac.kr/~sungeui/CG About

OpenAIRE Advance Advancing Open Scholarship @ openaire_eu WACREN 2018 Conference and Annual

Cut Price Slides USDA/AMS weekly reported negotiated wholesale prices (national FOB plant)

L Ill &gt; I Yt= Stvt 1441&gt;1 na (2) ) Ex AN )Y Stv Ki ( ,i - 1 -

Grouping Synonyms by Definitions Ingrid Falk 1 , Claire Gardent 2 , Evelyne Jacquey 3 , Fabienne

XBOX-2 Status and Crab Cavity Testing Ben Woolley G. Burt, A. Dexter, G. Mcmonagle, I.

A Method for Grouping Synonyms Ingrid Falk 1 , Claire Gardent 2 , Evelyne Jacquey 3 , Fabienne

Building RT image with Yocto Pierre Ficheux (pierre.ficheux@smile.fr) 02/2018 Building RT image

ADVANCES IN WOMENS BACKGROUND HEALTH: A CRITICAL REVIEW OF THE Annual Update in

L Ill > I Yt= Stvt 1441>1 na (2) ) Ex AN )Y Stv Ki ( ,i - 1 -