HANDLING UNCERTAINTY IN INFORMATION EXTRACTION Maurice van Keulen - PowerPoint PPT Presentation

HANDLING UNCERTAINTY IN INFORMATION EXTRACTION Maurice van Keulen and Mena Badieh Habib URSW 23 Oct 2011

INFORMATION EXTRACTION Information Unstructured Web of Data extraction Text Inherently imperfect process Word “Paris” source: GeoNames • First name? City? … “We humans happily deal with doubt and misinterpretation every day; • City => over 60 cities “Paris” Why shouldn’t computers?” Toponyms: 46% >2 refs Goal : Technology to support the development of domain specific information extractors Uncertainty Reasoning for the Semantic Web, Bonn, Germany 23 Oct 2011 2

SHERLOCK HOLMES-STYLE INFORMATION EXTRACTION “when you have eliminated the impossible, whatever remains, however improbable, must be the truth” Information extraction is about gathering enough evidence to decide upon a certain combination of annotations among many possible ones Evidence comes from ML + developer (generic) + end user (instances)  Annotations are uncertain Maintain alternatives + probabilities throughout process (incl. result)  Unconventional starting point Not “no annotations”, but “no knowledge, hence anything is possible”  Developer interactively defines information extractor until “good enough” Iterations: Add knowledge, apply to sample texts, evaluate result  Scalability for storage, querying, manipulation of annotations From my own field (databases): Probabilistic databases? Uncertainty Reasoning for the Semantic Web, Bonn, Germany 23 Oct 2011 3

SHERLOCK HOLMES-STYLE INFORMATION EXTRACTION EXAMPLE: NAMED ENTITY RECOGNITION (NER) “when you have eliminated the impossible, whatever remains, however improbable, must be the truth” dnc Paris Hilton stayed in the Paris Hilton a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 Person Toponym a11 a12 a12 a13 a14 a15 a16 a17 dnc a18 isa a19 a20 a21 a22 City a23 a24 a25 inter-actively a26 a27 defined a28 Uncertainty Reasoning for the Semantic Web, Bonn, Germany 23 Oct 2011 4

SHERLOCK HOLMES-STYLE INFORMATION EXTRACTION EXAMPLE: NAMED ENTITY RECOGNITION (NER) Paris Hilton stayed in the Paris Hilton  |A|=O(klt) linear?!? a1 a2 a3 a4 a5 a6 a7 Although a8 a9 a10 k: length of string a11 a12 a12 a13 a14 a15 conceptual/theoretical, l: maximum length phrases considered a16 a17 a18 it doesn’t seem to be a a19 t: number of entity types a20 severe challenge for a a21 a22 a23 probabilistic database a24 a25  Here: 28 * 3 = 84 possible annotations a26 a27 The problem is not in a28 the amount of  URSW call for papers alternative about 1300 words annotations! say 20 types say max length 6 (I saw one with 5) = roughly 1300 * 20 * 6 = roughly 156,000 possible annotations Uncertainty Reasoning for the Semantic Web, Bonn, Germany 23 Oct 2011 5

ADDING KNOWLEDGE = CONDITIONING Paris Hilton stayed in the Paris Hilton ∅ a ∧¬ b 0.08 a and b 0.12 independent b ∧¬ a P(a)=0.6 0.32 a ∧ b P(b)=0.8 0.48 ∅ a and b a ∧¬ b 0.15 mutually 0.23 exclusive (a ∧ b is not b ∧¬ a Person --- dnc --- City 0.62 possible) x 1 2 (“Paris” is a City) [a] x 8 1 (“Paris Hilton” is a Person) [b] become mutually exclusive Uncertainty Reasoning for the Semantic Web, Bonn, Germany 23 Oct 2011 6

ADDING KNOWLEDGE CREATES DEPENDENCIES NUMBER OF DEPS MAGNITUDES IN SIZE SMALLER THAN POSSIBLE COMBINATIONS Person Paris Hilton stayed City dnc Person dnc neq City dnc 8 +8 +15 Uncertainty Reasoning for the Semantic Web, Bonn, Germany 23 Oct 2011 7

PROBLEM AND SOLUTION DIRECTIONS I’m looking for a scalable approach to reason and redistribute probability mass considering all these dependencies to find the remaining possible interpretations and their probabilities  Feasibility approach hinges on efficient representation and conditioning of probabilistic dependencies  Solution directions (in my own field):  Koch etal VLDB 2008 (Conditioning in MayBMS)  Getoor etal VLDB 2008 (Shared correlations)  This is not about only learning a joint probability distribution. Here I’d like to estimate a joint probability distribution based on initial independent observations and then batch-by-batch add constraints/dependencies and recalculate  Techniques out there that fit this problem? Questions / Suggestions? Uncertainty Reasoning for the Semantic Web, Bonn, Germany 23 Oct 2011 8

HANDLING UNCERTAINTY IN INFORMATION EXTRACTION Maurice van Keulen - PowerPoint PPT Presentation

HANDLING UNCERTAINTY IN INFORMATION EXTRACTION Maurice van Keulen and Mena Badieh Habib URSW 23 Oct 2011 INFORMATION EXTRACTION Information Unstructured Web of Data extraction Text Inherently imperfect process Word Paris source:

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Uncertainty AIMA Chapter 13 Outline Uncertainty Uncertainty Probability Syntax and

Material Handling Chapter 5 Designing material handling systems Overview of material

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

UNCERTAINTY IN KNOWLEDGE Ch. 9 Uncertainty in Knowledge 1 Sources of Uncertainty

Powerpoint Presentation On Manual Handling Powerpoint Presentation On Manual Handling We proudly

Manual Handling Risk Assessment Powerpoint Presentation Manual handling technique. Hansen Manual

LITHIUM ION IN MATERIALS HANDLING LITHIUM ION IN MATERIALS HANDLING LITHIUM ION IN WAREHOUSE

Hand Ball Hand Ball What?? Handling the Ball Handling the Ball Goal - Consistent Calls

Discussion on Uncertainty handling in Logic Programing Lluis Godo IIIA - CSIC, Barcelona, Spain

7 Modelling Uncertainty Bayes theorem 7 Modelling Uncertainty Bayes theorem

Uncertainty and its Representa/on @kordinglab Uncertainty ma7ers

Decision Making Privacy-Motivated . . . under Uncertainty: Uncertainty Leads to . . .

CPSC 875 CPSC 875 John D McGregor John D. McGregor C10 Error Design Uncertainty Uncertainty

Decision Making Under Uncertainty Making Decisions Under Uncertainty AI C LASS 10 (C H .

Source Code Scanner Tokens Parser (*.lava) Code SAST Analyzer AST Generator

Entrepreneurship in the work place Presented by: Samuel G Coates What is the difference between

DO WE REALLY MEAN BUSINESS? A Direct Investors Perspective on the Public Private Partnership

Beyond CIT: Development of Mental Health Response Team with a system-wide response to mentally

F# MENTORSHIP PROGRAME FSHARP.ORG/MENTORSHIP UNFRYING YOUR BRAIN WITH F# QCON LONDON - MARCH

OF BLACKFOLD KENTARO TANABE (UNIVERSITY OF BARCELONA) collaboration with R. Emparan and S.

Using Novel Data for Vehicle Rating Lakshmi Shalini and Mark Richards SM CAS Special Interest

An Efficient Approach to Extracting Approximate Repeating Patterns in Music Databases Ning-Han Liu