to change
play

To change More textrunner, more pattern learning Reorder: Open - PDF document

To change More textrunner, more pattern learning Reorder: Open Information Extraction Kia start CSE 454 Daniel Weld CSE 454 Overview CSE 454 Overview Human Comp Cool UIs (Zoetrope & Revisiting) Adverts Adverts Open IE


  1. To change � More textrunner, � more pattern learning � Reorder: Open Information Extraction � Kia start CSE 454 Daniel Weld CSE 454 Overview CSE 454 Overview Human Comp Cool UIs (Zoetrope & Revisiting) Adverts Adverts Open IE Open IE Parsing & POS Tags Web Tables Parsing & POS Tags Web Tables Search Information Extraction Information Extraction Engines Supervised Learning Supervised Learning HTTP, HTML, Scaling & Crawling HTTP, HTML, Scaling & Crawling Inverted Indicies Cryptography & Security Traditional, Supervised I.E. What is Open Information Extraction? Raw Data Labeled Learning Training Algorithm Data Kirkland -based Microsoft is the largest software company. Boeing moved it’s headquarters to Chicago in 2003. Hank Levy was named chair of Computer Science & Engr. … Extractor HeadquarterOf(<company>,<city>)

  2. Methods for Open IE What is Open Information Extraction? � Self Supervision � Kylin (Wikipedia) � Shrinkage & Retraining � Temporal Extraction � Hearst Patterns � PMI Validation � Subclass Extraction � Pattern Learning � Structural Extraction � List Extraction & WebTables � TextRunner The Motivating Vision Intelligence in Wikipedia Next-Generation Search = Information Extraction Project + Ontology + Inference Daniel S. Weld Department of Computer Science & Engineering University of Washington Which German Seattle, WA, USA Scientists Taught at US Universities? Joint Work with … Einstein was a Fei Wu, Raphael Hoffmann, Stef Schoenmackers, guest lecturer at Eytan Adar, Saleema Amershi, Oren Etzioni, the Institute for Advanced Study James Fogarty, Chloe Kiddon, in New Jersey … Shawn Ling & Kayur Patel Next-Generation Search Why Wikipedia? � Information Extraction � Comprehensive Scalable � <Einstein, Born-In, Germany> � High Quality Self-Supervised Means � <Einstein, ISA, Physicist> [Giles Nature 05] � <Einstein, Lectured-At, IAS> � Useful Structure � <IAS, In, New-Jersey> Unique IDs & Links Comscore MediaMetrix – August 2007 Infoboxes � <New-Jersey, In, United-States> … Cons Categories & Lists � Ontology Natural-Language First Sentence � Physicist (x) � Scientist(x) … Missing Data Redirection pages � Inference Inconsistent Disambiguation pages Low Redundancy � Einstein = Einstein … Revision History Multilingual

  3. Kylin Architecture [Wu & Weld CIKM 2007] Kylin: Self-Supervised Information Extraction from Wikipedia From infoboxes to a training set Clearfield County was created in 1804 from parts of Huntingdon and Lycoming Counties but was administered as part of Centre County until 1812. Its county seat is Clearfield. 2,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water. As of 2005, the population density was 28.2/km². The Precision / Recall Tradeoff Preliminary Evaluation � Kylin Performed Well on Popular Classes: Correct Tuples tp � Precision Precision: mid 70% ~ high 90% + tp fp tn Recall: low 50% ~ mid 90% � Proportion of selected fp tp fn � ... Floundered on Sparse Classes – Little Training Data items that are correct tp + tp fn � Recall Tuples returned by System � Proportion of target items that were selected Precision AuC � Precision-Recall curve � Shows tradeoff 82% < 100 instances; 40% <10 instances Recall Shrinkage? Long-Tail 2: Incomplete Articles � Desired Information Missing from Wikipedia person 800,000/1,800,000(44.2%) stub pages [July 2007 of Wikipedia ] (1201) .birth_place Length performer .location (44) .birthplace .birth_place .cityofbirth actor comedian .origin (8738) (106) ID

  4. KOG: Kylin Ontology Generator Subsumption Detection [Wu & Weld, WWW08] � Binary Classification Problem n i e t s Person � Nine Complex Features n i E : 7 0 E.g., String Features / 6 Scientist … IR Measures … Mapping to Wordnet Physicist … Hearst Pattern Matches … Class Transitions in Revision History � Learning Algorithm SVM & MLN Joint Inference Schema Mapping KOG Architecture Performer Person birth_date birthdate birth_place location name name other_names othername … … � Heuristics � Edit History � String Similarity • Experiments • Precision: 94% Recall: 87% • Future • Integrated Joint Inference KOG: Kylin Ontology Generator Improving Recall on Sparse Classes [Wu et al. KDD-08] [Wu & Weld, WWW08] person person � Shrinkage (1201) (1201) performer � Extra Training Examples (44) from Related Classes performer actor (44) comedian (8738) (106) � How Weight New Examples? .birth_place actor comedian (8738) (106) .location .birthplace .birth_place .cityofbirth .origin

  5. Improving Recall on Sparse Classes Improvement due to Shrinkage [Wu et al. KDD-08] Retraining � Compare Kylin Extractions with Tuples from Textrunner � Additional Positive Examples � Eliminate False Negatives TextRunner [Banko et al. IJCAI-07, ACL-08 ] � Relation-Independent Extraction � Exploits Grammatical Structure � CRF Extractor with POS Tag Features Improving Recall on Sparse Classes Recall after Shrinkage / Retraining… [Wu et al. KDD-08] � Shrinkage � Retraining � Extract from Broader Web � 44% of Wikipedia Pages = “stub” � Extractor quality irrelevant � Query Google & Extract � How maintain high precision? � Many Web pages noisy, describe multiple objects � How integrate with Wikipedia extractions? Bootstrapping to the Web Main Lesson: Self Supervision � Find structured data source � Use heuristics to generate training data � E.g. Infobox attributes & matching sentences

  6. Other Sources Self-supervised Temporal Extraction � Goal Extract: � Google News Archives � happened(recognizes(UK, China), 1/6/1950) The KnowItAll System Methods for Open IE � Self Supervision Predicates Country(X) � Kylin (Wikipedia) � Shrinkage & Retraining Domain-independent Bootstrapping Rule Templates � Temporal Extraction <class> “such as” NP Discriminators Extraction Rules � Hearst Patterns “country X” “countries such as” NP � PMI Validation World Wide Web Extractor � Subclass Extraction � Pattern Learning Extractions Assessor Country(“France”) � Structural Extraction � List Extraction & WebTables Validated Extractions Country(“France”), prob=0.999 � TextRunner Unary predicates: instances of a class Recall – Precision Tradeoff Unary predicates: High precision rules apply to only a small instanceOf(City), instanceOf(Film), percentage of sentences on Web instanceOf(Company), … hits for “X” “cities such as X” “X and other cities” Good recall and precision from generic Boston 365,000,000 15,600,000 12,000 patterns: Tukwila 1,300,000 73,000 44 <class> “such as” X Gjatsk 88 34 0 X “and other” <class> Hadaslav 51 1 0 Instantiated rules: “Redundancy-based extraction” ignores all “cities such as” X X “and other cities” but the unambiguous references. “films such as” X X “and other films” “companies such as” X X “and other companies”

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend