Learning to Extract Entities from Labeled and Unlabeled Text Rosie - PowerPoint PPT Presentation

Learning to Extract Entities from Labeled and Unlabeled Text Rosie Jones Language Technologies Institute School of Computer Science Carnegie Mellon University May 5th, 2005

Extracting Information from Text Yesterday Rio de Janeiro was chosen as the new site for Arizona Building Inc. headquarters. Production will continue in Mali where Jaco Kumalo first founded it in 1987. Arizona rose 2.5% in after hours trading. 1

Extracting Information from Text Location Yesterday Rio de Janeiro was chosen as the new site for Company Arizona Building Inc. headquarters. Location Production will continue in Mali Person where Jaco Kumalo first Company Company founded it in 1987. Arizona rose 2.5% in after hours trading. 2

Information Extraction • Set of rules for extracting words or phrases from sentences extract(X) if p ( location | X, context ( X )) > τ – “hotel in paris”: X=”paris”, context(X) = “hotel in” – “paris hilton”: X = “paris”, “context(X) = “hilton” – p location (“paris”) = 0 . 5 – p location (“hilton”) = 0 . 01 – p location (“hotel in”) = 0 . 9 3

Information Extraction II • Types of Information: – “Locations” – “Organizations” – “People” – “Products” – “Job titles” – ... 4

Costs of Information Extraction Data Collection, Labeling Time, Information Verification What companies are hiring for which positions where? IBM? Texas? CEO? Microsoft? Mali? Accountant? Shell? Japan? Hiring(Yahoo,IR Researcher,Pasadena) : : Trainable IE System 5

Costs of Information Extraction • 3 - 6 months to port to new domain [Cardie 98] • 20,000 words required to learn named entity extraction [Seymore et al 99] 7000 labeled examples: supervised learning of extraction • rules for MUC task [Soderland 99] 6

Automated IE System Construction HomeIE Trained Models for IE − Probability Distribution over Noun−phrases − Probability Distribution over Contexts Training Phase Initial suggestions Inputs giraffe hippo feedback zebra User lion bear WWW, in−house document collection 7

Thesis Statement We can train semantic class extractors from text using minimal supervision in the form of • seed examples • actively labeled examples by exploiting the graph structure of text cooccurrence relation- ships. 8

Talk Outline • Information Extraction • Data Representation • Bootstrapping Algorithms: Learning From Almost Nothing • Understanding the Data: Graph Properties • Active learning: Effective Use of User Time 9

Data Representation the dog <X> ran quickly australia travelled to <X> france <X> is pleasant the canary noun-phrases lexico-syntactic contexts islands the dog X ran quickly the dog X is pleasant shares bought <X> australia X is pleasant shares bought X australia travelled to X france travelled to X the canary islands travelled to X 10

Information Extraction Approaches • Hand-constructed • Supervised learning from many labeled examples • Semi-supervised learning 11

The Semi-supervised IE Learning Task Given: • A large collection of unlabeled documents • A small set (10) of nouns representing the target class Learn: A set of rules for extracting members of the target class from novel unseen documents (test collection) 12

Initialization from Seeds • foreach instance in unlabeled docs – if matchesSeed(noun-phrase) – hardlabel(instance) = 1 – else softlabel(instance) = 0 • hardlabel(australia, located-in) = 1 • softlabel(the canary-islands, located-in) = 0 13

Bootstrapping Approach to Semi-supervised Learning • learn two models: – noun-phrases: { New York, Timbuktu, China, the place we met last time, the nation’s capitol ... } – contexts: { located-in < X > , travelled to < X > ... } • Use redundancy in two models: – noun-phrases can label contexts – contexts can label noun-phrases ⇒ bootstrapping 14

Space of Bootstrapping Algorithms • Incremental (label one-at-a-time) / All at once [Cotraining: Blum & Mitchell, 1998] [coEM: Nigam & Ghani, 2000] • asymmetric/ symmetric • heuristic/ probabilistic • use knowledge about language /assume nothing about language 15

Bootstrapping Inputs • corpus – 4160 company web pages – parsed [Riloff 1996] into noun-phrases and contexts (around 200,000 instances) ∗ ”Ultramar Diamond Shamrock has a strong network of approx- imately 4,400 locations in 10 Southwestern states and eastern Canada.” ∗ Ultramar Diamond Shamrock - < X > has network ∗ 10 Southwestern states and eastern Canada - locations in < X > 16

Seeds • locations : { australia, canada, china, england, france, ger- many, japan, mexico, switzerland, united states } • people : { customers, subscriber, people, users, shareholders, individuals, clients, leader, director, customer } • organizations: { inc., praxair, company, companies, dataram, halter marine group, xerox, arco, rayonier timberlands, puretec } 17

CoEM for Information Extraction the dog <X> ran quickly australia travelled to <X> france <X> is pleasant the canary islands shares bought <X> 18

CoEM for Information Extraction the dog <X> ran quickly australia travelled to <X> france <X> is pleasant the canary islands bought <X> shares 19

CoEM for Information Extraction the dog <X> ran quickly australia travelled to <X> france <X> is pleasant the canary islands bought <X> shares 20

CoEM the dog <X> ran quickly australia travelled to <X> france <X> is pleasant the canary islands shares bought <X> 21

Evaluation coEM Noun phrase Context Model Model Australia .999 moved−to <> 0.078 ... Washington 0.52 <> ate 0.001 23

Evaluation coEM Noun phrase Context Model Model Australia .999 moved−to <> 0.078 ... Washington 0.52 <> ate 0.001 the dog ate Labeller 0.0023 the dog ate moved to australia 0.9998 moved to australia washington said 0.156 washington said moved to washington 0.674 moved to washington ... Test Examples with Scores Test Examples 24

Evaluation coEM Noun phrase Context Model Model Australia .999 moved−to <> 0.078 ... 0.9998 moved to australia 1% Washington 0.52 <> ate 0.001 0.6714 moved to washington 0.1526 washington said 0.0023 the dog ate ... 99% Sorted Test Examples the dog ate Labeller 0.0023 the dog ate moved to australia 0.9998 moved to australia washington said Sort 0.156 washington said moved to washington 0.674 moved to washington ... Test Examples with Scores Test Examples 25

Evaluation • ˆ P ( location | example ) ∼ ˆ P ( location | NP ) ∗ ˆ P ( location | context ) for test collection • sort test examples by ˆ P ( location | example ): 800 cut points for precision-recall calculation Precision and Recall at each of 800 points: Precision = TargetClassRetrieved AllRetrieved TargetClassRetrieved Recall = TargetClassInCollection 26

Bootstrapping Results locations 1 coem 0.8 0.6 Precision 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall 27

Bootstrapping Results locations 1 coem coem+hand-corrected seed examples 0.8 0.6 Precision 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall 28

Bootstrapping Results locations 1 coem coem+hand-corrected seed examples coem+500 random labeled examples 0.8 0.6 Precision 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall 29

Bootstrapping Results - People people 1 coem coem+hand-corrected seed examples coem+500 random labeled examples 0.8 0.6 Precision 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall 30

Bootstrapping Results - Organizations organizations 1 coem coem+hand-corrected seed examples coem+500 random labeled examples 0.8 0.6 Precision 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall 31

We can Learn Simple Extraction Without Extensive Labeling • Using just 10 seeds, we learned to extract from an unseen collection of documents • No significant improvements from hand-correcting these examples • No significant improvements from adding 500 labeled examples selected uniformly at random • Did we just get lucky with the seeds? 32

We can Learn Simple Extraction Without Extensive Labeling • Using just 10 seeds, we learned to extract from an unseen collection of documents • No significant improvements from hand-correcting these examples • No significant improvements from adding 500 labeled examples selected uniformly at random • Did we just get lucky with the seeds? 33

Random Sets of Seeds Not So Good locations seed selection 10 random country names 1 10 locations (669 initial) random10 (87 initial) random10 (2 initial) 0.8 random10 (2 initial) 0.6 Precision 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall 34

Learning to Extract Entities from Labeled and Unlabeled Text Rosie - PowerPoint PPT Presentation

Learning to Extract Entities from Labeled and Unlabeled Text Rosie Jones Language Technologies Institute School of Computer Science Carnegie Mellon University May 5th, 2005 Extracting Information from Text Yesterday Rio de Janeiro was

10701 Semi supervised learning Can Unlabeled Data improve supervised learning? Important

Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert Guthrie, Jacob Eisenstein

10 Steps to Counting Unlabeled Planar Graphs: 20 Years Later Manuel Bodirsky October 2007

Learning from Limited Labeled Data (but a lot of unlabeled data) NELL as a case study Tom M.

Learning to Rank Learning to Rank with Partially-Labeled Data with Partially-Labeled Data Kevin

Co-Training Based on Combining Labeled and Unlabeled Data with Co-Training by A. Blum

Learning to Rank with Learning to Rank with Partially-Labeled Data Partially-Labeled Data Kevin

Combining Labeled and Unlabeled Data in Statistical Natural Language Parsing Simon Fraser

XML and Databases Chapter 2: XML II: Entities and Marked Sections Prof. Dr. Stefan Brass

Visual Learning with Unlabeled Video and Look-Around Policies Kristen Grauman Department of

Unlabeled Motzkin numbers Max Alekseyev Dept. Computer Science and Engineering 2013 Max

Word2Vec Michael Collins, Columbia University Motivation We can easily collect very large

Clustering Clustering is an unsupervised classification method, i.e. unlabeled data is partitioned

Ac#ve Learning Aarti Singh Machine Learning 10-601 Dec 6, 2011 Slides Courtesy: Burr

Learning from Unlabeled Data INFO-4604, Applied Machine Learning University of Colorado Boulder

Permanent Income Hypothesis (Extract I) by Costas Meghir and Luigi Pistaferri (Extract from

Crafting Your Census Campaign Plan 1 Hi, Im Christina! Senior Project Manager 270

RCB: A Simple and Practical Framework for Real-time Collaborative Browsing Chuan Yue, Zi Chu, and

DISPLACED PHYSICS AT THE LHC Eric Kuflik Cornell University with Csaba Csaki (Cornell) Salvator

Pocket Data The Case for TPC-MOBILE Oliver Kennedy, Jerry Ajay, Geoff Challen, Lukasz Ziarek

Attestation (RATS/EAT) Overview Laurence Lundblade February 2020 Entity Attestation Good

PRESENTATION DECK V3.3. SEPTEMBER 2018 A mobile-first crypto-exchange platform that empowers

Lets Make an Impact Investing Deal Monday, October 16, 2017 Monday, October 16, 2017 9:30

Images and Filters CSE 576 Ali Farhadi Many slides from Steve Seitz and Larry Zitnick