We KnowItAll: Lessons from a Quarter Century of Web Extraction - PDF document

We KnowItAll: Lessons from a Quarter Century of Web Extraction Research Oren Et zioni Turing Cent er Universit y of Washingt on http:/ / turing. cs. washington. edu KnowItAll Team � Michele Banko � Michael Caf arella � Tom Lin � Mausam � Alan Rit t er � St ef an Schoenmackers � St ephen Soderland � Dan Weld � Alumni: Doug Downey, Ana- Maria Popescu, Alex Yat es, and ot hers. 2 1

“In the Knowledge Lies the Power” � Manual “knowledge engineering” � Cyc, Freebase, WordNet � Volunt eers � Open Mind, DBPedia � von Ahn: games � Verbosit y � KnowI t All proj ect : “read” t he Web 3 Outline Turing Cent er 1. 2. Machine Reading & Open I E 3. Text Runner 4. Text ual I nf erence 5. Conclusions 4 2

2. What is Machine Reading? Un supervised underst anding of t ext I nf ormation extraction + inf erence 5 Information Extraction (IE) via Rules (Hearst ’92) � “… Cities such as Boston , Los Angeles , and Seattle … ” (“ C such as NP1 , NP2 , and NP3 ”) => I S-A(each(head( NP )), C ), … •Det ailed inf ormat ion f or several countries such as maps , … ” ProperNoun(head(NP)) • “I list en t o pret t y much all music but pref er country such as Garth Brooks ” 6 3

IE as Supervised Learning (E.g., Riloff ‘96, Soderland ’99) Relation Labeled Examples Extractor ManagementSuccession S. Smith formerly + = <P2> formerly <POS> Person-In chairman of XYZ of <ORG> Corporation … Person-Out Organization Position � Hand- label examples of each relat ion � Manual labor linear in | relat ions| � Learn lexicalized ext ract or 7 Semi-Supervised Learning per relation! � Few hand-labeled examples � � Limit on t he number of relat ions � � Relat ions are pre-specif ied � � Problemat ic f or Machine Reading � Alt ernat ive: self - supervised learning � Learner discovers relat ions on t he f ly (Sekine ’06) � Learner aut omat ically labels examples � “Look Ma, No Hands!” (Downey & Etzioni ’08) 8 4

Open IE = Self-supervised IE (Banko, et. al, IJCAI ’07, ACL ‘08) Traditional I E Open I E Corpus + Hand- Corpus I nput: labeled Dat a Relations: Specif ied Discovered in Advance Aut omat ically Complexity: O( D * R ) O( D ) R relat ions D document s Lexicalized, Relation- independent Output: relat ion-specif ic 9 How is Open IE Possible? � There is a compact set of relat ionship expressions in English (Banko & Et zioni ACL ’08) � “Relat ionship expressions” can be specif ied wit hout ref erence t o t he individual relat ion! � Gat es f ounded Microsof t � f ounded(Gat es, Microsof t ) 10 5

Relation-Indep. Extraction Rules Category Pattern Frequency E 1 Verb E 2 Verb 37.8% X established Y E 1 NP Prep E 2 Noun+Prep 22.8% the X settlement with Y E 1 Verb Prep E 2 Verb+Prep 16.0% X moved to Y E 1 to Verb E 2 Infinitive 9.4% X to acquire Y E 1 Verb E 2 NP Modifier 5.2% X is Y winner E 1 (and|,|-|:) E 2 NP Coordinate n 1.8% X - Y deal E 1 (and|,) E 2 Verb Coordinate v 1.0 X , Y merge 11 E NP ( | )? E Relation-Independent Observations 95% of sample � 8 broad pat t erns! � Rules are simplif ied… � Applicabilit y condit ions complex � “E1 Verb E2” Kentucky Fried Chicken 1. Microsoft announced Tuesday that… 2. Learned via a CRF model � Ef f ect ive, but f ar f rom perf ect ! � 12 6

Lesson 1: to scale IE to the Web corpus, use Open IE What about extraction errors? 7

Leveraging the Web’s Redundancy (Downey, Soderland, & Etzioni, IJCAI ’05) 1) Repet it ion in dist inct sent ences 2) Mult iple, dist inct ext ract ion rules Phrase Hit s “ Atlanta and other cities” 980 “ Canada and other cities” 286 “cities such as Atlanta ” 5860 7 “cities such as Canada ” IJCAI ‘05: A formal model of these intuitions 15 IJCAI ’05 Combinatorial Model If an extraction x appears k times in a set of n distinct sentences matching a rule , what is the probability that x ∈ class C ? 16 8

Lesson 2: utilize the Web’s redundancy What about complex sentences? � Paris, which has been proclaimed by many lit erary f igures t o be a great source of inspirat ion, is also a capit al cit y, but not t he capit al cit y of an ordinary count ry, but rat her t he capit al of a great republic--- t he republic of France! Paris is t he Capit al of France. 18 9

But What About… � Relat ive clauses � Disj unct ion & conj unct ion � Anaphora � Quant if icat ion � Temporal qualif icat ion � Count erf act uals See Durm & Schubert ’08; MacCart ney ‘08 19 Lesson 3: focus on ‘tractable’ sentences 10

3. TextRunner (Web’s 1 st Open IE system) Self - Supervised Learner : automatically 1. labels example ext ract ions & t rains a CRF- based ext ract or Single- Pass Extractor: makes a single pass 2. over corpus, ident if ying ext ract ions in ‘tractable’ sentences Search Engine: indexes and ranks 3. ext ract ions based on redundancy Query Processor: int eract ive speeds 4. 21 Application: Information Fusion � What kills bact eria? � What west coast , nano-t echnology companies are hiring? � Compare Obama’s “buzz” versus Hillary’s? � What is a quiet , inexpensive, 4-st ar hot el in Vancouver? 22 11

TextRunner Demo Ext ract ion run at Google on 500,000,000 high-qualit y Web pages. 24 12

25 Sample of Web Pages Triples 11.3 million � Concrete f acts: ( Oppenheimer, t aught With Well-Formed Relation at , Berkeley) 9.3 million � Abstract f acts: (f ruit , With Well-Formed Entities 7.8 million cont ain, vit amins) Abstract 6.8 million 79.2% correct Concrete 1.0 million 88.1% correct 26 13

Lesson 4: Open IE yields a mass of isolated “nuggets” Text ual inf erence is t he next st ep t owards machine reading 4. Examples of Textual Inference Ent it y and predicat e resolut ion I . Composing f act s t o draw conclusions I I . 28 14

I. Entity Resolution P(String1 = String2)? Resolver ( Yates & Etzioni, HLT ’07 ): det ermines synonymy based on relat ions f ound by Text Runner (cf . Pantel & Lin ‘01 ) � (X, born in, 1941) (M, born in, 1941) � (X, cit izen of , US) (M, cit izen of , US) � (X, f riend of , J oe) (M, f riend of , Mary) P(X = M) ~ shared relat ions 29 Relation Synonymy � (1, R, 2) � (1, Q, 2) � (2, R 4) � (2, Q, 4) � (4, R, 8) � (4, Q, 8) � Et c. � Et c. P(R = Q) ~ shared argument pair s • See paper f or probabilist ic model •O(n log(n)) algor it hm, n =| t uples| 30 15

Functional Relations help resolution married-t o(Oren, X) 1. married-t o(Oren, Y) 2. I f f unct ion(married-t o) t hen P(X=Y) is high Caveat : Oren & Oren ref er t o t he same � individual 31 Lesson 4: leverage relations implicit in the text Funct ional relat ions are part icularly inf ormat ive 16

Returning to Our Caveat… � BornI n(Oren, New York) � BornI n(Oren, Maryland) � But t hese are t wo dif f erent Orens. � How can you t ell? � Web page cont ext 33 Lesson 5: context enhances entity resolution To dat e, our ext ract ion process has been myopic 17

III. Compositional Inference is Key Is-A(Tour de France, The Tour de France is the cycle race) world's largest cycle race. Information The Tour de France takes In(Tour de France, Extraction place in France. France) The illuminated helmet is used during sporting events Is-A(cycle race, such as cycle racing. sporting event) sporting event in France? Tour de France 35 Scalable Textual Inference (Schoenmackers, Etzioni, Weld EMNLP ‘08) Desiderat a f or inf erence: I n t ext � probabilist ic inf erence � On t he Web � scalable � � linear in | Corpus| � Novel propert y of t ext ual relat ions: Prevalent � Provably linear � Empirically linear! � 36 18

Inference Scalability for Holmes 37 Related Work � Weld’s I nt elligence in Wikipedia proj ect � Sekine’s “pre-empt ive I E” � Powerset � Text ual Ent ailment � AAAI ‘07 Symposium on “Machine Reading” � Growing body of work on I E f rom t he Web 38 19

Lessons Reprised Scale & diversit y of t he Web � Open I E � Redundancy helps accuracy! � Focus on ‘t ract able’ sent ences � Open I E(Web) = disj oint ed “nugget s” � Scalable inf erence is key � Leverage relat ions implicit in t he t ext � Sent ence-based I E � page-based I E � 39 Directions for Future Work � Charact erize ‘t ract able’ sent ences � Page-based I E � Bet t er ent it y resolut ion � I nt egrat e Open I E wit h ont ologies � I nvest igat e ot her domains f or Open I E (e.g., news) 40 20

Implications for Web Search I magine search syst ems t hat operat e over a (more) semant ic space Key words, document s � extractions � TF-I DF, pagerank � entities, relations � Web pages, links � Q/ A, inf ormation f usion � Reading the Web � new Search Paradigm 41 Machine Reading of the Web Found 2,300 reviews; 85% How is the ThinkPad T-40? positive, key features are… I n Open IE over f o r m 10^ 10 Web a t i o n pages. F o o d C h a i n World Wide Web 42 21

43 Machine Reading at Web Scale? Difficult, ungrammatical Tractable sentences Sentences (PageRank) Unreliable information Redundancy Unanticipated Open Information objects & relations Extraction + Linear-time Inference Massive Scale 44 22

We KnowItAll: Lessons from a Quarter Century of Web Extraction - PDF document

We KnowItAll: Lessons from a Quarter Century of Web Extraction Research Oren Et zioni Turing Cent er Universit y of Washingt on http:/ / turing. cs. washington. edu KnowItAll Team Michele Banko Michael Caf arella Tom Lin

Introduction to English Linguistics 3: Morphology and Word Formation Part I: Morphology Part II:

Testing The Core Competency Model of Multi-Product Exporters Carsten Eckel Leonardo Iacovone

Friday, 9 May 14 Welcome to your New Project Friday, 9 May 14 Friday, 9 May 14 Friday, 9 May

Open64/ORC compilers Sbastian Pop Universit Louis Pasteur Strasbourg, Project A3 INRIA

Disclosures US Patent and Applications Pending: U.S. Pat. App. No. 13706036 - Debranching

Service Excellence in Challenging Times Keep calm, confident, and capable of guiding your

RegExpr:Review & Wrapup; Lecture 13b Larry Ruzzo Outline More regular expressions &

Generic Entity Resolution with Negative Rules Steven Whang Hector Garcia-Molina Omar Benjelloun

Ingesting Streaming Data for Analysis in Apache Ignite Pat Patterson StreamSets

ConnectHome Nation Webinar Introduction to GitHub July 30, 2019 1 Agenda Agenda What is 1.

The Doctrine of Equivalents current status in UK Fordham IP Conference April 26, 2019 Case

Lecture 2.2: PAT Embedding Andreas Hinzmann Annapaola de Cosa PAT Tutorial, June

How Prolog answers queries To answer a query, the Prolog interpreter tries to satisfy all the

Routing Improvement using Directional Antennas Presented by Greg Ratner Paper by Saha and

The IPAT Equation The IPAT Equation The IPAT Equation The IPAT

Natural Language Processing Info 159/259 Lecture 20: Semantic roles (Nov.1, 2018) David

PAT: Preference-Aware Transfer Learning for Recommendation with Heterogeneous Feedback Feng Liang

Advancing Analytics: Putting Risk Analytics to Work For Your Business Sponsored By: Advancing

Lead openCypher Language Group (CLG) Neo Technology opencypher.org | opencypher@googlegroups.com

TRECVID 2016 INSTANCE RETRIEVAL INTRODUCTION AND TASK OVERVIEW Wessel Kraaij The

Towards Practical Deletion Repair of Inconsistent DL-programs Thomas Eiter Michael Fink Daria

Additional Semantic Tasks: Entity Coreference and Question Answering CMSC 473/673 UMBC Outline

A Network Programming Language Nate Foster, Mike Freedman, Rob Harrison, Chris Monsanto, Jen

Regular Expressions Pattern and Match objects Genome 559: Introduction to Statistical and

We KnowItAll: Lessons from a Quarter Century of Web Extraction - PDF document

We KnowItAll: Lessons from a Quarter Century of Web Extraction Research Oren Et zioni Turing Cent er Universit y of Washingt on http:/ / turing. cs. washington. edu KnowItAll Team Michele Banko Michael Caf arella Tom Lin

Introduction to English Linguistics 3: Morphology and Word Formation Part I: Morphology Part II:

Testing The Core Competency Model of Multi-Product Exporters Carsten Eckel Leonardo Iacovone

Friday, 9 May 14 Welcome to your New Project Friday, 9 May 14 Friday, 9 May 14 Friday, 9 May

Open64/ORC compilers Sbastian Pop Universit Louis Pasteur Strasbourg, Project A3 INRIA

Disclosures US Patent and Applications Pending: U.S. Pat. App. No. 13706036 - Debranching

Service Excellence in Challenging Times Keep calm, confident, and capable of guiding your

RegExpr:Review &amp; Wrapup; Lecture 13b Larry Ruzzo Outline More regular expressions &amp;

Generic Entity Resolution with Negative Rules Steven Whang Hector Garcia-Molina Omar Benjelloun

Ingesting Streaming Data for Analysis in Apache Ignite Pat Patterson StreamSets

ConnectHome Nation Webinar Introduction to GitHub July 30, 2019 1 Agenda Agenda What is 1.

The Doctrine of Equivalents current status in UK Fordham IP Conference April 26, 2019 Case

Lecture 2.2: PAT Embedding Andreas Hinzmann Annapaola de Cosa PAT Tutorial, June

How Prolog answers queries To answer a query, the Prolog interpreter tries to satisfy all the

Routing Improvement using Directional Antennas Presented by Greg Ratner Paper by Saha and

The IPAT Equation The IPAT Equation The IPAT Equation The IPAT

Natural Language Processing Info 159/259 Lecture 20: Semantic roles (Nov.1, 2018) David

PAT: Preference-Aware Transfer Learning for Recommendation with Heterogeneous Feedback Feng Liang

Advancing Analytics: Putting Risk Analytics to Work For Your Business Sponsored By: Advancing

Lead openCypher Language Group (CLG) Neo Technology opencypher.org | opencypher@googlegroups.com

TRECVID 2016 INSTANCE RETRIEVAL INTRODUCTION AND TASK OVERVIEW Wessel Kraaij The

Towards Practical Deletion Repair of Inconsistent DL-programs Thomas Eiter Michael Fink Daria

Additional Semantic Tasks: Entity Coreference and Question Answering CMSC 473/673 UMBC Outline

A Network Programming Language Nate Foster, Mike Freedman, Rob Harrison, Chris Monsanto, Jen

Regular Expressions Pattern and Match objects Genome 559: Introduction to Statistical and

RegExpr:Review & Wrapup; Lecture 13b Larry Ruzzo Outline More regular expressions &