we knowitall
play

We KnowItAll: Lessons from a Quarter Century of Web Extraction - PDF document

We KnowItAll: Lessons from a Quarter Century of Web Extraction Research Oren Et zioni Turing Cent er Universit y of Washingt on http:/ / turing. cs. washington. edu KnowItAll Team Michele Banko Michael Caf arella Tom Lin


  1. We KnowItAll: Lessons from a Quarter Century of Web Extraction Research Oren Et zioni Turing Cent er Universit y of Washingt on http:/ / turing. cs. washington. edu KnowItAll Team � Michele Banko � Michael Caf arella � Tom Lin � Mausam � Alan Rit t er � St ef an Schoenmackers � St ephen Soderland � Dan Weld � Alumni: Doug Downey, Ana- Maria Popescu, Alex Yat es, and ot hers. 2 1

  2. “In the Knowledge Lies the Power” � Manual “knowledge engineering” � Cyc, Freebase, WordNet � Volunt eers � Open Mind, DBPedia � von Ahn: games � Verbosit y � KnowI t All proj ect : “read” t he Web 3 Outline Turing Cent er 1. 2. Machine Reading & Open I E 3. Text Runner 4. Text ual I nf erence 5. Conclusions 4 2

  3. 2. What is Machine Reading? Un supervised underst anding of t ext I nf ormation extraction + inf erence 5 Information Extraction (IE) via Rules (Hearst ’92) � “… Cities such as Boston , Los Angeles , and Seattle … ” (“ C such as NP1 , NP2 , and NP3 ”) => I S-A(each(head( NP )), C ), … •Det ailed inf ormat ion f or several countries such as maps , … ” ProperNoun(head(NP)) • “I list en t o pret t y much all music but pref er country such as Garth Brooks ” 6 3

  4. IE as Supervised Learning (E.g., Riloff ‘96, Soderland ’99) Relation Labeled Examples Extractor ManagementSuccession S. Smith formerly + = <P2> formerly <POS> Person-In chairman of XYZ of <ORG> Corporation … Person-Out Organization Position � Hand- label examples of each relat ion � Manual labor linear in | relat ions| � Learn lexicalized ext ract or 7 Semi-Supervised Learning per relation! � Few hand-labeled examples � � Limit on t he number of relat ions � � Relat ions are pre-specif ied � � Problemat ic f or Machine Reading � Alt ernat ive: self - supervised learning � Learner discovers relat ions on t he f ly (Sekine ’06) � Learner aut omat ically labels examples � “Look Ma, No Hands!” (Downey & Etzioni ’08) 8 4

  5. Open IE = Self-supervised IE (Banko, et. al, IJCAI ’07, ACL ‘08) Traditional I E Open I E Corpus + Hand- Corpus I nput: labeled Dat a Relations: Specif ied Discovered in Advance Aut omat ically Complexity: O( D * R ) O( D ) R relat ions D document s Lexicalized, Relation- independent Output: relat ion-specif ic 9 How is Open IE Possible? � There is a compact set of relat ionship expressions in English (Banko & Et zioni ACL ’08) � “Relat ionship expressions” can be specif ied wit hout ref erence t o t he individual relat ion! � Gat es f ounded Microsof t � f ounded(Gat es, Microsof t ) 10 5

  6. Relation-Indep. Extraction Rules Category Pattern Frequency E 1 Verb E 2 Verb 37.8% X established Y E 1 NP Prep E 2 Noun+Prep 22.8% the X settlement with Y E 1 Verb Prep E 2 Verb+Prep 16.0% X moved to Y E 1 to Verb E 2 Infinitive 9.4% X to acquire Y E 1 Verb E 2 NP Modifier 5.2% X is Y winner E 1 (and|,|-|:) E 2 NP Coordinate n 1.8% X - Y deal E 1 (and|,) E 2 Verb Coordinate v 1.0 X , Y merge 11 E NP ( | )? E Relation-Independent Observations 95% of sample � 8 broad pat t erns! � Rules are simplif ied… � Applicabilit y condit ions complex � “E1 Verb E2” Kentucky Fried Chicken 1. Microsoft announced Tuesday that… 2. Learned via a CRF model � Ef f ect ive, but f ar f rom perf ect ! � 12 6

  7. Lesson 1: to scale IE to the Web corpus, use Open IE What about extraction errors? 7

  8. Leveraging the Web’s Redundancy (Downey, Soderland, & Etzioni, IJCAI ’05) 1) Repet it ion in dist inct sent ences 2) Mult iple, dist inct ext ract ion rules Phrase Hit s “ Atlanta and other cities” 980 “ Canada and other cities” 286 “cities such as Atlanta ” 5860 7 “cities such as Canada ” IJCAI ‘05: A formal model of these intuitions 15 IJCAI ’05 Combinatorial Model If an extraction x appears k times in a set of n distinct sentences matching a rule , what is the probability that x ∈ class C ? 16 8

  9. Lesson 2: utilize the Web’s redundancy What about complex sentences? � Paris, which has been proclaimed by many lit erary f igures t o be a great source of inspirat ion, is also a capit al cit y, but not t he capit al cit y of an ordinary count ry, but rat her t he capit al of a great republic--- t he republic of France! Paris is t he Capit al of France. 18 9

  10. But What About… � Relat ive clauses � Disj unct ion & conj unct ion � Anaphora � Quant if icat ion � Temporal qualif icat ion � Count erf act uals See Durm & Schubert ’08; MacCart ney ‘08 19 Lesson 3: focus on ‘tractable’ sentences 10

  11. 3. TextRunner (Web’s 1 st Open IE system) Self - Supervised Learner : automatically 1. labels example ext ract ions & t rains a CRF- based ext ract or Single- Pass Extractor: makes a single pass 2. over corpus, ident if ying ext ract ions in ‘tractable’ sentences Search Engine: indexes and ranks 3. ext ract ions based on redundancy Query Processor: int eract ive speeds 4. 21 Application: Information Fusion � What kills bact eria? � What west coast , nano-t echnology companies are hiring? � Compare Obama’s “buzz” versus Hillary’s? � What is a quiet , inexpensive, 4-st ar hot el in Vancouver? 22 11

  12. TextRunner Demo Ext ract ion run at Google on 500,000,000 high-qualit y Web pages. 24 12

  13. 25 Sample of Web Pages Triples 11.3 million � Concrete f acts: ( Oppenheimer, t aught With Well-Formed Relation at , Berkeley) 9.3 million � Abstract f acts: (f ruit , With Well-Formed Entities 7.8 million cont ain, vit amins) Abstract 6.8 million 79.2% correct Concrete 1.0 million 88.1% correct 26 13

  14. Lesson 4: Open IE yields a mass of isolated “nuggets” Text ual inf erence is t he next st ep t owards machine reading 4. Examples of Textual Inference Ent it y and predicat e resolut ion I . Composing f act s t o draw conclusions I I . 28 14

  15. I. Entity Resolution P(String1 = String2)? Resolver ( Yates & Etzioni, HLT ’07 ): det ermines synonymy based on relat ions f ound by Text Runner (cf . Pantel & Lin ‘01 ) � (X, born in, 1941) (M, born in, 1941) � (X, cit izen of , US) (M, cit izen of , US) � (X, f riend of , J oe) (M, f riend of , Mary) P(X = M) ~ shared relat ions 29 Relation Synonymy � (1, R, 2) � (1, Q, 2) � (2, R 4) � (2, Q, 4) � (4, R, 8) � (4, Q, 8) � Et c. � Et c. P(R = Q) ~ shared argument pair s • See paper f or probabilist ic model •O(n log(n)) algor it hm, n =| t uples| 30 15

  16. Functional Relations help resolution married-t o(Oren, X) 1. married-t o(Oren, Y) 2. I f f unct ion(married-t o) t hen P(X=Y) is high Caveat : Oren & Oren ref er t o t he same � individual 31 Lesson 4: leverage relations implicit in the text Funct ional relat ions are part icularly inf ormat ive 16

  17. Returning to Our Caveat… � BornI n(Oren, New York) � BornI n(Oren, Maryland) � But t hese are t wo dif f erent Orens. � How can you t ell? � Web page cont ext 33 Lesson 5: context enhances entity resolution To dat e, our ext ract ion process has been myopic 17

  18. III. Compositional Inference is Key Is-A(Tour de France, The Tour de France is the cycle race) world's largest cycle race. Information The Tour de France takes In(Tour de France, Extraction place in France. France) The illuminated helmet is used during sporting events Is-A(cycle race, such as cycle racing. sporting event) sporting event in France? Tour de France 35 Scalable Textual Inference (Schoenmackers, Etzioni, Weld EMNLP ‘08) Desiderat a f or inf erence: I n t ext � probabilist ic inf erence � On t he Web � scalable � � linear in | Corpus| � Novel propert y of t ext ual relat ions: Prevalent � Provably linear � Empirically linear! � 36 18

  19. Inference Scalability for Holmes 37 Related Work � Weld’s I nt elligence in Wikipedia proj ect � Sekine’s “pre-empt ive I E” � Powerset � Text ual Ent ailment � AAAI ‘07 Symposium on “Machine Reading” � Growing body of work on I E f rom t he Web 38 19

  20. Lessons Reprised Scale & diversit y of t he Web � Open I E � Redundancy helps accuracy! � Focus on ‘t ract able’ sent ences � Open I E(Web) = disj oint ed “nugget s” � Scalable inf erence is key � Leverage relat ions implicit in t he t ext � Sent ence-based I E � page-based I E � 39 Directions for Future Work � Charact erize ‘t ract able’ sent ences � Page-based I E � Bet t er ent it y resolut ion � I nt egrat e Open I E wit h ont ologies � I nvest igat e ot her domains f or Open I E (e.g., news) 40 20

  21. Implications for Web Search I magine search syst ems t hat operat e over a (more) semant ic space Key words, document s � extractions � TF-I DF, pagerank � entities, relations � Web pages, links � Q/ A, inf ormation f usion � Reading the Web � new Search Paradigm 41 Machine Reading of the Web Found 2,300 reviews; 85% How is the ThinkPad T-40? positive, key features are… I n Open IE over f o r m 10^ 10 Web a t i o n pages. F o o d C h a i n World Wide Web 42 21

  22. 43 Machine Reading at Web Scale? Difficult, ungrammatical Tractable sentences Sentences (PageRank) Unreliable information Redundancy Unanticipated Open Information objects & relations Extraction + Linear-time Inference Massive Scale 44 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend