1
play

1 The Table Corpus The Table Corpus Table type % total count - PDF document

Outline WebTables & Octopus Michael J. Cafarella WebTables University of Washington Octopus CSE454 April 30, 2009 2 3 WebTables WebTables system automatically extracts dbs from This page contains 16 distinct HTML tables,


  1. Outline WebTables & Octopus Michael J. Cafarella � WebTables University of Washington � Octopus CSE454 April 30, 2009 2 3 WebTables � WebTables system automatically extracts dbs from This page contains 16 distinct HTML tables, web crawl but only one relational database [WebDB08, “Uncovering…”, Cafarella et al] [VLDB08, “WebTables: Exploring…”, Cafarella et al] Each relational database has its own schema, Schema Statistics usually with labeled columns. Applications Raw crawled pages Raw HTML Tables Recovered Relations � An extracted relation is one table plus labeled columns � Estimate that our crawl of 14.1B raw HTML tables contains ~ 154M good relational dbs 6 1

  2. The Table Corpus The Table Corpus Table type % total count Small tables 88.06 12.34B HTML forms 1.34 187.37M Calendars 0.04 5.50M Obvious non-rel 89.44 12.53B Other non-rel (est.) 9.46 1.33B Rel (est.) 1.10 154.15M 7 8 Relation Recovery Schema Statistics Recovered Relations make model year Schema Freq Toyo ta Camry 1984 {make , mode l , yea r } 1 2 make model year {make , mode l , yea r , co lo r } 1 { } { } Mazda Pro tégé 2003 { name , add r , c i t y , s t a te , z i p } 1 Chev ro le t Impa la 1979 { name , s i ze , l as t -mod i f i ed } 1 make model year color Step 1. Relational Filtering Step 2. Metadata Detection Chrys l e r Vo la re 1974 ye l l ow Recall 81% , Precision 41% Recall 85% , Precision 89% Nissan Sen t ra 1994 r ed � Output name addr city state zip � Schema stats useful for computing attribute � 271M databases, about 125M are good Dan S 16 Pa rk Sea t t l e W A 98195 probabilities A lon H 129 E lm Be lmon t CA 94011 � Five orders of magnitude larger than previous � p(“make”), p(“model”), p(“zipcode”) largest corpus [WWW02, “A Machine Learning…”, Wang & Hu] � p(“make” | “model”), p(“make” | “zipcode”) name size last-modified � 2.6M unique relational schemas Readme. t x t 182 Apr 26 , 2005 cac . xm l 813 Ju l 23 , 2008 � What can we do with those schemas? [VLDB08, “WebTables: Exploring…”, Cafarella et al] 9 10 App # 1: Relation Search App # 1: Relation Search � Problem: keyword search for high- quality extracted databases � Output depends on both quality of extracted tables and the ranking function 11 12 2

  3. App # 1: Relation Search App # 1: Experiments Metadata detection, when adding schema stats scoring Schema statistics can help improve both: � � Precision 0.79 ⇒ 0.89 � Relation Recovery (Metadata detection) � Recall 0.84 ⇒ 0.85 � Ranking � Ranking: compared 4 rankers on test set � By computing a schema coherency score S(R) for relation R, � Naïve : Top-10 pages from google.com � and adding it to feature vector Filter : Top-10 good tables from google.com � Measures how well a schema “hangs together” � Rank : Trained ranker � High: {make , mode l } � Rank-Stats : Trained ranker with coherency score � Low: {make , z i pcode } � What fraction of top-k are relevant? � Average pairwise Pointwise Mutual Information score for all � attributes in schema k Naïve Filter Rank Rank-Stats 10 0.26 0.35 0.43 0.47 (80%) (34% ) (65% ) 20 0.33 0.47 0.56 0.59 (79%) (42% ) (70% ) 30 0.34 0.59 0.66 0.68 (100%) (74% ) (94% ) 13 14 App # 2: Schema Autocomplete App # 2: Schema Autocomplete � Input: topic attribute ( e.g. , make ) name name, s i ze , l as t -mod i f i ed, t ype � Output: relevant schema i n s t ruc to r i n s t ruc to r , t ime , t i t l e , days , r oom, cou rse {make , mode l , yea r , p r i ce } e lec ted e lec ted , pa r t y , d i s t r i c t , i n cumben t , s t a tus , … � “tab-complete” for your database ab ab , h , r , bb , so , r b i , avg , l ob , h r , pos , ba t t e r s sq f t sq f t , p r i ce , ba ths , beds , yea r , t ype , l o t - sq f t , … � For input set I , output S , threshold t � while p( S - I | I ) > t � newAttr = max p( newAttr , S - I | I ) � S = S ∪ newAttr � emit newAttr 15 16 App # 2: Experiments App # 3: Synonym Discovery Input: topic attribute ( e.g. , addr ess ) � Asked experts for schemas in 10 areas � Output: relevant synonym pairs � � What was autocompleter’s recall? ( te lephone = tel - # ) Used for schema matching � [VLDB01, “Generic Schema Matching…”, Madhavan et al] Linguistic thesauri are incomplete; hand-made � thesauri are burdensome For attributes a , b and input domain C , � when p(a,b)= 0 17 18 3

  4. App # 3: Synonym Discovery App # 3: Experiments � For each input attr, repeatedly emit best synonym pair (until min threshold reached) name e-ma i l | ema i l , phone | te l ephone , e-ma i l _add ress |ema i l _add r ess , da te | l as t_mod i f i ed i ns t ruc to r cou rse - t i t l e | t i t l e , day |days , c ourse | cou rse -# , cou rse -name |cou rse - t i t l e e lec ted cand ida te |name, p res id ing - o f f i ce r | speake r ab k | so , h |h i t s , avg |ba , name |p l aye r sq f t ba th |ba ths , l i s t | l i s t - p r i ce , bed |beds , p r i ce | ren t 19 20 WebTables Contributions Outline � Largest collection of databases and schemas, by far � Large-scale extracted schema data for � WebTables first time; enables novel applications � Octopus 21 22 Multiple Tables Integration Challenge � Can we combine tables to create new � Try to create a database of all “VLDB p rog ram com mi t t ee membe rs ” data sources? � Data integration for the Structured Web � Many existing “mashup” tools, which ignore realities of Web data � A lot of useful data is not in XML � User cannot know all sources in advance � Transient integrations 23 24 4

  5. Octopus Walkthrough - Operator # 1 � SEARCH (“ VLDB p rog ram commi t tee � Provides “workbench” of data integration operators to build target database me mbers ”) � Most operators are not correct/incorrect, but se rge ab i t ebou l i n r i a high/low quality (like search) michae l ad iba … grenob le � Also, prosaic traditional operators an ton io a l bano … pisa … … se rge ab i t ebou l i n r i a anas tass ia a i l… ca rneg ie… gus tavo a lonso e tz zu r i ch … … 25 26 Walkthrough - Operator # 2 Walkthrough - Operator # 2 � Recover relevant data � Recover relevant data se rge ab i t ebou l i n r i a se rge ab i t ebou l i n r i a 1996 CONTEXT() CONTEXT() michae l ad iba … grenob le michae l ad iba … grenob le 1996 an ton io a l bano … pisa an ton io a l bano … pisa 1996 … … … … … se rge ab i t ebou l i n r i a 2005 se rge ab i t ebou l i n r i a anas tass ia a i l… ca rneg ie… 2005 anas tass ia a i l… ca rneg ie… CONTEXT() CONTEXT() gus tavo a lonso e tz zu r i ch 2005 gus tavo a lonso e tz zu r i ch … … … … … 27 28 Walkthrough - Union Walkthrough - Operator # 3 � Combine datasets � Add column to data � Similar to “join” but join target is a topic EXTEND ( “publications”, col= 0) “publications” se rge ab i t ebou l i n r i a 1996 michae l ad iba … grenob le 1996 se rge ab i t ebou l i n r i a 1996 se rge ab i t ebou l i n r i a se 1996 rge ab i t “ ebou La rge l Sca le i P2P n r i a D is t…” 1996 an ton io a l bano … pisa 1996 michae l ad iba … grenob le 1996 michae l ad iba … grenob le michae 1996 l ad “Exp iba lo i t i ng b i … grenob t empora l…” le 1996 … … … an ton io a l bano … pisa 1996 an ton io a l bano … pisa an 1996 ton io a l “Ano bano the r Examp … pisa le o f a…” 1996 Union() se rge ab i t ebou l i n r i a 2005 se rge ab i t ebou l i n r i a se rge 2005 ab i t “ ebou La rge l Sca le i P2P n r i a D is t…” 2005 anas tass ia a i l… ca rneg ie… 2005 anas tass ia a i l… ca rneg ie… anas 2005 tass ia “E a f i f l… i c i en t Use ca o rneg f t he…” ie… 2005 se rge ab i t ebou l i n r i a 2005 gus tavo a lonso e tz zu r i ch 2005 gus tavo a lonso e tz zu r i ch gus 2005 tavo a lonso “A Dynamic and e tz F zu lex r i ch ib l e …” 2005 anas tass ia a i l… ca rneg ie… 2005 … … … … … … … … … gus tavo a lonso e tz zu r i ch 2005 … … … • User has integrated data sources with little effort • No wrappers; data was never intended for reuse 29 30 5

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend