Masses Alon Halevy Google Structured Data & The Web Hard to - - PowerPoint PPT Presentation

masses
SMART_READER_LITE
LIVE PREVIEW

Masses Alon Halevy Google Structured Data & The Web Hard to - - PowerPoint PPT Presentation

Bringing (Web) Databases to the Masses Alon Halevy Google Structured Data & The Web Hard to find structured data via search engines Discover Requires Data is infrastructure, embedded in concerns web page, about losing behind forms


slide-1
SLIDE 1

Bringing (Web) Databases to the Masses

Alon Halevy Google

slide-2
SLIDE 2

Structured Data & The Web

slide-3
SLIDE 3

Discover Manage, Analyze, Combine Extract Publish

Hard to query, visualize, combine data across organizations Requires infrastructure, concerns about losing control Hard to find structured data via search engines Data is embedded in web page, behind forms

slide-4
SLIDE 4

Discover Manage, Analyze, Combine Extract Publish

Fusion Tables: collaborating

  • n data in the cloud, easy

data publishing Web-form crawling, Finding all HTML tables Lists -> tables Extracting context

slide-5
SLIDE 5

Discover Manage, Analyze, Combine Extract Publish

slide-6
SLIDE 6

What is the Deep Web?

store locations used cars radio stations patents recipes

  • Deep = not accessible through general purpose search

engines

– Major gap in the coverage of search engines.

slide-7
SLIDE 7

Vertical Search: Data Integration

slide-8
SLIDE 8

Tree Search Amish quilts Parking tickets in India Horses

slide-9
SLIDE 9

Three Flavors of Deep Web

  • Vertical search: a single domain.

– Can be done with data integration techniques

(e.g. Fetch, Transformic, Morpheus,…)

– Goal: deeper experience than search

  • close a transaction, related items, reviews, …
  • Search for anything

– Goal: drive traffic to relevant sites

  • Product search

– In between the above two.

slide-10
SLIDE 10

Search for Anything: Surfacing

  • Crawl & Indexing time

– Pre-compute interesting form submissions – Insert resulting pages into the Google Index

  • Query time: nothing!

– Deep web URLs in the Google Index are like any other URL

  • Advantages

– Reuse existing search engine infrastructure as-is – Reduced load on target web sites – users click only on what they deem relevant.

slide-11
SLIDE 11

Surfacing Challenges

[See VLDB 08, Madhavan et al.] 1. Predicting the correct input combinations

– Generating all possible URLs is wasteful and unnecessary – Cars.com has ~500K listings, but 250M possible queries

2. Predicting the appropriate values for text inputs

– Valid input values are required for retrieving data – Ingredients in recipes.com and zipcodes in borderstores.com

3. Don’t do anything bad!

slide-12
SLIDE 12

Informative Query Templates

http://jobs.shrm.org/search?state=All&kw=&type=All http://jobs.shrm.org/search?state=AL&kw=&type=All http://jobs.shrm.org/search?state=AK&kw=&type=All … http://jobs.shrm.org/search?state=WV&kw=&type=All http://jobs.shrm.org/search?state=All&kw=&type=ALL http://jobs.shrm.org/search?state=All&kw=&type=ANY http://jobs.shrm.org/search?state=All&kw=&type=EXACT Result pages different  informative Result pages similar un-informative

slide-13
SLIDE 13

Current Impact on Query Stream

  • Crawled ~3M sites
  • 50 languages, hundreds of domains
  • 1000 queries per-second get results from the

deep web!

  • 400K forms served per day, 800K per week
  • Impact mostly on the long and heavy tail of

queries

  • Slash-dotted and valley-wagged
  • See VLDB 2008 paper
slide-14
SLIDE 14

The Role of Semantics

  • The form provides a structured interface

to the data

– Extracting rows/tables from the resulting pages is very hard. – We treat them as any other web page

  • There is a huge collection of data that is

structured on the Web:

– HTML tables.

slide-15
SLIDE 15

Discover Manage, Analyze, Combine Extract Publish

Fusion Tables: collaborating

  • n data in the cloud, easy

data publishing Web-form crawling, Finding all HTML tables Lists -> tables Extracting context

slide-16
SLIDE 16
slide-17
SLIDE 17

WebTables: Exploring the Relational Web

[Cafarella et al., VLDB 2008, WebDB 08]

  • In corpus of 14B raw tables, we estimate

154M are “good” relations

– Single-table databases; Schema = attr labels + types – Largest corpus of databases & schemas we know of

  • The Webtables system:

– Recovers good relations from crawl and enables search – Builds novel apps on the recovered data

slide-18
SLIDE 18

The WebTables System

Raw crawled pages Raw HTML Tables Recovered Relations

  • What are good

relations?

  • What is the

“schema”?

Relation Search Inverted Index

  • What features are

important for ranking?

  • How to index the

tables?

Data is interesting, but there is much more in the structure itself!

slide-19
SLIDE 19

Attribute Correlations DB

Raw crawled pages Raw HTML Tables Recovered Relations Relation Search Inverted Index

Job-title, company, date 104 Make, model, year 916 Rbi, ab, h, r, bb, avg, slg 12 Dob, player, height, weight 4 … …

Attribute Correlation Statistics Db

  • 2.6M distinct schemas
  • 5.4M attributes
slide-20
SLIDE 20

App #1: Schema Auto-complete

  • Useful for traditional schema design for non-

expert users

  • Input I: attr0
  • Output schema S: attr1, attr2, attr3, …
  • While p(S - I | I) > t

– Find attra that maximizes p(attra, S | I) – S = S U attra

slide-21
SLIDE 21

Schema Auto-complete Examples

name name, size, last-modified, type instructor instructor, time, title, days, room, course elected Elected, party, district, incumbent, status, … ab ab, h, r, bb, so, rbi, avg, lob, hr, pos, batters sqft sqft, price, baths, beds, year, type, lot-sqft, …

slide-22
SLIDE 22

App #2: Synonym Discovery

  • Use schema statistics to automatically

compute attribute synonyms

– More complete than thesaurus

  • Given input “context” attribute set C:
  • 1. A = all attrs that appear with C
  • 2. P = all (a,b) where a∈A, b∈A, a≠b
  • 3. rm all (a,b) from P where p(a,b)>0
  • 4. For each remaining pair (a,b) compute:
slide-23
SLIDE 23

Synonym Discovery Examples

name e-mail|email, phone|telephone, 
 e-mail_address|email_address, date|last_modified instructor course-title|title, day|days, course|course-#,
 course-name|course-title elected candidate|name, presiding-officer|speaker ab k|so, h|hits, avg|ba, name|player sqft bath|baths, list|list-price, bed|beds, price|rent

slide-24
SLIDE 24

Harnessing the Structure

  • Tables provide much more information:

– Lists of entities

  • Useful in creating collections of instance-class

pairs [Talukdar et al. EMNLP 2008]

  • Segmenting lists [Elmeleegy, Madhavan, H.,

VLDB 2009]

– Association between instances and labels – Data-level synonyms (e.g., S. Korea, South Korea)

  • Goal: provide a set of “semantic

services” based on analyzing tables.

slide-25
SLIDE 25

Discover Manage, Analyze, Combine Extract Publish

Fusion Tables: collaborating

  • n data in the cloud, easy

data publishing Web-form crawling, Finding all HTML tables Lists -> tables Extracting context

slide-26
SLIDE 26

26

Octopus: Integration from the Web

[Cafarella, Koussainova, H., VLDB 2009, Elmeleegy, Madhavan, H., VLDB 2009]

  • Try to create a database of all

“VLDB program committee members”

slide-27
SLIDE 27

An Integration Workbench

  • Operators that combine: search,

extraction, cleaning and integration

  • Search: finds and clusters relevant data
  • Context: retrieves implicit data (e.g.,

year)

  • Split: transforms lists into tables

– Elmeleegy et al, 2009 – Uses several hints, including WebTables

  • Extend: adds new columns
slide-28
SLIDE 28

Discover Manage, Analyze, Combine Extract Publish

Fusion Tables: collaborating

  • n data in the cloud, easy

data publishing Web-form crawling, Finding all HTML tables Lists -> tables Extracting context

slide-29
SLIDE 29

Data Sharing on the Web

The Challenges

  • Hard to do:

– Need a DBMS and a publishing system

  • Lack of incentives:

– Hard to find the data later, looking at tables is boring

  • People afraid to lose control, attribution
  • Hard to combine data across multiple
  • rganizations:

– Integration research focused on other issues.

slide-30
SLIDE 30

Fusion Tables

Collaborative data management in the cloud [Madhavan, Gonzalez, Langen, Shapely, Halevy]

  • Incentives:

– No administration, easy visualizations – Attribution is sticky – Fine control over data access.

  • Focus on collaboration:

– Share data with collaborators or everyone – Fuse data from multiple tables. – Conduct discussion on the data: rows, columns, cells.

slide-31
SLIDE 31

Attribution and Description

slide-32
SLIDE 32
slide-33
SLIDE 33

Visualization

slide-34
SLIDE 34

Intensity Map

slide-35
SLIDE 35
slide-36
SLIDE 36

Collaboration

slide-37
SLIDE 37

Merging Data Sets

slide-38
SLIDE 38
slide-39
SLIDE 39

Discuss Data

slide-40
SLIDE 40

Conclusions

  • Structured data presents a significant

challenge for web search

– Table search is still an open problem – Automatically combining data from multiple tables is an important challenge.

  • Fusion Tables:

– A new kind of data management system – An opportunity to build and study the eco- system of structured data on the Web.

slide-41
SLIDE 41

References

  • Fusion Tables:

– tables.googlelabs.com

  • Deep-web crawling:

– [Madhavan et al., VLDB 08]

  • WebTables:

– [Cafarella et al., VLDB 08]

  • Octopus:

– [Cafarella et al., VLDB 09], – [Elmeleegy et al, VLDB 09]