Masses Alon Halevy Google Structured Data & The Web Hard to - PowerPoint PPT Presentation

Bringing (Web) Databases to the Masses Alon Halevy Google

Structured Data & The Web

Hard to find structured data via search engines Discover Requires Data is infrastructure, embedded in concerns web page, about losing behind forms control Publish Extract Manage, Analyze, Combine Hard to query, visualize, combine data across organizations

Web-form crawling, Finding all HTML tables Discover Publish Extract Lists -> tables Extracting context Manage, Analyze, Fusion Tables: collaborating on data in the cloud, easy Combine data publishing

Discover Publish Extract Manage, Analyze, Combine

What is the Deep Web? • Deep = not accessible through general purpose search engines – Major gap in the coverage of search engines. used cars store locations recipes patents radio stations

Vertical Search: Data Integration

Tree Search Amish quilts Parking tickets in India Horses

Three Flavors of Deep Web • Vertical search: a single domain. – Can be done with data integration techniques (e.g. Fetch, Transformic, Morpheus,…) – Goal: deeper experience than search • close a transaction, related items, reviews, … • Search for anything – Goal: drive traffic to relevant sites • Product search – In between the above two.

Search for Anything: Surfacing • Crawl & Indexing time – Pre-compute interesting form submissions – Insert resulting pages into the Google Index • Query time: nothing! – Deep web URLs in the Google Index are like any other URL • Advantages – Reuse existing search engine infrastructure as-is – Reduced load on target web sites – users click only on what they deem relevant.

Surfacing Challenges [See VLDB 08, Madhavan et al.] 1. Predicting the correct input combinations – Generating all possible URLs is wasteful and unnecessary Cars.com has ~500K listings, but 250M possible queries – 2. Predicting the appropriate values for text inputs – Valid input values are required for retrieving data – Ingredients in recipes.com and zipcodes in borderstores.com 3. Don’t do anything bad!

Informative Query Templates Result pages different  informative http://jobs.shrm.org/search? state=All &kw=&type=All http://jobs.shrm.org/search? state=AL &kw=&type=All http://jobs.shrm.org/search? state=AK &kw=&type=All … http://jobs.shrm.org/search? state=WV &kw=&type=All Result pages similar  un-informative http://jobs.shrm.org/search?state=All&kw=& type=ALL http://jobs.shrm.org/search?state=All&kw=& type=ANY http://jobs.shrm.org/search?state=All&kw=& type=EXACT

Current Impact on Query Stream • Crawled ~3M sites • 50 languages, hundreds of domains • 1000 queries per-second get results from the deep web! • 400K forms served per day, 800K per week • Impact mostly on the long and heavy tail of queries • Slash-dotted and valley-wagged • See VLDB 2008 paper

The Role of Semantics • The form provides a structured interface to the data – Extracting rows/tables from the resulting pages is very hard. – We treat them as any other web page • There is a huge collection of data that is structured on the Web: – HTML tables.

WebTables: Exploring the Relational Web [Cafarella et al., VLDB 2008, WebDB 08] • In corpus of 14B raw tables, we estimate 154M are “good” relations – Single-table databases; Schema = attr labels + types – Largest corpus of databases & schemas we know of • The Webtables system: – Recovers good relations from crawl and enables search – Builds novel apps on the recovered data

The WebTables System Inverted Index Raw crawled pages Raw HTML Tables Recovered Relations Relation Search • What are good • What features are important for relations? ranking? • What is the • How to index the “schema”? tables? Data is interesting, but there is much more in the structure itself!

Attribute Correlations DB Inverted Index Raw crawled pages Raw HTML Tables Recovered Relations Relation Search • 2.6M distinct schemas Job-title, company, date 104 Make, model, year 916 Rbi, ab, h, r, bb, avg, slg 12 • 5.4M attributes Dob, player, height, weight 4 … … Attribute Correlation Statistics Db

App #1: Schema Auto-complete • Useful for traditional schema design for non- expert users • Input I: attr 0 • Output schema S: attr 1 , attr 2 , attr 3 , … • While p(S - I | I) > t – Find attr a that maximizes p(attr a , S | I) – S = S U attr a

Schema Auto-complete Examples name name, size, last-modified, type instructor instructor, time, title, days, room, course elected Elected, party, district, incumbent, status, … ab ab, h, r, bb, so, rbi, avg, lob, hr, pos, batters sqft sqft, price, baths, beds, year, type, lot-sqft, …

App #2: Synonym Discovery • Use schema statistics to automatically compute attribute synonyms – More complete than thesaurus • Given input “context” attribute set C: 1. A = all attrs that appear with C 2. P = all (a,b) where a ∈ A, b ∈ A, a ≠ b 3. rm all (a,b) from P where p(a,b)>0 4. For each remaining pair (a,b) compute:

Harnessing the Structure • Tables provide much more information: – Lists of entities • Useful in creating collections of instance-class pairs [Talukdar et al. EMNLP 2008] • Segmenting lists [Elmeleegy, Madhavan, H., VLDB 2009] – Association between instances and labels – Data-level synonyms (e.g., S. Korea, South Korea) • Goal: provide a set of “semantic services” based on analyzing tables.

Octopus: Integration from the Web [Cafarella, Koussainova, H., VLDB 2009, Elmeleegy, Madhavan, H., VLDB 2009] • Try to create a database of all “VLDB program committee members” 26

An Integration Workbench • Operators that combine: search, extraction, cleaning and integration • Search : finds and clusters relevant data • Context : retrieves implicit data (e.g., year) • Split : transforms lists into tables – Elmeleegy et al, 2009 – Uses several hints, including WebTables • Extend : adds new columns

Data Sharing on the Web The Challenges • Hard to do: – Need a DBMS and a publishing system • Lack of incentives: – Hard to find the data later, looking at tables is boring • People afraid to lose control, attribution • Hard to combine data across multiple organizations: – Integration research focused on other issues.

Fusion Tables Collaborative data management in the cloud [Madhavan, Gonzalez, Langen, Shapely, Halevy] • Incentives: – No administration, easy visualizations – Attribution is sticky – Fine control over data access. • Focus on collaboration: – Share data with collaborators or everyone – Fuse data from multiple tables. – Conduct discussion on the data: rows, columns, cells.

Attribution and Description

Visualization

Intensity Map

Collaboration

Merging Data Sets

Discuss Data

Conclusions • Structured data presents a significant challenge for web search – Table search is still an open problem – Automatically combining data from multiple tables is an important challenge. • Fusion Tables: – A new kind of data management system – An opportunity to build and study the eco- system of structured data on the Web.

References • Fusion Tables: – tables.googlelabs.com • Deep-web crawling: – [Madhavan et al., VLDB 08] • WebTables: – [Cafarella et al., VLDB 08] • Octopus: – [Cafarella et al., VLDB 09], – [Elmeleegy et al, VLDB 09]

Masses Alon Halevy Google Structured Data & The Web Hard to - PowerPoint PPT Presentation

Bringing (Web) Databases to the Masses Alon Halevy Google Structured Data & The Web Hard to find structured data via search engines Discover Requires Data is infrastructure, embedded in concerns web page, about losing behind forms

Aspects of neutrino masses Jessica Turner UCL 13 December 2019 Outline Neutrino masses and

Pancreatic Mass: Solid or Cystic? Solid Pancreatic masses Cystic pancreatic masses -

Probing Neutrino Masses and Mixings with Probing Neutrino Masses and Mixings with Accelerator and

Recursion for the Masses TCS Seminar WS19/20 Christoph Rauch Dec 10, 2019 Recursion for the

MassBrowser Unblock cking the Censored Web fo for the Masses, by the Masses Milad Nasr, Hadi

Neutrino Masses from TeV Scale New Physics -- Tests of Neutrino Masses at the LHC Mu-Chun Chen,

SOFTDRIVE.NL, SOFTDRIVE.NL, CVMFS FOR THE CVMFS FOR THE MASSES MASSES DENNIS VAN DOK DENNIS

The EXO-200 detector Andrea Pocar Stanford University Double beta decay and neutrino masses

Nuclear Structure Ingredients for reaction models Lecture 1 Nuclear ingredients for reaction

Extended double seesaw model for neutrino masses and low scale leptogenesis. International

Models for Neutrino Masses and Physics Beyond Standard Model Salah Nasrj The 2nd Toyama

Experimental Constraints on Experimental Constraints on 4th generation quark masses 4th

Landell - live streaming for the masses Luciana Fujii Pontello Landell - live streaming for the

Slide 1 / 67 1 Two spherical objects have masses of 200 kg and 500 kg. Their centers are

Pole masses of Neutrinos in the GrimusNeufeld model Vytautas D ud enas Vilniaus

Slide 1 / 66 1 Two spherical objects have masses of 200 kg and 500 kg. Their centers are

Which one suits your organization? Agenda 1 Review purpose of our meeting today 2 The Three

NEWS RELEASE Scott Wood Addresses Importance of Consumer Driven Healthcare at AZ SHRM State

M AKING C HANGE H APPEN : Finding the Right Change Model By Randall Benson, MBA

Investors Presentation April 21-28, 2018 Abu Dhabi | London | New York | Boston | San

Government Shutdown: Lessons Learned BOAC Meeting June 2019 Discussant Pamela A. Webb

Earnings Conference Call February 2, 2017 Quarter Ended December 31, 2016 Cautionary Statement

AGFW-Definition of Power-to- (District) Heat How we define Power -to-(District)Heat (P2H)

CORROSSION UNDE R DAC O N INSEPEC T IO N SERVIC ES SUPPORT Who we ar e Co nve ntio na l a