1Table A System for Managing Structured Web Data Yang Zhang with: - PowerPoint PPT Presentation

1Table A System for Managing Structured Web Data Yang Zhang with: Alon Halevy, Mike Cafarella, Nodira Khoussainova, Eugene Wu and Daisy Zhe

Structured Web Data No tables • Web is more than just text Other – tables, tags, lists, etc tables Data – 50% pages have tables tables – 25% tables appear to be useful data tables (relational, entity, sets, etc.) • No existing tools to effectively query this data – RDBMSs don’t scale, process noisy data poorly – Search engines are structure ‐ blind • 1Table fills the gap!

Table Search Data Synthetic Table The 1Table Visualization Generation Project Reference Schema Reconciliation Reconciliation

1Table Project HOBO: TABLE SEARCH

The Quest for Infrastructure • _: limited indexing options, inefficient structure • _: lots of hoops, un ‐ structured • _: little bang for the buck, slow setup, inefficient structure • Wanted control over query model, ranking Hobo: “poor man’s text search”

Challenges • Millions of tables (~100M in Core) • Noisy: many are not data tables (layout) • Query by: attributes? values? similar examples? • No structured metadata Hobo • Similar to traditional inverted index search • Schema ‐ agnostic structured query model

Hobo Query Processor Slave 0 TID TID Shard Slaves 00000 Table Index Shard Slaves 00000 Shard Slaves 00000 00000 Slave 1 TID Master TID Shard Slaves 00000 Table Index Shard Slaves 00000 Shard Slaves 00216 00216 GFS Slave 499

Processing Pipeline extraction filtering docjoins raw tables good tables annotation servers labeling, annotation, munging Daffie querying indexing query processor Hobo inverted analyzed/cleaned tables index

Recipe: Hobo Query Model • Start with Google.com-style conjunction of disjunctions • Add structural primitives: terms have attributes • Introduce binding of variables to terms • Impose binary relational constraints (½ cup) • Mix bindings and constraints in arbitrary boolean expressions • Serve and enjoy

Query Model and x y “united states” where x .offset + 1 = y .offset

Query Model and x z y “france” “paris” “germany” where x .row = y .row and x .col = z .col

Query Model • What attributes are currently available? – Physical: offset, col, row – Logical: source (header/body/context) – For ranking: size, pageRank, isDataTable, hasHeaders, … – Easy to add more! • Fast (poly ‐ time) constraint verifier

Query Languages High ‐ level template ‐ based query Low ‐ level constraint ‐ based query language example: language: and { a = and { “united states” us a = term { united } b = term { states } china | prc cn where a.pos + 1 = b.pos * to } b = or { term { china } term { prc } parser, ((("united states") (us)) } rewriter c = us ((china | prc) (cn)) d = cn ((_) (to))) e = to where a.col == b.col c.col == d.col c.col == e.col a.row == c.row b.row == d.row }

Areas for Future Work • Low ‐ hanging performance fruits – O(n) constraint verification by ordering/hashing – Smarter concurrent iteration over inverted index – Query rewriting – More resources • Soft constraints: not required, but use for ranking • Frontend: richer data visualization • Ranking of results • Easy integration into Dataspaces

1Table Project TABLE SUGGEST

Synthetic Table Generation What country corresponds to code “tr”? united states us united states us china cn china cn tr turkey tr japan jp ... …

Challenges • Inconsistent/inaccurate information • Resolving data from multiple sources • Ad ‐ hoc semantics • Data with nested (sub ‐ cell) structure – .us (united states) – united states/us

TableSuggest Features • Spreadsheet that suggests values to fill in • Can draw data from _ and Google Sets, but primarily 1Table (Hobo) • Hodgpodge of techniques (thrown in ad ‐ hoc manner from inspecting results) – Type enumeration (_, Hobo) – Set expansion (Sets, Hobo) – Attribute resolution (Hobo) – Column clustering (1Table) – …

Areas for Future Work • More principled evaluation • Implementation infelicities • Support for numeric queries using two ‐ tier indexing structure with “range buckets” • Richer sub ‐ structure extraction (lists) • Incremental indexing with live data feeds/sources • Tailoring to specific domains • Entity tables • Aggregating values in denormalized tables

1Table A System for Managing Structured Web Data Yang Zhang with: - PowerPoint PPT Presentation

1Table A System for Managing Structured Web Data Yang Zhang with: Alon Halevy, Mike Cafarella, Nodira Khoussainova, Eugene Wu and Daisy Zhe Structured Web Data No tables Web is more than just text Other tables, tags, lists, etc tables Data

Naiad: A Timely Dataflow System Derek G. Murray Frank McSherry Rebecca Isaacs Michael Isard

A new job for statisticians: the data scientist. Which skills, how to build them Antonio Ottaiano ,

Social/Network/Analysis mohamed.bouguessa@uqo.ca/ 1 Web/today 2

Learning to Rank: From Pairwise Approach to Listwise Approach Zhe Cao Tao Qin Tie-Yan Liu

2017 SOA Annual Meeting & Exhibit MARC DES ROSIERS, FSA, FCIA Session 101, Methods to

How to get from A to E: using networks to unravel the past, present, and future Rebecca

Parallelizing Machine Learning- Functionally A F RAMEWORK and A BSTRACTIONS for Parallel Graph

S EMANTIC -B ASED M ULTILINGUAL D OCUMENT C LUSTERING VIA T ENSOR M ODELING Salvatore Romeo 1 ,

Transfer Request (Fund 41-56410) Kenny Solorio HEFAS Office Coordinator First Transfer Proposal

Catalog Orders Welcome! We will begin promptly at 9 a.m. Make sure your first and last

Section 10 of BPM for TPP- Generator modeling webinar Songzhe Zhu, Irina Green, Riddhi Ray,

2019 Revaluation Update Presented by the Mecklenburg County Assessors Office Progress to Date

3232 Eastern Avenue Urban Design & Architectural Review Panel Final Presentation February 18,

2017 Special Education Directors Conference Current Topics In Post- Secondary Transition 1

QUALITY CARE THROUGH INCLUSION GENDER PAY GAP REPORT 2018 OUR WORKFORCE BY THE NUMBERS Female

STREET SUPPORT PROJECT Roberto Perez Gayo Correlation - European Harm Reduction Network STREET

OECD TAX TALKS CENTRE FOR TAX POLICY AND ADMINISTRATION 22 July 2020 15:30 16:30 (CEST)

Survey on Part-Time Faculty Affairs Presented to the USC Academic Senate November 14, 2018

Indicators: Employment Trends for Adults with ID/DD and Suggestions for Policy Development

Advisory Committee Meeting July 16, 2020 Presentation overview Introductions Approve meeting

Beth Rhyne Managing Director Center for Financial Inclusion at Accion Microfinance Network,

TAKING PAYMENTS ECOSYSTEM CAPABILITIES TO THE NEXT LEVEL- FOCUS ON DIVERSITY Payments New

Planning Alena Berube, Director of Health Systems Policy Patrick Rooney, Director of Health

SDOT Annual WMBE SDOT Annual WMBE Update Update SDOT Annual WMBE Update Viviana Garza 6/29/19

1Table A System for Managing Structured Web Data Yang Zhang with: - PowerPoint PPT Presentation

1Table A System for Managing Structured Web Data Yang Zhang with: Alon Halevy, Mike Cafarella, Nodira Khoussainova, Eugene Wu and Daisy Zhe Structured Web Data No tables Web is more than just text Other tables, tags, lists, etc tables Data

Naiad: A Timely Dataflow System Derek G. Murray Frank McSherry Rebecca Isaacs Michael Isard

A new job for statisticians: the data scientist. Which skills, how to build them Antonio Ottaiano ,

Social/Network/Analysis mohamed.bouguessa@uqo.ca/ 1 Web/today 2

Learning to Rank: From Pairwise Approach to Listwise Approach Zhe Cao Tao Qin Tie-Yan Liu

2017 SOA Annual Meeting &amp; Exhibit MARC DES ROSIERS, FSA, FCIA Session 101, Methods to

How to get from A to E: using networks to unravel the past, present, and future Rebecca

Parallelizing Machine Learning- Functionally A F RAMEWORK and A BSTRACTIONS for Parallel Graph

S EMANTIC -B ASED M ULTILINGUAL D OCUMENT C LUSTERING VIA T ENSOR M ODELING Salvatore Romeo 1 ,

Transfer Request (Fund 41-56410) Kenny Solorio HEFAS Office Coordinator First Transfer Proposal

Catalog Orders Welcome! We will begin promptly at 9 a.m. Make sure your first and last

Section 10 of BPM for TPP- Generator modeling webinar Songzhe Zhu, Irina Green, Riddhi Ray,

2019 Revaluation Update Presented by the Mecklenburg County Assessors Office Progress to Date

3232 Eastern Avenue Urban Design &amp; Architectural Review Panel Final Presentation February 18,

2017 Special Education Directors Conference Current Topics In Post- Secondary Transition 1

QUALITY CARE THROUGH INCLUSION GENDER PAY GAP REPORT 2018 OUR WORKFORCE BY THE NUMBERS Female

STREET SUPPORT PROJECT Roberto Perez Gayo Correlation - European Harm Reduction Network STREET

OECD TAX TALKS CENTRE FOR TAX POLICY AND ADMINISTRATION 22 July 2020 15:30 16:30 (CEST)

Survey on Part-Time Faculty Affairs Presented to the USC Academic Senate November 14, 2018

Indicators: Employment Trends for Adults with ID/DD and Suggestions for Policy Development

Advisory Committee Meeting July 16, 2020 Presentation overview Introductions Approve meeting

Beth Rhyne Managing Director Center for Financial Inclusion at Accion Microfinance Network,

TAKING PAYMENTS ECOSYSTEM CAPABILITIES TO THE NEXT LEVEL- FOCUS ON DIVERSITY Payments New

Planning Alena Berube, Director of Health Systems Policy Patrick Rooney, Director of Health

SDOT Annual WMBE SDOT Annual WMBE Update Update SDOT Annual WMBE Update Viviana Garza 6/29/19

2017 SOA Annual Meeting & Exhibit MARC DES ROSIERS, FSA, FCIA Session 101, Methods to

3232 Eastern Avenue Urban Design & Architectural Review Panel Final Presentation February 18,