1Table A System for Managing Structured Web Data Yang Zhang with: - - PowerPoint PPT Presentation
1Table A System for Managing Structured Web Data Yang Zhang with: - - PowerPoint PPT Presentation
1Table A System for Managing Structured Web Data Yang Zhang with: Alon Halevy, Mike Cafarella, Nodira Khoussainova, Eugene Wu and Daisy Zhe Structured Web Data No tables Web is more than just text Other tables, tags, lists, etc tables Data
Structured Web Data
- Web is more than just text
– tables, tags, lists, etc – 50% pages have tables – 25% tables appear to be useful data tables (relational, entity, sets, etc.)
- No existing tools to effectively query this data
– RDBMSs don’t scale, process noisy data poorly – Search engines are structure‐blind
- 1Table fills the gap!
No tables Other tables Data tables
Schema Reconciliation Reference Reconciliation Data Visualization
The 1Table Project
Table Search Synthetic Table Generation
Schema Reconciliation Reference Reconciliation Data Visualization Table Search
The 1Table Project
Synthetic Table Generation
HOBO: TABLE SEARCH
1Table Project
The Quest for Infrastructure
- _: limited indexing options, inefficient
structure
- _: lots of hoops, un‐structured
- _: little bang for the buck, slow
setup, inefficient structure
- Wanted control over query model, ranking
Hobo: “poor man’s text search”
Challenges
- Millions of tables (~100M in Core)
- Noisy: many are not data tables (layout)
- Query by: attributes? values? similar
examples?
- No structured metadata
Hobo
- Similar to traditional inverted index search
- Schema‐agnostic structured query model
Hobo Query Processor
Master GFS TID 00000 TID 00000 Table 00000 Index 00000 TID 00000 TID 00000 Table 00216 Index 00216 Slave 0 Slave 499 Slave 1 Shard Slaves Shard Slaves Shard Slaves Shard Slaves Shard Slaves Shard Slaves
docjoins raw tables good tables analyzed/cleaned tables extraction filtering labeling, annotation, munging Hobo inverted index indexing querying query processor
Processing Pipeline
Daffie annotation servers
Recipe: Hobo Query Model
- Start with Google.com-style conjunction of
disjunctions
- Add structural primitives: terms have attributes
- Introduce binding of variables to terms
- Impose binary relational constraints (½ cup)
- Mix bindings and constraints in arbitrary
boolean expressions
- Serve and enjoy
Query Model
“united states” where x.offset + 1 = y.offset x and y
Query Model
“france” where x.row = y.row x and y “paris” “germany” z and x.col = z.col
Query Model
- What attributes are currently available?
– Physical: offset, col, row – Logical: source (header/body/context) – For ranking: size, pageRank, isDataTable, hasHeaders, … – Easy to add more!
- Fast (poly‐time) constraint verifier
Query Languages
High‐level template‐based query language example:
((("united states") (us)) ((china | prc) (cn)) ((_) (to)))
Low‐level constraint‐based query language:
and { a = and { a = term { united } b = term { states } where a.pos + 1 = b.pos } b = or { term { china } term { prc } } c = us d = cn e = to where a.col == b.col c.col == d.col c.col == e.col a.row == c.row b.row == d.row }
“united states” us china | prc cn * to parser, rewriter
Demo!
Areas for Future Work
- Low‐hanging performance fruits
– O(n) constraint verification by ordering/hashing – Smarter concurrent iteration over inverted index – Query rewriting – More resources
- Soft constraints: not required, but use for ranking
- Frontend: richer data visualization
- Ranking of results
- Easy integration into Dataspaces
TABLE SUGGEST
1Table Project
Synthetic Table Generation
united states us china cn tr united states us china cn turkey tr japan jp ... … What country corresponds to code “tr”?
Challenges
- Inconsistent/inaccurate information
- Resolving data from multiple sources
- Ad‐hoc semantics
- Data with nested (sub‐cell) structure
– .us (united states) – united states/us
TableSuggest Features
- Spreadsheet that suggests values to fill in
- Can draw data from _ and Google Sets, but
primarily 1Table (Hobo)
- Hodgpodge of techniques (thrown in ad‐hoc
manner from inspecting results)
– Type enumeration (_, Hobo) – Set expansion (Sets, Hobo) – Attribute resolution (Hobo) – Column clustering (1Table) – …
Demo!
Areas for Future Work
- More principled evaluation
- Implementation infelicities
- Support for numeric queries using two‐tier
indexing structure with “range buckets”
- Richer sub‐structure extraction (lists)
- Incremental indexing with live data feeds/sources
- Tailoring to specific domains
- Entity tables
- Aggregating values in denormalized tables