Data Integration
Sam Birch & Alex Leblang
Data Integration Sam Birch & Alex Leblang Two faces of data - - PowerPoint PPT Presentation
Data Integration Sam Birch & Alex Leblang Two faces of data integration Businesses Have relatively more structured databases which they need to organize Research on integrating less structured data Databases coming from
Sam Birch & Alex Leblang
○ Have relatively more structured databases which they need to organize
○ Databases coming from different organizations without common architecture
Businesses want to control and access their
Bill Inmon, a data warehouse means:
linked;
data in a consistent way;
and load it into another location, usually a data warehouse
data integration without the need for maintaining one single data warehouse
database systems into a single database
the advantage of possible geographic distribution
coupled
component to construct their own schema
○ ...forces the user to have knowledge of the schema when using the database
processes to create a schema used across the federated database
○ …removes much of the work from the user or DBA to the software itself
your database
○ Timestamps on rows ○ Version number on rows ○ Triggers on tables
The 4 Vs (according to Dong)
○ large Volume of sources ○ changing at a high Velocity ○ as well as a huge Variety of sources ○ with lots of question regarding data Veracity
Dong et al.
Dong et al
○ Identify domain specific modeling
○ Identify similarities between schema attributes
○ Specify how to specifically map records in different schemas
Dong et al
Dong et al
Dong et al
Dong et al
Dong et al
content
○ Voting ○ Source Quality ○ Copy Detection
Dong et al
http://research.cs.wisc.edu/dibook/ Chapter 3: Data Source
○ organization of tables ○ naming of schemas ○ data-level representation
http://research.cs.wisc.edu/dibook/ Chapter 3: Data Source
University CIT”
“tim_kraska@brown.edu”
Republican Army?
photos of the same place
“[The] problem of identifying and linking/grouping different manifestations of the same real world object.”
Getoor, 2012.
Ironically, AKA: deduplication, entity clustering, merge/purge, fuzzy match, record linkage, approximate match...
Getoor, 2012.
truncation, ambiguity)
Getoor, 2012.
likely to match other similar data
○ Splitting / combining rows
case of text, expanding truncations)
○ Maximally informative, but standard format
Getoor, 2012.
Raw data Normalized data Matching features
Getoor, 2012.
dot-product) for
Getoor, 2012.
to datum
○ May also require schema alignment
○ Training data not trivial: most pairs are obviously not matches
Getoor, 2012.
independent of the other data in the record
○ e.g. two research papers in the same venue are more likely to be by the same authors
relationships of columns in a record
○ Transitivity (if A = B, and B = C then A = C) ○ Exclusivity (if A = B then B != C)
Getoor, 2012.
Getoor, 2012.
tables and from that found 154 million that they considered to contain high quality relational data
Cafarella et al
statistics database
in the corpus”
Cafarella et al
○ schema auto-complete ○ attribute synonym finding ○ join-graph traversal
Cafarella et al
Cafarella et al
Cafarella et al
Extracting Tabular Data on the Web VLDB 2013 paper discusses the idea of row classes that have a more flexible method towards determining the table schema
Adelfio et al
specialized systems and integration
combining very large amounts of disparate data
Tutorial in Proceedings of the IEEE International Conference on Database Engineering (ICDE), 2013
Resolution: Theory, Practice & Open Challenges, PVLDB 5(12): 2018-2019 (2012)
Eugene Wu, Yang Zhang: WebTables: exploring the power of tables on the web. PVLDB 1(1): 538-549 (2008)
Tabular Data on the Web, In International Conference