Data Integration Sam Birch & Alex Leblang Two faces of data - - PowerPoint PPT Presentation

data integration
SMART_READER_LITE
LIVE PREVIEW

Data Integration Sam Birch & Alex Leblang Two faces of data - - PowerPoint PPT Presentation

Data Integration Sam Birch & Alex Leblang Two faces of data integration Businesses Have relatively more structured databases which they need to organize Research on integrating less structured data Databases coming from


slide-1
SLIDE 1

Data Integration

Sam Birch & Alex Leblang

slide-2
SLIDE 2

Two faces of data integration

  • Businesses

○ Have relatively more structured databases which they need to organize

  • Research on integrating less structured data

○ Databases coming from different organizations without common architecture

slide-3
SLIDE 3

Businesses have a lot of data …in many different databases

slide-4
SLIDE 4

One safe place: the data warehouse

Businesses want to control and access their

  • perational data in a single place:
  • 1. Backups and versioning
  • 2. Single query interface
  • 3. Quality & consistency
  • 4. Separation of analytic and
  • perational workloads
slide-5
SLIDE 5

Data warehouse design

Bill Inmon, a data warehouse means:

  • 1. All data about a single real-world thing are

linked;

  • 2. Data are never over-written or deleted;
  • 3. Comprises all (or nearly all) an organization's

data in a consistent way;

  • 4. Comprises all versions of the data in the
  • perational system;
  • 5. Not virtualized
slide-6
SLIDE 6

Update methods

  • 1. ETL
  • 2. Virtual / federated databases
  • 3. Change data capture
slide-7
SLIDE 7

ETL

  • Extract, Transform, Load
  • Take data from one data source, transform it,

and load it into another location, usually a data warehouse

  • Generally periodic (hourly, daily…)
slide-8
SLIDE 8

Virtual (federated) DBs

  • A method for integrating data virtually
  • No actual (physical) data integration
  • Virtual database systems give you the feel of

data integration without the need for maintaining one single data warehouse

slide-9
SLIDE 9

How they work

  • Federated databases map autonomous

database systems into a single database

  • This is done over a computer network and has

the advantage of possible geographic distribution

  • Federated databases can be loosely or tightly

coupled

slide-10
SLIDE 10

Loosely / Tightly Coupled

  • Loosely coupled databases require each

component to construct their own schema

○ ...forces the user to have knowledge of the schema when using the database

  • Tightly coupled databases use independent

processes to create a schema used across the federated database

○ …removes much of the work from the user or DBA to the software itself

slide-11
SLIDE 11

Change data capture

  • Keep track of diffs, like version control for

your database

  • Helpful for large data with smaller changes
  • Different Implementations:

○ Timestamps on rows ○ Version number on rows ○ Triggers on tables

slide-12
SLIDE 12

Reforming unstructured data

slide-13
SLIDE 13

Large and unstructured

The 4 Vs (according to Dong)

○ large Volume of sources ○ changing at a high Velocity ○ as well as a huge Variety of sources ○ with lots of question regarding data Veracity

Dong et al.

slide-14
SLIDE 14

Goals

  • Schema alignment
  • Record linkage
  • Data fusion
slide-15
SLIDE 15

Schema Alignment

Dong et al

slide-16
SLIDE 16

Schema Alignment

  • Mediated Schema

○ Identify domain specific modeling

  • Attribute Matching

○ Identify similarities between schema attributes

  • Schema Mapping

○ Specify how to specifically map records in different schemas

Dong et al

slide-17
SLIDE 17

Record Linkage

Dong et al

slide-18
SLIDE 18

Record Linkage

Dong et al

slide-19
SLIDE 19

Data Fusion

Dong et al

slide-20
SLIDE 20

Data Fusion

Dong et al

slide-21
SLIDE 21

Data Fusion

  • Reconciliation of conflicting non-identifying

content

○ Voting ○ Source Quality ○ Copy Detection

Dong et al

slide-22
SLIDE 22

Dealing With Different Data Sources

  • Semantic Heterogeneity
  • Access Pattern Limitations
  • Integrity Constraints
  • Data-level Heterogeneity

http://research.cs.wisc.edu/dibook/ Chapter 3: Data Source

slide-23
SLIDE 23

Semantic Heterogeneity

  • Data integration can suffer from many issues
  • Differences in:

○ organization of tables ○ naming of schemas ○ data-level representation

http://research.cs.wisc.edu/dibook/ Chapter 3: Data Source

slide-24
SLIDE 24

Data-level Heterogeneity

  • “115 Waterman St. 02912” / “Brown

University CIT”

  • “Tim Kraska” / “Timothy Kraska” /

“tim_kraska@brown.edu”

  • IRA: Individual Retirement Account or Irish

Republican Army?

  • Arbitrarily hard: different descriptions /

photos of the same place

slide-25
SLIDE 25

Entity resolution (ER)

“[The] problem of identifying and linking/grouping different manifestations of the same real world object.”

Getoor, 2012.

Ironically, AKA: deduplication, entity clustering, merge/purge, fuzzy match, record linkage, approximate match...

slide-26
SLIDE 26

Motivating examples

  • Mining unstructured data (e.g. webpages)
  • Governance (census, intelligence)
  • Generally, when data comes from different
  • rganizations

Getoor, 2012.

slide-27
SLIDE 27

ER Challenges

  • Fundamental ambiguity
  • Diversity in representations (format,

truncation, ambiguity)

  • Errors
  • Missing data
  • Records from different times
  • Relationships in addition to equality

Getoor, 2012.

slide-28
SLIDE 28

Normalization

  • Transform data into a format which is more

likely to match other similar data

○ Splitting / combining rows

  • Canonicalization (e.g. phone numbers, URLs,

case of text, expanding truncations)

○ Maximally informative, but standard format

  • Logic is specific to data

Getoor, 2012.

slide-29
SLIDE 29

Pairwise matching

Raw data Normalized data Matching features

Getoor, 2012.

slide-30
SLIDE 30

Matching features

  • Edit distance (e.g. Levenstein) for typos
  • Set/vector similarity (Jaccard index, TF/IDF,

dot-product) for

  • Alignment (e.g. Monge-Elkan)
  • Phonetic (e.g. Soundex)

Getoor, 2012.

slide-31
SLIDE 31

Record linkage

  • Matching record to record rather than datum

to datum

○ May also require schema alignment

  • Average of component similarities
  • Specific rules about each column
  • Probabilistic models with ML

○ Training data not trivial: most pairs are obviously not matches

Getoor, 2012.

slide-32
SLIDE 32

Collective matching and constraints

  • Some data matching operations aren’t

independent of the other data in the record

○ e.g. two research papers in the same venue are more likely to be by the same authors

  • Expressed in constraints over the matching

relationships of columns in a record

○ Transitivity (if A = B, and B = C then A = C) ○ Exclusivity (if A = B then B != C)

Getoor, 2012.

slide-33
SLIDE 33

Getoor, 2012.

slide-34
SLIDE 34

AscDB

  • Developed at Google
  • The authors looked at 14.1 billion HTML

tables and from that found 154 million that they considered to contain high quality relational data

  • Work was done in 2008

Cafarella et al

slide-35
SLIDE 35

AscDB

  • The authors created the attribute correlation

statistics database

  • AscDB is “a set of statistics about the schemas

in the corpus”

Cafarella et al

slide-36
SLIDE 36

AscDB

  • AscBD makes possible:

○ schema auto-complete ○ attribute synonym finding ○ join-graph traversal

Cafarella et al

slide-37
SLIDE 37

ASCDb

Cafarella et al

slide-38
SLIDE 38

ASCDb

Cafarella et al

slide-39
SLIDE 39

2013 Follow Up

Extracting Tabular Data on the Web VLDB 2013 paper discusses the idea of row classes that have a more flexible method towards determining the table schema

Adelfio et al

slide-40
SLIDE 40

Conclusion

  • For businesses there are tradeoffs between

specialized systems and integration

  • Lots of research is being done involving

combining very large amounts of disparate data

slide-41
SLIDE 41

References

  • Luna Dong and Divesh Srivastava, Big Data Integration,

Tutorial in Proceedings of the IEEE International Conference on Database Engineering (ICDE), 2013

  • Lise Getoor, Ashwin Machanavajjhala, Entity

Resolution: Theory, Practice & Open Challenges, PVLDB 5(12): 2018-2019 (2012)

  • Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang,

Eugene Wu, Yang Zhang: WebTables: exploring the power of tables on the web. PVLDB 1(1): 538-549 (2008)

  • Marco D. Adelfio, Hanan Samet, Schema Extraction for

Tabular Data on the Web, In International Conference

  • n Very Large Data Bases (VLDB), 2013
  • http://research.cs.wisc.edu/dibook/