Программирование методов
разрешения сущностей и слияния данных
при реализации ETL в среде Hadoop
Вовченко А.Е., Калиниченко Л.А., Ковалев Д.Ю. alexey.vovchenko@gmail.com Институт Проблем Информатики РАН (ИПИ РАН) RCDL’2014, 13/10/2014
ETL Hadoop - - PowerPoint PPT Presentation
ETL Hadoop . . , ..,
Вовченко А.Е., Калиниченко Л.А., Ковалев Д.Ю. alexey.vovchenko@gmail.com Институт Проблем Информатики РАН (ИПИ РАН) RCDL’2014, 13/10/2014
Introduction: ETL Entity Resolution (ER) Data Fusion (DF) ETL+BigData, Jaql+HIL An HIL-based example of ER + DF
Schema Mapping Data Transformation Entity Resolution Data Fusion
Source A Source B
<pub> <Titel> Federated Database Systems </Titel> <Autoren> <Autor> Amit Sheth </Autor> <Autor> James Larson </Autor> </Autoren> </pub> <publication> <title> Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> <author> Scheth & Larson </author> <year> 1990 </year> </publication>
Schema Mapping Data Transformation Entity Resolution Data Fusion
Source A Source B
<pub> <Titel> Federated Database Systems </Titel> <Autoren> <Autor> Amit Sheth </Autor> <Autor> James Larson </Autor> </Autoren> </pub> <publication> <title> Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> <author> Scheth & Larson </author> <year> 1990 </year> </publication> <pub> <title> </title> <Autoren> <author> </author> <author> </author> </Autoren> <year> </year> </pub>
Schema Mapping Schema Integration
Schema Mapping Data Transformation Entity Resolution Data Fusion
Source A Source B
<pub> <Titel> Federated Database Systems </Titel> <Autoren> <Autor> Amit Sheth </Autor> <Autor> James Larson </Autor> </Autoren> </pub> <publication> <title> Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> <author> Scheth & Larson </author> <year> 1990 </year> </publication> <pub> <title> Federated Database Systems </title> <Autoren> <author> Amit Sheth </author> <author> James Larson </author> </Autoren> </pub> <pub> <title> Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> <Autoren> <author> Scheth & Larson </author> </Autoren> <year> 1990 </year> </pub>
XQuery XQuery Transformation queries or views
Schema Mapping Data Transformation Entity Resolution Data Fusion
Source A Source B
<pub> <Titel> Federated Database Systems </Titel> <Autoren> <Autor> Amit Sheth </Autor> <Autor> James Larson </Autor> </Autoren> </pub> <publication> <title> Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> <author> Scheth & Larson </author> <year> 1990 </year> </publication> <pub> <title> Federated Database Systems </title> <Autoren> <author> Amit Sheth </author> <author> James Larson </author> </Autoren> </pub> <pub> <title> Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> <Autoren> <author> Scheth & Larson </author> </Autoren> <year> 1990 </year> </pub>
Schema Mapping Data Transformation Entity Resolution Data Fusion
Source A Source B
<pub> <title> Federated Database Systems </title> <Autoren> <author> Amit Sheth </author> <author> James Larson </author> </Autoren> </pub> <pub> <title> Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> <Autoren> <author> Scheth & Larson </author> </Autoren> <year> 1990 </year> </pub> <pub> <title> Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> <Autoren> <author> Amit Sheth </author> <author> James Larson </author> </Autoren> <year> 1990 </year> </pub>
Preserve lineage
Introduction: ETL Entity Resolution (ER) Data Fusion (DF) ETL+BigData, Jaql+HIL An HIL-based example of ER + DF
Problem of identifying and linking/grouping different manifestations of the same real world object. Examples of manifestations and objects:
Different ways of addressing (names, email addresses,
FaceBook accounts) the same person in text.
Web pages with differing descriptions of the same
business.
Different photos of the same object. …
Name/Attribute ambiguity Errors due to data entry Missing Values Changing Attributes Data formatting Abbreviations / Data Truncation
Thomas Cruise
Let consider Data already Prepared Schema normalization Data normalization Similarity Pairwise ER Determining whether or not a pair of records match
Equality on a boolean
predicate
Edit distance
Levenstein, Smith‐Waterman,
Affine
Set similarity
Jaccard, Dice
Vector Based
Cosine similarity, TFIDF
Alignment‐based or
Two‐tiered
Jaro‐Winkler, Soft‐TFIDF
, Monge‐Elkan
Phonetic Similarity
Soundex
Translation‐based Numeric distance between
values
Domain‐specific
Useful packages
SecondString, http://secondstring.sourceforge.net/ Simmetrics: http://sourceforge.net/projects/simmetrics/ LingPipe, http://alias‐i.com/lingpipe/index.html
Handle Typographical errors
Good for Text like reviews/ tweets Useful for Abbreviations and alternate names.
Relational features are often set‐based
Set of coauthors for a paper Set of cities in a country Set of products manufactured by manufacturer
Can use set similarity functions mentioned earlier
Common Neighbors: Intersection size Jaccard’s Coefficient: Normalize by union size Adar Coefficient: Weighted set similarity
Can reason about similarity in sets of values
Average or Max Other aggregates
Problem: Given a vector of component‐wise similarities for a pair of records (x,y), compute P(x and y match). Solutions:
Weighted sum or average of component‐wise similarity
0.5*1st‐author‐match‐score + 0.2*venue‐match‐score +
0.3*paper‐match‐score.
Hard to pick weights. Hard to tune a threshold.
Formulate rules about what constitutes a match.
(1st‐author‐match‐score > 0.7 AND venue‐match‐score > 0.8)
OR (paper‐match‐score > 0.9 AND venue‐match‐score > 0.9)
Manually formulating the right set of rules is hard.
r = (x,y) is record pair, γ is comparison vector, M
matches, U nonmatches
Decision rule R > t r Match R < t r Non-Match
Introduction: ETL Entity Resolution (ER) Data Fusion (DF) ETL+BigData, Jaql+HIL An HIL-based example of ER + DF
Schema Matching: Same attribute semantics
Duplicate detection: Same real-world entities
Data Fusion: Resolve uncertainties and contradictions
Extensional completeness Intensional completeness Intensional conciseness Extensional conciseness
Problem
Given a duplicate, create a single object representation
while resolving conflicting data values.
Difficulties
Null values: Subsumption and complementation Contradictions in data values Uncertainty & truth: Discover the true value and model
uncertainty in this process
Metadata: Preferences, recency, correctness Lineage: Keep original values and their origin Implementation in DBMS: SQL, extended SQL, UDFs, etc.
Data Fusion Operators Resolution strategies Conflict types
Ignorance Consistent answers Uncertainty Contradiction Avoidance Resolution Instance-based Metadata-based
Resolution functions
Possible worlds Join-based Union-based Instance-based Metadata-based Complementation Advanced functions Aggregation Subsumption
Function Description Examples Min, Max, Sum, Count, Avg Standard aggregation NumChildren, Salary, Height Random Random choice Shoe size Longest, Shortest Longest/shortest value First_name Choose(source) Value from a particular source DoB (DMV), CEO (SEC) ChooseDepending(val, col) Value depends on value chosen in other column city & zip, e-mail & employer Vote Majority decision Rating Coalesce First non-null value First_name Group, Concat Group or concatenate all values Book_reviews MostRecent Most recent (up-to-date) value Address MostAbstract, MostSpecific, CommonAncestor Use a taxonomy / ontology Location Escalate Export conflicting values gender … … …
MostRecent MostAbstract MostSpecific
conflict ignorance conflict avoidance conflict resolution conflict resolution strategies instance based instance based metadata based metadata based deciding mediating deciding mediating
Coalesce ChooseDepending Concat AVG, SUM MIN, MAX Random Vote Choose Escalate CommonAncestor
a, b, c a, b, c, d Source 1(A,B,C) a, b, d Source 2(A,B,D) a, b, c, - a, b, -, d a, b, - a, b, -, - a, b, - a, b, -, - a, b, -, - a, b, c a, f(b,e), c, d a, e, d a, b, c, - a, e, -, d a, b, c a, b, c, - a, b, - a, b, c, - a, b, -, - Identical tuples Subsumed tuples Conflicting tuples Complementing tuples
Identical tuples
UNION, OUTER UNION
Subsumed tuples (uncertainty)
MINIMUM UNION
Complementing tuples (uncertainty)
COMPLEMENT UNION, MERGE
Conflicting tuples (contradiction)
Relational approaches: Match, Group, Fuse, …
Other approaches
Possible worlds, probabilistic answers, consistent
answers
Union: Elimination
Minimum Union:
Elimination of subsumed tuples
Outer union Subsumption A B C a b c e f g m n
B D a b e f h m p A B C D a b c a b e f g e f h m n
m p
+
R A tuple t1 subsumes a tuple t2, if it has same schema, has less NULL-values, and coincides in all non-NULL-values.
A B C D a b c e f g e f h m n
m p
Represents all
Full outer join on all common
attributes
All combinations for more than two sources
Minimum union over results
A B C a b c e f g k
m A B D a b e f h m p k q r A B C D a b c e f g h m p k
k m k q r
|⋈|
R
A B C D a b c e f g h m p k
k m k q r
Elimination of
complementing tuples
Outer union Complementation
No known SQL
rewriting
A B C a b c e f g m n
B D a b e f h m p A B C D a b c a b e f g e f h m n
m p
+
R⇅ A tuple t1 complements a tuple t2, if it has same schema and coincides in all non-NULL-values.
A B C D a b c e f g h m n
m p
Includes duplicate removal and subsumption
WITH OU AS ( ( SELECT A, B, C, NULL AS D FROM U1 ) UNION (ALL) ( SELECT A, B, NULL AS C, D FROM U2 ) ), SELECT A, MAX(B), MIN(C), SUM(D) FROM OU GROUP BY A
Outer union then group by real-world ID Aggregate all other columns using conflict resolving
aggregate function
Efficient implementations Catches inter- and intra-source duplicates Restricted to built-in
aggregate-functions
MAX, MIN, AVG,
VAR, STDDEV, SUM, COUNT
SELECT ID, RESOLVE(Title, Choose(IMDB)), RESOLVE(Year, Max), RESOLVE(Director), RESOLVE(Rating), RESOLVE(Genre, Concat) FUSE FROM IMDB, Filmdienst FUSE BY (ID) ON ORDER Year DESC
SQL extensions to resolve uncertainties and contradictions FUSE FROM implies OUTER UNION
Removes subsumed and duplicate tuples by default
FUSE BY declares real-world ID RESOLVE specifies conflict resolution function from catalog
Default: COALESCE
Implemented on top of relational DBMS “XXL”
Introduction: ETL Entity Resolution (ER) Data Fusion (DF) ETL+BigData, Jaql+HIL An HIL-based example of ER + DF
Hadoop framework consists on two main layers
Distributed file system (HDFS) Execution engine (MapReduce)
Provide a simple, yet powerful language to manipulate semi-
structured data.
Use JSON as a data model Data is usually converted to/from JSON view Most data has a natural JSON representation
Easily extended using Java, Python, … Exploit massive parallelism using Hadoop
Unix Jaql cat var -> merge join join grep filter cut, paste, sed, tr map sort sort head top uniq distinct sort sort > filename write tee tee
Hive
Good for relational data Sql like syntax with UDF
Pig
Good for complex data (objects) Good for small scripts (UDF)
Jaql
God for complex and semi-
structured data (for example for text analysis)
Modular structure Functions allow to build complex
programs
Jaql combines both approaches
SQL syntax (like Hive) Jaql data flow style for semi-
structured data (like Pig) Pig Hive Jaql Developed Yahoo! Facebook IBM Language Pig Latin HiveQL Jaql Type Data flow Declarative (sql-like) Data flow & sql-like Data structures Complex structures structure Semi-structured, JSON Data Schema Not required required Not required Turing Compleetness Yes, but with UDFs Yes, but with UDFs Yes source sink
HIL captures multiple stages together in one
Mapping and cleansing Entity resolution (ER, or linkage) Fusion and aggregation
Rich data model: define and handle any types of
Relational, XML, semi-structured (JSON) Entities to be linked (by ER) do not need to have the same
schema
HIL targets multiple platforms (Jaql Map/Reduce,
Introduction: ETL Entity Resolution (ER) Data Fusion (DF) ETL+BigData, Jaql+HIL An HIL-based example of ER + DF
44
Data Integration: ETL Entity Resolution Data Fusion BigData ETL (Jaql, HIL) An HIL-based example of ER + DF