Программирование методов разрешения сущностей и слияния данных при реализации ETL в среде Hadoop Вовченко А.Е . , Калиниченко Л.А., Ковалев Д.Ю . alexey.vovchenko@gmail.com Институт Проблем Информатики РАН (ИПИ РАН) RCDL’2014, 13/10/2014
Outline Introduction: ETL Entity Resolution (ER) Data Fusion (DF) ETL+BigData, Jaql+HIL An HIL-based example of ER + DF
Information Integration: ETL <pub> <Titel> Federated Database Systems </Titel> Source A <Autoren> <Autor> Amit Sheth </Autor> <Autor> James Larson </Autor> </Autoren> </pub> <publication> <title> Federated Database Source B Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> <author> Scheth & Larson </author> <year> 1990 </year> </publication> Schema Data Entity Data Fusion Mapping Transformation Resolution
Information Integration: ETL <pub> <pub> <Titel> Federated Database <title> </title> Systems </Titel> <Autoren> Source A <Autoren> <author> </author> <Autor> Amit Sheth </Autor> <author> </author> <Autor> James Larson </Autor> </Autoren> </Autoren> <year> </year> </pub> </pub> <publication> <title> Federated Database Schema Integration Source B Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> Schema Mapping <author> Scheth & Larson </author> <year> 1990 </year> </publication> Schema Data Entity Data Fusion Mapping Transformation Resolution
Information Integration: ETL Transformation queries or views <pub> <Titel> Federated Database <pub> Systems </Titel> <title> Federated Database Source A <Autoren> Systems </title> XQuery <Autor> Amit Sheth </Autor> <Autoren> <Autor> James Larson </Autor> <author> Amit Sheth </author> </Autoren> <author> James Larson </author> </pub> </Autoren> </pub> <pub> <title> Federated Database Systems for <publication> Managing Distributed, <title> Federated Database Heterogeneous, and Autonomous Source B Systems for Managing Databases </title> Distributed, Heterogeneous, <Autoren> and Autonomous <author> Scheth & Larson </author> XQuery Databases </title> </Autoren> <author> Scheth & Larson </author> <year> 1990 </year> <year> 1990 </year> </pub> </publication> Schema Data Entity Data Fusion Mapping Transformation Resolution
Information Integration: ETL <pub> <Titel> Federated Database <pub> Systems </Titel> <title> Federated Database Source A <Autoren> Systems </title> <Autor> Amit Sheth </Autor> <Autoren> <Autor> James Larson </Autor> <author> Amit Sheth </author> </Autoren> <author> James Larson </author> </pub> </Autoren> </pub> <pub> <title> Federated Database Systems for <publication> Managing Distributed, <title> Federated Database Heterogeneous, and Autonomous Source B Systems for Managing Databases </title> Distributed, Heterogeneous, <Autoren> and Autonomous <author> Scheth & Larson </author> Databases </title> </Autoren> <author> Scheth & Larson </author> <year> 1990 </year> <year> 1990 </year> </pub> </publication> Schema Data Entity Data Fusion Mapping Transformation Resolution
Information Integration: ETL <pub> <pub> <title> Federated Database Systems for <title> Federated Database Source A Managing Distributed, Systems </title> Heterogeneous, and <Autoren> Autonomous Databases </title> <author> Amit Sheth </author> <Autoren> <author> James Larson </author> <author> Amit Sheth </author> </Autoren> <author> James Larson </author> </pub> </Autoren> <pub> <year> 1990 </year> <title> Federated Database Systems for </pub> Managing Distributed, Heterogeneous, and Autonomous Source B Databases </title> <Autoren> <author> Scheth & Larson </author> </Autoren> <year> 1990 </year> Preserve lineage </pub> Schema Data Entity Data Fusion Mapping Transformation Resolution
Outline Introduction: ETL Entity Resolution (ER) Data Fusion (DF) ETL+BigData, Jaql+HIL An HIL-based example of ER + DF
What is Entity Resolution? Problem of identifying and linking/grouping different manifestations of the same real world object. Examples of manifestations and objects: Different ways of addressing (names, email addresses, FaceBook accounts) the same person in text. Web pages with differing descriptions of the same business. Different photos of the same object. …
Entity Resolution duplicate names
Challenges in ER Name/Attribute ambiguity Thomas Cruise Errors due to data entry Missing Values Changing Attributes Data formatting Abbreviations / Data Truncation
ER overview Let consider Data already Prepared Schema normalization Data normalization Similarity Pairwise ER Determining whether or not a pair of records match
Summary of Similarity Equality on a boolean Alignment‐based or predicate Two‐tiered Handle Edit distance Jaro ‐Winkler, Soft‐TFIDF , Typographical errors Monge‐Elkan Useful for Levenstein , Smith‐Waterman, Phonetic Similarity Abbreviations Affine and alternate Soundex Set similarity names. Good for Text like reviews/ tweets Translation‐based Jaccard, Dice Vector Based Numeric distance between values Cosine similarity, TFIDF Domain‐specific Useful packages SecondString, http://secondstring.sourceforge.net/ Simmetrics: http://sourceforge.net/projects/simmetrics/ LingPipe, http:// alias‐i.com/ lingpipe/index.html
Relational Similarity Relational features are often set‐based Set of coauthors for a paper Set of cities in a country Set of products manufactured by manufacturer Can use set similarity functions mentioned earlier Common Neighbors: Intersection size Jaccard’s Coefficient: Normalize by union size Adar Coefficient: Weighted set similarity Can reason about similarity in sets of values Average or Max Other aggregates
Pairwise Match Score Problem: Given a vector of component‐wise similarities for a pair of records (x,y), compute P(x and y match). Solutions: Weighted sum or average of component‐wise similarity scores. Threshold determines match or non‐match. 0.5*1st‐author‐match‐score + 0.2*venue‐match‐score + 0.3*paper‐match‐score. Hard to pick weights. Hard to tune a threshold. Formulate rules about what constitutes a match. ( 1st‐author‐match‐score > 0.7 AND venue‐match‐score > 0.8) OR (paper‐match‐score > 0.9 AND venue‐match‐score > 0.9) Manually formulating the right set of rules is hard.
Basic ML Approach r = (x,y) is record pair, γ is comparison vector, M matches, U nonmatches Decision rule R > t r Match R < t r Non-Match
Outline Introduction: ETL Entity Resolution (ER) Data Fusion (DF) ETL+BigData, Jaql+HIL An HIL-based example of ER + DF
Completeness, Conciseness, and Correctness Schema Matching: Same attribute semantics
Completeness, Conciseness, and Correctness Duplicate detection: Same real-world entities
Completeness, Conciseness, and Correctness Intensional conciseness Data Fusion: Resolve Extensional completeness uncertainties and contradictions Extensional conciseness Intensional completeness
Data Fusion Problem Given a duplicate, create a single object representation while resolving conflicting data values. Difficulties Null values: Subsumption and complementation Contradictions in data values Uncertainty & truth: Discover the true value and model uncertainty in this process Metadata: Preferences, recency, correctness Lineage: Keep original values and their origin Implementation in DBMS: SQL, extended SQL, UDFs, etc.
The Field of Data Fusion Data Fusion Conflict types Resolution strategies Operators Resolution functions Join-based Subsumption Aggregation Uncertainty Contradiction Possible worlds Consistent answers Complementation Union-based Advanced functions Ignorance Resolution Avoidance Instance-based Metadata-based Instance-based Metadata-based
Recommend
More recommend