etl hadoop

ETL Hadoop - PowerPoint PPT Presentation

ETL Hadoop . . , ..,


  1. Программирование методов разрешения сущностей и слияния данных при реализации ETL в среде Hadoop Вовченко А.Е . , Калиниченко Л.А., Ковалев Д.Ю . alexey.vovchenko@gmail.com Институт Проблем Информатики РАН (ИПИ РАН) RCDL’2014, 13/10/2014

  2. Outline  Introduction: ETL  Entity Resolution (ER)  Data Fusion (DF)  ETL+BigData, Jaql+HIL  An HIL-based example of ER + DF

  3. Information Integration: ETL <pub> <Titel> Federated Database Systems </Titel> Source A <Autoren> <Autor> Amit Sheth </Autor> <Autor> James Larson </Autor> </Autoren> </pub> <publication> <title> Federated Database Source B Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> <author> Scheth & Larson </author> <year> 1990 </year> </publication> Schema Data Entity Data Fusion Mapping Transformation Resolution

  4. Information Integration: ETL <pub> <pub> <Titel> Federated Database <title> </title> Systems </Titel> <Autoren> Source A <Autoren> <author> </author> <Autor> Amit Sheth </Autor> <author> </author> <Autor> James Larson </Autor> </Autoren> </Autoren> <year> </year> </pub> </pub> <publication> <title> Federated Database Schema Integration Source B Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> Schema Mapping <author> Scheth & Larson </author> <year> 1990 </year> </publication> Schema Data Entity Data Fusion Mapping Transformation Resolution

  5. Information Integration: ETL Transformation queries or views <pub> <Titel> Federated Database <pub> Systems </Titel> <title> Federated Database Source A <Autoren> Systems </title> XQuery <Autor> Amit Sheth </Autor> <Autoren> <Autor> James Larson </Autor> <author> Amit Sheth </author> </Autoren> <author> James Larson </author> </pub> </Autoren> </pub> <pub> <title> Federated Database Systems for <publication> Managing Distributed, <title> Federated Database Heterogeneous, and Autonomous Source B Systems for Managing Databases </title> Distributed, Heterogeneous, <Autoren> and Autonomous <author> Scheth & Larson </author> XQuery Databases </title> </Autoren> <author> Scheth & Larson </author> <year> 1990 </year> <year> 1990 </year> </pub> </publication> Schema Data Entity Data Fusion Mapping Transformation Resolution

  6. Information Integration: ETL <pub> <Titel> Federated Database <pub> Systems </Titel> <title> Federated Database Source A <Autoren> Systems </title> <Autor> Amit Sheth </Autor> <Autoren> <Autor> James Larson </Autor> <author> Amit Sheth </author> </Autoren> <author> James Larson </author> </pub> </Autoren> </pub> <pub> <title> Federated Database Systems for <publication> Managing Distributed, <title> Federated Database Heterogeneous, and Autonomous Source B Systems for Managing Databases </title> Distributed, Heterogeneous, <Autoren> and Autonomous <author> Scheth & Larson </author> Databases </title> </Autoren> <author> Scheth & Larson </author> <year> 1990 </year> <year> 1990 </year> </pub> </publication> Schema Data Entity Data Fusion Mapping Transformation Resolution

  7. Information Integration: ETL <pub> <pub> <title> Federated Database Systems for <title> Federated Database Source A Managing Distributed, Systems </title> Heterogeneous, and <Autoren> Autonomous Databases </title> <author> Amit Sheth </author> <Autoren> <author> James Larson </author> <author> Amit Sheth </author> </Autoren> <author> James Larson </author> </pub> </Autoren> <pub> <year> 1990 </year> <title> Federated Database Systems for </pub> Managing Distributed, Heterogeneous, and Autonomous Source B Databases </title> <Autoren> <author> Scheth & Larson </author> </Autoren> <year> 1990 </year> Preserve lineage </pub> Schema Data Entity Data Fusion Mapping Transformation Resolution

  8. Outline  Introduction: ETL  Entity Resolution (ER)  Data Fusion (DF)  ETL+BigData, Jaql+HIL  An HIL-based example of ER + DF

  9. What is Entity Resolution? Problem of identifying and linking/grouping different manifestations of the same real world object. Examples of manifestations and objects:  Different ways of addressing (names, email addresses, FaceBook accounts) the same person in text.  Web pages with differing descriptions of the same business.  Different photos of the same object.  …

  10. Entity Resolution duplicate names

  11. Challenges in ER  Name/Attribute ambiguity Thomas Cruise  Errors due to data entry  Missing Values  Changing Attributes  Data formatting  Abbreviations / Data Truncation

  12. ER overview  Let consider Data already Prepared  Schema normalization  Data normalization  Similarity  Pairwise ER  Determining whether or not a pair of records match

  13. Summary of Similarity  Equality on a boolean  Alignment‐based or predicate Two‐tiered Handle  Edit distance  Jaro ‐Winkler, Soft‐TFIDF , Typographical errors Monge‐Elkan Useful for  Levenstein , Smith‐Waterman,  Phonetic Similarity Abbreviations Affine and alternate  Soundex  Set similarity names. Good for Text like reviews/ tweets  Translation‐based  Jaccard, Dice  Vector Based  Numeric distance between values  Cosine similarity, TFIDF  Domain‐specific  Useful packages  SecondString, http://secondstring.sourceforge.net/  Simmetrics: http://sourceforge.net/projects/simmetrics/  LingPipe, http:// alias‐i.com/ lingpipe/index.html

  14. Relational Similarity  Relational features are often set‐based  Set of coauthors for a paper  Set of cities in a country  Set of products manufactured by manufacturer  Can use set similarity functions mentioned earlier  Common Neighbors: Intersection size  Jaccard’s Coefficient: Normalize by union size  Adar Coefficient: Weighted set similarity  Can reason about similarity in sets of values  Average or Max  Other aggregates

  15. Pairwise Match Score Problem: Given a vector of component‐wise similarities for a pair of records (x,y), compute P(x and y match). Solutions:  Weighted sum or average of component‐wise similarity scores. Threshold determines match or non‐match.  0.5*1st‐author‐match‐score + 0.2*venue‐match‐score + 0.3*paper‐match‐score.  Hard to pick weights.  Hard to tune a threshold.  Formulate rules about what constitutes a match.  ( 1st‐author‐match‐score > 0.7 AND venue‐match‐score > 0.8) OR (paper‐match‐score > 0.9 AND venue‐match‐score > 0.9)  Manually formulating the right set of rules is hard.

  16. Basic ML Approach  r = (x,y) is record pair, γ is comparison vector, M matches, U nonmatches  Decision rule  R > t  r  Match  R < t  r  Non-Match

  17. Outline  Introduction: ETL  Entity Resolution (ER)  Data Fusion (DF)  ETL+BigData, Jaql+HIL  An HIL-based example of ER + DF

  18. Completeness, Conciseness, and Correctness Schema Matching: Same attribute semantics

  19. Completeness, Conciseness, and Correctness Duplicate detection: Same real-world entities

  20. Completeness, Conciseness, and Correctness Intensional conciseness Data Fusion: Resolve Extensional completeness uncertainties and contradictions Extensional conciseness Intensional completeness

  21. Data Fusion  Problem  Given a duplicate, create a single object representation while resolving conflicting data values.  Difficulties  Null values: Subsumption and complementation  Contradictions in data values  Uncertainty & truth: Discover the true value and model uncertainty in this process  Metadata: Preferences, recency, correctness  Lineage: Keep original values and their origin  Implementation in DBMS: SQL, extended SQL, UDFs, etc.

  22. The Field of Data Fusion Data Fusion Conflict types Resolution strategies Operators Resolution functions Join-based Subsumption Aggregation Uncertainty Contradiction Possible worlds Consistent answers Complementation Union-based Advanced functions Ignorance Resolution Avoidance Instance-based Metadata-based Instance-based Metadata-based

Recommend


More recommend