etl hadoop
play

ETL Hadoop - PowerPoint PPT Presentation

ETL Hadoop . . , ..,


  1. Программирование методов разрешения сущностей и слияния данных при реализации ETL в среде Hadoop Вовченко А.Е . , Калиниченко Л.А., Ковалев Д.Ю . alexey.vovchenko@gmail.com Институт Проблем Информатики РАН (ИПИ РАН) RCDL’2014, 13/10/2014

  2. Outline  Introduction: ETL  Entity Resolution (ER)  Data Fusion (DF)  ETL+BigData, Jaql+HIL  An HIL-based example of ER + DF

  3. Information Integration: ETL <pub> <Titel> Federated Database Systems </Titel> Source A <Autoren> <Autor> Amit Sheth </Autor> <Autor> James Larson </Autor> </Autoren> </pub> <publication> <title> Federated Database Source B Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> <author> Scheth & Larson </author> <year> 1990 </year> </publication> Schema Data Entity Data Fusion Mapping Transformation Resolution

  4. Information Integration: ETL <pub> <pub> <Titel> Federated Database <title> </title> Systems </Titel> <Autoren> Source A <Autoren> <author> </author> <Autor> Amit Sheth </Autor> <author> </author> <Autor> James Larson </Autor> </Autoren> </Autoren> <year> </year> </pub> </pub> <publication> <title> Federated Database Schema Integration Source B Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> Schema Mapping <author> Scheth & Larson </author> <year> 1990 </year> </publication> Schema Data Entity Data Fusion Mapping Transformation Resolution

  5. Information Integration: ETL Transformation queries or views <pub> <Titel> Federated Database <pub> Systems </Titel> <title> Federated Database Source A <Autoren> Systems </title> XQuery <Autor> Amit Sheth </Autor> <Autoren> <Autor> James Larson </Autor> <author> Amit Sheth </author> </Autoren> <author> James Larson </author> </pub> </Autoren> </pub> <pub> <title> Federated Database Systems for <publication> Managing Distributed, <title> Federated Database Heterogeneous, and Autonomous Source B Systems for Managing Databases </title> Distributed, Heterogeneous, <Autoren> and Autonomous <author> Scheth & Larson </author> XQuery Databases </title> </Autoren> <author> Scheth & Larson </author> <year> 1990 </year> <year> 1990 </year> </pub> </publication> Schema Data Entity Data Fusion Mapping Transformation Resolution

  6. Information Integration: ETL <pub> <Titel> Federated Database <pub> Systems </Titel> <title> Federated Database Source A <Autoren> Systems </title> <Autor> Amit Sheth </Autor> <Autoren> <Autor> James Larson </Autor> <author> Amit Sheth </author> </Autoren> <author> James Larson </author> </pub> </Autoren> </pub> <pub> <title> Federated Database Systems for <publication> Managing Distributed, <title> Federated Database Heterogeneous, and Autonomous Source B Systems for Managing Databases </title> Distributed, Heterogeneous, <Autoren> and Autonomous <author> Scheth & Larson </author> Databases </title> </Autoren> <author> Scheth & Larson </author> <year> 1990 </year> <year> 1990 </year> </pub> </publication> Schema Data Entity Data Fusion Mapping Transformation Resolution

  7. Information Integration: ETL <pub> <pub> <title> Federated Database Systems for <title> Federated Database Source A Managing Distributed, Systems </title> Heterogeneous, and <Autoren> Autonomous Databases </title> <author> Amit Sheth </author> <Autoren> <author> James Larson </author> <author> Amit Sheth </author> </Autoren> <author> James Larson </author> </pub> </Autoren> <pub> <year> 1990 </year> <title> Federated Database Systems for </pub> Managing Distributed, Heterogeneous, and Autonomous Source B Databases </title> <Autoren> <author> Scheth & Larson </author> </Autoren> <year> 1990 </year> Preserve lineage </pub> Schema Data Entity Data Fusion Mapping Transformation Resolution

  8. Outline  Introduction: ETL  Entity Resolution (ER)  Data Fusion (DF)  ETL+BigData, Jaql+HIL  An HIL-based example of ER + DF

  9. What is Entity Resolution? Problem of identifying and linking/grouping different manifestations of the same real world object. Examples of manifestations and objects:  Different ways of addressing (names, email addresses, FaceBook accounts) the same person in text.  Web pages with differing descriptions of the same business.  Different photos of the same object.  …

  10. Entity Resolution duplicate names

  11. Challenges in ER  Name/Attribute ambiguity Thomas Cruise  Errors due to data entry  Missing Values  Changing Attributes  Data formatting  Abbreviations / Data Truncation

  12. ER overview  Let consider Data already Prepared  Schema normalization  Data normalization  Similarity  Pairwise ER  Determining whether or not a pair of records match

  13. Summary of Similarity  Equality on a boolean  Alignment‐based or predicate Two‐tiered Handle  Edit distance  Jaro ‐Winkler, Soft‐TFIDF , Typographical errors Monge‐Elkan Useful for  Levenstein , Smith‐Waterman,  Phonetic Similarity Abbreviations Affine and alternate  Soundex  Set similarity names. Good for Text like reviews/ tweets  Translation‐based  Jaccard, Dice  Vector Based  Numeric distance between values  Cosine similarity, TFIDF  Domain‐specific  Useful packages  SecondString, http://secondstring.sourceforge.net/  Simmetrics: http://sourceforge.net/projects/simmetrics/  LingPipe, http:// alias‐i.com/ lingpipe/index.html

  14. Relational Similarity  Relational features are often set‐based  Set of coauthors for a paper  Set of cities in a country  Set of products manufactured by manufacturer  Can use set similarity functions mentioned earlier  Common Neighbors: Intersection size  Jaccard’s Coefficient: Normalize by union size  Adar Coefficient: Weighted set similarity  Can reason about similarity in sets of values  Average or Max  Other aggregates

  15. Pairwise Match Score Problem: Given a vector of component‐wise similarities for a pair of records (x,y), compute P(x and y match). Solutions:  Weighted sum or average of component‐wise similarity scores. Threshold determines match or non‐match.  0.5*1st‐author‐match‐score + 0.2*venue‐match‐score + 0.3*paper‐match‐score.  Hard to pick weights.  Hard to tune a threshold.  Formulate rules about what constitutes a match.  ( 1st‐author‐match‐score > 0.7 AND venue‐match‐score > 0.8) OR (paper‐match‐score > 0.9 AND venue‐match‐score > 0.9)  Manually formulating the right set of rules is hard.

  16. Basic ML Approach  r = (x,y) is record pair, γ is comparison vector, M matches, U nonmatches  Decision rule  R > t  r  Match  R < t  r  Non-Match

  17. Outline  Introduction: ETL  Entity Resolution (ER)  Data Fusion (DF)  ETL+BigData, Jaql+HIL  An HIL-based example of ER + DF

  18. Completeness, Conciseness, and Correctness Schema Matching: Same attribute semantics

  19. Completeness, Conciseness, and Correctness Duplicate detection: Same real-world entities

  20. Completeness, Conciseness, and Correctness Intensional conciseness Data Fusion: Resolve Extensional completeness uncertainties and contradictions Extensional conciseness Intensional completeness

  21. Data Fusion  Problem  Given a duplicate, create a single object representation while resolving conflicting data values.  Difficulties  Null values: Subsumption and complementation  Contradictions in data values  Uncertainty & truth: Discover the true value and model uncertainty in this process  Metadata: Preferences, recency, correctness  Lineage: Keep original values and their origin  Implementation in DBMS: SQL, extended SQL, UDFs, etc.

  22. The Field of Data Fusion Data Fusion Conflict types Resolution strategies Operators Resolution functions Join-based Subsumption Aggregation Uncertainty Contradiction Possible worlds Consistent answers Complementation Union-based Advanced functions Ignorance Resolution Avoidance Instance-based Metadata-based Instance-based Metadata-based

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend