ETL Hadoop - PowerPoint PPT Presentation

Программирование методов разрешения сущностей и слияния данных при реализации ETL в среде Hadoop Вовченко А.Е . , Калиниченко Л.А., Ковалев Д.Ю . alexey.vovchenko@gmail.com Институт Проблем Информатики РАН (ИПИ РАН) RCDL’2014, 13/10/2014

Outline  Introduction: ETL  Entity Resolution (ER)  Data Fusion (DF)  ETL+BigData, Jaql+HIL  An HIL-based example of ER + DF

Information Integration: ETL <pub> <Titel> Federated Database Systems </Titel> Source A <Autoren> <Autor> Amit Sheth </Autor> <Autor> James Larson </Autor> </Autoren> </pub> <publication> <title> Federated Database Source B Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> <author> Scheth & Larson </author> <year> 1990 </year> </publication> Schema Data Entity Data Fusion Mapping Transformation Resolution

Information Integration: ETL <pub> <pub> <Titel> Federated Database <title> </title> Systems </Titel> <Autoren> Source A <Autoren> <author> </author> <Autor> Amit Sheth </Autor> <author> </author> <Autor> James Larson </Autor> </Autoren> </Autoren> <year> </year> </pub> </pub> <publication> <title> Federated Database Schema Integration Source B Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> Schema Mapping <author> Scheth & Larson </author> <year> 1990 </year> </publication> Schema Data Entity Data Fusion Mapping Transformation Resolution

Information Integration: ETL Transformation queries or views <pub> <Titel> Federated Database <pub> Systems </Titel> <title> Federated Database Source A <Autoren> Systems </title> XQuery <Autor> Amit Sheth </Autor> <Autoren> <Autor> James Larson </Autor> <author> Amit Sheth </author> </Autoren> <author> James Larson </author> </pub> </Autoren> </pub> <pub> <title> Federated Database Systems for <publication> Managing Distributed, <title> Federated Database Heterogeneous, and Autonomous Source B Systems for Managing Databases </title> Distributed, Heterogeneous, <Autoren> and Autonomous <author> Scheth & Larson </author> XQuery Databases </title> </Autoren> <author> Scheth & Larson </author> <year> 1990 </year> <year> 1990 </year> </pub> </publication> Schema Data Entity Data Fusion Mapping Transformation Resolution

Information Integration: ETL <pub> <Titel> Federated Database <pub> Systems </Titel> <title> Federated Database Source A <Autoren> Systems </title> <Autor> Amit Sheth </Autor> <Autoren> <Autor> James Larson </Autor> <author> Amit Sheth </author> </Autoren> <author> James Larson </author> </pub> </Autoren> </pub> <pub> <title> Federated Database Systems for <publication> Managing Distributed, <title> Federated Database Heterogeneous, and Autonomous Source B Systems for Managing Databases </title> Distributed, Heterogeneous, <Autoren> and Autonomous <author> Scheth & Larson </author> Databases </title> </Autoren> <author> Scheth & Larson </author> <year> 1990 </year> <year> 1990 </year> </pub> </publication> Schema Data Entity Data Fusion Mapping Transformation Resolution

Information Integration: ETL <pub> <pub> <title> Federated Database Systems for <title> Federated Database Source A Managing Distributed, Systems </title> Heterogeneous, and <Autoren> Autonomous Databases </title> <author> Amit Sheth </author> <Autoren> <author> James Larson </author> <author> Amit Sheth </author> </Autoren> <author> James Larson </author> </pub> </Autoren> <pub> <year> 1990 </year> <title> Federated Database Systems for </pub> Managing Distributed, Heterogeneous, and Autonomous Source B Databases </title> <Autoren> <author> Scheth & Larson </author> </Autoren> <year> 1990 </year> Preserve lineage </pub> Schema Data Entity Data Fusion Mapping Transformation Resolution

What is Entity Resolution? Problem of identifying and linking/grouping different manifestations of the same real world object. Examples of manifestations and objects:  Different ways of addressing (names, email addresses, FaceBook accounts) the same person in text.  Web pages with differing descriptions of the same business.  Different photos of the same object.  …

Entity Resolution duplicate names

Challenges in ER  Name/Attribute ambiguity Thomas Cruise  Errors due to data entry  Missing Values  Changing Attributes  Data formatting  Abbreviations / Data Truncation

ER overview  Let consider Data already Prepared  Schema normalization  Data normalization  Similarity  Pairwise ER  Determining whether or not a pair of records match

Summary of Similarity  Equality on a boolean  Alignment‐based or predicate Two‐tiered Handle  Edit distance  Jaro ‐Winkler, Soft‐TFIDF , Typographical errors Monge‐Elkan Useful for  Levenstein , Smith‐Waterman,  Phonetic Similarity Abbreviations Affine and alternate  Soundex  Set similarity names. Good for Text like reviews/ tweets  Translation‐based  Jaccard, Dice  Vector Based  Numeric distance between values  Cosine similarity, TFIDF  Domain‐specific  Useful packages  SecondString, http://secondstring.sourceforge.net/  Simmetrics: http://sourceforge.net/projects/simmetrics/  LingPipe, http:// alias‐i.com/ lingpipe/index.html

Relational Similarity  Relational features are often set‐based  Set of coauthors for a paper  Set of cities in a country  Set of products manufactured by manufacturer  Can use set similarity functions mentioned earlier  Common Neighbors: Intersection size  Jaccard’s Coefficient: Normalize by union size  Adar Coefficient: Weighted set similarity  Can reason about similarity in sets of values  Average or Max  Other aggregates

Pairwise Match Score Problem: Given a vector of component‐wise similarities for a pair of records (x,y), compute P(x and y match). Solutions:  Weighted sum or average of component‐wise similarity scores. Threshold determines match or non‐match.  0.5*1st‐author‐match‐score + 0.2*venue‐match‐score + 0.3*paper‐match‐score.  Hard to pick weights.  Hard to tune a threshold.  Formulate rules about what constitutes a match.  ( 1st‐author‐match‐score > 0.7 AND venue‐match‐score > 0.8) OR (paper‐match‐score > 0.9 AND venue‐match‐score > 0.9)  Manually formulating the right set of rules is hard.

Basic ML Approach  r = (x,y) is record pair, γ is comparison vector, M matches, U nonmatches  Decision rule  R > t  r  Match  R < t  r  Non-Match

Completeness, Conciseness, and Correctness Schema Matching: Same attribute semantics

Completeness, Conciseness, and Correctness Duplicate detection: Same real-world entities

Completeness, Conciseness, and Correctness Intensional conciseness Data Fusion: Resolve Extensional completeness uncertainties and contradictions Extensional conciseness Intensional completeness

Data Fusion  Problem  Given a duplicate, create a single object representation while resolving conflicting data values.  Difficulties  Null values: Subsumption and complementation  Contradictions in data values  Uncertainty & truth: Discover the true value and model uncertainty in this process  Metadata: Preferences, recency, correctness  Lineage: Keep original values and their origin  Implementation in DBMS: SQL, extended SQL, UDFs, etc.

The Field of Data Fusion Data Fusion Conflict types Resolution strategies Operators Resolution functions Join-based Subsumption Aggregation Uncertainty Contradiction Possible worlds Consistent answers Complementation Union-based Advanced functions Ignorance Resolution Avoidance Instance-based Metadata-based Instance-based Metadata-based

ETL Hadoop - PowerPoint PPT Presentation

ETL Hadoop . . , ..,

ETL Overview Extract, Transform, Load (ETL) General ETL issues ETL/DW refreshment process

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

ETL and Event Sourcing Integration Architecture: Best Practice and Case Study Marc Siegel -

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management Andre Luckow,

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

BY SRIJHA REDDY GANGIDI What is Hadoop ? Evolution of Hadoop Created by dough cutting, a part

Spark and Hadoop at Yahoo: Brought to you by YARN Andy Feng Yahoo! Hadoop (afeng@yahoo-inc.com)

HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc.

Apache Hadoop 3.x State of The Union and Upgrade Guidance Wei-Chiu Chuang Wangda Tan

Hadoop Jrg Mllenkamp Principal Field Technologist Sun Microsystems Agenda Introduction

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and Hadoop Review various tools

Working With Hadoop Mostly based on Tom Whites book Hadoop: Now that we covered the

Datenanalyse mit Hadoop Quelle: Apache Software Foundation Datenanalyse mit Hadoop Gideon Zenz

Extension: Combiner Functions import org.apache.hadoop.io.IntWritable; import

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

Saving Professor Campbell Anne McDougall & Genevieve Goupil ETL teachers , Montreal

The multiplex structure of interbank networks L. Bargigli*, G. di Iasio, L. Infante, F.

MEDICAL IMAGE ANALYSIS Final Project - 3D Breast Ultrasound Segmentation Students: Flvia Dias

Movie & Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW Goal: build a movie and actor

Identifying Your Customers in Social Networks Date : 2015/03/12 Author: Chun-Ta Lu, Hong-Han

Th The t e trav avel o of hea eat i in solids Kamr mran Behni nia Ec Ecole Suprieure

Generating Useful Network-based Features for Analyzing Social Networks Jun Karam on, Yutaka

Overall Telecomm Project Safety Report October 2018 Lorem ipsum dolo sit amet. Bullet point

Experiment Summary Frequencies 95/150 GHz Angular resolutions 1/1.6 arcmin Field centers and

ETL Hadoop - PowerPoint PPT Presentation

ETL Hadoop . . , ..,

ETL Overview Extract, Transform, Load (ETL) General ETL issues ETL/DW refreshment process

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

ETL and Event Sourcing Integration Architecture: Best Practice and Case Study Marc Siegel -

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management Andre Luckow,

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

BY SRIJHA REDDY GANGIDI What is Hadoop ? Evolution of Hadoop Created by dough cutting, a part

Spark and Hadoop at Yahoo: Brought to you by YARN Andy Feng Yahoo! Hadoop (afeng@yahoo-inc.com)

HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc.

Apache Hadoop 3.x State of The Union and Upgrade Guidance Wei-Chiu Chuang Wangda Tan

Hadoop Jrg Mllenkamp Principal Field Technologist Sun Microsystems Agenda Introduction

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and Hadoop Review various tools

Working With Hadoop Mostly based on Tom Whites book Hadoop: Now that we covered the

Datenanalyse mit Hadoop Quelle: Apache Software Foundation Datenanalyse mit Hadoop Gideon Zenz

Extension: Combiner Functions import org.apache.hadoop.io.IntWritable; import

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

Saving Professor Campbell Anne McDougall &amp; Genevieve Goupil ETL teachers , Montreal

The multiplex structure of interbank networks L. Bargigli*, G. di Iasio**, L. Infante**, F.

MEDICAL IMAGE ANALYSIS Final Project - 3D Breast Ultrasound Segmentation Students: Flvia Dias

Movie &amp; Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW Goal: build a movie and actor

Identifying Your Customers in Social Networks Date : 2015/03/12 Author: Chun-Ta Lu, Hong-Han

Th The t e trav avel o of hea eat i in solids Kamr mran Behni nia Ec Ecole Suprieure

Generating Useful Network-based Features for Analyzing Social Networks Jun Karam on, Yutaka

Overall Telecomm Project Safety Report October 2018 Lorem ipsum dolo sit amet. Bullet point

Experiment Summary Frequencies 95/150 GHz Angular resolutions 1/1.6 arcmin Field centers and

Saving Professor Campbell Anne McDougall & Genevieve Goupil ETL teachers , Montreal

The multiplex structure of interbank networks L. Bargigli*, G. di Iasio, L. Infante, F.

Movie & Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW Goal: build a movie and actor