ETL Hadoop - - PowerPoint PPT Presentation

etl hadoop
SMART_READER_LITE
LIVE PREVIEW

ETL Hadoop - - PowerPoint PPT Presentation

ETL Hadoop . . , ..,


slide-1
SLIDE 1

Программирование методов

разрешения сущностей и слияния данных

при реализации ETL в среде Hadoop

Вовченко А.Е., Калиниченко Л.А., Ковалев Д.Ю. alexey.vovchenko@gmail.com Институт Проблем Информатики РАН (ИПИ РАН) RCDL’2014, 13/10/2014

slide-2
SLIDE 2

Outline

 Introduction: ETL  Entity Resolution (ER)  Data Fusion (DF)  ETL+BigData, Jaql+HIL  An HIL-based example of ER + DF

slide-3
SLIDE 3

Schema Mapping Data Transformation Entity Resolution Data Fusion

Information Integration: ETL

Source A Source B

<pub> <Titel> Federated Database Systems </Titel> <Autoren> <Autor> Amit Sheth </Autor> <Autor> James Larson </Autor> </Autoren> </pub> <publication> <title> Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> <author> Scheth & Larson </author> <year> 1990 </year> </publication>

slide-4
SLIDE 4

Schema Mapping Data Transformation Entity Resolution Data Fusion

Information Integration: ETL

Source A Source B

<pub> <Titel> Federated Database Systems </Titel> <Autoren> <Autor> Amit Sheth </Autor> <Autor> James Larson </Autor> </Autoren> </pub> <publication> <title> Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> <author> Scheth & Larson </author> <year> 1990 </year> </publication> <pub> <title> </title> <Autoren> <author> </author> <author> </author> </Autoren> <year> </year> </pub>

Schema Mapping Schema Integration

slide-5
SLIDE 5

Schema Mapping Data Transformation Entity Resolution Data Fusion

Information Integration: ETL

Source A Source B

<pub> <Titel> Federated Database Systems </Titel> <Autoren> <Autor> Amit Sheth </Autor> <Autor> James Larson </Autor> </Autoren> </pub> <publication> <title> Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> <author> Scheth & Larson </author> <year> 1990 </year> </publication> <pub> <title> Federated Database Systems </title> <Autoren> <author> Amit Sheth </author> <author> James Larson </author> </Autoren> </pub> <pub> <title> Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> <Autoren> <author> Scheth & Larson </author> </Autoren> <year> 1990 </year> </pub>

XQuery XQuery Transformation queries or views

slide-6
SLIDE 6

Schema Mapping Data Transformation Entity Resolution Data Fusion

Information Integration: ETL

Source A Source B

<pub> <Titel> Federated Database Systems </Titel> <Autoren> <Autor> Amit Sheth </Autor> <Autor> James Larson </Autor> </Autoren> </pub> <publication> <title> Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> <author> Scheth & Larson </author> <year> 1990 </year> </publication> <pub> <title> Federated Database Systems </title> <Autoren> <author> Amit Sheth </author> <author> James Larson </author> </Autoren> </pub> <pub> <title> Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> <Autoren> <author> Scheth & Larson </author> </Autoren> <year> 1990 </year> </pub>

slide-7
SLIDE 7

Schema Mapping Data Transformation Entity Resolution Data Fusion

Information Integration: ETL

Source A Source B

<pub> <title> Federated Database Systems </title> <Autoren> <author> Amit Sheth </author> <author> James Larson </author> </Autoren> </pub> <pub> <title> Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> <Autoren> <author> Scheth & Larson </author> </Autoren> <year> 1990 </year> </pub> <pub> <title> Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases </title> <Autoren> <author> Amit Sheth </author> <author> James Larson </author> </Autoren> <year> 1990 </year> </pub>

Preserve lineage

slide-8
SLIDE 8

Outline

 Introduction: ETL  Entity Resolution (ER)  Data Fusion (DF)  ETL+BigData, Jaql+HIL  An HIL-based example of ER + DF

slide-9
SLIDE 9

What is Entity Resolution?

Problem of identifying and linking/grouping different manifestations of the same real world object. Examples of manifestations and objects:

 Different ways of addressing (names, email addresses,

FaceBook accounts) the same person in text.

 Web pages with differing descriptions of the same

business.

 Different photos of the same object.  …

slide-10
SLIDE 10

Entity Resolution duplicate names

slide-11
SLIDE 11

Challenges in ER

 Name/Attribute ambiguity  Errors due to data entry  Missing Values  Changing Attributes  Data formatting  Abbreviations / Data Truncation

Thomas Cruise

slide-12
SLIDE 12

ER overview

 Let consider Data already Prepared  Schema normalization  Data normalization  Similarity  Pairwise ER  Determining whether or not a pair of records match

slide-13
SLIDE 13

 Equality on a boolean

predicate

 Edit distance

 Levenstein, Smith‐Waterman,

Affine

 Set similarity

 Jaccard, Dice

 Vector Based

 Cosine similarity, TFIDF

 Alignment‐based or

Two‐tiered

 Jaro‐Winkler, Soft‐TFIDF

, Monge‐Elkan

 Phonetic Similarity

 Soundex

 Translation‐based  Numeric distance between

values

 Domain‐specific

Summary of Similarity

 Useful packages

 SecondString, http://secondstring.sourceforge.net/  Simmetrics: http://sourceforge.net/projects/simmetrics/  LingPipe, http://alias‐i.com/lingpipe/index.html

Handle Typographical errors

Good for Text like reviews/ tweets Useful for Abbreviations and alternate names.

slide-14
SLIDE 14

Relational Similarity

 Relational features are often set‐based

 Set of coauthors for a paper  Set of cities in a country  Set of products manufactured by manufacturer

 Can use set similarity functions mentioned earlier

 Common Neighbors: Intersection size  Jaccard’s Coefficient: Normalize by union size  Adar Coefficient: Weighted set similarity

 Can reason about similarity in sets of values

 Average or Max  Other aggregates

slide-15
SLIDE 15

Pairwise Match Score

Problem: Given a vector of component‐wise similarities for a pair of records (x,y), compute P(x and y match). Solutions:

 Weighted sum or average of component‐wise similarity

  • scores. Threshold determines match or non‐match.

 0.5*1st‐author‐match‐score + 0.2*venue‐match‐score +

0.3*paper‐match‐score.

 Hard to pick weights.  Hard to tune a threshold.

 Formulate rules about what constitutes a match.

 (1st‐author‐match‐score > 0.7 AND venue‐match‐score > 0.8)

OR (paper‐match‐score > 0.9 AND venue‐match‐score > 0.9)

 Manually formulating the right set of rules is hard.

slide-16
SLIDE 16

Basic ML Approach

 r = (x,y) is record pair, γ is comparison vector, M

matches, U nonmatches

 Decision rule  R > t  r  Match  R < t  r  Non-Match

slide-17
SLIDE 17

Outline

 Introduction: ETL  Entity Resolution (ER)  Data Fusion (DF)  ETL+BigData, Jaql+HIL  An HIL-based example of ER + DF

slide-18
SLIDE 18

Completeness, Conciseness, and Correctness

Schema Matching: Same attribute semantics

slide-19
SLIDE 19

Completeness, Conciseness, and Correctness

Duplicate detection: Same real-world entities

slide-20
SLIDE 20

Completeness, Conciseness, and Correctness

Data Fusion: Resolve uncertainties and contradictions

Extensional completeness Intensional completeness Intensional conciseness Extensional conciseness

slide-21
SLIDE 21

Data Fusion

 Problem

 Given a duplicate, create a single object representation

while resolving conflicting data values.

 Difficulties

 Null values: Subsumption and complementation  Contradictions in data values  Uncertainty & truth: Discover the true value and model

uncertainty in this process

 Metadata: Preferences, recency, correctness  Lineage: Keep original values and their origin  Implementation in DBMS: SQL, extended SQL, UDFs, etc.

slide-22
SLIDE 22

The Field of Data Fusion

Data Fusion Operators Resolution strategies Conflict types

Ignorance Consistent answers Uncertainty Contradiction Avoidance Resolution Instance-based Metadata-based

Resolution functions

Possible worlds Join-based Union-based Instance-based Metadata-based Complementation Advanced functions Aggregation Subsumption

slide-23
SLIDE 23

Function Description Examples Min, Max, Sum, Count, Avg Standard aggregation NumChildren, Salary, Height Random Random choice Shoe size Longest, Shortest Longest/shortest value First_name Choose(source) Value from a particular source DoB (DMV), CEO (SEC) ChooseDepending(val, col) Value depends on value chosen in other column city & zip, e-mail & employer Vote Majority decision Rating Coalesce First non-null value First_name Group, Concat Group or concatenate all values Book_reviews MostRecent Most recent (up-to-date) value Address MostAbstract, MostSpecific, CommonAncestor Use a taxonomy / ontology Location Escalate Export conflicting values gender … … …

Conflict Resolution Functions

slide-24
SLIDE 24

MostRecent MostAbstract MostSpecific

Classification of Functions

conflict ignorance conflict avoidance conflict resolution conflict resolution strategies instance based instance based metadata based metadata based deciding mediating deciding mediating

Coalesce ChooseDepending Concat AVG, SUM MIN, MAX Random Vote Choose Escalate CommonAncestor

slide-25
SLIDE 25

Data Fusion Goals

a, b, c  a, b, c, d Source 1(A,B,C)  a, b, d Source 2(A,B,D)  a, b, c, - a, b, -, d  a, b, -  a, b, -, - a, b, - a, b, -, - a, b, -, -  a, b, c  a, f(b,e), c, d a, e, d a, b, c, - a, e, -, d  a, b, c  a, b, c, - a, b, - a, b, c, - a, b, -, -  Identical tuples Subsumed tuples Conflicting tuples Complementing tuples

slide-26
SLIDE 26

Relational Operators – Overview

 Identical tuples

 UNION, OUTER UNION

 Subsumed tuples (uncertainty)

 MINIMUM UNION

 Complementing tuples (uncertainty)

 COMPLEMENT UNION, MERGE

 Conflicting tuples (contradiction)

 Relational approaches: Match, Group, Fuse, …

 Other approaches

 Possible worlds, probabilistic answers, consistent

answers

slide-27
SLIDE 27

Minimum Union

 Union: Elimination

  • f exact duplicates

 Minimum Union:

Elimination of subsumed tuples

 Outer union  Subsumption A B C a b c e f g m n

  • A

B D a b  e f h m p  A B C D a b c  a b   e f g  e f  h m n

m p  

+

=

R A tuple t1 subsumes a tuple t2, if it has same schema, has less NULL-values, and coincides in all non-NULL-values.

A B C D a b c  e f g  e f  h m n

m p  

slide-28
SLIDE 28

Full Disjunction

 Represents all

possible combinations of source tuples

 Full outer join on all common

attributes

 All combinations for more than two sources

 Minimum union over results

A B C a b c e f g k 

  • k

m  A B D a b  e f h m p  k q r A B C D a b c  e f g h m p   k 

k m   k q  r

|⋈|

=

R

A B C D a b c  e f g h m p   k 

k m   k q  r

slide-29
SLIDE 29

Complement Union – Proposal

 Elimination of

complementing tuples

 Outer union  Complementation

 No known SQL

rewriting

A B C a b c e f g m n

  • A

B D a b  e f h m p  A B C D a b c  a b   e f g  e f  h m n

m p  

+

=

R⇅ A tuple t1 complements a tuple t2, if it has same schema and coincides in all non-NULL-values.

A B C D a b c  e f g h m n

m p  

Includes duplicate removal and subsumption

slide-30
SLIDE 30

WITH OU AS ( ( SELECT A, B, C, NULL AS D FROM U1 ) UNION (ALL) ( SELECT A, B, NULL AS C, D FROM U2 ) ), SELECT A, MAX(B), MIN(C), SUM(D) FROM OU GROUP BY A

Grouping and Aggregation

 Outer union then group by real-world ID  Aggregate all other columns using conflict resolving

aggregate function

 Efficient implementations  Catches inter- and intra-source duplicates  Restricted to built-in

aggregate-functions

 MAX, MIN, AVG,

VAR, STDDEV, SUM, COUNT

slide-31
SLIDE 31

SELECT ID, RESOLVE(Title, Choose(IMDB)), RESOLVE(Year, Max), RESOLVE(Director), RESOLVE(Rating), RESOLVE(Genre, Concat) FUSE FROM IMDB, Filmdienst FUSE BY (ID) ON ORDER Year DESC

FUSE BY

 SQL extensions to resolve uncertainties and contradictions  FUSE FROM implies OUTER UNION

 Removes subsumed and duplicate tuples by default

 FUSE BY declares real-world ID  RESOLVE specifies conflict resolution function from catalog

 Default: COALESCE

 Implemented on top of relational DBMS “XXL”

slide-32
SLIDE 32

Outline

 Introduction: ETL  Entity Resolution (ER)  Data Fusion (DF)  ETL+BigData, Jaql+HIL  An HIL-based example of ER + DF

slide-33
SLIDE 33

The Four V's of Big Data

slide-34
SLIDE 34

What is Hadoop

 Hadoop framework consists on two main layers

Distributed file system (HDFS) Execution engine (MapReduce)

slide-35
SLIDE 35

ETL + Big Data = Jaql

 Provide a simple, yet powerful language to manipulate semi-

structured data.

 Use JSON as a data model  Data is usually converted to/from JSON view  Most data has a natural JSON representation

 Easily extended using Java, Python, …  Exploit massive parallelism using Hadoop

Unix Jaql cat var -> merge join join grep filter cut, paste, sed, tr map sort sort head top uniq distinct sort sort > filename write tee tee

slide-36
SLIDE 36

When Jaql should be used

 Hive

 Good for relational data  Sql like syntax with UDF

 Pig

 Good for complex data (objects)  Good for small scripts (UDF)

 Jaql

 God for complex and semi-

structured data (for example for text analysis)

 Modular structure  Functions allow to build complex

programs

 Jaql combines both approaches

 SQL syntax (like Hive)  Jaql data flow style for semi-

structured data (like Pig) Pig Hive Jaql Developed Yahoo! Facebook IBM Language Pig Latin HiveQL Jaql Type Data flow Declarative (sql-like) Data flow & sql-like Data structures Complex structures structure Semi-structured, JSON Data Schema Not required required Not required Turing Compleetness Yes, but with UDFs Yes, but with UDFs Yes source sink

  • perator
  • perator
slide-37
SLIDE 37

HIL

 HIL captures multiple stages together in one

framework:

 Mapping and cleansing  Entity resolution (ER, or linkage)  Fusion and aggregation

 Rich data model: define and handle any types of

entities and relationships

 Relational, XML, semi-structured (JSON)  Entities to be linked (by ER) do not need to have the same

schema

 HIL targets multiple platforms (Jaql Map/Reduce,

RDBMS, MDM, Quality Stage)

slide-38
SLIDE 38

HIL vs Jaql

slide-39
SLIDE 39

Outline

 Introduction: ETL  Entity Resolution (ER)  Data Fusion (DF)  ETL+BigData, Jaql+HIL  An HIL-based example of ER + DF

slide-40
SLIDE 40

HIL: ER

insert into Deduplicated select [gen: [id: g.id, name: g.name, value: g.value], dup: [id: d.id, name: d.name, value: d.value], value: compareObject(g,d)] from Generated g, Duplicated d where compareObject(g, d) > 0.7;

slide-41
SLIDE 41

Jaql: DF Minimum Union

is_subsumed = fn(i,j) (( isnull(j.a) or (i.a == j.a) ) and ( isnull(j.b) or (i.b == j.b) ) and ( isnull(j.c) or (i.c == j.c) ) and ( isnull(j.d) or (i.d == j.d) ) and ( i != j)) ; removeSubsumed = fn (a) ( b = a, subs = for (i0 in b) [a->filter is_subsumed(i0,$)], s = subs -> expand, a -> filter not $ in s); minUnion = fn(id,a) ( {id:id, minunion : removeSubsumed(a)});

slide-42
SLIDE 42

insert into FusionIndex![id: f.gen.id] select [a: f.gen.a, b: f.gen.b, c: f.gen.c] from Deduplicated f; insert into FusionIndex![id: f.dup.id] select [a: f.dup.a, b: f.dup.b, d: f.dup.d] from Deduplicated f;

HIL: DF Minimum Union

create link Deduplicated as select [gen: [id: a.id, a:a.a, b:a.b, c:a.c], dup: [id: b.id, a:b.a, b:b.b, d:b.d]] from A a, B b match using rule1: a.id = b.id; insert into MinimumUnion select minUnion(i.dup.id, FusionIndex![id : i.dup.id]) from Deduplicated i;

slide-43
SLIDE 43

HIL: DF Minimum Union

insert into MinimumUnion select minUnion(i.dup.id, FusionIndex![id : i.dup.id]) from Deduplicated i;

slide-44
SLIDE 44

HIL: DF FuseBy

44

@jaql{ average = fn($a) avg($a[*].age); any = fn($a) any($a[*].name); concat = fn ($a) strJoin($a[*].info,"_"); } insert into Fused select [ id : i.dup.id, age: average(FusionIndex![id : i.dup.id]), name: any(FusionIndex![id : i.dup.id]), info: concat(FusionIndex![id: i.dup.id])] from Deduplicated i;

slide-45
SLIDE 45

Conclusion

Data Integration: ETL Entity Resolution Data Fusion BigData ETL (Jaql, HIL) An HIL-based example of ER + DF