 
              Human-in-the-loop Data Integration Guoliang Li Department of Computer Science, Tsinghua University, China http://dbgroup.cs.tsinghua.edu.cn/ligl
Acknowledgement Jianhua Feng Lizhu Zhou Beng Chin Ooi Chen Li @Tsinghua @Tsinghua @NUS @UCI 2
Acknowledgement Jiannan Wang Dong Deng Ju Fan Yudian Zheng aP@SFU PostDoc@MIT AP@RUC @HKU On Job Market! Thank everyone who support me! ü Collaborators ü Students ü Friends ü ...... 3
Data Integration (DI) Combine data in different sources and provide users with a unified view Data Integration Brand Product Region Price Brand Product Loc Price Sales Apple iPhone6S Beijing 4000 Apple iPhone6S Beijing 4000 40K Apple iPhone6SP Beijing 5000 Apple iPhone6SP Beijing 5000 30K Samsung Galaxy S7 Beijing 3500 Samsung Galaxy S7 Beijing 3500 35K Name Loc Sales Bei Jing 40K 6S 4.7 ’ Data Analysis Is there any correlation Bei Jing 30K 6S 5.5 ’ between location and revenue? S7 Bei Jing 35K Data Science Pipeline: Data Integration à Data Analysis 4
Data Integration (DI) p Data Integration is important and challenging – New York Times • 80% of a data science project is to clean and integrate the data, while 20% is actual data analysis – Mark Schreiber of Merck • data scientists spend 98% of on “grunt work” and • only one hour per week on “useful work” p In many communities – DB, AI, KDD, Web 5
�� �� � ��� � � Entity Matching in DI p Date Integration – data acquisition, extraction, cleaning, schema matching, entity matching, etc. p Entity Matching (EM): A core problem – Find pairs of records referring to the same entity Brand Product Region Price Apple iPhone6S Beijing 4000 Apple iPhone6SP Beijing 5000 Samsung Galaxy S7 Beijing 3500 Name Loc Sales Bei Jing 40K Apple 6S 4.7 ’ Bei Jing 30K Apple 6S 5.5 ’ Data are full of errors Samsung S7 Bei Jing 35K and inconsistencies. 6
Hybrid Human-Machine EM R Rule-based Crowd-based Refinement Candidate Generation Name Model Size Candidate 6S iPhone 5.7 Round-based Signature-based pairs Graph Model Verification Matching S7 Samsung 5.7 Latency Control Filtering pairs S Rules Unified Tuple-Level Rule Product Type Color Quality Control Cost Control Generation iPhone6 6th red Samsung 7gen white DIMA System CDB System p Similarity-based p Crowd-powered SQL processing system p New optimization Model p Entity Matching p Entity Matching p Ease to use p Cost, Latency, Quality 7
Hybrid Human-Machine EM iPhone6S iPhone 6S 4.7 ’ iPhone6SP iPhone 6S 5.5 ’ Two tables of records Galaxy S7 Samsung S7 Pruning 4 dissimilar pairs Machine-Based iPhone 6S 0.75 iPhone 6S 4.7 ’ Algorithms iPhone 6S 0.75 iPhone 6S 5.5 ’ Pruning dissimilar pairs iPhone 6SP 0.72 iPhone 6S 4.7 ’ Jaccard iPhone 6SP 0.72 iPhone 6S 5.5 ’ Edit distance candidate record pairs Galaxy S7 Samsung S7 0.5 Semantics iPhone6S Samsung S7 0.1 iPhone6S Samsung S7 0.1 Crowd-Based Galaxy S7 0.1 iPhone 6S 4.7 ’ Algorithms Galaxy S7 0.1 iPhone 6S 5.5 ’ Asking crowd to label some pairs Removing 2 non-matched pairs iPhone 6S iPhone 6S 4.7 ’ matching record pairs iPhone 6SP iPhone 6S 5.5 ’ Galaxy S7 Samsung S7 8
Dima: Distributed In-memory Similarity-based System Query interface CLI JDBC Scala Program ü Extended SQL simSQL Parser DataFrame API ü Easy to use Distributed In-memory Similarity-based Query Optimizer Processing Engine Similarity-based Query Operations ü Indexing ü Similarity Operations Global Indexing Local Indexing ü Optimizer Spark Support similarity-based RDBMS HDFS Native RDD query processing Ji Sun, Zeyuan Shang, Guoliang Li , Dong Deng, Zhifeng Bao. Dima: A Distributed In-Memory 9 Similarity-Based Query Processing System. VLDB 2017
CDB: A Crowd-powered Database CQL Parser Result Collection Fine-grained Graph-Based Query Model Tuple-level Graph model Query Optimization MetaData Cost Control Task Multi-goal Latency Control Worker optimization Quality Control Assignment !"#$ %##&'()*(+ !,-+. /(0*,*(1* Crowd UI Designer Statistics EaseCrowd Crowdsourcing Relational ChinaCrowd Platforms Database Guoliang Li , Chengliang Chai, Ju Fan, Jian Li, Yudian Zheng. CDB: A Crowd- Powered Database. SIGMOD 2017. 10
DIMA: Rule-Based Matching Brand Capacity Price Name Storage Sales Apple iPhone6S 64 4000 64GB 40K Apple 6S 4.7 ’ Apple iPhone6SP 128G 5000 128GB 30K Apple 6S 5.5 ’ Samsung Galaxy S7 64G 3500 Samsung S7 64GB 35K Storage is similar to Capacity Name is similar to Brand Jaccard(Name, Brand) ≥0.8 ∧ ED(Storage, Capacity)<2 Jiannan Wang, Guoliang Li , Jeffrey Xu Yu, Jianhua Feng: Entity Matching: How Similar Is 11 Similar. VLDB , 2011:622-633.
DIMA: Rule-Based Matching p Challenges – How to obtain the rules? Attribute pairs • High-quality rules • Explainable, Programmable Threshold – How to apply the rules? • Avoid Cartesian product Similarity Functions • Fast and Scalable 12
Quantifying Rules M: positive examples N: negative examples M Ψ ∩ M M Ψ ∩ N M φ : record pairs that satisfy a rule φ M Ψ : record pairs that satisfy a rule set Ψ={φ} Objective function: | M Ψ ∩ M|- |M Ψ ∩ N | Goal: Find a rule set Ψ to maximize | M Ψ ∩ M|-|M Ψ ∩ N| 13
Rule Generation a,b; Attributes Attribute pairs NP-hard Venue Effective Algorithm Computer Science … Chemical Sciences Database … System … Chemical Sciences (general) Threshold SIGMOD VLDB … … RSC Advances … ICPADS δ: [0,1] Similarity Infinite à Finite Functions ü F(a,b) on attribute pairs String: edit distance F Set: Jaccard Knowledge 14
Rule-Based Matching p Challenges – How to obtain the rules? Attribute pairs • High-quality rules • Explainable, Programmable Threshold – How to apply the rules? • Avoid Cartesian product Similarity Functions • Fast and Scalable 15
Applying Rules A Rule: Name is similar to Brand It is expensive to enumerate every record pairs! ü 10million*10million Signature-based Method ü If two records do not share a common signature, they cannot be matching 1 1 1 3 1 1 2 3 2 2 3 Guoliang Li , Dong Deng, Jiannan Wang, Jianhua Feng: Pass-Join: A Partition-based 16 Method for Similarity Joins. VLDB , 2012:253-264.
Dima - Signature CLI JDBC Scala Program EMSQL Parser DataFrame API EM Query Optimizer EM Query Operations Signature-based Method Global Indexing Local Indexing ü If two records do not share a common Balance-Aware Signature Generation signature, they cannot be matching Spark RDBMS HDFS Native RDD 1 2 1 2 3 1 1 3 3 2 Balance-Aware Signature ü The signatures are selectable ü Balance the workload 4 Dong Deng, Guoliang Li , He Wen, Jianhua Feng. An Efficient Partition Based Method for 17 Exact Set Similarity Joins. VLDB , 2016
Dima: Load Balance CLI JDBC Scala Program EMSQL Parser DataFrame API EM Query Optimizer Challenges EM Query Operations p How to generate signatures? Global Indexing Local Indexing Balance-Aware Signature Generation – Partition-based Spark p How to select the signatures? RDBMS HDFS Native RDD – Dynamic programming p How to balance the workload? – NP-hard η l ⇣ X X F � [ g ]+ W j = – Greedy algorithms b i i =1 g 2 pSig + s,i,l & P ( g )= j �⌘ X F + [ g ] + X F � [ g ] + F + [ g ] � c i g 2 pSig + g 2 pSig − s,i,l & P ( g )= j s,i,l & P ( g )= j ( ( 1 Z [ i ] = 1 1 Z [ i ] = 2 b i = c i = 0 Z [ i ] 6 = 1 0 Z [ i ] 6 = 2 η l X s.t. Z [ i ] � θ | s | ,l . i =1 18
CLI JDBC Scala Program Dima: Indexing EMSQL Parser DataFrame API EM Query Optimizer Signature-based partition EM Query Operations ü Local join Global Indexing Local Indexing ü Avoid join on different nodes Balance-Aware Signature Generation Spark Global Indexing RDBMS HDFS Native RDD ü Signature à nodes Local Indexing ü Signature à records DataRDD segment/deletion FrequencyTable IndexRDD LocalIndex + - sig=(seg/del,i,l) F (sig), F (sig) Index HashMap + sig L [pos] ���������� <signature> 0 ��������� - 0 sig L [pos] Global <signature> Data Array ���������� 1 ��������� 1 Mapping 1 record P(sig) 2 record … … … … record n record ���������� ��������� <signature> n n 19
CLI JDBC Scala Program Dima: Query Processing EMSQL Parser DataFrame API EM Query Optimizer EM Operation EM Query Operations ü Selection Global Indexing Local Indexing ü Join Balance-Aware Signature Generation Spark ü Topk RDBMS HDFS Native RDD EM Query Processing ü Global ü ZipPartition; Balance-aware ü Local ü Avoid Duplicates IndexRDD ZipPartition FrequencyTable DataRDD segment/deletion ProbeRDD + - (seg,i,l) F (sig), F (sig) ����� ; ����� ��������� 0 0 ��������� ����� ; ����� <signature> ���������� 0 ��������� 1 ��������� 1 Global <signature> Mapping … ���������� ����� ; ����� 1 },1,5):2 … … P(sig) … ���������� ��������� n <signature> ���������� ������ n 20
Recommend
More recommend