Human-in-the-loop Data Integration Guoliang Li Department of - PowerPoint PPT Presentation

Human-in-the-loop Data Integration Guoliang Li Department of Computer Science, Tsinghua University, China http://dbgroup.cs.tsinghua.edu.cn/ligl

Acknowledgement Jianhua Feng Lizhu Zhou Beng Chin Ooi Chen Li @Tsinghua @Tsinghua @NUS @UCI 2

Acknowledgement Jiannan Wang Dong Deng Ju Fan Yudian Zheng aP@SFU PostDoc@MIT AP@RUC @HKU On Job Market! Thank everyone who support me! ü Collaborators ü Students ü Friends ü ...... 3

Data Integration (DI) Combine data in different sources and provide users with a unified view Data Integration Brand Product Region Price Brand Product Loc Price Sales Apple iPhone6S Beijing 4000 Apple iPhone6S Beijing 4000 40K Apple iPhone6SP Beijing 5000 Apple iPhone6SP Beijing 5000 30K Samsung Galaxy S7 Beijing 3500 Samsung Galaxy S7 Beijing 3500 35K Name Loc Sales Bei Jing 40K 6S 4.7 ’ Data Analysis Is there any correlation Bei Jing 30K 6S 5.5 ’ between location and revenue? S7 Bei Jing 35K Data Science Pipeline: Data Integration à Data Analysis 4

Data Integration (DI) p Data Integration is important and challenging – New York Times • 80% of a data science project is to clean and integrate the data, while 20% is actual data analysis – Mark Schreiber of Merck • data scientists spend 98% of on “grunt work” and • only one hour per week on “useful work” p In many communities – DB, AI, KDD, Web 5

�� Entity Matching in DI p Date Integration – data acquisition, extraction, cleaning, schema matching, entity matching, etc. p Entity Matching (EM): A core problem – Find pairs of records referring to the same entity Brand Product Region Price Apple iPhone6S Beijing 4000 Apple iPhone6SP Beijing 5000 Samsung Galaxy S7 Beijing 3500 Name Loc Sales Bei Jing 40K Apple 6S 4.7 ’ Bei Jing 30K Apple 6S 5.5 ’ Data are full of errors Samsung S7 Bei Jing 35K and inconsistencies. 6

Hybrid Human-Machine EM R Rule-based Crowd-based Refinement Candidate Generation Name Model Size Candidate 6S iPhone 5.7 Round-based Signature-based pairs Graph Model Verification Matching S7 Samsung 5.7 Latency Control Filtering pairs S Rules Unified Tuple-Level Rule Product Type Color Quality Control Cost Control Generation iPhone6 6th red Samsung 7gen white DIMA System CDB System p Similarity-based p Crowd-powered SQL processing system p New optimization Model p Entity Matching p Entity Matching p Ease to use p Cost, Latency, Quality 7

Hybrid Human-Machine EM iPhone6S iPhone 6S 4.7 ’ iPhone6SP iPhone 6S 5.5 ’ Two tables of records Galaxy S7 Samsung S7 Pruning 4 dissimilar pairs Machine-Based iPhone 6S 0.75 iPhone 6S 4.7 ’ Algorithms iPhone 6S 0.75 iPhone 6S 5.5 ’ Pruning dissimilar pairs iPhone 6SP 0.72 iPhone 6S 4.7 ’ Jaccard iPhone 6SP 0.72 iPhone 6S 5.5 ’ Edit distance candidate record pairs Galaxy S7 Samsung S7 0.5 Semantics iPhone6S Samsung S7 0.1 iPhone6S Samsung S7 0.1 Crowd-Based Galaxy S7 0.1 iPhone 6S 4.7 ’ Algorithms Galaxy S7 0.1 iPhone 6S 5.5 ’ Asking crowd to label some pairs Removing 2 non-matched pairs iPhone 6S iPhone 6S 4.7 ’ matching record pairs iPhone 6SP iPhone 6S 5.5 ’ Galaxy S7 Samsung S7 8

Dima: Distributed In-memory Similarity-based System Query interface CLI JDBC Scala Program ü Extended SQL simSQL Parser DataFrame API ü Easy to use Distributed In-memory Similarity-based Query Optimizer Processing Engine Similarity-based Query Operations ü Indexing ü Similarity Operations Global Indexing Local Indexing ü Optimizer Spark Support similarity-based RDBMS HDFS Native RDD query processing Ji Sun, Zeyuan Shang, Guoliang Li , Dong Deng, Zhifeng Bao. Dima: A Distributed In-Memory 9 Similarity-Based Query Processing System. VLDB 2017

CDB: A Crowd-powered Database CQL Parser Result Collection Fine-grained Graph-Based Query Model Tuple-level Graph model Query Optimization MetaData Cost Control Task Multi-goal Latency Control Worker optimization Quality Control Assignment !"#$ %##&'()*(+ !,-+. /(0*,*(1* Crowd UI Designer Statistics EaseCrowd Crowdsourcing Relational ChinaCrowd Platforms Database Guoliang Li , Chengliang Chai, Ju Fan, Jian Li, Yudian Zheng. CDB: A Crowd- Powered Database. SIGMOD 2017. 10

DIMA: Rule-Based Matching Brand Capacity Price Name Storage Sales Apple iPhone6S 64 4000 64GB 40K Apple 6S 4.7 ’ Apple iPhone6SP 128G 5000 128GB 30K Apple 6S 5.5 ’ Samsung Galaxy S7 64G 3500 Samsung S7 64GB 35K Storage is similar to Capacity Name is similar to Brand Jaccard(Name, Brand) ≥0.8 ∧ ED(Storage, Capacity)<2 Jiannan Wang, Guoliang Li , Jeffrey Xu Yu, Jianhua Feng: Entity Matching: How Similar Is 11 Similar. VLDB , 2011:622-633.

DIMA: Rule-Based Matching p Challenges – How to obtain the rules? Attribute pairs • High-quality rules • Explainable, Programmable Threshold – How to apply the rules? • Avoid Cartesian product Similarity Functions • Fast and Scalable 12

Quantifying Rules M: positive examples N: negative examples M Ψ ∩ M M Ψ ∩ N M φ : record pairs that satisfy a rule φ M Ψ : record pairs that satisfy a rule set Ψ={φ} Objective function: | M Ψ ∩ M|- |M Ψ ∩ N | Goal: Find a rule set Ψ to maximize | M Ψ ∩ M|-|M Ψ ∩ N| 13

Rule Generation a,b; Attributes Attribute pairs NP-hard Venue Effective Algorithm Computer Science … Chemical Sciences Database … System … Chemical Sciences (general) Threshold SIGMOD VLDB … … RSC Advances … ICPADS δ: [0,1] Similarity Infinite à Finite Functions ü F(a,b) on attribute pairs String: edit distance F Set: Jaccard Knowledge 14

Rule-Based Matching p Challenges – How to obtain the rules? Attribute pairs • High-quality rules • Explainable, Programmable Threshold – How to apply the rules? • Avoid Cartesian product Similarity Functions • Fast and Scalable 15

Applying Rules A Rule: Name is similar to Brand It is expensive to enumerate every record pairs! ü 10million*10million Signature-based Method ü If two records do not share a common signature, they cannot be matching 1 1 1 3 1 1 2 3 2 2 3 Guoliang Li , Dong Deng, Jiannan Wang, Jianhua Feng: Pass-Join: A Partition-based 16 Method for Similarity Joins. VLDB , 2012:253-264.

Dima - Signature CLI JDBC Scala Program EMSQL Parser DataFrame API EM Query Optimizer EM Query Operations Signature-based Method Global Indexing Local Indexing ü If two records do not share a common Balance-Aware Signature Generation signature, they cannot be matching Spark RDBMS HDFS Native RDD 1 2 1 2 3 1 1 3 3 2 Balance-Aware Signature ü The signatures are selectable ü Balance the workload 4 Dong Deng, Guoliang Li , He Wen, Jianhua Feng. An Efficient Partition Based Method for 17 Exact Set Similarity Joins. VLDB , 2016

Dima: Load Balance CLI JDBC Scala Program EMSQL Parser DataFrame API EM Query Optimizer Challenges EM Query Operations p How to generate signatures? Global Indexing Local Indexing Balance-Aware Signature Generation – Partition-based Spark p How to select the signatures? RDBMS HDFS Native RDD – Dynamic programming p How to balance the workload? – NP-hard η l ⇣ X X F � [ g ]+ W j = – Greedy algorithms b i i =1 g 2 pSig + s,i,l & P ( g )= j �⌘ X F + [ g ] + X F � [ g ] + F + [ g ] � c i g 2 pSig + g 2 pSig − s,i,l & P ( g )= j s,i,l & P ( g )= j ( ( 1 Z [ i ] = 1 1 Z [ i ] = 2 b i = c i = 0 Z [ i ] 6 = 1 0 Z [ i ] 6 = 2 η l X s.t. Z [ i ] � θ | s | ,l . i =1 18

CLI JDBC Scala Program Dima: Indexing EMSQL Parser DataFrame API EM Query Optimizer Signature-based partition EM Query Operations ü Local join Global Indexing Local Indexing ü Avoid join on different nodes Balance-Aware Signature Generation Spark Global Indexing RDBMS HDFS Native RDD ü Signature à nodes Local Indexing ü Signature à records DataRDD segment/deletion FrequencyTable IndexRDD LocalIndex + - sig=(seg/del,i,l) F (sig), F (sig) Index HashMap + sig L [pos] �� <signature> 0 �� - 0 sig L [pos] Global <signature> Data Array �� 1 �� 1 Mapping 1 record P(sig) 2 record … … … … record n record �� <signature> n n 19

CLI JDBC Scala Program Dima: Query Processing EMSQL Parser DataFrame API EM Query Optimizer EM Operation EM Query Operations ü Selection Global Indexing Local Indexing ü Join Balance-Aware Signature Generation Spark ü Topk RDBMS HDFS Native RDD EM Query Processing ü Global ü ZipPartition; Balance-aware ü Local ü Avoid Duplicates IndexRDD ZipPartition FrequencyTable DataRDD segment/deletion ProbeRDD + - (seg,i,l) F (sig), F (sig) �� ; �� 0 0 �� ; �� <signature> �� 0 �� 1 �� 1 Global <signature> Mapping … �� ; �� 1 },1,5):2 … … P(sig) … �� n <signature> �� n 20

Human-in-the-loop Data Integration Guoliang Li Department of - PowerPoint PPT Presentation

Human-in-the-loop Data Integration Guoliang Li Department of Computer Science, Tsinghua University, China http://dbgroup.cs.tsinghua.edu.cn/ligl Acknowledgement Jianhua Feng Lizhu Zhou Beng Chin Ooi Chen Li @Tsinghua @Tsinghua @NUS @UCI

Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing

Repetition Types of Loops Counting loop Know how many times to loop

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

Trading Strategies Introduction Trading Loop Trading Loop Trading Loop Trading Loop Three

Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop

Loop Invariants: Part 2 7 January 2019 OSU CSE 1 Maintaining the Loop Invariant A claimed

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Upper and Lower Loop Bound Estimation by Symbolic Execution and Loop Acceleration Pavel Cadek

Enhancing Fine- Grained Parallelism Loop vectorization, Loop distribution, Scalar expansion

c } false loop body P (postcondition) Loop Invariant Defn : A boolean condition that

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

Objectives You should be able to ... Loop Invariants Explain the concept of well formed

Loop Statements & Vectorizing Code Chapter 5 Attaway MATLAB 4E for loop used as a

Trace while Loop, cont. Trace while Loop, cont. Print Welcome to Java Print Welcome to Java int

Loop Transformations for Parallelism & Locality Previously Loop transformations,

Open loop synthesis for closed loop control Kazufumi Ito, North Carolina State University June

Review Ch 1-5 Executing code Compile code (convert from C++ to computer code) - Syntax errors

PERFORMANCE OPTIMISATION Adrian Jackson adrianj@epcc.ed.ac.uk Hardware design Image from Colfax

Nonassociative Lie Theory Ivan P . Shestakov The International Conference on Group Theory in

Conformal field theory from conformal loop ensembles Benjamin Doyon Kings College London

PERFORMANCE ANALYSIS OF CONTAINERIZED APPLICATIONS ON LOCAL AND REMOTE STORAGE Qiumin Xu 1 , Manu

Local Netlist Transformations in Lagrangian Relaxation Apostolos Stefanidis, Dimitrios Mangiras,

Broadband Mapping International Good Practices and World Bank Experience 2 July 2019 1 Agenda

1 SA-C Image Processing Support Cameron Project Objective Data parallelism through tight

Human-in-the-loop Data Integration Guoliang Li Department of - PowerPoint PPT Presentation

Human-in-the-loop Data Integration Guoliang Li Department of Computer Science, Tsinghua University, China http://dbgroup.cs.tsinghua.edu.cn/ligl Acknowledgement Jianhua Feng Lizhu Zhou Beng Chin Ooi Chen Li @Tsinghua @Tsinghua @NUS @UCI

Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing

Repetition Types of Loops Counting loop Know how many times to loop

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

Trading Strategies Introduction Trading Loop Trading Loop Trading Loop Trading Loop Three

Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop

Loop Invariants: Part 2 7 January 2019 OSU CSE 1 Maintaining the Loop Invariant A claimed

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Upper and Lower Loop Bound Estimation by Symbolic Execution and Loop Acceleration Pavel Cadek

Enhancing Fine- Grained Parallelism Loop vectorization, Loop distribution, Scalar expansion

c } false loop body P (postcondition) Loop Invariant Defn : A boolean condition that

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

Objectives You should be able to ... Loop Invariants Explain the concept of well formed

Loop Statements &amp; Vectorizing Code Chapter 5 Attaway MATLAB 4E for loop used as a

Trace while Loop, cont. Trace while Loop, cont. Print Welcome to Java Print Welcome to Java int

Loop Transformations for Parallelism &amp; Locality Previously Loop transformations,

Open loop synthesis for closed loop control Kazufumi Ito, North Carolina State University June

Review Ch 1-5 Executing code Compile code (convert from C++ to computer code) - Syntax errors

PERFORMANCE OPTIMISATION Adrian Jackson adrianj@epcc.ed.ac.uk Hardware design Image from Colfax

Nonassociative Lie Theory Ivan P . Shestakov The International Conference on Group Theory in

Conformal field theory from conformal loop ensembles Benjamin Doyon Kings College London

PERFORMANCE ANALYSIS OF CONTAINERIZED APPLICATIONS ON LOCAL AND REMOTE STORAGE Qiumin Xu 1 , Manu

Local Netlist Transformations in Lagrangian Relaxation Apostolos Stefanidis, Dimitrios Mangiras,

Broadband Mapping International Good Practices and World Bank Experience 2 July 2019 1 Agenda

1 SA-C Image Processing Support Cameron Project Objective Data parallelism through tight

Loop Statements & Vectorizing Code Chapter 5 Attaway MATLAB 4E for loop used as a

Loop Transformations for Parallelism & Locality Previously Loop transformations,