Cost-efficient Data Acquisition on Online Data Marketplaces for - PowerPoint PPT Presentation

Cost-efficient Data Acquisition on Online Data Marketplaces for Correlation Analysis VLDB’19 Yanying Li 1 Haipei Sun 1 Boxiang Dong 2 Hui (Wendy) Wang 1 1 Stevens Institute of Technology Hoboken, NJ 2 Montclair State University Montclair, NJ August 28, 2019

Data Marketplace The rising demand for valuable online datasets has led to the emergence of data marketplace. Data seller Specify data views for sale and their prices. Data shopper Decide which views to purchase. 2 / 33

Data Acquisition We consider data shopper’s need as correlation analysis. Age Zipcode Population State Disease # of cases [35, 40] 10003 7,000 Zipcode State MA Flu 300 [20, 25] 01002 3,500 07003 NJ correct NJ Flu 400 [55, 60] 07003 1,200 07304 NJ correct Florida Lyme disease 130 [35, 40] 07003 5,800 10001 NY correct California Lyme disease 40 10001 NJ wrong [35, 40] 07304 2,000 NJ Lyme disease 200 (a) D S : Source instance owned by D 1 : Zipcode table D 2 : Data and statistics of diseases by state data shopper Adam (FD: Zipcode → State) Age Address Insurance Disease [35, 40] 10 North St. UnitedHealthCare Flu [20, 25] 5 Main St. MedLife HIV [35, 40] 25 South St. UnitedHealthCare Flu D 3 : Insurance & disease data instance (b) Relevant instances on data marketplace Need: Find correlation between age groups and diseases in New Jersey 3 / 33

Data Acquisition • Requirement 1: Meaningful join D S ⋊ ⋉ D 3 is meaningless, as it associates the aggregation data with individual records. Age Zipcode Population Address Insurance Disease [35, 40] 10003 7,000 10 North St. UnitedHealthCare Flu [35, 40] 10003 7,000 25 South St. UnitedHealthCare Flu [20, 25] 01002 3,500 5 Main St. MedLife HIV [35, 40] 07003 5,800 10 North St. UnitedHealthCare Flu [35, 40] 07003 5,800 10 North St. UnitedHealthCare Flu [35, 40] 07304 2,000 25 South St. UnitedHealthCare Flu [35, 40] 07304 2,000 25 South St. UnitedHealthCare Flu D S ⋊ ⋉ D 3 4 / 33

Data Acquisition • Requirement 1: Meaningful join • Requirement 2: High data quality We consider data inconsistency as the main quality issue. Zipcode State 07003 NJ correct FD: Zipcode → State 07304 NJ correct 10001 NY correct 10001 NJ wrong 5 / 33

Data Acquisition • Requirement 1: Meaningful join • Requirement 2: High data quality • Requirement 3: Budget constraint The data shopper has a purchase budget. The price of the purchased datasets must be within the budget. 6 / 33

Our Contributions We design a middleware service named DANCE , a Data Acquisition framework on oNline data market for CorrElation analysis that • provides cost-efficient data acquisition service; • enables budget-conscious search of the high-quality data; • maximizes the correlation of the desired attributes. 7 / 33

Outline 1 Introduction 2 Related Work 3 Preliminaries 4 DANCE • Offline Phase • Online Phase 5 Experiments 6 Conclusion 8 / 33

Related Work Data Market • Query-based pricing model [KUB + 15] • History-aware pricing model [U + 16] • Arbitrage-free pricing model [KUB + 12, LK14, DK17] Data Exploration via Join • Summary graph [YPS11] • Reverse engineering [ZEPS13] Do not consider data quality and budget. 9 / 33

Preliminaries - Data Pricing • In this paper, we mainly focus on query-based pricing functions [KUB + 15]. Input Explicit prices for a few views Output The derived price for any view • DANCE is compatible with any pricing model. 10 / 33

Preliminaries - Data Quality We define data quality as the fraction of tuples that are correct with regard to all the functional dependencies. TID A B C D E t 1 a 1 b 2 c 1 d 1 e 1 t 2 a 1 b 2 c 1 d 1 e 1 t 3 a 1 b 2 c 2 d 1 e 1 t 4 a 1 b 2 c 3 d 1 e 2 t 5 a 1 b 3 c 3 d 2 e 2 FD: A → B , D → E 11 / 33

Preliminaries - Data Quality We define data quality as the fraction of tuples that are correct with regard to all the functional dependencies. TID A B C D E t 1 a 1 b 2 c 1 d 1 e 1 t 2 a 1 b 2 c 1 d 1 e 1 t 3 a 1 b 2 c 2 d 1 e 1 t 4 a 1 b 2 c 3 d 1 e 2 t 5 a 1 b 3 c 3 d 2 e 2 FD: A → B , D → E C ( D , A → B ) = { t 1 , t 2 , t 3 , t 4 } 12 / 33

Preliminaries - Data Quality We define data quality as the fraction of tuples that are correct with regard to all the functional dependencies. TID A B C D E t 1 a 1 b 2 c 1 d 1 e 1 t 2 a 1 b 2 c 1 d 1 e 1 t 3 a 1 b 2 c 2 d 1 e 1 t 4 a 1 b 2 c 3 d 1 e 2 t 5 a 1 b 3 c 3 d 2 e 2 FD: A → B , D → E C ( D , A → B ) = { t 1 , t 2 , t 3 , t 4 } C ( D , D → E ) = { t 1 , t 2 , t 3 , t 5 } 13 / 33

Preliminaries - Data Quality We define data quality as the fraction of tuples that are correct with regard to all the functional dependencies. TID A B C D E t 1 a 1 b 2 c 1 d 1 e 1 t 2 a 1 b 2 c 1 d 1 e 1 t 3 a 1 b 2 c 2 d 1 e 1 t 4 a 1 b 2 c 3 d 1 e 2 t 5 a 1 b 3 c 3 d 2 e 2 FD: A → B , D → E C ( D , A → B ) = { t 1 , t 2 , t 3 , t 4 } C ( D , D → E ) = { t 1 , t 2 , t 3 , t 5 } Q ( D ) = 3 5 = 0 . 6 14 / 33

Preliminaries - Join Informativeness Definition (Join Informativeness) Given two instances D and D ′ , let J be their join attribute(s). The join informativeness of D and D ′ is defined as JI ( D , D ′ ) = Entropy ( D . J , D ′ . J ) − I ( D . J , D ′ . J ) , Entropy ( D . J , D ′ . J ) by using the joint distribution of D . J and D ′ . J in the output of the full outer join of D and D ′ , where I calculates the mutual information. • It penalizes those joins with excessive numbers of such unmatched values [YPS09]. • 0 ≤ JI ( D , D ′ ) ≤ 1. • The smaller JI ( D , D ′ ) is, the more important is the join connection between D and D ′ . 15 / 33

Preliminaries - Correlation Measurement Definition (Correlation Measurement) Given a dataset D and two attribute sets X and Y , the correlation of X and Y CORR ( X , Y ) is measured as • CORR ( X , Y ) = Entropy ( X ) − Entropy ( X | Y ) if X is categorical, • CORR ( X , Y ) = h ( X ) − h ( X | Y ) if X is numerical, where h ( X ) is the cumulative entropy of attribute X ∫ h ( X ) = − P ( X ≤ x ) logP ( X ≤ x ) dx , and ∫ h ( X | Y ) = − h ( X | y ) p ( y ) dy . 16 / 33

Problem Statement Input A set of data instances D = { D 1 , . . . , D n } , source attributes A S , and target attributes A T , purchase budget B , join informativeness threshold α , quality threshold β Output A set of data views T ⊆ D s.t. maximize CORR ( A S , A T ) \\ correlation T subject to ∀ T i ∈ T , ∃ D j ∈ D s . t . T i ⊆ D j , ∑ JI ( T i , T i + 1 ) ≤ α, \\ informativeness T i ∈ S ∪ T Q ( T ) ≥ β, \\ quality p ( T ) ≤ B . \\ budget 17 / 33

Framework of DANCE DANCE Request for Samples Offline Construction of Join Graph Phase Samples Data Marketplace Join Graph Source Instances Correlation (A S , A T ) !"#$%&'("# Online Data Acquisition Phase Data Purchase Query Data Shopper Data Purchase Query Purchased Data Offline Phase Construct a two-layer join graph of the datasets on the marketplace. Online Phase Process data acquisition requests. 18 / 33

Dealing with Large-scale Data Correlated Sampling S = { t i ∈ D | h ( t i [ J ]) ≤ p } Estimation from Samples • E ( JI ( S 1 , S 2 )) = JI ( D 1 , D 2 ) • E ( Q ( S 1 ⋊ ⋉ S 2 )) = Q ( D 1 ⋊ ⋉ D 2 ) • E ( CORR S 1 ⋊ ⋉ S 2 ( A S , A T )) = CORR D 1 ⋊ ⋉ D 2 ( A S , A T ) Re-sampling We design a correlated-resampling method to deal with large join result from samples in case of long join paths. 19 / 33

Offline Phase: Construction of Join Graph Construct a two-layer join graph from the data samples. Instance layer Nodes data instances Edges join attribute and the minimum informativeness Attribute set layer Nodes attribute sets Edges join attribute and informativeness 20 / 33

Offline Phase: Construction of Join Graph Construct a two-layer join graph from the data samples. (B, 0.45) D1 D2 Instance level (C, 0.6) (B, 0.45) (C, 0.6) (B, 0.45) (BC, 0.5) BC BD CD BE CE DE BC AB AC D1 D2 (BC, 0.5) BCE BDE BCD CDE (B, 0.45) (BC, 0.5) (C, 0.6) (BC, 0.5) BCDE ABC Attribute set level 21 / 33

Online Phase: Data Acquisition We design a two-step algorithm to search for the data views. Step 1 Find minimal weighted graphs at instance layer. D 6 D 6 J 12 D 1 D 1 D 2 D 2 J 16 J 13 D 3 D 3 J 27 J 34 J 46 D 4 D 4 J 49 J 35 D 9 D 9 J 56 J 59 D 5 D 5 D 7 D 7 J 57 J 89 J 58 D 8 D 8 Source Attribute Set Target Attribute Set • It is equivalent to the Steiner tree problem and is NP-hard [Vaz13]. 22 / 33

Online Phase: Data Acquisition We design a two-step algorithm to search for the data views. Step 1 Find minimal weighted graphs at instance layer. D 6 D 6 s 12 D 1 D 1 D 2 D 2 s 29 s 28 D 3 D 3 s 14 D 4 D 4 s 49 D 9 D 9 s 17 s 48 D 5 D 5 s 79 D 7 D 7 s 78 D 8 D 8 Source Attribute Set Target Attribute Set Landmark • We adapt the approximate shortest path search algorithm [GBSW10] based on landmarks. 23 / 33

Online Phase: Data Acquisition We design a two-step algorithm to search for the data views. Step 1 Find minimal weighted graphs at instance layer. Step 2 Find optimal target graphs at attribute set layer based on Markov chain Monte Carlo (MCMC). 24 / 33

Cost-efficient Data Acquisition on Online Data Marketplaces for - PowerPoint PPT Presentation

Cost-efficient Data Acquisition on Online Data Marketplaces for Correlation Analysis VLDB19 Yanying Li 1 Haipei Sun 1 Boxiang Dong 2 Hui (Wendy) Wang 1 1 Stevens Institute of Technology Hoboken, NJ 2 Montclair State University Montclair, NJ

TUTORIAL - TUTORIAL -ABC ABC TOTAL COST for a COST OBJECT TOTAL COST for a COST OBJECT

Cost Cost Ov Over erruns uns and Cos and Cost t Gr Growth: wth: A A Thr hree Decade ee

Cost Report Capital Cost Operating Cost (Up front cost) (Annual cost over time) Utilities

Cost Allocation Plans and Indirect Cost Rates Cost Allocation Plans and Indirect Cost Rates

Chapter 4 Chapter 4 Marginal Costing and Cost-Volume-Profit Analysis Cost behaviour Cost

CSN08101 Digital Forensics Lecture 6: Acquisition Lecture 6: Acquisition Module Leader: Dr

Portfolio Acquisition Portfolio Acquisition Portfolio Acquisition from from from Safe Harbor

Land Acquisition and Relocation Process Presented by: Lynn Green, Director of Acquisition

E-COMPASS ACQUISITION CORP. Acquisition of NYM Holding, Inc. Investor Presentation August 2016

Grammar in Performance and Acquisition: acquisition E Stabler, UCLA ENS Paris 2008 day 4

First Language Acquisition: Inherent Difficulty of Language Acquisition Theories and Evidence

Chapter 2: Cost Behavior, Activity Analysis, and Cost Estimation Agenda History of Cost

Cost Efficient Operations Cost Efficient Operations at Fira Barcelona at Fira Barcelona How to

COST European Cooperation in Science and Technology Introduction to the COST Framework Programme

Pricing according to cost Cost-based pricing Cost of a service = value of economic means used in

COST Action CA18108 Quantum gravity phenomenology in the multi-messenger approach What is a COST

ontology-mediated query answering Harnessing knowledge to get more from data Meghyn Bienvenu (

Communicating generalizations (in computational terms) Michael Henry Tessler Stanford University

A multicriteria decision analysis as an innovative approach to managing Lyme disease Paris,

Targeting Lyme Disease and other tick borne diseases in dogs: research update Christy Petersen

UPDATES IN INFECTIOUS DISEASES Jacob Kesner, PharmD Lovelace Medical Center Albuquerque, NM

Infec@ous Disease and the Environment Jessica Brownell

Clima mate Ready BC: Pr Preparing T eparing Together ther Developing a climate preparedness

Gl Glob obal Warming an and wh what t to do ab about t it it Gerald Oakham Peter Black