ERACER: A Database Approach for Statistical Inference and Data - PowerPoint PPT Presentation

ERACER: A Database Approach for Statistical Inference and Data Cleaning Chris Mayfield Jennifer Neville Sunil Prabhakar Department of Computer Science, Purdue University West Lafayette, Indiana, USA SIGMOD 2010, Indianapolis

Problem Real data is dirty ◮ Inconsistent, incomplete, corrupted, outdated, etc. ◮ Safety measures (e.g., constraints) are often not used ◮ Poor decisions based on dirty data costs billions annually Data cleaning is hard ◮ Typically ad hoc, interactive, exploratory, etc. ◮ Uncertain process: what to do with the “errors?” ◮ Maintenance of results (e.g., lineage/provenance) ◮ Consumes large amount of data management time (see Fan, Geerts, & Jia, VLDB 2008 tutorial) Mayfield, Neville, & Prabhakar SIGMOD 2010 2 of 21

Example 1: Genealogy Data 5M people from Pedigree Resource File ◮ Person (ind id, birth, death) ◮ Relative (ind id, rel id, role) B: 1707 B: ? Integrated from many D: 1766 D: ? sources, e.g.: B: ? B: 1741 ◮ Census records D: ? D: 1771 ◮ Immigration lists B: 1744 B: ? B: 1763 B: 1770 D: 1804 D: ? D: 1826 D: 1770 ◮ Family history societies B: ? B: ? B: 1743 D: ? D: ? D: 1787 ◮ Individual B: 1781 B: 1769 submissions D: 1807 D: 1769 Mayfield, Neville, & Prabhakar SIGMOD 2010 3 of 21

Example 2: Sensor Networks 2M readings of Intel Lab Data ◮ Sensor (epoch, mote id, temp, humid, light, volt) ◮ Neighbor (id1, id2, distance) ◮ 54 sensors, every 31 seconds, for 38 days ◮ ≈ 18% obviously incorrect ◮ Multiple data types Source: http://db.csail.mit.edu/labdata/labdata.html Mayfield, Neville, & Prabhakar SIGMOD 2010 4 of 21

Insight Correlations within tuples, e.g.: ◮ Birth and death years ◮ Temperature and humidity values Input: ◮ Possible tuple Correlations across tuples, e.g.: dependencies ◮ Parents and children ◮ Correlation ◮ Neighboring sensors model skeleton Output: Apply statistical relational learning ◮ PDFs for ◮ Don’t just clean tuples in isolation missing data (e.g., functional dependencies) ◮ Flags for ◮ Propagate inferences multiple times dirty data Mayfield, Neville, & Prabhakar SIGMOD 2010 5 of 21

Baseline Approach: Bayesian networks Exact inference (junction tree) Bayes Net Toolbox for Matlab

Bayesian Network Formulation Model template specifies conditional dependencies: F.b M.b F.d M.d Individual Individual I.b S.b Father Birth Death = ⇒ I.d S.d Mother C1.b C2.b C3.b C2.d C2.d C3.d Conditional probability distribution (CPD) at each node: P(I.d | I.b) death year, given the birth year P(I.b | M.b , F.b) birth year, given parent birth years Prior distribution at nodes with no parents: P(I.b) Simplified version of Relational Bayesian Networks (see e.g., Getoor & Taskar 2007) Mayfield, Neville, & Prabhakar SIGMOD 2010 7 of 21

Maximum Likelihood Estimation 1. Learn CPDs from data, e.g.: F.b M.b CREATE TABLE cpt_birth AS SELECT birth , death , count (*) FROM person F.d M.d GROUP BY birth , death; I.b S.b 2. Share CPDs across all nodes: -- P(I.d | I.b = 1750) I.d S.d SELECT death , count FROM cpt_birth C1.b C2.b C3.b WHERE birth = 1750; 3. Run inference (e.g., junction tree) C2.d C2.d C3.d ◮ Construct Bayesian network ◮ Bind evidence (query from DB) ◮ Extract results (store in DB) Mayfield, Neville, & Prabhakar SIGMOD 2010 8 of 21

Challenges and Lessons Learned Limiting model assumptions ◮ Fixed CPD structure (e.g., always two parents) ◮ Acyclicity constraint (can’t model sensor data) Potentially millions of parameters ◮ Becomes very inefficient ◮ Floating point underflow Not scalable to large data sets ◮ DB may not fit into main memory ◮ Moving data in/out of R, Matlab, etc. Not designed for data cleaning ◮ Propagates outliers/errors in original data ◮ Need to look beyond the Markov blanket Mayfield, Neville, & Prabhakar SIGMOD 2010 9 of 21

ERACER Approach: Relational dependency networks Approximate inference algorithm SQL-based framework Integrated data cleaning

Relational Dependency Networks For example, at each sensor and time epoch: S1.t S2.t S1.h Sensor Sensor S2.h S3.t Temp Humid = ⇒ Neighbor Neighbor Neighbor S3.h S4.t S4.h In contrast to Bayesian networks, RDNs: ◮ approximate the full joint distribution ◮ learn CPDs locally based on component models ◮ allow cyclic dependencies (i.e., many-to-many) ◮ use aggregation to deal with heterogeneity (see Neville & Jensen, JMLR 2007) Mayfield, Neville, & Prabhakar SIGMOD 2010 11 of 21

Component Models Convolution (for genealogy) ◮ parent age : M PA = P ( I.b − P.b ) ◮ death age : M DA = P ( I.d − I.b ) Expected value = 54.0 Standard deviation = 29.0 0.04 -- death age model 0.02 SELECT hist(death - birth) FROM person; 0.00 0 20 40 60 80 100 Regression (for sensors) ◮ mean temperature : S.t ∼ β 0 + β 1 · S.h + β 2 · avg ( N.t ) + β 3 · avg ( N.h ) ◮ mean humidity : S.h ∼ γ 0 + γ 1 · S.t + γ 2 · avg ( N.t ) + γ 3 · avg ( N.h ) Mayfield, Neville, & Prabhakar SIGMOD 2010 12 of 21

ERACER Framework Learning (one time, offline): 1. Extract graph structure using domain knowledge 2. RDNs aggregate existing data to learn parameters Inference (multiple iterations): 3. Apply component models to every value in DB 4. Combine predictions to deal with heterogeneity 5. Evaluate posterior distributions for cleaning 6. Repeat 3–5 until happy (i.e., results converge) Mayfield, Neville, & Prabhakar SIGMOD 2010 13 of 21

Step 1: Extract Graphical Model Construct nodes: INSERT INTO node SELECT make_nid (epoch , mote_id), -- creates simple key new_basis (temp), new_basis (humid), new_basis (light), new_basis (volt) FROM sensor; original value, if any (e.g., humid ) initial pdf current prediction (or distribution) basis data type: data cleaning flag (true = outlier) suspect round when pdf/suspect was last updated Construct edges: INSERT INTO link SELECT make_nid (a.epoch , a.mote_id), make_nid(b.epoch , b.mote_id) FROM neighbor AS c -- e.g., within 6 meters INNER JOIN sensor AS a ON c.id1 = a.mote_id INNER JOIN sensor AS b ON c.id2 = b.mote_id WHERE a.epoch - 30 <= b.epoch AND a.epoch + 30 >= b.epoch; Mayfield, Neville, & Prabhakar SIGMOD 2010 14 of 21

Step 2: Learn RDN Parameters Aggregate original data values: CREATE TABLE learn AS SELECT -- individual instances min(expect(i.t)) AS ti , min(expect(i.h)) AS hi , min(expect(i.l)) AS li , min(expect(i.v)) AS vi , -- average neighbor values avg(expect(n.t)) AS tn , avg(expect(n.h)) AS hn , avg(expect(n.l)) AS ln , avg(expect(n.v)) AS vn FROM node AS i LEFT JOIN link AS l ON i.nid = l.id1 LEFT JOIN node AS n ON l.id2 = n.nid GROUP BY i.nid; Optional: apply noise filters, sample data, etc. Estimate applicable component models ◮ Convolution: use built-in hist aggregate ◮ Regression: export to R; use lm function Mayfield, Neville, & Prabhakar SIGMOD 2010 15 of 21

Steps 3–6: Approximate Inference For each round of inference: 1. Update predictions via the erace aggregate query ◮ Infers/cleans all attributes in a single function call SELECT erace(i, n) FROM node AS i LEFT JOIN link AS l ON i.nid = l.id1 LEFT JOIN node AS n ON l.id2 = n.nid GROUP BY i; Key design choice: grouping by tuples, not attributes 2. Store results via CREATE TABLE AS (i.e., propagation ) ◮ Faster than UPDATE over the entire relation (MVCC) ◮ Other optimizations possible (e.g., indexes on nid ’s) Mayfield, Neville, & Prabhakar SIGMOD 2010 16 of 21

erace Aggregate Function SELECT erace(i, n) For each attribute in i : t i h i t n h n ◮ Select applicable model ? 40% 21 ◦ 38% 35 ◦ 23% ◮ Apply/combine predictions 24 ◦ ? ◮ Evaluate (cf. init and prev) Data cleaning algorithm: ◮ Run inference for known values, as if missing ◮ Is original evidence within expected range? ◮ Replace outliers with inferred distributions ◮ Do not propagate suspects (rely on other data) Many more details in the paper! Mayfield, Neville, & Prabhakar SIGMOD 2010 17 of 21

Experiments: Generate synthetic populations Randomly set attributes to NULL Compare inferred values to original

Genealogy Data Results Accuracy of birth pdfs: 10 20 ● BayesNet BayesNet Mean Absolute Error Mean Absolute Error ERACER ERACER 8 ● ● 15 6 ● ● ● ● 10 ● ● ● ● ● ● ● 4 ● ● 5 ● 2 0 0 20 30 40 50 60 0 5 10 15 20 25 Percent Missing (per blanket) Percent Corrupt (per blanket) Variance (uncertainty): Average Standard Deviation Average Standard Deviation 15 25 BayesNet BayesNet 20 ERACER ERACER ● ● ● 10 15 ● ● ● ● ● ● ● ● ● ● ● ● 10 ● 5 ● 5 0 0 20 30 40 50 60 0 5 10 15 20 25 Percent Missing (per blanket) Percent Corrupt (per blanket) Mayfield, Neville, & Prabhakar SIGMOD 2010 19 of 21

ERACER: A Database Approach for Statistical Inference and Data - PowerPoint PPT Presentation

ERACER: A Database Approach for Statistical Inference and Data Cleaning Chris Mayfield Jennifer Neville Sunil Prabhakar Department of Computer Science, Purdue University West Lafayette, Indiana, USA SIGMOD 2010, Indianapolis Problem Real

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

Statistical Database Design and Visualization Tools Development Statistical Database Experience by

STAT 401A - Statistical Methods for Research Workers Statistical Inference Jarad Niemi (Dr. J)

Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security

Foundations for Inference I Dajiang Liu @PHS525 Feb-09-2016 Statistical Inference

UQ, STAT2201, 2017, Lecture 6 Unit 6 Statistical Inference Ideas. 1 Statistical Inference is

Statistical Natural Language Processing Statistical models: learning, inference, estimation,

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

NEBC Database Course 2008 Database Servers Database Interfaces Tim Booth : tbooth@ceh.ac.uk

Modes of Statistical Inference for Causal Efgects Plus an overview of the testing based approach

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Statistical inference via data science: A "tidy" approach Albert Y. Kim Joint Math

Lifted Inference in Statistical Relational Models Guy Van den Broeck BUDA Invited Tutorial June

Statistical Inference https://people.bath.ac.uk/masss/APTS/apts.html Simon Shaw University of

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Soft Inference and Posterior Marginals September 19, 2013 Soft vs. Hard Inference Hard

PPFX FOR DUNE Amit Bashyal Oregon State University 1 INTRODUCTION My

The Recent Common Ancestry of All Humanity about 6,000 Years Ago Evolution only presses on

An Optimal Ancestry Labeling Scheme Amos Korman 1 Pierre Fraigniaud CNRS and University Paris

Pragmatism, Expressivism and the Global Challenge Huw Price & David Macarthur October 14,

Free Software / Debian Philosophy, Design, Merits Ritesh Raj Sarraf April 23 rd 2011 Agenda

1 IWES, September 2018 Summary of open-source licenses Claudio Scordino Paolo Gai 2 Why

WordPress 101 Mercer Public Library January 24, 2019 A little planning... What do I need for a

Introduction to Makeflow and Work Queue Nate Kremer-Herman Blue Waters Webinar March 22nd, 2017

Sambuz

Useful Links

Newsletter

Mail Us

ERACER: A Database Approach for Statistical Inference and Data - PowerPoint PPT Presentation

ERACER: A Database Approach for Statistical Inference and Data Cleaning Chris Mayfield Jennifer Neville Sunil Prabhakar Department of Computer Science, Purdue University West Lafayette, Indiana, USA SIGMOD 2010, Indianapolis Problem Real

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

Statistical Database Design and Visualization Tools Development Statistical Database Experience by

STAT 401A - Statistical Methods for Research Workers Statistical Inference Jarad Niemi (Dr. J)

Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security

Foundations for Inference I Dajiang Liu @PHS525 Feb-09-2016 Statistical Inference

UQ, STAT2201, 2017, Lecture 6 Unit 6 Statistical Inference Ideas. 1 Statistical Inference is

Statistical Natural Language Processing Statistical models: learning, inference, estimation,

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

NEBC Database Course 2008 Database Servers Database Interfaces Tim Booth : tbooth@ceh.ac.uk

Modes of Statistical Inference for Causal Efgects Plus an overview of the testing based approach

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Statistical inference via data science: A &quot;tidy&quot; approach Albert Y. Kim Joint Math

Lifted Inference in Statistical Relational Models Guy Van den Broeck BUDA Invited Tutorial June

Statistical Inference https://people.bath.ac.uk/masss/APTS/apts.html Simon Shaw University of

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Soft Inference and Posterior Marginals September 19, 2013 Soft vs. Hard Inference Hard

PPFX FOR DUNE Amit Bashyal Oregon State University 1 INTRODUCTION My

The Recent Common Ancestry of All Humanity about 6,000 Years Ago Evolution only presses on

An Optimal Ancestry Labeling Scheme Amos Korman 1 Pierre Fraigniaud CNRS and University Paris

Pragmatism, Expressivism and the Global Challenge Huw Price &amp; David Macarthur October 14,

Free Software / Debian Philosophy, Design, Merits Ritesh Raj Sarraf April 23 rd 2011 Agenda

1 IWES, September 2018 Summary of open-source licenses Claudio Scordino Paolo Gai 2 Why

WordPress 101 Mercer Public Library January 24, 2019 A little planning... What do I need for a

Introduction to Makeflow and Work Queue Nate Kremer-Herman Blue Waters Webinar March 22nd, 2017

Sambuz

Useful Links

Newsletter

Mail Us

Statistical inference via data science: A "tidy" approach Albert Y. Kim Joint Math

Pragmatism, Expressivism and the Global Challenge Huw Price & David Macarthur October 14,