ERACER: A Database Approach for Statistical Inference and Data - - PowerPoint PPT Presentation

eracer a database approach for statistical inference and
SMART_READER_LITE
LIVE PREVIEW

ERACER: A Database Approach for Statistical Inference and Data - - PowerPoint PPT Presentation

ERACER: A Database Approach for Statistical Inference and Data Cleaning Chris Mayfield Jennifer Neville Sunil Prabhakar Department of Computer Science, Purdue University West Lafayette, Indiana, USA SIGMOD 2010, Indianapolis Problem Real


slide-1
SLIDE 1

ERACER: A Database Approach for Statistical Inference and Data Cleaning

Chris Mayfield Jennifer Neville Sunil Prabhakar

Department of Computer Science, Purdue University West Lafayette, Indiana, USA

SIGMOD 2010, Indianapolis

slide-2
SLIDE 2

Problem

Real data is dirty

◮ Inconsistent, incomplete, corrupted, outdated, etc. ◮ Safety measures (e.g., constraints) are often not used ◮ Poor decisions based on dirty data costs billions annually

Data cleaning is hard

◮ Typically ad hoc, interactive, exploratory, etc. ◮ Uncertain process: what to do with the “errors?” ◮ Maintenance of results (e.g., lineage/provenance) ◮ Consumes large amount of data management time (see Fan, Geerts, & Jia, VLDB 2008 tutorial)

Mayfield, Neville, & Prabhakar SIGMOD 2010 2 of 21

slide-3
SLIDE 3

Example 1: Genealogy Data

5M people from Pedigree Resource File

◮ Person (ind id, birth, death) ◮ Relative (ind id, rel id, role)

B: 1744 D: 1804 B: ? D: ? B: 1781 D: 1807 B: ? D: ? B: ? D: ? B: ? D: ? B: 1743 D: 1787 B: 1707 D: 1766 B: 1741 D: 1771 B: 1763 D: 1826 B: ? D: ? B: 1769 D: 1769 B: 1770 D: 1770

Integrated from many sources, e.g.:

◮ Census records ◮ Immigration lists ◮ Family history

societies

◮ Individual

submissions

Mayfield, Neville, & Prabhakar SIGMOD 2010 3 of 21

slide-4
SLIDE 4

Example 2: Sensor Networks

◮ 54 sensors, every 31

seconds, for 38 days

◮ ≈ 18% obviously incorrect ◮ Multiple data types

2M readings of Intel Lab Data

◮ Sensor (epoch, mote id,

temp, humid, light, volt)

◮ Neighbor (id1, id2,

distance)

Source: http://db.csail.mit.edu/labdata/labdata.html

Mayfield, Neville, & Prabhakar SIGMOD 2010 4 of 21

slide-5
SLIDE 5

Insight

Correlations within tuples, e.g.:

◮ Birth and death years ◮ Temperature and humidity values

Correlations across tuples, e.g.:

◮ Parents and children ◮ Neighboring sensors

Apply statistical relational learning

◮ Don’t just clean tuples in isolation

(e.g., functional dependencies)

◮ Propagate inferences multiple times

Input:

◮ Possible tuple

dependencies

◮ Correlation

model skeleton Output:

◮ PDFs for

missing data

◮ Flags for

dirty data

Mayfield, Neville, & Prabhakar SIGMOD 2010 5 of 21

slide-6
SLIDE 6

Baseline Approach: Bayesian networks Exact inference (junction tree) Bayes Net Toolbox for Matlab

slide-7
SLIDE 7

Bayesian Network Formulation

Model template specifies conditional dependencies:

Individual Individual Birth Death Father Mother

= ⇒

F.b F.d I.b M.b M.d I.d C1.b C2.b C3.b S.b S.d C2.d C2.d C3.d

Conditional probability distribution (CPD) at each node: P(I.d | I.b) death year, given the birth year P(I.b | M.b , F.b) birth year, given parent birth years Prior distribution at nodes with no parents: P(I.b)

Simplified version of Relational Bayesian Networks (see e.g., Getoor & Taskar 2007)

Mayfield, Neville, & Prabhakar SIGMOD 2010 7 of 21

slide-8
SLIDE 8

Maximum Likelihood Estimation

F.b F.d I.b M.b M.d I.d C1.b C2.b C3.b S.b S.d C2.d C2.d C3.d

  • 1. Learn CPDs from data, e.g.:

CREATE TABLE cpt_birth AS SELECT birth , death , count (*) FROM person GROUP BY birth , death;

  • 2. Share CPDs across all nodes:
  • - P(I.d | I.b = 1750)

SELECT death , count FROM cpt_birth WHERE birth = 1750;

  • 3. Run inference (e.g., junction tree)

◮ Construct Bayesian network ◮ Bind evidence (query from DB) ◮ Extract results (store in DB)

Mayfield, Neville, & Prabhakar SIGMOD 2010 8 of 21

slide-9
SLIDE 9

Challenges and Lessons Learned

Limiting model assumptions

◮ Fixed CPD structure (e.g., always two parents) ◮ Acyclicity constraint (can’t model sensor data)

Potentially millions of parameters

◮ Becomes very inefficient ◮ Floating point underflow

Not scalable to large data sets

◮ DB may not fit into main memory ◮ Moving data in/out of R, Matlab, etc.

Not designed for data cleaning

◮ Propagates outliers/errors in original data ◮ Need to look beyond the Markov blanket

Mayfield, Neville, & Prabhakar SIGMOD 2010 9 of 21

slide-10
SLIDE 10

ERACER Approach: Relational dependency networks Approximate inference algorithm SQL-based framework Integrated data cleaning

slide-11
SLIDE 11

Relational Dependency Networks

For example, at each sensor and time epoch:

Sensor Sensor Temp Humid Neighbor Neighbor Neighbor

= ⇒

S1.t S1.h S2.t S4.h S2.h S3.t S3.h S4.t

In contrast to Bayesian networks, RDNs:

◮ approximate the full joint distribution ◮ learn CPDs locally based on component models ◮ allow cyclic dependencies (i.e., many-to-many) ◮ use aggregation to deal with heterogeneity (see Neville & Jensen, JMLR 2007)

Mayfield, Neville, & Prabhakar SIGMOD 2010 11 of 21

slide-12
SLIDE 12

Component Models

Convolution (for genealogy)

◮ parent age: MPA = P(I.b − P.b) ◮ death age: MDA = P(I.d − I.b)

  • - death

age model SELECT hist(death - birth) FROM person;

20 40 60 80 100 0.00 0.02 0.04

Expected value = 54.0 Standard deviation = 29.0

Regression (for sensors)

◮ mean temperature:

S.t ∼ β0 + β1 · S.h + β2 · avg(N.t) + β3 · avg(N.h)

◮ mean humidity:

S.h ∼ γ0 + γ1 · S.t + γ2 · avg(N.t) + γ3 · avg(N.h)

Mayfield, Neville, & Prabhakar SIGMOD 2010 12 of 21

slide-13
SLIDE 13

ERACER Framework

Learning (one time, offline):

  • 1. Extract graph structure using domain knowledge
  • 2. RDNs aggregate existing data to learn parameters

Inference (multiple iterations):

  • 3. Apply component models to every value in DB
  • 4. Combine predictions to deal with heterogeneity
  • 5. Evaluate posterior distributions for cleaning
  • 6. Repeat 3–5 until happy (i.e., results converge)

Mayfield, Neville, & Prabhakar SIGMOD 2010 13 of 21

slide-14
SLIDE 14

Step 1: Extract Graphical Model

Construct nodes:

INSERT INTO node SELECT make_nid (epoch , mote_id), -- creates simple key new_basis (temp), new_basis (humid), new_basis (light), new_basis (volt) FROM sensor;

basis data type:

initial

  • riginal value, if any (e.g., humid)

pdf current prediction (or distribution) suspect data cleaning flag (true = outlier) round when pdf/suspect was last updated

Construct edges:

INSERT INTO link SELECT make_nid (a.epoch , a.mote_id), make_nid(b.epoch , b.mote_id) FROM neighbor AS c -- e.g., within 6 meters INNER JOIN sensor AS a ON c.id1 = a.mote_id INNER JOIN sensor AS b ON c.id2 = b.mote_id WHERE a.epoch - 30 <= b.epoch AND a.epoch + 30 >= b.epoch;

Mayfield, Neville, & Prabhakar SIGMOD 2010 14 of 21

slide-15
SLIDE 15

Step 2: Learn RDN Parameters

Aggregate original data values:

CREATE TABLE learn AS SELECT

  • - individual

instances min(expect(i.t)) AS ti , min(expect(i.h)) AS hi , min(expect(i.l)) AS li , min(expect(i.v)) AS vi ,

  • - average

neighbor values avg(expect(n.t)) AS tn , avg(expect(n.h)) AS hn , avg(expect(n.l)) AS ln , avg(expect(n.v)) AS vn FROM node AS i LEFT JOIN link AS l ON i.nid = l.id1 LEFT JOIN node AS n ON l.id2 = n.nid GROUP BY i.nid;

Optional: apply noise filters, sample data, etc. Estimate applicable component models

◮ Convolution: use built-in hist aggregate ◮ Regression: export to R; use lm function

Mayfield, Neville, & Prabhakar SIGMOD 2010 15 of 21

slide-16
SLIDE 16

Steps 3–6: Approximate Inference

For each round of inference:

  • 1. Update predictions via the erace aggregate query

◮ Infers/cleans all attributes in a single function call

SELECT erace(i, n) FROM node AS i LEFT JOIN link AS l ON i.nid = l.id1 LEFT JOIN node AS n ON l.id2 = n.nid GROUP BY i;

Key design choice: grouping by tuples, not attributes

  • 2. Store results via CREATE TABLE AS (i.e., propagation)

◮ Faster than UPDATE over the entire relation (MVCC) ◮ Other optimizations possible (e.g., indexes on nid’s)

Mayfield, Neville, & Prabhakar SIGMOD 2010 16 of 21

slide-17
SLIDE 17

erace Aggregate Function

SELECT erace(i, n) ti hi tn hn ? 40% 21◦ 38% 35◦ 23% 24◦ ?

For each attribute in i:

◮ Select applicable model ◮ Apply/combine predictions ◮ Evaluate (cf. init and prev)

Data cleaning algorithm:

◮ Run inference for known values, as if missing ◮ Is original evidence within expected range? ◮ Replace outliers with inferred distributions ◮ Do not propagate suspects (rely on other data)

Many more details in the paper!

Mayfield, Neville, & Prabhakar SIGMOD 2010 17 of 21

slide-18
SLIDE 18

Experiments: Generate synthetic populations Randomly set attributes to NULL Compare inferred values to original

slide-19
SLIDE 19

Genealogy Data Results

Accuracy of birth pdfs:

20 30 40 50 60 2 4 6 8 10 Percent Missing (per blanket) Mean Absolute Error

  • BayesNet

ERACER 5 10 15 20 25 5 10 15 20 Percent Corrupt (per blanket) Mean Absolute Error

  • BayesNet

ERACER

Variance (uncertainty):

20 30 40 50 60 5 10 15 Percent Missing (per blanket) Average Standard Deviation

  • BayesNet

ERACER 5 10 15 20 25 5 10 15 20 25 Percent Corrupt (per blanket) Average Standard Deviation

  • BayesNet

ERACER Mayfield, Neville, & Prabhakar SIGMOD 2010 19 of 21

slide-20
SLIDE 20

Sensor Data Results

Accuracy of temperature pdfs:

30 40 50 60 70 80 90 1 2 3 4 5 Percent Missing Mean Absolute Error

  • Baseline

ERACER 10 20 30 40 50 60 70 1 2 3 4 5 Percent Corrupt Mean Absolute Error

  • Baseline

ERACER

Accuracy of humidity pdfs:

30 40 50 60 70 80 90 2 4 6 8 10 Percent Missing Mean Absolute Error

  • Baseline

ERACER 10 20 30 40 50 60 70 2 4 6 8 10 Percent Corrupt Mean Absolute Error

  • Baseline

ERACER Mayfield, Neville, & Prabhakar SIGMOD 2010 20 of 21

slide-21
SLIDE 21

Summary

Database-centric approach for approximate inference

◮ Statistical framework for correcting errors ◮ Efficient; no need to move data to/from R/Matlab

Synergy of imputation and data cleaning tasks

◮ Additional evidence identifies errors more accurately ◮ Corrected data values improve the quality of inference

Empirical evaluation on two real-world data sets

◮ Similar accuracy to Bayesian network baseline ◮ Significant gains in runtime performance ◮ Added benefit of simultaneous data cleaning

Mayfield, Neville, & Prabhakar SIGMOD 2010 21 of 21