A Scalable Prediction Engine for Automating Structured Data Prep - PowerPoint PPT Presentation

A Scalable Prediction Engine for Automating Structured Data Prep Ihab Ilyas University of Waterloo

The Notorious Data Quality Problem Manual labeling, fixing and A whole ecosystem tackling Pushing low quality best effort imputation different aspects data to “robust” models? 2

Data Prep is the Impediment for AI Building downstream ML models is fast and easy because of modern tooling, e.g., Overton, Ludwig, TensorFlow, and PyTorch However data cleaning and prep are: ● Labor-intensive No solution offers automated end-to-end data curation Infrastructure ● Costly Wrong analytics and human cleaning cost money 3

And Problems Don’t Come Piece-meal ID Name ZIP City State Income 1 Green 60610 Chicago IL 30k 2 Green 60611 Chicago IL 32k 3 Peter New Yrk NY 40k 4 John 11507 New York NY 40k 5 Gree 90057 Los Angeles CA 55k 6 Chuck 90057 San Francisco CA 30k Duplicates Missing Value Value/Syntactic Error Integrity Constraint Violation 4

Cleaning is Hard to Automate ID Name ZIP City State Income 1 Green 60610 Chicago IL 30k 1 Green 60610 Chicago IL 31k 2 Green 60611 Chicago IL 32k 11507 New York 3 Peter New Yrk NY 40k 4 John 11507 New York NY 40k 5 Gree 90057 Los Angeles CA 55k Los Angeles 6 Chuck 90057 San Francisco CA 30k Duplicates Missing Value Value/Syntactic Error Integrity Constraint Violation 5

Automating Cleaning with ML Why ML for Cleaning? + Can combine all signals and contexts (rules, constraints, statistics) + Avoids rules explosion to cover edge cases + Can communicate “confidence” instead of “certain cleaning semantics” It is a hard problem - Representing data and background knowledge as model inputs (due to sparsity) - Learning from limited (or no) training data and dirty observations - Scaling to millions of random variables 6

State-of-the-art Results 7

<latexit sha1_base64="DLHMdDa5jCBLJv7XcH/8rCEVGQ=">ADJnicdVFdb9MwFHXC1yhfHTzyckVTqatQlZYHeEGaYEgEaVpB6zap7irHuW2tOU5kO4gqFP4Ov4Y3hHjlX+B2wYxOnalyCfn3HNtH8e5FMaG4Q/Pv3L12vUbWzdrt27fuXuvn3/yGSF5jgmcz0ScwMSqFwYIWVeJrZGks8Tg+e7nUj9+jNiJTh3ae4yhlUyUmgjPrqH9F41xKlTJUVnUi1pQUp3CfpagXAFmjI740yW75Z/K42ZDWFM91BaFtQovcwdXeaOnPtwhmv3K2OFExAg+DvyPFl1BkF02obnQJmepuzDOIK+hla0Q3mSWYdb07bH6OdwNlRJX+uNq43wk64KrgIuhVokKr642vSZOMF6nzc8mMGXbD3I5Kpq3gEhc1WhjMGT9jUxw6qFiKZlSu3mQBTckMm0+5SFXveUbLUmHkau85lGZTW5L/04aFnTwblULlhUXF1xtNCgk2g+UDQyI0civnDjCuhTsr8BnTjLsQ3KQ9dHfRuO/mHuSomc10u6ySXJTVCk2wM6HAuE3xMUiRCmugUAlqhc4JTkuEySWbm5oLtrsZ40Vw1Ot0n3R6b3uN3RdVxFvkIXlEWqRLnpJd8pr0yYBw78ArvE/eZ/+L/9X/5n9ft/pe5XlA/in/52/2X/yl</latexit> Probabilistic Cleaning Model Probabilistic Noise Generator Probabilistic I Data Generator I J ∗ R pr ( J | I ) Model R as R ∆ Model I as I Θ Estimate ∆ Estimate Θ Dirty Instance I ∗ = argmax Pr ( I ) · Pr ( J ∗ | I ) I Clean Instance 8

Core AI Elements Self-supervision with multi-task learning Attention-based contextual representation Scale via distributed learning targeting different data partitions 9

Typical Prep Pipeline Signal Data Domain Pruning Compilation Untrusted Training Few Error Features Examples Automatic Compilation to (labeled) Features Trusted Error Weak Supervision Detection (few shot Repair Model learning) Builder Untrusted Data Features Repair Suggestions Rules Constraints Inference Model (Signals) 10

Use Case 1: Imputation Problem: Market Research Company k Accuracy Avg. Confidence Market research data missing many labels, was manually 1 96.8% 97.22% labelled via an expensive and labor-intensive process 2 99.4% 95.2% 3 99.8% 94.8% HoloClean was used to predict the label of each transaction from the master data (e.g., at the level of an SKU). A subset of Error Type Sampled the manually labeled data was used in addition to data Ground truth is incorrect 1128 (71.7%) augmentation to obtain training data. Prediction is Incorrect 333 (21.1%) Outcome Uncertain 112 (7.1%) HoloClean was trained on 2 million transactions in 12 hours on a single machine, and predicted categories for 7.5 million transactions in under one hour. HoloClean annotated each transaction with a probability distribution of labels, and a confidence for each one of the possible labels. The accuracy was evaluated using a test set of manually labeled data provided by the user. 11

Use Case 2: Error Detection Problem – Insurance Company k Accuracy 1 0.894 Insurance reference data was noisy and lead to poor analytics. A 2 0.952 need for “automatic” error detection on categorical data without any external supervision was identified. 3 0.972 Confidence Threshold Accuracy Recall Outcome 0.0 0.894 1 0.5 0.966 0.69 HoloClean trained on 200,000 records in 1.5 hours and predicted 0.7 0.985 0.52 on 800,000 in 20 minutes. 0.9 0.995 0.41 HoloClean produced a data set, with each cell annotated with the probability of being an error. For each possible error, the top-k possible values (based on the prediction probability) were provided. The accuracy of the results were examined by manually inspecting a sample of the identified errors and their suggested repair by experts. 12

Automating Data Cleaning Infrastructure A scalable prediction engine for structured data, building on modern AI technology ● Self (and weak) supervision ○ Contextual data representation ○ Direct applications/services in ● Error and anomaly detection ○ Data repair ○ Missing value imputation ○ Rules discovery and evaluation ○ Replaced months of manual work to hours on modest hardware configurations with similar ● to (and sometimes better than) human accuracy Thank You @ihabilyas

A Scalable Prediction Engine for Automating Structured Data Prep - PowerPoint PPT Presentation

A Scalable Prediction Engine for Automating Structured Data Prep Ihab Ilyas University of Waterloo The Notorious Data Quality Problem Manual labeling, fixing and A whole ecosystem tackling Pushing low quality best effort imputation different

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

L101: Introduction to Structured Prediction Ryan Cotterell What is structured prediction?

Training Strategies CS 6355: Structured Prediction 1 So far we saw What is structured output

CSCE 496/896 Lecture 11: Structured Prediction and Structured Prediction and Probabilistic

Course Information CS 6355: Structured Prediction Building up structured output prediction

L101: Incremental structured prediction Structured prediction reminder Given an input x (e.g. a

Structured Prediction Final words CS 6355: Structured Prediction 1 A look back What is a

Complex Prediction Problems A novel approach to multiple Structured Output Prediction Yasemin

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Google App Engine Guido van Rossum Stanford EE380 Colloquium, Nov 5, 2008 Google App Engine

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Automating batch fecundity measurements Automating batch fecundity measurements using digital

REDHAT KICKSTART REDHAT KICKSTART Automating Linux Installation Automating Linux Installation

University Ridge Developm ent Pre-Submission Conference Greenville County Square February 9,

VIRTUAL PUBLIC MEETING J u l y 2 , 2 0 2 0 PANELISTS Rod Nelson, PE , District Administrator

anticipated scope, anticipated Design-Build procurement process, and potential risks.

The ODOT Columbus State Construction Inspection Workforce Program (CIWP) Formal Education,

RAIM Prediction Service Introduction GNSS is considered a main navigation infrastructure.

Multi-source Aeroacoustic Noise Prediction Method Jonathan SCOTT CFD Engineer 03/12/2013 1

USING PREDICTIVE ANALYTICS TO LEARN WHAT WORKS FOR VULNERABLE JOBSEEKERS November 2016

Agenda Introduction Analytics Framework Current use of Metrics and Analytics