The Art and Science of Data Wrangling Kristen M. Altenburger and - PowerPoint PPT Presentation

The Art and Science of Data Wrangling Kristen M. Altenburger and Sam Pepose Facebook Core Data Science & Portal AI Georgia Tech CS 4803/7643 Deep Learning February 11, 2020

“The performance of machine learning methods is heavily dependent on the choice of data representation (or features) on which they are applied” (Bengio et al., 2013)

The Pitfalls of Data Wrangling (Aboumatar et al., 2019) (Camerer et al., 2018) 3

The Data Wrangling Process population 4

The Data Wrangling Process population sample 5

The Data Wrangling Process cross-validation train population sample test 6

The Data Wrangling Process cross-validation Learn train Model population sample test 7

The Data Wrangling Process cross-validation Learn train Model population sample Evaluate Model test 8

The Data Wrangling Process cross-validation Learn train Model population sample Evaluate Model test Step 1. What is the population of interest? What sample is predictive performance evaluated on, and is the sample representative of the population? 9

We Illustrate the Data Wrangling Process with an Example “Yelp might clean up the restaurant industry” https://www.theatlantic.com/magazine/archive/2013/07/youll-never-throw-up-in-this-town-again/309383/ 10

Previous Claims: Yelp is Predictive of Unhygienic Restaurants The Population: Yelp data and inspection records merged to predict restaurants with “severe violations”, over 2006-2013 in Seattle Previous Results: Demonstrated usefulness of mappings between Yelp review text and hygiene inspections 11 (Kang et al. 2013)

However, Previous Sample Set-up Overlooked Class Imbalance Original Data: 13k inspections (1,756 restaurants with 152k Yelp reviews) over 2006-2013 in Seattle 12 (Kang et al. 2013)

However, Previous Sample Set-up Overlooked Class Imbalance Original Data: 13k inspections (1,756 restaurants with 152k Yelp reviews) over 2006-2013 in Seattle 13 (Kang et al. 2013)

However, Previous Sample Set-up Overlooked Class Imbalance Original Data: 13k inspections (1,756 restaurants with 152k Yelp reviews) over 2006-2013 in Seattle Sampled Data: 612 observations ( 306 hygienic observations and 306 unhygienic observations ) 14 (Kang et al. 2013)

A Step-by-Step Wrangling Example Hygienic observations were non-randomly sampled, resulting in an unexpectedly high number of duplicate restaurants in the hygienic sample. 15 (Kang et al. 2013)

A Step-by-Step Wrangling Example Hygienic observations were non-randomly sampled, resulting in an unexpectedly high number of duplicate restaurants in the hygienic sample. 16 (Kang et al. 2013)

Data Sample Representativeness 17 https://www.foodsafetymagazine.com/magazine-archive1/december-2019january-2020/arfivicial-intelligence-and-food-safety-hype-vs-reality/

A Test of Bias by Asian vs. Non-Asian Establishments A Step-by-Step Wrangling Example 18 (Altenburger and Ho, 2018)

Data Wrangling Best Practices 1. Clearly define your population and sample 2. Understand the representativeness of your sample 21

The Data Wrangling Process cross-validation Learn train Model population sample Evaluate Model test Step 1. What is the population of interest? What sample is predictive performance evaluated on, and is the sample representative of the population? 22

The Data Wrangling Process cross-validation Learn train Model population sample Evaluate Model test Step 2. How do we cross-validate to evaluate our model? How do we avoid overfitting and data mining? 23

Cross-validation 24 (Hastie et al., 2011)

Cross-validation Example “1. Screen the predictors: find a subset of “good” predictors that show fairly strong (univariate) correlation with the class labels 2. Using just this subset of predictors, build a multivariate classifier. 3. Use cross-validation to estimate the unknown tuning parameters and to estimate the prediction error of the final model.” 25 (Hastie et al., 2011)

Cross-validation Example “1. Screen the predictors: find a subset of “good” predictors that show fairly strong (univariate) correlation with the class labels 2. Using just this subset of predictors, build a multivariate classifier. 3. Use cross-validation to estimate the unknown tuning parameters and to estimate the prediction error of the final model.” 26 (Hastie et al., 2011)

Class Imbalance and Cross-Validation 27

Class Imbalance and Cross-Validation 28

Cross-Validation Best Practices ● Random search vs. Grid Search for Hyperparameters (Bergstra and Bengio, 2012) ● Confirm hyperparameter range is sufficient such as plotting OOB error rate ● Temporal cross-validation considerations ● Check for overfitting 29

Data Wrangling Best Practices 1. Clearly define your population and sample 2. Understand the representativeness of your sample 30

Data Wrangling Best Practices 1. Clearly define your population and sample 2. Understand the representativeness of your sample 3. Cross-validation can go wrong in many ways; understand the relevant problem and prediction task that will be done in practice 31

The Data Wrangling Process cross-validation Learn train Model population sample Evaluate Model test Step 2. How do we cross-validate to evaluate our model? How do we avoid overfitting and data mining? 32

The Data Wrangling Process cross-validation Learn train Model population sample Evaluate Model test Step 3. What prediction task (classification vs. regression) do we care about? What is the meaningful evaluation criteria? 33

Our Re-Analysis: Classification vs. Regression 34 (Altenburger and Ho, 2019)

Classification and Calibrated Models 38 https://scikit-learn.org/stable/modules/calibration.html

Model Evaluation Statistics: Accuracy, AUC, Recall, Precision,... Classification Regression Actual ● Mean-squared error + - ● Visually analyze errors - + TP FP Predicted ● Partial Dependence Plots FN TN 39

What are we comparing against? The importance of Baselines ● Random guessing? ● Current Model in Production? ● Useful to compare predictive performance with current and proposed model. 40

Data Wrangling Best Practices 1. Clearly define your population and sample 2. Understand the representativeness of your sample 3. Cross-validation can go wrong in many ways; understand the relevant problem and prediction task that will be done in practice 41

Data Wrangling Best Practices 1. Clearly define your population and sample 2. Understand the representativeness of your sample 3. Cross-validation can go wrong in many ways; understand the relevant problem and prediction task that will be done in practice 4. Know the prediction task of interest (regression vs. classification) 5. Incorporate model checks and evaluate multiple predictive performance metrics 42

The Data Wrangling Process cross-validation Learn train Model population sample Evaluate Model test Step 3. What prediction task (classification vs. regression) do we care about? What is the meaningful evaluation criteria? 43

The Data Wrangling Process cross-validation Learn train Model population sample Evaluate Model test Step 4. How do we create a reproducible pipeline? 44

“Datasheets for Datasets” “...we propose that every dataset be accompanied with a datasheet that documents its motivation, composition, collection process, recommended uses, and so on.” 45 (Gebru et al., 2018)

A Step-by-Step Wrangling Example Data Cleaning for Deep Learning (...and when you should use Deep Learning instead of Machine Learning) 46

47 https://blogs.sas.com/content/subconsciousmusings/files/2017/04/machine-learning-cheet-sheet.png

Data Preparation Algorithm-specific data preparation Get your data in the right format s s m e c r o o f r n s p n a Scrub a dub dub - e e a r r l C T P 1 2 3 48

Missing Data Mechanisms ● Missing Completely at Random : likelihood of any data observation to be missing is random ● Missing at Random : likelihood of any data observation to be missing depends on observed data features ● Missing Not at Random : likelihood of any data observation to be missing depends on unobserved outcome 49 (Little and Rubin, 2019)

Clean: Missing Data Person Age Job Jay 42 Waiter Susan 65 Paco 30 Computer Scientist Max Student 50

Missing Data: Removal Person Age Job Jay 42 Waiter Susan 65 Paco 30 Computer Scientist Max Student - Easy, but lose information 51

The Art and Science of Data Wrangling Kristen M. Altenburger and - PowerPoint PPT Presentation

The Art and Science of Data Wrangling Kristen M. Altenburger and Sam Pepose Facebook Core Data Science & Portal AI Georgia Tech CS 4803/7643 Deep Learning February 11, 2020 The performance of machine learning methods is heavily

Data wrangling with Tableau and Excel October 11 2016 JRNL 520H What is data wrangling? Data

Applying the Data Wrangling Process Nicole G Weiskopf, 8/21/18 Wrangling diabetes Research

Overview of Wrangling Hypertension Nicole G Weiskopf, 8/26/18 Wrangling hypertension Research

The bottom line We are the data science people but the world needs to know about it Wrangling vs

HISTORY ART Pre- Historic Art Egyptian Art Greek Art Roman Art Byzantine Art Medieval Art

HISTORY ART Pre- Historic Art Egyptian Art Greek Art Roman Art Byzantine Art Medieval Art

Wrangling the Bugzilla Beast Robinson Tryon September 23 rd , 2015 1 Wrangling the Bugzilla

SNAKE WRANGLING SNAKE WRANGLING Isaac Elliott How can we bring the benefits of better languages

Introductjon to EHR Data Quality Nicole G Weiskopf, 8/21/18 Learning Objectjves What is data

General-Purpose Inductive Programming for Data Wrangling Automation Lidia Contreras-Ochando,

02 Preparing data for analysis Gabor Bekes Data Analysis 1: Exploration 2019 Variable types

ART OF CHANGE 21 PRSENTATION 2 ART OF CHANGE 21 ABOUT US Art of Change 21 works in the field

Overview of Presentation Public Art Definitions Why is Public Art Important ? Percent for Art

Data Wrangling John Meehan Jeff Rasley Working with raw

Art and Design Art and Design Insects Year One Art and Design Art and Design | LKS2 | Insects |

Pixel Art What is pixel art? Pixel art is a digital art form that is created in raster in its

Build and operate a CEPH Infrastructure - University of Pisa case study Simone Spinelli

58 59 The ionic contamination requirements can be calculated as follows: Water would have ~ 3e22

SUSTAINABILITY & INNOVATION IN PUBLIC RADIO 2019 WBUR BIZLAB SUMMIT #BizLabSummit

UC UC SF SF 1 2 Objectives Surgical Site Infections 3 rd most frequent nosocomial infxn

Speaker Disclosure: MAPS: Multidisciplinary Abnormal Placentation Service Nothing to Disclose

New New or or Under Under Used Used Fe Features Presented By: Andy Schommer & Jane Nickalls

2002 Model Whitewater 6 Slide Rock 6 Wild Ride INSTALLATION MANUAL Whitewater Slide

Dave Stokes MySQL Community Manager Email: David.Stokes@Oracle.com Twiter: @Stoker Slides:

The Art and Science of Data Wrangling Kristen M. Altenburger and - PowerPoint PPT Presentation

The Art and Science of Data Wrangling Kristen M. Altenburger and Sam Pepose Facebook Core Data Science & Portal AI Georgia Tech CS 4803/7643 Deep Learning February 11, 2020 The performance of machine learning methods is heavily

Data wrangling with Tableau and Excel October 11 2016 JRNL 520H What is data wrangling? Data

Applying the Data Wrangling Process Nicole G Weiskopf, 8/21/18 Wrangling diabetes Research

Overview of Wrangling Hypertension Nicole G Weiskopf, 8/26/18 Wrangling hypertension Research

The bottom line We are the data science people but the world needs to know about it Wrangling vs

HISTORY ART Pre- Historic Art Egyptian Art Greek Art Roman Art Byzantine Art Medieval Art

HISTORY ART Pre- Historic Art Egyptian Art Greek Art Roman Art Byzantine Art Medieval Art

Wrangling the Bugzilla Beast Robinson Tryon September 23 rd , 2015 1 Wrangling the Bugzilla

SNAKE WRANGLING SNAKE WRANGLING Isaac Elliott How can we bring the benefits of better languages

Introductjon to EHR Data Quality Nicole G Weiskopf, 8/21/18 Learning Objectjves What is data

General-Purpose Inductive Programming for Data Wrangling Automation Lidia Contreras-Ochando,

02 Preparing data for analysis Gabor Bekes Data Analysis 1: Exploration 2019 Variable types

ART OF CHANGE 21 PRSENTATION 2 ART OF CHANGE 21 ABOUT US Art of Change 21 works in the field

Overview of Presentation Public Art Definitions Why is Public Art Important ? Percent for Art

Data Wrangling John Meehan Jeff Rasley Working with raw

Art and Design Art and Design Insects Year One Art and Design Art and Design | LKS2 | Insects |

Pixel Art What is pixel art? Pixel art is a digital art form that is created in raster in its

Build and operate a CEPH Infrastructure - University of Pisa case study Simone Spinelli

58 59 The ionic contamination requirements can be calculated as follows: Water would have ~ 3e22

SUSTAINABILITY &amp; INNOVATION IN PUBLIC RADIO 2019 WBUR BIZLAB SUMMIT #BizLabSummit

UC UC SF SF 1 2 Objectives Surgical Site Infections 3 rd most frequent nosocomial infxn

Speaker Disclosure: MAPS: Multidisciplinary Abnormal Placentation Service Nothing to Disclose

New New or or Under Under Used Used Fe Features Presented By: Andy Schommer &amp; Jane Nickalls

2002 Model Whitewater 6 Slide Rock 6 Wild Ride INSTALLATION MANUAL Whitewater Slide

Dave Stokes MySQL Community Manager Email: David.Stokes@Oracle.com Twiter: @Stoker Slides:

SUSTAINABILITY & INNOVATION IN PUBLIC RADIO 2019 WBUR BIZLAB SUMMIT #BizLabSummit

New New or or Under Under Used Used Fe Features Presented By: Andy Schommer & Jane Nickalls