Data normalization P RACTICIN G MACH IN E LEARN IN G IN TERVIEW - PowerPoint PPT Presentation

Data normalization P RACTICIN G MACH IN E LEARN IN G IN TERVIEW QUES TION S IN R Rafael Falcon Data Scientist at Shopify

Course outline Chapter 1: Data preprocessing and visualization data normalization handling missing data outlier detection Chapter 2: Supervised learning regression classi�cation ensemble models PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Course outline Chapter 3: Unsupervised learning clustering dimensionality reduction (feature selection, feature extraction) Chapter 4: Model selection and evaluation model evaluation imbalanced classi�cation hyperparameter tuning ensemble learners (Random Forests, Gradient Boosted Trees) Heads-up : This course does not teach elementary Machine Learning concepts. PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Q: Is data normalization always needed? Data normalization: also referred to as feature scaling Not always needed, but it rarely hurts Some ML models can handle features with drastically different scales e.g., decision-tree-based models Most ML models will bene�t from it Support Vector Machines, K-nearest neighbors, Logistic Regression neural networks (helps improve gradient descent convergence) clustering algorithms (K-means, K-medoids, DBSCAN, etc.) feature extraction (Principal Component Analysis, Linear Discriminant Analysis, etc.) PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Q: Min-max scaling or standardization? Min-max scaling Maps a numerical value x to the [0,1] interval x − min ′ x = max − min Standardization (also called Z-score normalization ) Maps a numerical value x to a new distribution with mean μ = 0 and standard deviation σ = 1 x − μ ′ x = σ The min() , max() , mean() and sd() R functions will help you normalize the data PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

FIFA player data PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Normalizing FIFA player data x axis still squished due to outlier both features on very similar scales more robust to outliers differences in y axis will still dominate PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Min-max scaling vs. standardization Min-max normalization : Ensures that all features will share the exact same scale Does not cope well with outliers Z-score normalization : More robust to outliers Normalized data may be on different scales PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Let's practice! P RACTICIN G MACH IN E LEARN IN G IN TERVIEW QUES TION S IN R

Handling missing data P RACTICIN G MACH IN E LEARN IN G IN TERVIEW QUES TION S IN R Rafael Falcon Data Scientist at Shopify

Q: What to do with the missing values? First, thoroughly explore your dataset to identify and visualize the missing values. Then talk to your client about the missing data and apply business knowledge if possible! Three main choices: Ignore : Discard samples with missing values from your analysis. Impute : " Fill in " the missing values with other values. Accept : Apply methods that are unaffected by the missing data. The strategy depends on the type of missingness you have. PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Bringing NAs in library(naniar) any_na(mammographic) FALSE # Set missing data symbols to NA mammographic <- replace_with_na_all(mammographic,~.x == '?') any_na(mammographic) TRUE PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Summarizing missingness miss_var_summary(mammographic) # A tibble: 6 x 3 variable n_miss pct_miss <chr> <int> <dbl> 1 Density 76 7.91 2 Margin 48 4.99 3 Shape 31 3.23 4 Age 5 0.520 5 BI_RADS 2 0.208 6 Severity 0 0 PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Visualizing missingness 1 vis_miss(mammographic) vis_miss(mammographic, cluster=TRUE) PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Visualizing missingness 2 gg_miss_case(mammographic) PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Understanding missingness types There are three types of missing data: MCAR : Missing Completely At Random MAR : Missing At Random MNAR : Missing Not At Random For their de�nitions and examples, check Chapter 2 of the Dealing with Missing Data in R DataCamp course. PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Implications of the missingness types Type Imputation Deletion Visual cues Will not lead MCAR Recommended Random or noisy patterns in missingness clusters to bias May lead to Well-de�ned missingness clusters when arranging for a MAR Recommended bias particular variable(s) MNAR Will lead to Will lead to Neither visual pattern above holds bias bias It can be dif�cult to ascertain the missingness type using visual inspection! PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Q: How good was the data imputation? PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Mean and linear imputations library(naniar) # Mean imputation imp_mean <- mammographic %>% bind_shadow(only_miss = TRUE) %>% add_label_shadow() %>% impute_mean_all() library(simputation) imp_lm <- ... %>% impute_lm(Y ~ X1 + X2) PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Combining multiple imputation models # Aggregate the imputation models imp_models <- bind_rows(mean = imp_mean, lm = imp_lm, .id = "imp_model") head(imp_models) # A tibble: 6 x 14 imp_model BI_RADS Age Shape Margin Density Severity BI_RADS_NA <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> 1 mean 5 67 3 5 3 1 !NA 2 mean 4 43 1 1 2.91 1 !NA 3 mean 5 58 4 5 3 1 !NA 4 mean 4 28 1 1 3 0 !NA 5 mean 5 74 1 5 2.91 1 !NA PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Detecting anomalies in data P RACTICIN G MACH IN E LEARN IN G IN TERVIEW QUES TION S IN R Rafael Falcon Data Scientist at Shopify

Q: How to detect outliers in the data? Univariate methods The 3-sigma rule (for normally distributed data) The 1.5*IQR rule (more general) Multivariate methods Distance-based: K-nearest neighbors (KNN) distance Density-based: Local outlier factor (LOF) PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Univariate methods: the 3-sigma rule Assumption: variable follows a Gaussian distribution Any value below mean - 3 sigma or above mean + 3 sigma PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Univariate methods: the 1.5*IQR rule Q : �rst quantile (25% percentile) 1 Q : third quantile (75% percentile) 3 IQR : Inter-Quartile Range ( Q − Q ) 3 1 Outlier: any value lower than Q − 1.5 × IQR 1 or higher than Q + 1.5 × IQR 3 PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Multivariate methods: distance vs. density-based Two broad categories : distance-based methods density-based methods Assumption: outliers often lie far from their neighbors Distance-based methods Average distance to the K-nearest neighbors Density-based methods Number of neighboring points within a certain distance PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Distance-based methods: K-nearest neighbors get.knn() function from the FNN package PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Local and global outliers PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Density-based methods: Local Outlier Factor (LOF) Measures the local deviation of a data point with respect to its neighbors. Outliers are observations with substantially lower density than their neighbors. Each observation x has an associated score LOF( x ) . LOF( x ) ≈ 1 similar density to its neighbors LOF( x ) < 1 higher density than neighbors (inlier) LOF( x ) > 1 lower density than neighbors (outlier) lof() function from dbscan package PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Q: What to do with outlier observations? 1. Retention : Keep them in your dataset and, if possible, use algorithms that are robust to outliers. e.g., K nearest-neighbors (KNN), tree-based methods (decision tree, random forest) 2. Imputation : Use an imputation method to replace their value with a less extreme observation. e.g., mode imputation, linear imputation, KNN imputation 3. Capping: Replace them with the value of the 5-th percentile (lower limit) or 95-th percentile (upper limit). 4. Exclusion : Not recommended, especially in small datasets or those where a normal distribution cannot be assumed. Apply domain knowledge to help make sense of your outliers! PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R

Data normalization P RACTICIN G MACH IN E LEARN IN G IN TERVIEW - PowerPoint PPT Presentation

Data normalization P RACTICIN G MACH IN E LEARN IN G IN TERVIEW QUES TION S IN R Rafael Falcon Data Scientist at Shopify Course outline Chapter 1: Data preprocessing and visualization data normalization handling missing data outlier

Normal forms and normalization An example of normalization using normal forms We assume we have

TAEP/ AWMA Joint Meeting TAEP/ AWMA Joint Meeting Normalization of the Abnorm Normalization of

Strong normalization for the parameter-free Strong polymorphic lambda calculus based on the

Normalization Lecture 9 Normalization 24 February 2015 1 Wentworth Institute of Technology

Linear Logic and Strong Normalization Beniamino Accattoli Carnegie Mellon University B.

Formalizing Strong Normalization Proofs Kazuhiko Sakaguchi College of Information Science,

Database Normalization Asst. Prof. Dr. Kanda Runapongsa Saikaew (krunapon@kku.ac.th)

Normalization Redundancy causes several anomalies : insert, delete and update

Normalization-Invariant Fuzzy Logic Need for Normalization Operations Explain Empirical Success

Normalization by evaluation for Thorsten Altenkirch Tarmo Uustalu University of Nottingham

Normalization by Evaluation for Martin-Lf Type Theory with One Universe Peter Dybjer,

Normalization Cs386 - Introduction to Database Systems Jay Urbain, PhD Credits: Data

Genealogical Place Name Normalization Bob Leaman (bob.leaman@asu.edu) What is meant by

Normalization and differential expression II Katharina H oel Statistical Analysis of RNA-Seq

RNAseq: Normalization and differential expression I Jens Gietzelt 22.05.2012 Robinson, Oshlack.

Learning attention for historical text normalization by learning to pronounce Marcel Bollmann 1

Know Your Data Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net

FIDO Trust Requirements Ijlal Loutfi, Audun Jsang University of Oslo Mathematics and Natural

Odyssey Landscape & y y p Environmental Services, INC. Erosion Mitigation g &

COLLABORATION In the CrossCountry Toolkit With Helen and Jose Diacono QUICK TOUR OF ZOOM Turn

Managing Huma man Rights Adopt a HR Policy Iden2fy HR Impacts Prevent HR Impacts

Y outh Soccer Training Slides: A Math and Science Approach Y outh Soccer Training Slides: A Math

Feature Extraction and Aggregation for Predicting the Euro 2016 Maryam Tavakol Hamid

H1 FY16 Earnings presentation November 4th, 2015 Yves Guillemot, President and Chief Executive

Data normalization P RACTICIN G MACH IN E LEARN IN G IN TERVIEW - PowerPoint PPT Presentation

Data normalization P RACTICIN G MACH IN E LEARN IN G IN TERVIEW QUES TION S IN R Rafael Falcon Data Scientist at Shopify Course outline Chapter 1: Data preprocessing and visualization data normalization handling missing data outlier

Normal forms and normalization An example of normalization using normal forms We assume we have

TAEP/ AWMA Joint Meeting TAEP/ AWMA Joint Meeting Normalization of the Abnorm Normalization of

Strong normalization for the parameter-free Strong polymorphic lambda calculus based on the

Normalization Lecture 9 Normalization 24 February 2015 1 Wentworth Institute of Technology

Linear Logic and Strong Normalization Beniamino Accattoli Carnegie Mellon University B.

Formalizing Strong Normalization Proofs Kazuhiko Sakaguchi College of Information Science,

Database Normalization Asst. Prof. Dr. Kanda Runapongsa Saikaew (krunapon@kku.ac.th)

Normalization Redundancy causes several anomalies : insert, delete and update

Normalization-Invariant Fuzzy Logic Need for Normalization Operations Explain Empirical Success

Normalization by evaluation for Thorsten Altenkirch Tarmo Uustalu University of Nottingham

Normalization by Evaluation for Martin-Lf Type Theory with One Universe Peter Dybjer,

Normalization Cs386 - Introduction to Database Systems Jay Urbain, PhD Credits: Data

Genealogical Place Name Normalization Bob Leaman (bob.leaman@asu.edu) What is meant by

Normalization and differential expression II Katharina H oel Statistical Analysis of RNA-Seq

RNAseq: Normalization and differential expression I Jens Gietzelt 22.05.2012 Robinson, Oshlack.

Learning attention for historical text normalization by learning to pronounce Marcel Bollmann 1

Know Your Data Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net

FIDO Trust Requirements Ijlal Loutfi, Audun Jsang University of Oslo Mathematics and Natural

Odyssey Landscape &amp; y y p Environmental Services, INC. Erosion Mitigation g &amp;

COLLABORATION In the CrossCountry Toolkit With Helen and Jose Diacono QUICK TOUR OF ZOOM Turn

Managing Huma man Rights Adopt a HR Policy Iden2fy HR Impacts Prevent HR Impacts

Y outh Soccer Training Slides: A Math and Science Approach Y outh Soccer Training Slides: A Math

Feature Extraction and Aggregation for Predicting the Euro 2016 Maryam Tavakol Hamid

H1 FY16 Earnings presentation November 4th, 2015 Yves Guillemot, President and Chief Executive

Odyssey Landscape & y y p Environmental Services, INC. Erosion Mitigation g &