Data normalization
P RACTICIN G MACH IN E LEARN IN G IN TERVIEW QUES TION S IN R
Rafael Falcon
Data Scientist at Shopify
Data normalization P RACTICIN G MACH IN E LEARN IN G IN TERVIEW - - PowerPoint PPT Presentation
Data normalization P RACTICIN G MACH IN E LEARN IN G IN TERVIEW QUES TION S IN R Rafael Falcon Data Scientist at Shopify Course outline Chapter 1: Data preprocessing and visualization data normalization handling missing data outlier
P RACTICIN G MACH IN E LEARN IN G IN TERVIEW QUES TION S IN R
Rafael Falcon
Data Scientist at Shopify
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
Chapter 1: Data preprocessing and visualization data normalization handling missing data
Chapter 2: Supervised learning regression classication ensemble models
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
Chapter 3: Unsupervised learning clustering dimensionality reduction (feature selection, feature extraction) Chapter 4: Model selection and evaluation model evaluation imbalanced classication hyperparameter tuning ensemble learners (Random Forests, Gradient Boosted Trees) Heads-up: This course does not teach elementary Machine Learning concepts.
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
Data normalization: also referred to as feature scaling Not always needed, but it rarely hurts Some ML models can handle features with drastically different scales e.g., decision-tree-based models Most ML models will benet from it Support Vector Machines, K-nearest neighbors, Logistic Regression neural networks (helps improve gradient descent convergence) clustering algorithms (K-means, K-medoids, DBSCAN, etc.) feature extraction (Principal Component Analysis, Linear Discriminant Analysis, etc.)
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
Min-max scaling Maps a numerical value x to the [0,1] interval
x =
Standardization (also called Z-score normalization) Maps a numerical value x to a new distribution with mean μ = 0 and standard deviation σ = 1
x =
The min() , max() , mean() and sd() R functions will help you normalize the data
′
max − min x − min
′
σ x − μ
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
x axis still squished due to outlier
differences in y axis will still dominate both features on very similar scales more robust to outliers
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
Min-max normalization: Ensures that all features will share the exact same scale Does not cope well with outliers Z-score normalization: More robust to outliers Normalized data may be on different scales
P RACTICIN G MACH IN E LEARN IN G IN TERVIEW QUES TION S IN R
P RACTICIN G MACH IN E LEARN IN G IN TERVIEW QUES TION S IN R
Rafael Falcon
Data Scientist at Shopify
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
First, thoroughly explore your dataset to identify and visualize the missing values. Then talk to your client about the missing data and apply business knowledge if possible! Three main choices: Ignore: Discard samples with missing values from your analysis. Impute: "Fill in" the missing values with other values. Accept: Apply methods that are unaffected by the missing data. The strategy depends on the type of missingness you have.
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
library(naniar) any_na(mammographic) FALSE # Set missing data symbols to NA mammographic <- replace_with_na_all(mammographic,~.x == '?') any_na(mammographic) TRUE
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
miss_var_summary(mammographic) # A tibble: 6 x 3 variable n_miss pct_miss <chr> <int> <dbl> 1 Density 76 7.91 2 Margin 48 4.99 3 Shape 31 3.23 4 Age 5 0.520 5 BI_RADS 2 0.208 6 Severity 0 0
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
vis_miss(mammographic) vis_miss(mammographic, cluster=TRUE)
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
gg_miss_case(mammographic)
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
There are three types of missing data: MCAR: Missing Completely At Random MAR: Missing At Random MNAR: Missing Not At Random For their denitions and examples, check Chapter 2 of the Dealing with Missing Data in R DataCamp course.
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
Type Imputation Deletion Visual cues MCAR Recommended Will not lead to bias Random or noisy patterns in missingness clusters MAR Recommended May lead to bias Well-dened missingness clusters when arranging for a particular variable(s) MNAR Will lead to bias Will lead to bias Neither visual pattern above holds It can be difcult to ascertain the missingness type using visual inspection!
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
library(naniar) # Mean imputation imp_mean <- mammographic %>% bind_shadow(only_miss = TRUE) %>% add_label_shadow() %>% impute_mean_all() library(simputation) imp_lm <- ... %>% impute_lm(Y ~ X1 + X2)
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
# Aggregate the imputation models imp_models <- bind_rows(mean = imp_mean, lm = imp_lm, .id = "imp_model") head(imp_models) # A tibble: 6 x 14 imp_model BI_RADS Age Shape Margin Density Severity BI_RADS_NA <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> 1 mean 5 67 3 5 3 1 !NA 2 mean 4 43 1 1 2.91 1 !NA 3 mean 5 58 4 5 3 1 !NA 4 mean 4 28 1 1 3 0 !NA 5 mean 5 74 1 5 2.91 1 !NA
P RACTICIN G MACH IN E LEARN IN G IN TERVIEW QUES TION S IN R
P RACTICIN G MACH IN E LEARN IN G IN TERVIEW QUES TION S IN R
Rafael Falcon
Data Scientist at Shopify
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
Univariate methods The 3-sigma rule (for normally distributed data) The 1.5*IQR rule (more general) Multivariate methods Distance-based: K-nearest neighbors (KNN) distance Density-based: Local outlier factor (LOF)
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
Assumption: variable follows a Gaussian distribution Any value below mean - 3 sigma or above mean + 3 sigma
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
Q : rst quantile (25% percentile) Q : third quantile (75% percentile) IQR: Inter-Quartile Range (Q − Q )
Outlier: any value lower than Q − 1.5 × IQR
1 3 3 1 1 3
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
Two broad categories: distance-based methods density-based methods Assumption: outliers often lie far from their neighbors Distance-based methods Average distance to the K-nearest neighbors Density-based methods Number of neighboring points within a certain distance
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
get.knn() function from the FNN package
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
Measures the local deviation of a data point with respect to its neighbors. Outliers are observations with substantially lower density than their neighbors. Each observation x has an associated score LOF(x).
LOF(x) ≈ 1 similar density to its neighbors LOF(x) < 1 higher density than neighbors (inlier) LOF(x) > 1 lower density than neighbors (outlier)
lof() function from dbscan package
PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN R
e.g., K nearest-neighbors (KNN), tree-based methods (decision tree, random forest)
e.g., mode imputation, linear imputation, KNN imputation
limit).
cannot be assumed. Apply domain knowledge to help make sense of your outliers!
P RACTICIN G MACH IN E LEARN IN G IN TERVIEW QUES TION S IN R