Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Preprocessing Overview Selection Cleaning Transformation Feature selection Feature extraction Sampling 2 The analytics process 3 The data set
Overview
Selection Cleaning Transformation Feature selection Feature extraction Sampling
2
The analytics process
3
The data set
Instance Age Account activity Owns credit card Churn 1 24 low yes no 2 21 medium no yes 3 42 high yes no 4 34 low no yes 5 19 medium yes yes 6 44 medium no no 7 31 high yes no 8 29 low no no 9 41 high no no 10 29 medium yes no 11 52 medium no yes 12 55 high no no 13 52 medium yes no 14 38 low no yes
4
The data set
A tabular data set ("structured data"):
Has instances (examples, rows, observations, customers, ...) and attributes (features, fields, variables, predictors) These features can be:
Numeric (continuous) or categorical (discrete, nominal, ordinal, factor, binary)
Target (label, class, dependent variable) can be present
Can also be numeric, categorical, ...
5
Constructing a data set takes work
Merging different data sources Levels of aggregation e.g. household versus individual customer Linking instance identifiers Definition of target variable
6
Selection
As data mining can only uncover patterns actually present in the data, target data set must be large/complete enough to contain these patterns
But concise enough to be mined within an acceptable time limit Data marts and data warehouses can help
When you want to predict a target variable y
Make absolutely sure you’re not cheating by including explanatory variables which are too “perfectly” correlated with the target “Too good to be true” Do you know this explanatory variable after the target outcome, or before? Your finished model will not be able to look into the future!
Set aside a hold-out test set 7
Exploration
Visual analytics as a means for initial exploration (EDA: exploratory data analysis)
Boxplots Scatter plots Histograms Correlation plots Basic statistics
There is some debate on this topic 8
Cleaning
Consistency
Detecting errors and duplications K.U.L., KU Leuven, K.U.Leuven, KULeuven…
Transforming different representations to common format
Male, Man, MAN → M True, YES, Ok → 1
Removing “future variables”
Or variables that got modified according to the target at hand
9
Cleaning: missing values
Many techniques cannot deal with missing values
Not applicable (credit card limit = NA if no credit card owned) versus unknown or not disclosed (age = NA) Detection is easy
Treatment:
Delete: complete row/column Replace (impute): by mean, median, or mode Or by separate model (the missing value becomes the target) – often not worth it in practice Keep: include a separate missing value indicator feature
10
Cleaning: missing values
11
Cleaning: missing values
Common approach: delete rows/columns with too many missing values, impute
- thers using median and mode, add separate column indicating if original value
was missing
Sometimes imputation of median/mode based within same class-label More advanced imputation: nearest-neighbor based
12
Cleaning: missing values
Always keep the production setting in mind: new unseen instances can contain missing values as well Don’t impute with a new median! But using the same “rules” as with the training data Also when working with validation data! What if we’ve never observe a missing value for this feature before?
Use the original training data to construct an imputation Consider rebuilding your model (monitoring!)
13
Cleaning: outliers
Extreme observations (age = 241 or income = 3 million)
Can be valid or invalid
Some techniques are sensitive to outliers
Detection: histograms, z-scores, box-plots
Sometimes the detection of outliers is the whole “analysis” task Anomaly detection (fraud analytics)
Treatment:
As missing: for invalid outliers, consider them as missing values Treat: for valid outliers, but depends on the technique you will use: keep as-is, truncate (cap), bin, group
14
Cleaning: outliers
15
Cleaning: outliers
16
Cleaning: duplicates
Duplicate rows can be valid or invalid Treat accordingly 17
Transformations: standardization and normalization
Standardization: constrain feature to ~N(0,1)
Good for Gaussian-distributed data and some techniques: SVMs, regression models, k-nearest neighbor Useless for some techniques: decision trees
x =
new
σ x − μ
18
Transformations: standardization and normalization
Normalization: also called feature scaling
Rescale to [0,1], [-1,1] In credit risk models, this is oftentimes applied to the resulting scores so that they fall in [0, 1000]
x =
new
x − x
max min
x − xmin
19
Transformations: categorization
Continuous to nominal
Binning: group ranges into categories
Can be useful to treat outliers Oftentimes driven by domain knowledge Equal width versus equal frequency
Grouping: grouping multiple nominal levels together
In case you have many levels (…)
20
Transformations: dummyfication and other encodings
Nominal to continuous Dummy variables: artificial variable to represent an attribute with multiple categories
Mainly used in regression (cannot deal with categoricals directly) E.g. account activity = high, medium, low Convert to: account_activity_high = 0, 1 account_activity_medium = 0, 1 account_activity_low = 0, 1 ... and then drop one
Binary encoding
E.g. binarization: account_activity_high = 1 → 001 account_activity_medium = 2 → 010 account_activity_low = 3 → 100 More compact then dummy variables
21
Transformations: high-level categoricals
Other ideas
Replace with counting variable (for high-level categoricals that are difficult to group using domain knowledge) (“grouping”)
E.g. postal code = 2000, 3000, 9401, 3001… Solution 1: Group using domain knowledge (working area, communities) Solution 2: Create new variable postal_code_count E.g. postal_code_count = 23 if original postcode appears 23 times in training data (Again: keep the same rule for validation/production!) Lose detailed information but goal is that model can pick up on frequencies
Odds based grouping Weight of evidence encoding Probabilistic transformations and other “Kaggle”-tricks such as Leave One Out Mean (Owen Zhang)
Risky, not always robust Including noise and "jitter"
22
Odds based grouping
The "pivot" table approach
Create a pivot table of the attribute versus target and compute the odds Group variable values having similar odds
23
Weights of evidence encoding
Weights of evidence variables can be defined as follows:
WoE = ln( )
p
= number of instances with class 1 in category / number of instance with class 1 (total)
p
= number of instances with class 2 in category / number of instance with class 2 (total)
If p
> p
then WoE > 0 If p
< p
then WoE < 0 WoE has a monotonic relationship with the target variable
cat pc2,cat pc1,cat
c1,cat c2,cat
c1,cat c2,cat c1,cat c2,cat
24
Weights of evidence encoding
Higher weights of evidence indicates less risk (monotonic relation to target) Information Value (IV) defined as:
(p − p )WoE
Can be used to screen variables (important variables have IV > 0.1)
∑
G B
25
Weights of evidence encoding
https://cran.r-project.org/web/packages/smbinning/index.html
26
Some other creative approaches
For geospatial data: group nearby communities together, gaussian regression, ... First build a decision tree only on the one categorical variable 1-dimensional k-means clustering on a continuous variable to suggest groups
27
Leave one out mean (Owen Zhang)
28
Hashing trick
The “hashing trick”: for categoricals with many levels and expected that new levels will occur in new instances
Oftentimes used for text mining
Alternative for bagging: 29
Hashing trick
The “hashing trick”: for categoricals with many levels and expected that new levels will occur in new instances But: perhaps there are smarter approaches? 30
Embeddings
Other “embedding” approaches are possible as well in the textual domain
“word2vec and friends” Can even be used for high-level and sparse categoricals We’ll come back to this later on, when we talk about text mining in more detail
https://arxiv.org/pdf/1604.06737.pdf
31
Transformations: mathematical approaches
Logarithms, square roots, etc
Mostly done to enforce some linearity Box Cox Yeo Johnson Other "power transformations": applied to create a monotonic transformation of data using power functions. This is a useful data transformation technique used to stabilize variance, make the data more normal distribution-like, improve the validity of measures of association such as the Pearson correlation
32
Transformations: interaction variables
When no linear effect is present on x ∼ y and x ∼ y but there is one on f(x , x ) ∼ y In most cases: f(x , x ) = x × x
1 2 1 2 1 2 1 2
33
Transformations: delta's, trends, windows
Evolutions over time or differences between features are crucial in many settings
Solution 1: Keep track of an instance through time and add in as separate rows (“panel data analysis”) Solution 2: One point in time as “base”, add in relative features
34
Feature selection
Oftentimes, you end up with lots of features
Will make many techniques go haywire: exploration space too large Throw out weak features
E.g. features with low variance Chi-squared goodness-of-fit test Information gain Correlated features
More advanced techniques:
Stepwise introduction or removal Based on genetic algorithms Using techniques that provide a “variable importance” ranking Retrain model with top-n only and check performance stability Apply clustering first and use distance to centroids (but also transforms) Principal component analysis to lower dimensionality (but also transforms)
35
Principal component analysis
http://setosa.io/ev/principal-component-analysis/
Remains a surprising effective technique to have in your toolbox
Before and after the data mining
36
Feature extraction
How do get a tabular data set out of something non-structured?
Squeeze out a representational vector Very time-consuming/difficult Approaches range from manual to... trained with models
37
Feature extraction
38
Feature engineering is the most important step
Most techniques are relatively “dumb” - any human domain knowledge and smart features, transformation, ... can help
Though avoid biasing data and remember correlation ≠ causation
39
Sampling
A subset of the data at hand
In case you have too much to compute a model in reasonable time But throwing data away Randomized Block randomized …
(Sampling will return later on, when we talk more about validation) 40
Conclusion
Pre-processing: many steps and checks
Depends on technique being used later on, which might not yet be certain Can you apply your pre-processing steps next month, on a future data set, in production? Can you apply your pre-processing steps on the test set? Time consuming! E.g. with SQL / pandas / Spark / …: join columns, remove columns, create aggregates, sort, order, split, … Not the “fun” aspect of data science Easy to introduce mistakes
41
# Make cross-link uid-sid uid_sid <- logs %>% select(uid, sid) %>% distinct() %>% collect() %>% filter(sid != "" & !is.na(sid) & sid != "s0123456" & str_length(sid) == 8) %>% select(uid, sid) %>% distinct() %>% group_by(uid) %>% summarise(newsid = paste(sid, collapse=' '), sidnr = n())
42
# Extract all interesting log events, match with the user ids # then retain only known students data <- logs %>% select(time, address, type, value, uid, osid = sid) %>% mutate(stime = time / 1000) %>% collect() %>% mutate(sftime = format(as.POSIXct(stime, origin="1970-01-01"), "%Y-%m-%d %H:%M:%S")) %>% inner_join(uid_sid, by='uid', copy=T) %>% mutate(rsid = ifelse(newsid == '', osid, newsid)) %>% select(stime, sftime, address, type, value, uid, rsid) %>% filter(type %in% c('EvaluationProgramSource', 'EvaluationProgramStatus', 'EvaluationProgramHash', 'EvaluationProgramHasErrors', 'EvaluationProgramError', 'EvaluationResult')) %>% separate(rsid, c('r1', 'r2', 'r3'), sep=' ') %>% gather('rsidkey', 'rsid', r1, r2, r3) %>% filter(!is.na(rsid)) %>% arrange(stime) %>% filter(str_extract(rsid,"[0-9]+") %in% students$nummer) %>% select(-rsidkey)