Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation

advanced analytics in business d0s07a big data platforms
SMART_READER_LITE
LIVE PREVIEW

Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Preprocessing Overview Selection Exploration Cleaning Transformations Feature engineering Feature selection Conclusion 2 Today 3 Where we want to


slide-1
SLIDE 1

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a]

Preprocessing

slide-2
SLIDE 2

Overview

Selection Exploration Cleaning Transformations Feature engineering Feature selection Conclusion

2

slide-3
SLIDE 3

Today

3

slide-4
SLIDE 4

Where we want to get to

Instance Age Account activity Owns credit card Churn 1 24 low yes no 2 21 medium no yes 3 42 high yes no 4 34 low no yes 5 19 medium yes yes 6 44 medium no no 7 31 high yes no 8 29 low no no 9 41 high no no 10 29 medium yes no 11 52 medium no yes 12 55 high no no 13 52 medium yes no 14 38 low no yes

4

slide-5
SLIDE 5

The data set

Recall, a tabular data set (“structured data”):

Has instances (examples, rows, observations, customers, …) And features (attributes, fields, variables, predictors, covariates, explanatory variables, regressors, independent variables) These features can be:

Numeric (continuous) Categorical (discrete, factor) either nominal (binary as a special case) or ordinal

Target (label, class, dependent variable, repsonse variable) can also be present

Numeric, categorical, …

5

slide-6
SLIDE 6

Constructing a data set takes work

Merging different data sources Levels of aggregation, e.g. household versus individual customer Linking instance identifiers Definition of target variable Cleaning, preprocessing, featurization

Let’s go through the different steps… 6

slide-7
SLIDE 7

Data Selection

7

slide-8
SLIDE 8

External data

Social media data (e.g. Facebook, Twitter, e.g. for sentiment analysis) Macro economic data (e.g. GPD, inflation) Weather data Competitor data Search data (e.g. Google Trends) Web scraped data

Open data

External data that anyone can access, use and share Government data (e.g. Eurostat, OECD) Scientific data

Master data

Relates to core entities company is working with E.g. Customers, Products, Employees, Suppliers, Vendors Typically stores in operational data bases and data warehouses (historical view)

Transactional data

Timing, quantity and items E.g. POS data, credit card transactions, money transfers, web visits, etc. Will typically require a featurization step (see later)

Data types

Though keep the production context in mind! 8

slide-9
SLIDE 9

Example: Google Trends

https://medium.com/dataminingapps-articles/forecasting-with-google-trends-114ab741bda4

9

slide-10
SLIDE 10

Example: Google Street View

https://arxiv.org/ftp/arxiv/papers/1904/1904.05270.pdf

10

slide-11
SLIDE 11

Data types

Structured, unstructured, semi-structured data

Better: tabular, imagery, time series, …

Small vs. big data Metadata

Data that describes other data Data about data Data definitions E.g. stored in DBMS catalog Oftentimes lacking but can help a great deal in understanding, and feature extraction as well

11

slide-12
SLIDE 12

Selection

As data mining can only uncover patterns actually present in the data, target data set must be large/complete enough to contain these patterns

But concise enough to be mined within an acceptable time limit Data marts and data warehouses can help

When you want to predict a target variable y

Make absolutely sure you’re not cheating by including explanatory variables which are too “perfectly” correlated with the target “Too good to be true” Do you know this explanatory variable after the target outcome, or before? At the time when your model will be used, or after? Your finished model will not be able to look into the future!

Set aside a hold-out test set as soon as possible 12

slide-13
SLIDE 13

Data Exploration

13

slide-14
SLIDE 14

Exploration

Visual analytics as a means for initial exploration (EDA: exploratory data analysis)

Boxplots Scatter plots Histograms Correlation plots Basic statistics

There is some debate on this topic re: bias 14

slide-15
SLIDE 15

Data Cleaning

15

slide-16
SLIDE 16

Cleaning

Consistency

Detecting errors and duplications K.U.L., KU Leuven, K.U.Leuven, KULeuven…

Transforming different representations to common format

Male, Man, MAN → M True, YES, Ok → 1

Removing “future variables”

Or variables that got modified according to the target at hand

16

slide-17
SLIDE 17

Missing values

Many techniques cannot deal with missing values

Not applicable (credit card limit = NA if no credit card owned) versus unknown or not disclosed (age = NA)

Missing at random vs. missing not at random

Detection is easy

Main treatment options:

Delete: complete row/column Replace (impute): by mean, median (more robust), or mode Or by separate model (the missing value then becomes the target to predict) – often not worth it in practice Keep: include a separate missing value indicator feature

17

slide-18
SLIDE 18

Missing values

“Detection is easy?” 18

slide-19
SLIDE 19

Missing values

missingno (Python) or VIM (R)

19

slide-20
SLIDE 20

Missing values

20

slide-21
SLIDE 21

Missing values

Common approach: delete rows with too many missing values, impute features using median and mode, add separate column indicating if original value was missing

Sometimes imputation of median/mode based within same class-label More advanced imputation: nearest-neighbor based

21

slide-22
SLIDE 22

Intermezzo

Always keep the production setting in mind: new unseen instances can contain missing values as well Don’t impute with a new median! But using the same “rules” as with the training data Also when working with validation data! What if we’ve never observe a missing value for this feature before? Use the original training data to construct an imputation Consider rebuilding your model (monitoring!)

Everything you do over multiple instances becomes part of the model! 22

slide-23
SLIDE 23

Outliers

Extreme observations (age = 241 or income = 3 million)

Can be valid or invalid

Some techniques are sensitive to outliers

Detection: histograms, z-scores, box-plots Sometimes the detection of outliers is the whole “analysis” task Anomaly detection (fraud analytics)

Treatment:

As missing: for invalid outliers, consider them as missing values Treat: for valid outliers, but depends on the technique you will use: keep as-is, truncate (cap), categorize

23

slide-24
SLIDE 24

Outliers

24

slide-25
SLIDE 25

Outliers

25

slide-26
SLIDE 26

Outliers

26

slide-27
SLIDE 27

Duplicate rows

Duplicate rows can be valid or invalid Treat accordingly 27

slide-28
SLIDE 28

Transformations: standardization

Standardization: constrain feature to

Good/necessary for Gaussian-distributed data and some techniques: SVMs, regression models, k-nearest neighbor, everything working with Euclidian distance or similarity Useless for other techniques: decision trees

∼ N(0, 1) xnew = x − μ σ 28

slide-29
SLIDE 29

Transformations: normalization

Normalization: also called “feature scaling”

Rescale to [0,1], [-1,1] In credit risk models, this is oftentimes applied to the resulting scores so that they fall in [0, 1000]

xnew = x − xmin xmax − xmin 29

slide-30
SLIDE 30

Transformations: categorization

Also called “coarse classification”, “classing”, “binning”, “grouping” Continuous to nominal

Binning: group ranges into categories Can be useful to treat outliers Oftentimes driven by domain knowledge Equal width/interval binning versus equal frequency binning (histogram equalization)

Nominal to reduced nominal

Grouping: grouping multiple nominal levels together In case you have many levels (…)

30

slide-31
SLIDE 31

Transformations: categorization

Treat outliers Make final model more interpretable Reduce curse of dimensionality following from high number of levels Introduce non-linear effects into linear models

31

slide-32
SLIDE 32

Transformations: dummyfication and other encodings

Nominal to continuous Dummy variables: artificial variable to represent an attribute with multiple categories (“one-hot-encoding”)

Mainly used in regression (cannot deal with categoricals directly) E.g. account activity = high, medium, low Convert to: account_activity_high = 0, 1 account_activity_medium = 0, 1 account_activity_low = 0, 1 … and then drop one

Binary encoding

E.g. binarization: account_activity_high = 1 → 001 account_activity_medium = 2 → 010 account_activity_low = 3 → 100 More compact than dummy variables

32

slide-33
SLIDE 33

Transformations: high-level categoricals

What now if we have a categorical value with too many levels (or, alternatively, too many dummy variables)

Domain-knowledge driven grouping (e.g. NACE codes) Frequency-based grouping

E.g. postal code = 2000, 3000, 9401, 3001… Solution 1: Group using domain knowledge (working area, communities) Solution 2: Create new variable postal_code_count E.g. postal_code_count = 23 if original postcode appears 23 times in training data (Again: keep the same rule for validation/production!) Lose detailed information but goal is that model can pick up on frequencies

Odds based grouping Weight of evidence encoding Probabilistic transformations and other “Kaggle”-tricks such as Leave One Out Mean (Owen Zhang) Decision tree based Embeddings Not: integer encoding if your variable is not ordinal!

33

slide-34
SLIDE 34

Transformations: odds based grouping

Drawback of equal-interval or equal-frequency based binning: do not take

  • utcome into account

The “pivot” table approach

Create a pivot table of the attribute versus target and compute the odds Group variable values having similar odds

34

slide-35
SLIDE 35

Transformations: odds based grouping

How to verify which option is better? Consider the following example taken from the book Credit Scoring and Its Applications:

Attribute: Owner Rent Unfurnished Rent Furnished Parents Other No Answer Total Goods 6000 1600 350 950 90 10 9000 Bads 300 400 140 100 50 10 1000 Goods / Bads Odds 20 4 2.5 9.5 1.8 1 9

Let’s say we have two options to group the levels:

  • 1. Owners, Renters (Rent Unfurnished + Furnished), and Others (every other

level)

  • 2. Owners, Parents, and Others

35

slide-36
SLIDE 36

Transformations: odds based grouping

Empirical frequencies for Option 1:

Attribute: Owners Renters Others Total Goods 6000 1950 1050 9000 Bads 300 540 160 1000 Total 6300 2490 1210 10000

Independence frequencies for Option 1:

E.g. number of good owners given that odds are same as in whole population is

Attribute: Owners Renters Others Total Goods 5670 2241 1089 9000 Bads 630 249 121 1000 Total 6300 2490 1210 10000

Chi-square distance:

6300/10000 × 9000/10000 × 10000 = 5670 χ2 = + + + + + = 583

(6000−5670)2 5670 (300−630)2 630 (1950−2241)2 2241 (540−249)2 249 (1050−1089)2 1089 (160−121)2 121

36

slide-37
SLIDE 37

Transformations: odds based grouping

Chi-square distance Option 1: Chi-square distance Option 2: In order to judge upon significance, the obtained chi-square statistic should follow a chi-square distribution with degrees of freedom, with the number of classes (3 in our case). This can then be summarized by a p-value to see whether there is a statistically significant dependence or not.

Since both options assume 3 categories, we can directly compare the value of 662 to 583 and since the former is bigger conclude that Option 2 is the better coarse classification χ2 = + + + + + = 583

(6000−5670)2 5670 (300−630)2 630 (1950−2241)2 2241 (540−249)2 249 (1050−1089)2 1089 (160−121)2 121

χ2 = + + + + + = 662

(6000−5670)2 5670 (300−630)2 630 (950−945)2 945 (100−105)2 105 (2050−2385)2 2385 (600−265)2 265

k − 1 k 37

slide-38
SLIDE 38

Weights of evidence encoding

Weights of evidence variables can be defined as follows:

= number of instances with class 1 in category / number of instance with class 1 (total) = number of instances with class 2 in category / number of instance with class 2 (total)

If then > 0 If then < 0 WoE has a monotonic relationship with the target variable! WoEcat = ln( )

pc1,cat pc2,cat

pc1,cat pc2,cat

pc1,cat > pc2,cat WoE pc1,cat < pc2,cat WoE 38

slide-39
SLIDE 39

Weights of evidence encoding

Higher weights of evidence indicates less risk (monotonic relation to target) Information Value (IV) defined as: Can be used to screen variables (important variables have IV > 0.1) ∑(pG − pB)WoE

39

slide-40
SLIDE 40

https://cran.r- project.org/web/packages/smbinning/index.html

Weights of evidence encoding

Category boundaries

Optimize so as to maximize IV

40

slide-41
SLIDE 41

Weights of evidence encoding

Number of categories?

Trade-off

Fewer categories because of simplicity, interpretability and stability More categories to keep predictive power

Practical: perform sensitivity analysis

IV versus number of categories Decide on cut-off: elbow point?

Note: fewer values in category, less reliable/robust/stable WOE value Laplace smoothing:

n: smoothing parameter, larger n → less reliance on data, pushes WoE closer to 0 WoEcat = ln( )

pc1,cat+n pc2,cat+n

41

slide-42
SLIDE 42

Weights of evidence encoding

See smbinning and Information packages in R, https://www.kaggle.com/puremath86/iv-woe-starter-for-python for Python 42

slide-43
SLIDE 43

Some other creative approaches

For geospatial data: group nearby communities together, spatial interpolation, … First build a decision tree only on the one categorical variable and target 1-dimensional k-means clustering on a continuous variable to suggest groups Probabilistic transformations and other “Kaggle”-tricks such as Leave One Out Mean (Owen Zhang) Categorical embeddings

43

slide-44
SLIDE 44

Leave one out mean (Owen Zhang)

Sometimes, a small “jitter” is added as well Dangerous in practice

44

slide-45
SLIDE 45

Hashing trick

The “hashing trick”: for categoricals with many levels and when it expected that new levels will occur in new instances

Oftentimes used for text mining

Alternative for bagging: 45

slide-46
SLIDE 46

Hashing trick

The “hashing trick”: for categoricals with many levels and when it expected that new levels will occur in new instances But: perhaps there are smarter approaches? 46

slide-47
SLIDE 47

Embeddings

Other “embedding” approaches are possible as well in the textual domain

“word2vec and friends” Can even be used for high-level and sparse categoricals We’ll come back to this later on, when we talk about text mining in more detail

https://arxiv.org/pdf/1604.06737.pdf

47

slide-48
SLIDE 48

Transformations: mathematical approaches

Logarithms, square roots, etc

Mostly done to enforce some linearity Or more gaussian behavior Box Cox Yeo Johnson and other “power transformations”, e.g. for saturation effects

48

slide-49
SLIDE 49

Transformations: interaction variables

When no linear effect is present on and but there is one on In most cases: x1 ∼ y x2 ∼ y f(x1, x2) ∼ y f(x1, x2) = x1 × x2

49

slide-50
SLIDE 50

Feature Engineering

50

slide-51
SLIDE 51

Feature engineering

The goal of the transformations above: creating more informative features!

Important to take the bias of your analytical technique into account

E.g. new features to make the data linearly separable for linear models

However: a key step for any kind of model

E.g. “date of birth” → “age” Most techniques are relatively “dumb” – any human domain knowledge and smart features, transformation, … can help Though avoid biasing data and remember correlation ≠ causation

The aim of feature engineering is to transform data set variables into features so as to help the analytical models achieve better performance in terms of either predictive performance, interpretability or both.

“ “

51

slide-52
SLIDE 52

Feature engineering

https://www.kdnuggets.com/2018/12/feature-engineering-explained.html

52

slide-53
SLIDE 53

Feature engineering: RFM features

Already popular since (Cullinan, 1977)

Recency: recency of transaction Frequency: frequency of transaction Monetary: monetary value of transaction

Can be operationalized in various ways

Frequency in last month, last two months, … Highest monetary in last month, average, minimum Recency with varying time decays Grouped by product, type, …

Popular in e.g. marketing and fraud analytics 53

slide-54
SLIDE 54

Feature engineering: RFM features

recency = e(−γt) 54

slide-55
SLIDE 55

Feature engineering: time features

Capture information about time aspect by meaningful features

Dealing with time can be tricky

00:00 = 24:00 No natural ordering: 23:00 > 01:00?

Do not use arithmetic mean to compute average timestamp

Model time as a periodic variable using von Mises distribution: distribution

  • f a normally distributed

variable wrapped across a circle, defined by two parameters and

is the periodic mean is a measure of concentration such that is the periodic variance

μ κ

μ κ 1/κ

55

slide-56
SLIDE 56

Feature engineering: date features

Time since date Quarter, month of year Week, day of month Weekend or not Holiday or not …

56

slide-57
SLIDE 57

Feature engineering: delta’s, trends, windows

Evolutions over time or differences between features are crucial in many settings

Solution 1: Keep track of an instance through time and add in as separate rows (“panel data analysis”) Solution 2: One point in time as “base”, add in relative features

57

slide-58
SLIDE 58

Feature engineering: delta’s, trends, windows

Absolute trends: Relative trends: Can be useful for size variables (e.g., asset size, loan amounts) and ratios Beware with denominators equal to 0! Can put higher weight on recent values Extension: time series analysis

Ft−Ft−x x Ft−Ft−x Ft−x

58

slide-59
SLIDE 59

Feature engineering: ordinal variables

Ordinal features have intrinsic ordering in their values (e.g., credit rating, debt seniority)

Don’t: integer coding Percentile coding

E.g. class ratings Suppose 10% AAA, 35% AA, 15% A, 30% B and 10% C Assign real-valued number based on percentiles: 0.1, 0.45, 0.60, 0.90 and 1

Thermometer coding

Progressively codes ordinal scale of variable F1 F2 F3 F4 AAA AA 1 A 1 1 B 1 1 1 C 1 1 1 1

59

slide-60
SLIDE 60

Feature engineering: relational data

60

slide-61
SLIDE 61

Feature engineering: relational data

dm : https://krlmlr.github.io/dm/

Analytics typically requires or presumes the data will be presented in a single table

Denormalization refers to the merging of several normalized source data tables into an aggregated, denormalized data table Merging tables involves selecting and/or aggregating information from different tables related to an individual entity, and copying it to the aggregated data table

61

slide-62
SLIDE 62

Featurization

62

slide-63
SLIDE 63

Featurization

How do get a tabular data set out of something non-structured?

Squeeze out a representational vector Very time-consuming/difficult Approaches range from manual to… trained with models

(Deep learning as a feature extractor, not necessarily as final model) 63

slide-64
SLIDE 64

Featurization

featuretools : An open source python

framework for automated feature engineering, https://www.featuretools.com/

Basically applies lots of auto-aggregations and transformations “Excels at transforming temporal and relational datasets into feature matrices for machine learning” stumpy : https://github.com/TDAmeritrade/stumpy “STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of time series tasks” tsfresh “tsfresh is used to to extract characteristics from time series”

64

slide-65
SLIDE 65

Feature Selection

65

slide-66
SLIDE 66

Feature selection

Oftentimes, you end up with lots of features

Will make many techniques go haywire: exploration space too large Throw out weak features

E.g. features with low variance Chi-square goodness-of-fit test Information gain Correlated features Information value

More advanced techniques:

Stepwise introduction or removal Regularization Based on genetic algorithms Using techniques that provide a “variable importance” ranking

Retrain model with top-n only and check performance stability

Principal component analysis to lower dimensionality

66

slide-67
SLIDE 67

Principal component analysis

http://setosa.io/ev/principal-component-analysis/

Remains a surprising effective technique to have in your toolbox

Note: dimensionality reduction rather than selection Before and after the data mining Note: standardization/normalization required

67

slide-68
SLIDE 68

Conclusion

68

slide-69
SLIDE 69

More on feature engineering and dimensionality reduction

Stepwise selection Regularization Variable importance Clustering t-SNE UMAP Features from text Categorical embeddings

(We’ll come back to these during later sessions) 69

slide-70
SLIDE 70

Sampling

Think carefully about population on which model that is going to be built using sample will be used

Timing of sample

How far do I go back to get my sample? Trade-off: many data versus recent data Sample taken must be from normal business period to get as accurate a picture as possible of target population

Too much instances to compute a model in reasonable time

Don’t be afraid to quickly iterate on a small sample first!

(Sampling will return in a different context later on, when we talk more about validation: over/under/smart sampling) 70

slide-71
SLIDE 71

Conclusion

Pre-processing: many steps and checks

Depends on technique being used later on, which might not yet be certain Can you apply your pre-processing steps next month, on a future data set, in production?

I.e.: can you apply your pre-processing steps on the test set?

Time consuming! E.g. with SQL / pandas / Spark / …: join columns, remove columns, create aggregates, sort, order, split, …

Not the “fun” aspect of data science Easy to introduce mistakes

71