Advanced Analytics in Business [D0S07a] Big Data Platforms & - PowerPoint PPT Presentation

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Preprocessing

Overview Selection Exploration Cleaning Transformations Feature engineering Feature selection Conclusion 2

Today 3

Where we want to get to Instance Age Account activity Owns credit card Churn 1 24 low yes no 2 21 medium no yes 3 42 high yes no 4 34 low no yes 5 19 medium yes yes 6 44 medium no no 7 31 high yes no 8 29 low no no 9 41 high no no 10 29 medium yes no 11 52 medium no yes 12 55 high no no 13 52 medium yes no 14 38 low no yes 4

The data set Recall, a tabular data set (“structured data”): Has instances (examples, rows, observations, customers, …) And features (attributes, fields, variables, predictors, covariates, explanatory variables, regressors, independent variables) These features can be: Numeric (continuous) Categorical (discrete, factor) either nominal (binary as a special case) or ordinal Target (label, class, dependent variable, repsonse variable) can also be present Numeric, categorical, … 5

Constructing a data set takes work Merging different data sources Levels of aggregation, e.g. household versus individual customer Linking instance identifiers Definition of target variable Cleaning, preprocessing, featurization Let’s go through the different steps… 6

Data Selection 7

Data types Master data External data Relates to core entities company is working Social media data (e.g. Facebook, Twitter, with e.g. for sentiment analysis) E.g. Customers, Products, Employees, Macro economic data (e.g. GPD, inflation) Suppliers, Vendors Weather data Typically stores in operational data bases Competitor data and data warehouses (historical view) Search data (e.g. Google Trends) Transactional data Web scraped data Timing, quantity and items Open data E.g. POS data, credit card transactions, money transfers, web visits, etc. External data that anyone can access, use and share Will typically require a featurization step (see later) Government data (e.g. Eurostat, OECD) Scientific data Though keep the production context in mind! 8

Example: Google Trends https://medium.com/dataminingapps-articles/forecasting-with-google-trends-114ab741bda4 9

Example: Google Street View https://arxiv.org/ftp/arxiv/papers/1904/1904.05270.pdf 10

Data types Structured, unstructured, semi-structured data Better: tabular, imagery, time series, … Small vs. big data Metadata Data that describes other data Data about data Data definitions E.g. stored in DBMS catalog Oftentimes lacking but can help a great deal in understanding, and feature extraction as well 11

Selection As data mining can only uncover patterns actually present in the data, target data set must be large/complete enough to contain these patterns But concise enough to be mined within an acceptable time limit Data marts and data warehouses can help When you want to predict a target variable y Make absolutely sure you’re not cheating by including explanatory variables which are too “perfectly” correlated with the target “Too good to be true” Do you know this explanatory variable after the target outcome, or before? At the time when your model will be used, or after? Your finished model will not be able to look into the future! Set aside a hold-out test set as soon as possible 12

Data Exploration 13

Exploration Visual analytics as a means for initial exploration (EDA: exploratory data analysis) Boxplots Scatter plots Histograms Correlation plots Basic statistics There is some debate on this topic re: bias 14

Data Cleaning 15

Cleaning Consistency Detecting errors and duplications K.U.L., KU Leuven, K.U.Leuven, KULeuven… Transforming different representations to common format Male, Man, MAN → M True, YES, Ok → 1 Removing “future variables” Or variables that got modified according to the target at hand 16

Missing values Many techniques cannot deal with missing values Not applicable (credit card limit = NA if no credit card owned) versus unknown or not disclosed (age = NA) Missing at random vs. missing not at random Detection is easy Main treatment options: Delete : complete row/column Replace (impute): by mean, median (more robust), or mode Or by separate model (the missing value then becomes the target to predict) – often not worth it in practice Keep : include a separate missing value indicator feature 17

Missing values “Detection is easy?” 18

Missing values missingno (Python) or VIM (R) 19

Missing values 20

Missing values Common approach: delete rows with too many missing values, impute features using median and mode, add separate column indicating if original value was missing Sometimes imputation of median/mode based within same class-label More advanced imputation: nearest-neighbor based 21

Intermezzo Always keep the production setting in mind: new unseen instances can contain missing values as well Don’t impute with a new median! But using the same “rules” as with the training data Also when working with validation data! What if we’ve never observe a missing value for this feature before? Use the original training data to construct an imputation Consider rebuilding your model (monitoring!) Everything you do over multiple instances becomes part of the model! 22

Outliers Extreme observations (age = 241 or income = 3 million) Can be valid or invalid Some techniques are sensitive to outliers Detection: histograms, z-scores, box-plots Sometimes the detection of outliers is the whole “analysis” task Anomaly detection (fraud analytics) Treatment: As missing : for invalid outliers, consider them as missing values Treat : for valid outliers, but depends on the technique you will use: keep as-is, truncate (cap), categorize 23

Outliers 24

Outliers 25

Outliers 26

Duplicate rows Duplicate rows can be valid or invalid Treat accordingly 27

Transformations: standardization Standardization: constrain feature to ∼ N (0, 1) Good/necessary for Gaussian-distributed data and some techniques: SVMs, regression models, k-nearest neighbor, everything working with Euclidian distance or similarity Useless for other techniques: decision trees x new = x − μ σ 28

Transformations: normalization Normalization: also called “feature scaling” Rescale to [0,1], [-1,1] In credit risk models, this is oftentimes applied to the resulting scores so that they fall in [0, 1000] x − x min x new = x max − x min 29

Transformations: categorization Also called “coarse classification”, “classing”, “binning”, “grouping” Continuous to nominal Binning: group ranges into categories Can be useful to treat outliers Oftentimes driven by domain knowledge Equal width/interval binning versus equal frequency binning (histogram equalization) Nominal to reduced nominal Grouping: grouping multiple nominal levels together In case you have many levels (…) 30

Transformations: categorization Treat outliers Make final model more interpretable Reduce curse of dimensionality following from high number of levels Introduce non-linear effects into linear models 31

Transformations: dummyfication and other encodings Nominal to continuous Dummy variables: artificial variable to represent an attribute with multiple categories (“one-hot-encoding”) Mainly used in regression (cannot deal with categoricals directly) E.g. account activity = high, medium, low Convert to: account_activity_high = 0, 1 account_activity_medium = 0, 1 account_activity_low = 0, 1 … and then drop one Binary encoding E.g. binarization: account_activity_high = 1 → 001 account_activity_medium = 2 → 010 account_activity_low = 3 → 100 More compact than dummy variables 32

Transformations: high-level categoricals What now if we have a categorical value with too many levels (or, alternatively, too many dummy variables) Domain-knowledge driven grouping (e.g. NACE codes) Frequency-based grouping E.g. postal code = 2000, 3000, 9401, 3001… Solution 1: Group using domain knowledge (working area, communities) Solution 2: Create new variable postal_code_count E.g. postal_code_count = 23 if original postcode appears 23 times in training data (Again: keep the same rule for validation/production!) Lose detailed information but goal is that model can pick up on frequencies Odds based grouping Weight of evidence encoding Probabilistic transformations and other “Kaggle”-tricks such as Leave One Out Mean (Owen Zhang) Decision tree based Embeddings Not : integer encoding if your variable is not ordinal! 33

Transformations: odds based grouping Drawback of equal-interval or equal-frequency based binning: do not take outcome into account The “pivot” table approach Create a pivot table of the attribute versus target and compute the odds Group variable values having similar odds 34

Advanced Analytics in Business [D0S07a] Big Data Platforms & - PowerPoint PPT Presentation

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Preprocessing Overview Selection Exploration Cleaning Transformations Feature engineering Feature selection Conclusion 2 Today 3 Where we want to

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Data

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Data

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Setting

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a]

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Supervised

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Setting

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Text

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Model

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Social

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Model

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Artificial

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Ensemble

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a]

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a]

What Else Can Central Banks Do ? Marking the launch of the Eighteenth Geneva Report on the

9M2008 Results Announcement November 7, 2008 Scope of Presentation CEOs Report

The ECB and the Crises of the Eurozone 5 th EU Expert Discussion of the Rosa-Luxemburg Foundation

Marketing & Distribution Creating Action for your LTC business by Creating Action for

Todays Agenda Reminder: Test 0 (on precalculus) is this Friday, 8/28. If you do well, you

Recommendations for Bachelor-/ Master Programs in Computer Science Hans-Ulrich Heiss Faculty IV

PERSPECTIVES FOR ITALIAN STEEL PRODUCERS, III. INTERNATIONAL STEEL TRADE DAY RE-ROLLERS AND

Learning Outcomes s for Credit it Tran ansfer fer in VET and Higher er Educat cation

Advanced Analytics in Business [D0S07a] Big Data Platforms & - PowerPoint PPT Presentation

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Preprocessing Overview Selection Exploration Cleaning Transformations Feature engineering Feature selection Conclusion 2 Today 3 Where we want to

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Hadoop and

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Hadoop and

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Data

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Data

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Setting

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a]

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Supervised

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Setting

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Text

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Model

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Social

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Model

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Artificial

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Ensemble

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a]

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a]

What Else Can Central Banks Do ? Marking the launch of the Eighteenth Geneva Report on the

9M2008 Results Announcement November 7, 2008 Scope of Presentation CEOs Report

The ECB and the Crises of the Eurozone 5 th EU Expert Discussion of the Rosa-Luxemburg Foundation

Marketing &amp; Distribution Creating Action for your LTC business by Creating Action for

Todays Agenda Reminder: Test 0 (on precalculus) is this Friday, 8/28. If you do well, you

Recommendations for Bachelor-/ Master Programs in Computer Science Hans-Ulrich Heiss Faculty IV

PERSPECTIVES FOR ITALIAN STEEL PRODUCERS, III. INTERNATIONAL STEEL TRADE DAY RE-ROLLERS AND

Learning Outcomes s for Credit it Tran ansfer fer in VET and Higher er Educat cation

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Data

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Data

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Setting

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a]

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Supervised

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Setting

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Text

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Model

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Social

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Model

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Artificial

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Ensemble

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a]

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a]

Marketing & Distribution Creating Action for your LTC business by Creating Action for