Essential Data Preparation, Descriptive Statistics and - - PowerPoint PPT Presentation

essential data preparation descriptive statistics and
SMART_READER_LITE
LIVE PREVIEW

Essential Data Preparation, Descriptive Statistics and - - PowerPoint PPT Presentation

Essential Data Preparation, Descriptive Statistics and Visualizations with Examples for Slides by Michael Hahsler Purpose 1) "Get used to the data." 2) Clean the data (e.g., find missing values, outliers, mistakes) 3) "Make


slide-1
SLIDE 1

Essential Data Preparation, Descriptive Statistics and Visualizations

with Examples for

Slides by Michael Hahsler

slide-2
SLIDE 2

Purpose

1) "Get used to the data." 2) Clean the data (e.g., find missing values, outliers, mistakes) 3) "Make sure the data makes sense." 4) Find simple relationships between variables. 5) Prepare data for predictive/prescriptive modeling.

slide-3
SLIDE 3

RapidMiner

  • Install RapidMiner Studio and obtain an educational

license (see course website).

  • The dataset for the examples can be obtained from

http://michael.hahsler.net/SMU/EMIS3309/data/census.csv

  • Rapidminer processes for this slide set are

available here (save and import process in Rapidminer):

http://michael.hahsler.net/SMU/EMIS3309/data/rapidminer/Basic_Stat istics_and_Visualizations.rmp

http://michael.hahsler.net/SMU/EMIS3309/data/rapidminer/Cleaning _and_preprocessing.rmp

slide-4
SLIDE 4

Importing Data

For the examples we use a dataset with census data at the ZIP- code level (Data and processes can be found on the class website).

Categorical (RM: polynomial) Quantitative – ratio (RM: integer or real) should not be integer! Features/Attributes Observations

slide-5
SLIDE 5

Single Variable - Quantitative

Histogram to show the distribution 5-Number Summary:

min

1st quartile

Median

Mean

3rd quartile

max Descriptive Stats Visualization Rapidminer gives you this for Population per Zipcode:

slide-6
SLIDE 6

Single Variable - Categorical

  • Bar chart for counts
  • Pie chart (not ideal for more than a few groups)

Count table

Descriptive Stats Visualization

slide-7
SLIDE 7

Data Cleaning

– Missing values?

Is this the result of reading the data? Are missing values correctly read in (or are there values like 99, 'N/A' or '.' as text)? Do we have to impute the missing values?

– Outliers and strange values

Identify in histograms and scatter plots. Examples: many zeros, weird visual pattern visible. Might be the result of data collection. Needs investigation!

– Duplicates: Are these a data problem? – Dates: Make sure that these are read in correctly!

RapidMiner Operators in Cleansing

slide-8
SLIDE 8

Data Cleaning

Set a higher number of bins. What do the spikes at 0 and 200,000 for median family income mean? What should we do?

slide-9
SLIDE 9

Data Transformation

Data needs to be often transformed to be useful in a model.

  • Features

– Feature Selection (select attributes) – Feature Generation (generate aggregation, etc.) – Nomalization (e.g, z-score) – Discretization (binning) – Manipulate values (map, replace)

  • Observations

– Sampling/Filtering examples – Grouping and aggregation RapidMiner Operators in Blending and Cleansing

slide-10
SLIDE 10

Two Variables - Quantitative

Scatterplot Correlation

Example: population and # of housing units per zipcode have a (Pearson) correlation coefficient of: 0.975

Descriptive Stats Visualization

RapidMiner: Use Correlation Matrix node

slide-11
SLIDE 11

Two Variables - Categorical

Grouped bar charts Mosaic plot

(not very popular)

Cross-tabulation (i.e., contingency table)

Descriptive Stats Visualization

RapidMiner: Use Aggregate and Pivot nodes

slide-12
SLIDE 12

Two Variables - Mixed

Bar chart for individual statistic to compare groups Box plot Compare 5-number statistic grouped by categorical variable.

Descriptive Stats Visualization

RapidMiner: Use Aggregate node

slide-13
SLIDE 13

Multiple Variables

Are usually broken down into pairwise comparisons. Correlation matrix Scatterplot matrix

slide-14
SLIDE 14

Multiple Variables (cont.)

Comparing multiple quantitative variables (or comparing a single quantitative variable between groups defined by another categorical variable). Tables with group-wise statistics

  • r

Boxplot (called Quartile in RapidMiner)

slide-15
SLIDE 15

Single Variable - Explore the distribution

Basic Descriptive Statistics and Data Visualization Cheat Sheet

3+ Variables

Break it down into pairwise statistics or plots. E.g., Correlation matrix, scatter plot matrix, box plot.

Two Variables - Explore the relationship

Statistics Visualization Categorical Variable Counts Bar chart Quantitative Variable 5-number summary Histogram Statistics Visualization Categorical Variables Contingency table (Cross tabulation) Grouped bar chart Quantitative Variables Correlation Scatter plot Mixed Variables Group-wise statistics (e.g., average) Box plot Bar chart of statistics