Essential Data Preparation, Descriptive Statistics and Visualizations
with Examples for
Slides by Michael Hahsler
Essential Data Preparation, Descriptive Statistics and - - PowerPoint PPT Presentation
Essential Data Preparation, Descriptive Statistics and Visualizations with Examples for Slides by Michael Hahsler Purpose 1) "Get used to the data." 2) Clean the data (e.g., find missing values, outliers, mistakes) 3) "Make
Slides by Michael Hahsler
–
http://michael.hahsler.net/SMU/EMIS3309/data/rapidminer/Basic_Stat istics_and_Visualizations.rmp
–
http://michael.hahsler.net/SMU/EMIS3309/data/rapidminer/Cleaning _and_preprocessing.rmp
Categorical (RM: polynomial) Quantitative – ratio (RM: integer or real) should not be integer! Features/Attributes Observations
–
min
–
1st quartile
–
Median
–
Mean
–
3rd quartile
–
max Descriptive Stats Visualization Rapidminer gives you this for Population per Zipcode:
Descriptive Stats Visualization
RapidMiner Operators in Cleansing
Set a higher number of bins. What do the spikes at 0 and 200,000 for median family income mean? What should we do?
– Feature Selection (select attributes) – Feature Generation (generate aggregation, etc.) – Nomalization (e.g, z-score) – Discretization (binning) – Manipulate values (map, replace)
– Sampling/Filtering examples – Grouping and aggregation RapidMiner Operators in Blending and Cleansing
Descriptive Stats Visualization
RapidMiner: Use Correlation Matrix node
Descriptive Stats Visualization
RapidMiner: Use Aggregate and Pivot nodes
Descriptive Stats Visualization
RapidMiner: Use Aggregate node
Single Variable - Explore the distribution
3+ Variables
Break it down into pairwise statistics or plots. E.g., Correlation matrix, scatter plot matrix, box plot.
Two Variables - Explore the relationship
Statistics Visualization Categorical Variable Counts Bar chart Quantitative Variable 5-number summary Histogram Statistics Visualization Categorical Variables Contingency table (Cross tabulation) Grouped bar chart Quantitative Variables Correlation Scatter plot Mixed Variables Group-wise statistics (e.g., average) Box plot Bar chart of statistics