Visualizing and Exploring Data Visual Methods for finding - - PDF document

visualizing and exploring data visual methods for finding
SMART_READER_LITE
LIVE PREVIEW

Visualizing and Exploring Data Visual Methods for finding - - PDF document

Visualizing and Exploring Data Visual Methods for finding structures in data Power of human eye/brain to detect structures Product of eons of evolution Display data in ways that capitalize on human pattern processing abilities


slide-1
SLIDE 1

Visualizing and Exploring Data

slide-2
SLIDE 2

Visual Methods for finding structures in data

  • Power of human eye/brain to detect

structures

– Product of eons of evolution

  • Display data in ways that capitalize on

human pattern processing abilities

  • Can find unexpected relationships

– Limitation: very large data sets

slide-3
SLIDE 3

Exploratory Data Analysis

  • Explore the data without any clear ideas of what

we are looking for

  • EDA techniques are

– Interactive – Visual

  • Many graphical methods for low-dimensional data
  • For higher dimensions -- Principal Components

Analysis

slide-4
SLIDE 4

Topics in Visualization

  • 1. Summarizing Data

Mean, Variance, Standard Deviation, Skewness

  • 2. Tools for Single Variables
  • 3. Tools for Pairs of Variables
  • 4. Tools for Multiple Variables
  • 5. Principal Components Analysis

– Reduced number of dimensions

slide-5
SLIDE 5
  • 1. Summarizing the data
  • Centrality

– Minimizes the sum of squared errors to all samples – If there are n data values, mean is the value such that the sum of n copies of it equals the sum of data values

  • Measures of Location

– Mean is a measure of location – Median (value that has equal no of points above and blow) – Quartile (value greater than a quarter of the data points)

=

=

n i

i x n Mean

1

) ( 1 , µ

slide-6
SLIDE 6

Measures of Dispersion, or Variability

2 1 2

] ) ( [ 1 1 , µ σ − − =

= n i

i x n Variance

Average squared error in mean representing data

2 1 2

] ) ( [ 1 1 , Deviation Standard µ σ − − =

= n i

i x n

2 / 3 2 3

) ) ˆ ) ( ( ( ) ˆ ) ( ( Skewness

∑ ∑

− − = µ µ i x i x

Measures how much the data is one-sided (single long tail)

slide-7
SLIDE 7
  • 2. Tools for Displaying Single

Variables

  • Basic display for univariate data is the

histogram

– No of values of the variable that lie in consecutive intervals

slide-8
SLIDE 8

Histogram of supermarket credit card usage Many did not use it at all These used it every week except holidays weeks

slide-9
SLIDE 9

Histogram of Diastolic blood pressure of individuals (UCI ML archive) Zero BP means data missing

slide-10
SLIDE 10

Smoothing estimates

  • Kernel Function K
  • Estimated density at point x is

=

− =

n i

h i x x K n x f

1

) ) ( ( 1 ) ( ˆ

  • Gaussian Kernel with std dev h

2

) ( 2 1

) , (

h t

Ce h t K

=

) ( where i x x t − =

slide-11
SLIDE 11

Kernel Estimates with different values of h: Small values lead to spiky estimates Data is right skewed with hint of multimodality Higher smoothing

slide-12
SLIDE 12
  • 3. Tools for Displaying

Relationship between two variables

  • Box Plots
  • Scatter Plots
  • Contour Plots
  • Time as one of the two variables
slide-13
SLIDE 13

Box Plot

Median Upper Quartile Lower Quartile 1.5 times inter-quartile range

slide-14
SLIDE 14

Healthy Diabetic Multiple Variables

slide-15
SLIDE 15

Scatterplot

Credit card repayment data Highly correlated data Significant number depart from pattern: worth investigating