visualizing and exploring data visual methods for finding
play

Visualizing and Exploring Data Visual Methods for finding - PDF document

Visualizing and Exploring Data Visual Methods for finding structures in data Power of human eye/brain to detect structures Product of eons of evolution Display data in ways that capitalize on human pattern processing abilities


  1. Visualizing and Exploring Data

  2. Visual Methods for finding structures in data • Power of human eye/brain to detect structures – Product of eons of evolution • Display data in ways that capitalize on human pattern processing abilities • Can find unexpected relationships – Limitation: very large data sets

  3. Exploratory Data Analysis • Explore the data without any clear ideas of what we are looking for • EDA techniques are – Interactive – Visual • Many graphical methods for low-dimensional data • For higher dimensions -- Principal Components Analysis

  4. Topics in Visualization 1. Summarizing Data Mean, Variance, Standard Deviation, Skewness 2. Tools for Single Variables 3. Tools for Pairs of Variables 4. Tools for Multiple Variables 5. Principal Components Analysis – Reduced number of dimensions

  5. 1. Summarizing the data n 1 ∑ , µ = Mean x ( i ) n = i 1 • Centrality – Minimizes the sum of squared errors to all samples – If there are n data values, mean is the value such that the sum of n copies of it equals the sum of data values • Measures of Location – Mean is a measure of location – Median (value that has equal no of points above and blow) – Quartile (value greater than a quarter of the data points)

  6. Measures of Dispersion, or Variability n 1 ∑ Average squared error σ = − µ 2 2 Variance , [ x ( i ) ] in mean representing data − n 1 = i 1 n 1 ∑ σ = − µ 2 2 Standard Deviation , [ x ( i ) ] − n 1 = i 1 ∑ − µ 3 ˆ ( ( ) ) x i Measures how much the data = Skewness is one-sided (single long tail) ∑ − µ 2 3 / 2 ˆ ( ( x ( i ) ) )

  7. 2. Tools for Displaying Single Variables • Basic display for univariate data is the histogram – No of values of the variable that lie in consecutive intervals

  8. Many Histogram of supermarket credit card usage did not use it at all These used it every week except holidays weeks

  9. Histogram of Diastolic blood pressure of individuals (UCI ML archive) Zero BP means data missing

  10. Smoothing estimates • Kernel Function K • Estimated density at point x is − n 1 x x ( i ) ∑ = ˆ f ( x ) K ( ) n h = i 1 • Gaussian Kernel with std dev h 1 t = − − 2 ( ) = where t x x ( i ) 2 h K ( t , h ) Ce

  11. Kernel Estimates with different values of h: Small values lead to spiky estimates Data is right skewed with hint of multimodality Higher smoothing

  12. 3. Tools for Displaying Relationship between two variables • Box Plots • Scatter Plots • Contour Plots • Time as one of the two variables

  13. Box Plot 1.5 times inter-quartile range Upper Quartile Median Lower Quartile

  14. Diabetic Healthy Variables Multiple

  15. Scatterplot Credit card repayment data Highly correlated data Significant number depart from pattern: worth investigating

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend