data mining
play

Data Mining Getting to know your data Hamid Beigy Sharif - PowerPoint PPT Presentation

Data Mining Getting to know your data Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 12 Table of contents Introduction 1 Getting to know your data 2


  1. Data Mining Getting to know your data Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 12

  2. Table of contents Introduction 1 Getting to know your data 2 Statistical description of data 3 Data visualization 4 Reading 5 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 2 / 12

  3. Outline Introduction 1 Getting to know your data 2 Statistical description of data 3 Data visualization 4 Reading 5 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 12

  4. Data mining process A typical knowledge discovery process is Knowledge Evaluation and presentation Patterns Data mining Selection and transformation Data Warehouse Cleaning and integration Flat files Databases Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 12

  5. Outline Introduction 1 Getting to know your data 2 Statistical description of data 3 Data visualization 4 Reading 5 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 4 / 12

  6. Getting to know your data Real-world data are typically noisy, enormous in volume, and may originate from heterogenous sources. The first step of data mining is to know the data. We need to know What are the type of attributes or fields that make up the data? What kind of values does each attribute have? Which attributes are discrete and which are continuous-valued? How are the values distributed? Are the ways we can visualize the data to get a better sense of it? Can we spot any outlier? Can we measure the similarity of some data objects with respect to others? Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 4 / 12

  7. Attribute types Nominal attributes The values of nominal attributes are symboles or name of things. Each value represents some kind of category, code, or state, and nominal attributes are also referred as categorical. The values does not have any meaningful order. Binary attributes A binary attribute is a nominal attribute with only two categories. A binary attribute is symmetric if both of its states are equally valuable and carry the same weight. A binary attribute is asymmetric if the outcomes of the states are not equally important. Ordinal attributes An ordinal attribute is an attribute with possible values that have a meaningful order or ranking among them, but the magnitude between successive values is not known. Numerical attributes A numeric attribute is quantitative; that is, it is a measurable quantity, represented in integer or real values. Numeric attributes can be interval-scaled or ratio-scaled. Interval-scaled attributes are measured on a scale of equal-size units. Their values have order. We can compare and quantify the difference between values. Temperatures in Celsius(Fahrenheit) do not have a true zero-point. A ratio-scaled attribute is a numeric attribute with an inherent zero-point. Their values have order and allow us to compare and quantify the difference between values. Temperatures in the Kelvin has a true zero-point. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 5 / 12

  8. Outline Introduction 1 Getting to know your data 2 Statistical description of data 3 Data visualization 4 Reading 5 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 6 / 12

  9. Statistical description of data For data preprocessing to be successful, it is essential to have an overall picture of your data. Basic statistical descriptions can be used to identify properties of the data and highlight which data values should be treated as noise or outliers. Three basic statistical descriptions are Measures of central tendency This measures the location of the middle or center of a data distribution. such as mean, median, and mode. Mean Mode Mean Mean Mode Median Mode Median Median (a) Symmetric data (b) Positively skewed data (c) Negatively skewed data Measuring the data dispersion This measures how are the data spread out. The most common data dispersion measures are range, quartile, interquartile range (IQR), five-numbers summary, box plots, variance, and standard deviation. 25% Q 1 Q 2 Q 3 Median 75th 25th percentile percentile Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 6 / 12

  10. Statistical description of data (cont.) Box plots are a popular way of visualizing a distribution. A box plot incorporates the five-numbers summary (min, max, Q 1 , Q 3 , median). 220 200 180 160 140 Unit price ($) 120 100 80 60 40 20 Branch 1 Branch 2 Branch 3 Branch 4 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 7 / 12

  11. Statistical description of data (cont.) In Graphic displays of basic statistical description of data, graphs are helpful for visual inspection of data. These includes Quantile plots Quantile-quantile plots Histograms 6000 5000 Count of items sold 4000 3000 2000 1000 0 40–59 60–79 80–99 100–119 120–139 Unit price ($) Scatter plots. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 8 / 12

  12. Statistical description of data (Histogram) Plotting histograms is a graphical method for summarizing the distribution of a given attribute X . If X is nominal, then a plot is drawn for each value of X . If X is numeric, the range of values for X is partitioned into disjoint consequitive subranges (buckets) or bins. The value of bucket (height of a bar) indicates the frequency of that X value. The resulting graph is more commonly known as a bar chart. 6000 5000 Count of items sold 4000 3000 2000 1000 0 40–59 60–79 80–99 100–119 120–139 Unit price ($) Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 9 / 12

  13. Statistical description of data (Scatter plots) A scatter plot is a graphical method for determining if there appears to be a relationships between two numeric attributes (if any). 700 600 500 Items sold 400 300 200 100 0 0 20 40 60 80 100 120 140 Unit price ($) (a) (b) Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 10 / 12

  14. Outline Introduction 1 Getting to know your data 2 Statistical description of data 3 Data visualization 4 Reading 5 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 11 / 12

  15. Data visualization Data visualization aims to communicate data clearly and effectively through graphical representation. We can take advantage of visualization techiques to discover data relationships that exist but are not easily observable by looking at the raw data. Consider the visualization of a data set using scatter plots Some visualization techniques pixel-oriented techniques geometric projection techniques Icon-based techniques Hierarchical techniques Graph based techniques Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 11 / 12

  16. Outline Introduction 1 Getting to know your data 2 Statistical description of data 3 Data visualization 4 Reading 5 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 12 / 12

  17. Reading Read chapter 2 of the following book J. Han, M. Kamber, and Jian Pei, Data Mining: Concepts and Techniques , Morgan Kaufmann, 2012. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 12 / 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend