Data Mining Getting to know your data Hamid Beigy Sharif - - PowerPoint PPT Presentation

data mining
SMART_READER_LITE
LIVE PREVIEW

Data Mining Getting to know your data Hamid Beigy Sharif - - PowerPoint PPT Presentation

Data Mining Getting to know your data Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 12 Table of contents Introduction 1 Getting to know your data 2


slide-1
SLIDE 1

Data Mining

Getting to know your data Hamid Beigy

Sharif University of Technology

Fall 1396

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 12

slide-2
SLIDE 2

Table of contents

1

Introduction

2

Getting to know your data

3

Statistical description of data

4

Data visualization

5

Reading

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 2 / 12

slide-3
SLIDE 3

Outline

1

Introduction

2

Getting to know your data

3

Statistical description of data

4

Data visualization

5

Reading

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 12

slide-4
SLIDE 4

Data mining process

A typical knowledge discovery process is

Flat files Databases Data Warehouse Patterns Knowledge Cleaning and integration Selection and transformation Data mining Evaluation and presentation Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 12

slide-5
SLIDE 5

Outline

1

Introduction

2

Getting to know your data

3

Statistical description of data

4

Data visualization

5

Reading

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 4 / 12

slide-6
SLIDE 6

Getting to know your data

Real-world data are typically noisy, enormous in volume, and may originate from heterogenous sources. The first step of data mining is to know the data. We need to know

What are the type of attributes or fields that make up the data? What kind of values does each attribute have? Which attributes are discrete and which are continuous-valued? How are the values distributed? Are the ways we can visualize the data to get a better sense of it? Can we spot any outlier? Can we measure the similarity of some data objects with respect to others?

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 4 / 12

slide-7
SLIDE 7

Attribute types

Nominal attributes The values of nominal attributes are symboles or name of things. Each value represents some kind of category, code, or state, and nominal attributes are also referred as categorical. The values does not have any meaningful order. Binary attributes A binary attribute is a nominal attribute with only two categories. A binary attribute is symmetric if both of its states are equally valuable and carry the same weight. A binary attribute is asymmetric if the outcomes of the states are not equally important. Ordinal attributes An ordinal attribute is an attribute with possible values that have a meaningful order or ranking among them, but the magnitude between successive values is not known. Numerical attributes A numeric attribute is quantitative; that is, it is a measurable quantity, represented in integer or real values. Numeric attributes can be interval-scaled or ratio-scaled. Interval-scaled attributes are measured on a scale of equal-size units. Their values have order. We can compare and quantify the difference between

  • values. Temperatures in Celsius(Fahrenheit) do not have a true zero-point.

A ratio-scaled attribute is a numeric attribute with an inherent zero-point. Their values have order and allow us to compare and quantify the difference between values. Temperatures in the Kelvin has a true zero-point.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 5 / 12

slide-8
SLIDE 8

Outline

1

Introduction

2

Getting to know your data

3

Statistical description of data

4

Data visualization

5

Reading

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 6 / 12

slide-9
SLIDE 9

Statistical description of data

For data preprocessing to be successful, it is essential to have an overall picture of your

  • data. Basic statistical descriptions can be used to identify properties of the data and

highlight which data values should be treated as noise or outliers. Three basic statistical descriptions are Measures of central tendency This measures the location of the middle or center of a data distribution. such as mean, median, and mode.

Mode Median Mean Mode Median Mean Mean Median Mode (a) Symmetric data (b) Positively skewed data (c) Negatively skewed data

Measuring the data dispersion This measures how are the data spread out. The most common data dispersion measures are range, quartile, interquartile range (IQR), five-numbers summary, box plots, variance, and standard deviation.

Q2 Q3 Q1 25th percentile 75th percentile Median 25% Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 6 / 12

slide-10
SLIDE 10

Statistical description of data (cont.)

Box plots are a popular way of visualizing a distribution. A box plot incorporates the five-numbers summary (min, max, Q1, Q3, median).

20 40 60 80 100 120 140 160 180 200 220 Unit price ($) Branch 1 Branch 4 Branch 3 Branch 2 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 7 / 12

slide-11
SLIDE 11

Statistical description of data (cont.)

In Graphic displays of basic statistical description of data, graphs are helpful for visual inspection of data. These includes

Quantile plots Quantile-quantile plots Histograms

6000 5000 4000 3000 2000 1000 Count of items sold 40–59 60–79 80–99 100–119 120–139 Unit price ($)

Scatter plots.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 8 / 12

slide-12
SLIDE 12

Statistical description of data (Histogram)

Plotting histograms is a graphical method for summarizing the distribution of a given attribute X. If X is nominal, then a plot is drawn for each value of X. If X is numeric, the range of values for X is partitioned into disjoint consequitive subranges (buckets) or

  • bins. The value of bucket (height of a bar) indicates the frequency of that X value. The

resulting graph is more commonly known as a bar chart.

6000 5000 4000 3000 2000 1000 Count of items sold 40–59 60–79 80–99 100–119 120–139 Unit price ($)

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 9 / 12

slide-13
SLIDE 13

Statistical description of data (Scatter plots)

A scatter plot is a graphical method for determining if there appears to be a relationships between two numeric attributes (if any).

Unit price ($) Items sold 700 600 500 400 300 200 100 20 40 60 80 100 120 140

(a) (b)

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 10 / 12

slide-14
SLIDE 14

Outline

1

Introduction

2

Getting to know your data

3

Statistical description of data

4

Data visualization

5

Reading

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 11 / 12

slide-15
SLIDE 15

Data visualization

Data visualization aims to communicate data clearly and effectively through graphical representation. We can take advantage of visualization techiques to discover data relationships that exist but are not easily observable by looking at the raw data. Consider the visualization of a data set using scatter plots Some visualization techniques

pixel-oriented techniques geometric projection techniques Icon-based techniques Hierarchical techniques Graph based techniques

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 11 / 12

slide-16
SLIDE 16

Outline

1

Introduction

2

Getting to know your data

3

Statistical description of data

4

Data visualization

5

Reading

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 12 / 12

slide-17
SLIDE 17

Reading

Read chapter 2 of the following book

  • J. Han, M. Kamber, and Jian Pei, Data Mining: Concepts and Techniques, Morgan

Kaufmann, 2012.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 12 / 12