Data Mining Getting to know your data Hamid Beigy Sharif - PowerPoint PPT Presentation

Data Mining Getting to know your data Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 12

Table of contents Introduction 1 Getting to know your data 2 Statistical description of data 3 Data visualization 4 Reading 5 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 2 / 12

Outline Introduction 1 Getting to know your data 2 Statistical description of data 3 Data visualization 4 Reading 5 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 12

Data mining process A typical knowledge discovery process is Knowledge Evaluation and presentation Patterns Data mining Selection and transformation Data Warehouse Cleaning and integration Flat files Databases Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 12

Getting to know your data Real-world data are typically noisy, enormous in volume, and may originate from heterogenous sources. The first step of data mining is to know the data. We need to know What are the type of attributes or fields that make up the data? What kind of values does each attribute have? Which attributes are discrete and which are continuous-valued? How are the values distributed? Are the ways we can visualize the data to get a better sense of it? Can we spot any outlier? Can we measure the similarity of some data objects with respect to others? Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 4 / 12

Attribute types Nominal attributes The values of nominal attributes are symboles or name of things. Each value represents some kind of category, code, or state, and nominal attributes are also referred as categorical. The values does not have any meaningful order. Binary attributes A binary attribute is a nominal attribute with only two categories. A binary attribute is symmetric if both of its states are equally valuable and carry the same weight. A binary attribute is asymmetric if the outcomes of the states are not equally important. Ordinal attributes An ordinal attribute is an attribute with possible values that have a meaningful order or ranking among them, but the magnitude between successive values is not known. Numerical attributes A numeric attribute is quantitative; that is, it is a measurable quantity, represented in integer or real values. Numeric attributes can be interval-scaled or ratio-scaled. Interval-scaled attributes are measured on a scale of equal-size units. Their values have order. We can compare and quantify the difference between values. Temperatures in Celsius(Fahrenheit) do not have a true zero-point. A ratio-scaled attribute is a numeric attribute with an inherent zero-point. Their values have order and allow us to compare and quantify the difference between values. Temperatures in the Kelvin has a true zero-point. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 5 / 12

Statistical description of data For data preprocessing to be successful, it is essential to have an overall picture of your data. Basic statistical descriptions can be used to identify properties of the data and highlight which data values should be treated as noise or outliers. Three basic statistical descriptions are Measures of central tendency This measures the location of the middle or center of a data distribution. such as mean, median, and mode. Mean Mode Mean Mean Mode Median Mode Median Median (a) Symmetric data (b) Positively skewed data (c) Negatively skewed data Measuring the data dispersion This measures how are the data spread out. The most common data dispersion measures are range, quartile, interquartile range (IQR), five-numbers summary, box plots, variance, and standard deviation. 25% Q 1 Q 2 Q 3 Median 75th 25th percentile percentile Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 6 / 12

Statistical description of data (cont.) Box plots are a popular way of visualizing a distribution. A box plot incorporates the five-numbers summary (min, max, Q 1 , Q 3 , median). 220 200 180 160 140 Unit price ($) 120 100 80 60 40 20 Branch 1 Branch 2 Branch 3 Branch 4 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 7 / 12

Statistical description of data (cont.) In Graphic displays of basic statistical description of data, graphs are helpful for visual inspection of data. These includes Quantile plots Quantile-quantile plots Histograms 6000 5000 Count of items sold 4000 3000 2000 1000 0 40–59 60–79 80–99 100–119 120–139 Unit price ($) Scatter plots. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 8 / 12

Statistical description of data (Histogram) Plotting histograms is a graphical method for summarizing the distribution of a given attribute X . If X is nominal, then a plot is drawn for each value of X . If X is numeric, the range of values for X is partitioned into disjoint consequitive subranges (buckets) or bins. The value of bucket (height of a bar) indicates the frequency of that X value. The resulting graph is more commonly known as a bar chart. 6000 5000 Count of items sold 4000 3000 2000 1000 0 40–59 60–79 80–99 100–119 120–139 Unit price ($) Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 9 / 12

Statistical description of data (Scatter plots) A scatter plot is a graphical method for determining if there appears to be a relationships between two numeric attributes (if any). 700 600 500 Items sold 400 300 200 100 0 0 20 40 60 80 100 120 140 Unit price ($) (a) (b) Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 10 / 12

Data visualization Data visualization aims to communicate data clearly and effectively through graphical representation. We can take advantage of visualization techiques to discover data relationships that exist but are not easily observable by looking at the raw data. Consider the visualization of a data set using scatter plots Some visualization techniques pixel-oriented techniques geometric projection techniques Icon-based techniques Hierarchical techniques Graph based techniques Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 11 / 12

Reading Read chapter 2 of the following book J. Han, M. Kamber, and Jian Pei, Data Mining: Concepts and Techniques , Morgan Kaufmann, 2012. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 12 / 12

Data Mining Getting to know your data Hamid Beigy Sharif - PowerPoint PPT Presentation

Data Mining Getting to know your data Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 12 Table of contents Introduction 1 Getting to know your data 2

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

JUST THE MATHS SLIDES NUMBER 11.5 DIFFERENTIATION APPLICATIONS 5 (Maclaurins and

Agenda for today 1. Descriptive Data Analysis 2. Graphics XploRe Descriptive Data Analysis 1-2

Introductory Statistics Day 2 Descriptive Statistics Class Check-In Did everyone find the

Why is s that Tru rue? How does it Work? Findin ing Mu Mult ltip iple le Answers for r Ma

Continuous Probability, RVs, Distributions EECS 126 Fall 2019 September 17, 2019 Agenda

Exponential Distribution IE 502: Probabilistic Models Jayendran Venkateswaran IE & OR

System Modeling and Simulation Carey Williamson Department of Computer Science University of

PII Oliveira Peres S Theorem 2 2012 t act for all ca and 7 positive constants reversible s.t ca

Data Mining Getting to know your data Hamid Beigy Sharif - PowerPoint PPT Presentation

Data Mining Getting to know your data Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 12 Table of contents Introduction 1 Getting to know your data 2

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

JUST THE MATHS SLIDES NUMBER 11.5 DIFFERENTIATION APPLICATIONS 5 (Maclaurins and

Agenda for today 1. Descriptive Data Analysis 2. Graphics XploRe Descriptive Data Analysis 1-2

Introductory Statistics Day 2 Descriptive Statistics Class Check-In Did everyone find the

Why is s that Tru rue? How does it Work? Findin ing Mu Mult ltip iple le Answers for r Ma

Continuous Probability, RVs, Distributions EECS 126 Fall 2019 September 17, 2019 Agenda

Exponential Distribution IE 502: Probabilistic Models Jayendran Venkateswaran IE &amp; OR

System Modeling and Simulation Carey Williamson Department of Computer Science University of

PII Oliveira Peres S Theorem 2 2012 t act for all ca and 7 positive constants reversible s.t ca

Exponential Distribution IE 502: Probabilistic Models Jayendran Venkateswaran IE & OR