Data Mining and Analysis: Fundamental Concepts and Algorithms - PowerPoint PPT Presentation

Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 1: Data Mining and Analysis Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 1 / 23

Data Matrix Data can often be represented or abstracted as an n × d data matrix , with n rows and d columns, given as  X 1 X 2 · · · X d  x 1 x 11 x 12 · · · x 1 d     · · · x 2 x 21 x 22 x 2 d D =    . . . .  ... . . . .   . . . .   x n x n 1 x n 2 · · · x nd Rows : Also called instances , examples , records , transactions , objects , points , feature-vectors , etc. Given as a d -tuple x i = ( x i 1 , x i 2 , . . . , x id ) Columns : Also called attributes , properties , features , dimensions , variables , fields , etc. Given as an n -tuple X j = ( x 1 j , x 2 j , . . . , x nj ) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 2 / 23

Iris Dataset Extract   Sepal Sepal Petal Petal Class length width length width     X 1 X 2 X 3 X 4 X 5     x 1 5 . 9 3 . 0 4 . 2 1 . 5 Iris-versicolor     x 2 6 . 9 3 . 1 4 . 9 1 . 5 Iris-versicolor     x 3 6 . 6 2 . 9 4 . 6 1 . 3 Iris-versicolor     x 4 4 . 6 3 . 2 1 . 4 0 . 2 Iris-setosa     x 5 6 . 0 2 . 2 4 . 0 1 . 0 Iris-versicolor     x 6 4 . 7 3 . 2 1 . 3 0 . 2 Iris-setosa     x 7 6 . 5 3 . 0 5 . 8 2 . 2 Iris-virginica     x 8 5 . 8 2 . 7 5 . 1 1 . 9 Iris-virginica     . . . . . . . . . . . .   . . . . . .     x 149 7 . 7 3 . 8 6 . 7 2 . 2 Iris-virginica   x 150 5 . 1 3 . 4 1 . 5 0 . 2 Iris-setosa Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 3 / 23

Attributes Attributes may be classified into two main types Numeric Attributes : real-valued or integer-valued domain Interval-scaled : only differences are meaningful e.g., temperature Ratio-scaled : differences and ratios are meaningful e..g, Age Categorical Attributes : set-valued domain composed of a set of symbols Nominal : only equality is meaningful e.g., domain( Sex ) = { M, F } Ordinal : both equality (are two values the same?) and inequality (is one value less than another?) are meaningful e.g., domain( Education ) = { High School , BS , MS , PhD } Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 4 / 23

bc bC Data: Algebraic and Geometric View For numeric data matrix D , each row or point is a d -dimensional column vector:  x i 1  x i 2 � T ∈ R d   � x i =  = x i 1 x i 2 · · · x id  .  .   .  x id whereas each column or attribute is a n -dimensional column vector: � T ∈ R n � X j = x 1 j x 2 j · · · x nj X 3 X 2 4 4 3 x 1 = ( 5 . 9 , 3 . 0 , 4 . 2 ) T x 1 = ( 5 . 9 , 3 . 0 ) T 3 2 2 1 1 1 2 3 1 X 2 2 3 0 X 1 4 5 0 1 2 3 4 5 6 6 X 1 (a) R 2 (b) R 3 Figure: Projections of x 1 = ( 5 . 9 , 3 . 0 , 4 . 2 , 1 . 5 ) T in 2D and 3D Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 5 / 23

bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC b bC bC bC bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC Scatterplot: 2D Iris Dataset sepal length versus sepal width . Visualizing Iris dataset as points/vectors in 2D Solid circle shows the mean point 4 . 5 4 . 0 X 2 : sepal width 3 . 5 3 . 0 2 . 5 2 4 4 . 5 5 . 0 5 . 5 6 . 0 6 . 5 7 . 0 7 . 5 8 . 0 X 1 : sepal length Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 6 / 23

Numeric Data Matrix If all attributes are numeric, then the data matrix D is an n × d matrix, or equivalently a i ∈ R d or a set of d column vectors X j ∈ R n set of n row vectors x T — x T 1 —    x 11 x 12 · · · x 1 d   | | |  — x T x 21 x 22 · · · x 2 d 2 —       D =  = = X 1 X 2 · · · X d . . .  ...   .  . . .     . . . .   | | | .    x n 1 x n 2 · · · x nd — x T n — The mean of the data matrix D is the average of all the points: n mean ( D ) = µ = 1 � x i n i = 1 The centered data matrix is obtained by subtracting the mean from all the points: x T µ T x T 1 − µ T z T         1 1 x T µ T x T 2 − µ T z T         Z = D − 1 · µ T = 2 2         − = = (1)  .   .   .   .  . . . .         . . . .         x T µ T x T n − µ T z T n n where z i = x i − µ is a centered point, and 1 ∈ R n is the vector of ones. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 7 / 23

bc Norm, Distance and Angle Given two points a , b ∈ R m , their dot Distance between a and b is given as product is defined as the scalar � m � a T b = a 1 b 1 + a 2 b 2 + · · · + a m b m � � � a − b � = ( a i − b i ) 2 � m i = 1 � = a i b i Angle between a and b is given as i = 1 � a � T � b a T b � The Euclidean norm or length of a cos θ = � a � � b � = vector a is defined as � a � � b � X 2 � √ m � � � a 2 ( 1 , 4 ) � a � = a T a = � i 4 a − b i = 1 bc ( 5 , 3 ) 3 The unit vector in the direction of a is a u = � a � with � a � = 1. 2 b a 1 θ 0 X 1 0 1 2 3 4 5 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 8 / 23

Orthogonal Projection Two vectors a and b are orthogonal iff a T b = 0, i.e., the angle between them is 90 ◦ . Orthogonal projection of b on a comprises the vector p = b � parallel to a , and r = b ⊥ perpendicular or orthogonal to a , given as b = b � + b ⊥ = p + r where � a T b � p = b � = a a T a X 2 b 4 r = b ⊥ a 3 2 p = b � 1 0 X 1 0 1 2 3 4 5 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 9 / 23

rs rS bC bc bC rs rS rs rs bC rS rs rS rs rS rs bc bc rs bc bc bC bc bC bc bC bC bC bc bC bc bC bc bC bc rS rS bc rs rs rS rs rS uT rS rS rs rs rS rs rS rs rS rS rS rs rS rS rs rS rs rS rs rs rs rS rs rS rs rS rs rS bC bC rS bc bc bC bc bC bc bC bC bc bc bC bc bC bc bC bC bC bC bC bC bc bC bc bC bc bc bc bC bc bC bc bC bc bC bc bc bc bC bC bc bC bc bC bc bc bC bC bc bC bc bC bc bC bc bc bC bc bc bC bc bC bc bC bC bC bc bC bc bC bc bC bc rs rS rs bC ut ut uT ut uT ut uT uT ut ut uT ut uT ut uT uT uT uT uT uT ut uT ut uT ut ut ut uT ut uT ut uT ut uT ut ut uT uT uT ut uT ut uT ut ut uT uT ut uT ut uT ut ut ut uT ut ut uT ut uT ut uT uT uT ut uT ut uT ut uT ut ut ut rs rS rS rs rS rs rS rs rs rS rS rs rS rs rS rs rs rs rs rS rS rs rS rs rS rs rs bc rS rs rS rs rS rs rS rS rS uT ut ut uT ut uT ut uT uT ut ut uT ut uT ut uT ut uT uT rs rS rS rs rS rs rS rs rs ut rS ut uT ut uT ut uT bc Projection of Centered Iris Data Onto a Line ℓ . X 2 ℓ 1 . 5 1 . 0 rS rs 0 . 5 0 . 0 X 1 − 0 . 5 − 1 . 0 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 10 / 23

Data: Probabilistic View A random variable X is a function X : O → R , where O is the set of all possible outcomes of the experiment, also called the sample space . A discrete random variable takes on only a finite or countably infinite number of values, whereas a continuous random variable if it can take on any value in R . By default, a numeric attribute X j is considered as the identity random variable given as X ( v ) = v for all v ∈ O . Here O = R . Discrete Variable: Long Sepal Length Define random variable A , denoting long sepal length (7cm or more) as follows: � 0 if v < 7 A ( v ) = 1 if v ≥ 7 The sample space of A is O = [ 4 . 3 , 7 . 9 ] , and its range is { 0 , 1 } . Thus, A is discrete. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 1: Data Mining and Analysis 11 / 23

Data Mining and Analysis: Fundamental Concepts and Algorithms - PowerPoint PPT Presentation

Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Mining: Concepts and Techniques Chapter 1 Introduction 1 August 19, 2013

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining Chapter 5 Association Analysis : Basic Concepts Introduction to Data Mining, 2 nd

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

Data Mining: Concepts and Techniques Chapter 9 Graph mining and Social Network Analysis

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining: Concepts and Techniques Cluster Analysis Li Xiong Slide credits: Jiawei Han and

Chapter 1 Chapter 1 Fundamental Concepts Fundamental Concepts 1 Signals Signals A signal

Ontology Aided Smart Contract Execution for Unexpected Situations Farhad Mohsin, Xingjian Zhao,

Control-Flow Integrity Zhi Wang, Xuxian Jiang North Carolina State University Outline

SCIENTIFIC COMPUTING Samarth Shah (shah.samarth.p@gmail.com) 29/12/2012 About Me 2 B.Tech

How to Speak a Language without Knowing It Xing Shi, Kevin Knight 1 Heng Ji 2 1 Information

High Order Asymptotic Preserving Schemes for Some Discrete-Velocity Kinetic Equations Fengyan Li

Binding the Web of Things with LwM2M for a vehicular use case Benjamin KLOTZ klotz@eurecom.fr

Applications of Small, Cheap Computers In Network Operations Internet2 Technology Exchange

IntelliSAR December 13, 2019 Department of Electrical and Computer Engineering Department of