Data Mining and Machine Learning: Fundamental Concepts and - - PowerPoint PPT Presentation

data mining and machine learning fundamental concepts and
SMART_READER_LITE
LIVE PREVIEW

Data Mining and Machine Learning: Fundamental Concepts and - - PowerPoint PPT Presentation

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science


slide-1
SLIDE 1

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

dataminingbook.info Mohammed J. Zaki1 Wagner Meira Jr.2

1Department of Computer Science

Rensselaer Polytechnic Institute, Troy, NY, USA

2Department of Computer Science

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 3: Categorical Attributes

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 1 / 26

slide-2
SLIDE 2

Univariate Analysis: Bernoulli Variable

Consider a single categorical attribute, X, with domain dom(X) = {a1,a2,...,am} comprising m symbolic values. The data D is an n × 1 symbolic data matrix given as D =        X x1 x2 . . . xn        where each point xi ∈ dom(X). Bernoulli Variable: Special case when m = 2 X(v) =

  • 1

if v = a1 if v = a2 i.e., dom(X) = {0,1}.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 2 / 26

slide-3
SLIDE 3

Bernoulli Variable: Mean and Variance

The probability mass function (PMF) of X is given as P(X = x) = f (x) = px(1 − p)1−x The expected value of X is given as µ = E[X] = 1 · p + 0 · (1 − p) = p and the variance of X is given as σ2 = var(X) = p(1 − p) Assume that each symbolic point has been mapped to its binary value. The set {x1,x2,...,xn} is a random sample drawn from X. The sample mean is given as ˆ µ = 1 n

n

  • i=1

xi = n1 n = ˆ p where ni is the number of points with xj = i in the random sample (equal to the number of occurrences of symbol ai). The sample variance is given as ˆ σ2 = ˆ p(1 − ˆ p)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 3 / 26

slide-4
SLIDE 4

Binomial Distribution: Number of Occurrences

Given the Bernoulli variable X, let {x1,x2,...,xn} be a random sample of size n. Let N be the random variable denoting the number of occurrences of the symbol a1 (value X = 1). N has a binomial distribution, given as f (N = n1| n,p) = n n1

  • pn1(1 − p)n−n1

N is the sum of the n independent Bernoulli random variables xi IID with X, that is, N = n

i=1 xi. The mean or expected number of occurrences of a1 is

µN = E[N] = E n

  • i=1

xi

  • =

n

  • i=1

E[xi] =

n

  • i=1

p = np The variance of N is σ2

N = var(N) = n

  • i=1

var(xi) =

n

  • i=1

p(1 − p) = np(1 − p)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 4 / 26

slide-5
SLIDE 5

Multivariate Bernoulli Variable

For the general case when dom(X) = {a1,a2,...,am}, we model X as an m-dimensional or multivariate Bernoulli random variable X = (A1,A2,...,Am)T, where each Ai is a Bernoulli variable with parameter pi denoting the probability of

  • bserving symbol ai.

However, X can assume only one of the symbolic values at any one time. Thus, X(v) = ei if v = ai where ei is the i-th standard basis vector in m dimensions. The range of X consists of m distinct vector values {e1,e2,...,em}. The PMF of X is P(X = ei) = f (ei) = pi =

m

  • j=1

p

eij j

with m

i=1 pi = 1.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 5 / 26

slide-6
SLIDE 6

Multivariate Bernoulli: Mean

The mean or expected value of X can be obtained as µ = E[X] =

m

  • i=1

eif (ei) =

m

  • i=1

eipi =      1 . . .     p1 + ··· +      . . . 1     pm =      p1 p2 . . . pm      = p The sample mean is ˆ µ = 1 n

n

  • i=1

xi =

m

  • i=1

ni n ei =      n1/n n2/n . . . nm/n      =      ˆ p1 ˆ p2 . . . ˆ pm      = ˆ p where ni is the number of occurrences of the vector value ei in the sample, i.e., the number of occurrences of the symbol ai. Furthermore, m

i=1 ni = n.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 6 / 26

slide-7
SLIDE 7

Multivariate Bernoulli Variable: sepal length

Bins Domain Counts [4.3,5.2] Very Short (a1) n1 = 45 (5.2,6.1] Short (a2) n2 = 50 (6.1,7.0] Long (a3) n3 = 43 (7.0,7.9] Very Long (a4) n4 = 12 We model sepal length as a multivariate Bernoulli variable X X(v) =          e1 = (1,0,0,0) if v = a1 e2 = (0,1,0,0) if v = a2 e3 = (0,0,1,0) if v = a3 e4 = (0,0,0,1) if v = a4 For example, the symbolic point x1 = Short = a2 is represented as the vector (0,1,0,0)T = e2. Probability Mass Function The total sample size is n = 150; the estimates ˆ pi are: ˆ p1 = 45/150 = 0.3 ˆ p2 = 50/150 = 0.333 ˆ p3 = 43/150 = 0.287 ˆ p4 = 12/150 = 0.08

0.1 0.2 0.3 x f (x)

b b b b

0.3 0.333 0.287 0.08 e1 e2 e3 e4 Very Short Short Long Very Long Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 7 / 26

slide-8
SLIDE 8

Multivariate Bernoulli Variable: Covariance Matrix

We have X = (A1,A2,...,Am)T, where Ai is the Bernoulli variable corresponding to symbol ai. The variance for each Bernoulli variable Ai is σ2

i = var(Ai) = pi(1 − pi)

The covariance between Ai and Aj is σij = E[AiAj] − E[Ai] · E[Aj] = 0 − pipj = −pipj Negative relationship since Ai and Aj cannot both be 1 at the same time. The covariance matrix for X is given as Σ =      σ2

1

σ12 ... σ1m σ12 σ2

2

... σ2m . . . . . . ... . . . σ1m σ2m ... σ2

m

     =      p1(1 − p1) −p1p2 ··· −p1pm −p1p2 p2(1 − p2) ··· −p2pm . . . . . . ... . . . −p1pm −p2pm ··· pm(1 − pm)      More compactly Σ = diag(p) − p · pT where µ = p = (p1,··· ,pm)T.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 8 / 26

slide-9
SLIDE 9

Categorical, Mapped Binary and Centered Dataset

Modeling as multivariate Bernoulli variable is equivalent to treating X(xi) as a new n × m binary data matrix X x1 Short x2 Short x3 Long x4 Short x5 Long A1 A2 x1 1 x2 1 x3 1 x4 1 x5 1 Z1 Z2 z1 −0.4 0.4 z2 −0.4 0.4 z3 0.6 −0.6 z4 −0.4 0.4 z5 0.6 −0.6 X is the multivariate Bernoulli variable X(v) =    e1 = (1,0)T if v = Long(a1) e2 = (0,1)T if v = Short(a2) The sample mean and covariance matrix are ˆ µ = ˆ p = (2/5,3/5)T = (0.4,0.6)T

  • Σ = diag(ˆ

p) − ˆ pˆ pT = 0.24 −0.24 −0.24 0.24

  • From the centered data, we have Z = (Z1,Z2)T and
  • Σ = 1

5Z TZ = 0.24 −0.24 −0.24 0.24

  • Zaki & Meira Jr.

(RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 9 / 26

slide-10
SLIDE 10

Multinomial Distribution: Number of Occurrences

Let {x1,x2,...,xn} be a random sample from X. Let Ni be the random variable denoting number of occurrences of symbol ai in the sample, and let N = (N1,N2,...,Nm)T. N has a multinomial distribution, given as f

  • N = (n1,n2,...,nm) | p
  • =
  • n

n1n2 ...nm m

  • i=1

pni

i

The mean and covariance matrix of N are: µN = E[N] = nE[X] = n · µ = n · p =    np1 . . . npm    ΣN = n · (diag(p) − ppT) =      np1(1 − p1) −np1p2 ··· −np1pm −np1p2 np2(1 − p2) ··· −np2pm . . . . . . ... . . . −np1pm −np2pm ··· npm(1 − pm)      The sample mean and covariance matrix for N are ˆ µN = nˆ p

  • ΣN = n
  • diag(ˆ

p) − ˆ pˆ pT

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 10 / 26

slide-11
SLIDE 11

Bivariate Analysis

Assume the data comprises two categorical attributes, X1 and X2, dom(X1) = {a11,a12,...,a1m1} dom(X2) = {a21,a22,...,a2m2} We model X1 and X2 as multivariate Bernoulli variables X 1 and X 2 with dimensions m1 and m2, respectively. The joint distribution of X 1 and X 2 is modeled as the m1 + m2 dimensional vector variable X = X 1 X 2

  • X
  • (v1,v2)T

= X 1(v1) X 2(v2)

  • =

e1i e2j

  • provided that v1 = a1i and v2 = a2j.

The joint PMF for X is given as the m1 × m2 matrix P12 =      p11 p12 ... p1m2 p21 p22 ... p2m2 . . . . . . ... . . . pm11 pm12 ... pm1m2     

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 11 / 26

slide-12
SLIDE 12

Bivariate Empirical PMF: sepal length and sepal width

X1:sepal length X2:sepal width Bins Domain Counts [4.3,5.2] Very Short (a1) n1 = 45 (5.2,6.1] Short (a2) n2 = 50 (6.1,7.0] Long (a3) n3 = 43 (7.0,7.9] Very Long (a4) n4 = 12 Bins Domain Counts [2.0,2.8] Short (a1) 47 (2.8,3.6] Medium (a2) 88 (3.6,4.4] Long (a3) 15

Observed Counts (nij) X2 Short (e21) Medium (e22) Long (e23) X1 Very Short (e11) 7 33 5 Short (e22) 24 18 8 Long (e13) 13 30 Very Long (e14) 3 7 2

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 12 / 26

slide-13
SLIDE 13

Bivariate Empirical PMF: sepal length and sepal width

X1 X2 f (x)

b b b b b b b b b b b b

0.047 0.22 0.033 0.16 0.12 0.053 0.087 0.2 0.02 0.047 0.013 e11 e12 e13 e14 e21 e22 e23 0.1 0.2 Joint probabilities: ˆ pij = nij/n

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 13 / 26

slide-14
SLIDE 14

Attribute Dependence: Contingency Analysis

The contingency table for X 1 and X 2 is the m1 × m2 matrix of observed counts nij N12 = n · P12 =      n11 n12 ··· n1m2 n21 n22 ··· n2m2 . . . . . . ... . . . nm11 nm12 ··· nm1m2      where P12 is the empirical joint PMF for X 1 and X 2. The contingency table is augmented with row and column marginal counts, as follows: N1 = n · ˆ p1 =    n1

1

. . . n1

m1

   N2 = n · ˆ p2 =    n2

1

. . . n2

m2

   N1 and N2 have a multinomial distribution with parameters p1 = (p1

1,...,p1 m1) and

p2 = (p2

1,...,p2 m2), respv.

N12 also has a multinomial distribution with parameters P12 = {pij}, for 1 ≤ i ≤ m1 and 1 ≤ j ≤ m2.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 14 / 26

slide-15
SLIDE 15

Contingency Table: sepal length vs. sepal width

Sepal length (X1) Sepal width (X2) Short Medium Long a21 a22 a23 Row Counts Very Short (a11) 7 33 5 n1

1 = 45

Short (a12) 24 18 8 n1

2 = 50

Long (a13) 13 30 n1

3 = 43

Very Long (a14) 3 7 2 n1

4 = 12

Column Counts n2

1 = 47

n2

2 = 88

n2

3 = 15

n = 150

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 15 / 26

slide-16
SLIDE 16

Chi-Squared Test for Independence

Assume X 1 and X 2 are independent. Then, their joint PMF is ˆ pij = ˆ p1

i · ˆ

p2

j

The expected frequency for each pair of values is eij = n · ˆ pij = n · ˆ p1

i · ˆ

p2

j = n · n1 i

n · n2

j

n = n1

i n2 j

n The χ2 statistic quantifies the difference between observed and expected counts χ2 =

m1

  • i=1

m2

  • j=1

(nij − eij)2 eij The sampling distribution for the χ2 statistic follows the chi-squared density function: f (x|q) = 1 2q/2Γ(q/2)x

q 2 −1e − x 2

where q is the degrees of freedom q = |dom(X1)| × |dom(X2)| − (|dom(X1)| + |dom(X2)|) + 1 = m1m2 − m1 − m2 + 1 = (m1 − 1)(m2 − 1)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 16 / 26

slide-17
SLIDE 17

Chi-Squared Test: sepal length and sepal width

Expected Counts X2 Short (a21) Medium (a22) Short (a23) X1 Very Short (a11) 14.1 26.4 4.5 Short (a12) 15.67 29.33 5.0 Long (a13) 13.47 25.23 4.3 Very Long (a14) 3.76 7.04 1.2 Observed Counts X2 Short (a21) Medium (a22) Long (a23) Very Short (a11) 7 33 5 Short (a12) 24 18 8 Long (a13) 13 30 Very Long (a14) 3 7 2 The chi-squared statistic value is χ2 = 21.8. The number of degrees of freedom are q = (m1 − 1) · (m2 − 1) = 3 · 2 = 6

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 17 / 26

slide-18
SLIDE 18

Chi-Squared Distribution (q = 6).

The p-value of a statistic θ is defined as the probability of obtaining a value at least as extreme as the observed value. The null hypothesis, that X1 and X2 are independent, is rejected if p-value(z) ≤ α, say α = 0.01. We have p-value(21.8) = 0.0013. Thus, we reject the null hypothesis, and conclude that X1 and X2 are dependent.

5 10 15 20 25 0.03 0.06 0.09 0.12 0.15 x f (x|6)

bC b

α = 0.01 H0 Rejection Region

16.8 21.8 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 18 / 26

slide-19
SLIDE 19

Multiway Contingency Analysis

Given X = (X1,X2,··· ,Xd)T. The chi-squared statistic is given as χ2 =

  • i

(ni − ei)2 ei =

m1

  • i1=1

m2

  • i2=1

···

md

  • id =1

(ni1,i2,...,id − ei1,i2,...,id )2 ei1,i2,...,id Under the null hypothesis, that attributes are independent, the expected number

  • f occurrences of the symbol tuple (a1i1,a2i2,...,adid ) is given as

ei = n · ˆ pi = n ·

d

  • j=1

ˆ pj

ij =

n1

i1n2 i2 ...nd id

nd−1 The total number of degrees of freedom for the chi-squared distribution is given as q =

d

  • i=1

|dom(Xi)| −

d

  • i=1

|dom(Xi)| + (d − 1) = d

  • i=1

mi

d

  • i=1

mi

  • + d − 1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 19 / 26

slide-20
SLIDE 20

3-Way Contingency Table

X1: sepal length, X2: sepal width and X3: Iris type 5 17 12 5 11 1 33 5 3 8 1 7 3 8 19 3 7 2 X3 X1 X2 X3 a31 a32 a33 50 50 50 X1 a14 45 a13 50 a12 43 a11 12 X2 a21 47 a22 88 a23 15

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 20 / 26

slide-21
SLIDE 21

3-Way Contingency Analysis

X3(a31/a32/a33) X2 a21 a22 a23 X1 a11 1.25 2.35 0.40 a12 4.49 8.41 1.43 a13 5.22 9.78 1.67 a14 4.70 8.80 1.50 The value of the χ2 statistic is χ2 = 231.06, and the number of degrees of freedom is q = 4 · 3 · 3 − (4 + 3 + 3) + 2 = 36 − 10 + 2 = 28. For a significance level of α = 0.01, the critical value of the chi-square distribution is z = 48.28. The observed value of χ2 = 231.06 is much greater than z, and it is thus extremely unlikely to happen under the null hypothesis. We conclude that the three attributes are not 3-way independent, but rather there is some dependence between them.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 21 / 26

slide-22
SLIDE 22

Distance and Angle

With the modeling of categorical attributes as multivariate Bernoulli variables, it is possible to compute the distance or the angle between any two points xi and xj: xi =    e1i1 . . . ed id    xj =    e1j1 . . . ed jd    The different measures of distance and similarity rely on the number of matching and mismatching values (or symbols) across the d attributes X k. The number of matching values s is given as: s = xT

i xj = d

  • k=1

(ekik)Tekjk The number of mismatches is simply d − s. Also useful is the norm of each point: xi2 = xT

i xi = d

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 22 / 26

slide-23
SLIDE 23

Distance and Angle

The Euclidean distance between xi and xj is given as δ(xi,xj) = xi − xj =

  • xT

i xi − 2xixj + xT j xj =

  • 2(d − s)

The Hamming distance is given as δH(xi,xj) = d − s Cosine Similarity: The cosine of the angle is given as cosθ = xT

i xj

xi · xj = s d The Jaccard Coeff icient is given as J(xi,xj) = s 2(d − s) + s = s 2d − s

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 23 / 26

slide-24
SLIDE 24

Discretization

Discretization, also called binning, converts numeric attributes into categorical ones. Equal-Width Intervals: Partition the range of X into k equal-width intervals. The interval width is simply the range of X divided by k: w = xmax − xmin k Thus, the ith interval boundary is given as vi = xmin + iw, for i = 1,...,k − 1 Equal-Frequency Intervals: We divide the range of X into intervals that contain (approximately) equal number of points. The intervals are computed from the empirical quantile or inverse cumulative distribution function ˆ F −1(q) = min{x | P(X ≤ x) ≥ q} We require that each interval contain 1/k of the probability mass; therefore, the interval boundaries are given as follows: vi = ˆ F −1(i/k) for i = 1,...,k − 1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 24 / 26

slide-25
SLIDE 25

Equal-Frequency Discretization: sepal length (4 bins)

Empirical Inverse CDF

0.25 0.50 0.75 1.00 4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 q ˆ F −1(q)

Quartile values: ˆ F −1(0.25) = 5.1 ˆ F −1(0.5) = 5.8 ˆ F −1(0.75) = 6.4 Range: [4.3,7.9] Bin Width Count [4.3,5.1] 0.8 n1 = 41 (5.1,5.8] 0.7 n2 = 39 (5.8,6.4] 0.6 n3 = 35 (6.4,7.9] 1.5 n4 = 35

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 25 / 26

slide-26
SLIDE 26

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

dataminingbook.info Mohammed J. Zaki1 Wagner Meira Jr.2

1Department of Computer Science

Rensselaer Polytechnic Institute, Troy, NY, USA

2Department of Computer Science

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 3: Categorical Attributes

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 26 / 26