data mining and machine learning fundamental concepts and
play

Data Mining and Machine Learning: Fundamental Concepts and - PowerPoint PPT Presentation

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science


  1. Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 3: Categorical Attributes Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 1 / 26

  2. Univariate Analysis: Bernoulli Variable Consider a single categorical attribute, X , with domain dom ( X ) = { a 1 , a 2 ,..., a m } comprising m symbolic values. The data D is an n × 1 symbolic data matrix given as   X x 1     x 2 D =    .  .   .   x n where each point x i ∈ dom ( X ) . Bernoulli Variable : Special case when m = 2 � 1 if v = a 1 X ( v ) = 0 if v = a 2 i.e., dom ( X ) = { 0 , 1 } . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 2 / 26

  3. Bernoulli Variable: Mean and Variance Assume that each symbolic point has The probability mass function (PMF) of been mapped to its binary value. The set X is given as { x 1 , x 2 ,..., x n } is a random sample drawn from X . P ( X = x ) = f ( x ) = p x ( 1 − p ) 1 − x The sample mean is given as n µ = 1 x i = n 1 The expected value of X is given as � ˆ n = ˆ p n i = 1 µ = E [ X ] = 1 · p + 0 · ( 1 − p ) = p where n i is the number of points with x j = i in the random sample (equal to and the variance of X is given as the number of occurrences of symbol a i ). The sample variance is given as σ 2 = var ( X ) = p ( 1 − p ) σ 2 = ˆ ˆ p ( 1 − ˆ p ) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 3 / 26

  4. Binomial Distribution: Number of Occurrences Given the Bernoulli variable X , let { x 1 , x 2 ,..., x n } be a random sample of size n . Let N be the random variable denoting the number of occurrences of the symbol a 1 (value X = 1). N has a binomial distribution, given as � n � p n 1 ( 1 − p ) n − n 1 f ( N = n 1 | n , p ) = n 1 N is the sum of the n independent Bernoulli random variables x i IID with X , that is, N = � n i = 1 x i . The mean or expected number of occurrences of a 1 is � n � n n � � � µ N = E [ N ] = E x i = E [ x i ] = p = np i = 1 i = 1 i = 1 The variance of N is n n � � σ 2 N = var ( N ) = var ( x i ) = p ( 1 − p ) = np ( 1 − p ) i = 1 i = 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 4 / 26

  5. Multivariate Bernoulli Variable For the general case when dom ( X ) = { a 1 , a 2 ,..., a m } , we model X as an m -dimensional or multivariate Bernoulli random variable X = ( A 1 , A 2 ,..., A m ) T , where each A i is a Bernoulli variable with parameter p i denoting the probability of observing symbol a i . However, X can assume only one of the symbolic values at any one time. Thus, X ( v ) = e i if v = a i where e i is the i -th standard basis vector in m dimensions. The range of X consists of m distinct vector values { e 1 , e 2 ,..., e m } . The PMF of X is m e ij � P ( X = e i ) = f ( e i ) = p i = p j j = 1 with � m i = 1 p i = 1. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 5 / 26

  6. Multivariate Bernoulli: Mean The mean or expected value of X can be obtained as  1   0    p 1 m m 0 0 p 2       � � µ = E [ X ] = e i f ( e i ) = e i p i =  p 1 + ··· +  p m =  = p  .   .   .  . . .       . . . i = 1 i = 1    0 1 p m The sample mean is     n 1 / n p 1 ˆ n m n 2 / n p 2 ˆ µ = 1 n i     � � ˆ x i = n e i =  =  = ˆ  .   .  p . . n     . . i = 1 i = 1   n m / n p m ˆ where n i is the number of occurrences of the vector value e i in the sample, i.e., the number of occurrences of the symbol a i . Furthermore, � m i = 1 n i = n . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 6 / 26

  7. b b b b Multivariate Bernoulli Variable: sepal length Probability Mass Function Bins Domain Counts The total sample size is n = 150; the [ 4 . 3 , 5 . 2 ] Very Short ( a 1 ) n 1 = 45 estimates ˆ p i are: ( 5 . 2 , 6 . 1 ] Short ( a 2 ) n 2 = 50 ( 6 . 1 , 7 . 0 ] Long ( a 3 ) n 3 = 43 p 1 = 45 / 150 = 0 . 3 ˆ ( 7 . 0 , 7 . 9 ] Very Long ( a 4 ) n 4 = 12 p 2 = 50 / 150 = 0 . 333 ˆ We model sepal length as a multivariate p 3 = 43 / 150 = 0 . 287 ˆ Bernoulli variable X p 4 = 12 / 150 = 0 . 08 ˆ  e 1 = ( 1 , 0 , 0 , 0 ) if v = a 1  f ( x )    e 2 = ( 0 , 1 , 0 , 0 ) if v = a 2 0 . 333 X ( v ) = 0 . 3 0 . 287 0 . 3  e 3 = ( 0 , 0 , 1 , 0 ) if v = a 3    e 4 = ( 0 , 0 , 0 , 1 ) if v = a 4 0 . 2 For example, the symbolic point 0 . 1 0 . 08 x 1 = Short = a 2 is represented as the vector ( 0 , 1 , 0 , 0 ) T = e 2 . 0 x e 1 e 2 e 3 e 4 Very Short Short Long Very Long Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 7 / 26

  8. Multivariate Bernoulli Variable: Covariance Matrix We have X = ( A 1 , A 2 ,..., A m ) T , where A i is the Bernoulli variable corresponding to symbol a i . The variance for each Bernoulli variable A i is σ 2 i = var ( A i ) = p i ( 1 − p i ) The covariance between A i and A j is σ ij = E [ A i A j ] − E [ A i ] · E [ A j ] = 0 − p i p j = − p i p j Negative relationship since A i and A j cannot both be 1 at the same time. The covariance matrix for X is given as     σ 2 σ 12 ... σ 1 m p 1 ( 1 − p 1 ) − p 1 p 2 ··· − p 1 p m 1  σ 2    σ 12 ... σ 2 m − p 1 p 2 p 2 ( 1 − p 2 ) ··· − p 2 p m 2     Σ =  =  . . .   . . .  ... ... . . . . . .    . . . . . . σ 2 σ 1 m σ 2 m ... − p 1 p m − p 2 p m ··· p m ( 1 − p m ) m More compactly Σ = diag ( p ) − p · p T where µ = p = ( p 1 , ··· , p m ) T . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 8 / 26

  9. Categorical, Mapped Binary and Centered Dataset Modeling as multivariate Bernoulli variable is equivalent to treating X ( x i ) as a new n × m binary data matrix X A 1 A 2 Z 1 Z 2 0 1 − 0.4 0.4 x 1 Short x 1 z 1 0 1 − 0.4 0.4 x 2 Short x 2 z 2 1 0 0.6 − 0.6 x 3 Long x 3 z 3 0 1 − 0.4 0.4 x 4 Short x 4 z 4 1 0 0.6 − 0.6 x 5 Long x 5 z 5 X is the multivariate Bernoulli variable   e 1 = ( 1 , 0 ) T if v = Long ( a 1 ) X ( v ) = e 2 = ( 0 , 1 ) T if v = Short ( a 2 )  The sample mean and covariance matrix are � 0 . 24 � − 0 . 24 p = ( 2 / 5 , 3 / 5 ) T = ( 0 . 4 , 0 . 6 ) T p T = � µ = ˆ ˆ Σ = diag (ˆ p ) − ˆ p ˆ − 0 . 24 0 . 24 From the centered data, we have Z = ( Z 1 , Z 2 ) T and � 0 . 24 � Σ = 1 − 0 . 24 � 5 Z T Z = − 0 . 24 0 . 24 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 9 / 26

  10. Multinomial Distribution: Number of Occurrences Let { x 1 , x 2 ,..., x n } be a random sample from X . Let N i be the random variable denoting number of occurrences of symbol a i in the sample, and let N = ( N 1 , N 2 ,..., N m ) T . N has a multinomial distribution, given as � � m � � � n p n i N = ( n 1 , n 2 ,..., n m ) | p = f i n 1 n 2 ... n m i = 1 The mean and covariance matrix of N are:   np 1  .  µ N = E [ N ] = nE [ X ] = n · µ = n · p = .   . np m   np 1 ( 1 − p 1 ) − np 1 p 2 ··· − np 1 p m   − np 1 p 2 np 2 ( 1 − p 2 ) ··· − np 2 p m   Σ N = n · ( diag ( p ) − pp T ) =  . . .  ... . . .   . . . − np 1 p m − np 2 p m ··· np m ( 1 − p m ) The sample mean and covariance matrix for N are � p T � � µ N = n ˆ ˆ p Σ N = n diag (ˆ p ) − ˆ p ˆ Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 10 / 26

  11. Bivariate Analysis Assume the data comprises two categorical attributes, X 1 and X 2 , dom ( X 1 ) = { a 11 , a 12 ,..., a 1 m 1 } dom ( X 2 ) = { a 21 , a 22 ,..., a 2 m 2 } We model X 1 and X 2 as multivariate Bernoulli variables X 1 and X 2 with dimensions m 1 and m 2 , respectively. The joint distribution of X 1 and X 2 is modeled as the m 1 + m 2 � X 1 � dimensional vector variable X = X 2 � X 1 ( v 1 ) � � e 1 i � � ( v 1 , v 2 ) T � = = X X 2 ( v 2 ) e 2 j provided that v 1 = a 1 i and v 2 = a 2 j . The joint PMF for X is given as the m 1 × m 2 matrix   p 11 p 12 ... p 1 m 2   p 21 p 22 ... p 2 m 2   P 12 =  . . .  ... . . .   . . . p m 1 1 p m 1 2 ... p m 1 m 2 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 11 / 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend