Grouping categorical variables Grouping categories of nominal - PowerPoint PPT Presentation

Grouping categorical variables Grouping categories of nominal variables Ricco RAKOTOMALALA Université Lumière Lyon 2 Ricco Rakotomalala 1 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Outline 1. Clustering of categorical variables. Why? a. HCA from a dissimilarity matrix b. Deficiency of the clustering of categorical variables 2. Clustering categories of nominal variables a. Distance between categories – Dice’s coefficient b. HAC on the categories c. Interpretation of the obtained clusters 3. Other approaches for the clustering of categories 4. Conclusion 5. References Ricco Rakotomalala 2 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Why? For what purpose? Ricco Rakotomalala 3 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Clustering of variables  The variables in the same group are highly Goal: grouping associated together. related variables  The variables in different groups are not related (in the sense of association measure) With what objective? 1. Indentify the underlying structure of the dataset. Make a summary of the relevant information (the approach is complementary to the clustering of individuals). 2. Detect redundancies, for instance in order to selecting the variables intended for a subsequent analysis (e.g. supervised learning task) a. In a pretreatment phase, in order to organize the search space b. In a post-treatment phase, in order to understand the role of the removed variables in the selection process. Ricco Rakotomalala 4 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

An example: Vote dataset (1984) n = 435 individuals (US Congressmen) Political affiliation p = 6 active variables Illustrative variable i.e. used for understanding the nature of the groups Variable Categories Role affiliation democrat, republican illustrative budget yes, no, neither active physician yes, no, neither active Vote on each subject, 3 categories: yes salvador yes, no, neither active nicaraguan yes, no, neither active (yea), no (nay), neither (not “yea” or “nay”) missile yes, no, neither active Active variables education yes, no, neither active Identify the vote which are highly related Establish their association with the political affiliation We observe that a vote "yea" to a subject may be highly related to vote "nay" to another subject. Ricco Rakotomalala 5 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Using the Cramer’s V to measure the association between the nominal variables Ricco Rakotomalala 6 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Measure of association between 2 nominal variables Pearson’s chi -squared statistic    2 n e  .  n n    2 kl kl k . l e kl A \ B b b b Total n e 1 l L k l kl a 1  # P(A) x P(B) # P(AB)   a n n Under the independence assumption k kl k . observed  a K Cramer’s v Total n n . l • Symmetrical  2  v • 0  v  1      min 1 , 1 n K L Nombre de budget physician   2 355 . 48 Total  budget n neither y général p . value 0 . 0001 High association n 25 146 171 Ex. neither 3 6 2 11 Significant at the 5% level y 219 5 29 253  v 0 . 639 Total général 247 11 177 435 Ricco Rakotomalala 7 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Similarity matrix – Dissimilarity matrix Similarity matrix (Cramer’s v) budget physician salvador nicaraguan missile education budget 1 0.639 0.507 0.517 0.439 0.475 physician 0.639 1 0.576 0.518 0.471 0.509 #function for calculating Cramer's v salvador 0.507 0.576 1 0.611 0.558 0.470 cramer <- function(y,x){ nicaraguan 0.517 0.518 0.611 1 0.545 0.469 missile 0.439 0.471 0.558 0.545 1 0.427 K <- nlevels(y) education 0.475 0.509 0.470 0.469 0.427 1 L <- nlevels(x) n <- length(y) chi2 <- chisq.test (y,x,correct=F) Dissimilarity matrix (1-v) print(chi2$statistic) budget physician salvador nicaraguan missile education v <- sqrt(chi2$statistic/(n*min(K-1,L-1))) budget 0 0.361 0.493 0.483 0.561 0.525 return(v) physician 0.361 0 0.424 0.482 0.529 0.491 } salvador 0.493 0.424 0 0.389 0.442 0.530 nicaraguan 0.483 0.482 0.389 0 0.455 0.531 missile 0.561 0.529 0.442 0.455 0 0.573 education 0.525 0.491 0.530 0.531 0.573 0 We can use this matrix as input for the HAC algorithm Ricco Rakotomalala 8 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

hclust() under R – Distance = (1 – v), Ward’s method #similarity matrix sim <- matrix(1,nrow=ncol(vote.active),ncol=ncol(vote.active)) rownames(sim) <- colnames(vote.active) colnames(sim) <- colnames(vote.active) Cluster Dendrogram for (i in 1:(nrow(sim)-1)){ 0.65 for (j in (i+1):ncol(sim)){ y <- vote.active[,i] 0.55 x <- vote.active[,j] sim[i,j] <- cramer(y,x) education sim[j,i] <- sim[i,j] Height 0.45 } missile } #distance matrix 0.35 salvador nicaraguan dissim <- as.dist(1-sim) budget physician #clustering tree <- hclust (dissim,method="ward.D") plot(tree) G1 G2 dissim hclust (*, "ward.D") We get a vision of the structures of association between variables. e.g. "budget" and "physician" are related i.e. there is a strong coherence of votes (v = 0.639); budget and salvador are less related (v = 0.507), etc. but Ricco Rakotomalala we do not know on what association of votes (yes or no) these relationships are based... 9 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Other approaches for clustering categorical variables ClustOfVar (Chavent and al., 2012) “Centroid" (representative F = 1 st factor from the MCA p variable) of a group of  (multiple correspondence analysis)    2 ( , ) X j F  (.) correlation ratio variables = latent variable  j 1  Variation within the group i.e. the group is scored as a single variable  HAC approaches: minimizing the loss of variation at each step Various strategies for grouping are possible.  K-Means approach: assign the variables to the closest "centroid" (in the sense of the correlation ratio) during the learning process “ ClustOfVar ” can handle dataset with mixed numeric and categorical variables. The centroid 1. is defined with first component of the factor analysis for mixed data 2. This is a generalization of the CLV approach (Vigneau and Qannari, 2003) which can handle numeric variables only and is based on PCA (principal component analysis) Ricco Rakotomalala 10 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

ClustOfVar on the « vote » dataset Cluster Dendrogram 0.45 G2 G1 0.30 library(ClustOfVar) arbre <- hclustvar(X.quali=vote.active) Height 0.15 plot(arbre) missile salvador nicaraguan education budget physician mgroups <- kmeansvar(X.quali=vote.active,init=2,nstart=10) print(summary(mgroups)) We obtain the same results as for the HAC on the (1-v) dissimilarity matrix Ricco Rakotomalala 11 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

The clustering of categorical variables gives a partial vision of the structure of the relationships among variables... Ricco Rakotomalala 12 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Interpreting a cluster – Ex. G2 Nombre de budget physician budget n neither y Total général Main associations n 25 146 171 between the categories neither 3 6 2 11 v = 0.639 y 219 5 29 253 Total général 247 11 177 435 Budget = y Physician = n Nombre de budget education budget n neither y Total général Education = n n 28 10 133 171 neither 4 4 3 11 v = 0.475 y 201 17 35 253 Budget = n Total général 233 31 171 435 Physician = y Nombre de budget education Education = y physician n neither y Total général n 202 16 29 247 neither 6 4 1 11 v = 0.509 y 25 11 141 177 Total général 233 31 171 435 This kind of analysis cannot be done manually. Ricco Rakotomalala 13 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Analyzing the illustrative variables The illustrative variables are used to strengthen the interpretation of the results. #2 subgroups groups <- cutree(tree,k=2) Affiliation Variable (Cramer’s v) Mean (v) print(groups) nicaraguan 0.660 #Cramer's v : affiliation vs. attributes 0.667 missile 0.629 cv <- sapply(vote.active,cramer,x=vote.data$affiliation) education 0.688 print(cv) budget 0.740 #mean of v for each group 0.781 physician 0.914 m <- tapply(X=cv,INDEX=groups,FUN=mean) salvador 0.712 print(m) • The political affiliation has a little more influence for the votes in G2 than in G1 (why? the subjects are more sensitive in G2?) • We do not know what are the votes of the democrats (republicans)? Ricco Rakotomalala 14 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Identifying the nature of the association between the categorical variables Ricco Rakotomalala 15 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Grouping categorical variables Grouping categories of nominal - PowerPoint PPT Presentation

Grouping categorical variables Grouping categories of nominal variables Ricco RAKOTOMALALA Universit Lumire Lyon 2 Ricco Rakotomalala 1 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/ Outline 1. Clustering of categorical

What is a Grouping? What is a Grouping? A Grouping is a category within the new tiered A

STAT 113 Describing Categorical Data I Colin Reimer Dawson Oberlin College September 11, 2020

YCL Week 3 Lets talk about variables! Variables Variables are containers for data. Variables

Combinatory Categorial Grammar (CCG) Categories Categories = types Primitive categories

Examining common themed variables Emily Robinson Data Scientist DataCamp Categorical Data in

Galicia Norte Portugal Norte Portugal Galicia European Grouping Grouping for for

Categorical Professional Development In-Service August 6, 2019 Welcome Back Categorical Team

STAT 113 Describing Categorical Data Colin Reimer Dawson Oberlin College September 7, 2017 1 /

Categorical Probability and Statistics Peter McCullagh Department of Statistics University of

Case study introduction Emily Robinson Data Scientist DataCamp Categorical Data in the

Categorical quantum mechanics Chris Heunen 1 / 76 Categorical Quantum Mechanics? Study of

Categorical Semantics for Linear Logic Categorical semantics for linear logic Interaction

Categorical models of probability with symmetries Sam Staton, Oxford Categorical models

Reordering factors Emily Robinson Data Scientist DataCamp Categorical Data in the Tidyverse

Chapter 11 Categorical Data Analysis Categorical Data and the Multinomial Distribution

Chapter 23 Two Categorical Variables: The Chi-Square Test Chapter 22 1 BPS - 5th Ed.

Weighted inequalities and dyadic harmonic analysis Cristina Pereyra University of New Mexico,

Bayesian nonparametric inference for diffusion models with discrete sampling Delft University of

NetFPGA Workshop Day 2 Presented by: Jad Naous Andrew W. Moore (Stanford University)

QCD phase diagram for nonzero isospin-asymmetry SEWM, Barcelona, June 25 th 2018 Sebastian

Chiral Random Matrix Model as a simple model for QCD Hirotsugu FUJII (University of T okyo,

Auxiliary field approach to extended operators for quasi-PDFs Jeremy Green in collaboration with

Lattice QCD thermodynamics Kalman Szabo Bergische Universitat, Wuppertal Wuppertal-Budapest

Linear Regression via Normal Equations some material thanks to Andrew Ng @Stanford Course Map /