Scalable Clustering of Categorical Data and Applications - PowerPoint PPT Presentation

Scalable Clustering of Categorical Data and Applications University of Trento Periklis Andritsos periklis@dit.unitn.it

Problem Definition o Clustering is a procedure that groups members of a population into similar categories, or clusters o Why is clustering important ?  Get insight in the way data is distributed  Preprocess an initial data set January 9, 2006 Periklis Andritsos 2

Exercise from the real world January 9, 2006 Periklis Andritsos 3

o In September 2004 The New York Times reported the launch of Clusty (http://www.clusty.com/) Given a query, the meta-search engine places relevant web  documents into groups Clusty uses clustering technology on ten different types of web  content including material from the web, image, news and shopping databases o Example: January 9, 2006 Periklis Andritsos 4

… of Digital Cameras o IDC ( http://www.idc.com/ ) employs clustering for customer segmentation purposes o Example: Digital camera companies investigated ways to increase sales during Christmas holidays o IDC surveyed 1,000 U.S. consumers at 50 malls “Likely buyers will be more motivated to buy a digital camera  knowing that digital images can be displayed on a TV, printed using a PC-less photo quality printer, or printed at traditional film developer outlets “ “Likely buyers capture more images per month on their film  cameras than unlikely buyers” o Companies were able to target the proper market segment Source: http://www.imaging-resource.com/NEWS/1037573998.html January 9, 2006 Periklis Andritsos 5

Understanding software systems main Event Main boxParserClass boxScanner Generate boxParserClass Event Class Prolog boxClass error stackOfAnyWithTop boxScannerClass Globals boxClass error Lexer stackOfAnyWithTop Fonts Lexer Globals Mathlib EdgeClass ColorTable stackOfAny hashedBoxes stackOfAny hashedBoxes edgaClass MathLib colorTable Fonts hashGlobals GenerateProlog NodeOfAny NodeOfAny hashGlobals [MMBRCG ’ 98] January 9, 2006 Periklis Andritsos 6

What if we have …. Some information cannot be depicted [Andritsos, Miller: IEEE Int ’ l Workshop on Program Comprehension, 2001] January 9, 2006 Periklis Andritsos 7

Integrated Information Information O-O cust emp dept System Database dno dna select all Integrated XML Relational Customer Information Order Product <title>... Repository Database Scheduled Delivery <author>... Salesperson <year>... o We deal with data that: are stored in heterogeneous sources  exist under different formats  are available online (with schemas)   Schema : A type specification of a collection of data o We often need to integrate data, which introduces errors January 9, 2006 Periklis Andritsos 8

Cluster Analysis Stages Initial Represen- Data Collection Screening tation Focus of my work Interpre- Clustering Validation tation Strategy o Intention was not to build yet another clustering algorithm, but one that adheres to real-world constraints January 9, 2006 Periklis Andritsos 9

Requirements o Perform good quality clustering on different data types The majority of existing commercial algorithms perform clustering  of objects expressed over numerical values o Scalability The optimal solution to clustering is hard to find, and existing  heuristic techniques do not necessarily perform well with large inputs. o Parameter setting Many algorithms expect the user to give a set of (sometimes)  unintuitive parameters o Inclusion of descriptive information in software clustering Software clustering techniques use structural information  exclusively January 9, 2006 Periklis Andritsos 10

Calculating Distance Numerical Data Categorical data o o L p metrics defined no single ordering of values   Euclidean, Manhattan  movie director actor genre employee salary age Godfather II De Niro Scorcese Crime John $5,000 25 De Niro Good Fellas Crime Coppola Mary $6,000 26 Vertigo J. Stewart Thriller Hitchcock Peter $2,500 30 C. Grant N by NW Thriller Hitchcock Jenny $60,000 32 Bishop’s Wife Koster C. Grant Comedy Comedy Harvey J. Stewart Koster January 9, 2006 Periklis Andritsos 11

Agglomerative Clustering Agglomerative, or Hierarchical clustering in Agglomerative, or Hierarchical clustering in Euclidean space on 6 points. Euclidean space on 6 points. F E A B C D A B C D E F Need to compute distance between objects as well as between objects and sub-clusters January 9, 2006 Periklis Andritsos 12

Contributions Developed LIMBO, an algorithm that o is hierarchical  clusters categorical data using a small number of parameters  is scalable as the size of the input increases  International Conference on Extending Data Base Technology, (EDBT’04) Studied software systems using both structural and non-structural o information The algorithm incorporates information such as the Developer, Lines Of Code or  Directory structure International Working Conference on Reverse Engineering, (WCRE’03) IEEE Transactions on Software Engineering, (TSE’05) Proposed a set if Information-Theoretic tools to discover structure in large o data sets ACM International Conference on the Management of Data, (SIGMOD’04) International Workshop on Information Integration on the Web, (WEB’02) IEEE Data Engineering Bulletin 2002, 2003 January 9, 2006 Periklis Andritsos 13

Roadmap  Introduction  Contributions  Motivating example  LIMBO Algorithm  Studying Software Systems  Identifying Structure  Conclusions & Future Work January 9, 2006 Periklis Andritsos 14

Clustering Categorical Data o Cluster rows (objects) in order to preserve as much information as possible about the attribute values movie director actor genre Preserves Information Godfather II Scorcese De Niro Crime for actor , genre De Niro Good Fellas Crime Coppola Two choices for Vertigo J. Stewart Thriller director Hitchcock N by NW C. Grant Hitchcock Thriller Two choices for director , actor , and Bishop’s Wife Koster C. Grant Comedy genre Comedy Harvey J. Stewart Koster January 9, 2006 Periklis Andritsos 15

Clustering Categorical Data o Cluster rows (objects) in order to preserve as much information as possible about the attribute values movie director actor genre Godfather II Scorcese De Niro Crime Three choices for De Niro director , actor , Good Fellas Crime Coppola and two for genre Vertigo J. Stewart Thriller Hitchcock Preserves N by NW C. Grant Hitchcock Thriller Information for Bishop’s Wife Koster C. Grant Comedy director , genre Two choices for Comedy Harvey J. Stewart Koster actor January 9, 2006 Periklis Andritsos 16

Roadmap  Introduction  Contributions  Motivating example  LIMBO Algorithm  Studying Software Systems  Identifying Structure  Conclusions & Future Work January 9, 2006 Periklis Andritsos 17

Information Theory Basics ( ) H X p ( x ) log p ( x ) o Entropy : ∑ = −  Measures the Uncertainty in a random variable ( ) H X | Y o Conditional Entropy :  Measures the Uncertainty of one variable knowing the values of another. ( ) I X ; Y H ( X ) H ( X | Y ) o Mutual Information : = −  Measures the Dependence of two random variables January 9, 2006 Periklis Andritsos 18

Information Theoretic Clustering o T : a random variable that ranges over the rows o V : a random variable that ranges over the attribute values o I(T;V) : mutual information of T and V o Information Bottleneck Method [TPB ’ 99]  Compress T into a clustering C k so that the information preserved about V is maximum ( k =number of clusters). o Optimization criterion:  Minimize{ I(T;V) - I(C k ;V)}  i.e. , minimization of Information Loss January 9, 2006 Periklis Andritsos 19

Computing Information Loss Representation: Every cluster, c i , is represented by o Its probability p(c i )=n(c i )/n  Conditional probability of the values in V given the cluster, p(V|c i )  This information is sufficient to compute the Information Loss o January 9, 2006 Periklis Andritsos 20

Agglomerative IB [ST99] o Computes an ( n x n) Distance Matrix using Information Loss as distance o Merge sub-clusters with the minimum Information Loss movie director actor genre Godfather II De Niro Scorcese Crime De Niro Good Fellas Coppola Crime Vertigo J. Stewart Thriller Hitchcock N by NW C. Grant Hitchcock Thriller Bishop’s Wife Koster C. Grant Comedy Comedy Harvey J. Stewart Koster January 9, 2006 Periklis Andritsos 21

sca L able Infor M ation BO ttleneck o Agglomerative approach ( AIB ) has quadratic complexity since we need to compute an (n x n) distance matrix. o LIMBO algorithm  produce a summary of the data  apply agglomerative clustering on the summary o Summary=Distributional Cluster Features DCF(c) = (n(c) , p(V|c) ) o DCFs can be computed incrementally n ( c ) n ( c )   DCF ( c *) n ( c ) n ( c ), 1 p ( V | c ) 2 p ( V | c )   = + +   1 2 1 2 n ( c ) n ( c ) n ( c ) n ( c ) + +   1 2 1 2 January 9, 2006 Periklis Andritsos 22

Scalable Clustering of Categorical Data and Applications - PowerPoint PPT Presentation

Scalable Clustering of Categorical Data and Applications University of Trento Periklis Andritsos periklis@dit.unitn.it Problem Definition o Clustering is a procedure that groups members of a population into similar categories, or clusters o

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Case study introduction Emily Robinson Data Scientist DataCamp Categorical Data in the

Reordering factors Emily Robinson Data Scientist DataCamp Categorical Data in the Tidyverse

Clustering A Categorization of Major Clustering Methods Partitioning Methods

STAT 113 Describing Categorical Data Colin Reimer Dawson Oberlin College September 7, 2017 1 /

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

STAT 113 Describing Categorical Data I Colin Reimer Dawson Oberlin College September 11, 2020

CHAPTER VIII VIII CHAPTER Data Clustering and Data Clustering and Self- -Organizing Feature

Introduction to qualitative data Emily Robinson Data Scientist DataCamp Categorical Data in

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Constraint satisfaction in string spaces tt

The quark-gluon plasma shear viscosity from RHIC to LHC Ulrich Heinz Department of Physics

AARNet Australias National Research and Education Network Mark Prior J-Talk, Canberra, 6 July

Quantum Weirdness Part 5 The Uncertainty Principle The Laser Quantum Zeno Effect Quantum

S chool Rove r STUDENT OF THE WEEK On Friday, 5 th of November, Mr John

Network 2030 and New IP Richard Li, Ph.D . Chief Scientist and VP of Network Technologies

Right hemisphere Motor functions on left side of body Perceives left side of space Left

Blaise Internet 4.8.4 Load and Performance Testing Lane Masterton Assistant Statistician