 
              Introduction Methods & questions Model-based clustering Illustrations Challenges Clustering: evolution of methods to meet new challenges C. Biernacki ee “Clustering”, Orange Labs, October 20 th 2015 Journ´ 1/54
Introduction Methods & questions Model-based clustering Illustrations Challenges Take home message cluster clustering define both! 2/54
Introduction Methods & questions Model-based clustering Illustrations Challenges Outline 1 Introduction 2 Methods & questions 3 Model-based clustering 4 Illustrations 5 Challenges 3/54
Introduction Methods & questions Model-based clustering Illustrations Challenges A first systematic attempt Carl von Linn´ e (1707–1778), Swedish botanist, physician, and zoologist Father of modern taxonomy based on the most visible similarities between species Linnaeus’s Systema Naturae (1st ed. in 1735) lists about 10,000 species of organisms (6,000 plants, 4,236 animals) 4/54
Introduction Methods & questions Model-based clustering Illustrations Challenges Interdisciplinary endeavor Medicine 1 : diseases may be classified by etiology (cause), pathogenesis (mechanism by which the disease is caused), or by symptom(s). Alternatively, diseases may be classified according to the organ system involved, though this is often complicated since many diseases affect more than one organ. And so on. . . 1 Nosologie m´ ethodique, dans laquelle les maladies sont rang´ ees par classes, suivant le syst` eme de Sydenham, & l’ordre des botanistes, par Fran¸ cois Boissier de Sauvages de Lacroix. Paris, H´ erissant le fils, 1771 5/54
Introduction Methods & questions Model-based clustering Illustrations Challenges Three main clustering structures Data set of n individuals x = ( x 1 , . . . , x n ), x i described by d variables Partition in K clusters denoted by z = ( z 1 , . . . , z n ), with z i ∈ { 1 , . . . , K } Hierarchy Nested partitions Block partition Crossing simultaneously partitions in individuals and columns 6/54
Introduction Methods & questions Model-based clustering Illustrations Challenges Clustering is the cluster building process According to JSTOR, data clustering first appeared in the title of a 1954 article dealing with anthropological data Need to be automatic (algorithms) for complex data: mixed features, large data sets, high-dimensional data. . . 7/54
Introduction Methods & questions Model-based clustering Illustrations Challenges A 1st aim: explanatory task A clustering for a marketing study Data: d = 13 demographic attributes (nominal and ordinal variables) of n = 6 876 shopping mall customers in the San Francisco Bay (SEX (1. Male, 2. Female), MARITAL STATUS (1. Married, 2. Living together, not married, 3. Divorced or separated, 4. Widowed, 5. Single, never married), AGE (1. 14 thru 17, 2. 18 thru 24, 3. 25 thru 34, 4. 35 thru 44, 5. 45 thru 54, 6. 55 thru 64, 7. 65 and Over), etc. ) Partition: retrieve less that 19 999$ (group of “low income”), between 20 000$ and 39 999$ group of “average income”), more than 40 000$ (group of “high income”) 1.5 1.5 Low income Average income High income 1 1 0.5 0.5 2nd MCA axis 2nd MCA axis 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1 −0.5 0 0.5 1 1.5 2 2.5 1st MCA axis 1st MCA axis 8/54
Introduction Methods & questions Model-based clustering Illustrations Challenges A 2nd aim: preprocessing step Logit model: Not very flexible since linear borderline Unbiased ML estimate by asymptotic variance ∼ n ( x ′ wx ) − 1 is influenced by correlations A clustering may improve logistic regression prediction More flexible borderline: piecewise linear Decrease correlation so decrease variance 9/54
Introduction Methods & questions Model-based clustering Illustrations Challenges Mixed features 10/54
Introduction Methods & questions Model-based clustering Illustrations Challenges Large data sets 2 2 S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering: Algorithms and Applications , 29 11/54
Introduction Methods & questions Model-based clustering Illustrations Challenges High-dimensional data 3 3 S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering: Algorithms and Applications , 29 12/54
Introduction Methods & questions Model-based clustering Illustrations Challenges Genesis of “Big Data” The Big Data phenomenon mainly originates in the increase of computer and digital resources at an ever lower cost Storage cost per MB: 700$ in 1981, 1$ in 1994, 0.01$ in 2013 → price divided by 70,000 in thirty years Storage capacity of HDDs: ≈ 1.02 Go in 1982, ≈ 8 To today → capacity multiplied by 8,000 over the same period Computeur processing speed: 1 gigaFLOPS 4 in 1985, 33 petaFLOPS in 2013 → speed multiplied by 33 million 4 FLOP = FLoating-point Operations Per Second 13/54
Introduction Methods & questions Model-based clustering Illustrations Challenges Digital flow Digital in 1986: 1% of the stored information, 0.02 Eo 5 Digital in 2007: 94% of the stored information, 280 Eo (multiplied by 14,000) 5 Exabyte 14/54
Introduction Methods & questions Model-based clustering Illustrations Challenges Societal phenomenon All human activities are impacted by data accumulation Trade and business: corporate reporting system , banks, commercial transactions, reservation systems. . . Governments and organizations: laws, regulations, standardizations , infrastructure. . . Entertainment: music, video, games, social networks. . . Sciences: astronomy, physics and energy, genome,. . . Health: medical record databases in the social security system. . . Environment: climate, sustainable development , pollution, power. . . Humanities and Social Sciences: digitization of knowledge , literature, history , art, architecture, archaeological data. . . 15/54
Introduction Methods & questions Model-based clustering Illustrations Challenges New data. . . but classical answers 6 6 Rexer Analytics’s Annual Data Miner Survey is the largest survey of data mining, data science, and analytics professionals in the industry (survey of 2011) 16/54
Introduction Methods & questions Model-based clustering Illustrations Challenges Outline 1 Introduction 2 Methods & questions 3 Model-based clustering 4 Illustrations 5 Challenges 17/54
Introduction Methods & questions Model-based clustering Illustrations Challenges Clustering of clustering algorithms 7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into 5 groups based on their partitions on 12 different datasets. It is not surprising to see that the related algorithms are clustered together. For a visualization of the similarity between the algorithms, the 35 algorithms are also embedded in a two-dimensional space obtained from the 35x35 similarity matrix. 7 A.K. Jain (2008). Data Clustering: 50 Years Beyond K-Means. 18/54
Introduction Methods & questions Model-based clustering Illustrations Challenges Popularity of K -means and hierarchical clustering Even K -means was first proposed over 50 years ago, it is still one of the most widely used algorithms for clustering for several reasons: ease of implementation, simplicity, efficiency, empirical success. . . and model-based interpretation (see later) 19/54
Introduction Methods & questions Model-based clustering Illustrations Challenges Within-cluster inertia criterion Select the partition z minimizing the criterion n K � � x k � 2 z ik � x i − ¯ W M ( z ) = M i =1 k =1 � · � M is the Euclidian distance with metric M in R d ¯ x k is the mean of the k th cluster n x k = 1 � ¯ z ik x i n k i =1 and n k = � n k =1 z ik indicates the number of individuals in cluster k 20/54
Introduction Methods & questions Model-based clustering Illustrations Challenges Ward hierarchical clustering i nd i ce d d d 10 e e e a a a c c c + b b b 3 donn ée s s i ng l e t on s d ( { b } , { c } ) = 0 . 5 d d d 2 + + e e e + a a a 0 . 5 c c c + + b b b 0 a c d ( { d } , { e } ) = 2 d ( { a } , { b , c } ) = 3 d ( { a , b , c } , { d , e } ) = 10 b d e C l a ss i f i ca ti on h i é r a r c h i qu e a s ce nd a n t e D e ndog r a mm e ( m é t h od e d e W a r d ) Suboptimal optimisation of W M ( · ) A partition is obtained by cuting the dendrogram A dissimilarity matrix between pairs of individuals is enough 21/54
Introduction Methods & questions Model-based clustering Illustrations Challenges K -means algorithm d e a c b donn ée s d d d e e e a a a c c c b b b ca l c u l d e s ce n t r e s 2 i nd i v i du s a u h a s a r d a ff ec t a ti on a ux ce n t r e s d d d e e e a a a c c c b b b a ff ec t a ti on a ux ce n t r e s ca l c u l d e s ce n t r e s a ff ec t a ti on a ux ce n t r e s A l go r it h m e d e s ce n t r e s m ob il e s Alternating optimization between the partition and the center of clusters 22/54
Recommend
More recommend