course content
play

Course Content Lecture 6 Week 10 (May 12) and Week 11 (May 19) - PowerPoint PPT Presentation

Course Content Lecture 6 Week 10 (May 12) and Week 11 (May 19) 33459-01 Principles of Knowledge Discovery Introduction to Data Mining in Data Association analysis Sequential Pattern Analysis Clustering Analysis: Agglomerative,


  1. Course Content Lecture 6 Week 10 (May 12) and Week 11 (May 19) 33459-01 Principles of Knowledge Discovery • Introduction to Data Mining in Data • Association analysis • Sequential Pattern Analysis Clustering Analysis: Agglomerative, • Classification and prediction Hierarchical, Density-based and • Contrast Sets other approaches • Data Clustering • Outlier Detection Lecture by: Dr. Osmar R. Zaïane • Web Mining 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 1 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 2 (Dr. O. Zaiane) (Dr. O. Zaiane) What is Classification? Classification = Learning a Model The goal of data classification is to organize and Training Set (labeled) categorize data in distinct classes. A model is first created based on the data distribution. The model is then used to classify new data. Given the model, a class can be predicted for new data. Classification Model With classification, I can predict in which bucket to put the ball, but I can’t predict the weight of the ball. ? … New unlabeled data Labeling=Classification 1 2 3 4 n 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 3 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 4 (Dr. O. Zaiane) (Dr. O. Zaiane)

  2. Supervised and Unsupervised What is Clustering? The process of putting similar data together. Supervised Classification = Classification e � We know the class labels and the number of classes a Grouping a a b a e gray red blue green black e a Clustering a b … 1 2 3 4 n b b e d Partitioning d d c e c Unsupervised Classification = Clustering d d c � We do not know the class labels and may not know the number d of classes – Objects are not labeled, i.e. there is no training data. ? ? ? ? ? – We need a notion of similarity or closeness (what features?) … 1 2 3 4 n – Should we know apriori how many clusters exist? ? ? ? ? ? – How do we characterize members of groups? … 1 2 3 4 – How do we label groups? ? 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 5 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 6 (Dr. O. Zaiane) (Dr. O. Zaiane) What is Clustering in Data Mining? Lecture Outline Part I: What is Clustering in Data Mining (30 minutes) Clustering is a process of partitioning a set of data (or objects) in a set of • Introduction to clustering meaningful sub-classes, called clusters . • Motivating Examples for clustering – Helps users understand the natural grouping or structure in a data set. • Taxonomy of Major Clustering Algorithms • Cluster: a collection of data objects that are “similar” to one another and thus • Major Issues in Clustering • What is Good Clustering? can be treated collectively as one group. Part II: Major Clustering Approaches (1 hour 20 minutes) • A good clustering method produces high quality clusters in which: • K-means (Partitioning-based clustering algorithm) • The intra-class (that is, intra intra-cluster) similarity is high. • Nearest Neighbor clustering algorithm • The inter-class similarity is low. • Hierarchical Clustering • The quality of a clustering result depends on both the • Density-based Clustering similarity measure used and its implementation. Part III: Open Problems (10 minutes) • Clustering = function that maximizes similarity between • Research Issues in Clustering objects within a cluster and minimizes similarity between objects in different clusters. 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 7 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 8 (Dr. O. Zaiane) (Dr. O. Zaiane)

  3. Typical Clustering Applications Clustering Example – Fitting Troops • Fitting the troops – re-design of uniforms for female • As a stand-alone tool to soldiers in US army – get insight into data distribution – Goal: reduce the number of uniform sizes to be kept in – find the characteristics of each cluster inventory while still providing good fit – assign the cluster of a new example • Researchers from Cornell University used clustering • As a preprocessing step for other algorithms and designed a new set of sizes – e.g. numerosity reduction – using cluster centers to – Traditional clothing size system: ordered set of graduated sizes represent data in clusters where all dimensions increase together • It is a building block for many data mining – The new system: sizes that fit body types • E.g. one size for short-legged, small waisted, women with wide and solutions long torsos, average arms, broad shoulders, and skinny necks – e.g. Recommender systems – group users with similar behaviour or preferences to improve recommendation. 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 9 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 10 (Dr. O. Zaiane) (Dr. O. Zaiane) Other Examples of Clustering Applications Clustering Example - Houses • • Given a dataset it may be clustered on different Marketing – help discover distinct groups of customers, and then use this attributes. The result and its interpretation would be knowledge to develop targeted marketing programs different • Biology – derive plant and animal taxonomies – find genes with similar function • Land use – identify areas of similar land use in an earth observation database • Insurance Group of houses Clustered based on Clustered based on Clustered based on size – identify groups of motor insurance policy holders with a high average geographic distance value and value claim cost Definition of a distance function is highly application dependent • City-planning Properties of a distance function Measures “dissimilarity” between pairs dist ( x , y ) ≥ 0 – identify groups of houses according to their house type, value, and objects x and y dist ( x , y ) = 0 iff x = y • Small distance dist ( x , y ): objects x and y geographical location dist ( x , y ) = dist ( y , x ) (symmetry) are more similar If dist is a metric, which is often the case: • Large distance dist ( x , y ): objects x and y dist ( x , z ) ≤ dist ( x , y ) + dist ( y , z ) (triangle inequality) are less similar 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 11 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 12 (Dr. O. Zaiane) (Dr. O. Zaiane)

  4. Five Categories of Clustering Methods Major Clustering Techniques • Partitioning algorithms : Construct various partitions and then • Clustering techniques have been studied extensively in: evaluate them by some criterion. (K-Means is the most known) – Statistics, machine learning, and data mining • Hierarchy algorithms : Create a hierarchical decomposition of with many methods proposed and studied. the set of data (or objects) using some criterion. There is an • Clustering methods can be classified into 5 approaches: agglomerative approach and a divisive approach. – partitioning algorithms • Density-based : based on connectivity and density functions. – hierarchical algorithms  We will cover only these. • Grid-based : based on a multiple-level granularity structure. • Model-based : (Generative approach) A model is hypothesized – density-based methods for each of the clusters and the idea is to find the best fit of that – grid-based methods model to each other. Generative models estimated through maximum likelihood approach. (EM: Expectation Maximization – model-based methods with a Gaussian Mixture Model, is a typical example) 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 13 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 14 (Dr. O. Zaiane) (Dr. O. Zaiane) Important Issues in Clustering Important Issues in Clustering (2) • Different Types of Attributes • Noise and outlier Detection – Numerical: Generally can be represented in a – Differentiate remote points from internal ones. Euclidean Space. Distance can be used as a – Noisy points (errors in data) can artificially split measure of dissimilarity. (See classification slides for measures) or merge clusters. – Distinguishing remote noisy points or very – Categorical: A metric space may not be small unexpected clusters can be very definable. A similarity measure has to be important for the validity and quality of the defined. Jaccard ( ); Dice ( ); results. Overlap ( ); Cosine ( ) etc. – Noise can bias the results especially in the – Sequence aware similarity: eg. DNA calculation of cluster characteristics. sequences, web access behaviour. Can use Dynamic Programming. 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 15 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 16 (Dr. O. Zaiane) (Dr. O. Zaiane)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend