Course Content Lecture 6 Week 10 (May 12) and Week 11 (May 19) - PowerPoint PPT Presentation

Course Content Lecture 6 Week 10 (May 12) and Week 11 (May 19) 33459-01 Principles of Knowledge Discovery • Introduction to Data Mining in Data • Association analysis • Sequential Pattern Analysis Clustering Analysis: Agglomerative, • Classification and prediction Hierarchical, Density-based and • Contrast Sets other approaches • Data Clustering • Outlier Detection Lecture by: Dr. Osmar R. Zaïane • Web Mining 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 1 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 2 (Dr. O. Zaiane) (Dr. O. Zaiane) What is Classification? Classification = Learning a Model The goal of data classification is to organize and Training Set (labeled) categorize data in distinct classes. A model is first created based on the data distribution. The model is then used to classify new data. Given the model, a class can be predicted for new data. Classification Model With classification, I can predict in which bucket to put the ball, but I can’t predict the weight of the ball. ? … New unlabeled data Labeling=Classification 1 2 3 4 n 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 3 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 4 (Dr. O. Zaiane) (Dr. O. Zaiane)

Supervised and Unsupervised What is Clustering? The process of putting similar data together. Supervised Classification = Classification e � We know the class labels and the number of classes a Grouping a a b a e gray red blue green black e a Clustering a b … 1 2 3 4 n b b e d Partitioning d d c e c Unsupervised Classification = Clustering d d c � We do not know the class labels and may not know the number d of classes – Objects are not labeled, i.e. there is no training data. ? ? ? ? ? – We need a notion of similarity or closeness (what features?) … 1 2 3 4 n – Should we know apriori how many clusters exist? ? ? ? ? ? – How do we characterize members of groups? … 1 2 3 4 – How do we label groups? ? 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 5 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 6 (Dr. O. Zaiane) (Dr. O. Zaiane) What is Clustering in Data Mining? Lecture Outline Part I: What is Clustering in Data Mining (30 minutes) Clustering is a process of partitioning a set of data (or objects) in a set of • Introduction to clustering meaningful sub-classes, called clusters . • Motivating Examples for clustering – Helps users understand the natural grouping or structure in a data set. • Taxonomy of Major Clustering Algorithms • Cluster: a collection of data objects that are “similar” to one another and thus • Major Issues in Clustering • What is Good Clustering? can be treated collectively as one group. Part II: Major Clustering Approaches (1 hour 20 minutes) • A good clustering method produces high quality clusters in which: • K-means (Partitioning-based clustering algorithm) • The intra-class (that is, intra intra-cluster) similarity is high. • Nearest Neighbor clustering algorithm • The inter-class similarity is low. • Hierarchical Clustering • The quality of a clustering result depends on both the • Density-based Clustering similarity measure used and its implementation. Part III: Open Problems (10 minutes) • Clustering = function that maximizes similarity between • Research Issues in Clustering objects within a cluster and minimizes similarity between objects in different clusters. 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 7 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 8 (Dr. O. Zaiane) (Dr. O. Zaiane)

Typical Clustering Applications Clustering Example – Fitting Troops • Fitting the troops – re-design of uniforms for female • As a stand-alone tool to soldiers in US army – get insight into data distribution – Goal: reduce the number of uniform sizes to be kept in – find the characteristics of each cluster inventory while still providing good fit – assign the cluster of a new example • Researchers from Cornell University used clustering • As a preprocessing step for other algorithms and designed a new set of sizes – e.g. numerosity reduction – using cluster centers to – Traditional clothing size system: ordered set of graduated sizes represent data in clusters where all dimensions increase together • It is a building block for many data mining – The new system: sizes that fit body types • E.g. one size for short-legged, small waisted, women with wide and solutions long torsos, average arms, broad shoulders, and skinny necks – e.g. Recommender systems – group users with similar behaviour or preferences to improve recommendation. 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 9 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 10 (Dr. O. Zaiane) (Dr. O. Zaiane) Other Examples of Clustering Applications Clustering Example - Houses • • Given a dataset it may be clustered on different Marketing – help discover distinct groups of customers, and then use this attributes. The result and its interpretation would be knowledge to develop targeted marketing programs different • Biology – derive plant and animal taxonomies – find genes with similar function • Land use – identify areas of similar land use in an earth observation database • Insurance Group of houses Clustered based on Clustered based on Clustered based on size – identify groups of motor insurance policy holders with a high average geographic distance value and value claim cost Definition of a distance function is highly application dependent • City-planning Properties of a distance function Measures “dissimilarity” between pairs dist ( x , y ) ≥ 0 – identify groups of houses according to their house type, value, and objects x and y dist ( x , y ) = 0 iff x = y • Small distance dist ( x , y ): objects x and y geographical location dist ( x , y ) = dist ( y , x ) (symmetry) are more similar If dist is a metric, which is often the case: • Large distance dist ( x , y ): objects x and y dist ( x , z ) ≤ dist ( x , y ) + dist ( y , z ) (triangle inequality) are less similar 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 11 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 12 (Dr. O. Zaiane) (Dr. O. Zaiane)

Five Categories of Clustering Methods Major Clustering Techniques • Partitioning algorithms : Construct various partitions and then • Clustering techniques have been studied extensively in: evaluate them by some criterion. (K-Means is the most known) – Statistics, machine learning, and data mining • Hierarchy algorithms : Create a hierarchical decomposition of with many methods proposed and studied. the set of data (or objects) using some criterion. There is an • Clustering methods can be classified into 5 approaches: agglomerative approach and a divisive approach. – partitioning algorithms • Density-based : based on connectivity and density functions. – hierarchical algorithms  We will cover only these. • Grid-based : based on a multiple-level granularity structure. • Model-based : (Generative approach) A model is hypothesized – density-based methods for each of the clusters and the idea is to find the best fit of that – grid-based methods model to each other. Generative models estimated through maximum likelihood approach. (EM: Expectation Maximization – model-based methods with a Gaussian Mixture Model, is a typical example) 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 13 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 14 (Dr. O. Zaiane) (Dr. O. Zaiane) Important Issues in Clustering Important Issues in Clustering (2) • Different Types of Attributes • Noise and outlier Detection – Numerical: Generally can be represented in a – Differentiate remote points from internal ones. Euclidean Space. Distance can be used as a – Noisy points (errors in data) can artificially split measure of dissimilarity. (See classification slides for measures) or merge clusters. – Distinguishing remote noisy points or very – Categorical: A metric space may not be small unexpected clusters can be very definable. A similarity measure has to be important for the validity and quality of the defined. Jaccard ( ); Dice ( ); results. Overlap ( ); Cosine ( ) etc. – Noise can bias the results especially in the – Sequence aware similarity: eg. DNA calculation of cluster characteristics. sequences, web access behaviour. Can use Dynamic Programming. 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 15 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 16 (Dr. O. Zaiane) (Dr. O. Zaiane)

Course Content Lecture 6 Week 10 (May 12) and Week 11 (May 19) - PowerPoint PPT Presentation

Course Content Lecture 6 Week 10 (May 12) and Week 11 (May 19) 33459-01 Principles of Knowledge Discovery Introduction to Data Mining in Data Association analysis Sequential Pattern Analysis Clustering Analysis: Agglomerative,

Course Orientation q Course Description q Course Outcomes q Course Requirements q Course Outline

CONTENT DURING CORONAVIRUS LUCINDA DAWES - STRATEGIC CONTENT MARKETING Copywriter & Content

Content Provider Content Resolver Cursor Content Provider Basics Content providers is one

Peering and CDNs Arturo Servin Google Imagine youre a Content Provider Content Provider

CS371m - Mobile Computing Content Providers And Content Resolvers Content Providers One of

Course Search Widget Topics StudyLink Course Search Widget Demo Generic Course Search

Course Specifications/Detailed Course Outline Course code : STA 331 2.0 Course title :

Content Editors Training Course 2 In this session we will introduce Content Editors to the new

NC COURSE OF STUDY GRADUATION REQUIREMENTS * Content Area CAREER PREP COLLEGE TECH PREP**

DPD Basic Bicycle Course Course Objectives COURSE GOAL: The course will provide the trainee with

CANVAS COURSE PROFILE STUDENT PERFORMANCE COURSE OVERVIEW ASSIGNMENT AND SUBMISSION ANALYSIS

Leadplane Training Course Leadplane Training Course Course Objectives Describe procedures for

Statistics II Xavier Vil Course 2004-2005 1.- Course Contents 2.- Course Resources 3.-

ARM Microcontroller Course June 3, 2015 ARM Microcontroller Course The Course Direct Digital

Course Home Page Course Design Course Structure main source reading-intensive course

Level 1, V2.0 Level 1, V2.0 1 Course Contents Course Contents Course Contents Course

Reactors Daniel Cooper Department of Materials Science and Engineering Daniel Cooper The

Crucial Problems of Nuclear Power Prof. Michael Golay

1 TIVI Phase One-Set T one & Rules Introductions Set the tone for the interview

HMS Research Administrators Open Forum October 9, 2019 Faculty Reporting on Federal Awards

5 THINGS HR MUST DO IN THE ROLE OF THE DATA PROTECTION OFFICER GILLIAN ACHESON DATA PROTECTION

Bio(tech) Interlude 3 Nobel Prizes: PCR: Kary Mullis, 1993 Electrophoresis: A.W.K. Tiselius,

Code Metrics SWEN-261 Introduction to Software Engineering Department of Software Engineering

The Defjnitional Side of the Forcing . G. Jaber G. Lewertowski P.-M. Pdrot M. Sozeau N.

Course Content Lecture 6 Week 10 (May 12) and Week 11 (May 19) - PowerPoint PPT Presentation

Course Content Lecture 6 Week 10 (May 12) and Week 11 (May 19) 33459-01 Principles of Knowledge Discovery Introduction to Data Mining in Data Association analysis Sequential Pattern Analysis Clustering Analysis: Agglomerative,

Course Orientation q Course Description q Course Outcomes q Course Requirements q Course Outline

CONTENT DURING CORONAVIRUS LUCINDA DAWES - STRATEGIC CONTENT MARKETING Copywriter &amp; Content

Content Provider Content Resolver Cursor Content Provider Basics Content providers is one

Peering and CDNs Arturo Servin Google Imagine youre a Content Provider Content Provider

CS371m - Mobile Computing Content Providers And Content Resolvers Content Providers One of

Course Search Widget Topics StudyLink Course Search Widget Demo Generic Course Search

Course Specifications/Detailed Course Outline Course code : STA 331 2.0 Course title :

Content Editors Training Course 2 In this session we will introduce Content Editors to the new

NC COURSE OF STUDY GRADUATION REQUIREMENTS * Content Area CAREER PREP COLLEGE TECH PREP**

DPD Basic Bicycle Course Course Objectives COURSE GOAL: The course will provide the trainee with

CANVAS COURSE PROFILE STUDENT PERFORMANCE COURSE OVERVIEW ASSIGNMENT AND SUBMISSION ANALYSIS

Leadplane Training Course Leadplane Training Course Course Objectives Describe procedures for

Statistics II Xavier Vil Course 2004-2005 1.- Course Contents 2.- Course Resources 3.-

ARM Microcontroller Course June 3, 2015 ARM Microcontroller Course The Course Direct Digital

Course Home Page Course Design Course Structure main source reading-intensive course

Level 1, V2.0 Level 1, V2.0 1 Course Contents Course Contents Course Contents Course

Reactors Daniel Cooper Department of Materials Science and Engineering Daniel Cooper The

Crucial Problems of Nuclear Power Prof. Michael Golay

1 TIVI Phase One-Set T one &amp; Rules Introductions Set the tone for the interview

HMS Research Administrators Open Forum October 9, 2019 Faculty Reporting on Federal Awards

5 THINGS HR MUST DO IN THE ROLE OF THE DATA PROTECTION OFFICER GILLIAN ACHESON DATA PROTECTION

Bio(tech) Interlude 3 Nobel Prizes: PCR: Kary Mullis, 1993 Electrophoresis: A.W.K. Tiselius,

Code Metrics SWEN-261 Introduction to Software Engineering Department of Software Engineering

The Defjnitional Side of the Forcing . G. Jaber G. Lewertowski P.-M. Pdrot M. Sozeau N.

CONTENT DURING CORONAVIRUS LUCINDA DAWES - STRATEGIC CONTENT MARKETING Copywriter & Content

1 TIVI Phase One-Set T one & Rules Introductions Set the tone for the interview