Clustering Problem Given a set of points, with a - PowerPoint PPT Presentation

Clustering ¡ CompSci ¡590.03 ¡ Instructor: ¡Ashwin ¡Machanavajjhala ¡ ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 1 ¡

Clustering ¡Problem ¡ • Given ¡a ¡set ¡of ¡points, ¡ ¡ with ¡a ¡noDon ¡of ¡distance ¡between ¡points, ¡ ¡ ¡ group ¡the ¡points ¡into ¡some ¡number ¡of ¡ clusters , ¡ ¡ ¡ so ¡that ¡members ¡of ¡a ¡cluster ¡are ¡in ¡some ¡sense ¡as ¡close ¡to ¡each ¡ other ¡as ¡possible. ¡ ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 2 ¡

Example: ¡Clustering ¡News ¡ArDcles ¡ • Consider ¡some ¡vocabulary ¡V ¡= ¡{v1, ¡v2, ¡…, ¡vk}. ¡ ¡ • Each ¡news ¡arDcle ¡is ¡a ¡vector ¡(x1, ¡x2, ¡…, ¡xk), ¡ ¡ where ¡xi ¡= ¡1 ¡iff ¡vi ¡appears ¡in ¡the ¡arDcle ¡ • Documents ¡with ¡similar ¡sets ¡of ¡words ¡correspond ¡to ¡similar ¡topics ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 3 ¡

Example: ¡Clustering ¡movies ¡ ¡ (CollaboraDve ¡Filtering) ¡ • Represent ¡each ¡movie ¡by ¡the ¡set ¡of ¡users ¡who ¡rated ¡it. ¡ ¡ • Each ¡movie ¡is ¡a ¡vector ¡(x1, ¡x2, ¡…, ¡xk), ¡where ¡xi ¡is ¡the ¡raDng ¡ provided ¡by ¡user ¡i. ¡ ¡ • Similar ¡movies ¡have ¡similar ¡raDngs ¡from ¡the ¡same ¡sets ¡of ¡users. ¡ ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 4 ¡

Example: ¡Protein ¡Sequences ¡ • Objects ¡are ¡sequences ¡of ¡{C, ¡A, ¡T, ¡G} ¡ • Distance ¡between ¡two ¡sequences ¡is ¡the ¡ edit ¡distance , ¡or ¡the ¡ minimum ¡number ¡of ¡inserts ¡and ¡deletes ¡needed ¡to ¡change ¡one ¡ sequence ¡to ¡another. ¡ ¡ • Clusters ¡correspond ¡to ¡proteins ¡with ¡similar ¡sequences. ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 5 ¡

Outline ¡ • Distance ¡measures ¡ • Clustering ¡algorithms ¡ ¡ – K-‑Means ¡Clustering ¡ – Hierarchical ¡Clustering ¡ ¡ • Scaling ¡up ¡Clustering ¡Algorithms ¡ – Canopy ¡Clustering ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 6 ¡

Distance ¡Measures ¡ • Each ¡clustering ¡problem ¡is ¡based ¡on ¡some ¡noDon ¡of ¡distance ¡ between ¡objects ¡or ¡points ¡ – Also ¡called ¡similarity ¡ • Euclidean ¡Distance ¡ – Based ¡on ¡a ¡set ¡of ¡m ¡real ¡valued ¡dimensions ¡ – Euclidean ¡distance ¡is ¡based ¡on ¡the ¡locaDons ¡of ¡the ¡points ¡in ¡the ¡ ¡ m-‑dimensional ¡space ¡ – There ¡is ¡a ¡noDon ¡of ¡ average ¡of ¡two ¡points ¡ • Non-‑Euclidean ¡Distance ¡ – Not ¡based ¡on ¡the ¡locaDon ¡of ¡points ¡ – NoDon ¡of ¡average ¡may ¡not ¡be ¡defined ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 7 ¡

Distance ¡Metric ¡ • A ¡distance ¡funcDon ¡is ¡a ¡metric ¡if ¡it ¡saDsfies ¡the ¡following ¡ condiDons ¡ • ¡d(x,y) ¡ ¡≥ ¡0 ¡ • ¡d(x,y) ¡= ¡0 ¡iff ¡x ¡= ¡y ¡ • ¡d(x,y) ¡= ¡d(y,x) ¡ • ¡d(x,y) ¡≤ ¡d(x,z) ¡+ ¡d(z,y) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ triangle ¡inequality ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 8 ¡

Examples ¡of ¡Distance ¡Metrics ¡ • Lp ¡norm: ¡ • L2 ¡norm ¡= ¡Distance ¡in ¡euclidean ¡space ¡ • L1 ¡norm ¡= ¡Manhahan ¡distance ¡ • L∞ ¡norm ¡= ¡maximum ¡(x i ¡– ¡y i ) ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 9 ¡

Examples ¡of ¡Distance ¡Metrics ¡ • Jaccard ¡Distance: ¡ ¡ ¡Let ¡A ¡and ¡B ¡be ¡two ¡sets. ¡ ¡ ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 10 ¡

Examples ¡of ¡Distance ¡Metrics ¡ • Cosine ¡Similarity: ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 11 ¡

Examples ¡of ¡Distance ¡Metrics ¡ • Levenshtein ¡distance ¡a.k.a. ¡Edit ¡distance ¡ ¡Minimum ¡number ¡of ¡inserts ¡and ¡deletes ¡of ¡characters ¡ ¡ ¡needed ¡to ¡turn ¡one ¡string ¡into ¡another. ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 12 ¡

Outline ¡ • Distance ¡measures ¡ • Clustering ¡algorithms ¡ ¡ – K-‑Means ¡Clustering ¡ – Hierarchical ¡Clustering ¡ ¡ • Scaling ¡up ¡Clustering ¡Algorithms ¡ – Canopy ¡Clustering ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 13 ¡

K-‑Means ¡ • A ¡very ¡popular ¡ point ¡assignment ¡based ¡clustering ¡algorithm ¡ • Goal: ¡ ¡ParDDon ¡a ¡set ¡of ¡points ¡into ¡k ¡clusters, ¡such ¡that ¡points ¡ within ¡a ¡cluster ¡are ¡closer ¡to ¡each ¡other ¡than ¡point ¡from ¡different ¡ clusters. ¡ ¡ • Distance ¡measure ¡is ¡typically ¡Euclidean ¡ – K-‑medians ¡if ¡distance ¡measure ¡does ¡not ¡permit ¡an ¡average ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 14 ¡

K-‑Means ¡ • Input: ¡ ¡ ¡A ¡set ¡of ¡points ¡in ¡ m ¡ dimensions ¡{x1, ¡x2, ¡…, ¡xn} ¡ ¡The ¡desired ¡number ¡of ¡clusters ¡K ¡ • Output: ¡ ¡ ¡ ¡A ¡mapping ¡from ¡points ¡to ¡clusters ¡C: ¡{1, ¡…, ¡m} ¡ à ¡{1, ¡…, ¡K} ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 15 ¡

K-‑Means ¡ Input: ¡ ¡ • ¡A ¡set ¡of ¡points ¡in ¡ m ¡ dimensions ¡{x1, ¡x2, ¡…, ¡xn} ¡ ¡The ¡desired ¡number ¡of ¡clusters ¡K ¡ Output: ¡ ¡ ¡ • ¡A ¡mapping ¡from ¡points ¡to ¡clusters ¡C: ¡{1, ¡…, ¡m} ¡ à ¡{1, ¡…, ¡K} ¡ Algorithm: ¡ ¡ • Start ¡with ¡an ¡ arbitrary ¡C ¡ • Repeat ¡ – Compute ¡the ¡centroid ¡of ¡each ¡cluster ¡ – Reassign ¡each ¡point ¡to ¡the ¡closest ¡centroid ¡ • UnDl ¡C ¡converges ¡ ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 16 ¡

Example ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 17 ¡

IniDalize ¡Clusters ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 18 ¡

Compute ¡Centroids ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 19 ¡

Reassign ¡Clusters ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 20 ¡

Recompute ¡Centroids ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 21 ¡

Reassign ¡Clusters ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 22 ¡

Recompute ¡Centroids ¡– ¡Done! ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 23 ¡

QuesDons ¡ • What ¡is ¡a ¡good ¡value ¡for ¡K? ¡ • Does ¡K-‑means ¡always ¡terminate? ¡ • How ¡should ¡we ¡choose ¡iniDal ¡cluster ¡centers? ¡ ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 24 ¡

Determining ¡K ¡ Average Correct value of k Diameter Number of Clusters • Small ¡k: ¡Many ¡points ¡have ¡large ¡distances ¡to ¡centroid ¡ • Large ¡k: ¡No ¡significant ¡improvement ¡in ¡average ¡diameter ¡(max ¡ distance ¡between ¡any ¡two ¡points ¡in ¡a ¡cluster) ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 25 ¡

K-‑means ¡as ¡an ¡opDmizaDon ¡problem ¡ • Let ¡ENCODE ¡be ¡a ¡funcDon ¡mapping ¡points ¡in ¡the ¡dataset ¡to ¡{1…k} ¡ • Let ¡DECODE ¡be ¡a ¡funcDon ¡mapping ¡{1…k} ¡to ¡a ¡point ¡ • Alternately, ¡if ¡we ¡write ¡DECODE[j] ¡= ¡cj, ¡ ¡ we ¡need ¡to ¡find ¡an ¡ENCODE ¡funcDon ¡and ¡k ¡points ¡c1, ¡…, ¡ck ¡ ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 26 ¡

K-‑means ¡terminates ¡ • Consider ¡the ¡objecDve ¡funcDon. ¡ ¡ • There ¡are ¡finitely ¡many ¡possible ¡clusterings ¡(K n ) ¡ • Each ¡Dme ¡we ¡reassign ¡a ¡point ¡to ¡a ¡nearer ¡cluster, ¡the ¡objecDve ¡ decreases. ¡ • Every ¡Dme ¡we ¡recompute ¡the ¡centroids, ¡the ¡objecDve ¡either ¡stays ¡ the ¡same ¡or ¡decreases. ¡ ¡ • Therefore ¡the ¡algorithm ¡has ¡to ¡terminate. ¡ ¡ Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 27 ¡

Local ¡opDma ¡ • Depending ¡on ¡iniDalizaDon ¡K-‑means ¡can ¡converge ¡to ¡different ¡ local ¡opDma. ¡ ¡ Example Lecture ¡17 ¡: ¡590.02 ¡Spring ¡13 ¡ 28 ¡

Clustering Problem Given a set of points, with a - PowerPoint PPT Presentation

Clustering CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 17 : 590.02 Spring 13 1 Clustering Problem Given a set of points, with a

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Clustering Data Clustering with user constraints The clustering problem : Given a set of

Clustering Lecture 14 David Sontag New York University

http://cs246.stanford.edu High dim. Graph Infinite Machine Apps data data data learning

Clustering Algorithms CS345a: Data Mining Jure Leskovec and Anand

A Smart Computing Framework Centered on User and Societal Empowerment to Achieve the Sustainable

Lecture 12: Clustering Geoffrey Hinton Clustering We assume that the data was generated from

Partitional Clustering Boston University Slideshow Title Goes Here Clustering: David Arthur,

Clustering ECE6133 Physical Design Automation of VLSI Systems Prof. Sung Kyu Lim School of

Clustering on Graphs: The Markov Cluster Algorithm (MCL) CS 595D Presentation By Kathy Macropol