INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 22/26: Hierarchical Clustering Paul Ginsparg Cornell University, Ithaca, NY 17 Nov 2009 1 / 37
Overview Recap 1 Introduction to Hierarchical clustering 2 2 / 37
Outline Recap 1 Introduction to Hierarchical clustering 2 3 / 37
Applications of clustering in IR Application What is Benefit Example clustered? Search result clustering search more effective infor- results mation presentation to user Scatter-Gather (subsets alternative user inter- of) col- face: “search without lection typing” Collection clustering collection effective information McKeown et al. 2002, presentation for ex- news.google.com ploratory browsing Cluster-based retrieval collection higher efficiency: Salton 1971 faster search 4 / 37
K -means algorithm K -means ( { � x 1 , . . . ,� x N } , K ) 1 ( � s 1 ,� s 2 , . . . ,� s K ) ← SelectRandomSeeds ( { � x 1 , . . . ,� x N } , K ) 2 for k ← 1 to K 3 do � µ k ← � s k 4 while stopping criterion has not been met 5 do for k ← 1 to K 6 do ω k ← {} 7 for n ← 1 to N 8 do j ← arg min j ′ | � µ j ′ − � x n | 9 ω j ← ω j ∪ { � x n } (reassignment of vectors) 10 for k ← 1 to K 1 11 do � µ k ← � x ∈ ω k � x (recomputation of centroids) | ω k | � 12 return { � µ 1 , . . . , � µ K } 5 / 37
Initialization of K -means Random seed selection is just one of many ways K -means can be initialized. Random seed selection is not very robust: It’s easy to get a suboptimal clustering. Better heuristics: Select seeds not randomly, but using some heuristic (e.g., filter out outliers or find a set of seeds that has “good coverage” of the document space) Use hierarchical clustering to find good seeds (next class) Select i (e.g., i = 10) different sets of seeds, do a K -means clustering for each, select the clustering with lowest RSS 6 / 37
External criterion: Purity purity(Ω , C ) = 1 � max | ω k ∩ c j | N j k Ω = { ω 1 , ω 2 , . . . , ω K } is the set of clusters and C = { c 1 , c 2 , . . . , c J } is the set of classes. For each cluster ω k : find class c j with most members n kj in ω k Sum all n kj and divide by total number of points 7 / 37
Discussion 6 Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. Usenix SDI ’04, 2004. http://www.usenix.org/events/osdi04/tech/full papers/dean/dean.pdf See also (Jan 2009): http://michaelnielsen.org/blog/write-your-first-mapreduce-program-in-20-minutes/ part of lectures on “google technology stack”: http://michaelnielsen.org/blog/lecture-course-the-google-technology-stack/ (including PageRank, etc.) 8 / 37
Some Questions Who are the authors? When was it written? When was the work started? What is the problem they were trying to solve? Is there a compiler that will automatically parallelize the most general program? How does the example in section 2.1 work? What are other examples of algorithms amenable to map reduce methodology? What’s going on in Figure 1? What happens between map and reduce steps? map(k1,v1) → list(k2,v2) reduce(k2,list(v2)) → list(v2) 9 / 37
Wordcount example from http://michaelnielsen.org/blog/write-your-first-mapreduce-program-in-20-minutes/ a.txt: The quick brown fox jumped over the lazy grey dogs. b.txt: That’s one small step for a man, one giant leap for mankind. c.txt: Mary had a little lamb, Its fleece was white as snow; And everywhere that Mary went, The lamb was sure to go. 10 / 37
Map mapper(”a.txt”,i[”a.txt”]) returns: [(’the’, 1), (’quick’, 1), (’brown’, 1), (’fox’, 1), (’jumped’, 1), (’over’, 1), (’the’, 1), (’lazy’, 1), (’grey’, 1), (’dogs’, 1)] def mapper(input key,input value): return [(word,1) for word in remove punctuation(input value.lower()).split()] def remove punctuation(s): return s.translate(string.maketrans(””,””), string.punctuation) 11 / 37
Output of the map phase [(’the’, 1), (’quick’, 1), (’brown’, 1), (’fox’, 1), (’jumped’, 1), (’over’, 1), (’the’, 1), (’lazy’, 1), (’grey’, 1), (’dogs’, 1), (’mary’, 1), (’had’, 1), (’a’, 1), (’little’, 1), (’lamb’, 1), (’its’, 1), (’fleece’, 1), (’was’, 1), (’white’, 1), (’as’, 1), (’snow’, 1), (’and’, 1), (’everywhere’, 1), (’that’, 1), (’mary’, 1), (’went’, 1), (’the’, 1), (’lamb’, 1), (’was’, 1), (’sure’, 1), (’to’, 1), (’go’, 1), (’thats’, 1), (’one’, 1), (’small’, 1), (’step’, 1), (’for’, 1), (’a’, 1), (’man’, 1), (’one’, 1), (’giant’, 1), (’leap’, 1), (’for’, 1), (’mankind’, 1)] 12 / 37
Combine gives { ’and’: [1], ’fox’: [1], ’over’: [1], ’one’: [1, 1], ’as’: [1], ’go’: [1], ’its’: [1], ’lamb’: [1, 1], ’giant’: [1], ’for’: [1, 1], ’jumped’: [1], ’had’: [1], ’snow’: [1], ’to’: [1], ’leap’: [1], ’white’: [1], ’was’: [1, 1], ’mary’: [1, 1], ’brown’: [1], ’lazy’: [1], ’sure’: [1], ’that’: [1], ’little’: [1], ’small’: [1], ’step’: [1], ’everywhere’: [1], ’mankind’: [1], ’went’: [1], ’man’: [1], ’a’: [1, 1], ’fleece’: [1], ’grey’: [1], ’dogs’: [1], ’quick’: [1], ’the’: [1, 1, 1], ’thats’: [1] } 13 / 37
Output of the reduce phase def reducer(intermediate key,intermediate value list): return (intermediate key,sum(intermediate value list)) [(’and’, 1), (’fox’, 1), (’over’, 1), (’one’, 2), (’as’, 1), (’go’, 1), (’its’, 1), (’lamb’, 2), (’giant’, 1), (’for’, 2), (’jumped’, 1), (’had’, 1), (’snow’, 1), (’to’, 1), (’leap’, 1), (’white’, 1), (’was’, 2), (’mary’, 2), (’brown’, 1), (’lazy’, 1), (’sure’, 1), (’that’, 1), (’little’, 1), (’small’, 1), (’step’, 1), (’everywhere’, 1), (’mankind’, 1), (’went’, 1), (’man’, 1), (’a’, 2), (’fleece’, 1), (’grey’, 1), (’dogs’, 1), (’quick’, 1), (’the’, 3), (’thats’, 1)] 14 / 37
PageRank example, P jk = A jk / d j Input (key,value) to MapReduce key = id j of the webpage value contains data describing the page: current r j , out-degree d j , and a list [ k 1 , k 2 , . . . , k d j ] of pages to which it links For each of the latter pages k a , a = 1 , . . . d j , mapper outputs an intermediate key-value pair [ k a , r j / d j ] (where r j / d j is the contribution to the PageRank from page j to page k a , and corresponds to random websurfer moving from j to k a — combines probability r j of starting at page j with probability 1 / d j of moving from j to k a ) Between map and reduce phases, MapReduce collects all intermediate values corresponding to any given intermediate key k (list of all probabilities of moving to page k). The reducer sums up probabilities, outputting result as second entry in pair ( k , r ′ r ′ , as desired. k ), giving the entries of � rP = � 15 / 37
k -means clustering, e.g., Netflix data Goal Find similar movies from ratings provided by users Vector Model Give each movie a vector Make one dimension per user Put origin at average rating (so poor is negative) Normalize all vectors to unit length (cosine similarity) Issues - Users are biased in the movies they rate + Addresses different numbers of raters 16 / 37
k -means clustering Goal cluster similar data points Approach: given data points and distance function select k centroids � µ a assign � x i to closest centroid � µ a minimize � a , i d ( � x i , � µ a ) Algorithm: randomly pick centroids, possibly from data points assign points to closest centroid average assigned points to obtain new centroids repeat 2,3 until nothing changes Issues: - takes superpolynomial time on some inputs - not guaranteed to find optimal solution + converges quickly in practice 17 / 37
Iterative MapReduce (from http://kheafield.com/professional/google/more.pdf ) 18 / 37
Outline Recap 1 Introduction to Hierarchical clustering 2 19 / 37
Hierarchical clustering Our goal in hierarchical clustering is to create a hierarchy like the one we saw earlier in Reuters: TOP regions industries Kenya poultry oil & gas China UK France coffee We want to create this hierarchy automatically. We can do this either top-down or bottom-up. The best known bottom-up method is hierarchical agglomerative clustering. 20 / 37
Hierarchical agglomerative clustering (HAC) HAC creates a hierachy in the form of a binary tree. Assumes a similarity measure for determining the similarity of two clusters. Up to now, our similarity measures were for documents. We will look at four different cluster similarity measures. 21 / 37
Hierarchical agglomerative clustering (HAC) Start with each document in a separate cluster Then repeatedly merge the two clusters that are most similar Until there is only one cluster The history of merging is a hierarchy in the form of a binary tree. The standard way of depicting this history is a dendrogram. 22 / 37
Recommend
More recommend