INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

info 4300 cs4300 information retrieval slides adapted
SMART_READER_LITE
LIVE PREVIEW

INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 22/26: Hierarchical Clustering Paul Ginsparg Cornell University, Ithaca, NY 17 Nov 2009 1 / 37 Overview


slide-1
SLIDE 1

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/

IR 22/26: Hierarchical Clustering

Paul Ginsparg

Cornell University, Ithaca, NY

17 Nov 2009

1 / 37

slide-2
SLIDE 2

Overview

1

Recap

2

Introduction to Hierarchical clustering

2 / 37

slide-3
SLIDE 3

Outline

1

Recap

2

Introduction to Hierarchical clustering

3 / 37

slide-4
SLIDE 4

Applications of clustering in IR

Application What is Benefit Example clustered? Search result clustering search results more effective infor- mation presentation to user Scatter-Gather (subsets

  • f)

col- lection alternative user inter- face: “search without typing” Collection clustering collection effective information presentation for ex- ploratory browsing McKeown et al. 2002, news.google.com Cluster-based retrieval collection higher efficiency: faster search Salton 1971

4 / 37

slide-5
SLIDE 5

K-means algorithm

K-means({ x1, . . . , xN}, K) 1 ( s1, s2, . . . , sK) ← SelectRandomSeeds({ x1, . . . , xN}, K) 2 for k ← 1 to K 3 do µk ← sk 4 while stopping criterion has not been met 5 do for k ← 1 to K 6 do ωk ← {} 7 for n ← 1 to N 8 do j ← arg minj′ | µj′ − xn| 9 ωj ← ωj ∪ { xn} (reassignment of vectors) 10 for k ← 1 to K 11 do µk ←

1 |ωk|

  • x∈ωk

x (recomputation of centroids) 12 return { µ1, . . . , µK}

5 / 37

slide-6
SLIDE 6

Initialization of K-means

Random seed selection is just one of many ways K-means can be initialized. Random seed selection is not very robust: It’s easy to get a suboptimal clustering. Better heuristics:

Select seeds not randomly, but using some heuristic (e.g., filter

  • ut outliers or find a set of seeds that has “good coverage” of

the document space) Use hierarchical clustering to find good seeds (next class) Select i (e.g., i = 10) different sets of seeds, do a K-means clustering for each, select the clustering with lowest RSS

6 / 37

slide-7
SLIDE 7

External criterion: Purity

purity(Ω, C) = 1 N

  • k

max

j

|ωk ∩ cj| Ω = {ω1, ω2, . . . , ωK} is the set of clusters and C = {c1, c2, . . . , cJ} is the set of classes. For each cluster ωk: find class cj with most members nkj in ωk Sum all nkj and divide by total number of points

7 / 37

slide-8
SLIDE 8

Discussion 6

Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. Usenix SDI ’04, 2004. http://www.usenix.org/events/osdi04/tech/full papers/dean/dean.pdf See also (Jan 2009):

http://michaelnielsen.org/blog/write-your-first-mapreduce-program-in-20-minutes/

part of lectures on “google technology stack”:

http://michaelnielsen.org/blog/lecture-course-the-google-technology-stack/

(including PageRank, etc.)

8 / 37

slide-9
SLIDE 9

Some Questions

Who are the authors? When was it written? When was the work started? What is the problem they were trying to solve? Is there a compiler that will automatically parallelize the most general program? How does the example in section 2.1 work? What are other examples of algorithms amenable to map reduce methodology? What’s going on in Figure 1? What happens between map and reduce steps? map(k1,v1) → list(k2,v2) reduce(k2,list(v2)) → list(v2)

9 / 37

slide-10
SLIDE 10

Wordcount example

from

http://michaelnielsen.org/blog/write-your-first-mapreduce-program-in-20-minutes/

a.txt: The quick brown fox jumped over the lazy grey dogs. b.txt: That’s one small step for a man, one giant leap for mankind. c.txt: Mary had a little lamb, Its fleece was white as snow; And everywhere that Mary went, The lamb was sure to go.

10 / 37

slide-11
SLIDE 11

Map

mapper(”a.txt”,i[”a.txt”]) returns: [(’the’, 1), (’quick’, 1), (’brown’, 1), (’fox’, 1), (’jumped’, 1), (’over’, 1), (’the’, 1), (’lazy’, 1), (’grey’, 1), (’dogs’, 1)] def mapper(input key,input value): return [(word,1) for word in remove punctuation(input value.lower()).split()] def remove punctuation(s): return s.translate(string.maketrans(””,””), string.punctuation)

11 / 37

slide-12
SLIDE 12

Output of the map phase

[(’the’, 1), (’quick’, 1), (’brown’, 1), (’fox’, 1), (’jumped’, 1), (’over’, 1), (’the’, 1), (’lazy’, 1), (’grey’, 1), (’dogs’, 1), (’mary’, 1), (’had’, 1), (’a’, 1), (’little’, 1), (’lamb’, 1), (’its’, 1), (’fleece’, 1), (’was’, 1), (’white’, 1), (’as’, 1), (’snow’, 1), (’and’, 1), (’everywhere’, 1), (’that’, 1), (’mary’, 1), (’went’, 1), (’the’, 1), (’lamb’, 1), (’was’, 1), (’sure’, 1), (’to’, 1), (’go’, 1), (’thats’, 1), (’one’, 1), (’small’, 1), (’step’, 1), (’for’, 1), (’a’, 1), (’man’, 1), (’one’, 1), (’giant’, 1), (’leap’, 1), (’for’, 1), (’mankind’, 1)]

12 / 37

slide-13
SLIDE 13

Combine gives

{’and’: [1], ’fox’: [1], ’over’: [1], ’one’: [1, 1], ’as’: [1], ’go’: [1], ’its’: [1], ’lamb’: [1, 1], ’giant’: [1], ’for’: [1, 1], ’jumped’: [1], ’had’: [1], ’snow’: [1], ’to’: [1], ’leap’: [1], ’white’: [1], ’was’: [1, 1], ’mary’: [1, 1], ’brown’: [1], ’lazy’: [1], ’sure’: [1], ’that’: [1], ’little’: [1], ’small’: [1], ’step’: [1], ’everywhere’: [1], ’mankind’: [1], ’went’: [1], ’man’: [1], ’a’: [1, 1], ’fleece’: [1], ’grey’: [1], ’dogs’: [1], ’quick’: [1], ’the’: [1, 1, 1], ’thats’: [1]}

13 / 37

slide-14
SLIDE 14

Output of the reduce phase

def reducer(intermediate key,intermediate value list): return (intermediate key,sum(intermediate value list)) [(’and’, 1), (’fox’, 1), (’over’, 1), (’one’, 2), (’as’, 1), (’go’, 1), (’its’, 1), (’lamb’, 2), (’giant’, 1), (’for’, 2), (’jumped’, 1), (’had’, 1), (’snow’, 1), (’to’, 1), (’leap’, 1), (’white’, 1), (’was’, 2), (’mary’, 2), (’brown’, 1), (’lazy’, 1), (’sure’, 1), (’that’, 1), (’little’, 1), (’small’, 1), (’step’, 1), (’everywhere’, 1), (’mankind’, 1), (’went’, 1), (’man’, 1), (’a’, 2), (’fleece’, 1), (’grey’, 1), (’dogs’, 1), (’quick’, 1), (’the’, 3), (’thats’, 1)]

14 / 37

slide-15
SLIDE 15

PageRank example, Pjk = Ajk/dj

Input (key,value) to MapReduce key = id j of the webpage value contains data describing the page: current rj, out-degree dj, and a list [k1, k2, . . . , kdj ] of pages to which it links For each of the latter pages ka, a = 1, . . . dj, mapper outputs an intermediate key-value pair [ka, rj/dj] (where rj/dj is the contribution to the PageRank from page j to page ka, and corresponds to random websurfer moving from j to ka — combines probability rj of starting at page j with probability 1/dj of moving from j to ka) Between map and reduce phases, MapReduce collects all intermediate values corresponding to any given intermediate key k (list of all probabilities of moving to page k). The reducer sums up probabilities, outputting result as second entry in pair (k, r′

k), giving the entries of

rP = r′, as desired.

15 / 37

slide-16
SLIDE 16

k-means clustering, e.g., Netflix data

Goal Find similar movies from ratings provided by users Vector Model Give each movie a vector Make one dimension per user Put origin at average rating (so poor is negative) Normalize all vectors to unit length (cosine similarity) Issues

  • Users are biased in the movies they rate

+ Addresses different numbers of raters

16 / 37

slide-17
SLIDE 17

k-means clustering

Goal cluster similar data points Approach: given data points and distance function select k centroids µa assign xi to closest centroid µa minimize

a,i d(

xi, µa) Algorithm: randomly pick centroids, possibly from data points assign points to closest centroid average assigned points to obtain new centroids repeat 2,3 until nothing changes Issues:

  • takes superpolynomial time on some inputs
  • not guaranteed to find optimal solution

+ converges quickly in practice

17 / 37

slide-18
SLIDE 18

Iterative MapReduce

(from http://kheafield.com/professional/google/more.pdf )

18 / 37

slide-19
SLIDE 19

Outline

1

Recap

2

Introduction to Hierarchical clustering

19 / 37

slide-20
SLIDE 20

Hierarchical clustering

Our goal in hierarchical clustering is to create a hierarchy like the one we saw earlier in Reuters:

coffee poultry

  • il & gas

France UK China Kenya industries regions TOP

We want to create this hierarchy automatically. We can do this either top-down or bottom-up. The best known bottom-up method is hierarchical agglomerative clustering.

20 / 37

slide-21
SLIDE 21

Hierarchical agglomerative clustering (HAC)

HAC creates a hierachy in the form of a binary tree. Assumes a similarity measure for determining the similarity of two clusters. Up to now, our similarity measures were for documents. We will look at four different cluster similarity measures.

21 / 37

slide-22
SLIDE 22

Hierarchical agglomerative clustering (HAC)

Start with each document in a separate cluster Then repeatedly merge the two clusters that are most similar Until there is only one cluster The history of merging is a hierarchy in the form of a binary tree. The standard way of depicting this history is a dendrogram.

22 / 37

slide-23
SLIDE 23

A dendrogram

1.0 0.8 0.6 0.4 0.2 0.0 Ag trade reform. Back−to−school spending is up Lloyd’s CEO questioned Lloyd’s chief / U.S. grilling Viag stays positive Chrysler / Latin America Ohio Blue Cross Japanese prime minister / Mexico CompuServe reports loss Sprint / Internet access service Planet Hollywood Trocadero: tripling of revenues German unions split War hero Colin Powell War hero Colin Powell Oil prices slip Chains may raise prices Clinton signs law Lawsuit against tobacco companies suits against tobacco firms Indiana tobacco lawsuit Most active stocks Mexican markets Hog prices tumble NYSE closing averages British FTSE index Fed holds interest rates steady Fed to keep interest rates steady Fed keeps interest rates steady Fed keeps interest rates steady

The history of mergers can be read off from left to right. The vertical line of each merger tells us what the similarity of the merger was. We can cut the dendrogram at a particular point (e.g., at 0.1 or 0.4) to get a flat clustering.

23 / 37

slide-24
SLIDE 24

Divisive clustering

Divisive clustering is top-down. Alternative to HAC (which is bottom up). Divisive clustering:

Start with all docs in one big cluster Then recursively split clusters Eventually each node forms a cluster on its own.

→ Bisecting K-means at the end For now: HAC (= bottom-up)

24 / 37

slide-25
SLIDE 25

Naive HAC algorithm

SimpleHAC(d1, . . . , dN) 1 for n ← 1 to N 2 do for i ← 1 to N 3 do C[n][i] ← Sim(dn, di) 4 I[n] ← 1 (keeps track of active clusters) 5 A ← [] (collects clustering as a sequence of merges) 6 for k ← 1 to N − 1 7 do i, m ← arg max{i,m:i=m∧I[i]=1∧I[m]=1} C[i][m] 8 A.Append(i, m) (store merge) 9 for j ← 1 to N 10 do (use i as representative for < i, m >) 11 C[i][j] ← Sim(< i, m >, j) 12 C[j][i] ← Sim(< i, m >, j) 13 I[m] ← 0 (deactivate cluster) 14 return A

25 / 37

slide-26
SLIDE 26

Computational complexity of the naive algorithm

First, we compute the similarity of all N × N pairs of documents. Then, in each of N iterations:

We scan the O(N × N) similarities to find the maximum similarity. We merge the two clusters with maximum similarity. We compute the similarity of the new cluster with all other (surviving) clusters.

There are O(N) iterations, each performing a O(N × N) “scan” operation. Overall complexity is O(N3). We’ll look at more efficient algorithms later.

26 / 37

slide-27
SLIDE 27

Key question: How to define cluster similarity

Single-link: Maximum similarity

Maximum similarity of any two documents

Complete-link: Minimum similarity

Minimum similarity of any two documents

Centroid: Average “intersimilarity”

Average similarity of all document pairs (but excluding pairs of docs in the same cluster) This is equivalent to the similarity of the centroids.

Group-average: Average “intrasimilarity”

Average similary of all document pairs, including pairs of docs in the same cluster

27 / 37

slide-28
SLIDE 28

Cluster similarity: Example

1 2 3 4 5 6 7 1 2 3 4

b b b b

28 / 37

slide-29
SLIDE 29

Single-link: Maximum similarity

1 2 3 4 5 6 7 1 2 3 4

b b b b

29 / 37

slide-30
SLIDE 30

Complete-link: Minimum similarity

1 2 3 4 5 6 7 1 2 3 4

b b b b

30 / 37

slide-31
SLIDE 31

Centroid: Average intersimilarity

intersimilarity = similarity of two documents in different clusters 1 2 3 4 5 6 7 1 2 3 4

b b b b

31 / 37

slide-32
SLIDE 32

Group average: Average intrasimilarity

intrasimilarity = similarity of any pair, including cases where the two documents are in the same cluster 1 2 3 4 5 6 7 1 2 3 4

b b b b

32 / 37

slide-33
SLIDE 33

Cluster similarity: Larger Example

1 2 3 4 5 6 7 1 2 3 4

b b b b b b b b b b b b b b b b b b b b

33 / 37

slide-34
SLIDE 34

Single-link: Maximum similarity

1 2 3 4 5 6 7 1 2 3 4

b b b b b b b b b b b b b b b b b b b b

34 / 37

slide-35
SLIDE 35

Complete-link: Minimum similarity

1 2 3 4 5 6 7 1 2 3 4

b b b b b b b b b b b b b b b b b b b b

35 / 37

slide-36
SLIDE 36

Centroid: Average intersimilarity

1 2 3 4 5 6 7 1 2 3 4

b b b b b b b b b b b b b b b b b b b b

36 / 37

slide-37
SLIDE 37

Group average: Average intrasimilarity

1 2 3 4 5 6 7 1 2 3 4

b b b b b b b b b b b b b b b b b b b b

37 / 37