- 4. Mining & Organization
1
Advanced Topics in Information Retrieval 4. Mining & - - PowerPoint PPT Presentation
Advanced Topics in Information Retrieval 4. Mining & Organization Vinay Setty Jannik Strtgen (jtroetge@mpi-inf.mpg.de) (vsetty@mpi-inf.mpg.de) 1 Mining & Organization Retrieving a list of relevant documents (10 blue links)
1
insufficient
about brazil”)
inspect
users to explore and digest the contained information, e.g.:
2
3
4
5
5
S e a r c h r e s u l t s
g a n i z e d a s c l u s t e r s
`
6
6
intersection divided by the size of their union: sim(C1, C2) = |C1∩C2|/|C1∪C2|
7
intersection divided by the size of their union: sim(C1, C2) = |C1∩C2|/|C1∪C2|
7
3 in intersection 8 in union Jaccard similarity= 3/8 Jaccard distance = 5/8
8
sim(q, d) = q · d kqk kdk
= P
v qv dv
pP
v q 2 v
pP
v d 2 v
ci
average similarity between documents and their cluster centroid
centroid
9
1 |D| X
d∈D
max
c∈C sim(c, d)
10
reading all documents and assigning them to most similar cluster
it to Ci
10
reading all documents and assigning them to most similar cluster
it to Ci
collection, which has cost in O(nkd) with n as number of documents, k as number of clusters and, and d as number of dimensions
10
documents in every iteration using WAND
are assigned to the most similar centroid
11
12
different numbers of dimensions: ~26M for (1), ~7M for (2)
13
System ` Dataset 1 Similarity Dataset 1 Time Dataset 2 Similarity Dataset 2 Time k-means — 0.7804 445.05 0.2856 705.21 wand-k-means 100 0.7810 83.54 0.2858 324.78 wand-k-means 10 0.7811 75.88 0.2856 243.9 wand-k-means 1 0.7813 61.17 0.2709 100.84
System p ` Dataset 1 Similarity Dataset 1 Time ` Dataset 2 Similarity Dataset 2 Time k-means — — 0.7804 445.05 — 0.2858 705.21 wand-k-means — 1 0.7813 61.17 10 0.2856 243.91 wand-k-means 500 1 0.7817 8.83 10 0.2704 4.00 wand-k-means 200 1 0.7814 6.18 10 0.2855 2.97 wand-k-means 100 1 0.7814 4.72 10 0.2853 1.94 wand-k-means 50 1 0.7803 3.90 10 0.2844 1.39
14
15
16
17
18
S h i n g l i n g Document
18
S h i n g l i n g Document The set
that appear in the doc- ument
18
S h i n g l i n g Document The set
that appear in the doc- ument M i n H a s h i n g Signatures: short integer vectors that represent the sets, and reflect their similarity
18
S h i n g l i n g Document The set
that appear in the doc- ument M i n H a s h i n g Signatures: short integer vectors that represent the sets, and reflect their similarity Locality- Sensitive Hashing Candidate pairs: those pairs
that we need to test for similarity
19
20
S h i n g l i n g Document The set
that appear in the doc- ument M i n H a s h i n g Signatures: short integer vectors that represent the sets, and reflect their similarity Locality- Sensitive Hashing Candidate pairs: those pairs
that we need to test for similarity
20
S h i n g l i n g Document The set
that appear in the doc- ument M i n H a s h i n g Signatures: short integer vectors that represent the sets, and reflect their similarity Locality- Sensitive Hashing Candidate pairs: those pairs
that we need to test for similarity
21
21
21
21
21
22
23
24
26
S h i n g l i n g Document The set
that appear in the doc- ument M i n H a s h i n g Signatures: short integer vectors that represent the sets, and reflect their similarity Locality- Sensitive Hashing Candidate pairs: those pairs
that we need to test for similarity
formalized as finding subsets that have significant intersecIon
set union as bitwise OR
27
28
Jaccard similarity (not distance) = 3/6
28
Documents (N) Shingles (D)
h(C2)
29
h(C2)
29
30
31
32
Input matrix (Shingles x Documents) Permutation π
32
Input matrix (Shingles x Documents) Permutation π
32
Signature matrix M
Input matrix (Shingles x Documents) Permutation π
32
Signature matrix M
2nd element of the permutation is the first to map to a 1
Input matrix (Shingles x Documents) Permutation π
32
Signature matrix M
2nd element of the permutation is the first to map to a 1
Input matrix (Shingles x Documents) Permutation π
32
Signature matrix M
2nd element of the permutation is the first to map to a 1 4th element of the permutation is the first to map to a 1
Input matrix (Shingles x Documents) Permutation π
32
Signature matrix M
2nd element of the permutation is the first to map to a 1 4th element of the permutation is the first to map to a 1
Input matrix (Shingles x Documents) Permutation π
32
Signature matrix M
2nd element of the permutation is the first to map to a 1 4th element of the permutation is the first to map to a 1
Input matrix (Shingles x Documents) Permutation π
Note: Another (equivalent) way is to store row indexes:
1 5 1 5 2 3 1 3 6 4 6 4
C1 C2 A 1 1 B 1 C 0 1 D 0
If a type-B or type-C row, then not
33
34
34
34
34
34
34
34
34
35
Similarities: 1-3 2-4 1-2 3-4 Col/Col 0.75 0.75 0 0 Sig/Sig 0.67 1.00 0 0
Signature matrix M
Input matrix (Shingles x Documents)
Permutation π
37
Init
37
Init Row 0
37
Init Row 0 Row 1
37
Init Row 0 Row 1 Row 2
37
Init Row 0 Row 1 Row 2 Row 3
37
Init Row 0 Row 1 Row 2 Row 3 Row 4
38
S h i n g l i n g Document The set
that appear in the doc- ument M i n H a s h i n g Signatures: short integer vectors that represent the sets, and reflect their similarity Locality- Sensitive Hashing Candidate pairs: those pairs
that we need to test for similarity
39
1 2 1 2 1 4 1 2 2 1 2 1
40
1 2 1 2 1 4 1 2 2 1 2 1
41
1 2 1 2 1 4 1 2 2 1 2 1
42
Signature matrix M r rows per band b bands One signature
1 2 1 2 1 4 1 2 2 1 2 1
Matrix M r rows b bands
Buckets
43
Matrix M r rows b bands
Buckets
Columns 2 and 6 are probably identical (candidate pair)
43
Matrix M r rows b bands
Buckets
Columns 2 and 6 are probably identical (candidate pair) Columns 6 and 7 are surely different.
43
44
45
46
47
1 2 1 2 1 4 1 2 2 1 2 1
want them to hash to at least 1 common bucket (at least one band is identical)
48
1 2 1 2 1 4 1 2 2 1 2 1
want them to hash to at least 1 common bucket (at least one band is identical)
48
1 2 1 2 1 4 1 2 2 1 2 1
want them to hash to at least 1 common bucket (at least one band is identical)
are false negatives (we miss them)
48
1 2 1 2 1 4 1 2 2 1 2 1
common buckets (all bands should be different)
49
1 2 1 2 1 4 1 2 2 1 2 1
common buckets (all bands should be different)
49
1 2 1 2 1 4 1 2 2 1 2 1
common buckets (all bands should be different)
0.3% end up becoming candidate pairs
candidate pairs) but then it will turn out their similarity is below threshold s
49
1 2 1 2 1 4 1 2 2 1 2 1
50
1 2 1 2 1 4 1 2 2 1 2 1
Similarity s =sim(C1, C2) of two sets Probability
a bucket Similarity threshold s
51
Similarity s =sim(C1, C2) of two sets Probability
a bucket Similarity threshold s No chance if t < s
51
Similarity s =sim(C1, C2) of two sets Probability
a bucket Similarity threshold s No chance if t < s Probability = 1 if t > s
51
52
Similarity s =sim(C1, C2) of two sets Probability
a bucket
52
Similarity s =sim(C1, C2) of two sets Probability
a bucket
52
Remember: With a single hash function: Probability of equal hash-values = similarity
Similarity s =sim(C1, C2) of two sets Probability
a bucket
52
Remember: With a single hash function: Probability of equal hash-values = similarity
Similarity s =sim(C1, C2) of two sets Probability
a bucket
52
Remember: With a single hash function: Probability of equal hash-values = similarity
Similarity s =sim(C1, C2) of two sets Probability
a bucket False positives
52
Remember: With a single hash function: Probability of equal hash-values = similarity
Similarity s =sim(C1, C2) of two sets Probability
a bucket False positives False negatives
s r
All rows
are equal
1 -
Some row
unequal
( )b
No bands identical
1 -
At least
identical
t ~ (1/b)1/r
53
Similarity s=sim(C1, C2) of two sets Probability
a bucket
54
55
56
57
57
57
57
in exploring/navigating a collection of documents (e.g., query results)
that can be flat or hierarchical, e.g.:
etc.
58
for large-scale document collections with sparse meta-data
mined in a query-dependent manner from pseudo- relevant documents
sizes, etc.) are typically represented as lists in web pages
and use the consolidated lists as facets
59
chicken) via: item{, item}* (and|or) {other} item
ignoring instructions such as “select” or “chose”
ignoring header and footer rows
alphanumeric characters (e.g., brackets), converting them to lower case, and removing items longer than 20 terms
60
61
query, i.e., are mentioned in many pseudo-relevant documents
weight SDOC and their average inverse document frequency SIDF
with sdm as fraction of list items mentioned in document d and sdr as importance of document d (estimated as rank(d)-1/2)
62
Sl = SDOC · SIDF
SDOC = X
d∈R
(sm
d · sr d)
63
SIDF = 1 |l| X
i∈l
idf (i)
64
d(l1, l2) = 1 − |l1 ∩ l2| min{|l1|, |l2|} d(c1, c2) = maxl1∈c1, l2∈c2d(l1, l2)
favoring dimensions grouping lists with high weight
favoring items which are often ranked high within containing lists
65
Sc = X
s∈Sites(c)
maxl∈c, l∈sSl Si|c = X
s∈Sites(c)
1 p AvgRank(c, i, s)
66
67
query: watches 1. cartier, breitling, omega, citizen, tag heuer, bulova, casio, rolex, audemars piguet, seiko, accutron, movado, fossil, gucci, . . .
cal, manual, automatic, electric, dive, . . .
query: lost
jorge garcia, daniel dae kim, michael emerson, terry o’quinn, . . .
charlie, ben, juliet, sun, jin, ana, lucia . . .
date, the last recruit, everybody loves hugo, the end, . . . query: lost season 5
is dead, some like it hoth, whatever happened happened, the little prince, this place is death, the variable, . . . 2. jack, kate, hurley, sawyer, sayid, ben, juliet, locke, miles, desmond, charlotte, various, sun, none, richard, daniel 3. matthew fox, naveen andrews, evangeline lilly, jorge garcia, henry ian cusick, josh holloway, michael emerson, . . .
query: flowers
christmas, thank you, new baby, sympathy, fall
gerberas, orchids, iris
query: what is the fastest animals in the world 1. cheetah, pronghorn antelope, lion, thomson’s gazelle, wilde- beest, cape hunting dog, elk, coyote, quarter horse
3. science, technology, entertainment, nature, sports, lifestyle, travele, gaming, world business query: the presidents of the united states
james madison, abraham lincoln, john quincy adams, william henry harrison, martin van buren, james monroe, . . .
the united states ii, love everybody, pure frosting, these are the good times people, freaked out and small, . . .
kick out the jams, stranger, boll weevil, ca plane pour moi, . . . 4. federalist, democratic-republican, whig, democratic, republi- can, no party, national union, . . . query: visit beijing 1. tiananmen square, forbidden city, summer palace, temple of heaven, great wall, beihai park, hutong
portation, facts query: cikm
dustry research track 2. submission, important dates, topics, overview, scope, com- mittee, organization, programme, registration, cfp, publication, programme committee, organisers, . . .
stoc, uist, vldb, wsdm, . . .
68
visualize their volume in traditional news and blogs
69
mentions of the same meme need to be identified
that are reasonably long and occur often enough
(i.e., u can be transformed into v by adding at most ε tokens)
and how often v occurs in the document collection
70
construction
having minimum total weight, so that each resulting component is single-rooted
hence addressed by greedy heuristic algorithm
71
a force for good in the world palling around with terrorists who would target their own country that he s palling around with terrorists who would target their own country pal around with terrorists who targeted their own country palling around with terrorists who target their own country we see america as a force of good in this world we see an america of exceptionalism someone who sees america as imperfe around with terrorists who targeted th sees america as imperfect enough to pal around with terrorists who targeted their own country terrorists who would target their own country
1 2 3 4 5 6 7 8 9 10 11 13 15 14 12
media
72
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
3 6 9 12 Proportion of total volume Time relative to peak [hours], t Mainstream media Blogs
Figure 8: Time lag for blogs and news media. Thread volume in
73
[1]
Scalable k-Means by Ranked Retrieval, WSDM 2014 [3]
Finding Dimensions for Queries, CIKM 2011 [4]
CACM 49(4), 2006 [5]
Meme-tracking and the Dynamics of the News Cycle, KDD 2009 [6]
SIGIR 2000 [7] K.-P . Yee, K. Swearingen, K. Li, M. Hearst: Faceted Metadata for Image Search and Browsing, CHI 2003 For LSH refer to the Mining of Massive Datasets Chapter 3 http://infolab.stanford.edu/~ullman/mmds/ book.pdf LSH related slides were borrowed from http://i.stanford.edu/~ullman/cs246slides/LSH-1.pdf Some slides were borrowed from Prof. Klaus Berberich as well
74