Introduction to Information Retrieval
http://informationretrieval.org IIR 2: The term vocabulary and postings lists
Hinrich Sch¨ utze
Center for Information and Language Processing, University of Munich
2014-04-09
1 / 62
Introduction to Information Retrieval - - PowerPoint PPT Presentation
Introduction to Information Retrieval http://informationretrieval.org IIR 2: The term vocabulary and postings lists Hinrich Sch utze Center for Information and Language Processing, University of Munich 2014-04-09 1 / 62 Overview Recap 1
1 / 62
2 / 62
16 / 62
17 / 62
19 / 62
20 / 62
21 / 62
22 / 62
30 / 62
31 / 62
32 / 62
33 / 62
34 / 62
35 / 62
36 / 62
37 / 62
38 / 62
39 / 62
40 / 62
1 / 65
2 / 65
12 / 65
13 / 65
14 / 65
15 / 65
16 / 65
17 / 65
18 / 65
19 / 65
20 / 65
21 / 65
22 / 65
23 / 65
24 / 65
25 / 65
26 / 65
27 / 65
28 / 65
29 / 65
30 / 65
31 / 65
32 / 65
33 / 65
34 / 65
35 / 65
36 / 65
37 / 65
38 / 65
39 / 65
40 / 65
41 / 65
42 / 65
43 / 65
44 / 65
45 / 65
46 / 65
47 / 65
48 / 65
49 / 65
50 / 65
51 / 65
52 / 65
53 / 65
54 / 65
55 / 65
56 / 65
57 / 65
58 / 65
Term frequency Document frequency Normalization n (natural) tft,d n (no) 1 n (none) 1 l (logarithm) 1 + log(tft,d) t (idf) log N dft c (cosine)
1
√
w2
1 +w2 2 +...+w2 M
a (augmented) 0.5 +
0.5×tft,d maxt(tft,d)
p (prob idf) max{0, log N−dft
dft
} u (pivoted unique) 1/u b (boolean) 1 if tft,d > 0
b (byte size) 1/CharLengthα, α < 1 L (log ave)
1+log(tft,d) 1+log(avet∈d(tft,d))
Best known combination of weighting options Default: no weighting
60 / 65
61 / 65
i wqi · wdi = 0 + 0 + 1.04 + 2.04 = 3.08 Questions? 62 / 65
63 / 65
64 / 65
65 / 65