IR: Information Retrieval
FIB, Master in Innovation and Research in Informatics Slides by Marta Arias, José Luis Balcázar, Ramon Ferrer-i-Cancho, Ricard Gavaldá
Department of Computer Science, UPC
Fall 2018 http://www.cs.upc.edu/~ir-miri
1 / 31
IR: Information Retrieval FIB, Master in Innovation and Research in - - PowerPoint PPT Presentation
IR: Information Retrieval FIB, Master in Innovation and Research in Informatics Slides by Marta Arias, Jos Luis Balczar, Ramon Ferrer-i-Cancho, Ricard Gavald Department of Computer Science, UPC Fall 2018 http://www.cs.upc.edu/~ir-miri 1
1 / 31
3 / 31
◮ “given term t, get all the documents that contain it”.
4 / 31
5 / 31
6 / 31
7 / 31
◮ almost always sorted by docid ◮ often compressed: minimize info to bring from disk! 8 / 31
9 / 31
◮ Time: again order of the sum of lengths of posting lists.
◮ Time: length of shortest list times log of length of longest.
◮ sequential scan: 2000 comparisons, ◮ binary search: 1000 ∗ 10 = 10, 000 comparisons.
◮ sequential scan: 10, 100 comparisons, ◮ binary search: 100 ∗ log(10, 000) = 1400 comparisons. 10 / 31
11 / 31
12 / 31
13 / 31
14 / 31
◮ above a threshold simmin, or ◮ the top r according to that similarity, or ◮ all documents,
15 / 31
16 / 31
17 / 31
◮ the reason for inverted index!
18 / 31
19 / 31
20 / 31
21 / 31
22 / 31
◮ typical binary encoding: |binary(x)| = log2(x)
23 / 31
24 / 31
◮ Easy to estimate memory used! 1I put it something greater than 3 as an approximation 25 / 31
26 / 31
◮ So, could use 8 bits instead of 20 (or 32)
◮ Will need a variable length, self-delimiting encoding scheme
◮ Will use need a variable length, self-delimiting, binary
27 / 31
28 / 31
◮ Exercise: think how to decode uniquely
◮ Exercise: why? 29 / 31
◮ if 0, then last byte ◮ if 1, number continues
30 / 31
31 / 31