Text Operations Text Operations
Berlin Chen 2003
References:
1. Modern Information Retrieval, chapters 7, 5 2. Information Retrieval: Data Structures & Algorithms, chapters 7, 8 3. Managing Gigabytes, chapter 2
Text Operations Text Operations Berlin Chen 2003 References: 1. - - PowerPoint PPT Presentation
Text Operations Text Operations Berlin Chen 2003 References: 1. Modern Information Retrieval, chapters 7, 5 2. Information Retrieval: Data Structures & Algorithms, chapters 7, 8 3. Managing Gigabytes, chapter 2 Index Term Selection and
1. Modern Information Retrieval, chapters 7, 5 2. Information Retrieval: Data Structures & Algorithms, chapters 7, 8 3. Managing Gigabytes, chapter 2
2
control the size of vocabulary (reduce the size of distinct index terms) side effect ? improve performance but waste time controversial for its benefits
3
structure accents, spacing, etc. stopwords Noun groups stemming Manual indexing Docs structure Full text Index terms
text + structure text
4
5
6
7
8
9
10
11
12
L 1 READAB B 1 READA A, I, S 3 READ D 1 REA A, D 2 RE E, I,O 3 R BLANK 1 READABLE 1 Successor Variety E READABL Stem Prefix
13
w1 w2 wn w1w2 wn
14
15
16
form a hierarchical structure Depend on specific context automatically
by specialists
17
18
Forskett, 1997
19
20
21
22
1 6 12 16 18 25 29 36 40 45 54 58 66 70
beautiful flowers garden house .... 70 45, 58 18, 29 6 .... Vocabulary Occurrences
Each element in a list points to a text position
Each element in a list points to a doc number
23
24
25
26
27
28
model encoder model decoder text text Compressed text model encoder model decoder text text Compressed text updating updating (for semi-static modeling)
29
Huffman coding tree a each , for is
1 1 1
Original text: for each rose, a rose is a rose
01 3/9 rose 00 2/9 a 111 1/9 is 110 1/9 for 101 1/9 , 100 1/9 each Code Prob. Symbol 1/9 1/9 1/9 1/9 2/9 2/9 4/9 2/9
rose
3/9 1 1 5/9 Average=2.44 bits/per sample
30
Huffman coding tree rose a each , for is
1 1 1 1 1
Original text: for each rose, a rose is a rose
1 3/9 rose 00 2/9 a 0111 1/9 is 0110 1/9 for 0101 1/9 , 0100 1/9 each Code Prob. Symbol 1/9 1/9 1/9 1/9 2/9 2/9 4/9 2/9 3/9 6/9 9/9
42 . 2 ) 9 3 log 9 3 9 2 log 9 2 9 1 log 9 1 4 ( log
2 2 2 2
≈ × + × + × × − = − = ∑
i i
p p E
Average=2.56 bits/per sample
31
The number of trees finally formed will be quite large!
32
rose a each , for is
1 1 1 1
canonical Huffman coding tree
11 01 3/9 rose 10 00 2/9 a 011 111 1/9 is 010 110 1/9 for 001 101 1/9 , 000 100 1/9 each
Old Code Prob. Symbol
Original text: for each rose, a rose is a rose
1/9 1/9 1/9 1/9 5/9 3/9 2/9 2/9 4/9 9/9 1 Average=2.44 bits/per sample
33
rose a each , for is
1 1 1 1 1
canonical Huffman coding tree
1 1 3/9 rose 01 00 2/9 a 0011 0111 1/9 is 0010 0110 1/9 for 0001 0101 1/9 , 0000 0100 1/9 each
Old Code Prob. Symbol
Original text: for each rose, a rose is a rose
1/9 1/9 1/9 1/9 2/9 3/9 2/9 2/9 4/9 6/9 9/9
42 . 2 ) 9 3 log 9 3 9 2 log 9 2 9 1 log 9 1 4 ( log
2 2 2 2
≈ × + × + × × − = − = ∑
i i
p p E
Average=2.56 bits/per sample
34
35
no yes yes no Random access Yes
(theoretically)
yes yes no Compressed pattern matching moderate high low low Memory space very fast very fast fast slow Decompression Speed very fast fast fast slow Compression Speed good very good poor very good Compression Ratio Ziv-Lempel Word Huffman Character Huffman Arithmetic
36
37