Text Representation
http://www.cse.iitb.ac.in/~soumen/mining-the-web/
Text Representation - - PowerPoint PPT Presentation
Text Representation http://www.cse.iitb.ac.in/~soumen/mining-the-web/ Ahmed Rafea Text Representation Document Preprocessing Vector Space Model for Document Storage Measure of Similarity 2 Document preprocessing(1/4)
http://www.cse.iitb.ac.in/~soumen/mining-the-web/
2
3
4
Queries containing only stopwords ruled out Polysemous words that are stopwords in one
– E.g.; can as a verb vs. can as a noun
5
morphological analysis (e.g., Porter's algorithm) dictionary lookup (e.g., WordNet).
It is not a good idea to stem Abbreviations, and names
coined in the technical and commercial sectors
E.g.: Stemming “ides” to “IDE”, the hard disk standard,
“SOCKS” firewall protocol to “sock” worn on the foot, may be bad !
6
into the same token
account
census and telephone directory data.
q characters
set of slightly distorted query terms
) 4 2 ( ≤ ≤ q q
7
number of times term t occurs in document d,
to scale down the coordinates of terms that occur
8
τ
τ
9
D is the document collection and is the set of
mostly dampened functions of SMART | |
t
D D
t
t
10
11
.
Else shorter documents dominate (since queries
Shorter documents are penalized
1
2
12
1
2
| ) ( ) ( | | ) ( ) ( | ) , ( '
2 1 2 1 2 1
d T d T d T d T d d r ∪ ∩ =
13
1
d ) , (
2 1 d
d r
2
d