SLIDE 7 ECDL 2008 Norwegian University of Science and Technology 7
Data Preprocessing
A direct comparison between extracted words in a document vs. temporal language models limits accuracy. .
Only the top-ranked N according to TF-IDF scores will be selected as index terms
Word filtering Word filtering
Comparing 2 language models on concept level avoids a less frequency word problem
Concept extraction Concept extraction
Identifying the correct sense of word by analyzing context in a sentence, e.g. “bank”
Word sense disambiguation Word sense disambiguation
Co-occurrence of different words can alter the meaning, e.g. “United States”
Collocation extraction Collocation extraction
Most interesting classes of words are selected, e.g. nouns, verbs, and adjectives
Part Part-
speech tagging Description Description Semantic Semantic-
based Preprocessing