text representation
play

Text Representation - PowerPoint PPT Presentation

Text Representation http://www.cse.iitb.ac.in/~soumen/mining-the-web/ Ahmed Rafea Text Representation Document Preprocessing Vector Space Model for Document Storage Measure of Similarity 2 Document preprocessing(1/4)


  1. Text Representation http://www.cse.iitb.ac.in/~soumen/mining-the-web/ Ahmed Rafea

  2. Text Representation � Document Preprocessing � Vector Space Model for Document Storage � Measure of Similarity 2

  3. Document preprocessing(1/4) � Tokenization • Filtering away tags • Tokens regarded as nonempty sequence of characters excluding spaces and punctuations. • Token represented by a suitable integer, tid , typically 32 bits • Optional: stemming/conflation of words • Result: document (did) transformed into a sequence of integers ( tid, pos ) 3

  4. Document preprocessing(2/4) � Stopwords • Function words and connectives • Appear in large number of documents and little use in pinpointing documents • Issues � Queries containing only stopwords ruled out � Polysemous words that are stopwords in one sense but not in others – E.g.; can as a verb vs. can as a noun 4

  5. Document preprocessing(3/4) � Stemming • Remove inflections that convey parts of speech, tense and number • E.g.: university and universal both stem to universe. • Techniques � morphological analysis (e.g., Porter's algorithm) � dictionary lookup (e.g., WordNet ). • Stemming may increase the number of documents in the response of a query but at the price of precision � It is not a good idea to stem Abbreviations, and names coined in the technical and commercial sectors � E.g.: Stemming “ides” to “IDE”, the hard disk standard, “SOCKS” firewall protocol to “sock” worn on the foot, may be bad ! 5

  6. Document preprocessing(4/4) � Non-uniformity of word spellings • dialects of English • transliteration from other languages � Two ways to reduce this problem. 1. Aggressive conflation mechanism to collapse variant spellings into the same token • E.g.: Soundex : takes phonetics and pronunciation details into account • used with great success in indexing and searching last names in census and telephone directory data. 2. Decompose terms into a sequence of q-grams or sequences of q characters ≤ q ≤ ( 2 4 ) • Check for similarity in the grams q • Looking up the inverted index : a two-stage affair: • Smaller index of q-grams consulted to expand each query term into a set of slightly distorted query terms • These terms are submitted to the regular index • Used by Google for spelling correction • Idea also adopted for eliminating near-duplicate pages 6

  7. The vector space model (1/4) � Documents represented as vectors in a multi-dimensional Euclidean space • Each axis = a term (token) � Coordinate of document d in direction of term t determined by: • Term frequency TF(d,t) � number of times term t occurs in document d, scaled in a variety of ways to normalize document length • Inverse document frequency IDF(t) � to scale down the coordinates of terms that occur in many documents 7

  8. The vector space model (2/4) � Term frequency n(d, t) n(d, t) = • . TF(d, t) = ∑ TF(d, t) τ ) n(d, τ max (n(d, )) . τ τ � Cornell SMART system uses a smoothed version = if ( , ) 0 n d t = ( , ) 0 TF d t = + + otherwise ( , ) 1 log( 1 ( , )) TF d t n d t 8

  9. The vector space model (3/4) � Inverse document frequency • Given � D is the document collection and is the set of D t documents containing t • Formulae D � mostly dampened functions of | | D t � SMART + 1 | | D = ( ) log( ) IDF t | | D t 9

  10. Vector space model (4/4) � Coordinate of document d in axis t d t = • . ( , ) ( ) TF d t IDF t r • Transformed to in the TFIDF-space d � Query q • Interpreted as a document r • Transformed to in the same TFIDF-space q as d 10

  11. Measures of Similarity (1/3) � Distance measure • Magnitude of the vector difference r r − � . | | d q • Document vectors must be normalized to unit ( or ) length L L 1 2 � Else shorter documents dominate (since queries are short) � Cosine similarity r r • cosine of the angle between and d q � Shorter documents are penalized 11

  12. Measures of Similarity (2/3) • Jaccard coefficient of similarity between document and d d 1 2 • T(d) = set of tokens in document d ∩ | ( ) ( ) | T d T d • . = 1 2 ' ( , ) r d d ∪ 1 2 | ( ) ( ) | T d T d 1 2 • Symmetric, reflexive • Forgives any number of occurrences and any permutations of the terms. 12

  13. Measures of Similarity (3/3) � Represent each document as a set of q-grams (shingles) � A shingle is a contiguous subsequence of tokens taken from a document � S(d,w) is the set of distinct shingles of width w taken from document d � When w is fixed S(d,w) is shortened to S(d) � When w = 1, S(d) = T(d) � Using the shingled document representation one may define the resemblance between and using Jaccard d d 1 2 similarity by replacing T(d) by S(d,w) ( , ) r d 1 d 2 � The two documents are similar if Jaccard similarity is above a threshold 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend