Text Representation - - PowerPoint PPT Presentation

text representation
SMART_READER_LITE
LIVE PREVIEW

Text Representation - - PowerPoint PPT Presentation

Text Representation http://www.cse.iitb.ac.in/~soumen/mining-the-web/ Ahmed Rafea Text Representation Document Preprocessing Vector Space Model for Document Storage Measure of Similarity 2 Document preprocessing(1/4)


slide-1
SLIDE 1

Text Representation

http://www.cse.iitb.ac.in/~soumen/mining-the-web/

Ahmed Rafea

slide-2
SLIDE 2

2

Text Representation

Document Preprocessing Vector Space Model for Document Storage Measure of Similarity

slide-3
SLIDE 3

3

Document preprocessing(1/4)

Tokenization

  • Filtering away tags
  • Tokens regarded as nonempty sequence of

characters excluding spaces and punctuations.

  • Token represented by a suitable integer, tid,

typically 32 bits

  • Optional: stemming/conflation of words
  • Result: document (did) transformed into a

sequence of integers (tid, pos)

slide-4
SLIDE 4

4

Document preprocessing(2/4)

Stopwords

  • Function words and connectives
  • Appear in large number of documents and

little use in pinpointing documents

  • Issues

Queries containing only stopwords ruled out Polysemous words that are stopwords in one

sense but not in others

– E.g.; can as a verb vs. can as a noun

slide-5
SLIDE 5

5

Document preprocessing(3/4)

Stemming

  • Remove inflections that convey parts of speech,

tense and number

  • E.g.: university and universal both stem to universe.
  • Techniques

morphological analysis (e.g., Porter's algorithm) dictionary lookup (e.g., WordNet).

  • Stemming may increase the number of documents in

the response of a query but at the price of precision

It is not a good idea to stem Abbreviations, and names

coined in the technical and commercial sectors

E.g.: Stemming “ides” to “IDE”, the hard disk standard,

“SOCKS” firewall protocol to “sock” worn on the foot, may be bad !

slide-6
SLIDE 6

6

Document preprocessing(4/4)

Non-uniformity of word spellings

  • dialects of English
  • transliteration from other languages

Two ways to reduce this problem. 1.Aggressive conflation mechanism to collapse variant spellings

into the same token

  • E.g.: Soundex : takes phonetics and pronunciation details into

account

  • used with great success in indexing and searching last names in

census and telephone directory data.

2.Decompose terms into a sequence of q-grams or sequences of

q characters

  • Check for similarity in the grams
  • Looking up the inverted index : a two-stage affair:
  • Smaller index of q-grams consulted to expand each query term into a

set of slightly distorted query terms

  • These terms are submitted to the regular index
  • Used by Google for spelling correction
  • Idea also adopted for eliminating near-duplicate pages

) 4 2 ( ≤ ≤ q q

slide-7
SLIDE 7

7

The vector space model (1/4)

Documents represented as vectors in a multi-dimensional Euclidean space

  • Each axis = a term (token)

Coordinate of document d in direction of term t determined by:

  • Term frequency TF(d,t)

number of times term t occurs in document d,

scaled in a variety of ways to normalize document length

  • Inverse document frequency IDF(t)

to scale down the coordinates of terms that occur

in many documents

slide-8
SLIDE 8

8

The vector space model (2/4)

Term frequency

  • .

.

Cornell SMART system uses a smoothed version

=

τ

τ ) n(d, t) n(d, t) TF(d, )) (n(d, max t) n(d, t) TF(d, τ

τ

= )) , ( 1 log( 1 ) , ( ) , ( t d n t d TF t d TF + + = =

  • therwise

t d n ) , ( =

if

slide-9
SLIDE 9

9

The vector space model (3/4)

Inverse document frequency

  • Given

D is the document collection and is the set of

documents containing t

  • Formulae

mostly dampened functions of SMART | |

t

D D

) | | | | 1 log( ) (

t

D D t IDF + =

t

D

slide-10
SLIDE 10

10

Vector space model (4/4)

Coordinate of document d in axis t

  • .
  • Transformed to in the TFIDF-space

Query q

  • Interpreted as a document
  • Transformed to in the same TFIDF-space

as d

) ( ) , ( t IDF t d TF dt = d r

q r

slide-11
SLIDE 11

11

Measures of Similarity (1/3)

Distance measure

  • Magnitude of the vector difference

.

  • Document vectors must be normalized to unit

( or ) length

Else shorter documents dominate (since queries

are short)

Cosine similarity

  • cosine of the angle between and

Shorter documents are penalized

| | q d r r −

1

L

2

L d r q r

slide-12
SLIDE 12

12

Measures of Similarity (2/3)

  • Jaccard coefficient of similarity between

document and

  • T(d) = set of tokens in document d
  • .
  • Symmetric, reflexive
  • Forgives any number of occurrences and any

permutations of the terms.

1

d

2

d

| ) ( ) ( | | ) ( ) ( | ) , ( '

2 1 2 1 2 1

d T d T d T d T d d r ∪ ∩ =

slide-13
SLIDE 13

13

Measures of Similarity (3/3)

Represent each document as a set of q-grams (shingles) A shingle is a contiguous subsequence of tokens taken from a document S(d,w) is the set of distinct shingles of width w taken from document d When w is fixed S(d,w) is shortened to S(d) When w = 1, S(d) = T(d) Using the shingled document representation one may define the resemblance between and using Jaccard similarity by replacing T(d) by S(d,w) The two documents are similar if Jaccard similarity is above a threshold

1

d ) , (

2 1 d

d r

2

d