N-grams & Language ID If N-gram models represent language - - PDF document

n grams language id
SMART_READER_LITE
LIVE PREVIEW

N-grams & Language ID If N-gram models represent language - - PDF document

N-grams & Language ID If N-gram models represent language models, can we use N-gram models for N-grams Language Identification? For example, can we use it to differentiate between text in German, text in English, text in


slide-1
SLIDE 1

1

N-grams

Language Models

N-grams & Language ID

  • If N-gram models represent “language”

models, can we use N-gram models for Language Identification?

  • For example, can we use it to differentiate

between text in German, text in English, text in Czech, etc.?

  • If so, how?
  • What’s the lower threshold for the size of

text that can ensure successful ID?

Zipf’s law

  • ‘The nth most common word in a human

language text occurs with a frequency inversely proportional to n.’

  • The most frequent word will occur approximately

twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc

Zipf’s Law

  • See pg. 24 of M&S (Tom Sawyer):

Word Freq rank the 3332 1 and 2972 2 and 1775 3 he 877 10 but 410 20 be 294 30 there 222 40

From Brown Corpus Zipf’s Law

  • Li (1992) showed that random typed

letters (with spaces) give Zipf-like distribution

  • Applies also to n-grams across a text
  • Cavnar and Trenkle tap the fact that the

distribution of n-grams will vary from language to language (in fact, from text type to text type)

slide-2
SLIDE 2

2

N-grams, Cavnar & Trenkle Cavnar & Trenkle methodology

  • Split the text into separate tokens consisting only
  • f letters and apostrophes. Digitsand

punctuation are discarded. Pad the token with sufficient blanks before and after.

  • Generate all possible N-grams, for N=1 to 5.

Use positions that span the padding blanks, as well.

  • Hash into a table to find the counter for the N-

gram, and increment it. The hash table

  • When done, output all N-grams and their counts.
  • Sort those counts into reverse order by the

number of occurrences.

Cavnar & Trenkle methodology

  • N-grams profiles created for a body of text

for each target language (or document type, etc.)

  • A profile is created for a text to be

evaluated.

  • This profile is compared against stored

profiles using a simple rank-order statistic.

  • The closest profile is the one with the

smallest distance measure.

Cavnar & Trenkle methodology Cavnar & Trenkle

  • Download implementation from:

http://software.wise-guys.nl/libtextcat/

Damashek Methodology

  • Damashek 1995: Gauging Similarity with

n-Grams: Language-Independent Categorization of Text

  • Taps into the similar Zipfian notion, but

uses Vector Space Model instead

slide-3
SLIDE 3

3

Vector Space Models

  • Often used in IR and search tasks
  • (We’ll be covering these more this term)
  • Essentially: represent some source data (text

document, Web page, e-mail) by some vector representation

– Vectors composed of counts/frequency of particular words (esp. certain content words) or other objects of interest – ‘Search’ vector compared against ‘target’ vectors – Most closely related vectors cluster together

Vector Space Models

  • Distance measured by cosine measure,

which if vectors are normalized, is their dot product: sim(qk,dj) = qk • dj = Σ wi,k × wi,j = cos θqd

  • Cos = 1 means identical vectors, 0 not
  • Perl has a built in methods for working

with vectors.

→ → → → i=1 N

Damashek’s methodology

  • Step n-gram window through document, one

character at a time

  • Convert each n-gram into an indexing key (for

eventual hashing)

  • Concatenate all such keys into a list and note

the length

  • Order list by key value (sort)
  • Count and store # of occurrences of each

distinct key

  • Divide # of occurrences of each distinct key by

the length of the original list (normalization)

Damashek

  • Vectors built of documents compared

against known vectors

  • Highest cosine measure means closest

language (or document)

  • Damashek’s method works better with

longer documents

Damashek

  • See the following URL for demo:

http://epsilon3.georgetown.edu/~ballc/languagei d/index.html

  • For his original Science paper, see Jstor:

http://www.jstor.org/view/00368075/di002302/00 p0139l/0?currentResult=00368075%2bdi002302 %2b00p0139l%2b0%2c03&searchUrl=http%3A %2F%2Fwww.jstor.org%2Fsearch%2FBasicRes ults%3Fhp%3D25%26si%3D1%26Query%3Dda mashek