n grams language id
play

N-grams & Language ID If N-gram models represent language - PDF document

N-grams & Language ID If N-gram models represent language models, can we use N-gram models for N-grams Language Identification? For example, can we use it to differentiate between text in German, text in English, text in


  1. N-grams & Language ID • If N-gram models represent “language” models, can we use N-gram models for N-grams Language Identification? • For example, can we use it to differentiate between text in German, text in English, text in Czech, etc.? Language Models • If so, how? • What’s the lower threshold for the size of text that can ensure successful ID? Zipf’s law Zipf’s Law • ‘The n th most common word in a human • See pg. 24 of M&S (Tom Sawyer): language text occurs with a frequency inversely proportional to n .’ • The most frequent word will occur approximately Word Freq rank twice as often as the second most frequent the 3332 1 word, which occurs twice as often as the fourth and 2972 2 most frequent word, etc and 1775 3 he 877 10 but 410 20 be 294 30 there 222 40 From Brown Corpus Zipf’s Law • Li (1992) showed that random typed letters (with spaces) give Zipf-like distribution • Applies also to n-grams across a text • Cavnar and Trenkle tap the fact that the distribution of n-grams will vary from language to language (in fact, from text type to text type) 1

  2. N-grams, Cavnar & Trenkle Cavnar & Trenkle methodology • Split the text into separate tokens consisting only of letters and apostrophes. Digitsand punctuation are discarded. Pad the token with sufficient blanks before and after. • Generate all possible N-grams, for N=1 to 5. Use positions that span the padding blanks, as well. • Hash into a table to find the counter for the N- gram, and increment it. The hash table • When done, output all N-grams and their counts. • Sort those counts into reverse order by the number of occurrences. Cavnar & Trenkle methodology Cavnar & Trenkle methodology • N-grams profiles created for a body of text for each target language (or document type, etc.) • A profile is created for a text to be evaluated. • This profile is compared against stored profiles using a simple rank-order statistic. • The closest profile is the one with the smallest distance measure. Cavnar & Trenkle Damashek Methodology • Download implementation from: • Damashek 1995: Gauging Similarity with n-Grams: Language-Independent http://software.wise-guys.nl/libtextcat/ Categorization of Text • Taps into the similar Zipfian notion, but uses Vector Space Model instead 2

  3. Vector Space Models Vector Space Models • Often used in IR and search tasks • Distance measured by cosine measure, • (We’ll be covering these more this term) which if vectors are normalized, is their dot • Essentially: represent some source data (text product: document, Web page, e-mail) by some vector representation N → → → → sim(q k ,d j ) = q k • d j = Σ w i,k × w i,j – Vectors composed of counts/frequency of particular i =1 = cos θ qd words (esp. certain content words) or other objects of interest – ‘Search’ vector compared against ‘target’ vectors • Cos = 1 means identical vectors, 0 not – Most closely related vectors cluster together • Perl has a built in methods for working with vectors. Damashek’s methodology Damashek • Step n-gram window through document, one • Vectors built of documents compared character at a time against known vectors • Convert each n-gram into an indexing key (for • Highest cosine measure means closest eventual hashing) language (or document) • Concatenate all such keys into a list and note the length • Damashek’s method works better with • Order list by key value (sort) longer documents • Count and store # of occurrences of each distinct key • Divide # of occurrences of each distinct key by the length of the original list (normalization) Damashek • See the following URL for demo: http://epsilon3.georgetown.edu/~ballc/languagei d/index.html • For his original Science paper, see Jstor: http://www.jstor.org/view/00368075/di002302/00 p0139l/0?currentResult=00368075%2bdi002302 %2b00p0139l%2b0%2c03&searchUrl=http%3A %2F%2Fwww.jstor.org%2Fsearch%2FBasicRes ults%3Fhp%3D25%26si%3D1%26Query%3Dda mashek 3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend