language and computers
play

Language and Computers Tokenization Inflection Writers Aids - PowerPoint PPT Presentation

Language and Computers Writers Aids Introduction Error causes Keyboard mistypings Phonetic errors Knowledge problems Challenges Language and Computers Tokenization Inflection Writers Aids Productivity Non-word error detection


  1. Language and Computers Writers’ Aids Introduction Error causes Keyboard mistypings Phonetic errors Knowledge problems Challenges Language and Computers Tokenization Inflection Writers’ Aids Productivity Non-word error detection Dictionaries N-gram analysis Based on Dickinson, Brew, & Meurers (2013) Isolated-word error Indiana University correction Rule-based methods Fall 2013 Similarity key techniques Probabilistic methods Minimum edit distance Error correction for web queries Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 1 / 76

  2. Language and Why people care about spelling Computers Writers’ Aids Introduction ◮ Misspellings can cause misunderstandings Error causes Keyboard mistypings ◮ Standard spelling makes it easy to organize words & Phonetic errors Knowledge problems text: Challenges Tokenization ◮ e.g., Without standard spelling, how would you look up Inflection Productivity things in a lexicon or thesaurus? Non-word error ◮ e.g., Optical character recognition software (OCR) can detection use knowledge about standard spelling to recognize Dictionaries N-gram analysis scanned words even for hardly legible input. Isolated-word error correction ◮ Standard spelling makes it possible to provide a single Rule-based methods Similarity key techniques text, accessible to a wide range of readers (different Probabilistic methods Minimum edit distance backgrounds, speaking different dialects, etc.). Error correction for web queries ◮ Using standard spelling can make a good impression in Grammar correction social interaction. Syntax and Computing Grammar correction rules Caveat emptor 2 / 76

  3. Language and How are spell checkers used? Computers Writers’ Aids Introduction Error causes Keyboard mistypings ◮ interactive spelling checkers = spell checker detects Phonetic errors Knowledge problems errors as you type. Challenges ◮ It may or may not make suggestions for correction. Tokenization Inflection ◮ It needs a “real-time” response (i.e., must be fast) Productivity ◮ It is up to the human to decide if the spell checker is Non-word error detection right or wrong, and so we may not require 100% Dictionaries N-gram analysis accuracy (especially with a list of choices) Isolated-word error ◮ automatic spelling correctors = spell checker runs on correction Rule-based methods a whole document, finds errors, and corrects them Similarity key techniques Probabilistic methods Minimum edit distance ◮ A much more difficult task. Error correction for ◮ A human may or may not proofread the results later. web queries Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 3 / 76

  4. Language and Detection vs. Correction Computers Writers’ Aids Introduction Error causes Keyboard mistypings Phonetic errors Knowledge problems Challenges ◮ There are two distinct tasks: Tokenization Inflection ◮ error detection = simply find the misspelled words Productivity ◮ error correction = correct the misspelled words Non-word error detection ◮ e.g., It might be easy to tell that ater is a misspelled Dictionaries N-gram analysis word, but what is the correct word? water ? later ? after ? Isolated-word error correction ◮ Note that detection is a prerequisite for correction. Rule-based methods Similarity key techniques Probabilistic methods Minimum edit distance Error correction for web queries Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 4 / 76

  5. Language and Error causes Computers Writers’ Aids Keyboard mistypings Introduction Error causes Keyboard mistypings Phonetic errors Knowledge problems Space bar issues Challenges Tokenization Inflection ◮ run-on errors = two separate words become one Productivity Non-word error ◮ e.g., the fuzz becomes thefuzz detection Dictionaries ◮ split errors = one word becomes two separate items N-gram analysis Isolated-word error ◮ e.g., equalization becomes equali zation correction Rule-based methods ◮ Note that the resulting items might still be words! Similarity key techniques Probabilistic methods Minimum edit distance ◮ e.g., a tollway becomes atoll way Error correction for web queries Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 5 / 76

  6. Language and Error causes Computers Writers’ Aids Keyboard mistypings (cont.) Introduction Error causes Keyboard mistypings Phonetic errors Keyboard proximity Knowledge problems Challenges Tokenization ◮ e.g., Jack becomes Hack since h and j are next to each Inflection Productivity other on a typical American keyboard Non-word error detection Dictionaries N-gram analysis Physical similarity Isolated-word error correction ◮ similarity of shape, e.g., mistaking two physically similar Rule-based methods Similarity key techniques Probabilistic methods letters when typing up something handwritten Minimum edit distance ◮ e.g., tight for fight Error correction for web queries Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 6 / 76

  7. Language and Error causes Computers Writers’ Aids Phonetic errors Introduction Error causes Keyboard mistypings phonetic errors Phonetic errors Knowledge problems Challenges = errors based on the sounds of a language (not necessarily Tokenization on the letters) Inflection Productivity Non-word error ◮ homophones = two words which sound the same detection Dictionaries ◮ e.g., red / read (past tense), cite / site / sight , N-gram analysis they’re / their / there Isolated-word error correction ◮ letter/word substitution: replacing a letter (or sequence Rule-based methods Similarity key techniques of letters) with a similar-sounding one Probabilistic methods Minimum edit distance ◮ e.g., John kracked his nuckles. Error correction for web queries instead of John cracked his knuckles. Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 7 / 76

  8. Language and Error causes Computers Writers’ Aids Knowledge problems Introduction Error causes Keyboard mistypings Phonetic errors ◮ not knowing a word and guessing its spelling (can be Knowledge problems phonetic) Challenges Tokenization ◮ e.g., sientist Inflection Productivity ◮ not knowing a rule and guessing it Non-word error detection Dictionaries ◮ e.g., Do we double a consonant for ing words? N-gram analysis jog → joging Isolated-word error correction joke → jokking Rule-based methods Similarity key techniques ◮ knowing something is odd about the spelling, but Probabilistic methods Minimum edit distance guessing the wrong thing Error correction for web queries ◮ e.g., typing siscors for the non-regular scissors Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 8 / 76

  9. Language and Challenges & Techniques for spelling correction Computers Writers’ Aids Introduction Before we turn to how we detect spelling errors, we’ll look Error causes briefly at three issues: Keyboard mistypings Phonetic errors ◮ Tokenization : What is a word? Knowledge problems Challenges ◮ Inflection : How are some words related? Tokenization Inflection ◮ Productivity of language : How many words are there? Productivity Non-word error detection How we handle these issues determines how we build a Dictionaries dictionary. N-gram analysis Isolated-word error And then we’ll turn to the techniques used: correction Rule-based methods Similarity key techniques ◮ Non-word error detection Probabilistic methods Minimum edit distance ◮ Isolated-word error correction Error correction for web queries ◮ Context-dependent word error detection and correction Grammar correction → grammar correction Syntax and Computing Grammar correction rules Caveat emptor 9 / 76

  10. Language and Tokenization Computers Writers’ Aids Intuitively a “word” is simply whatever is between two Introduction spaces, but this is not always so clear. Error causes Keyboard mistypings ◮ contractions = two words combined into one Phonetic errors Knowledge problems ◮ e.g., can’t , he’s , John’s [car] (vs. his car ) Challenges Tokenization ◮ multi-token words = (arguably) a single word with a Inflection Productivity space in it Non-word error detection ◮ e.g., New York , in spite of , deja vu Dictionaries N-gram analysis ◮ hyphens (note: can be ambiguous if a hyphen ends a Isolated-word error correction line) Rule-based methods Similarity key techniques ◮ Some are always a single word: e-mail , co-operate Probabilistic methods Minimum edit distance ◮ Others are two words combined into one: Error correction for Columbus-based , sound-change web queries Grammar correction ◮ Abbreviations: may stand for multiple words Syntax and Computing Grammar correction rules ◮ e.g., etc. = et cetera , ATM = Automated Teller Machine Caveat emptor 10 / 76

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend