Language and Computers Tokenization Inflection Writers Aids - PowerPoint PPT Presentation

Language and Computers Writers’ Aids Introduction Error causes Keyboard mistypings Phonetic errors Knowledge problems Challenges Language and Computers Tokenization Inflection Writers’ Aids Productivity Non-word error detection Dictionaries N-gram analysis Based on Dickinson, Brew, & Meurers (2013) Isolated-word error Indiana University correction Rule-based methods Fall 2013 Similarity key techniques Probabilistic methods Minimum edit distance Error correction for web queries Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 1 / 76

Language and Why people care about spelling Computers Writers’ Aids Introduction ◮ Misspellings can cause misunderstandings Error causes Keyboard mistypings ◮ Standard spelling makes it easy to organize words & Phonetic errors Knowledge problems text: Challenges Tokenization ◮ e.g., Without standard spelling, how would you look up Inflection Productivity things in a lexicon or thesaurus? Non-word error ◮ e.g., Optical character recognition software (OCR) can detection use knowledge about standard spelling to recognize Dictionaries N-gram analysis scanned words even for hardly legible input. Isolated-word error correction ◮ Standard spelling makes it possible to provide a single Rule-based methods Similarity key techniques text, accessible to a wide range of readers (different Probabilistic methods Minimum edit distance backgrounds, speaking different dialects, etc.). Error correction for web queries ◮ Using standard spelling can make a good impression in Grammar correction social interaction. Syntax and Computing Grammar correction rules Caveat emptor 2 / 76

Language and How are spell checkers used? Computers Writers’ Aids Introduction Error causes Keyboard mistypings ◮ interactive spelling checkers = spell checker detects Phonetic errors Knowledge problems errors as you type. Challenges ◮ It may or may not make suggestions for correction. Tokenization Inflection ◮ It needs a “real-time” response (i.e., must be fast) Productivity ◮ It is up to the human to decide if the spell checker is Non-word error detection right or wrong, and so we may not require 100% Dictionaries N-gram analysis accuracy (especially with a list of choices) Isolated-word error ◮ automatic spelling correctors = spell checker runs on correction Rule-based methods a whole document, finds errors, and corrects them Similarity key techniques Probabilistic methods Minimum edit distance ◮ A much more difficult task. Error correction for ◮ A human may or may not proofread the results later. web queries Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 3 / 76

Language and Detection vs. Correction Computers Writers’ Aids Introduction Error causes Keyboard mistypings Phonetic errors Knowledge problems Challenges ◮ There are two distinct tasks: Tokenization Inflection ◮ error detection = simply find the misspelled words Productivity ◮ error correction = correct the misspelled words Non-word error detection ◮ e.g., It might be easy to tell that ater is a misspelled Dictionaries N-gram analysis word, but what is the correct word? water ? later ? after ? Isolated-word error correction ◮ Note that detection is a prerequisite for correction. Rule-based methods Similarity key techniques Probabilistic methods Minimum edit distance Error correction for web queries Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 4 / 76

Language and Error causes Computers Writers’ Aids Keyboard mistypings Introduction Error causes Keyboard mistypings Phonetic errors Knowledge problems Space bar issues Challenges Tokenization Inflection ◮ run-on errors = two separate words become one Productivity Non-word error ◮ e.g., the fuzz becomes thefuzz detection Dictionaries ◮ split errors = one word becomes two separate items N-gram analysis Isolated-word error ◮ e.g., equalization becomes equali zation correction Rule-based methods ◮ Note that the resulting items might still be words! Similarity key techniques Probabilistic methods Minimum edit distance ◮ e.g., a tollway becomes atoll way Error correction for web queries Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 5 / 76

Language and Error causes Computers Writers’ Aids Keyboard mistypings (cont.) Introduction Error causes Keyboard mistypings Phonetic errors Keyboard proximity Knowledge problems Challenges Tokenization ◮ e.g., Jack becomes Hack since h and j are next to each Inflection Productivity other on a typical American keyboard Non-word error detection Dictionaries N-gram analysis Physical similarity Isolated-word error correction ◮ similarity of shape, e.g., mistaking two physically similar Rule-based methods Similarity key techniques Probabilistic methods letters when typing up something handwritten Minimum edit distance ◮ e.g., tight for fight Error correction for web queries Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 6 / 76

Language and Error causes Computers Writers’ Aids Phonetic errors Introduction Error causes Keyboard mistypings phonetic errors Phonetic errors Knowledge problems Challenges = errors based on the sounds of a language (not necessarily Tokenization on the letters) Inflection Productivity Non-word error ◮ homophones = two words which sound the same detection Dictionaries ◮ e.g., red / read (past tense), cite / site / sight , N-gram analysis they’re / their / there Isolated-word error correction ◮ letter/word substitution: replacing a letter (or sequence Rule-based methods Similarity key techniques of letters) with a similar-sounding one Probabilistic methods Minimum edit distance ◮ e.g., John kracked his nuckles. Error correction for web queries instead of John cracked his knuckles. Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 7 / 76

Language and Error causes Computers Writers’ Aids Knowledge problems Introduction Error causes Keyboard mistypings Phonetic errors ◮ not knowing a word and guessing its spelling (can be Knowledge problems phonetic) Challenges Tokenization ◮ e.g., sientist Inflection Productivity ◮ not knowing a rule and guessing it Non-word error detection Dictionaries ◮ e.g., Do we double a consonant for ing words? N-gram analysis jog → joging Isolated-word error correction joke → jokking Rule-based methods Similarity key techniques ◮ knowing something is odd about the spelling, but Probabilistic methods Minimum edit distance guessing the wrong thing Error correction for web queries ◮ e.g., typing siscors for the non-regular scissors Grammar correction Syntax and Computing Grammar correction rules Caveat emptor 8 / 76

Language and Challenges & Techniques for spelling correction Computers Writers’ Aids Introduction Before we turn to how we detect spelling errors, we’ll look Error causes briefly at three issues: Keyboard mistypings Phonetic errors ◮ Tokenization : What is a word? Knowledge problems Challenges ◮ Inflection : How are some words related? Tokenization Inflection ◮ Productivity of language : How many words are there? Productivity Non-word error detection How we handle these issues determines how we build a Dictionaries dictionary. N-gram analysis Isolated-word error And then we’ll turn to the techniques used: correction Rule-based methods Similarity key techniques ◮ Non-word error detection Probabilistic methods Minimum edit distance ◮ Isolated-word error correction Error correction for web queries ◮ Context-dependent word error detection and correction Grammar correction → grammar correction Syntax and Computing Grammar correction rules Caveat emptor 9 / 76

Language and Tokenization Computers Writers’ Aids Intuitively a “word” is simply whatever is between two Introduction spaces, but this is not always so clear. Error causes Keyboard mistypings ◮ contractions = two words combined into one Phonetic errors Knowledge problems ◮ e.g., can’t , he’s , John’s [car] (vs. his car ) Challenges Tokenization ◮ multi-token words = (arguably) a single word with a Inflection Productivity space in it Non-word error detection ◮ e.g., New York , in spite of , deja vu Dictionaries N-gram analysis ◮ hyphens (note: can be ambiguous if a hyphen ends a Isolated-word error correction line) Rule-based methods Similarity key techniques ◮ Some are always a single word: e-mail , co-operate Probabilistic methods Minimum edit distance ◮ Others are two words combined into one: Error correction for Columbus-based , sound-change web queries Grammar correction ◮ Abbreviations: may stand for multiple words Syntax and Computing Grammar correction rules ◮ e.g., etc. = et cetera , ATM = Automated Teller Machine Caveat emptor 10 / 76

Language and Computers Tokenization Inflection Writers Aids - PowerPoint PPT Presentation

Language and Computers Writers Aids Introduction Error causes Keyboard mistypings Phonetic errors Knowledge problems Challenges Language and Computers Tokenization Inflection Writers Aids Productivity Non-word error detection

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Language and Computers where to start? Outline Computers Computers Computers Topic 1: Text

Language and Computers where to start? Language and Outline Language and Computers

Outline Searching Computers Computers Computers Topic 2: Searching Topic 2: Searching Topic

Who cares about spelling? Why people care about spelling Computers Computers Computers Topic

What is MT good for? Language and Example translations Language and Computers Computers

Searching in speech Language and Keyword searching in OSCAR Language and Computers Computers

The Turing Test Language and Example conversation (cont.) Language and Computers Computers

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

Quantum Mechanics; a Blessing and a Curse By Elias Marcopoulos Quantum Computers Quantum

Why people care about spelling Language and Detection vs. Correction Language and Computers

Good Morning! INT1004 Computers for Business Ulrich Werner Discovering Computers Technology in

A Brief History of Computers A Brief History of Computers A Brief History of Computers By

Linguistics 384: Language and Computers Relation to language Comparison of systems Topic 1: Text

Assembly Language for Intel- -Based Based Assembly Language for Intel th Edition Computers, 4 th

P3a Common Words spell check You are going to choose the correct slide for your group, Yellow,

Overview of SIGHAN 2015 Bake-off for Chinese Spelling Check

Sustained Performance PPI Conference Nashville, TN October 17, 2018 JD Consulting Key Points of

Amdahl s Law 18 Amdahl s Law The fundamental theorem of performance optimization

Di Discre crete e Element ement Met ethods ods in n STAR-CC -CCM+ Petr etr Kodl CD

Woo-Young Seo, Woong-Tae Kim Seoul National University Nuclear rings in barred-spiral

Conical Spiral Antennas for EoR applications A. Jiwani, S. K. Padhi, M. W. Waterson, Peter J.

THE LATEST AND GREATEST WITH IPADS IN MUSIC EDUCATION MMEA Mid-Winter Conference February 12,

Sambuz

Useful Links

Newsletter

Mail Us