Probabilistic Spelling Correction CE-324: Modern Information - PowerPoint PPT Presentation

Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures (CS-276, Stanford)

Applications of spelling correction 2

Spelling Tasks  Spelling Error Detection  Spelling Error Correction:  Autocorrect  hte  the  Suggest a correction  Suggestion lists 3

Types of spelling errors  Non-word Errors  graffe  giraffe  Real-word Errors  Typographical errors  three  there  Cognitive Errors (homophones)  piece  peace,  too  two  your  you ’ re  Real-word correction almost needs to be context sensitive 4

Spelling correction steps  For each word w , generate candidate set:  Find candidate words with similar pronunciations  Find candidate words with similar spellings  Choose best candidate  By “ Weighted edit distance ” or “ Noisy Channel ” approach  Context-sensitive – so have to consider whether the surrounding words “ make sense ” “ Flying form Heathrow to LAX ”  ” Flying from Heathrow to LAX ” 5

Candidate Testing: Damerau-Levenshtein edit distance  Minimal edit distance between two strings, where edits are:  Insertion  Deletion  Substitution  Transposition of two adjacent letters 6

Noisy channel intuition 8

Noisy channel  We see an observation 𝑦 of a misspelled word  Find the correct word 𝑥 9

Language Model  Take a big supply of words withT tokens: 𝑞 𝑥 = 𝐷(𝑥) 𝑈 C(w) = # occurrences of w  Supply of words  your document collection  In other applications:  you can take the supply to be typed queries (suitably filtered) – when a static dictionary is inadequate 10

Unigram prior probability  Counts from 404,253,213 words in Corpus of Contemporary English (COCA) 11

Channel model probability  Error model probability, Edit probability Misspelled word x = x1, x2, x3, … ,xm Correct word w = w1, w2, w3, … , wn  P(x|w) = probability of the edit (deletion/insertion/substitution/transposition) 12

Calculating p(x|w)  Still a research question.  Can be estimated.  Some simply ways. i.e.,  Confusion matrix  A square 26 × 26 table which represents how many times one letter was incorrectly used instead of another.  Usually, there are four confusion matrix:  deletion, insertion, substitution and transposition.

Computing error probability: Confusion matrix  del[x,y]: count(xy typed as x)  ins[x,y]: count(x typed as xy)  sub[x,y]: count(y typed as x)  trans[x,y]: count(xy typed as yx) Inser*on and dele*on condi*oned on previous character 14

Confusion matrix for subs*tu*on The cell [o,e] in a substitution confusion matrix would give the 15 count of times that e was substituted for o.

Channel model 16

Smoothing probabili*es: Add-1 smoothing |A| character alphabet 17

Channel model for acress 18

Noisy channel for real-word spell correc*on  Given a sentence w1,w2,w3, … ,wn  Generate a set of candidates for each word wi  Candidate(w1) = {w1, w ’ 1 , w ’’ 1 , w ’’’ 1 , … }  Candidate(w2) = {w2, w ’ 2 , w ’’ 2 , w ’’’ 2 , … }  Candidate(wn) = {wn, w ’ n , w ’’ n , w ’’’ n , … }  Choose the sequenceW that maximizes P(W) 21

Incorpora*ng context words: Context-sensi*ve spelling correc*on  Determining whether actress or across is appropriate will require looking at the context of use  A bigram language model condi*ons the probability of a word on (just) the previous word 𝑄(𝑥 1 … 𝑥 𝑜 ) = 𝑄(𝑥 1 )𝑄(𝑥 2 |𝑥 1 ) … 𝑄(𝑥 𝑜 |𝑥 𝑜−1 ) 22

Incorpora*ng context words  For unigram counts, 𝑄(𝑥 𝑙 ) is always non-zero  if our dic*onary is derived from the document collec*on  This won ’ t be true of 𝑄(𝑥 𝑙 |𝑥 𝑙−1 ) .We need to smooth  add-1 smoothing on this condi*onal distribu*on  Interpolate a unigram and a bigram: 23

Using a bigram language model 24

Using a bigram language model 25

Noisy channel for real-word spell correc*on 26

Noisy channel for real-word spell correc*on 27

Simplifica*on: One error per sentence 28

Where to get the probabili*es  Language model  Unigram  Bigram  Channel model  Same as for non-word spelling correc*on  Plus need probability for no error, P(w|w) 29

Probability of no error  What is the channel probability for a correctly typed word?  P( “ the ” | “ the ” )  If you have a big corpus, you can es*mate this percent correct  But this value depends strongly on the applica*onbility of no error 30

Peter Norvig ’ s “ thew ” example 31

Improvements to channel model  Allow richer edits (Brill and Moore 2000)  ent  ant  ph  f  le  al  Incorporate pronuncia*on into channel (T outanova and Moore 2002)  Incorporate device into channel  Not all Android phones need have the same error model  But spell correc*on may be done at the system level 32

Probabilistic Spelling Correction CE-324: Modern Information - PowerPoint PPT Presentation

Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures (CS-276, Stanford) Applications of

Spelling Correction and the Noisy Channel The Spelling

Spelling Correction and the Noisy Channel The$Spelling$ Correc/on$Task$ Dan$Jurafsky$

Spelling Frome Vale Academy Finding out. about spelling within the new primary curriculum

NPFL103: Information Retrieval (2) Dictionaries, Tolerant retrieval, Spelling correction Pavel

Spelling Presentation Book, Grade 3 (SRA Reading Mastery, Signature Spelling Presentation Book,

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

SPaG Parent Workshop Agenda English and the 2014 Curriculum Spelling How we teach

Relaunch of Christ Churchs spelling scheme What is the Christ Church Spelling Scheme? It is

IAVA Education Day Bees Presentation Why Spelling Bee? Help students improve their spelling

Grammatica Spelling & Grammar API Introduction (simplified: see the expanded diagram below)

This week, we are going to look at a set of statutory spelling challenge words from the Y3/Y4

Spelling auditory memory, discrimina:on and motor skills] Spelling [encoding] is the reverse

SPELLING AND GRAMMAR Know the statutory guidelines for each year group. Know the expectations

Who cares about spelling? Why people care about spelling Computers Computers Computers Topic

Quantum Information Processing and Quantum Error Correction and Quantum Error Correction with

Eight Truths about Correction from the Book of Proverbs 3 1. The right attitude to correction

Problem: Want to send a message with n packets. Channel: Lossy channel: loses k packets. Question:

Chapter 3 outline 3.1 transport-layer 3.5 connection-oriented transport: TCP services

Codes Correcting Synchronization Errors for Symbol-Pair Read Channels Presented by: Van Khu

Progress with AIRS for NWP assimilation at ECMWF November 2001 Anthony McNally Technical status

JEDI Applications Jedi Academy IV, Monterey CA 26 th February 2020 JEDI Applications

I I I -V High-Electron Mobility Transistors on the path to THz operation J. A. del Alamo 1 and

A fA-Range Low-Power Multi-Channel Digital A fA-Range Low-Power Multi-Channel Digital Read-Out

JCSP Networking 2.0 (or maybe JCSP 1.1 rc4) Kevin Chalmers School of Computing Napier

Probabilistic Spelling Correction CE-324: Modern Information - PowerPoint PPT Presentation

Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures (CS-276, Stanford) Applications of

Spelling Correction and the Noisy Channel The Spelling

Spelling Correction and the Noisy Channel The$Spelling$ Correc/on$Task$ Dan$Jurafsky$

Spelling Frome Vale Academy Finding out. about spelling within the new primary curriculum

NPFL103: Information Retrieval (2) Dictionaries, Tolerant retrieval, Spelling correction Pavel

Spelling Presentation Book, Grade 3 (SRA Reading Mastery, Signature Spelling Presentation Book,

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

SPaG Parent Workshop Agenda English and the 2014 Curriculum Spelling How we teach

Relaunch of Christ Churchs spelling scheme What is the Christ Church Spelling Scheme? It is

IAVA Education Day Bees Presentation Why Spelling Bee? Help students improve their spelling

Grammatica Spelling &amp; Grammar API Introduction (simplified: see the expanded diagram below)

This week, we are going to look at a set of statutory spelling challenge words from the Y3/Y4

Spelling auditory memory, discrimina:on and motor skills] Spelling [encoding] is the reverse

SPELLING AND GRAMMAR Know the statutory guidelines for each year group. Know the expectations

Who cares about spelling? Why people care about spelling Computers Computers Computers Topic

Quantum Information Processing and Quantum Error Correction and Quantum Error Correction with

Eight Truths about Correction from the Book of Proverbs 3 1. The right attitude to correction

Problem: Want to send a message with n packets. Channel: Lossy channel: loses k packets. Question:

Chapter 3 outline 3.1 transport-layer 3.5 connection-oriented transport: TCP services

Codes Correcting Synchronization Errors for Symbol-Pair Read Channels Presented by: Van Khu

Progress with AIRS for NWP assimilation at ECMWF November 2001 Anthony McNally Technical status

JEDI Applications Jedi Academy IV, Monterey CA 26 th February 2020 JEDI Applications

I I I -V High-Electron Mobility Transistors on the path to THz operation J. A. del Alamo 1 and

A fA-Range Low-Power Multi-Channel Digital A fA-Range Low-Power Multi-Channel Digital Read-Out

JCSP Networking 2.0 (or maybe JCSP 1.1 rc4) Kevin Chalmers School of Computing Napier

Grammatica Spelling & Grammar API Introduction (simplified: see the expanded diagram below)