Probabilistic Spelling Correction CE-324: Modern Information - - PowerPoint PPT Presentation

probabilistic spelling correction
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Spelling Correction CE-324: Modern Information - - PowerPoint PPT Presentation

Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures (CS-276, Stanford) Applications of


slide-1
SLIDE 1

Probabilistic Spelling Correction

CE-324: Modern Information Retrieval

Sharif University of Technology

  • M. Soleymani

Fall 2016

Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures (CS-276, Stanford)

slide-2
SLIDE 2

Applications of spelling correction

2

slide-3
SLIDE 3

Spelling Tasks

3

 Spelling Error Detection  Spelling Error Correction:

 Autocorrect

 htethe

 Suggest a correction  Suggestion lists

slide-4
SLIDE 4

Types of spelling errors

4

 Non-word Errors

 graffegiraffe

 Real-word Errors

 Typographical errors

 three  there

 Cognitive Errors (homophones)

 piece  peace,  too  two  your  you’re

 Real-word

correction almost needs to be context sensitive

slide-5
SLIDE 5

Spelling correction steps

5

 For each word w, generate candidate set:

 Find candidate words with similar pronunciations  Find candidate words with similar spellings

 Choose best candidate

 By “Weighted edit distance” or “Noisy Channel” approach  Context-sensitive

– so have to consider whether the surrounding words “make sense”

“Flying form Heathrow to LAX””Flying from Heathrow to LAX”

slide-6
SLIDE 6

Candidate Testing: Damerau-Levenshtein edit distance

6

 Minimal edit distance between two strings, where edits

are:

 Insertion  Deletion  Substitution  Transposition of two adjacent letters

slide-7
SLIDE 7

7

slide-8
SLIDE 8

Noisy channel intuition

8

slide-9
SLIDE 9

Noisy channel

9

 We see an observation 𝑦 of a misspelled word  Find the correct word

𝑥

slide-10
SLIDE 10

Language Model

10

 Take a big supply of words withT tokens:

𝑞 𝑥 = 𝐷(𝑥) 𝑈

 Supply of words

 your document collection  In other applications:

 you can take the supply to be typed queries (suitably filtered) – when

a static dictionary is inadequate

C(w) = # occurrences of w

slide-11
SLIDE 11

Unigram prior probability

11

 Counts from 404,253,213 words in Corpus of Contemporary

English (COCA)

slide-12
SLIDE 12

Channel model probability

12

 Error model probability, Edit probability

Misspelled word x = x1, x2, x3,… ,xm Correct word w = w1, w2, w3,…, wn

 P(x|w) = probability of the edit

(deletion/insertion/substitution/transposition)

slide-13
SLIDE 13

Calculating p(x|w)

 Still a research question.

 Can be estimated.

 Some simply ways. i.e.,

 Confusion matrix

 A square 26×26 table which represents how many times one

letter was incorrectly used instead of another.

 Usually, there are four confusion matrix:

 deletion, insertion, substitution and transposition.

slide-14
SLIDE 14

Computing error probability: Confusion matrix

14

 del[x,y]: count(xy typed as x)  ins[x,y]: count(x typed as xy)  sub[x,y]: count(y typed as x)  trans[x,y]: count(xy typed as yx)

Inser*on and dele*on condi*oned on previous character

slide-15
SLIDE 15

Confusion matrix for subs*tu*on

15

The cell [o,e] in a substitution confusion matrix would give the count of times that e was substituted for o.

slide-16
SLIDE 16

Channel model

16

slide-17
SLIDE 17

Smoothing probabili*es: Add-1 smoothing

17

|A| character alphabet

slide-18
SLIDE 18

Channel model for acress

18

slide-19
SLIDE 19

19

slide-20
SLIDE 20

20

slide-21
SLIDE 21

Noisy channel for real-word spell correc*on

21

 Given a sentence w1,w2,w3,…,wn  Generate a set of candidates for each word wi

 Candidate(w1) = {w1, w’1 , w’’1 , w’’’1 ,…}  Candidate(w2) = {w2, w’2 , w’’2 , w’’’2 ,…}  Candidate(wn) = {wn, w’n , w’’n , w’’’n ,…}

 Choose the sequenceW that maximizes P(W)

slide-22
SLIDE 22

Incorpora*ng context words: Context-sensi*ve spelling correc*on

22

 Determining whether actress or across is appropriate

will require looking at the context of use

 A bigram language model condi*ons the probability of

a word on (just) the previous word

𝑄(𝑥1 … 𝑥𝑜) = 𝑄(𝑥1)𝑄(𝑥2|𝑥1) … 𝑄(𝑥𝑜|𝑥𝑜−1)

slide-23
SLIDE 23

Incorpora*ng context words

23

 For unigram counts, 𝑄(𝑥𝑙) is always non-zero

 if our dic*onary is derived from the document collec*on

 This won’t be true of 𝑄(𝑥𝑙|𝑥𝑙−1).We need to smooth

 add-1 smoothing on this condi*onal distribu*on  Interpolate a unigram and a bigram:

slide-24
SLIDE 24

Using a bigram language model

24

slide-25
SLIDE 25

Using a bigram language model

25

slide-26
SLIDE 26

Noisy channel for real-word spell correc*on

26

slide-27
SLIDE 27

Noisy channel for real-word spell correc*on

27

slide-28
SLIDE 28

Simplifica*on: One error per sentence

28

slide-29
SLIDE 29

Where to get the probabili*es

29

 Language model

 Unigram  Bigram

 Channel model

 Same as for non-word spelling correc*on  Plus need probability for no error, P(w|w)

slide-30
SLIDE 30

Probability of no error

30

 What is the channel probability for a correctly typed

word?

 P(“the”|“the”)

 If you have a big corpus, you can es*mate this percent

correct

 But this value depends strongly on the applica*onbility of

no error

slide-31
SLIDE 31

Peter Norvig’s “thew” example

31

slide-32
SLIDE 32

Improvements to channel model

32

 Allow richer edits (Brill and Moore 2000)

 entant  ph  f  le  al

 Incorporate pronuncia*on into channel (T

  • utanova and

Moore 2002)

 Incorporate device into channel

 Not all Android phones need have the same error model  But spell correc*on may be done at the system level