probabilistic spelling correction
play

Probabilistic Spelling Correction CE-324: Modern Information - PowerPoint PPT Presentation

Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures (CS-276, Stanford) Applications of


  1. Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures (CS-276, Stanford)

  2. Applications of spelling correction 2

  3. Spelling Tasks  Spelling Error Detection  Spelling Error Correction:  Autocorrect  hte  the  Suggest a correction  Suggestion lists 3

  4. Types of spelling errors  Non-word Errors  graffe  giraffe  Real-word Errors  Typographical errors  three  there  Cognitive Errors (homophones)  piece  peace,  too  two  your  you ’ re  Real-word correction almost needs to be context sensitive 4

  5. Spelling correction steps  For each word w , generate candidate set:  Find candidate words with similar pronunciations  Find candidate words with similar spellings  Choose best candidate  By “ Weighted edit distance ” or “ Noisy Channel ” approach  Context-sensitive – so have to consider whether the surrounding words “ make sense ” “ Flying form Heathrow to LAX ”  ” Flying from Heathrow to LAX ” 5

  6. Candidate Testing: Damerau-Levenshtein edit distance  Minimal edit distance between two strings, where edits are:  Insertion  Deletion  Substitution  Transposition of two adjacent letters 6

  7. 7

  8. Noisy channel intuition 8

  9. Noisy channel  We see an observation 𝑦 of a misspelled word  Find the correct word 𝑥 9

  10. Language Model  Take a big supply of words withT tokens: 𝑞 𝑥 = 𝐷(𝑥) 𝑈 C(w) = # occurrences of w  Supply of words  your document collection  In other applications:  you can take the supply to be typed queries (suitably filtered) – when a static dictionary is inadequate 10

  11. Unigram prior probability  Counts from 404,253,213 words in Corpus of Contemporary English (COCA) 11

  12. Channel model probability  Error model probability, Edit probability Misspelled word x = x1, x2, x3, … ,xm Correct word w = w1, w2, w3, … , wn  P(x|w) = probability of the edit (deletion/insertion/substitution/transposition) 12

  13. Calculating p(x|w)  Still a research question.  Can be estimated.  Some simply ways. i.e.,  Confusion matrix  A square 26 × 26 table which represents how many times one letter was incorrectly used instead of another.  Usually, there are four confusion matrix:  deletion, insertion, substitution and transposition.

  14. Computing error probability: Confusion matrix  del[x,y]: count(xy typed as x)  ins[x,y]: count(x typed as xy)  sub[x,y]: count(y typed as x)  trans[x,y]: count(xy typed as yx) Inser*on and dele*on condi*oned on previous character 14

  15. Confusion matrix for subs*tu*on The cell [o,e] in a substitution confusion matrix would give the 15 count of times that e was substituted for o.

  16. Channel model 16

  17. Smoothing probabili*es: Add-1 smoothing |A| character alphabet 17

  18. Channel model for acress 18

  19. 19

  20. 20

  21. Noisy channel for real-word spell correc*on  Given a sentence w1,w2,w3, … ,wn  Generate a set of candidates for each word wi  Candidate(w1) = {w1, w ’ 1 , w ’’ 1 , w ’’’ 1 , … }  Candidate(w2) = {w2, w ’ 2 , w ’’ 2 , w ’’’ 2 , … }  Candidate(wn) = {wn, w ’ n , w ’’ n , w ’’’ n , … }  Choose the sequenceW that maximizes P(W) 21

  22. Incorpora*ng context words: Context-sensi*ve spelling correc*on  Determining whether actress or across is appropriate will require looking at the context of use  A bigram language model condi*ons the probability of a word on (just) the previous word 𝑄(𝑥 1 … 𝑥 𝑜 ) = 𝑄(𝑥 1 )𝑄(𝑥 2 |𝑥 1 ) … 𝑄(𝑥 𝑜 |𝑥 𝑜−1 ) 22

  23. Incorpora*ng context words  For unigram counts, 𝑄(𝑥 𝑙 ) is always non-zero  if our dic*onary is derived from the document collec*on  This won ’ t be true of 𝑄(𝑥 𝑙 |𝑥 𝑙−1 ) .We need to smooth  add-1 smoothing on this condi*onal distribu*on  Interpolate a unigram and a bigram: 23

  24. Using a bigram language model 24

  25. Using a bigram language model 25

  26. Noisy channel for real-word spell correc*on 26

  27. Noisy channel for real-word spell correc*on 27

  28. Simplifica*on: One error per sentence 28

  29. Where to get the probabili*es  Language model  Unigram  Bigram  Channel model  Same as for non-word spelling correc*on  Plus need probability for no error, P(w|w) 29

  30. Probability of no error  What is the channel probability for a correctly typed word?  P( “ the ” | “ the ” )  If you have a big corpus, you can es*mate this percent correct  But this value depends strongly on the applica*onbility of no error 30

  31. Peter Norvig ’ s “ thew ” example 31

  32. Improvements to channel model  Allow richer edits (Brill and Moore 2000)  ent  ant  ph  f  le  al  Incorporate pronuncia*on into channel (T outanova and Moore 2002)  Incorporate device into channel  Not all Android phones need have the same error model  But spell correc*on may be done at the system level 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend