similarity and correction of strings and trees towards a
play

Similarity and Correction of Strings and Trees : Towards a - PowerPoint PPT Presentation

Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Universit-Franois Rabelais de Tours, Campus de Blois, Laboratoire dInformatique Seminarium IPIPAN, 24 kwietnia, 2006 String-to-string


  1. Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois, Laboratoire d’Informatique Seminarium IPIPAN, 24 kwietnia, 2006

  2. String-to-string correction 2

  3. T raditional string-to-string correction (Wagner&Fischer 1974, Lawrence&Wagner 1975,…) • CONTEXT: – Finite set of symbols ( alphabet ) – Elementary operations on symbols ( editing operations , e.g. deletion, insertion, or replacement of a letter, inversion of two adjacent letters) with their costs (usually 1 per operation) – Sequences of editing operations ( edit sequences ; each operation applies to a word resulting from the previous operations) with their costs (sums of costs of editing operations involved) – Measure of similarity between words A and B ( edit distance or error distance ): minimum cost of all edit sequences transforming A to B • INPUT: – Two words A and B • OUTPUT: – Distance between A and B 3

  4. Examples of elementary edit operations • Insertion of a letter monter  montaer, monter  montrer • Deletion of a letter monter  montr, monter  monte • Replacement of a letter by another monter  ponter, monter  conter • Transposition of two adjacent letters monter  mnoter, monter  montre Each elementary operation has a non negatif cost. From now on we admit cost 1 for each elementary operation. 4

  5. Edit sequence • Edit sequence = sequence of elementary edit operations • For each couple of words X and Y many edit sequences exist that transform X into Y. • Example 1: transforming sorting into string : Linear sequence – sorting  srting  sting  string (3 operations) – sorting  sotring  string (2 operations) Linear sequence – sorting  srting  string (2 operations) – sorting  strting  string (2 operations) Linear sequence – sorting  srting  sting  sing  sring  string (5 operations) – ................. • Example 2: transforming abc into ca : – abc  ac  ca (2 operations) Linear sequence – abc  cabc  cac  ca (3 operations) • From now on, we’ll be interested in linear edit sequences (Du&Chang 1992), i.e. such that the operations are performed from left to right, and no further operation may alter the result of a previous operation. 5

  6. Edit (error) distance • Cost of an edit sequence = sum of costs of all elementary operations included in the sequence – sorting  srting  sting  string (3 operations)  cost = 3 – sorting  sotring  string (2 operations)  cost = 2 – sorting  srting  sting  sing  sring  string (5 operations)  cost = 5 • Edit distance (error distance) between two words X and Y (ed(X,Y)) = minimal cost of all edit sequences transforming X into Y : ed(sorting, string) = 2 ed(abc,ca) = 2, if all edit sequences are taken into account ed(abc,ca) = 3, if only the linear edit sequences are taken into account 6

  7. Calculating the edit distance (1/4) Notation : word X= x 1 x 2 ... x i ...x n ; the prefix of lenght i of X : X[i] = x 1 x 2 ... x i i x 1 x 2 x 3 ... x i ... x n X X[i] It is possible to calculate the distance between two prefixes X[i+1] and Y[j+1] on the basis of the distances between shorter prefixes: 3 cases i X[i+1] If x i+1 = y j+1 then ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) Y[j+1] j 7

  8. Calculating the edit distance (2/4) If x i = y j+1 and x i+1 = y j (the 2 last characters may be i inverted) then 4 sub-cases are possible: X[i+1] Transposition’s • The cheapest sequence transforming X[i+1] cost into Y[j+1] contains a transposition of x i and x i+1 : Y[j+1] ed(X[i+1],Y[j+1]) = ed(X[i-1],Y[j-1]) + 1 j • The cheapest sequence transforming X[i+1] into Y[j+1] contains the replacement of x i+1 by y j+1 : Replacement’s cost ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1 • The cheapest sequence transforming X[i+1] into Y[j+1] contains the l’ insertion of y j+1 : ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1 Insertion’s cost • The cheapest sequence transforming X[i+1] into Y[j+1] contains the deletion of x i+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Deletion’s cost 8

  9. Calculating the edit distance (3/4) i X[i+1] OTHERWISE (if x i+1  y j+1 , and ( x i  y j+1 or x i+1  y j )) then 3 sub-cases are possible: Y[j+1] j • The cheapest sequence transforming X[i+1] into Y[j+1] contains the replacement of x i+1 by y j+1 : Replacement’s cost ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1 • The cheapest sequence transforming X[i+1] into Y[j+1] contains the insertion of y j+1 : ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1 Insertion’s cost • The cheapest sequence transforming X[i+1] into Y[j+1] contains the deletion of x i+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Deletion’s cost 9

  10. Calculating the edit distance (4/4) Edit distance between X[i] and Y[j] - recursive definition: For i=0,...,m, j=0,...,n: 1° ed(X[-1],Y[j]) = ed(X[i], Y[-1]) = max(m,n) 2° ed(X[0],Y[j]) = j ed(X[i],Y[0]) = i ed(X[i],Y[j]) if x i+1 = y j+1 1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]), if x i =y j+1 et x i+1 = y j 3° ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]), ed(X[i-1],Y[j-1]) } 1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]), otherwise ed(X[i],Y[j+1])} 10

  11. Calculation the edit distance : dynamic programming case [i,j] contains the edit j m distance between the prefix [1,..,i] of the one word and the prefixe [1,...,j] of the other word  s o r t i n g  0 1 2 3 4 5 6 7 s 1 0 1 2 3 4 5 6 i t 2 1 1 2 2 3 4 5 r 3 2 2 1 2 3 4 5 i 4 3 3 2 3 2 3 4 n 5 4 4 3 4 3 2 3 g 6 5 5 4 5 4 3 2 n case [n,m] contains the edit 11 distance between the 2 words

  12. Dynamic programming: case 1 j+1 x i+1 = y j+1  s o r t i n g  0 1 2 3 4 ? ? ? s 1 0 1 2 3 ? ? ? i+1 t 2 1 1 2 2 ? ? ? r ? ? ? ? ? ? ? ? i ? ? ? ? ? ? ? ? n ? ? ? ? ? ? ? ? g ? ? ? ? ? ? ? ? 12

  13. Dynamic programming : case 2 j+1 x i+1 = y j and x i+1 = y j  s o r t i n g  0 1 2 3 4 ? ? ? s 1 0 1 2 3 ? ? ? t 2 1 1 2 2 ? ? ? r 3 2 2 1 2 ? ? ? i+1 i ? ? ? ? ? ? ? ? n ? ? ? ? ? ? ? ? g ? ? ? ? ? ? ? ? 13

  14. Dynamic programming : case 3 j+1 x i+1  y j+1 et (x i+1  y j ou x i+1  y j )  s o r t i n g  0 1 2 3 4 ? ? ? s 1 0 1 2 3 ? ? ? t 2 1 1 2 2 ? ? ? r 3 2 2 1 2 ? ? ? i 4 3 3 2 2 ? ? ? i+1 n ? ? ? ? ? ? ? ? g ? ? ? ? ? ? ? ? 14

  15. String-to-language correction 15

  16. String-to-language correction: problem defjnition • CONTEXT: – Finite set of symbols ( alphabet ) – Elementary edit operations on symbols (as before) with their costs (1 per operation) – Edit sequences (as before) – Edit distance ( error distance ) between words: as before • INPUT: – Regular grammar describing words (a finite set of words in particular) – Incorrect word A (unrecognizable by the grammar) – Threshold t • OUTPUT: – A set of correct words B 1 , B 2 , …, B n whose distance from A stays within t (the nearest neighbors of A) 16

  17. String-to-language correction: simplistic approach • METHOD: – For each word B recognizable by the grammar calculate the edit distance matrix between A and B. – Propose candidates whose distance from A does not exceed the threshold t (ed(A,B)  t). • FAISABILITY: – Impossible in case of infinite languages • COMPLEXITY: O(n * m * |D|) 17

  18. String-to-language correction: threshold-controlled depth-fjrst exploration of an FSA (Ofmazer 1996, …) 18

  19. String correction with respect to a deterministic FSA (1/4) e Word to be corrected : *aply, threshold 2 2 4 p p 8 a s e Part of the matrix 9 calculated only once for 1 5 7 y a l all valid words sharing p the same prefix appl y 3 6 l  a p p l ... ... e • Each time a transition is followed a  0 1 2 3 4 ... ... 5 new column is calculated in the edit distance matrix a 1 0 1 2 3 ... 4 ... • If we get to a final state and the edit p 2 1 0 1 2 ... 3 ... distance remains within the thershold  a new candidate has been found l 3 2 1 1 1 ... 2 ... 2 y 4 3 2 2 2 ... ... apple 19

  20. String correction with respect to a deterministic FSA (2/4) e Word to be corrected : *aply, threshold 2 2 4 p p 8 a s e Part of the matrix 9 calculated only once for 1 5 7 y a l all valid words sharing p the same prefix appl y 3 6 l  a p p l ... ... e s • Each time a transition is followed a  0 1 2 3 4 ... ... 5 6 new column is calculated in the edit distance matrix a 1 0 1 2 3 ... 4 ... 5 • If we get to a final state and the edit p 2 1 0 1 2 ... 3 ... 4 distance remains within the thershold  a new candidate has been found l 3 2 1 1 1 ... 2 ... 3 2 3 y 4 3 2 2 2 ... ... apple 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend