Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Université-François Rabelais de Tours, Campus de Blois, Laboratoire d’Informatique Seminarium IPIPAN, 24 kwietnia, 2006
String-to-string correction 2
T raditional string-to-string correction (Wagner&Fischer 1974, Lawrence&Wagner 1975,…) • CONTEXT: – Finite set of symbols ( alphabet ) – Elementary operations on symbols ( editing operations , e.g. deletion, insertion, or replacement of a letter, inversion of two adjacent letters) with their costs (usually 1 per operation) – Sequences of editing operations ( edit sequences ; each operation applies to a word resulting from the previous operations) with their costs (sums of costs of editing operations involved) – Measure of similarity between words A and B ( edit distance or error distance ): minimum cost of all edit sequences transforming A to B • INPUT: – Two words A and B • OUTPUT: – Distance between A and B 3
Examples of elementary edit operations • Insertion of a letter monter montaer, monter montrer • Deletion of a letter monter montr, monter monte • Replacement of a letter by another monter ponter, monter conter • Transposition of two adjacent letters monter mnoter, monter montre Each elementary operation has a non negatif cost. From now on we admit cost 1 for each elementary operation. 4
Edit sequence • Edit sequence = sequence of elementary edit operations • For each couple of words X and Y many edit sequences exist that transform X into Y. • Example 1: transforming sorting into string : Linear sequence – sorting srting sting string (3 operations) – sorting sotring string (2 operations) Linear sequence – sorting srting string (2 operations) – sorting strting string (2 operations) Linear sequence – sorting srting sting sing sring string (5 operations) – ................. • Example 2: transforming abc into ca : – abc ac ca (2 operations) Linear sequence – abc cabc cac ca (3 operations) • From now on, we’ll be interested in linear edit sequences (Du&Chang 1992), i.e. such that the operations are performed from left to right, and no further operation may alter the result of a previous operation. 5
Edit (error) distance • Cost of an edit sequence = sum of costs of all elementary operations included in the sequence – sorting srting sting string (3 operations) cost = 3 – sorting sotring string (2 operations) cost = 2 – sorting srting sting sing sring string (5 operations) cost = 5 • Edit distance (error distance) between two words X and Y (ed(X,Y)) = minimal cost of all edit sequences transforming X into Y : ed(sorting, string) = 2 ed(abc,ca) = 2, if all edit sequences are taken into account ed(abc,ca) = 3, if only the linear edit sequences are taken into account 6
Calculating the edit distance (1/4) Notation : word X= x 1 x 2 ... x i ...x n ; the prefix of lenght i of X : X[i] = x 1 x 2 ... x i i x 1 x 2 x 3 ... x i ... x n X X[i] It is possible to calculate the distance between two prefixes X[i+1] and Y[j+1] on the basis of the distances between shorter prefixes: 3 cases i X[i+1] If x i+1 = y j+1 then ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) Y[j+1] j 7
Calculating the edit distance (2/4) If x i = y j+1 and x i+1 = y j (the 2 last characters may be i inverted) then 4 sub-cases are possible: X[i+1] Transposition’s • The cheapest sequence transforming X[i+1] cost into Y[j+1] contains a transposition of x i and x i+1 : Y[j+1] ed(X[i+1],Y[j+1]) = ed(X[i-1],Y[j-1]) + 1 j • The cheapest sequence transforming X[i+1] into Y[j+1] contains the replacement of x i+1 by y j+1 : Replacement’s cost ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1 • The cheapest sequence transforming X[i+1] into Y[j+1] contains the l’ insertion of y j+1 : ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1 Insertion’s cost • The cheapest sequence transforming X[i+1] into Y[j+1] contains the deletion of x i+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Deletion’s cost 8
Calculating the edit distance (3/4) i X[i+1] OTHERWISE (if x i+1 y j+1 , and ( x i y j+1 or x i+1 y j )) then 3 sub-cases are possible: Y[j+1] j • The cheapest sequence transforming X[i+1] into Y[j+1] contains the replacement of x i+1 by y j+1 : Replacement’s cost ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1 • The cheapest sequence transforming X[i+1] into Y[j+1] contains the insertion of y j+1 : ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1 Insertion’s cost • The cheapest sequence transforming X[i+1] into Y[j+1] contains the deletion of x i+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Deletion’s cost 9
Calculating the edit distance (4/4) Edit distance between X[i] and Y[j] - recursive definition: For i=0,...,m, j=0,...,n: 1° ed(X[-1],Y[j]) = ed(X[i], Y[-1]) = max(m,n) 2° ed(X[0],Y[j]) = j ed(X[i],Y[0]) = i ed(X[i],Y[j]) if x i+1 = y j+1 1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]), if x i =y j+1 et x i+1 = y j 3° ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]), ed(X[i-1],Y[j-1]) } 1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]), otherwise ed(X[i],Y[j+1])} 10
Calculation the edit distance : dynamic programming case [i,j] contains the edit j m distance between the prefix [1,..,i] of the one word and the prefixe [1,...,j] of the other word s o r t i n g 0 1 2 3 4 5 6 7 s 1 0 1 2 3 4 5 6 i t 2 1 1 2 2 3 4 5 r 3 2 2 1 2 3 4 5 i 4 3 3 2 3 2 3 4 n 5 4 4 3 4 3 2 3 g 6 5 5 4 5 4 3 2 n case [n,m] contains the edit 11 distance between the 2 words
Dynamic programming: case 1 j+1 x i+1 = y j+1 s o r t i n g 0 1 2 3 4 ? ? ? s 1 0 1 2 3 ? ? ? i+1 t 2 1 1 2 2 ? ? ? r ? ? ? ? ? ? ? ? i ? ? ? ? ? ? ? ? n ? ? ? ? ? ? ? ? g ? ? ? ? ? ? ? ? 12
Dynamic programming : case 2 j+1 x i+1 = y j and x i+1 = y j s o r t i n g 0 1 2 3 4 ? ? ? s 1 0 1 2 3 ? ? ? t 2 1 1 2 2 ? ? ? r 3 2 2 1 2 ? ? ? i+1 i ? ? ? ? ? ? ? ? n ? ? ? ? ? ? ? ? g ? ? ? ? ? ? ? ? 13
Dynamic programming : case 3 j+1 x i+1 y j+1 et (x i+1 y j ou x i+1 y j ) s o r t i n g 0 1 2 3 4 ? ? ? s 1 0 1 2 3 ? ? ? t 2 1 1 2 2 ? ? ? r 3 2 2 1 2 ? ? ? i 4 3 3 2 2 ? ? ? i+1 n ? ? ? ? ? ? ? ? g ? ? ? ? ? ? ? ? 14
String-to-language correction 15
String-to-language correction: problem defjnition • CONTEXT: – Finite set of symbols ( alphabet ) – Elementary edit operations on symbols (as before) with their costs (1 per operation) – Edit sequences (as before) – Edit distance ( error distance ) between words: as before • INPUT: – Regular grammar describing words (a finite set of words in particular) – Incorrect word A (unrecognizable by the grammar) – Threshold t • OUTPUT: – A set of correct words B 1 , B 2 , …, B n whose distance from A stays within t (the nearest neighbors of A) 16
String-to-language correction: simplistic approach • METHOD: – For each word B recognizable by the grammar calculate the edit distance matrix between A and B. – Propose candidates whose distance from A does not exceed the threshold t (ed(A,B) t). • FAISABILITY: – Impossible in case of infinite languages • COMPLEXITY: O(n * m * |D|) 17
String-to-language correction: threshold-controlled depth-fjrst exploration of an FSA (Ofmazer 1996, …) 18
String correction with respect to a deterministic FSA (1/4) e Word to be corrected : *aply, threshold 2 2 4 p p 8 a s e Part of the matrix 9 calculated only once for 1 5 7 y a l all valid words sharing p the same prefix appl y 3 6 l a p p l ... ... e • Each time a transition is followed a 0 1 2 3 4 ... ... 5 new column is calculated in the edit distance matrix a 1 0 1 2 3 ... 4 ... • If we get to a final state and the edit p 2 1 0 1 2 ... 3 ... distance remains within the thershold a new candidate has been found l 3 2 1 1 1 ... 2 ... 2 y 4 3 2 2 2 ... ... apple 19
String correction with respect to a deterministic FSA (2/4) e Word to be corrected : *aply, threshold 2 2 4 p p 8 a s e Part of the matrix 9 calculated only once for 1 5 7 y a l all valid words sharing p the same prefix appl y 3 6 l a p p l ... ... e s • Each time a transition is followed a 0 1 2 3 4 ... ... 5 6 new column is calculated in the edit distance matrix a 1 0 1 2 3 ... 4 ... 5 • If we get to a final state and the edit p 2 1 0 1 2 ... 3 ... 4 distance remains within the thershold a new candidate has been found l 3 2 1 1 1 ... 2 ... 3 2 3 y 4 3 2 2 2 ... ... apple 20
Recommend
More recommend