A linear-time algorithm for comparing similar ordered trees H el` - - PowerPoint PPT Presentation

a linear time algorithm for comparing similar ordered
SMART_READER_LITE
LIVE PREVIEW

A linear-time algorithm for comparing similar ordered trees H el` - - PowerPoint PPT Presentation

A linear-time algorithm for comparing similar ordered trees H el` ene Touzet LIFL University of Lille 1 France Comparison with k errors P roblem : Input : two ordered trees (that are assumed to be similar) a natural number k :


slide-1
SLIDE 1

A linear-time algorithm for comparing similar ordered trees

H´ el` ene Touzet

LIFL – University of Lille 1 – France

slide-2
SLIDE 2

Comparison with k errors

◮ Problem :

Input : two ordered trees (that are assumed to be similar) a natural number k Output : the best mapping M containing less than k errors, if it exists

◮ Error : insertion of a node, deletion of a node ◮ Edit operations : substitution, deletion, insertion ◮ Comparison model: edit distance vs alignment

slide-3
SLIDE 3

How to compare trees: edit operations

Substitution Deletion Insertion

slide-4
SLIDE 4

How to compare trees: comparison model

◮ Edit Distance [Tai 1979, Zhang-Shasha 1989, Klein 1998, Dulucq &Touzet 2003]

◮ all mappings are valid ◮ largest common subtree

a d a b e d c a c f d e c e

◮ Alignment [Jiang et al. 1995]

◮ insertions should precede deletions ◮ smallest common supertree

a b e d c a f b c e d d a c f d e

slide-5
SLIDE 5

Previous results

Tree Tree Strings distance alignment full mapping O(n2) O(n4)

Zhang-Shasha O(n3 log(n)) Klein

O(n2d2) Jiang et al. k-errors O(kn) O(n log(n)d3k2) Jansson-Lingas n : size of the tree d : maximal degree of the tree k : bound on the number of errors - known in advance

slide-6
SLIDE 6

Previous results

Tree Tree Strings distance alignment full mapping O(n2) O(n4)

Zhang-Shasha O(n3 log(n)) Klein

O(n2d2) Jiang et al. k-errors O(kn) O(k3n) O(n log(n)d3k2) Jansson-Lingas n : size of the tree d : maximal degree of the tree k : bound on the number of errors - known in advance

slide-7
SLIDE 7

Edit graph for the string alignment problem

◮ Two-dimensional grid ◮ Three kinds of arcs: deletion, insertion and substitution

C A T G G A C T G A C G

C A T G G A

  • |

| | | | C

  • T

G G A C Time complexity: O(n2)

slide-8
SLIDE 8

Edit graph for the string alignment problem

◮ Two-dimensional grid ◮ Three kinds of arcs: deletion, insertion and substitution

C A T G G A C T G G A C

C A T G G A

  • |

| | | | C

  • T

G G A C Time complexity: O(n2) With k-errors : O(kn)

slide-9
SLIDE 9

Tree edit graph

◮ Trees as strings : enumerate the nodes in postorder traversal ◮ Supplementary constraints imposed by the tree structure

2 3 4 5 1 1 2 3 4 5 6 6

1 2 3 4 5 6 6 2 4 3 1 5

slide-10
SLIDE 10

Tree edit graph

◮ Trees as strings : enumerate the nodes in postorder traversal ◮ Supplementary constraints imposed by the tree structure

2 3 4 5 1 1 2 3 4 5 6 6

1 2 3 4 5 6 6 2 4 3 1 5 Legal path

slide-11
SLIDE 11

Tree edit graph

◮ Trees as strings : enumerate the nodes in postorder traversal ◮ Supplementary constraints imposed by the tree structure

2 3 4 5 6 1 1 2 3 4 5 6

1 2 3 4 5 6 6 2 4 3 1 5 Illegal path

slide-12
SLIDE 12

Tree edit graph

◮ Trees as strings : enumerate the nodes in postorder traversal ◮ Supplementary constraints imposed by the tree structure

2 3 4 5 1 1 2 3 4 5 6 6

1 2 3 4 5 6 6 2 4 3 1 5

slide-13
SLIDE 13

Tree edit graph

◮ Trees as strings : enumerate the nodes in postorder traversal ◮ Supplementary constraints imposed by the tree structure

2 3 4 5 6 1 1 2 3 4 5 6

1 2 3 4 5 6 6 2 4 3 1 5

slide-14
SLIDE 14

Edit graph for trees

◮ Deletion arcs (horizontal arcs):

(x, y) (x − 1, y) labeled by del

◮ Insertion arcs (vertical arcs):

(x, y) (x, y − 1) labeled by ins

◮ Substitution arcs :

(x, y) (x − size(x), y − size(y)) labeled by the distance between A(x) and B(y)

◮ Size of the graph :O(mn)

slide-15
SLIDE 15

2 3 4 5 1 1 2 3 4 5 6 6

1 2 3 4 5 6 6 2 4 3 1 5

slide-16
SLIDE 16

2 3 4 5 1 1 2 3 4 5 6 6

1 2 3 4 5 6 6 2 4 3 1 5

slide-17
SLIDE 17

2 3 4 5 1 1 2 3 4 5 6 6

1 2 3 4 5 6 6 2 4 3 1 5

slide-18
SLIDE 18

2 3 4 5 1 1 2 3 4 5 6 6

1 2 3 4 5 6 6 2 4 3 1 5

slide-19
SLIDE 19

2 3 4 5 1 1 2 3 4 5 6 6

1 2 3 4 5 6 6 2 4 3 1 5

slide-20
SLIDE 20

2 3 4 5 1 1 2 3 4 5 6 6

1 2 3 4 5 6 6 2 4 3 1 5

slide-21
SLIDE 21

2 3 4 5 1 1 2 3 4 5 6 6

1 2 3 4 5 6 6 2 4 3 1 5

and so on . . .

slide-22
SLIDE 22

Usage of the tree edit graph

How to compute the valuations of the arcs ?

◮ The label of the substitution arc starting from (x, y) is the weight of

an optimal path in the subgraph delimited by A(x) × B(y) Time complexity : O(n4) Space complexity : O(n2) How to recover the mapping from the tree edit graph ? Multi-level tracing back :

◮ Construction of an optimal path for A × B ◮ Iteration for subgraphs induced by matching pairs of nodes

Time complexity : O(n3) Space complexity : O(n2)

slide-23
SLIDE 23

◮ Optimal paths for td(x, y) h = x − size(x), l = y − size(y) fd(h, l, h, l) = fd(i, l, h, l) = fd(i − 1, l, h, l) + del fd(h, j, h, l) = fd(h, j − 1, h, l) + ins fd(i, j, h, l) = min 8 < : fd(i − 1, j, h, l) + del fd(i, j − 1, h, l) + ins fd(i − size(i), j − size(j), h, l) + td(i, j) ◮ For the subtrees if fd(x − 1, y − 1, h, l) + sub(x, y) < min{fd(x − 1, y, h, l) + del, fd(x, y − 1, h, l) + ins} then td(x, y) ← fd(x − 1, y − 1, h, l) + sub(x, y) else td(x, y) ← +∞ ◮ This is Zhang&Shasha algorithm ◮ Klein and Dulucq&Touzet algorithms build the same edit graph, but

they use alternative strategies to compute the valuations of the arcs.

slide-24
SLIDE 24

Edit distance with k errors

◮ Error : insertion of a node, deletion of a node ◮ Problem :

Input : two ordered trees, a natural number k Output : the best mapping containing less than k errors, (if it exists)

◮ Method : pruning the tree edit graph

slide-25
SLIDE 25

Edit distance with k errors

Idea 1 : the best mappings have their path near the main diagonal

2 3 4 5 1 1 2 3 4 5 6 6

slide-26
SLIDE 26

Edit distance with k errors

Idea 1 : the best mappings have their path near the main diagonal

2 3 4 5 1 1 2 3 4 5 6 6

k-strip={(x, y); |x − y| ≤ k}

slide-27
SLIDE 27

Edit distance with k errors

Idea 1 : the best mappings have their path near the main diagonal

2 3 4 5 1 1 2 3 4 5 6 6

k-strip={(x, y); |x − y| ≤ k} Size of the graph :O(nk) Computation time for each node: O(size(A, x)k) O(k2 size(A, x))

slide-28
SLIDE 28

Edit distance with k errors

Idea 2 : when inspecting the subtree rooted at x, there is no need to visit the nodes of depth > k + 1

2 3 4 5 1 1 2 3 4 5 6 6

1 2 3 4 5 6

slide-29
SLIDE 29

Edit distance with k errors

Idea 2 : when inspecting the subtree rooted at x, there is no need to visit the nodes of depth > k + 1

2 3 4 5 1 1 2 3 4 5 6 6

1 2 3 4 5 6

slide-30
SLIDE 30

Edit distance with k errors

Idea 2 : when inspecting the subtree rooted at x, there is no need to visit the nodes of depth > k + 1

2 3 4 5 1 1 2 3 4 5 6 6

A(x, k) = {i ∈ A(x); depth(i) − depth(x) ≤ k + 1} O(nk) couples de sous-arbres O(size(A, x, k)k) pour chaque couple k2 size(A, x, k)

slide-31
SLIDE 31

Edit distance with k errors

Idea 2 : when inspecting the subtree rooted at x, there is no need to visit the nodes of depth > k + 1

2 3 4 5 1 1 2 3 4 5 6 6

A(x, k) = {i ∈ A(x); depth(i) − depth(x) ≤ k + 1} Size of the graph: O(nk) Computation time for each node: O(size(A, x, k)k) O(k2 size(A, x, k)) = O(k3n)

slide-32
SLIDE 32

◮ Tree edit graph for k errors : O(k3n)

Input: two trees A and B, positive integer k Output: tree edit graph for (x, y) ∈ k-strip(A, B) do O(k2 size(A, x, k)) = O(k3n) if not k-relevant(x, y) then td(x, y) ← +∞ else for i ∈ A(x, k) do O(ksize(A, x, k)) for j ∈ B such that (i, j) ∈ k-strip(A, B) do O(k) compute fd(i, j) O(1) end do end do compute td(x, y) O(1) end if end do

◮ Recovering the optimal mapping : O(k3n)