Efficient List-based Computation of the String Subsequence Kernel - - PowerPoint PPT Presentation

efficient list based computation of the string
SMART_READER_LITE
LIVE PREVIEW

Efficient List-based Computation of the String Subsequence Kernel - - PowerPoint PPT Presentation

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion Efficient List-based Computation of the String Subsequence Kernel Slimane Bellaouar 1 Hadda Cherroun 1 Djelloul Ziadi 2 1 Laboratoire LIM, Universit


slide-1
SLIDE 1

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Efficient List-based Computation of the String Subsequence Kernel

Slimane Bellaouar 1 Hadda Cherroun 1 Djelloul Ziadi 2

1Laboratoire LIM, Université Amar Telidji, Laghouat, Algérie 2Laboratoire LITIS - EA 4108, Université de Rouen, Rouen, France

slide-2
SLIDE 2

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Outline

1

Introduction

2

String Subsequence Kernels Naive Implementation Efficient Implementations

3

List and Layered Range Tree based Approach Suffix Table Representation Location of Points in a Range Fractional Cascading List of lists Building and GWSK Computation

4

Conclusion

slide-3
SLIDE 3

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Introduction

Machine learning algorithms are applied to linear separable problems.

slide-4
SLIDE 4

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Introduction

Kernel methods project the data into a high dimensional feature space where linear learning machines can be applied.

slide-5
SLIDE 5

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Introduction

Strings are considered among the important data types. A great effort of research has been devoted to string kernels. The philosophy of all string kernels can be reduced to different ways to count common substrings or subsequences that occur in the two strings.

slide-6
SLIDE 6

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Introduction

Motivation

The efficiency of computation, a key property of kernel methods.

slide-7
SLIDE 7

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

String Subsequence Kernels

SSK measures the similarity between two strings based on non contiguous elements (subsequences) a gap penalty λ ∈]0, 1] is introduced. φp

u(s) =

  • I:u=s(I)

λl(I), u ∈ Σp. The associated kernel can be written as: Kp(s, t) = φp(s), φp(t) =

  • u∈Σp
  • I:u=s(I)
  • J:u=t(J)

λl(I)+l(J).

slide-8
SLIDE 8

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Outline

1

Introduction

2

String Subsequence Kernels Naive Implementation Efficient Implementations

3

List and Layered Range Tree based Approach Suffix Table Representation Location of Points in a Range Fractional Cascading List of lists Building and GWSK Computation

4

Conclusion

slide-9
SLIDE 9

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Naive Implementation

A suffix kernel is defined to assist in the computation of the SSK : φp,S

u (s) =

  • I∈I|s|

p :u=s(I)

λl(I), u ∈ Σp, The SSK can be expressed in terms of its suffix version as follows: Kp(s, t) =

|s|

  • i=1

|t|

  • j=1

KS

p(s(1 : i), t(1 : j)),

with KS

1(s, t) = ([s|s| = t|t|] λ2).

slide-10
SLIDE 10

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Naive Implementation

A recursion has to be devised. The similarity between two strings (sa and tb) is conditioned by their last symbols. KS

p(sa, tb) = [a = b] |s|

  • i=1

|t|

  • j=1

λ2+|s|−i+|t|−jKS

p−1(s(1 : i), t(1 : j)).

The recursion leads to an O(p(|s|2 |t|2) time complexity.

slide-11
SLIDE 11

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Naive Implementation

Example (computation of the SSK) s = gatta, t = cata and p = 1. KS

1

g a t t a c a λ2 λ2 t λ2 λ2 a λ2 λ2 K1(gatta, cata) = 6λ2

slide-12
SLIDE 12

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Outline

1

Introduction

2

String Subsequence Kernels Naive Implementation Efficient Implementations

3

List and Layered Range Tree based Approach Suffix Table Representation Location of Points in a Range Fractional Cascading List of lists Building and GWSK Computation

4

Conclusion

slide-13
SLIDE 13

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Efficient Implementations

There exists three efficient approaches to compute the SSK: Dynamic Programming Approach Trie-based Approach Sparse Dynamic Programming Approach

slide-14
SLIDE 14

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Effecient Implementations

Dynamic Programming Approach

The similarity between two strings (sa and tb) is conditioned by their final symbols. KS

p(sa, tb) = [a = b] |s|

  • i=1

|t|

  • j=1

λ2+|s|−i+|t|−jKS

p−1(s(1 : i), t(1 : j)).

We can consider a separate dynamic programming table: DPp(k, l) =

k

  • i=1

l

  • j=1

λk−i+l−j KS

p−1(s(1 : i), t(1 : j)).

slide-15
SLIDE 15

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Effecient Implementations

Dynamic Programming Approach

DPp(k, l) =

k

  • i=1

l

  • j=1

λk−i+l−j KS

p−1(s(1 : i), t(1 : j)).

Computing ordinary DPp for each (k, l) would be inefficient. We can devise a recursion DPp(k, l) = KS

p−1(s(1 : k), t(1 : l)) + λDPp(k − 1, l)+

λDPp(k, l − 1) − λ2DPp(k − 1, l − 1). The computation of SSK leads to an O(p |s| |t|) time

  • complexity. .
slide-16
SLIDE 16

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Effecient Implementations

Trie-based Approach

Approach based on search trees known as tries, introduced by E. Fredkin in 1960. The key idea: leaves play the role as indices of the feature space indexed by the set Σp. kernel will be evaluated as follows: Kp(s, t) =

  • u∈Σp

φp

u(s)φp u(t) =

  • u∈Σp
  • gs,gt

λgs+p|Ls(u, gs)| · λgt+p|Lt(u, gt)| The worst-case time complexity of the algorithm is O( p+m

m

  • (|s| + |t|)).
slide-17
SLIDE 17

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Effecient Implementations

Sparse Dynamic Programming Approach

Observation: Most of the entries of the DP matrix are zero Propositions: Two data structures

A set of match lists instead of the KS

p matrix.

A range sum tree (B-tree) instead of the DPp matrix.

The time complexity is O(p |L| log min(|s|, |t|)), where L = {(i, j) | si = tj}.

slide-18
SLIDE 18

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

List and Layered Range Tree based Approach

Objective: Improve the complexity of the SSK Observation 1: the computation of KS

p(s, t) is required only

when s|s| = t|t| Proposition: keep only a list of index pairs rather than the whole suffix table, L(s, t) = {(i, j) : si = tj}.

slide-19
SLIDE 19

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

List and Layered Range Tree based Approach

Example KS

1

g a t t a c a λ2 λ2 t λ2 λ2 a λ2 λ2 L(gatta, cata) = {(2, 2), (5, 2), (3, 3), (4, 3), (2, 4), (5, 4)}.

slide-20
SLIDE 20

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

List and Layered Range Tree based Approach

Observation 2: Not obvious to compute KS

p(s, t) efficiently

  • n a list data structure. (O(p |L(s, t)|2))

Proposition: The suffix table of KS

p(s, t) can be represented

by a 2-D dimensional space.

slide-21
SLIDE 21

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

List and Layered Range Tree based Approach

KS

1

g a t t a c a λ2 λ2 t λ2 λ2 a λ2 λ2

slide-22
SLIDE 22

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

List and Layered Range Tree based Approach

⇒ the computation of KS

p(s, t) can be interpreted as

  • rthogonal range queries.

several data structures that are used in computational geometry.

Kd-tree: The time cost = O(p(|L|

  • |L| + K)) (K is the total of

the reported points). Range tree: Better query time for rectangular range queries.

slide-23
SLIDE 23

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Outline

1

Introduction

2

String Subsequence Kernels Naive Implementation Efficient Implementations

3

List and Layered Range Tree based Approach Suffix Table Representation Location of Points in a Range Fractional Cascading List of lists Building and GWSK Computation

4

Conclusion

slide-24
SLIDE 24

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Suffix Table Representation

L(gatta, cata) = {(2, 2), (5, 2), (3, 3), (4, 3), (2, 4), (5, 4)}.

slide-25
SLIDE 25

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Suffix Table Representation

L(gatta, cata) = {(2, 2), (5, 2), (3, 3), (4, 3), (2, 4), (5, 4)}.

slide-26
SLIDE 26

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Suffix Table Representation

Construction of the range tree

Problem Points have the same x or y-coordinates. Solution Lexicographic order.

Replace the real number by a composite-number space. (x|y) < (x′|y′) ⇔ x < x′ ∨ (x = x′ ∧ y < y′).

slide-27
SLIDE 27

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Suffix Table Representation

Complexity of range tree construction

Let s and t be two strings. L(s, t) = {(i, j) : si = tj} the match list associated to the suffix version of the SSK. Space: O(|L| log |L|) Time construction: O(|L| log |L|)

slide-28
SLIDE 28

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Outline

1

Introduction

2

String Subsequence Kernels Naive Implementation Efficient Implementations

3

List and Layered Range Tree based Approach Suffix Table Representation Location of Points in a Range Fractional Cascading List of lists Building and GWSK Computation

4

Conclusion

slide-29
SLIDE 29

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Location of Points in a Range

Recall the computation of KS

p(sa, tb) = [a = b] |s|

  • i=1

|t|

  • j=1

λ2+|s|−i+|t|−jKS

p−1(s(1 : i), t(1 : j)).

can be interpreted as the evaluation of 2-dimensional range queries applied to a 2-dimensional range tree.

slide-30
SLIDE 30

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Location of Points in a Range

[x1 : x2] × [y1 : y2] is a 2-dimensional range query. First ask for the points with x-coordinates (Select a collection of subtrees which contains, exactly, the points that lie in the x-range [x1 : x2].) Consider the points that fall in the y-range [y1 : y2].

slide-31
SLIDE 31

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Location of Points in a Range

Complexity The total task of a range query can be performed in O(log2 |L| + k) time, where k is the number of points that are in the range. Improvement Enhancing the 2-dimensional range tree with the fractional cascading technique.

slide-32
SLIDE 32

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Outline

1

Introduction

2

String Subsequence Kernels Naive Implementation Efficient Implementations

3

List and Layered Range Tree based Approach Suffix Table Representation Location of Points in a Range Fractional Cascading List of lists Building and GWSK Computation

4

Conclusion

slide-33
SLIDE 33

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Fractional Cascading

Observations A rectangular range query searchs the same range [y1 : y2] in the associated structures y-RT . There exists an inclusion relation between these associated structures. Goal of the fractional cascading Execute the binary search only once and use the result to speed up other searches without expanding the storage by more than a constant factor.

slide-34
SLIDE 34

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Fractional Cascading

slide-35
SLIDE 35

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Fractional Cascading

Algorithm 1 Fractional Cascading Technique {A(v)[i] stores a point with an y-coordinate yi} if There exist a smallest key (y-coordinate) larger or equal yi then store a pointer to A(left(v)) with the smallest key (y- coordinate) larger or equal yi else the pointer is nil end if {The pointer into A(right(v)) is defined in the same way.}

slide-36
SLIDE 36

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Fractional Cascading

Complexity of the layered range tree construction Let s and t be two strings. L(s, t) = {(i, j) : si = tj} the match list associated to the suffix version of the SSK. Space: O(|L| log |L|) Time construction: O(|L| log |L|) Complexity of SSK computation the SSK of length p can be computed in O(p(|L| log |L| + K)), where K is the total number of reported points over all the entries of L.

slide-37
SLIDE 37

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Outline

1

Introduction

2

String Subsequence Kernels Naive Implementation Efficient Implementations

3

List and Layered Range Tree based Approach Suffix Table Representation Location of Points in a Range Fractional Cascading List of lists Building and GWSK Computation

4

Conclusion

slide-38
SLIDE 38

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

List of lists Building

L(gatta, cata) = {(2, 2), (5, 2), (3, 3), (4, 3), (2, 4), (5, 4)}.

slide-39
SLIDE 39

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

List of lists Building

Observations Point coordinates, in our case, in the plane remain unchanged during the entire processing. The computation of KS

p(sa, tb) is recursive

KS

p(sa, tb) = [a = b] |s|

  • i=1

|t|

  • j=1

λ2+|s|−i+|t|−jKS

p−1(s(1 : i), t(1 : j)).

slide-40
SLIDE 40

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

List of lists Building

Mutiple invocations of the 2-D range query can be replaced by only one computation. Extension of the match list to be a list of lists.

Figure : List of lists inherent to KS

1(gatta,cata).

slide-41
SLIDE 41

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

List of lists Building

Algorithm 2 List of Lists Creation Require: match list L(s, t) and Layered Range Tree RT for each entry (k, l) ∈ L(s, t) do {Preparing the range query} x1 ← 0; y1 ← 0; x2 ← k − 1; y2 ← l − 1 relatedpoints ← 2D-RANGE-QUERY(RT , [(x1|−∞) : (x2|+ ∞)] × [(y1| − ∞) : (y2| + ∞)] while There exists (i, j) ∈ relatedpoints do add (i, j) to (k, l)-list end while end for Ensure: List of Lists LL(s, t): The match list augmented with lists containing reported points

slide-42
SLIDE 42

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Algorithm 3 SSK computation Require: List of Lists LL(s, t), length p and coefficient λ for q=1:p do K(q) ← 0; KPS(1 : |max|) ← 0 for each entry (k, l) ∈ LL(s, t) do for each entry r ∈ (k, l) − list do (i, j) ← r.Key; KPS(i,j) ← r.Value KPS(k, l) ← KPS(k, l) + λk−i+l−j KPS(i,j) end for K(q) ← K(q) + KPS(k, l)) end for {Preparing LL(s, t) For the next computation} for each entry (k, l) ∈ KPS do Update LL(k, l) with KPS(k, l) end for end for Ensure: Kernel values Kq(s, t) = K(q) : q = 1, . . . , p

slide-43
SLIDE 43

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Conclusion

L(gatta, cata) = {(2, 2), (5, 2), (3, 3), (4, 3), (2, 4), (5, 4)}.

slide-44
SLIDE 44

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Conclusion

Figure : List of lists inherent to KS

1(gatta,cata).

slide-45
SLIDE 45

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Conclusion

Complexity Time : O(|L| log |L| + pK). Space : O(|L| log |L| + K). The cost of naive implementation of the list version is O(p |L|2). Obvious improvement of the time complexity.

slide-46
SLIDE 46

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Discussion

The proposed algorithm is output sensitive:

Empirical analysis to compare our contribution with other approaches.

Comparison

DP: faster when DP table is nearly full.

Short strings. Long strings and small alphabet.

Trie-based: medium-sized alphabets. Spars DP: large-sized alphabets.

Time computation: O(p |L| log min(|s|, |t|)).

List & LRT: large-sized alphabets.

Time computation: O(pK)

slide-47
SLIDE 47

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Discussion

Separation of the process of required data location from the strict computation one: O(|L| log |L| + pK)

Limits the impact of the length of the SSK on the computation. It can be favorable if we assume that the problem is multi-dimensional.

slide-48
SLIDE 48

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Discussion

Implementation: a great programming effort is supported by well-studied and ready to use computational geometry algorithms

the emphasis is shifted to a variant of string kernel computations that can be easily adapted.

slide-49
SLIDE 49

Introduction String Subsequence Kernels List and Layered Range Tree based Approach Conclusion

Thank You

Thank You Questions please