Composite Pattern Discovery for PCR Application Stanislav Angelov - - PowerPoint PPT Presentation

composite pattern discovery for pcr application
SMART_READER_LITE
LIVE PREVIEW

Composite Pattern Discovery for PCR Application Stanislav Angelov - - PowerPoint PPT Presentation

Composite Pattern Discovery for PCR Application Stanislav Angelov University of Pennsylvania, USA Shunsuke Inenaga Kyushu University, Japan Japan Society for the Promotion of Science Pattern Discovery Input Output text data pattern


slide-1
SLIDE 1

Composite Pattern Discovery for PCR Application

Stanislav Angelov

University of Pennsylvania, USA

Shunsuke Inenaga

Kyushu University, Japan Japan Society for the Promotion of Science

slide-2
SLIDE 2

Pattern Discovery

ACGTTGACGT ACGTTGACGT TG TGGATCGA TCGATG TG ACCGA ACCGATGAC TGACA GATAAA AAATGGG TGGG CAG CAGTGTCACA TGTCACA GTTATGCCCC TGCCCC ACTGTGCCTT ACTGTGCCTT TTGGCAAAGT CAAAGT

Input text data Output pattern knowledge rule

slide-3
SLIDE 3

Finding Missing Patterns

Input : text T and threshold  Output : Pattern pair (A,B) satisfying:

  • 1. The distance between any occurrences
  • f A and B in T is at least ,

2.|A| = |B|, and 3.|A| (=|B|) is shortest possible.

slide-4
SLIDE 4

Finding Missing Patterns [cont.]

T

A B    

T

   

T Case 1: Case 2: Case 3: -close non -close non -close

A B A B A B A A

If A and B are non -close, (A,B) is said to be a missing pair.

slide-5
SLIDE 5

 Standard technique to produce many copies

  • f a region of DNA (can be a tiny sample).

 In Medicine, to detect infections.  In Forensic Science, to identify individuals.

PCR (Polymerase Chain Reaction)

Application - PCR

slide-6
SLIDE 6

 Repeated PCR with nested primers  Achieving ultra-sensitive detection  Good adapter primers for nested PCR:

bind only to the adapters, and amplify nothing directly from the samples! Nested PCR

Application – PCR [cont.]

slide-7
SLIDE 7

S (sample) 5’ 3’ S’ (complement to S) 3’ 5’ 

Left specific primer Right specific primer Adapter Adapter

  • We want a pair of good adapter primers

which amplify nothing directly from S or S’.

(Adapter primers are complements to adapters.)

Application – PCR [cont.]

slide-8
SLIDE 8
  • If (A,B) is a missing pair in S and S’,

then (A’,B) is not a pair of binding sites for any region of length less than .

S (sample) 5’ 3’ S’ (complement to S) 3’ 5’ 

Left specific primer Right specific primer Adapter Adapter

Application – PCR [cont.]

slide-9
SLIDE 9
  • So (A’,B) satisfies a necessary condition
  • f being a good adapter primer pair!!

S (sample) 5’ 3’ S’ (complement to S) 3’ 5’ 

Left specific primer Right specific primer Adapter Adapter

Application – PCR [cont.]

slide-10
SLIDE 10

Previous Work

 Inenaga, Kivioja and Makinen. [WABI’04]

proposed a bit-table based algorithm to find a missing pattern pair of the same length.

 We also gave a suffix tree based algorithm to

solve a generalized problem where the patterns in the pair can be of different length.

slide-11
SLIDE 11

Complexity Comparisons

time space

  • ur algorithm

O(nloglogn) O(n)

bit-table algorithm of inenaga et al. [WABI’04]

O(n(+loglogn)) O(n)

Finding missing pattern pair of same length

  is the alphabet size.   is typically 5000 (due to PCR application)!

slide-12
SLIDE 12

Complexity Comparisons [cont.]

Finding missing pattern pair of different length

time space

  • ur algorithm

O(nlogn) O(n)

suffix tree algorithm A of Inenaga et al. [WABI’04]

O(n2) O(n)

suffix tree algorithm B of Inenaga et al. [WABI’04]

O(nlogn) O(nlogn)

 Our algorithm does not need a suffix tree –

not only faster but also simpler.

slide-13
SLIDE 13

Single Missing Pattern

 We start with finding a single missing pattern.  KEY: There are at most k patterns of length k.

T

P1 P2 Pk-1 Pk

n k n-k+1 < n

slide-14
SLIDE 14

Single Missing Pattern [cont.]

T

P1 P2 Pk-1 Pk

n k n-k+1 < n

  • We have k < logn .
  • If k is the largest integer for which

all k patterns of length k exist in T, then there is a missing pattern of length logn .

slide-15
SLIDE 15

Single Missing Pattern [cont.]

 Compute a bit table of all patterns of length logn

using a bijective mapping f from patterns to

  • integers. (O(n) time, using e.g. Karp & Rabin algo.)

1) there exists a missing pattern of length logn

  • utput it.

2) otherwise (all patterns of length logn are present in T) there is a missing pattern of length logn compute and output it.

slide-16
SLIDE 16

Missing Pair of Fixed Length

 Input: text T, threshold ,

pattern lengths a and b

 Output: missing pattern pair (A, B)

such that |A| = a and |B| = b

 Assume w.l.o.g. a > b.  We consider the case a < m, where m is the length of

the shortest single missing pattern P in T. Or else P can be paired with any pattern of length b.

 Let Na = a and Nb = b (Note n > Na > Nb).

slide-17
SLIDE 17

Missing Pair of Fixed Length [cont.]

T

A

i1 i2 i3 L

Na-1

h

i1 i2 i3

  • Let f (A) = h.
  • L : array of size Na,

where L[h] is the list of

  • ccurrences of A in T.

a

slide-18
SLIDE 18

Missing Pair of Fixed Length [cont.]

T

B

j H

n-b

j h’ b

  • Let f (B) = h’.
  • H : array of size n-b+1,

where H[j] = h’.

slide-19
SLIDE 19

Missing Pair of Fixed Length [cont.]

T i1 L

Na-1

h

i1 i2 i3

H

n-b

i1

h1

M

Nb-1

h1    

B1 f (B1) = h1

1

CM = 0

slide-20
SLIDE 20

Missing Pair of Fixed Length [cont.]

T i1 L

Na-1

h

i1 i2 i3

H

n-b

i1

h1

M

Nb-1

h1    

B1 f (B1) = h1

1

CM = 1

slide-21
SLIDE 21

Missing Pair of Fixed Length [cont.]

T i1 L

Na-1

h

i1 i2 i3

H

n-b

i1

h1

M

Nb-1

h1    

B2 f (B2) = h2

1 h2

h2 1

CM = 1

slide-22
SLIDE 22

Missing Pair of Fixed Length [cont.]

T i1 L

Na-1

h

i1 i2 i3

H

n-b

i1

h1

M

Nb-1

h1    

B2 f (B2) = h2

1 h2

h2 1

CM = 2

slide-23
SLIDE 23

 The iteration ends

 when CM = Nb.

This case, all patterns of length b are -close to A.

 or when all positions in L[h] are processed.

This case, scan M and find a missing pattern of length b. The algorithm outputs the missing pair.

 The algorithm runs in total of O(n) time and O(n)

space.

Missing Pair of Fixed Length [cont.]

slide-24
SLIDE 24

 Monotonicity property: If (A, B) is a missing pair,

for any superstrings C, D of A, B resp., (C, D) is also a missing pair.

 By monotonicity property we can do a binary

search on the length 1… logn of the patterns using the aforementioned algorithm, and find the shortest missing pair of same length. It takes O(nloglogn) time and O(n) space.

Missing Pair of Same Length [cont.]

slide-25
SLIDE 25

Missing Pair of Different Length

 It is not hard to extend the algorithm to the

case where A and B do not necessarily have the same length.

 We can find such a missing pair in O(nlogn) time

and O(n) space.

slide-26
SLIDE 26

Experiments

 Linux on 1GHz CPU with 2GB RAM.  In Java. http://www.cis.upenn.edu/~angelov  Human genome (2.5GB) from

ftp://ftp.ensembl.org/pub/current_human/

  = 5000.

slide-27
SLIDE 27

Experiments [cont.]

 We found 238 pairs of missing patterns of length

8 for the human genome.

 For the Baker’s yeast genome, the patterns in the

shortest missing pairs are also of length 8 ! [Inenaga et al. WABI’04]

 There are common missing pairs of patterns of

length 8 for the human and yeast genomes.

slide-28
SLIDE 28

Experiments [cont.]

missing pair yeast AB human AB

(AATCGACG,CGATCGGT)

5008 6458

(CCGATCGG,CCGTACGG)

5658 6839

(CGACCGTA,TACGGTCG)

13933 7585

(CGACCGTA,TCGCGTAC)

5494 5345

(CGAGTACG,GTCGATCG)

5903 8090

(CGATCGGA,GCGCGATA)

6432 6619

Missing pattern pairs of length 8 for both the human and the yeast gemones. The reverse complements are also missing

slide-29
SLIDE 29

Conclusions

 We solved the missing pattern pair problem in

O(nloglogn) time for the same length case, and O(nlogn) time for the different length case. Both in O(n) space.

 We also developed an alternative algorithm to

solve this problem, and moreover solved extended problems (see the proceedings).