Universal Sequence Maps of Arbitrary Discrete Sequences By Almeida - - PowerPoint PPT Presentation

universal sequence maps of arbitrary discrete sequences
SMART_READER_LITE
LIVE PREVIEW

Universal Sequence Maps of Arbitrary Discrete Sequences By Almeida - - PowerPoint PPT Presentation

Universal Sequence Maps of Arbitrary Discrete Sequences By Almeida and Vigna Presented By Chris Standish chriss@cs.tamu.edu 1 Motivation Sequence alignment techniques assume there is conservation of contiguity between homologous


slide-1
SLIDE 1

Universal Sequence Maps

  • f Arbitrary Discrete

Sequences

By Almeida and Vigna Presented By Chris Standish chriss@cs.tamu.edu

slide-2
SLIDE 2

1 Motivation

  • Sequence alignment techniques assume there is conservation of

contiguity between homologous segments.

  • The assumption of contiguity is violated by genetic processes such

as: recombination - the exchange of regions of the genome between paired homologous chromosomes during meiosis (crossover). genome shuffling - an experimental technique which allows for recombination between multiple parents with the goal of improving individual genes.

  • Alignment-free techniques attempt to overcome this limitation.

The alignment-free technique we will discuss is called Universal Se- quence Maps (USM) [1]. It is based on a sequence representation tech- nique called “Chaos Game Representation” [2]. 0-1

slide-3
SLIDE 3

2 Chaos Game Representation

  • Chaos Game Representation (CGR) maps a nucleotide sequence

into a continuous two dimensional space on the unit square, CGR- space.

  • CGR has the property that each unique genomic sequence S =

s(1)s(2) . . . s(i) . . ., of any length, has a unique position in CGR- space.

  • The iterative function used in CGR is defined by [3]:

CGRj(s(0)) =

1 2

CGRj(s(i)) = CGRj(s(i−1)) + 1

2(CGRj(s(i−1)) − u(i) j )

where u(i)

j

is the jth bit of the binary encoding for sequence symbol s(i), and 1 ≤ j ≤ 2.

  • This mapping is a bijection and so it is invertible.

0-2

slide-4
SLIDE 4

3 CGR Binary Encoding

  • Each unique symbol in the genomic sequence is encoded in binary.
  • For CGR the binary encoding is defined as:

Unit Code A 00 T 01 C 10 G 11 0-3

slide-5
SLIDE 5

Figure 1: CGR Representation of ATGCGAGATGT. 0-4

slide-6
SLIDE 6

Figure 2: Using quadrants we can recover the original sequence. 0-5

slide-7
SLIDE 7

4 Universal Sequence Maps

  • USMs generalizes the 2-D CGR-space to an n-dimensional space,

where n depends on the number of unique symbols in the sequence.

  • In addition to the “forward” map of CGR, there is also a “back-

ward” map.

  • Together, the “forward” and “backward” maps allow us to esti-

mate the length of a similar segment. 0-6

slide-8
SLIDE 8

4.1 Binary Encoding of Unique Symbols

  • As a concrete example, we let these two stanzas of a poem be our

two sequences. I am a poet. I am very fond of bananas. I am of very fond bananas. Am I a poet?

  • There are nineteen unique symbols, so we encode each symbol

with a 5-bit binary code, where 5 = ⌈log2(19)⌉.

  • Each unique symbol is placed in a unique corner of the 5-dimensional

unit hypercube. 0-7

slide-9
SLIDE 9

4.2 USM Binary Code

Unit Code 00000 . 00001 ? 00010 A 00011 a 00101 b 00110 d 00111 e 01000 f 01001 Unit Code I 00100 m 01010 n 01011

  • 01100

p 01101 r 01110 s 01111 t 10000 v 10001 y 10010 0-8

slide-10
SLIDE 10

4.3 Forward Map

  • The forward map defines a position in an n-dimensional unit

hypercube for each prefix Si = s(1)s(2) . . . s(i) of sequence S = s(1)s(2) . . . s(k).

  • For each coordinate j = 1, . . . , n of n-space, the USMj coordi-

nates for each prefix Si are determined as follows: USMj(s(0)) = Unif([0, 1]) USMj(s(i)) = USMj(s(i−1)) + 1

2

  • u(i)

j

− USMj(s(i−1))

  • =

1 2USMj(s(i−1)) + 1 2u(i) j

where u(i)

j

∈ {0, 1} and 1 ≤ i ≤ k, 1 ≤ j ≤ n

  • Notice the above formulas use the binary encoding u(i) of s(i).

0-9

slide-11
SLIDE 11

4.4 Backward Map

  • The backward map defines another n-dimensional unit hypercube.
  • For each coordinate j = 1, . . . , n of n-space, the USMj coordi-

nates for each suffix Si are determined as follows: USMn+j(s(k+1)) = Unif([0, 1]) USMn+j(s(i)) =

1 2USMn+j(s(i+1)) + 1 2u(i) j

where i = k, k − 1, . . . , 1.

  • Together the forward and backward maps transform a sequence

to a point in 2n-dimensional space. 0-10

slide-12
SLIDE 12

5 Sequence Similarity

  • The distance between two sequences in USM-space estimates se-

quence similarity.

  • The “bi-directional distance” measure D between two sequences

A = a(1)a(2) . . . a(r) . . . and B = b(1)b(2) . . . b(s) . . . is defined as: D(a(r), b(s)) = df(a(r), b(s)) + db(a(r), b(s))

  • The “forward distance” is defined as:

df(a(r), b(s)) = − log2( max

1≤j≤n|USM(b(r) j ) − USM(a(s) j )|)

  • The forward distance measures the number of similar contiguous

symbols proceeding ar. 0-11

slide-13
SLIDE 13
  • The “backward distance” is defined as:

db(a(r), b(s)) = − log2( max

1≤j≤n|USM(b(r) j+n) − USM(a(s) j+n)|)

  • The backward distance measures the number of similar contiguous

symbols succeeding bs.

  • Both distance measures df and db overestimate the true number of

similar contiguous symbols. So D overestimates the total number

  • f similar contiguous symbols.

0-12

slide-14
SLIDE 14

Algorithm 5.1 USM Compare Input: Two sequences Output: A matrix D of bi-directional distance values 1. Identify the unique symbols in the input sequences 2. Find the dimension n of the unit hypercube 3. Map each unique symbol to a unique corner of the unit hypercube 4. Iteratively generate forward USM coordinates for each input sequence 5. Iteratively generate backward USM coordinates for each input sequence 6. Find the bi-directional distance matrix D 7. Return D 0-13

slide-15
SLIDE 15

6 Recovering the Sequence

  • In theory we can recover the original sequence from a given point

in USM-space since the forward and backward mappings are bi- jections, i.e., each inverses exists.

  • In practice we are limited by the precision of the machine.
  • As an example, the USM coordinates of the 16th character ’a’

are: [0.0156, 0.0138, 0.6314, 0.0001, 0.5338,

  • forward

0.0703, 0.3004, 0.5169, 0.2742, 0.5652

  • backward

] 0-14

slide-16
SLIDE 16
  • We recover the original sequence by inverting the corresponding
  • map. For the inverse forward map:

USMj(a(15)) = 2 USMj(a(16)) − u(16)

j

= 2       .0156 .0138 .6314 .0001 .5338       −       1 1       =       .0312 .0276 .2628 .0002 .0676       →             a space 0-15

slide-17
SLIDE 17

Inverse forward map 00100 I 00000 00101 a 01010 m 00000 00101 a 00000 01101 p 01100

  • 01000

e 10000 t 00001 . 00000 00100 I 00000 00101 a Inverse backward map 00101 a 01010 m 00000 10001 v 01000 e 01110 r 10010 y 00000 01001 f 01100

  • 01011

n 00111 d 00000 01100

  • 01001

f 00000 0-16

slide-18
SLIDE 18

Figure 3: Distance matrices D for USM(left) versus bUSM on the two

  • stanzas. Brighter regions indicate larger distance values.

0-17

slide-19
SLIDE 19

Figure 4: Distance matrices D for a 100 nucleotide mRNA using both USM(left) and bUSM. 0-18

slide-20
SLIDE 20

7 Boolean USM

There are some problems with USMs:

  • The length of the sequence that can be recovered is limited to the

machines precision.

  • The distance measure D over estimates the true number of similar

symbols in a similar segment. The boolean USM (bUSM) procedure [4] fixes both these problems by:

  • Replacing arithmetic operations with equivalent boolean opera-

tions.

  • Changes the symbol encoding scheme.

0-19

slide-21
SLIDE 21

8 Boolean USM Coordinates

  • We will look at the forward map only.
  • Conceptually, we represent a coordinate(dimension) c in boolean

USM-space as an infinite bit sequence.

  • Each coordinate in bUSM-space represents an infinite bit history
  • f the symbols that have been seen so far.

c = ∨∞

i=1Riai

where ai ∈ {0, 1}, ∨∞

i=1 represents bit-wise logical OR, and Ri is the

right shift operator repeated i times. 0-20

slide-22
SLIDE 22

9 Tail Symbols

  • We also need to add two new symbols to the set of sequence

symbols called “tail symbols,” one for each sequence.

  • Suppose we have the following encoding, where Vt and Wt are

the tail symbols for the sequences V = ATGA and W = CTGA respectively. Unit Code A 000 T 001 G 010 C 011 Vt 100 Wt 101

  • Conceptually we add these tail symbols to the beginning and the

end of the original sequences. 0-21

slide-23
SLIDE 23

10 bUSM Recursion Formula

  • Consider a computer word of 8-bits, i.e., our coordinate bit history

is 8 bits long.

  • The bUSM recursion is initialized with the tail symbol for each

sequence. USM(v(0)) =   00000000 00000000 11111111  

  • The bUSM recursion is written as:

USMj(v(k)) = R1 USMj(v(k−1))

  • ⊙ u(k)

j

where u(k)

j

∈ {0, 1}, 1 ≤ j ≤ n. ⊙ is defined below. 0-22

slide-24
SLIDE 24

11 Example

  • For instance, for the first symbol of V, A = (000), we get:

USM(v(1)) = R1   00000000 00000000 11111111   ⊙     =   00000000 00000000 01111111  

  • Where ⊙ means replace the 1’st columns entry, in the right shifted

coordinate histories, with a 1, if the corresponding entry in the column vector is 1. USM(v(2)) = R1   00000000 00000000 01111111   ⊙   1   =   10000000 00000000 00111111   0-23

slide-25
SLIDE 25

12 bUSM Forward Distance

  • We define the “forward distance” between the two symbols v(m)

and w(m) as: dfUSMj(v(m), w(m)) = USMj(v(m)) ⊕ USMj(w(m)) where ⊕ is the exclusive OR of each row, and 1 ≤ j ≤ n.

  • For instance:

dfUSM(v(4), w(4)) =   00100000 01000000 00001111   ⊕   00111111 01010000 00001111   =   00011111 00010000 00000000   0-24

slide-26
SLIDE 26

13 Preceding Similar Symbols

  • To find the number of preceding similar symbols we take the bit-

wise OR of each row to get: (00011111)

  • We then find the leftmost bit that is set to 1.

The position

  • f this bit will be one more than the number of the similar se-

quence symbols, including (v(4), w(4)), that precede the position (v(4), w(4)).

  • So in this case there are 3 preceding similar symbols.

Note, V = ATGA and W = CTGA

  • The number of succeeding similar symbols is found similarly using

the backward distance db(v(i), w(j)). 0-25

slide-27
SLIDE 27

14 Finite Word Length

To overcome the limitations imposed by finite word length we define new distance measures. Let a = USM(v(i)), b = USM(w(j)), and W be the number of bits in a word. We define: d

f(a, b) =

df(a, b) if df(a, b) < W W

  • therwise

d

b(a, b) =

db(a, b) if db(a, b) < W W

  • therwise

0-26

slide-28
SLIDE 28

15 Recursive Forward Distance

  • Let c = USM(v(i−W )) and d = USM(w(j−W )). The “recursive

forward distance” is defined as: Df(a, b) =

  • d

f(a, b)

if d

f(a, b) < W

Df(c, d) + W

  • therwise
  • Let e = USM(v(i+W )) and f = USM(w(j+W )). The “recursive

backward distance” is defined as: Db(a, b) =

  • d

b(a, b)

if d

b(a, b) < W

Db(e, f) + W

  • therwise

0-27

slide-29
SLIDE 29

16 Exact Distance

The exact length of a similar segment is determined by: D(a, b) =

  • Df(a, b) + Db(a, b) − 1

if Df(a, b) > 0

  • therwise

0-28

slide-30
SLIDE 30

17 Conclusion

The bUSM procedure:

  • Defines a distance metric that determines the exact number of

similar symbols in a similar segment,

  • Allows for the identification of similar segments of arbitrary length,

and

  • Is computationally more efficient than USM.

Can bUSM be used to:

  • Provide fast lookup of sequences in a database if sequences are

stored as 2n-dimensional points in bUSM-space?

  • Compare multiple sequences?
  • What else is it good for?

0-29

slide-31
SLIDE 31

References

[1] Jonas Almeida and Susana Vigna. Universal Sequence Map (USM)

  • f Arbitrary Discrete Sequences. BMC Bioinformatics, 3(6):1–11,

February 2002. www.biomedcentral.com/1471-2105/3/6. [2] H. J. Jeffrey. Chaos Game Representation of Gene Struc- ture. Nucleic Acids Research, 18(8):2163–2170, April 1990. www.pubmedcentral.nih.gov. [3] Almeida, Carroco, Maretzek, Noble, and Fletcher. Anal- ysis

  • f

Genomic Sequences by Chaos Game Representa- tion. Bioinformatics, 17(5):429–437, May 2001. bioinformat- ics.oupjournals.org/content/vol17/issue5/index.dtl. [4] John Schwacke and Jonas S Almeida. Efficient Boolean Implemen- tation of Universal Sequence Maps (bUSM). BMC Bioinformatics, 3(28):1–11, October 2002. www.biomedcentral.com/1471-2105/3/28. 0-30