universal sequence maps of arbitrary discrete sequences
play

Universal Sequence Maps of Arbitrary Discrete Sequences By Almeida - PowerPoint PPT Presentation

Universal Sequence Maps of Arbitrary Discrete Sequences By Almeida and Vigna Presented By Chris Standish chriss@cs.tamu.edu 1 Motivation Sequence alignment techniques assume there is conservation of contiguity between homologous


  1. Universal Sequence Maps of Arbitrary Discrete Sequences By Almeida and Vigna Presented By Chris Standish chriss@cs.tamu.edu

  2. 1 Motivation • Sequence alignment techniques assume there is conservation of contiguity between homologous segments. • The assumption of contiguity is violated by genetic processes such as: recombination - the exchange of regions of the genome between paired homologous chromosomes during meiosis (crossover). genome shuffling - an experimental technique which allows for recombination between multiple parents with the goal of improving individual genes. • Alignment-free techniques attempt to overcome this limitation. The alignment-free technique we will discuss is called Universal Se- quence Maps (USM) [1]. It is based on a sequence representation tech- nique called “Chaos Game Representation” [2]. 0-1

  3. 2 Chaos Game Representation • Chaos Game Representation (CGR) maps a nucleotide sequence into a continuous two dimensional space on the unit square, CGR- space. • CGR has the property that each unique genomic sequence S = s (1) s (2) . . . s ( i ) . . . , of any length, has a unique position in CGR- space. • The iterative function used in CGR is defined by [3]: 1 CGR j ( s (0) ) = 2 2 ( CGR j ( s ( i − 1) ) − u ( i ) = CGR j ( s ( i − 1) ) + 1 CGR j ( s ( i ) ) j ) where u ( i ) is the j th bit of the binary encoding for sequence symbol j s ( i ) , and 1 ≤ j ≤ 2. • This mapping is a bijection and so it is invertible. 0-2

  4. 3 CGR Binary Encoding • Each unique symbol in the genomic sequence is encoded in binary. • For CGR the binary encoding is defined as: Unit Code A 00 T 01 C 10 G 11 0-3

  5. Figure 1: CGR Representation of ATGCGAGATGT. 0-4

  6. Figure 2: Using quadrants we can recover the original sequence. 0-5

  7. 4 Universal Sequence Maps • USMs generalizes the 2-D CGR-space to an n -dimensional space, where n depends on the number of unique symbols in the sequence. • In addition to the “forward” map of CGR, there is also a “back- ward” map. • Together, the “forward” and “backward” maps allow us to esti- mate the length of a similar segment. 0-6

  8. 4.1 Binary Encoding of Unique Symbols • As a concrete example, we let these two stanzas of a poem be our two sequences. I am a poet. I am very fond of bananas. I am of very fond bananas. Am I a poet? • There are nineteen unique symbols, so we encode each symbol with a 5-bit binary code, where 5 = ⌈ log 2 (19) ⌉ . • Each unique symbol is placed in a unique corner of the 5-dimensional unit hypercube. 0-7

  9. 4.2 USM Binary Code Unit Code Unit Code 00000 I 00100 . 00001 m 01010 ? 00010 n 01011 A 00011 o 01100 a 00101 p 01101 b 00110 r 01110 d 00111 s 01111 e 01000 t 10000 f 01001 v 10001 y 10010 0-8

  10. 4.3 Forward Map • The forward map defines a position in an n -dimensional unit hypercube for each prefix S i = s (1) s (2) . . . s ( i ) of sequence S = s (1) s (2) . . . s ( k ) . • For each coordinate j = 1 , . . . , n of n -space, the USM j coordi- nates for each prefix S i are determined as follows: USM j ( s (0) ) = Unif ([0 , 1]) � � u ( i ) USM j ( s ( i ) ) = USM j ( s ( i − 1) ) + 1 − USM j ( s ( i − 1) ) j 2 2 u ( i ) 1 2 USM j ( s ( i − 1) ) + 1 = j where u ( i ) ∈ { 0 , 1 } and 1 ≤ i ≤ k , 1 ≤ j ≤ n j • Notice the above formulas use the binary encoding u ( i ) of s ( i ) . 0-9

  11. 4.4 Backward Map • The backward map defines another n -dimensional unit hypercube. • For each coordinate j = 1 , . . . , n of n -space, the USM j coordi- nates for each suffix S i are determined as follows: USM n + j ( s ( k +1) ) = Unif ([0 , 1]) 2 u ( i ) 1 2 USM n + j ( s ( i +1) ) + 1 USM n + j ( s ( i ) ) = j where i = k, k − 1 , . . . , 1. • Together the forward and backward maps transform a sequence to a point in 2 n -dimensional space. 0-10

  12. 5 Sequence Similarity • The distance between two sequences in USM-space estimates se- quence similarity. • The “bi-directional distance” measure D between two sequences A = a (1) a (2) . . . a ( r ) . . . and B = b (1) b (2) . . . b ( s ) . . . is defined as: D ( a ( r ) , b ( s ) ) = d f ( a ( r ) , b ( s ) ) + d b ( a ( r ) , b ( s ) ) • The “forward distance” is defined as: 1 ≤ j ≤ n | USM ( b ( r ) j ) − USM ( a ( s ) d f ( a ( r ) , b ( s ) ) = − log 2 ( max j ) | ) • The forward distance measures the number of similar contiguous symbols proceeding a r . 0-11

  13. • The “backward distance” is defined as: 1 ≤ j ≤ n | USM ( b ( r ) j + n ) − USM ( a ( s ) d b ( a ( r ) , b ( s ) ) = − log 2 ( max j + n ) | ) • The backward distance measures the number of similar contiguous symbols succeeding b s . • Both distance measures d f and d b overestimate the true number of similar contiguous symbols. So D overestimates the total number of similar contiguous symbols. 0-12

  14. Algorithm 5.1 USM Compare Input : Two sequences Output : A matrix D of bi-directional distance values 1 . Identify the unique symbols in the input sequences 2 . Find the dimension n of the unit hypercube 3 . Map each unique symbol to a unique corner of the unit hypercube 4 . Iteratively generate forward USM coordinates for each input sequence 5 . Iteratively generate backward USM coordinates for each input sequence 6 . Find the bi-directional distance matrix D 7 . Return D 0-13

  15. 6 Recovering the Sequence • In theory we can recover the original sequence from a given point in USM -space since the forward and backward mappings are bi- jections, i.e., each inverses exists. • In practice we are limited by the precision of the machine. • As an example, the USM coordinates of the 16th character ’a’ are: [0 . 0156 , 0 . 0138 , 0 . 6314 , 0 . 0001 , 0 . 5338 , � �� � forward 0 . 0703 , 0 . 3004 , 0 . 5169 , 0 . 2742 , 0 . 5652 ] � �� � backward 0-14

  16. • We recover the original sequence by inverting the corresponding map. For the inverse forward map: = 2 USM j ( a (16) ) − u (16) USM j ( a (15) ) j     . 0156 0 . 0138 0         = 2 . 6314 − 1         . 0001 0     . 5338 1   . 0312 . 0276     = . 2628     . 0002   . 0676   0 0     → 0 a space     0   0 0-15

  17. Inverse forward map Inverse backward map 00100 I 00101 a 00000 01010 m 00101 a 00000 01010 m 10001 v 00000 01000 e 00101 a 01110 r 00000 10010 y 01101 p 00000 01100 o 01001 f 01000 e 01100 o 10000 t 01011 n 00001 . 00111 d 00000 00000 00100 I 01100 o 00000 01001 f 00101 a 00000 0-16

  18. Figure 3: Distance matrices D for USM(left) versus bUSM on the two stanzas. Brighter regions indicate larger distance values. 0-17

  19. Figure 4: Distance matrices D for a 100 nucleotide mRNA using both USM(left) and bUSM. 0-18

  20. 7 Boolean USM There are some problems with USMs: • The length of the sequence that can be recovered is limited to the machines precision. • The distance measure D over estimates the true number of similar symbols in a similar segment. The boolean USM (bUSM) procedure [4] fixes both these problems by: • Replacing arithmetic operations with equivalent boolean opera- tions. • Changes the symbol encoding scheme. 0-19

  21. 8 Boolean USM Coordinates • We will look at the forward map only. • Conceptually, we represent a coordinate(dimension) c in boolean USM-space as an infinite bit sequence. • Each coordinate in bUSM-space represents an infinite bit history of the symbols that have been seen so far. c = ∨ ∞ i =1 R i a i i =1 represents bit-wise logical OR, and R i is the where a i ∈ { 0 , 1 } , ∨ ∞ right shift operator repeated i times. 0-20

  22. 9 Tail Symbols • We also need to add two new symbols to the set of sequence symbols called “tail symbols,” one for each sequence. • Suppose we have the following encoding, where V t and W t are the tail symbols for the sequences V = ATGA and W = CTGA respectively. Unit Code A 000 T 001 G 010 C 011 100 V t W t 101 • Conceptually we add these tail symbols to the beginning and the end of the original sequences. 0-21

  23. 10 bUSM Recursion Formula • Consider a computer word of 8-bits, i.e., our coordinate bit history is 8 bits long. • The bUSM recursion is initialized with the tail symbol for each sequence.   00000000 USM ( v (0) ) = 00000000   11111111 • The bUSM recursion is written as: USM j ( v ( k ) ) = R 1 � � ⊙ u ( k ) USM j ( v ( k − 1) ) j where u ( k ) ∈ { 0 , 1 } , 1 ≤ j ≤ n . ⊙ is defined below. j 0-22

  24. 11 Example • For instance, for the first symbol of V, A = (000) , we get:       00000000 0 00000000 USM ( v (1) ) = R 1  ⊙  = 00000000 0 00000000     11111111 0 01111111 • Where ⊙ means replace the 1’st columns entry, in the right shifted coordinate histories, with a 1, if the corresponding entry in the column vector is 1.       00000000 1 10000000  ⊙  = USM ( v (2) ) = R 1 00000000 0 00000000     01111111 0 00111111 0-23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend