Universal Sequence Maps
- f Arbitrary Discrete
Universal Sequence Maps of Arbitrary Discrete Sequences By Almeida - - PowerPoint PPT Presentation
Universal Sequence Maps of Arbitrary Discrete Sequences By Almeida and Vigna Presented By Chris Standish chriss@cs.tamu.edu 1 Motivation Sequence alignment techniques assume there is conservation of contiguity between homologous
contiguity between homologous segments.
as: recombination - the exchange of regions of the genome between paired homologous chromosomes during meiosis (crossover). genome shuffling - an experimental technique which allows for recombination between multiple parents with the goal of improving individual genes.
The alignment-free technique we will discuss is called Universal Se- quence Maps (USM) [1]. It is based on a sequence representation tech- nique called “Chaos Game Representation” [2]. 0-1
into a continuous two dimensional space on the unit square, CGR- space.
s(1)s(2) . . . s(i) . . ., of any length, has a unique position in CGR- space.
CGRj(s(0)) =
1 2
CGRj(s(i)) = CGRj(s(i−1)) + 1
2(CGRj(s(i−1)) − u(i) j )
where u(i)
j
is the jth bit of the binary encoding for sequence symbol s(i), and 1 ≤ j ≤ 2.
0-2
Unit Code A 00 T 01 C 10 G 11 0-3
Figure 1: CGR Representation of ATGCGAGATGT. 0-4
Figure 2: Using quadrants we can recover the original sequence. 0-5
where n depends on the number of unique symbols in the sequence.
ward” map.
mate the length of a similar segment. 0-6
two sequences. I am a poet. I am very fond of bananas. I am of very fond bananas. Am I a poet?
with a 5-bit binary code, where 5 = ⌈log2(19)⌉.
unit hypercube. 0-7
Unit Code 00000 . 00001 ? 00010 A 00011 a 00101 b 00110 d 00111 e 01000 f 01001 Unit Code I 00100 m 01010 n 01011
p 01101 r 01110 s 01111 t 10000 v 10001 y 10010 0-8
hypercube for each prefix Si = s(1)s(2) . . . s(i) of sequence S = s(1)s(2) . . . s(k).
nates for each prefix Si are determined as follows: USMj(s(0)) = Unif([0, 1]) USMj(s(i)) = USMj(s(i−1)) + 1
2
j
− USMj(s(i−1))
1 2USMj(s(i−1)) + 1 2u(i) j
where u(i)
j
∈ {0, 1} and 1 ≤ i ≤ k, 1 ≤ j ≤ n
0-9
nates for each suffix Si are determined as follows: USMn+j(s(k+1)) = Unif([0, 1]) USMn+j(s(i)) =
1 2USMn+j(s(i+1)) + 1 2u(i) j
where i = k, k − 1, . . . , 1.
to a point in 2n-dimensional space. 0-10
quence similarity.
A = a(1)a(2) . . . a(r) . . . and B = b(1)b(2) . . . b(s) . . . is defined as: D(a(r), b(s)) = df(a(r), b(s)) + db(a(r), b(s))
df(a(r), b(s)) = − log2( max
1≤j≤n|USM(b(r) j ) − USM(a(s) j )|)
symbols proceeding ar. 0-11
db(a(r), b(s)) = − log2( max
1≤j≤n|USM(b(r) j+n) − USM(a(s) j+n)|)
symbols succeeding bs.
similar contiguous symbols. So D overestimates the total number
0-12
Algorithm 5.1 USM Compare Input: Two sequences Output: A matrix D of bi-directional distance values 1. Identify the unique symbols in the input sequences 2. Find the dimension n of the unit hypercube 3. Map each unique symbol to a unique corner of the unit hypercube 4. Iteratively generate forward USM coordinates for each input sequence 5. Iteratively generate backward USM coordinates for each input sequence 6. Find the bi-directional distance matrix D 7. Return D 0-13
in USM-space since the forward and backward mappings are bi- jections, i.e., each inverses exists.
are: [0.0156, 0.0138, 0.6314, 0.0001, 0.5338,
0.0703, 0.3004, 0.5169, 0.2742, 0.5652
] 0-14
USMj(a(15)) = 2 USMj(a(16)) − u(16)
j
= 2 .0156 .0138 .6314 .0001 .5338 − 1 1 = .0312 .0276 .2628 .0002 .0676 → a space 0-15
Inverse forward map 00100 I 00000 00101 a 01010 m 00000 00101 a 00000 01101 p 01100
e 10000 t 00001 . 00000 00100 I 00000 00101 a Inverse backward map 00101 a 01010 m 00000 10001 v 01000 e 01110 r 10010 y 00000 01001 f 01100
n 00111 d 00000 01100
f 00000 0-16
Figure 3: Distance matrices D for USM(left) versus bUSM on the two
0-17
Figure 4: Distance matrices D for a 100 nucleotide mRNA using both USM(left) and bUSM. 0-18
There are some problems with USMs:
machines precision.
symbols in a similar segment. The boolean USM (bUSM) procedure [4] fixes both these problems by:
tions.
0-19
USM-space as an infinite bit sequence.
c = ∨∞
i=1Riai
where ai ∈ {0, 1}, ∨∞
i=1 represents bit-wise logical OR, and Ri is the
right shift operator repeated i times. 0-20
symbols called “tail symbols,” one for each sequence.
the tail symbols for the sequences V = ATGA and W = CTGA respectively. Unit Code A 000 T 001 G 010 C 011 Vt 100 Wt 101
end of the original sequences. 0-21
is 8 bits long.
sequence. USM(v(0)) = 00000000 00000000 11111111
USMj(v(k)) = R1 USMj(v(k−1))
j
where u(k)
j
∈ {0, 1}, 1 ≤ j ≤ n. ⊙ is defined below. 0-22
USM(v(1)) = R1 00000000 00000000 11111111 ⊙ = 00000000 00000000 01111111
coordinate histories, with a 1, if the corresponding entry in the column vector is 1. USM(v(2)) = R1 00000000 00000000 01111111 ⊙ 1 = 10000000 00000000 00111111 0-23
and w(m) as: dfUSMj(v(m), w(m)) = USMj(v(m)) ⊕ USMj(w(m)) where ⊕ is the exclusive OR of each row, and 1 ≤ j ≤ n.
dfUSM(v(4), w(4)) = 00100000 01000000 00001111 ⊕ 00111111 01010000 00001111 = 00011111 00010000 00000000 0-24
wise OR of each row to get: (00011111)
The position
quence symbols, including (v(4), w(4)), that precede the position (v(4), w(4)).
Note, V = ATGA and W = CTGA
the backward distance db(v(i), w(j)). 0-25
To overcome the limitations imposed by finite word length we define new distance measures. Let a = USM(v(i)), b = USM(w(j)), and W be the number of bits in a word. We define: d
′
f(a, b) =
df(a, b) if df(a, b) < W W
d
′
b(a, b) =
db(a, b) if db(a, b) < W W
0-26
forward distance” is defined as: Df(a, b) =
′
f(a, b)
if d
′
f(a, b) < W
Df(c, d) + W
backward distance” is defined as: Db(a, b) =
′
b(a, b)
if d
′
b(a, b) < W
Db(e, f) + W
0-27
The exact length of a similar segment is determined by: D(a, b) =
if Df(a, b) > 0
0-28
The bUSM procedure:
similar symbols in a similar segment,
and
Can bUSM be used to:
stored as 2n-dimensional points in bUSM-space?
0-29
[1] Jonas Almeida and Susana Vigna. Universal Sequence Map (USM)
February 2002. www.biomedcentral.com/1471-2105/3/6. [2] H. J. Jeffrey. Chaos Game Representation of Gene Struc- ture. Nucleic Acids Research, 18(8):2163–2170, April 1990. www.pubmedcentral.nih.gov. [3] Almeida, Carroco, Maretzek, Noble, and Fletcher. Anal- ysis
Genomic Sequences by Chaos Game Representa- tion. Bioinformatics, 17(5):429–437, May 2001. bioinformat- ics.oupjournals.org/content/vol17/issue5/index.dtl. [4] John Schwacke and Jonas S Almeida. Efficient Boolean Implemen- tation of Universal Sequence Maps (bUSM). BMC Bioinformatics, 3(28):1–11, October 2002. www.biomedcentral.com/1471-2105/3/28. 0-30