Universal Sequence Maps of Arbitrary Discrete Sequences By Almeida - PowerPoint PPT Presentation

Universal Sequence Maps of Arbitrary Discrete Sequences By Almeida and Vigna Presented By Chris Standish chriss@cs.tamu.edu

1 Motivation • Sequence alignment techniques assume there is conservation of contiguity between homologous segments. • The assumption of contiguity is violated by genetic processes such as: recombination - the exchange of regions of the genome between paired homologous chromosomes during meiosis (crossover). genome shuffling - an experimental technique which allows for recombination between multiple parents with the goal of improving individual genes. • Alignment-free techniques attempt to overcome this limitation. The alignment-free technique we will discuss is called Universal Se- quence Maps (USM) [1]. It is based on a sequence representation technique called “Chaos Game Representation” [2]. 0-1

2 Chaos Game Representation • Chaos Game Representation (CGR) maps a nucleotide sequence into a continuous two dimensional space on the unit square, CGR- space. • CGR has the property that each unique genomic sequence S = s (1) s (2) . . . s ( i ) . . . , of any length, has a unique position in CGR- space. • The iterative function used in CGR is defined by [3]: 1 CGR j ( s (0) ) = 2 2 ( CGR j ( s ( i − 1) ) − u ( i ) = CGR j ( s ( i − 1) ) + 1 CGR j ( s ( i ) ) j ) where u ( i ) is the j th bit of the binary encoding for sequence symbol j s ( i ) , and 1 ≤ j ≤ 2. • This mapping is a bijection and so it is invertible. 0-2

3 CGR Binary Encoding • Each unique symbol in the genomic sequence is encoded in binary. • For CGR the binary encoding is defined as: Unit Code A 00 T 01 C 10 G 11 0-3

Figure 1: CGR Representation of ATGCGAGATGT. 0-4

Figure 2: Using quadrants we can recover the original sequence. 0-5

4 Universal Sequence Maps • USMs generalizes the 2-D CGR-space to an n -dimensional space, where n depends on the number of unique symbols in the sequence. • In addition to the “forward” map of CGR, there is also a “backward” map. • Together, the “forward” and “backward” maps allow us to esti- mate the length of a similar segment. 0-6

4.1 Binary Encoding of Unique Symbols • As a concrete example, we let these two stanzas of a poem be our two sequences. I am a poet. I am very fond of bananas. I am of very fond bananas. Am I a poet? • There are nineteen unique symbols, so we encode each symbol with a 5-bit binary code, where 5 = ⌈ log 2 (19) ⌉ . • Each unique symbol is placed in a unique corner of the 5-dimensional unit hypercube. 0-7

4.2 USM Binary Code Unit Code Unit Code 00000 I 00100 . 00001 m 01010 ? 00010 n 01011 A 00011 o 01100 a 00101 p 01101 b 00110 r 01110 d 00111 s 01111 e 01000 t 10000 f 01001 v 10001 y 10010 0-8

4.3 Forward Map • The forward map defines a position in an n -dimensional unit hypercube for each prefix S i = s (1) s (2) . . . s ( i ) of sequence S = s (1) s (2) . . . s ( k ) . • For each coordinate j = 1 , . . . , n of n -space, the USM j coordinates for each prefix S i are determined as follows: USM j ( s (0) ) = Unif ([0 , 1]) � � u ( i ) USM j ( s ( i ) ) = USM j ( s ( i − 1) ) + 1 − USM j ( s ( i − 1) ) j 2 2 u ( i ) 1 2 USM j ( s ( i − 1) ) + 1 = j where u ( i ) ∈ { 0 , 1 } and 1 ≤ i ≤ k , 1 ≤ j ≤ n j • Notice the above formulas use the binary encoding u ( i ) of s ( i ) . 0-9

4.4 Backward Map • The backward map defines another n -dimensional unit hypercube. • For each coordinate j = 1 , . . . , n of n -space, the USM j coordinates for each suffix S i are determined as follows: USM n + j ( s ( k +1) ) = Unif ([0 , 1]) 2 u ( i ) 1 2 USM n + j ( s ( i +1) ) + 1 USM n + j ( s ( i ) ) = j where i = k, k − 1 , . . . , 1. • Together the forward and backward maps transform a sequence to a point in 2 n -dimensional space. 0-10

5 Sequence Similarity • The distance between two sequences in USM-space estimates sequence similarity. • The “bi-directional distance” measure D between two sequences A = a (1) a (2) . . . a ( r ) . . . and B = b (1) b (2) . . . b ( s ) . . . is defined as: D ( a ( r ) , b ( s ) ) = d f ( a ( r ) , b ( s ) ) + d b ( a ( r ) , b ( s ) ) • The “forward distance” is defined as: 1 ≤ j ≤ n | USM ( b ( r ) j ) − USM ( a ( s ) d f ( a ( r ) , b ( s ) ) = − log 2 ( max j ) | ) • The forward distance measures the number of similar contiguous symbols proceeding a r . 0-11

• The “backward distance” is defined as: 1 ≤ j ≤ n | USM ( b ( r ) j + n ) − USM ( a ( s ) d b ( a ( r ) , b ( s ) ) = − log 2 ( max j + n ) | ) • The backward distance measures the number of similar contiguous symbols succeeding b s . • Both distance measures d f and d b overestimate the true number of similar contiguous symbols. So D overestimates the total number of similar contiguous symbols. 0-12

Algorithm 5.1 USM Compare Input : Two sequences Output : A matrix D of bi-directional distance values 1 . Identify the unique symbols in the input sequences 2 . Find the dimension n of the unit hypercube 3 . Map each unique symbol to a unique corner of the unit hypercube 4 . Iteratively generate forward USM coordinates for each input sequence 5 . Iteratively generate backward USM coordinates for each input sequence 6 . Find the bi-directional distance matrix D 7 . Return D 0-13

6 Recovering the Sequence • In theory we can recover the original sequence from a given point in USM -space since the forward and backward mappings are bi- jections, i.e., each inverses exists. • In practice we are limited by the precision of the machine. • As an example, the USM coordinates of the 16th character ’a’ are: [0 . 0156 , 0 . 0138 , 0 . 6314 , 0 . 0001 , 0 . 5338 , � �� forward 0 . 0703 , 0 . 3004 , 0 . 5169 , 0 . 2742 , 0 . 5652 ] � �� backward 0-14

• We recover the original sequence by inverting the corresponding map. For the inverse forward map: = 2 USM j ( a (16) ) − u (16) USM j ( a (15) ) j     . 0156 0 . 0138 0         = 2 . 6314 − 1         . 0001 0     . 5338 1   . 0312 . 0276     = . 2628     . 0002   . 0676   0 0     → 0 a space     0   0 0-15

Inverse forward map Inverse backward map 00100 I 00101 a 00000 01010 m 00101 a 00000 01010 m 10001 v 00000 01000 e 00101 a 01110 r 00000 10010 y 01101 p 00000 01100 o 01001 f 01000 e 01100 o 10000 t 01011 n 00001 . 00111 d 00000 00000 00100 I 01100 o 00000 01001 f 00101 a 00000 0-16

Figure 3: Distance matrices D for USM(left) versus bUSM on the two stanzas. Brighter regions indicate larger distance values. 0-17

Figure 4: Distance matrices D for a 100 nucleotide mRNA using both USM(left) and bUSM. 0-18

7 Boolean USM There are some problems with USMs: • The length of the sequence that can be recovered is limited to the machines precision. • The distance measure D over estimates the true number of similar symbols in a similar segment. The boolean USM (bUSM) procedure [4] fixes both these problems by: • Replacing arithmetic operations with equivalent boolean operations. • Changes the symbol encoding scheme. 0-19

8 Boolean USM Coordinates • We will look at the forward map only. • Conceptually, we represent a coordinate(dimension) c in boolean USM-space as an infinite bit sequence. • Each coordinate in bUSM-space represents an infinite bit history of the symbols that have been seen so far. c = ∨ ∞ i =1 R i a i i =1 represents bit-wise logical OR, and R i is the where a i ∈ { 0 , 1 } , ∨ ∞ right shift operator repeated i times. 0-20

9 Tail Symbols • We also need to add two new symbols to the set of sequence symbols called “tail symbols,” one for each sequence. • Suppose we have the following encoding, where V t and W t are the tail symbols for the sequences V = ATGA and W = CTGA respectively. Unit Code A 000 T 001 G 010 C 011 100 V t W t 101 • Conceptually we add these tail symbols to the beginning and the end of the original sequences. 0-21

10 bUSM Recursion Formula • Consider a computer word of 8-bits, i.e., our coordinate bit history is 8 bits long. • The bUSM recursion is initialized with the tail symbol for each sequence.   00000000 USM ( v (0) ) = 00000000   11111111 • The bUSM recursion is written as: USM j ( v ( k ) ) = R 1 � � ⊙ u ( k ) USM j ( v ( k − 1) ) j where u ( k ) ∈ { 0 , 1 } , 1 ≤ j ≤ n . ⊙ is defined below. j 0-22

11 Example • For instance, for the first symbol of V, A = (000) , we get:       00000000 0 00000000 USM ( v (1) ) = R 1  ⊙  = 00000000 0 00000000     11111111 0 01111111 • Where ⊙ means replace the 1’st columns entry, in the right shifted coordinate histories, with a 1, if the corresponding entry in the column vector is 1.       00000000 1 10000000  ⊙  = USM ( v (2) ) = R 1 00000000 0 00000000     01111111 0 00111111 0-23

Universal Sequence Maps of Arbitrary Discrete Sequences By Almeida - PowerPoint PPT Presentation

Universal Sequence Maps of Arbitrary Discrete Sequences By Almeida and Vigna Presented By Chris Standish chriss@cs.tamu.edu 1 Motivation Sequence alignment techniques assume there is conservation of contiguity between homologous

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Making maps pretty Andrea Aime Jim Groffen Making Maps Pretty Making Maps Pretty 1 1 Making

The Ranked Sequence ADT A ranked sequence S (with n elements) supports the following methods:

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

20-03-06 7. Learning Sequences/Behaviors How to use sequences/behaviors? Sequences and more

An enumerative relationship between maps and 4-regular maps Michael La Croix April 9, 2008 An

( DAY 2) V OCABULARY Two types of sequences were studied: Arithmetic Sequence: A sequence is

Quasiconformal distortion of projective maps and discrete conformal maps with Stefan Born and

Discrete Mathematics & Mathematical Reasoning Sequences and Sums Colin Stirling Informatics

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Variants Chris Yates UCL Cancer Institute c.yates@ucl.ac.uk Outline SAVs and Disease

Examples of non- algebraic classes in the Brown-Peterson tower Freie Universitt Berlin

A Talk on Protein Homology Detection by HMM-HMM comparisons[1] Sding, J Qing Ye Department of

Modelling binding site with 3DLigandSite Mark Wass m.n.wass@kent.ac.uk CASP MEEYKVVVCGSGPVALGCF

Learning outcomes Learning outcomes in UCC in UCC International Symposium on Implementing

The effect of rate of presentation Article in Attention Perception & Psychophysics March

Kristen Chalmet, Kenny Dauwe, Lander Foquet, Bea Van Der Gucht, Dirk Vogelaers, Jean Plum, Linos

State Inclusion of LGBT Human Rights in the Philippines Tesa Casal de Vela Miriam College

Universal Sequence Maps of Arbitrary Discrete Sequences By Almeida - PowerPoint PPT Presentation

Universal Sequence Maps of Arbitrary Discrete Sequences By Almeida and Vigna Presented By Chris Standish chriss@cs.tamu.edu 1 Motivation Sequence alignment techniques assume there is conservation of contiguity between homologous

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Making maps pretty Andrea Aime Jim Groffen Making Maps Pretty Making Maps Pretty 1 1 Making

The Ranked Sequence ADT A ranked sequence S (with n elements) supports the following methods:

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

20-03-06 7. Learning Sequences/Behaviors How to use sequences/behaviors? Sequences and more

An enumerative relationship between maps and 4-regular maps Michael La Croix April 9, 2008 An

( DAY 2) V OCABULARY Two types of sequences were studied: Arithmetic Sequence: A sequence is

Quasiconformal distortion of projective maps and discrete conformal maps with Stefan Born and

Discrete Mathematics &amp; Mathematical Reasoning Sequences and Sums Colin Stirling Informatics

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Variants Chris Yates UCL Cancer Institute c.yates@ucl.ac.uk Outline SAVs and Disease

Examples of non- algebraic classes in the Brown-Peterson tower Freie Universitt Berlin

A Talk on Protein Homology Detection by HMM-HMM comparisons[1] Sding, J Qing Ye Department of

Modelling binding site with 3DLigandSite Mark Wass m.n.wass@kent.ac.uk CASP MEEYKVVVCGSGPVALGCF

Learning outcomes Learning outcomes in UCC in UCC International Symposium on Implementing

The effect of rate of presentation Article in Attention Perception &amp; Psychophysics March

Kristen Chalmet, Kenny Dauwe, Lander Foquet, Bea Van Der Gucht, Dirk Vogelaers, Jean Plum, Linos

State Inclusion of LGBT Human Rights in the Philippines Tesa Casal de Vela Miriam College

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Discrete Mathematics & Mathematical Reasoning Sequences and Sums Colin Stirling Informatics

The effect of rate of presentation Article in Attention Perception & Psychophysics March