Optical Mapping Data: Data Generation and Algorithms Sample - - PowerPoint PPT Presentation
Optical Mapping Data: Data Generation and Algorithms Sample - - PowerPoint PPT Presentation
Optical Mapping Data: Data Generation and Algorithms Sample Preparation Fragments Sequencing Reads Assembly Contigs Analysis What is an Optical Map? Optical maps are ordered, genome-wide, high- resolution restriction maps. GGCTT CCGA
Sample Preparation Sequencing Assembly Analysis
Fragments Reads Contigs
What is an Optical Map?
GGCTTCCGACCACCACAACCGAATTATGAAGGATACCGAA 6,19,35
Optical maps are ordered, genome-wide, high- resolution restriction maps.
- Much longer than reads. For example, the average
map size for goat covers 360,000 bp
- Now commercially available
. Isolated DNA Microfludic device DNA is elongated and cleaved
- n the optical mapping surface
Epiflourescence microscope with CCD camera
6 3 3 4 9
6 3 3 4 9 6 3 9 4
Genome wide optical map
βThere is [..] a critical need for the continued development and public release of software tools for processing optical mapping data ..β
- GigaScience 2014
Goal: tool to align the contig to a segment of an
- ptical map
Sample Preparation Sequencing Assembly Analysis
Genome-wide
- ptical map
contigs
Optical Map Data
- Previous approaches use dynamic programming
- Burrows-Wheeler Transform (BWT) would
improve time efficiency
- Challenges in applying BWT: (1) Sizing error and
(2) alphabet size
Challenges
6 3 9 4 5 4 9.5 6
Actual optical map values Optical map obtained from experiment
1 1 0.5 2
SIZING ERROR
- Previous approaches use dynamic programming
- Burrows-Wheeler Transform (BWT) would
improve time efficiency
- Challenges in applying BWT: (1) Sizing error and
(2) alphabet size
Challenges
6 3 9 4 5 4 9.5 6
Actual optical map values Optical map obtained from experiment
1 1 0.5 2
SIZING ERROR
- Previous approaches use dynamic programming
- Burrows-Wheeler Transform (BWT) would
improve time efficiency
- Challenges in applying BWT: (1) Sizing error and
(2) alphabet size
Challenges
! π£ππππ£π ππ ππππππ’ π‘ππ¨ππ‘ >
- 16,000
Twin
Sample Preparation Sequencing Assembly Analysis
Contigs
Optical Map Data
Alignment of contigs to optical map Genome-wide
- ptical map
Contig 1 Contig 2 Contig 3 Contig 5 Contig 4
Twin Algorithm
- 1. In silico digest contigs into optical maps.
TTTCCGACCACTTTTCCGAATTATGACCGAA 4,13,24
Twin Algorithm
- 1. In silico digest contigs into optical maps.
- 2. Build FM-index* and auxiliary data structures
- n the genome-wide optical map.
* a data structure that allows compression
- f the input text while still permitting fast
substring queries
BWT and FM-index
A suffix array (SA) of string S is an array of the suffixes
- f S sorted into alphabetical order.
1 acaaacgn 2 caaacgn 3 aaacgn 4 aacgn 5 acgn 6 cgn 7 gn 8 n 3 aaacgn 4 aacgn 1 acaaacgn 5 acgn 2 caaacgn 6 cgn 7 gn 8 n
acaaacgn
BWT and FM-index
A suffix array (SA) of string S is an array of the suffixes
- f S sorted into alphabetical order.
The suffix array clusters all the occurrences of every pattern together into a contiguous range!
1 acaaacgn 2 caaacgn 3 aaacgn 4 aacgn 5 acgn 6 cgn 7 gn 8 n 3 aaacgn 4 aacgn 1 acaaacgn 5 acgn 2 caaacgn 6 cgn 7 gn 8 n
acaaacgn
A suffix array (SA) of string S is an array of the suffixes
- f S sorted into alphabetical order.
The suffix array clusters all the occurrences of every pattern together into a contiguous range!
1 acaaacgn 2 caaacgn 3 aaacgn 4 aacgn 5 acgn 6 cgn 7 gn 8 n 3 aaacgn 4 aacgn 1 acaaacgn 5 acgn 2 caaacgn 6 cgn 7 gn 8 n
acaaacgn
BWT and FM-index
1 acaaacgn 2 caaacgn 3 aaacgn 4 aacgn 5 acgn 6 cgn 7 gn 8 n 3 aaacgn 4 aacgn 1 acaaacgn 5 acgn 2 caaacgn 6 cgn 7 gn 8 n
acaaacgn
A suffix array (SA) of string S is an array of the suffixes
- f S sorted into alphabetical order.
The suffix array clusters all the occurrences of every pattern together into a contiguous range!
BWT and FM-index
3 aaacgn 4 aacgn 1 acaaacgn 5 acgn 2 caaacgn 6 cgn 7 gn 8 n 1 acaaacgn 2 caaacgn 3 aaacgn 4 aacgn 5 acgn 6 cgn 7 gn 8 n
acaaacgn
A suffix array (SA) of string S is an array of the suffixes
- f S sorted into alphabetical order.
The suffix array clusters all the occurrences of every pattern together into a contiguous range!
BWT and FM-index
The Burrows-Wheeler Transform (BWT) is a permutation
- f the string such that BWT[i] = S[SA[i] - 1].
3 aaacgnac 4 aacgnaca 1 acaaacgn 5 acgnacaa 2 caaacgna 6 cgnacaaa 7 gnacaaac 8 nacaaacg
acaaacgn
BWT and FM-index
c a n a a a c g
Extract last column of SA
The Burrows-Wheeler Transform (BWT) is a permutation
- f the string such that BWT[i] = S[SA[i] - 1].
rankK(i): return the number of Kβs in S[1,i]
3 aaacgnac 4 aacgnaca 1 acaaacgn 5 acgnacaa 2 caaacgna 6 cgnacaaa 7 gnacaaac 8 nacaaacg
acaaacgn
BWT and FM-index
c a n a a a c g 1 2 3 1
BWT rank
The Burrows-Wheeler Transform (BWT) is a permutation
- f the string such that BWT[i] = S[SA[i] - 1].
rankK(i): return the number of Kβs in S[1,i]
3 aaacgnac 4 aacgnaca 1 acaaacgn 5 acgnacaa 2 caaacgna 6 cgnacaaa 7 gnacaaac 8 nacaaacg
acaaacgn
BWT and FM-index
c a n a a a c g 1 2 3 1
BWT rank ranka[5] = 2
The Burrows-Wheeler Transform (BWT) is a permutation
- f the string such that BWT[i] = S[SA[i] - 1].
FM-index is the compressed version of the BWT and rank.
3 aaacgnac 4 aacgnaca 1 acaaacgn 5 acgnacaa 2 caaacgna 6 cgnacaaa 7 gnacaaac 8 nacaaacg
acaaacgn
BWT and FM-index
c a n a a a c g 1 2 3 1
BWT rank
Twin Algorithm
- 1. In silico digest contigs into optical maps.
- 2. Build FM-index and auxiliary data structures
- n the genome-wide optical map.
- 3. Using the FM-index we find all alignments
between the optical map and the in silico digested contigs.
- Modified FM-index Backward Search Algorithm
FM-Index Backward Search
A recursive algorithm for finding substrings using rank and BWT
rank[c] rank[a] rank[a]
Modified FM-Index Backward Search
- Sizing error and alphabet size are challenges
to overcome
- We cannot afford a brute force enumeration
- f the alphabet at each step in the
backward search
- Novelty for optical maps: Wavelet Tree
Wavelet Tree
A Wavelet Tree converts a string into a balanced binary-tree of bit vectors, where a 0 replaces half of the symbols, and a 1 replaces the other
- half. This definition is applied recursive
{A,C,G,T} is encoded as {0,0,1,1}
ACGTATATAGGAAGA 001101010110010
Wavelet Tree
{A,C,G,T} is encoded as {0,0,1,1}
ACGTATATAGGAAGA 001101010110010
Wavelet Tree
No ambiguity!
Wavelet Tree
ACGTATATAGGAAGA 001101010110010 ACAAAAAA 01000000
{A,C} is encoded as {0,1}
Wavelet Tree
ACGTATATAGGAAGA 001101010110010 ACAAAAAA 01000000
{G,T} is encoded as {0,1}
GTTTGGG 0111000 1
Which symbols in {A, G} exist in input string?
To match x we need to find all the substrings within the range x +/- y, for tolerance y.
Modified FM-Index Backward Search
To match 9 we need to find all the substrings within the range [6, 12] , for tolerance 3.
Modified FM-Index Backward Search
2,11,10,23,53,3,5,10,14,9,11 0, 1, 0, 1, 1,0,0, 0, 1,0, 1 Genome wide
- ptical map
Modified FM-Index Backward Search
2,11,10,23,53,3,5,10,14,9,11 0, 1, 0, 1, 1,0,0, 0, 1,0, 1
To match 9 we need to find all the substrings within the range [6, 12] , for tolerance 3.
2,10,3,5,10,9 0, 1,0,0, 1,1 11,23,53,14,11 0, 1, 1, 0, 0 2,3,5 0,0,1 10,9,10 0,1, 0 2,3 0,1 5 1 11,14,11 0, 1, 0 23,53 0, 1
Modified FM-Index Backward Search
2,11,10,23,53,3,5,10,14,9,11 0, 1, 0, 1, 1,0,0, 0, 1,0, 1
To match 9 we need to find all the substrings within the range [6, 12] , for tolerance 3.
2,10,3,5,10,9 0, 1,0,0, 1,1 11,23,53,14,11 0, 1, 1, 0, 0 2,3,5 0,0,1 10,9,10 0,1, 0 2,3 0,1 5 1 11,14,11 0, 1, 0 23,53 0, 1
A recursive algorithm for finding substrings using rank and BWT
rank[c] rank[a] rank[a]
Modified FM-Index Backward Search
Wavelet Tree Query
Twin Algorithm
- 1. In silico digest contigs into optical maps.
- 2. Build FM-index and auxiliary data structures
- n the genome-wide optical map.
- 3. Using the FM-index we find all alignments
between the optical map and the in silico digested contigs.
- 4. Output the alignments in PSL format.
TWIN Test Datasets
TWIN Results
Twin is the first alignment method that is capable of handling large genome sizes
- The only index-based tool and is orders of
magnitude faster than existing approaches (patent pending)
- Pine tree (20 Gb) would take ~84 machine years
with SOMA but a couple hours with Twin
TWIN: Optical Map Aligner
CORRECTING ERRORS IN GENOMES
Mis-assembly in Genomes
Mis-assembly: Significantly large insertion, deletion, inversion, or rearrangement that is the result of decisions made by the assembly program
Correct assembly Rearrangement Deletion Insertion A R R B A R R B A R B A R R B R
Extensive vs. Local Mis-assemblies
Extensive Mis-assembly: 1 kbp in size and regions align to different strands or different chromosomes. Local Mis-assembly: smaller in size and on the same strand and same chromosome.
De Bruijn Graph of a Genome
Example Genome: ABCDEFGHICDEFGKL Example Genome: ABCDEFGHICDEFGKL
1 3 2 ABC BCD CDE DEF EFG FGK GKL FGH GHI HIC ICD
De Bruijn Graph of a Genome
ABC BCD CDE DEF EFG FGK GKL
Example Genome: ABCDEFGHICDEFGKL Example Genome: ABCDEFGHICDEFGKL
De Bruijn Graph of a Genome
ABC BCD CDE DEF EFG FGK GKL
Example Genome: ABCDEFGHICDEFGKL Resulting Erroneous Genome: ABCDEFGKL
1
Sample Preparation Sequencing Assembly Analysis
Fragments Reads Contigs
misSEQuel* Refined Contigs Reads Contigs
*(Muggli, Puglisi, Ronen, Boucher, ISMB 2015)
Sample Preparation Sequencing Assembly Analysis
Fragments Reads Contigs
Optical Map Data
misSEQuel Algorithm
- 1. Align sequence reads to contigs using a
standard alignment tool.
GGCTTCCGACCACCACAAATGGATTATGAAGGATATATGGA
misSEQuel Algorithm
- 1. Align sequence reads to contigs using a
standard alignment tool.
GGCTTCCGACCACCACAAATGGATATGAAGGATATATGGATTATGAAGGATATA
GGCTTCCGACCACCACAAATGGATTATGAAGGATATATGGA
misSEQuel Algorithm
- 1. Align sequence reads to contigs using a
standard alignment tool.
GGCTTCCGACCACCACAAATGGATTATGAAGGATATATGGA
misSEQuel Algorithm
- 1. Align sequence reads to contigs using a
standard alignment tool.
GGCTTCCGACCACCACAAATGGATATGAAGGATATATGGATTATGAAGGATATA
GGCTTCCGACCACCACAAATGGATTATGAAGGATATATGGA 1 9
misSEQuel Algorithm
- 1. Align sequence reads to contigs using a
standard alignment tool.
- 2. Build the red-black positional de Bruijn graph
based on the alignment.
Sample Preparation Sequencing
ACGTAGAATCGACCATG GGGACGTAGAATACGAC ACGTAGAATACGTAGAA
Reads Fragments Next Generation Sequencing (NGS)
ACGTAGAATCGACCATGGGGACGTAGAATACGA
Paired-End Reads / Mate-Pair Reads
Sample Preparation
Sequencing
Fragments
Read Mate Pair Concordance
A R R B A R R B A R R B Correct assembly Rearrangement Inversion
Read Depth
A R R B A R B R R R A B Correct assembly Insertion Deletion
Red-Black Positional De Bruijn Graph
I. Choose a value of π and Ξ . II. Each positional π-mer (sk) is an edge between two positional πβmers: prefix and suffix of sk. III. Positional πβmers, sk-1 and sk-1β, are glued if sk-1 and sk-1β have the same label and their distances differ by at most Ξ. IV. A sk-1 is red if the read depth is two standard deviations from the mean or there is a significant number of disconcordinate read alignments; otherwise, it is black.
A positional π-mer is a π-mer with an approximate position.
Positional Red Black de Bruijn Graph
Reads aligned to contigs: Positional k-mers with read depth: Positional Red Black de Bruijn Graph:
misSEQuel Algorithm
- 1. Align sequence reads to contigs using a
standard alignment tool.
- 2. Build the red-black positional de Bruijn graph
based on the alignment.
- 3. Remove all bulges and whirls for the red-
black positional de Bruijn graph.
misSEQuel Algorithm
- 1. Align sequence reads to contigs using a
standard alignment tool.
- 2. Build the red-black positional de Bruijn graph
based on the alignment.
- 3. Remove all bulges and whirls for the red-
black positional de Bruijn graph.
Correct assembled contigs Mis-assembled contigs A R R B A R R B A R B A R R B R A R R B
β¦
misSEQuel Algorithm
- 1. Align sequence reads to contigs using a
standard alignment tool.
- 2. Build the red-black positional de Bruijn graph
based on the alignment.
- 3. Remove all bulges and whirls for the red-
black positional de Bruijn graph.
- 4. Contig refinement using optical map
alignment.
Optical Map Alignment
NheI = G^CTAGC
- E. Coli optical map segment
A R R B
NheI = G^CTAGC
βGCTAGCβ
Optical Map Alignment
B A R R
NheI = G^CTAGC
Correctly Assembled Contigs Align
B A R R
NheI = G^CTAGC
A R B R R
Mis-assembled Contigs Donβt Align
NheI = G^CTAGC
A R B R R
Mis-assembled Contigs Donβt Align
Results on Tularensis
Results on Tularensis
Results on Tularensis
Results on Tularensis
Results on Tularensis
Results on Tularensis
Results on Tularensis
Results on Tularensis
Results on Tularensis
Results on Pine
B B A R R
Improve Prediction
A R R R
B
Improve Prediction
A R R R Deletion between two aligned regions