Optical Mapping Data: Data Generation and Algorithms Sample - - PowerPoint PPT Presentation

β–Ά
optical mapping data data generation and algorithms
SMART_READER_LITE
LIVE PREVIEW

Optical Mapping Data: Data Generation and Algorithms Sample - - PowerPoint PPT Presentation

Optical Mapping Data: Data Generation and Algorithms Sample Preparation Fragments Sequencing Reads Assembly Contigs Analysis What is an Optical Map? Optical maps are ordered, genome-wide, high- resolution restriction maps. GGCTT CCGA


slide-1
SLIDE 1

Optical Mapping Data: Data Generation and Algorithms

slide-2
SLIDE 2

Sample Preparation Sequencing Assembly Analysis

Fragments Reads Contigs

slide-3
SLIDE 3

What is an Optical Map?

GGCTTCCGACCACCACAACCGAATTATGAAGGATACCGAA 6,19,35

Optical maps are ordered, genome-wide, high- resolution restriction maps.

  • Much longer than reads. For example, the average

map size for goat covers 360,000 bp

  • Now commercially available
slide-4
SLIDE 4

. Isolated DNA Microfludic device DNA is elongated and cleaved

  • n the optical mapping surface

Epiflourescence microscope with CCD camera

slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8

6 3 3 4 9

slide-9
SLIDE 9

6 3 3 4 9 6 3 9 4

Genome wide optical map

slide-10
SLIDE 10

β€œThere is [..] a critical need for the continued development and public release of software tools for processing optical mapping data ..”

  • GigaScience 2014
slide-11
SLIDE 11

Goal: tool to align the contig to a segment of an

  • ptical map

Sample Preparation Sequencing Assembly Analysis

Genome-wide

  • ptical map

contigs

Optical Map Data

slide-12
SLIDE 12
  • Previous approaches use dynamic programming
  • Burrows-Wheeler Transform (BWT) would

improve time efficiency

  • Challenges in applying BWT: (1) Sizing error and

(2) alphabet size

Challenges

6 3 9 4 5 4 9.5 6

Actual optical map values Optical map obtained from experiment

1 1 0.5 2

SIZING ERROR

slide-13
SLIDE 13
  • Previous approaches use dynamic programming
  • Burrows-Wheeler Transform (BWT) would

improve time efficiency

  • Challenges in applying BWT: (1) Sizing error and

(2) alphabet size

Challenges

6 3 9 4 5 4 9.5 6

Actual optical map values Optical map obtained from experiment

1 1 0.5 2

SIZING ERROR

slide-14
SLIDE 14
  • Previous approaches use dynamic programming
  • Burrows-Wheeler Transform (BWT) would

improve time efficiency

  • Challenges in applying BWT: (1) Sizing error and

(2) alphabet size

Challenges

! π‘£π‘œπ‘—π‘Ÿπ‘£π‘“ π‘”π‘ π‘π‘•π‘›π‘“π‘œπ‘’ 𝑑𝑗𝑨𝑓𝑑 >

  • 16,000
slide-15
SLIDE 15

Twin

Sample Preparation Sequencing Assembly Analysis

Contigs

Optical Map Data

Alignment of contigs to optical map Genome-wide

  • ptical map
slide-16
SLIDE 16

Contig 1 Contig 2 Contig 3 Contig 5 Contig 4

slide-17
SLIDE 17

Twin Algorithm

  • 1. In silico digest contigs into optical maps.

TTTCCGACCACTTTTCCGAATTATGACCGAA 4,13,24

slide-18
SLIDE 18

Twin Algorithm

  • 1. In silico digest contigs into optical maps.
  • 2. Build FM-index* and auxiliary data structures
  • n the genome-wide optical map.

* a data structure that allows compression

  • f the input text while still permitting fast

substring queries

slide-19
SLIDE 19

BWT and FM-index

A suffix array (SA) of string S is an array of the suffixes

  • f S sorted into alphabetical order.

1 acaaacgn 2 caaacgn 3 aaacgn 4 aacgn 5 acgn 6 cgn 7 gn 8 n 3 aaacgn 4 aacgn 1 acaaacgn 5 acgn 2 caaacgn 6 cgn 7 gn 8 n

acaaacgn

slide-20
SLIDE 20

BWT and FM-index

A suffix array (SA) of string S is an array of the suffixes

  • f S sorted into alphabetical order.

The suffix array clusters all the occurrences of every pattern together into a contiguous range!

1 acaaacgn 2 caaacgn 3 aaacgn 4 aacgn 5 acgn 6 cgn 7 gn 8 n 3 aaacgn 4 aacgn 1 acaaacgn 5 acgn 2 caaacgn 6 cgn 7 gn 8 n

acaaacgn

slide-21
SLIDE 21

A suffix array (SA) of string S is an array of the suffixes

  • f S sorted into alphabetical order.

The suffix array clusters all the occurrences of every pattern together into a contiguous range!

1 acaaacgn 2 caaacgn 3 aaacgn 4 aacgn 5 acgn 6 cgn 7 gn 8 n 3 aaacgn 4 aacgn 1 acaaacgn 5 acgn 2 caaacgn 6 cgn 7 gn 8 n

acaaacgn

BWT and FM-index

slide-22
SLIDE 22

1 acaaacgn 2 caaacgn 3 aaacgn 4 aacgn 5 acgn 6 cgn 7 gn 8 n 3 aaacgn 4 aacgn 1 acaaacgn 5 acgn 2 caaacgn 6 cgn 7 gn 8 n

acaaacgn

A suffix array (SA) of string S is an array of the suffixes

  • f S sorted into alphabetical order.

The suffix array clusters all the occurrences of every pattern together into a contiguous range!

BWT and FM-index

slide-23
SLIDE 23

3 aaacgn 4 aacgn 1 acaaacgn 5 acgn 2 caaacgn 6 cgn 7 gn 8 n 1 acaaacgn 2 caaacgn 3 aaacgn 4 aacgn 5 acgn 6 cgn 7 gn 8 n

acaaacgn

A suffix array (SA) of string S is an array of the suffixes

  • f S sorted into alphabetical order.

The suffix array clusters all the occurrences of every pattern together into a contiguous range!

BWT and FM-index

slide-24
SLIDE 24

The Burrows-Wheeler Transform (BWT) is a permutation

  • f the string such that BWT[i] = S[SA[i] - 1].

3 aaacgnac 4 aacgnaca 1 acaaacgn 5 acgnacaa 2 caaacgna 6 cgnacaaa 7 gnacaaac 8 nacaaacg

acaaacgn

BWT and FM-index

c a n a a a c g

Extract last column of SA

slide-25
SLIDE 25

The Burrows-Wheeler Transform (BWT) is a permutation

  • f the string such that BWT[i] = S[SA[i] - 1].

rankK(i): return the number of K’s in S[1,i]

3 aaacgnac 4 aacgnaca 1 acaaacgn 5 acgnacaa 2 caaacgna 6 cgnacaaa 7 gnacaaac 8 nacaaacg

acaaacgn

BWT and FM-index

c a n a a a c g 1 2 3 1

BWT rank

slide-26
SLIDE 26

The Burrows-Wheeler Transform (BWT) is a permutation

  • f the string such that BWT[i] = S[SA[i] - 1].

rankK(i): return the number of K’s in S[1,i]

3 aaacgnac 4 aacgnaca 1 acaaacgn 5 acgnacaa 2 caaacgna 6 cgnacaaa 7 gnacaaac 8 nacaaacg

acaaacgn

BWT and FM-index

c a n a a a c g 1 2 3 1

BWT rank ranka[5] = 2

slide-27
SLIDE 27

The Burrows-Wheeler Transform (BWT) is a permutation

  • f the string such that BWT[i] = S[SA[i] - 1].

FM-index is the compressed version of the BWT and rank.

3 aaacgnac 4 aacgnaca 1 acaaacgn 5 acgnacaa 2 caaacgna 6 cgnacaaa 7 gnacaaac 8 nacaaacg

acaaacgn

BWT and FM-index

c a n a a a c g 1 2 3 1

BWT rank

slide-28
SLIDE 28

Twin Algorithm

  • 1. In silico digest contigs into optical maps.
  • 2. Build FM-index and auxiliary data structures
  • n the genome-wide optical map.
  • 3. Using the FM-index we find all alignments

between the optical map and the in silico digested contigs.

  • Modified FM-index Backward Search Algorithm
slide-29
SLIDE 29

FM-Index Backward Search

A recursive algorithm for finding substrings using rank and BWT

rank[c] rank[a] rank[a]

slide-30
SLIDE 30

Modified FM-Index Backward Search

  • Sizing error and alphabet size are challenges

to overcome

  • We cannot afford a brute force enumeration
  • f the alphabet at each step in the

backward search

  • Novelty for optical maps: Wavelet Tree
slide-31
SLIDE 31

Wavelet Tree

A Wavelet Tree converts a string into a balanced binary-tree of bit vectors, where a 0 replaces half of the symbols, and a 1 replaces the other

  • half. This definition is applied recursive
slide-32
SLIDE 32

{A,C,G,T} is encoded as {0,0,1,1}

ACGTATATAGGAAGA 001101010110010

Wavelet Tree

slide-33
SLIDE 33

{A,C,G,T} is encoded as {0,0,1,1}

ACGTATATAGGAAGA 001101010110010

Wavelet Tree

slide-34
SLIDE 34

No ambiguity!

Wavelet Tree

ACGTATATAGGAAGA 001101010110010 ACAAAAAA 01000000

{A,C} is encoded as {0,1}

slide-35
SLIDE 35

Wavelet Tree

ACGTATATAGGAAGA 001101010110010 ACAAAAAA 01000000

{G,T} is encoded as {0,1}

GTTTGGG 0111000 1

Which symbols in {A, G} exist in input string?

slide-36
SLIDE 36

To match x we need to find all the substrings within the range x +/- y, for tolerance y.

Modified FM-Index Backward Search

slide-37
SLIDE 37

To match 9 we need to find all the substrings within the range [6, 12] , for tolerance 3.

Modified FM-Index Backward Search

2,11,10,23,53,3,5,10,14,9,11 0, 1, 0, 1, 1,0,0, 0, 1,0, 1 Genome wide

  • ptical map
slide-38
SLIDE 38

Modified FM-Index Backward Search

2,11,10,23,53,3,5,10,14,9,11 0, 1, 0, 1, 1,0,0, 0, 1,0, 1

To match 9 we need to find all the substrings within the range [6, 12] , for tolerance 3.

2,10,3,5,10,9 0, 1,0,0, 1,1 11,23,53,14,11 0, 1, 1, 0, 0 2,3,5 0,0,1 10,9,10 0,1, 0 2,3 0,1 5 1 11,14,11 0, 1, 0 23,53 0, 1

slide-39
SLIDE 39

Modified FM-Index Backward Search

2,11,10,23,53,3,5,10,14,9,11 0, 1, 0, 1, 1,0,0, 0, 1,0, 1

To match 9 we need to find all the substrings within the range [6, 12] , for tolerance 3.

2,10,3,5,10,9 0, 1,0,0, 1,1 11,23,53,14,11 0, 1, 1, 0, 0 2,3,5 0,0,1 10,9,10 0,1, 0 2,3 0,1 5 1 11,14,11 0, 1, 0 23,53 0, 1

slide-40
SLIDE 40

A recursive algorithm for finding substrings using rank and BWT

rank[c] rank[a] rank[a]

Modified FM-Index Backward Search

Wavelet Tree Query

slide-41
SLIDE 41

Twin Algorithm

  • 1. In silico digest contigs into optical maps.
  • 2. Build FM-index and auxiliary data structures
  • n the genome-wide optical map.
  • 3. Using the FM-index we find all alignments

between the optical map and the in silico digested contigs.

  • 4. Output the alignments in PSL format.
slide-42
SLIDE 42

TWIN Test Datasets

slide-43
SLIDE 43

TWIN Results

slide-44
SLIDE 44

Twin is the first alignment method that is capable of handling large genome sizes

  • The only index-based tool and is orders of

magnitude faster than existing approaches (patent pending)

  • Pine tree (20 Gb) would take ~84 machine years

with SOMA but a couple hours with Twin

TWIN: Optical Map Aligner

slide-45
SLIDE 45

CORRECTING ERRORS IN GENOMES

slide-46
SLIDE 46

Mis-assembly in Genomes

Mis-assembly: Significantly large insertion, deletion, inversion, or rearrangement that is the result of decisions made by the assembly program

Correct assembly Rearrangement Deletion Insertion A R R B A R R B A R B A R R B R

slide-47
SLIDE 47
slide-48
SLIDE 48

Extensive vs. Local Mis-assemblies

Extensive Mis-assembly: 1 kbp in size and regions align to different strands or different chromosomes. Local Mis-assembly: smaller in size and on the same strand and same chromosome.

slide-49
SLIDE 49

De Bruijn Graph of a Genome

Example Genome: ABCDEFGHICDEFGKL Example Genome: ABCDEFGHICDEFGKL

1 3 2 ABC BCD CDE DEF EFG FGK GKL FGH GHI HIC ICD

slide-50
SLIDE 50

De Bruijn Graph of a Genome

ABC BCD CDE DEF EFG FGK GKL

Example Genome: ABCDEFGHICDEFGKL Example Genome: ABCDEFGHICDEFGKL

slide-51
SLIDE 51

De Bruijn Graph of a Genome

ABC BCD CDE DEF EFG FGK GKL

Example Genome: ABCDEFGHICDEFGKL Resulting Erroneous Genome: ABCDEFGKL

1

slide-52
SLIDE 52

Sample Preparation Sequencing Assembly Analysis

Fragments Reads Contigs

slide-53
SLIDE 53

misSEQuel* Refined Contigs Reads Contigs

*(Muggli, Puglisi, Ronen, Boucher, ISMB 2015)

Sample Preparation Sequencing Assembly Analysis

Fragments Reads Contigs

Optical Map Data

slide-54
SLIDE 54

misSEQuel Algorithm

  • 1. Align sequence reads to contigs using a

standard alignment tool.

GGCTTCCGACCACCACAAATGGATTATGAAGGATATATGGA

slide-55
SLIDE 55

misSEQuel Algorithm

  • 1. Align sequence reads to contigs using a

standard alignment tool.

GGCTTCCGACCACCACAAATGGATATGAAGGATATATGGATTATGAAGGATATA

GGCTTCCGACCACCACAAATGGATTATGAAGGATATATGGA

slide-56
SLIDE 56

misSEQuel Algorithm

  • 1. Align sequence reads to contigs using a

standard alignment tool.

GGCTTCCGACCACCACAAATGGATTATGAAGGATATATGGA

slide-57
SLIDE 57

misSEQuel Algorithm

  • 1. Align sequence reads to contigs using a

standard alignment tool.

GGCTTCCGACCACCACAAATGGATATGAAGGATATATGGATTATGAAGGATATA

GGCTTCCGACCACCACAAATGGATTATGAAGGATATATGGA 1 9

slide-58
SLIDE 58

misSEQuel Algorithm

  • 1. Align sequence reads to contigs using a

standard alignment tool.

  • 2. Build the red-black positional de Bruijn graph

based on the alignment.

slide-59
SLIDE 59

Sample Preparation Sequencing

ACGTAGAATCGACCATG GGGACGTAGAATACGAC ACGTAGAATACGTAGAA

Reads Fragments Next Generation Sequencing (NGS)

slide-60
SLIDE 60

ACGTAGAATCGACCATGGGGACGTAGAATACGA

Paired-End Reads / Mate-Pair Reads

Sample Preparation

Sequencing

Fragments

slide-61
SLIDE 61

Read Mate Pair Concordance

A R R B A R R B A R R B Correct assembly Rearrangement Inversion

slide-62
SLIDE 62

Read Depth

A R R B A R B R R R A B Correct assembly Insertion Deletion

slide-63
SLIDE 63

Red-Black Positional De Bruijn Graph

I. Choose a value of 𝑙 and Ξ” . II. Each positional 𝑙-mer (sk) is an edge between two positional 𝑙–mers: prefix and suffix of sk. III. Positional 𝑙–mers, sk-1 and sk-1’, are glued if sk-1 and sk-1’ have the same label and their distances differ by at most Ξ”. IV. A sk-1 is red if the read depth is two standard deviations from the mean or there is a significant number of disconcordinate read alignments; otherwise, it is black.

A positional 𝑙-mer is a 𝑙-mer with an approximate position.

slide-64
SLIDE 64

Positional Red Black de Bruijn Graph

Reads aligned to contigs: Positional k-mers with read depth: Positional Red Black de Bruijn Graph:

slide-65
SLIDE 65

misSEQuel Algorithm

  • 1. Align sequence reads to contigs using a

standard alignment tool.

  • 2. Build the red-black positional de Bruijn graph

based on the alignment.

  • 3. Remove all bulges and whirls for the red-

black positional de Bruijn graph.

slide-66
SLIDE 66

misSEQuel Algorithm

  • 1. Align sequence reads to contigs using a

standard alignment tool.

  • 2. Build the red-black positional de Bruijn graph

based on the alignment.

  • 3. Remove all bulges and whirls for the red-

black positional de Bruijn graph.

Correct assembled contigs Mis-assembled contigs A R R B A R R B A R B A R R B R A R R B

…

slide-67
SLIDE 67

misSEQuel Algorithm

  • 1. Align sequence reads to contigs using a

standard alignment tool.

  • 2. Build the red-black positional de Bruijn graph

based on the alignment.

  • 3. Remove all bulges and whirls for the red-

black positional de Bruijn graph.

  • 4. Contig refinement using optical map

alignment.

slide-68
SLIDE 68

Optical Map Alignment

NheI = G^CTAGC

  • E. Coli optical map segment

A R R B

slide-69
SLIDE 69

NheI = G^CTAGC

β€œGCTAGC”

Optical Map Alignment

B A R R

slide-70
SLIDE 70

NheI = G^CTAGC

Correctly Assembled Contigs Align

B A R R

slide-71
SLIDE 71

NheI = G^CTAGC

A R B R R

Mis-assembled Contigs Don’t Align

slide-72
SLIDE 72

NheI = G^CTAGC

A R B R R

Mis-assembled Contigs Don’t Align

slide-73
SLIDE 73

Results on Tularensis

slide-74
SLIDE 74

Results on Tularensis

slide-75
SLIDE 75

Results on Tularensis

slide-76
SLIDE 76

Results on Tularensis

slide-77
SLIDE 77

Results on Tularensis

slide-78
SLIDE 78

Results on Tularensis

slide-79
SLIDE 79

Results on Tularensis

slide-80
SLIDE 80

Results on Tularensis

slide-81
SLIDE 81

Results on Tularensis

slide-82
SLIDE 82
slide-83
SLIDE 83

Results on Pine

slide-84
SLIDE 84

B B A R R

Improve Prediction

A R R R

slide-85
SLIDE 85

B

Improve Prediction

A R R R Deletion between two aligned regions