Masters Thesis Genome Assembly: Scaffolding Guided by Related - - PowerPoint PPT Presentation

master s thesis genome assembly scaffolding guided by
SMART_READER_LITE
LIVE PREVIEW

Masters Thesis Genome Assembly: Scaffolding Guided by Related - - PowerPoint PPT Presentation

Masters Thesis Genome Assembly: Scaffolding Guided by Related Genomes Runar Furenes Department of Informatics University of Oslo 2013-06-05 Scaffolding Guided by Related Genomes 1 / 42 Presentation overview Introduction Problem


slide-1
SLIDE 1

Master’s Thesis Genome Assembly: Scaffolding Guided by Related Genomes

Runar Furenes

Department of Informatics University of Oslo

2013-06-05

Scaffolding Guided by Related Genomes 1 / 42

slide-2
SLIDE 2

Presentation overview

Introduction Problem specification Methods Materials Results Discussion Questions

Scaffolding Guided by Related Genomes 2 / 42

slide-3
SLIDE 3

Presentation overview

Introduction Problem specification Methods Materials Results Discussion Questions

Scaffolding Guided by Related Genomes 2 / 42

slide-4
SLIDE 4

Presentation overview

Introduction Problem specification Methods Materials Results Discussion Questions

Scaffolding Guided by Related Genomes 2 / 42

slide-5
SLIDE 5

Presentation overview

Introduction Problem specification Methods Materials Results Discussion Questions

Scaffolding Guided by Related Genomes 2 / 42

slide-6
SLIDE 6

Presentation overview

Introduction Problem specification Methods Materials Results Discussion Questions

Scaffolding Guided by Related Genomes 2 / 42

slide-7
SLIDE 7

Presentation overview

Introduction Problem specification Methods Materials Results Discussion Questions

Scaffolding Guided by Related Genomes 2 / 42

slide-8
SLIDE 8

Presentation overview

Introduction Problem specification Methods Materials Results Discussion Questions

Scaffolding Guided by Related Genomes 2 / 42

slide-9
SLIDE 9

Introduction

Introduction

Scaffolding Guided by Related Genomes 3 / 42

slide-10
SLIDE 10

Introduction Genome assembly

From biological DNA to complete sequenced genome

ACTCGCA GGCATGCA GGCTAAGCT CGGATTACC

Scaffolding Guided by Related Genomes 4 / 42

slide-11
SLIDE 11

Introduction Genome assembly

From biological DNA to complete sequenced genome

ACTCGCA GGCATGCA GGCTAAGCT CGGATTACC

Scaffolding Guided by Related Genomes 4 / 42

slide-12
SLIDE 12

Introduction Genome assembly

Scaffolding

A scaffold consists of at least two contigs Each contig within a scaffold is ordered and oriented A gap estimate is provided for each pair of contigs Mate pairs are commonly used in scaffolding

Scaffolding Guided by Related Genomes 5 / 42

slide-13
SLIDE 13

Introduction Genome assembly

Scaffolding

A scaffold consists of at least two contigs Each contig within a scaffold is ordered and oriented A gap estimate is provided for each pair of contigs Mate pairs are commonly used in scaffolding

Scaffolding Guided by Related Genomes 5 / 42

slide-14
SLIDE 14

Introduction Genome assembly

Scaffolding

A scaffold consists of at least two contigs Each contig within a scaffold is ordered and oriented A gap estimate is provided for each pair of contigs Mate pairs are commonly used in scaffolding

Scaffolding Guided by Related Genomes 5 / 42

slide-15
SLIDE 15

Introduction Genome assembly

Scaffolding

A scaffold consists of at least two contigs Each contig within a scaffold is ordered and oriented A gap estimate is provided for each pair of contigs Mate pairs are commonly used in scaffolding

Scaffolding Guided by Related Genomes 5 / 42

slide-16
SLIDE 16

Introduction Motivation

Motivation for the thesis

Scaffolding is an important step in the process of genome assembly Scaffolding often requires time consuming and expensive lab work Using genomes related to the target genome may make this easier The continuously growth of fully sequenced genomes available makes this increasingly relevant

Scaffolding Guided by Related Genomes 6 / 42

slide-17
SLIDE 17

Introduction Motivation

Motivation for the thesis

Scaffolding is an important step in the process of genome assembly Scaffolding often requires time consuming and expensive lab work Using genomes related to the target genome may make this easier The continuously growth of fully sequenced genomes available makes this increasingly relevant

Scaffolding Guided by Related Genomes 6 / 42

slide-18
SLIDE 18

Introduction Motivation

Motivation for the thesis

Scaffolding is an important step in the process of genome assembly Scaffolding often requires time consuming and expensive lab work Using genomes related to the target genome may make this easier The continuously growth of fully sequenced genomes available makes this increasingly relevant

Scaffolding Guided by Related Genomes 6 / 42

slide-19
SLIDE 19

Introduction Motivation

Motivation for the thesis

Scaffolding is an important step in the process of genome assembly Scaffolding often requires time consuming and expensive lab work Using genomes related to the target genome may make this easier The continuously growth of fully sequenced genomes available makes this increasingly relevant

Scaffolding Guided by Related Genomes 6 / 42

slide-20
SLIDE 20

Introduction Hypotheses

Hypotheses

Related genomes can be helpful in scaffolding Many such related genomes can be preferable to a few It can be beneficial to use only the ends of contigs

Scaffolding Guided by Related Genomes 7 / 42

slide-21
SLIDE 21

Introduction Hypotheses

Hypotheses

Related genomes can be helpful in scaffolding Many such related genomes can be preferable to a few It can be beneficial to use only the ends of contigs

Scaffolding Guided by Related Genomes 7 / 42

slide-22
SLIDE 22

Introduction Hypotheses

Hypotheses

Related genomes can be helpful in scaffolding Many such related genomes can be preferable to a few It can be beneficial to use only the ends of contigs

Scaffolding Guided by Related Genomes 7 / 42

slide-23
SLIDE 23

Problem specification

Problem specification

Scaffolding Guided by Related Genomes 8 / 42

slide-24
SLIDE 24

Problem specification Scaffolding problem specific to this thesis

Using related genomes in a scaffolding process

Related genomes may have nucleotide sequence similarities Can contigs be scaffolded with high accuracy using one or more such related genomes? More distant related genomes have more sequence similarities on a protein level than on a nucleotide level Can the same process run on a protein level instead?

Scaffolding Guided by Related Genomes 9 / 42

slide-25
SLIDE 25

Problem specification Scaffolding problem specific to this thesis

Using related genomes in a scaffolding process

Related genomes may have nucleotide sequence similarities Can contigs be scaffolded with high accuracy using one or more such related genomes? More distant related genomes have more sequence similarities on a protein level than on a nucleotide level Can the same process run on a protein level instead?

Scaffolding Guided by Related Genomes 9 / 42

slide-26
SLIDE 26

Problem specification Scaffolding problem specific to this thesis

Using related genomes in a scaffolding process

Related genomes may have nucleotide sequence similarities Can contigs be scaffolded with high accuracy using one or more such related genomes? More distant related genomes have more sequence similarities on a protein level than on a nucleotide level Can the same process run on a protein level instead?

Scaffolding Guided by Related Genomes 9 / 42

slide-27
SLIDE 27

Problem specification Scaffolding problem specific to this thesis

Using related genomes in a scaffolding process

Related genomes may have nucleotide sequence similarities Can contigs be scaffolded with high accuracy using one or more such related genomes? More distant related genomes have more sequence similarities on a protein level than on a nucleotide level Can the same process run on a protein level instead?

Scaffolding Guided by Related Genomes 9 / 42

slide-28
SLIDE 28

Problem specification Earlier research on this subject

Other works

Existing tools: ABACAS1 GRASS2 Can use additional information such as reference genome(s) in their scaffolding algorithms.

1Assefa et al. 2009 2Gritsenko et al. 2012 Scaffolding Guided by Related Genomes 10 / 42

slide-29
SLIDE 29

Methods

Methods

Scaffolding Guided by Related Genomes 11 / 42

slide-30
SLIDE 30

Methods Overview

Proposed method: GuideScaff

GuideScaff is a pipeline producing scaffolds from contigs and guiding

  • genomes. Main steps:

Use contigs or contig ends from an assembly Match contigs with guiding genomes Use agreeing matches to create scaffolds Evaluate scaffolds with target genome if available

Scaffolding Guided by Related Genomes 12 / 42

slide-31
SLIDE 31

Methods Overview

Proposed method: GuideScaff

GuideScaff is a pipeline producing scaffolds from contigs and guiding

  • genomes. Main steps:

Use contigs or contig ends from an assembly Match contigs with guiding genomes Use agreeing matches to create scaffolds Evaluate scaffolds with target genome if available

Scaffolding Guided by Related Genomes 12 / 42

slide-32
SLIDE 32

Methods Overview

Proposed method: GuideScaff

GuideScaff is a pipeline producing scaffolds from contigs and guiding

  • genomes. Main steps:

Use contigs or contig ends from an assembly Match contigs with guiding genomes Use agreeing matches to create scaffolds Evaluate scaffolds with target genome if available

Scaffolding Guided by Related Genomes 12 / 42

slide-33
SLIDE 33

Methods Overview

Proposed method: GuideScaff

GuideScaff is a pipeline producing scaffolds from contigs and guiding

  • genomes. Main steps:

Use contigs or contig ends from an assembly Match contigs with guiding genomes Use agreeing matches to create scaffolds Evaluate scaffolds with target genome if available

Scaffolding Guided by Related Genomes 12 / 42

slide-34
SLIDE 34

Methods Overview

Contig end extraction

Contigs are assumed to be more or less correct Scaffold consists of entire contigs Contigs can map to multiple locations in a genome Using only contig ends could make it easier

Scaffolding Guided by Related Genomes 13 / 42

slide-35
SLIDE 35

Methods Overview

Contig end extraction

Contigs are assumed to be more or less correct Scaffold consists of entire contigs Contigs can map to multiple locations in a genome Using only contig ends could make it easier

Scaffolding Guided by Related Genomes 13 / 42

slide-36
SLIDE 36

Methods Overview

Contig end extraction

Contigs are assumed to be more or less correct Scaffold consists of entire contigs Contigs can map to multiple locations in a genome Using only contig ends could make it easier

Scaffolding Guided by Related Genomes 13 / 42

slide-37
SLIDE 37

Methods Overview

Contig end extraction

Contigs are assumed to be more or less correct Scaffold consists of entire contigs Contigs can map to multiple locations in a genome Using only contig ends could make it easier

Scaffolding Guided by Related Genomes 13 / 42

slide-38
SLIDE 38

Methods Description

Choosing guiding genomes

Manually, based on domain knowledge Automatically, based on BLAST search or similar

Scaffolding Guided by Related Genomes 14 / 42

slide-39
SLIDE 39

Methods Description

Choosing guiding genomes

Manually, based on domain knowledge Automatically, based on BLAST search or similar

Scaffolding Guided by Related Genomes 14 / 42

slide-40
SLIDE 40

Methods Description

Contig end extraction

Contigs with length ≥ 2N are replaced by N nucleotides from each end Smaller contigs are kept intact N is experimentally set

Scaffolding Guided by Related Genomes 15 / 42

slide-41
SLIDE 41

Methods Description

Contig end extraction

Contigs with length ≥ 2N are replaced by N nucleotides from each end Smaller contigs are kept intact N is experimentally set

Scaffolding Guided by Related Genomes 15 / 42

slide-42
SLIDE 42

Methods Description

Contig end extraction

Contigs with length ≥ 2N are replaced by N nucleotides from each end Smaller contigs are kept intact N is experimentally set

Scaffolding Guided by Related Genomes 15 / 42

slide-43
SLIDE 43

Methods Description

Aligning contig ends to guiding genomes

Contig ends are aligned to each guiding genome using tools from MUMmer3 Initially on a nucleotide level Re-aligned on a protein level if initial results are unsatisfactory

3Kurtz et al. 2004 Scaffolding Guided by Related Genomes 16 / 42

slide-44
SLIDE 44

Methods Description

Aligning contig ends to guiding genomes

Contig ends are aligned to each guiding genome using tools from MUMmer3 Initially on a nucleotide level Re-aligned on a protein level if initial results are unsatisfactory

3Kurtz et al. 2004 Scaffolding Guided by Related Genomes 16 / 42

slide-45
SLIDE 45

Methods Description

Aligning contig ends to guiding genomes

Contig ends are aligned to each guiding genome using tools from MUMmer3 Initially on a nucleotide level Re-aligned on a protein level if initial results are unsatisfactory

3Kurtz et al. 2004 Scaffolding Guided by Related Genomes 16 / 42

slide-46
SLIDE 46

Methods Description

Creating contig links

A tiling of contigs is produced for each guiding genome Tilings are processed to create a distance matrix Links between contigs are created based on this matrix Links are created in a greedy manner At least t guiding genomes must support each link created

Scaffolding Guided by Related Genomes 17 / 42

slide-47
SLIDE 47

Methods Description

Creating contig links

A tiling of contigs is produced for each guiding genome Tilings are processed to create a distance matrix Links between contigs are created based on this matrix Links are created in a greedy manner At least t guiding genomes must support each link created

Scaffolding Guided by Related Genomes 17 / 42

slide-48
SLIDE 48

Methods Description

Creating contig links

A tiling of contigs is produced for each guiding genome Tilings are processed to create a distance matrix Links between contigs are created based on this matrix Links are created in a greedy manner At least t guiding genomes must support each link created

Scaffolding Guided by Related Genomes 17 / 42

slide-49
SLIDE 49

Methods Description

Creating contig links

A tiling of contigs is produced for each guiding genome Tilings are processed to create a distance matrix Links between contigs are created based on this matrix Links are created in a greedy manner At least t guiding genomes must support each link created

Scaffolding Guided by Related Genomes 17 / 42

slide-50
SLIDE 50

Methods Description

Creating contig links

A tiling of contigs is produced for each guiding genome Tilings are processed to create a distance matrix Links between contigs are created based on this matrix Links are created in a greedy manner At least t guiding genomes must support each link created

Scaffolding Guided by Related Genomes 17 / 42

slide-51
SLIDE 51

Methods Description

Creating scaffolds from contig links

Linked contigs are converted to scaffolds Gap estimate = n corresponds to n N symbols A negative gap estimate means contig overlap Contigs marked as overlapping are attempted merged Contigs oriented opposite to the normal way are converted to their reversed complement

ACCGGTTANNNNNNACCAGGTTAACNNNNACGGTTT

Scaffolding Guided by Related Genomes 18 / 42

slide-52
SLIDE 52

Methods Description

Creating scaffolds from contig links

Linked contigs are converted to scaffolds Gap estimate = n corresponds to n N symbols A negative gap estimate means contig overlap Contigs marked as overlapping are attempted merged Contigs oriented opposite to the normal way are converted to their reversed complement

ACCGGTTANNNNNNACCAGGTTAACNNNNACGGTTT

Scaffolding Guided by Related Genomes 18 / 42

slide-53
SLIDE 53

Methods Description

Creating scaffolds from contig links

Linked contigs are converted to scaffolds Gap estimate = n corresponds to n N symbols A negative gap estimate means contig overlap Contigs marked as overlapping are attempted merged Contigs oriented opposite to the normal way are converted to their reversed complement

ACCGGTTANNNNNNACCAGGTTAACNNNNACGGTTT

Scaffolding Guided by Related Genomes 18 / 42

slide-54
SLIDE 54

Methods Description

Creating scaffolds from contig links

Linked contigs are converted to scaffolds Gap estimate = n corresponds to n N symbols A negative gap estimate means contig overlap Contigs marked as overlapping are attempted merged Contigs oriented opposite to the normal way are converted to their reversed complement

ACCGGTTANNNNNNACCAGGTTAACNNNNACGGTTT

Scaffolding Guided by Related Genomes 18 / 42

slide-55
SLIDE 55

Methods Description

Creating scaffolds from contig links

Linked contigs are converted to scaffolds Gap estimate = n corresponds to n N symbols A negative gap estimate means contig overlap Contigs marked as overlapping are attempted merged Contigs oriented opposite to the normal way are converted to their reversed complement

ACCGGTTANNNNNNACCAGGTTAACNNNNACGGTTT

Scaffolding Guided by Related Genomes 18 / 42

slide-56
SLIDE 56

Methods Evaluation

Commonly used evaluation metrics

N50 Breakpoints

Scaffolding Guided by Related Genomes 19 / 42

slide-57
SLIDE 57

Methods Evaluation

Commonly used evaluation metrics

N50 Breakpoints

Scaffolding Guided by Related Genomes 19 / 42

slide-58
SLIDE 58

Methods Evaluation

N50

Defined as “the size of the smallest contig (or scaffold) such that 50% of the genome is contained in contigs [or scaffolds] of size N50

  • r larger”4

Gives information about scaffold sizes only

4Gritsenko et al. 2012 Scaffolding Guided by Related Genomes 20 / 42

slide-59
SLIDE 59

Methods Evaluation

N50

Defined as “the size of the smallest contig (or scaffold) such that 50% of the genome is contained in contigs [or scaffolds] of size N50

  • r larger”4

Gives information about scaffold sizes only

4Gritsenko et al. 2012 Scaffolding Guided by Related Genomes 20 / 42

slide-60
SLIDE 60

Methods Evaluation

Breakpoints

Breakpoints are specific errors made in the resulting scaffolds. These errors are Contigs mapping to different chromosomes in the target genome Incorrect relative orientations of contigs inside a scaffold Incorrect relative ordering of contigs inside a scaffold Gap estimate more than a certain number nucleotides off from the true distance

Scaffolding Guided by Related Genomes 21 / 42

slide-61
SLIDE 61

Methods Evaluation

Breakpoints

Breakpoints are specific errors made in the resulting scaffolds. These errors are Contigs mapping to different chromosomes in the target genome Incorrect relative orientations of contigs inside a scaffold Incorrect relative ordering of contigs inside a scaffold Gap estimate more than a certain number nucleotides off from the true distance

Scaffolding Guided by Related Genomes 21 / 42

slide-62
SLIDE 62

Methods Evaluation

Breakpoints

Breakpoints are specific errors made in the resulting scaffolds. These errors are Contigs mapping to different chromosomes in the target genome Incorrect relative orientations of contigs inside a scaffold Incorrect relative ordering of contigs inside a scaffold Gap estimate more than a certain number nucleotides off from the true distance

Scaffolding Guided by Related Genomes 21 / 42

slide-63
SLIDE 63

Methods Evaluation

Breakpoints

Breakpoints are specific errors made in the resulting scaffolds. These errors are Contigs mapping to different chromosomes in the target genome Incorrect relative orientations of contigs inside a scaffold Incorrect relative ordering of contigs inside a scaffold Gap estimate more than a certain number nucleotides off from the true distance

Scaffolding Guided by Related Genomes 21 / 42

slide-64
SLIDE 64

Materials

Materials

Scaffolding Guided by Related Genomes 22 / 42

slide-65
SLIDE 65

Materials Chosen target genomes and contigs

Target genomes

Escherichia coli str. K-12 substr. MG1655 1 chromosome Pseudoxanthomonas suwonensis 11-1 1 chromosome Rhodobacter sphaeroides 2.4.1 2 chromosomes, 5 plasmids Staphylococcus aureus subsp. aureus USA300 TCH1516 1 chromosome, 2 plasmids

Scaffolding Guided by Related Genomes 23 / 42

slide-66
SLIDE 66

Materials Chosen target genomes and contigs

Contigs

Escherichia coli str. K-12 substr. MG1655 481 contigs Pseudoxanthomonas suwonensis 11-1 303 contigs Rhodobacter sphaeroides 2.4.1 809 contigs Staphylococcus aureus subsp. aureus USA300 TCH1516 301 contigs All contigs were produced5 with Velvet, using short paired-end reads from Illumina sequencing technologies.

5by Gritsenko et al. 2012 and Salzberg et al. 2012 Scaffolding Guided by Related Genomes 24 / 42

slide-67
SLIDE 67

Materials Chosen guiding genomes

Guiding genomes

Escherichia coli str. K-12 substr. MG1655 10 genomes from the same species, but different strains Pseudoxanthomonas suwonensis 11-1 10 genomes. None from the same species as the target genome Rhodobacter sphaeroides 2.4.1 3 genomes from the same species. 7 genomes from other species Staphylococcus aureus subsp. aureus USA300 TCH1516 10 genomes from the same species, but different strains

Scaffolding Guided by Related Genomes 25 / 42

slide-68
SLIDE 68

Results

Results

Scaffolding Guided by Related Genomes 26 / 42

slide-69
SLIDE 69

Results One guiding genome

Entire contigs vs. contig ends with N = 1, 000

Entire contigs

  • E. coli
  • P. suwonensis
  • R. sphaeroides
  • S. aureus

# Contigs 481 303 583 162 # Contigs used 421 97 387 94 # Scaffolds 4 3 13 3 N50 scaffolds 2,465,078 3,169,365 2,730,310 2,016,698 Different chromosomes 1 Different orientations 37 3 Different order 38 2 Gap errors > 500 15 77 36 12 Gap errors > 10,000 2 61 18 8 Contig ends with N = 1, 000

  • E. coli
  • P. suwonensis
  • R. sphaeroides
  • S. aureus

Contig end length 1,000 1,000 1,000 1, 000 # Contigs 481 303 583 162 # Contigs used 433 65 389 98 # Scaffolds 19 27 64 17 N50 scaffolds 597,757 46,936 84,362 252,200 Different chromosomes 2 Different orientations 6 3 Different order 5 1 2 Gap errors > 500 13 27 43 24 Gap errors > 10,000 6 21 17 16 Scaffolding Guided by Related Genomes 27 / 42

slide-70
SLIDE 70

Results Multiple guiding genomes

Entire contigs vs. contig ends with N = 1, 000

The following plots shows the effect on different metrics when increasing the threshold value t when running GuideScaff on all datasets.

Scaffolding Guided by Related Genomes 28 / 42

slide-71
SLIDE 71

Results Multiple guiding genomes

Entire contigs vs. contig ends with N = 1, 000

1 2 3 4 5 6 7 8 9 10 0.5 1 1.5 2 2.5 3 3.5 x 10

6

Number of guiding genomes t to agree N50 of scaffolds produced

Escherichia coli Pseudoxanthomonas suwonensis Rhodobacter sphaeroides Staphylococcus aureus 1 2 3 4 5 6 7 8 9 10 0.5 1 1.5 2 2.5 3 x 10

6

Number of guiding genomes t to agree N50 of scaffolds produced

Escherichia coli Pseudoxanthomonas suwonensis Rhodobacter sphaeroides Staphylococcus aureus

Scaffolding Guided by Related Genomes 29 / 42

slide-72
SLIDE 72

Results Multiple guiding genomes

Entire contigs vs. contig ends with N = 1, 000

1 2 3 4 5 6 7 8 9 10 50 100 150 200 250 300 350 400 450 500

Number of guiding genomes t to agree Number of contigs used in scaffolds

Escherichia coli Pseudoxanthomonas suwonensis Rhodobacter sphaeroides Staphylococcus aureus 1 2 3 4 5 6 7 8 9 10 50 100 150 200 250 300 350 400 450

Number of guiding genomes t to agree Number of contigs used in scaffolds

Escherichia coli Pseudoxanthomonas suwonensis Rhodobacter sphaeroides Staphylococcus aureus

Scaffolding Guided by Related Genomes 30 / 42

slide-73
SLIDE 73

Results Multiple guiding genomes

Entire contigs vs. contig ends with N = 1, 000

1 2 3 4 5 6 7 8 9 10 5 10 15 20 25

Number of guiding genomes t to agree Number of contig pairs from different chromosomes

Escherichia coli Pseudoxanthomonas suwonensis Rhodobacter sphaeroides Staphylococcus aureus 1 2 3 4 5 6 7 8 9 10 2 4 6 8 10 12 14

Number of guiding genomes t to agree Number of contig pairs from different chromosomes

Escherichia coli Pseudoxanthomonas suwonensis Rhodobacter sphaeroides Staphylococcus aureus

Scaffolding Guided by Related Genomes 31 / 42

slide-74
SLIDE 74

Results Multiple guiding genomes

Entire contigs vs. contig ends with N = 1, 000

1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 90 100

Number of guiding genomes t to agree Number of contigs placed in wrong order

Escherichia coli Pseudoxanthomonas suwonensis Rhodobacter sphaeroides Staphylococcus aureus 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60

Number of guiding genomes t to agree Number of contigs placed in wrong order

Escherichia coli Pseudoxanthomonas suwonensis Rhodobacter sphaeroides Staphylococcus aureus

Scaffolding Guided by Related Genomes 32 / 42

slide-75
SLIDE 75

Results Multiple guiding genomes

Entire contigs vs. contig ends with N = 1, 000

1 2 3 4 5 6 7 8 9 10 20 40 60 80 100 120

Number of guiding genomes t to agree Number of contigs placed in wrong orientations

Escherichia coli Pseudoxanthomonas suwonensis Rhodobacter sphaeroides Staphylococcus aureus 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70

Number of guiding genomes t to agree Number of contigs placed in wrong orientations

Escherichia coli Pseudoxanthomonas suwonensis Rhodobacter sphaeroides Staphylococcus aureus

Scaffolding Guided by Related Genomes 33 / 42

slide-76
SLIDE 76

Results Multiple guiding genomes

Entire contigs vs. contig ends with N = 1, 000

1 2 3 4 5 6 7 8 9 10 20 40 60 80 100 120 140

Number of guiding genomes t to agree Number of gap estimates exceeding ∆ = 500

Escherichia coli Pseudoxanthomonas suwonensis Rhodobacter sphaeroides Staphylococcus aureus 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60

Number of guiding genomes t to agree Number of gap estimates exceeding ∆ = 500

Escherichia coli Pseudoxanthomonas suwonensis Rhodobacter sphaeroides Staphylococcus aureus

Scaffolding Guided by Related Genomes 34 / 42

slide-77
SLIDE 77

Results Multiple guiding genomes

Entire contigs vs. contig ends with N = 1, 000

1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 90 100

Number of guiding genomes t to agree Number of gap estimates exceeding ∆ = 10, 000

Escherichia coli Pseudoxanthomonas suwonensis Rhodobacter sphaeroides Staphylococcus aureus 1 2 3 4 5 6 7 8 9 10 5 10 15 20 25 30 35 40 45

Number of guiding genomes t to agree Number of gap estimates exceeding ∆ = 10, 000

Escherichia coli Pseudoxanthomonas suwonensis Rhodobacter sphaeroides Staphylococcus aureus

Scaffolding Guided by Related Genomes 35 / 42

slide-78
SLIDE 78

Discussion

Discussion

Scaffolding Guided by Related Genomes 36 / 42

slide-79
SLIDE 79

Discussion Analysis of proposed method

Performance

Can handle an arbitrary number of guiding genomes Alignments can be done independently, and therefore in parallel Scaffolds precision increase with an increasingly agreement threshold value

Scaffolding Guided by Related Genomes 37 / 42

slide-80
SLIDE 80

Discussion Analysis of proposed method

Performance

Can handle an arbitrary number of guiding genomes Alignments can be done independently, and therefore in parallel Scaffolds precision increase with an increasingly agreement threshold value

Scaffolding Guided by Related Genomes 37 / 42

slide-81
SLIDE 81

Discussion Analysis of proposed method

Performance

Can handle an arbitrary number of guiding genomes Alignments can be done independently, and therefore in parallel Scaffolds precision increase with an increasingly agreement threshold value

Scaffolding Guided by Related Genomes 37 / 42

slide-82
SLIDE 82

Discussion Analysis of proposed method

Potential usage

GuideScaff can be used as it is to provide scaffolds in a fast and inexpensive way It could be used as a supplement to other scaffolding algorithms

Scaffolding Guided by Related Genomes 38 / 42

slide-83
SLIDE 83

Discussion Analysis of proposed method

Potential usage

GuideScaff can be used as it is to provide scaffolds in a fast and inexpensive way It could be used as a supplement to other scaffolding algorithms

Scaffolding Guided by Related Genomes 38 / 42

slide-84
SLIDE 84

Discussion Analysis of proposed method

Possible improvements and further work

A global optimization could be attempted instead of the greedy algorithm Mate-pair information could be utilized if available Contig end length could be set dynamically Guiding genomes could be weighted differently

Scaffolding Guided by Related Genomes 39 / 42

slide-85
SLIDE 85

Discussion Analysis of proposed method

Possible improvements and further work

A global optimization could be attempted instead of the greedy algorithm Mate-pair information could be utilized if available Contig end length could be set dynamically Guiding genomes could be weighted differently

Scaffolding Guided by Related Genomes 39 / 42

slide-86
SLIDE 86

Discussion Analysis of proposed method

Possible improvements and further work

A global optimization could be attempted instead of the greedy algorithm Mate-pair information could be utilized if available Contig end length could be set dynamically Guiding genomes could be weighted differently

Scaffolding Guided by Related Genomes 39 / 42

slide-87
SLIDE 87

Discussion Analysis of proposed method

Possible improvements and further work

A global optimization could be attempted instead of the greedy algorithm Mate-pair information could be utilized if available Contig end length could be set dynamically Guiding genomes could be weighted differently

Scaffolding Guided by Related Genomes 39 / 42

slide-88
SLIDE 88

Discussion Conclusion

Conclusion

GuideScaff works as a proof of concept Related genomes can indeed be useful in scaffolding One guiding genome may suffice Demanding at least two guiding genomes to agree decreases all types

  • f errors

Using contig ends increases the scaffold correctness when genomes are very dissimilar

Scaffolding Guided by Related Genomes 40 / 42

slide-89
SLIDE 89

Discussion Conclusion

Conclusion

GuideScaff works as a proof of concept Related genomes can indeed be useful in scaffolding One guiding genome may suffice Demanding at least two guiding genomes to agree decreases all types

  • f errors

Using contig ends increases the scaffold correctness when genomes are very dissimilar

Scaffolding Guided by Related Genomes 40 / 42

slide-90
SLIDE 90

Discussion Conclusion

Conclusion

GuideScaff works as a proof of concept Related genomes can indeed be useful in scaffolding One guiding genome may suffice Demanding at least two guiding genomes to agree decreases all types

  • f errors

Using contig ends increases the scaffold correctness when genomes are very dissimilar

Scaffolding Guided by Related Genomes 40 / 42

slide-91
SLIDE 91

Discussion Conclusion

Conclusion

GuideScaff works as a proof of concept Related genomes can indeed be useful in scaffolding One guiding genome may suffice Demanding at least two guiding genomes to agree decreases all types

  • f errors

Using contig ends increases the scaffold correctness when genomes are very dissimilar

Scaffolding Guided by Related Genomes 40 / 42

slide-92
SLIDE 92

Discussion Conclusion

Conclusion

GuideScaff works as a proof of concept Related genomes can indeed be useful in scaffolding One guiding genome may suffice Demanding at least two guiding genomes to agree decreases all types

  • f errors

Using contig ends increases the scaffold correctness when genomes are very dissimilar

Scaffolding Guided by Related Genomes 40 / 42

slide-93
SLIDE 93

Questions

Questions

Scaffolding Guided by Related Genomes 41 / 42

slide-94
SLIDE 94

Questions

Source code

https://github.com/runarfu/GuideScaff

Scaffolding Guided by Related Genomes 42 / 42