DNA Assembly and Finishing DNA Assembly and Finishing Latin - - PowerPoint PPT Presentation

dna assembly and finishing dna assembly and finishing
SMART_READER_LITE
LIVE PREVIEW

DNA Assembly and Finishing DNA Assembly and Finishing Latin - - PowerPoint PPT Presentation

DNA Assembly and Finishing DNA Assembly and Finishing Latin American Course on Bioinformatics Bioinformatics for for Latin American Course on Tropical Disease Research Tropical Disease Research th to March 2 nd 2002 So Paulo Paulo


slide-1
SLIDE 1

DNA Assembly and Finishing DNA Assembly and Finishing

Arthur Gruber Arthur Gruber

Latin American Course on Latin American Course on Bioinformatics Bioinformatics for for Tropical Disease Research Tropical Disease Research

São São Paulo Paulo – – February 17 February 17th

th to March 2

to March 2nd

nd 2002

2002

Faculty of Veterinary Medicine and Faculty of Veterinary Medicine and Zootechny Zootechny University of University of São São Paulo Paulo BRAZIL BRAZIL

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-2
SLIDE 2

Why to assemble? Why to assemble?

Whole genome BAC/cosmid clone final consensus sequence Finishing quality both stands coverage gap filling Partial Assembly contigs DNA sequencing random clones Clone library pUC18 Small fragments 1.0 - 2.0 kb DNA fragmentation sonic disruption nebulization Whole genome BAC/cosmid clone

  • Current

Current DNA DNA sequencing methods sequencing methods generate reads generate reads of 500

  • f 500-
  • 700

700 bp bp – – resolution resolution limit limit of

  • f electrophoresis

electrophoresis

  • Whole genomes or large

Whole genomes or large clones clones need need to to be be fragmented fragmented -

  • clone

clone library library

  • Short

Short fragments fragments are are randomly randomly sequenced sequenced ( (shotgun shotgun approach) approach) – – reads reads are are assembled assembled to to form form final final consensus consensus sequence sequence

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-3
SLIDE 3

Shotgun Sequencing I Shotgun Sequencing I – – random phase random phase

BAC clone: BAC clone: 100 100-

  • 200 kb

200 kb Sheared DNA: Sheared DNA: 1.0 1.0-

  • 2.0 kb

2.0 kb Sequencing Sequencing Templates Templates Random Random Reads Reads

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

Modified from BCM Modified from BCM-

  • HGSC

HGSC

slide-4
SLIDE 4

Consensus Consensus Sequence Sequence Gap Gap Low Base Low Base Quality Quality Single Single Stranded Stranded Region Region

Mis Mis-

  • Assembly

Assembly ( (Inverted

Inverted)

)

Shotgun Sequencing II Shotgun Sequencing II -

  • assembly

assembly

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

Modified from BCM Modified from BCM-

  • HGSC

HGSC

slide-5
SLIDE 5

Consensus Consensus Sequence Sequence Gap Gap Low Base Low Base Quality Quality Single Single Stranded Stranded Region Region

Shotgun Sequencing III Shotgun Sequencing III -

  • finishing

finishing

Mis Mis-

  • Assembly

Assembly ( (Inverted

Inverted)

)

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

Modified from BCM Modified from BCM-

  • HGSC

HGSC

slide-6
SLIDE 6

Consensus Consensus Sequence Sequence Gap Gap Single Single Stranded Stranded Region Region

Shotgun Sequencing III Shotgun Sequencing III -

  • finishing

finishing

Mis Mis-

  • Assembly

Assembly ( (Inverted

Inverted)

)

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

Modified from BCM Modified from BCM-

  • HGSC

HGSC

slide-7
SLIDE 7

Consensus Consensus Sequence Sequence Gap Gap

Shotgun Sequencing III Shotgun Sequencing III -

  • finishing

finishing

Mis Mis-

  • Assembly

Assembly ( (Inverted

Inverted)

)

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

Modified from BCM Modified from BCM-

  • HGSC

HGSC

slide-8
SLIDE 8

Consensus Consensus Sequence Sequence Gap Gap

Shotgun Sequencing III Shotgun Sequencing III -

  • finishing

finishing

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

Modified from BCM Modified from BCM-

  • HGSC

HGSC

slide-9
SLIDE 9

Consensus Consensus

Shotgun Sequencing III Shotgun Sequencing III -

  • finishing

finishing

High Accuracy Sequence: High Accuracy Sequence: < 1 error/ 10,000 bases < 1 error/ 10,000 bases

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

Modified from BCM Modified from BCM-

  • HGSC

HGSC

slide-10
SLIDE 10

How to deal with the enormous amount How to deal with the enormous amount

  • f reads generated by the high
  • f reads generated by the high

throughput DNA sequencers? throughput DNA sequencers?

Sanger Centre

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-11
SLIDE 11

Phred Phred/ / Phrap Phrap/ / Consed Consed Package Package

Phred Phred/ / Phrap Phrap/ / Consed Consed is a is a worldwide worldwide distributed package distributed package for: for:

  • a. Trace file (
  • a. Trace file (chromatograms

chromatograms) ) reading reading; ; b.

  • b. Quality

Quality ( (confidence confidence) ) assignment assignment to to each each individual base; individual base; c.

  • c. Vector and repeat sequences identification

Vector and repeat sequences identification and and masking masking; ; d.

  • d. Sequence assembly and error probability

Sequence assembly and error probability assignment assignment to to the consensus sequence the consensus sequence; ; e.

  • e. Assembly viewing and editing

Assembly viewing and editing; ; f.

  • f. Automatic finishing

Automatic finishing. .

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-12
SLIDE 12

Phred Phred/ / Phrap Phrap/ / Consed Consed Pipeline Pipeline

Chromat Chromat_dir _dir Phd Phd_dir _dir Edit Edit_dir _dir

Directories Directories: :

Finishing Autofinish + manual finishing Assembly viewing/editing Consed Assembly Phrap assembled contigs - seq.fasta.screen.contigs assembly file - seq.fasta.screen.ace# Vector screening and masking Cross_Match (local alignment program) x vector.seq screened/masked file - seq.fasta.screen quality values - seq.fasta.screen.qual Conversion - phd to fasta phd2fasta.pl nucleotide sequences - seq.fasta quality values - seq.fasta.qual Quality (confidence) values assignment Phred phd files - *.phd Input chromatogram files

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-13
SLIDE 13

Phred Phred

Genome Research Genome Research 8

8: 175

: 175-

  • 185, 1998

185, 1998

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-14
SLIDE 14

Phred Phred

Genome Research Genome Research 8

8: 186

: 186-

  • 194, 1998

194, 1998

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-15
SLIDE 15

Phred Phred

Phred Phred is a is a program program that that performs performs several several tasks tasks: : a.

  • a. Reads

Reads trace files trace files – – compatible with most compatible with most file file formats formats: SCF (standard : SCF (standard chromatogram chromatogram format format), ), ABI (373/377/3700), ESD ( ABI (373/377/3700), ESD (MegaBACE MegaBACE) ) and and LI LI -

  • COR.

COR. b.

  • b. Calls

Calls bases bases – – attributes attributes a base for a base for each each identified peak with identified peak with a a lower error lower error rate rate than the than the standard base standard base calling programs calling programs. .

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-16
SLIDE 16

Phred Phred

c.

  • c. Assigns quality values

Assigns quality values to to the the bases bases – – a “ a “Phred Phred value value” ” based based

  • n
  • n

an an error error rate rate estimation estimation calculated calculated for for each each individual base. individual base. d.

  • d. Creates

Creates output files

  • utput files –

– base base calls and quality calls and quality values values are are written written to output files. to output files.

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-17
SLIDE 17

Trace File Trace File

High quality read: High quality read:

  • no ambiguities (Ns)

no ambiguities (Ns)

  • no noise

no noise

  • peaks very well spaced

peaks very well spaced

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-18
SLIDE 18

Good quality read: Good quality read:

  • no ambiguities (Ns)

no ambiguities (Ns)

  • some noise (notice baseline)

some noise (notice baseline)

  • peaks very well spaced

peaks very well spaced

Trace File Trace File

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-19
SLIDE 19

Poor quality read: Poor quality read:

  • some ambiguities (Ns)

some ambiguities (Ns)

  • bad noise (notice baseline)

bad noise (notice baseline)

  • overlapping peaks
  • verlapping peaks
  • can be caused by bad quality template, bad matrix, low signal t

can be caused by bad quality template, bad matrix, low signal to noise rate

  • noise rate

Trace File Trace File

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-20
SLIDE 20

Poor quality read: Poor quality read:

  • many ambiguities (Ns)

many ambiguities (Ns)

  • noise

noise

  • caused by

caused by homopolymeric homopolymeric region/ region/ polymerase polymerase slippage slippage

Trace File Trace File

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-21
SLIDE 21

Sudden drop Sudden drop artifact artifact: :

  • good quality region is followed by a sudden drop of signal

good quality region is followed by a sudden drop of signal

  • caused by secondary structure

caused by secondary structure

Trace File Trace File

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-22
SLIDE 22

High quality region: High quality region:

  • no ambiguities (Ns)

no ambiguities (Ns)

  • no noise

no noise

  • peaks very well spaced

peaks very well spaced

Trace File Trace File

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-23
SLIDE 23

Medium quality region: Medium quality region:

  • some ambiguities (Ns)

some ambiguities (Ns)

  • no noise

no noise

  • peaks very well spaced

peaks very well spaced

  • some

some homopolymeric strectches homopolymeric strectches are not well resolved are not well resolved

Trace File Trace File

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-24
SLIDE 24

Poor quality region Poor quality region -

  • diffusion effects and decrease in the relative mass

diffusion effects and decrease in the relative mass difference between the sequence products: difference between the sequence products:

  • overlapping peaks, peaks not evenly spaced
  • verlapping peaks, peaks not evenly spaced
  • low resolution

low resolution

  • low confidence to base assignment

low confidence to base assignment

Trace File Trace File

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-25
SLIDE 25

Phred Phred

Analysis steps Analysis steps

a) a) Predicts idealized Predicts idealized ( (expected expected) ) peaks peaks (amplitudes) (amplitudes) based based effectively on the best region effectively on the best region of

  • f the

the trace trace b) b) Identifies observed peaks Identifies observed peaks c) c) Compares Compares observed and expected peaks

  • bserved and expected peaks (divides

(divides the peaks the peaks into matched and unmatched into matched and unmatched) ) d) d) Unmatched peaks Unmatched peaks are are analyzed analyzed for for any peak that could be any peak that could be called called, , but was not called but was not called in in step step c c

Modified from Evan Eichler Modified from Evan Eichler, Ph.D , Ph.D

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-26
SLIDE 26

Phred Phred value formula value formula

q q = = -

  • 10 x log

10 x log10

10 (

(p p) )

where where q q -

  • q

quality value uality value p p -

  • estimated probability error

estimated probability error for a base for a base call call

Examples Examples: : q q = 20 = 20 means means p p = 10 = 10-

  • 2

2 (1

(1 error error in 100 bases) in 100 bases) q q = 40 = 40 means means p p = 10 = 10-

  • 4

4 (1

(1 error error in 10,000 bases) in 10,000 bases)

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-27
SLIDE 27

The structure of a The structure of a phd phd file file

BEGIN_SEQUENCE 01EBV10201A02.g BEGIN_SEQUENCE 01EBV10201A02.g BEGIN_COMMENT BEGIN_COMMENT CHROMAT_FILE: EBV10201A02.g CHROMAT_FILE: EBV10201A02.g ABI_THUMBPRINT: ABI_THUMBPRINT: PHRED_VERSION: 0.990722.g PHRED_VERSION: 0.990722.g CALL_METHOD: CALL_METHOD: phred phred QUALITY_LEVELS:99 QUALITY_LEVELS:99 TIME: Thu May 24 00:18:58 2001 TIME: Thu May 24 00:18:58 2001 TRACE_ARRAY_MIN_INDEX: 0 TRACE_ARRAY_MIN_INDEX: 0 TRACE_ARRAY_MAX_INDEX: 12153 TRACE_ARRAY_MAX_INDEX: 12153 TRIM: TRIM: CHEM: term CHEM: term DYE: big DYE: big END_COMMENT END_COMMENT BEGIN_DNA BEGIN_DNA t 8 5 t 8 5 c 13 17 c 13 17 a 19 26 a 19 26 c 19 32 c 19 32 t 6 11908 t 6 11908 a 6 11921 a 6 11921 g 6 11927 g 6 11927 t 6 11947 t 6 11947 c 6 11953 c 6 11953 a 6 11964 a 6 11964 g 6 11981 g 6 11981 c 4 11994 c 4 11994 n 4 12015 n 4 12015 c 4 12037 c 4 12037 n 4 12044 n 4 12044 n 4 12058 n 4 12058 n 4 12071 n 4 12071 n 4 12085 n 4 12085 n 4 12098 n 4 12098 n 4 12111 n 4 12111 n 4 12124 n 4 12124 c 4 12144 c 4 12144 n 4 12151 n 4 12151 END_DNA END_DNA END_SEQUENCE END_SEQUENCE t 24 2221 t 24 2221 a 24 2232 a 24 2232 a 22 2245 a 22 2245 a 27 2261 a 27 2261 g 25 2272 g 25 2272 c 19 2286 c 19 2286 c 12 2302 c 12 2302 t 19 2314 t 19 2314 g 12 2324 g 12 2324 g 15 2331 g 15 2331 g 19 2346 g 19 2346 g 23 2363 g 23 2363 t 33 2378 t 33 2378 g 36 2390 g 36 2390 c 44 2404 c 44 2404 c 44 2419 c 44 2419 t 39 2433 t 39 2433 a 39 2446 a 39 2446 a 34 2460 a 34 2460 t 35 2470 t 35 2470 g 34 2482 g 34 2482 t 16 8191 t 16 8191 g 19 8200 g 19 8200 t 13 8211 t 13 8211 c 13 8229 c 13 8229 g 4 8241 g 4 8241 n 4 8253 n 4 8253 c 4 8263 c 4 8263 t 10 8276 t 10 8276 t 9 8286 t 9 8286 c 12 8301 c 12 8301 t 16 8313 t 16 8313 c 12 8329 c 12 8329 c 12 8336 c 12 8336 c 15 8343 c 15 8343 t 19 8356 t 19 8356 c 9 8371 c 9 8371 g 13 8386 g 13 8386 g 14 8397 g 14 8397 a 7 8417 a 7 8417 g 9 8427 g 9 8427 g 4 8445 g 4 8445

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-28
SLIDE 28

Phred Phred/ / Phrap Phrap/ / Consed Consed Pipeline Pipeline

Chromat Chromat_dir _dir Phd Phd_dir _dir Edit Edit_dir _dir

Directories Directories: :

Finishing Autofinish and manual finishing Assembly viewing/editing Consed Assembly Phrap assembled contigs - seq.fasta.screen.contigs assembly file - seq.fasta.screen.ace# Vector screening and masking Cross_Match (local alignment program) x vector.seq screened/masked file - seq.fasta.screen quality values - seq.fasta.screen.qual Conversion - phd to fasta phd2fasta.pl nucleotide sequences - seq.fasta quality values - seq.fasta.qual Quality (confidence) values assignment Phred phd files - *.phd Input chromatogram files

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-29
SLIDE 29

Conversion of Conversion of phd phd files into FASTA files files into FASTA files phd2fasta script phd2fasta script

Features Features: :

  • Phred

Phred creates single creates single-

  • sequences

sequences files files containing the containing the sequence sequence itself itself plus plus the the quality assignments quality assignments ( (phd phd files) files)

  • The

The input file for input file for cross cross_match _match and phrap and phrap programs programs is a is a multiple sequence multiple sequence file in FASTA file in FASTA format format

  • A

A Perl Perl script script named named phd2fasta phd2fasta converts the phd converts the phd files files into two into two multiple sequence multiple sequence FASTA FASTA format format files, files, containing the sequence containing the sequence information and the basecall quality information respectively information and the basecall quality information respectively

  • phredPhrap

phredPhrap script script automatically automatically executes phd2fasta executes phd2fasta before before running cross running cross_match _match and phrap and phrap! !

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-30
SLIDE 30

Phred Phred/ / Phrap Phrap/ / Consed Consed Pipeline Pipeline

Chromat Chromat_dir _dir Phd Phd_dir _dir Edit Edit_dir _dir

Directories Directories: :

Finishing Autofinish and manual finishing Assembly viewing/editing Consed Assembly Phrap assembled contigs - seq.fasta.screen.contigs assembly file - seq.fasta.screen.ace# Vector screening and masking Cross_Match (local alignment program) x vector.seq screened/masked file - seq.fasta.screen quality values - seq.fasta.screen.qual Conversion - phd to fasta phd2fasta.pl nucleotide sequences - seq.fasta quality values - seq.fasta.qual Quality (confidence) values assignment Phred phd files - *.phd Input chromatogram files

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-31
SLIDE 31

Vector screening Vector screening

Features Features: :

This step This step removes removes or screen

  • r screen out
  • ut vector sequence before running phrap

vector sequence before running phrap

Program Program: :

Cross Cross_match _match – – a a program program for for rapid sequence comparison and rapid sequence comparison and database database search based on search based on na na efficient implementation efficient implementation of

  • f the

the Smith Smith-

  • Waterman

Waterman-

  • Gotoh

Gotoh algorithm algorithm. .

Command Command: :

cross cross_match _match seq seq_file1 [ _file1 [ seq seq_file2...] [ _file2...] [ -

  • optionvalue
  • ptionvalue]

] – – [ [ optionvalue

  • ptionvalue]

]

  • seq

seq_file is a file _file is a file containing containing sequences sequences in a FASTA in a FASTA format format

  • all sequences

all sequences in in seq seq_file1 _file1 ( (query query) ) are are compared compared to to sequences sequences in in seq seq_file2 _file2 ( (subject subject) )

  • matches

matches meeting meeting relevant criteria relevant criteria are are written written to to the the standard output standard output

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-32
SLIDE 32

Vector screening Vector screening

Example Example: :

cross cross_match _match seqfile seqfile.fasta .fasta vector vector. .seq seq – –minmatch minmatch 10 10 – –minscore minscore 20 20 – –screen screen > > screen screen.out .out where where: :

‘seqfile seqfile.fasta’ is a file .fasta’ is a file containing multiple reads containing multiple reads in FASTA in FASTA format format

‘vector vector. .seq’ seq’ is a file is a file containing the vector sequences containing the vector sequences

‘-

  • minmatch’ and

minmatch’ and ‘ ‘-

  • minscore’

minscore’ are are parameters parameters for for pairwise alignment pairwise alignment

‘-

  • screen’ creates

screen’ creates a file a file named seqfile named seqfile.fasta. .fasta.screen containing vector screen containing vector-

  • masked

masked versions versions of

  • f the

the original

  • riginal sequences

sequences. . Any region Any region matching any part matching any part of a

  • f a vector

vector sequence sequence is is replaced by Xs replaced by Xs. .

‘screen screen.out’ .out’ contains contains a a list list of

  • f the matches found

the matches found

  • the

the .‘ .‘screen’ screen’ file is file is the the input for input for phrap phrap

  • if a ‘.qual’ file

if a ‘.qual’ file was created was created (i.e. (i.e. seqfile seqfile.fasta.qual) .fasta.qual) , it , it has has to to be renamed be renamed to to ( (seqfile seqfile.fasta. .fasta.screen screen.qual) .qual) – – phredPhrap phredPhrap script script automatically performs automatically performs this this step step! !

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-33
SLIDE 33

Phred Phred/ / Phrap Phrap/ / Consed Consed Pipeline Pipeline

Chromat Chromat_dir _dir Phd Phd_dir _dir Edit Edit_dir _dir

Directories Directories: :

Finishing Autofinish and manual finishing Assembly viewing/editing Consed Assembly Phrap assembled contigs - seq.fasta.screen.contigs assembly file - seq.fasta.screen.ace# Vector screening and masking Cross_Match (local alignment program) x vector.seq screened/masked file - seq.fasta.screen quality values - seq.fasta.screen.qual Conversion - phd to fasta phd2fasta.pl nucleotide sequences - seq.fasta quality values - seq.fasta.qual Quality (confidence) values assignment Phred phd files - *.phd Input chromatogram files

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-34
SLIDE 34

Phrap Phrap -

  • Phragment

Phragment Assembly Program or… Assembly Program or… Phil’s Revised Assembly Program Phil’s Revised Assembly Program

Phrap Phrap is a program for assembling shotgun DNA is a program for assembling shotgun DNA sequence data sequence data Command Command: :

phrap phrap – –seq seq_file1 [ _file1 [ seq seq_file2...] [ _file2...] [ -

  • optionvalue
  • ptionvalue]

] – – [ [ optionvalue

  • ptionvalue]

]

  • seq

seq_file is a file _file is a file containing multiple sequences containing multiple sequences in a FASTA in a FASTA format format

  • the current version only handles

the current version only handles a a single sequence single sequence file file

  • all the sequences

all the sequences in in the seq the seq_file are _file are compared compared to to each other each other

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-35
SLIDE 35

Phrap Phrap

Key Features: Key Features:

  • a. Uses the entire read content
  • a. Uses the entire read content –

– no need for trimming. no need for trimming.

  • b. User supplied (i.e.
  • b. User supplied (i.e. Repbase

Repbase) + internally computed data ) + internally computed data – – better accuracy of assembly in the presence of repeats. better accuracy of assembly in the presence of repeats. c.

  • c. Contig

Contig sequence is constituted by a mosaic of the highest sequence is constituted by a mosaic of the highest quality parts of the reads quality parts of the reads – – it’s not a consensus! it’s not a consensus!

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-36
SLIDE 36

Phrap Phrap

Key Features: Key Features:

  • e. Handles very large datasets
  • e. Handles very large datasets –

– hundreds of thousands of hundreds of thousands of reads are easily manipulated. reads are easily manipulated.

  • f. Generate output files
  • f. Generate output files –

– contain some important data and contain some important data and enable visualization by other programs enable visualization by other programs

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-37
SLIDE 37

Phrap Phrap output files

  • utput files
  • * .

* .contigs contigs – – fasta fasta file containing the file containing the contigs contigs

  • Contigs

Contigs with more than one read with more than one read

  • Singletons (single reads with a match to some other

Singletons (single reads with a match to some other contig contig but that couldn’t be but that couldn’t be merged consistently to it) merged consistently to it)

  • * .

* .singlets singlets – – fasta fasta file of the file of the singlet singlet reads reads

  • Reads with no match to other read

Reads with no match to other read

  • * .ace

* .ace – – allows for viewing the assembly using allows for viewing the assembly using Consed Consed

  • * .view

* .view – – required for viewing the assembly using required for viewing the assembly using Phrapview Phrapview

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-38
SLIDE 38

Phred Phred/ / Phrap Phrap/ / Consed Consed Pipeline Pipeline

Chromat Chromat_dir _dir Phd Phd_dir _dir Edit Edit_dir _dir

Directories Directories: :

Finishing Autofinish and manual finishing Assembly viewing/editing Consed Assembly Phrap assembled contigs - seq.fasta.screen.contigs assembly file - seq.fasta.screen.ace# Vector screening and masking Cross_Match (local alignment program) x vector.seq screened/masked file - seq.fasta.screen quality values - seq.fasta.screen.qual Conversion - phd to fasta phd2fasta.pl nucleotide sequences - seq.fasta quality values - seq.fasta.qual Quality (confidence) values assignment Phred phd files - *.phd Input chromatogram files

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-39
SLIDE 39

Consed Consed

Genome Research Genome Research 8

8: 195

: 195-

  • 202, 1998

202, 1998

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-40
SLIDE 40

Consed Consed

Consed Consed is a program for viewing and editing is a program for viewing and editing assemblies produced by assemblies produced by Phrap Phrap

Key Features: Key Features:

  • a. Assembly viewer
  • a. Assembly viewer -
  • allows for visualization of

allows for visualization of contigs contigs, assembly , assembly (aligned reads), quality values of reads and final sequence. (aligned reads), quality values of reads and final sequence.

  • b. Trace file viewer
  • b. Trace file viewer –

– single and multiple trace files can be single and multiple trace files can be visualized allowing for comparison of a given sequence in severa visualized allowing for comparison of a given sequence in several l reads. reads.

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-41
SLIDE 41

Consed Consed

Consed Consed is a program for viewing and editing is a program for viewing and editing assemblies produced by assemblies produced by Phrap Phrap

Key Features: Key Features:

  • c. Navigation
  • c. Navigation –

– identify and list regions which are below a given identify and list regions which are below a given quality threshold, contain high quality discrepancies, single quality threshold, contain high quality discrepancies, single-

  • strand coverage, etc.

strand coverage, etc. d. d. Autofinish Autofinish – – automatic set of functions for: gap closure, automatic set of functions for: gap closure, improvement of sequence quality, determination of relative improvement of sequence quality, determination of relative

  • rientation of
  • rientation of contigs

contigs, identification of regions covered by a , identification of regions covered by a single read or by reads of a single strand. single read or by reads of a single strand. The program The program automatically performs primer picking and chooses the automatically performs primer picking and chooses the templates. templates.

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-42
SLIDE 42

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-43
SLIDE 43

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-44
SLIDE 44

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-45
SLIDE 45

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-46
SLIDE 46

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-47
SLIDE 47

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-48
SLIDE 48

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-49
SLIDE 49

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-50
SLIDE 50

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-51
SLIDE 51

Phred Phred/ / Phrap Phrap/ / Consed Consed Pipeline Pipeline

Chromat Chromat_dir _dir Phd Phd_dir _dir Edit Edit_dir _dir

Directories Directories: :

Finishing Autofinish and manual finishing Assembly viewing/editing Consed Assembly Phrap assembled contigs - seq.fasta.screen.contigs assembly file - seq.fasta.screen.ace# Vector screening and masking Cross_Match (local alignment program) x vector.seq screened/masked file - seq.fasta.screen quality values - seq.fasta.screen.qual Conversion - phd to fasta phd2fasta.pl nucleotide sequences - seq.fasta quality values - seq.fasta.qual Quality (confidence) values assignment Phred phd files - *.phd Input chromatogram files

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-52
SLIDE 52

Autofinish Autofinish

Genome Research Genome Research 11

11: 614

: 614-

  • 625, 2001

625, 2001

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-53
SLIDE 53

Autofinish Autofinish

Features Features: :

  • Autofinish

Autofinish is is part part of

  • f the Consed package

the Consed package. .

  • It

It automatically chooses finishing reads automatically chooses finishing reads in in order

  • rder

to to finish finish a a project project. .

  • The

The “ “finished finished” status is ” status is defined by the user defined by the user according according to to pre pre-

  • defined parameters

defined parameters

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-54
SLIDE 54

Autofinish Autofinish

Autofinish allows the user Autofinish allows the user to: to:

  • Figure out

Figure out how contigs how contigs are are ordered and oriented

  • rdered and oriented
  • Close

Close gaps gaps

  • Improve the error

Improve the error rate rate

  • Cover every

Cover every base base by reads from by reads from at at least least 2 2 different subclones different subclones

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-55
SLIDE 55

Autofinish Autofinish

Autofinish will suggest any Autofinish will suggest any of

  • f the following types

the following types of

  • f

reads reads: :

  • Forward universal primer

Forward universal primer terminator reads terminator reads

  • Reverse

Reverse universal primer universal primer terminator reads terminator reads

  • Custom

Custom primer primer reads with subclone template reads with subclone template

  • Custom

Custom primer primer reads with whole reads with whole clone clone template template

  • Minilibraries

Minilibraries

  • PCR

PCR

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-56
SLIDE 56

Autofinish Autofinish

Finishing procedure Finishing procedure: :

Autofinish Autofinish suggests suggests reads reads Shotgun Shotgun reads reads Assemble new Assemble new reads with reads with existing reads existing reads Make reads Make reads in in lab lab

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-57
SLIDE 57

Finishing Problems Finishing Problems

Finishing can be a boring and difficult task due: Finishing can be a boring and difficult task due:

DNA sequencing problems DNA sequencing problems

  • a. High GC content
  • a. High GC content –

– genomes presenting a high GC content are genomes presenting a high GC content are more prone to generate artifacts as compressions, sudden more prone to generate artifacts as compressions, sudden drops, bad quality regions. drops, bad quality regions. Try to use Dye Primer instead of Dye Terminator,

Try to use Dye Primer instead of Dye Terminator, change chemistry, add DMSO, increase annealing temperature, use change chemistry, add DMSO, increase annealing temperature, use deaza deaza-

  • dGTP

dGTP instead of instead of dGTP dGTP, , etc. etc.

b.

  • b. Palindromic

Palindromic regions regions – – lead to strong secondary structures lead to strong secondary structures causing sudden drops. causing sudden drops. Try to use

Try to use deaza deaza-

  • dGTP

dGTP instead of instead of dGTP dGTP, amplify the , amplify the problematic region by PCR and sequence the product. problematic region by PCR and sequence the product.

  • c. Homopolymeric regions
  • c. Homopolymeric regions –

– can reduce DNA synthesis efficiency can reduce DNA synthesis efficiency for some chemistries. for some chemistries. Try to use Dye Primer instead of Dye Terminator, change

Try to use Dye Primer instead of Dye Terminator, change chemistry ( chemistry (dRhodamine dRhodamine instead of instead of BigDye BigDye). ).

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-58
SLIDE 58

Finishing Problems Finishing Problems

Finishing can be a boring and difficult task due: Finishing can be a boring and difficult task due:

DNA assembly problems DNA assembly problems

  • a. High content of repeats
  • a. High content of repeats –

– highly repeated elements reduce highly repeated elements reduce accuracy of DNA assembly. accuracy of DNA assembly. Identify the repeat unit, screen it with Cross_Match or

Identify the repeat unit, screen it with Cross_Match or Repeat_Masker and mask it. Try to assemble again and add the rep Repeat_Masker and mask it. Try to assemble again and add the repetitive region only at the end. etitive region only at the end. Map the repetitive region using restriction enzymes to estimate Map the repetitive region using restriction enzymes to estimate its size and number of repeat its size and number of repeat units. units.

  • b. High AT content
  • b. High AT content

– – some highly biased genomes (i.e. some highly biased genomes (i.e. Plasmodium Plasmodium falciparum falciparum; ; plastid genomes) can pose a problem plastid genomes) can pose a problem for assembly programs. for assembly programs. Very difficult to solve. Try to determine a restriction map and

Very difficult to solve. Try to determine a restriction map and associate mapping with DNA sequencing data. associate mapping with DNA sequencing data.

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-59
SLIDE 59

How to get the programs How to get the programs

Supported platforms: Supported platforms:

  • Sun Solaris (2.5.1 or better)

Sun Solaris (2.5.1 or better)

  • DEC

DEC-

  • Alpha Digital Unix (OSF1 V4.0 or better)

Alpha Digital Unix (OSF1 V4.0 or better)

  • HP HP

HP HP-

  • UX (11.0 or better)

UX (11.0 or better)

  • SGI

SGI Irix Irix (6.2, or better) (6.2, or better)

  • Linux (

Linux (Redhat Redhat 5.2 or better) 5.2 or better)

Note: there are commercial versions of Note: there are commercial versions of Phred Phred/ / Phrap Phrap for DOS/Windows and for DOS/Windows and MacOS MacOS platform (no platform (no Consed Consed version so far) version so far)

Internet site: Internet site:

http://www.phrap.org http://www.phrap.org -

  • academic version

academic version http://www.phrap.com http://www.phrap.com and and http://www. http://www.codoncode codoncode.com .com -

  • commercial version

commercial version

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-60
SLIDE 60

Contacts Contacts

To obtain the programs, questions, bug reports, To obtain the programs, questions, bug reports, suggestions: suggestions:

  • Phrap

Phrap/Cross_match/Swat /Cross_match/Swat – – Phil Green Phil Green – – phg phg@u. @u.washington washington. .edu edu

  • Phred

Phred – – Brent Ewing Brent Ewing – – bge bge@u.washington.edu @u.washington.edu

  • Consed

Consed – – David Gordon David Gordon – – gordon gordon@genome. @genome.washington washington. .edu edu

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-61
SLIDE 61

Preparing sequence trace data for analysis for assembly Preparing sequence trace data for analysis for assembly

– pregap4

  • Graphical user interface

Graphical user interface

  • Prepare trace data

Prepare trace data

  • Automation

Automation

  • Trace format conversion

Trace format conversion

  • Quality analysis

Quality analysis

  • Vector clipping

Vector clipping

  • Contaminant screening

Contaminant screening

  • Repeat searching.

Repeat searching.

The The Staden Staden Package Package

Medical Research Council Medical Research Council –

– Laboratory of Molecular Biology (MRC

Laboratory of Molecular Biology (MRC-

  • LMB)

LMB) -

  • UK

UK

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-62
SLIDE 62

Assembly program Assembly program

– – gap4 gap4

  • Assembly

Assembly

  • Contig

Contig joining joining

  • Assembly checking

Assembly checking

  • Repeat searching

Repeat searching

  • Experiment suggestion

Experiment suggestion

  • Read pair analysis

Read pair analysis

  • Contig

Contig editing editing

  • Graphical views of

Graphical views of contigs contigs

  • Database

Database Note: Note: ace files produced by a special version of ace files produced by a special version of Phrap Phrap can be viewed by Gap4 can be viewed by Gap4

The The Staden Staden Package Package

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-63
SLIDE 63

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-64
SLIDE 64

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-65
SLIDE 65

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-66
SLIDE 66

Supported platforms: Supported platforms:

  • Sun Solaris

Sun Solaris

  • Compaq Tru64 UNIX (Alpha)

Compaq Tru64 UNIX (Alpha)

  • SGI

SGI Irix Irix

  • Linux

Linux

  • MS Windows (Win9x, NT, 2000)

MS Windows (Win9x, NT, 2000)

Internet site: Internet site:

E-mail: Rodger Staden - rs@mrc-lmb.cam.ac.uk http://www.mrc-lmb.cam.ac.uk/pubseq/

The The Staden Staden Package Package

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-67
SLIDE 67

CAP3 CAP3 -

  • Sequence Assembly Program

Sequence Assembly Program

Genome Research Genome Research 9

9: 868

: 868-

  • 877, 1999

877, 1999

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-68
SLIDE 68

Characteristics Characteristics: :

  • Makes

Makes use of use of quality values quality values – – qual files qual files produced by produced by Phred can be used by Phred can be used by CAP3 CAP3

  • Produces an ace

Produces an ace file file compatible with Consed compatible with Consed

  • Can also be used

Can also be used in Gap4 ( in Gap4 (Staden Package Staden Package) )

  • The

The program program is is available available under under request request – – send an send an e e-

  • mail

mail to to Xiaoqiu Huang Xiaoqiu Huang – – huang huang@ @mtu mtu. .edu edu

CAP3 CAP3 -

  • Sequence Assembly Program

Sequence Assembly Program

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-69
SLIDE 69

Characteristics Characteristics: :

  • Makes

Makes use of use of quality values quality values – – qual files qual files produced by produced by Phred can be used by Phred can be used by CAP3 CAP3

  • Produces an ace

Produces an ace file file compatible with Consed compatible with Consed

  • Can also be used

Can also be used in Gap4 ( in Gap4 (Staden Package Staden Package) )

  • The

The program program is is available available under under request request – – send an send an e e-

  • mail

mail to to Xiaoqiu Huang Xiaoqiu Huang – – huang huang@ @mtu mtu. .edu edu

CAP3 CAP3 -

  • Sequence Assembly Program

Sequence Assembly Program

AG AG-

  • FMVZ

FMVZ-

  • USP

USP

slide-70
SLIDE 70

E E-

  • mail:

mail: argruber argruber@ @usp usp. .br br

AG AG-

  • FMVZ

FMVZ-

  • USP

USP