[PDF] - Outline What is EMBOSS? Major programs Running EMBOSS Programs PDF Document

SLIDE 1

EMBnet Course: Unix & PERL for biomedical researchers

Lausanne, 2 March 2005

Lorenza Bordoli

Swiss Institute of Bioinformatics

Outline

What is EMBOSS?
Major programs
Running EMBOSS Programs from the

Unix command line

Combining EMBOSS with Perl

SLIDE 2

Why EMBOSS ?

History:

Wisconsin (sequence analysis) package, GCG (Genetics Computer Group)

founded in 1982 as a service of the Department of Genetics at the University of Wisconsin;

Widely used, sources available for inspection (programs could be

algorithmically verified and adapted to needs);

Since 1998 EGCG (extended GCG) developed academic add-on,

started as a small collection of programs to support EMBL's research activities, in particular the development of automated DNA sequencing;

GCG became a private company in 1990, now belongs to Accelerys;
Currently sources not freely available anymore, no longer possible to

distribute academic software source code which uses the GCG libraries!

1999 - EGCG split from GCG to become EMBOSS;

February 2005: version 2.9.0

What is EMBOSS ?

http://emboss.sourceforge.net/
EMBOSS, The European Molecular Biology Open Source Software Suite, is

a package of high-quality FREE Open Source software for sequence analysis;

EMBOSS includes hundreds of applications (+150). They share a similar

interface, but each comes with its own documentation:

Many sequence analysis & display programs.
Protein 3D structure prediction being developed.
Other assorted programs, eg: enzyme kinetics.
Extensible (with some C programming knowledge)!
Complete list of the programs in the currently release:

http://emboss.sourceforge.net/apps/#Overview

Grouped applications: http://emboss.sourceforge.net/apps/groups.html

SLIDE 3

EMBOSS !

Free Open Source (for most Unix platforms, including MacOSX)
GCG successor (compatible with GCG file format)
Public domain (GNU Public License)
Written by HGMP/Sanger/EBI/Denmark … etc
Easy to install locally:

but no interface, requires local databases Unix command-line only

Interfaces:

Jemboss, www2gcg, w2h, wEMBOSS… (with account) Pise, EMBOSS-GUI, SRS (no account) Staden, Kaptain, CoLiMate, Jemboss (local)

History (cont.):

The UK Medical Research Council is to close the Research and

Bioinformatics Divisions of the RFCGR (the current home of EMBOSS) in the middle of 2005. The MRC Press Office has stated: "All MRC can say at this stage is that Council have made a decision to close the Research and Bioinformatics Divisions. However, the Director has been asked to draw up a closing down plan for consideration by Council in July."

“This action will more than halve the current core development team and will

therefore adversely affect the development and support of EMBOSS. We hope that alternative sources of funding can be found.”

EMBOSS has moved to SourceForge.net (http://sourceforge.net/);

SLIDE 4

EMBOSS: Introduction

The EMBOSS package consists of a large number of separate programs

that have a specific function.

They usually take a (number of) input file(s) and some parameters that are

important to the function and produce output in the form of files, plots, web pages or simple text output.

Running EMBOSS Programs

EMBOSS programs are run by:

typing them at the UNIX prompt,
or by using a graphical interface.

Remote server: sib-dea.unil.ch Local computer: your PC in the lab, in the course room,… Remote server: you personal account

SLIDE 5

Running EMBOSS Programs

EMBOSS programs are run by:

typing them at the UNIX prompt: X-terminal (X window system)

Local computer Remote server: sib-dea.unil.ch

Running EMBOSS Programs

EMBOSS programs are run by:

or by using an interface: web based (browser), java based

embl:m93650

Local computer Remote server: sib-dea.unil.ch

SLIDE 6

Graphical interfaces to EMBOSS

Jemboss: java based interface to EMBOSS
wEMBOSS: web interface to EMBOSS
Pise: web interface to EMBOSS
others : http://emboss.sourceforge.net/

Access to EMBOSS: graphical interfaces http://emboss.ch.embnet.org

Access to EMBOSS: graphical interfaces

Jemboss: account needed on EMBnet machine (->ask Laurent)
Pise: no account needed: anonymous access

SLIDE 7

http://www.bc2.unibas.ch/

Jemboss/wEMBOSS: unibasel account needed

Some major programs:

General:

wossname lists all EMBOSS programs showdb shows the available databases

Sequence retrieval:

seqret retrieves and/or changes format of a sequence seqretset retrieve and or change formats of a number seqretall

f sequences at once

transeq translate a DNA sequence to protein backtranseq translate a protein sequence to DNA extractseq extract regions from a sequence cutseq remove a region from a sequence pasteseq inserts a sequence into another sequence infoseq display information about a sequence splitter split a sequence into smaller sequences

SLIDE 8

Some major programs (cont.):

Sequence comparison

needle Needleman-Wusch sequence alignment water Smith-Waterman sequence alignment stretcher Myers and Miller global alignment matcher Huang and Miller local alignment dottup dotmatcher dotplot comparisons of two sequences. prettyplot plots multiple sequence alignments polydot supermatcher dotplot comparisons of multiple sequences emma ClustalW program

Sequence parameters

cusp generates a codon usage table syco synonymous codon usage plot dan calculates DNA/RNA melting temperature compseq sequence composition tables

DNA Sequence features

remap restriction map of the sequence cpgplot cpgreport CpG island detection etandem einverted finds tandem and inverted repeats plotorf plots potential ORFs showorf pretty display of potential ORFs fuzznuc DNA pattern search tfscan scans sequence for TF binding sites

Some major programs (cont.):

SLIDE 9

Protein Sequence features

ief Isoelectric point calculation antigenic Finds potential antigenic sites digest protein digestion map findkm Vmax and Km calculations fuzzpro protein pattern search garnier protein 2D structure prediction helixturnhelix finds nucleic acid binding motifs

ctanol

pepwindow displays protein hydropathy patmatdb patmatmotifs searching with motifs vs protein sequences pepcoil predicts coiled coil regions pepinfo pepstats Protein information Hammer package ehmmpfam, ehmmsearch, ehmmbuild, … Phylip package efitch, edolpenny, edollop, …

Some major programs (cont.):

Working with sequences :

EMBOSS reads sequences from files or databases.
It automatically recognizes the input sequence format.
You can easily specify many output formats.

SLIDE 10

Uniform Sequence Address (USA)

= a standard way of specifying a sequence to be read into a program in EMBOSS
Sequences can be in databases or in files
It has the following syntax:

format::database:entry format::file:entry In general, a USA specifies

what sequence format to expect
what file or database to open
what entry to look for

Uniform Sequence Address (USA)

format::database:entry

Of these only the “file” or “database” are necessary;
If format is omitted: EMBOSS will check and recognizes the format (occasionally

needs a hint) * ;

if the “entry “ part is omitted, all of the entries in the file or database are read in;

* EMBOSS recognizes: GCG, FASTA, ClustalW, MSF, EMBL, GenBank, DNAStrider, Phylip, PIR, PAUP,ASN.1, NBRF, Fitch, IntelliGenetics

SLIDE 11

Uniform Sequence Address (USA)

The most common ways of specifying a sequence are:

to type the name of the file that the sequence is in: myfile.seq
or type “db:entry”, where “db” is the name of the database and “entry” is

either the ID or the accession number (AC) of the sequence in the database Ex.: database:accession embl:X65923 database:ID swissprot:opsd_xenla file name myfile.seq

ACs and IDs …

An entry in a database: uniquely identified in that database. Most sequence

databases have two such identifiers for each sequence - an ID name and an Accession number.

Why are there two such identifiers?
The ID name: a human-readable name that had some indication of the

function of its sequence: OPSD_HUMAN in Swiss-Prot !! ID names are not guaranteed to remain the same between different versions

f a database.
Accession numbers: unique alphanumeric identifiers that are guaranteed to

remain with that sequence through the rest of the life of the database:

P08100. If two sequences are merged into one, then the new sequence will

get a new Accession number and the Accession numbers of the merged sequences will be retained as 'secondary' Accession numbers.

SLIDE 12

Databases

You can easily find out what are the database name in your EMBOSS installation by running the showdb program: Unix % showdb Displays information on the currently available databases #Name Type ID Qry All Comment #==== ==== == === === ======= sw P OK OK OK Swiss-Prot section of UniProt swiss P OK OK OK Swiss-Prot section of UniProt swiss-prot P OK OK OK Swiss-Prot section of UniProt trembl P OK OK OK TrEMBL section of UniProt uniprot P OK OK OK UniProt (Swiss-Prot & TrEMBL), …

Databases

#Name Type ID Qry All Comment #==== ==== == === === ======= sw P OK OK OK Swiss-Prot section of UniProt

ID allows programs to extract a single explicitly named entry from the database:

embl:x13776 ;

Query indicates that programs can extract a set of matching wildcard entry

names: sw:opsd_* ;

All allows programs to analyze all entries in the database sequentially: embl:*

;

SLIDE 13

Uniform Sequence Address (USA)

you may also use:

filename all sequences in a file filename:entry an entry in a file dbname all sequences in a database (not recommended) dbname:entry a sequence in a database @listfile a list file list::listfile a list file asis::sequence a specific short sequence

Specifying a List File

Instead of containing the sequences themselves, a listefile contains

“references” to sequences using any valid USA.

Example of a ListFile:
psd_abyko.fasta

: the name of a sequence file; sw:opsd_xenla : a specific sequence in the Swiss-Prot database; @anotherlist : the name of a second list file;

Blank lines and lines starting with a '#' character are ignored in List Files: a way to

add your comments: this won’t be read by the programs.

SLIDE 14

Specifying a List File

Instead of containing the sequences themselves, a listefile contains

“references” to sequences using any valid USA.

Example of a ListFile:
psd_abyko.fasta

: the name of a sequence file; sw:opsd_xenla : a specific sequence in the Swiss-Prot database; @anotherlist : the name of a second list file;

Blank lines and lines starting with a '#' character are ignored in List Files: a way to

add your comments: this won’t be read by the programs.

#This is an example #of a list file

psd_abyko.fasta

sw:opsd_xenla @anotherlist

Specifying a sequence “As Is”

The simplest USA format is “asis” format. This is used to specify a sequence

immediately without it having to be in a file or a database.

The syntax is “asis::sequence”:

asis::atgctagcttagc : for the sequence “atgctagcttagc” ;

SLIDE 15

Sequence Formats

Sequences can be read and written in a variety of formats;
Sequences are stored in databases or in files as simple text (ASCII text);
Microsoft WORD format is not a sequence format (save the file as text *.txt file!!!)
The default sequence file format is fasta:

>xyz some other comment ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgtagcctcactggagggcatt xyz: ID name

Sequence Database Format: EMBL, GenBank, SwissProt, PIR;
Sequence File: Files can hold sequences in standard recognized formats;

Sequence Formats

Currently input/output supported formats (more than 42):

http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

Input Sequence Formats: fasta, EMBL (embl/em), Swiss-Prot

(swissprot/swiss/sw), GCG (gcg), MSF (msf), Genbank (genbank), raw,…

Output Sequence Formats: embl, gcg, swiss, CLUSTALW (clustal, aln), genbank,

…

SLIDE 16

Running EMBOSS from the Unix command line The command Line

EMBOSS programs are called by giving their name on the UNIX command line

either:

without parameters: interactive way, query-answer session with the user, in

which the user is asked (prompted) to enter a piece of information one at a time;

or with parameters: the program will have all the information it needs all at
nce;

SLIDE 17

The command Line

EMBOSS programs are called by giving their name on the UNIX command line:
without parameters: interactive way, query-answer session with the user,

in which the user is asked (prompted) to enter a piece of information one at a time: % seqret Reads and writes (returns) a sequence Input sequence: embl:xlrhodop Output sequence [xlrhodop.fasta]: % more xlrhodop.fasta >XLRHODOP L07770 Xenopus laevis rhodopsin ggtagaacagcttcagttgggatcacaggcttctagggatccttt gggcaaaaaagaaac acagaaggcattctttctatacaagaaaggactttatagagctgc taccatgaacggaac

The command Line

EMBOSS programs are called by giving their name on the UNIX command line:
without parameters: interactive way, query-answer session with the user,

in which the user is asked (prompted) to enter a piece of information one at a time: % wossname Finds programs by keywords in their one-line documentation Keyword to search for: restrict SEARCH FOR 'RESTRICT’ recode Remove restriction sites but maintain the same translation remap Display a sequence with restriction cut sites, translation Etc…

SLIDE 18

The file “stdout”

Note that the default output file for wossname is:

stdout (Standard output)

Use this whenever prompted for an output file.
This is a ‘magic’ file name.
It displays the output on the screen, not a file.

The command Line

EMBOSS programs are called by giving their name on the UNIX command line:
with parameters: the program will have all the information it needs all at
nce:
% seqret -sequence filename.seq -outseq xlrhodop.fasta
programname
parameter/object: here: a sequence file
qualifier: specify the properties of that object/parameter. here: input/output

sequence

SLIDE 19

The command Line and parameters

% seqret -sequence filename.seq -outseq xlrhodop.fasta programname parameter/object: here: a sequence file qualifier: specify the properties of that object/parameter. here: input/output sequence

There are 3 classes of parameters:
standard (mandatory) : minimum input for the program, the program will

prompt for them; ex: sequence file name;

additional (optional) : you will be prompted only if you use the –options

(-opt) qualifier; ex: begin and end of the sequence;

advanced : you have to specify them on the command line (you won’t be

prompted);

The command Line and parameters

There are 3 classes of parameters:
standard (mandatory) : minimum input for the program, the program will

prompt for them; ex: sequence file name;

additional (optional) : you will be prompted only if you use the –options

qualifier; ex: begin and end of the sequence;

advanced : you have to specify them on the command line (you won’ be

prompted);

How do I find them out? With the –help qualifier! (try also –verbose)

% Programname -help

SLIDE 20

Qualifiers

% wossname –help (-verbose) Mandatory qualifiers: [-search] string Enter a word or words here. Optional qualifiers:

outfile
utfile

Output file name Advanced qualifiers:

[no]emboss bool

EMBOSS program documentation will be searched.

"-sequence" related qualifiers

sbegin

integer first base used

send

integer last base used, def=seq length

sreverse

bool reverse (if DNA)

sask

bool ask for begin/end/reverse

slower

bool make lower case

supper

bool make upper case

sformat

string input sequence format

SLIDE 21

"-outseq" related qualifiers

osformat

string output sequence format

ossingle

bool separate file for each entry

…

Boolean options (Yes/No, True/False)

thing, -nothing
-thing=T, -thing=F
-thing=1, -thing=0
-thing=Y, -thing=N

SLIDE 22

General qualifiers

These can be used with any program:
help

Prints a summary of the options the program can take. With

verbose it gives a more detailed list.
options

Prompt the user for the optional parameters

auto

Accept all the default settings and run without prompting the user (used when running program scripts)

sask

Ask for the start, end and reverse of the sequence input

stdout

Print output to stdout (the screen) instead of to a file.

filter

Take input from stdin (keyboard) and output to stdout

warnings

Report warnings

error

Report errors

Example: Seqret

Give seqret all of its data on the command-line :

% seqret embl:xlrhodop -outseq xlrhodop.fasta % seqret embl:xlrhodop xlrhodop.fasta

Even shorter, leave out the qualifier:

% seqret -help -verbose

SLIDE 23

Example: Seqret

seqret can reformat sequences by specifying the output format:

% seqret embl:xlrhodop –osformat gcg -outseq xlrhodop.gcg % seqret embl:xlrhodop xlrhodop.gcg –osformat gcg

equivalent to:

% seqret embl:xlrhodop gcg::xlrhodop.gcg Unix % more xlrhodop.gcg !!NA_SEQUENCE 1.0 Xenopus laevis rhodopsin mRNA, complete cds. XLRHODOP Length: 1684 Type: N Check: 9453 .. 1 ggtagaacag cttcagttgg gatcacaggc ttctagggat cctttgggca 51 aaaaagaaac acagaaggca ttctttctat acaagaaagg actttataga

equivalent to:

Practical:

Try running wossname:

Can you find a program to: Display multiple alignments. Find ORFs (Open Reading Frames). Translate a sequence. Find restriction enzyme sites Find the isoelectric point of a protein. Do global alignments.

SLIDE 24

Uniform Sequence Address (USA)

= a standard way of specifying a sequence to be read into a program in EMBOSS
Sequences can be in databases or in files
It has the following syntax:

format::database:entry In general, a USA specifies

what sequence format to expect
what file or database to open
what entry to look for

Databases

#Name Type ID Qry All Comment #==== ==== == === === ======= sw P OK OK OK Swiss-Prot section of UniProt

ID allows programs to extract a single explicitly named entry from the database:

embl:x13776 ;

Query indicates that programs can extract a set of matching wildcard entry

names: sw:opsd_* ;

All allows programs to analyze all entries in the database sequentially: embl:*

;

SLIDE 25

Asterisk on the command line

You can't use a ‘*’ on the UNIX command-line.
UNIX tries to match it to filenames.
Use it quoted, either with quotes or a backslash:

"embl:*" embl:\*

For example:

% seqret “sw:*_HUMAN” Human.seq

The full USA syntax

filename : a file containing one or more sequences filename:entry : a given sequence in the file. The ‘entry’ mysequences:opsd_xenla is the ID or AC of the sequence in that file filename:entry[start:end] : a part of the sequence can be specified mysequences:opsd_xenla[1:20]* by the range mysequences:opsd_xenla[-1:-20]* : the last 20 residues/nucleotides mysequences:[1:20:r]* : reverse-complemented (nucleotide sequences) * In some Unix shell you might have to use backslash: \[ and \[

SLIDE 26

Specifying Search Fields

Beside ID names or AC numbers there are other ways to specify sequences.
A typical sequence entry in EMBL format is:

ID HSFAU standard; DNA; UNC; 518 BP. AC X65923; SV X65923.1 DE H.sapiens fau mRNA KW fau gene. OS Homo sapiens (human) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. SQ Sequence 518 BP; 125 A; 139 C; 148 G; 106 T; 0 other;

It is also useful to find sequences that contain words occurring in their description

filed (“DE” line), their Keyword field (“KE” line), …

Specifying Search Fields

You can specify which field you are searching by using one of the following Search

Field Names:

Name Searches for acc Accession number des Description id ID name key Keyword

rg

Organism Name sv Sequence Version/GI Number

Examples:

embl-des:fau : database embl-des:h*emoglobin : database myclones.seq:des:fau : file

SLIDE 27

The full USA syntax: summary

The full syntax of the possible USAs are:

Mandatory parts of the USAs are given in bold text. asis :: Sequence [start : end : reverse] format :: '@' ListFile [start : end : reverse] format :: 'list' : ListFile [start : end : reverse] format :: Database : Entry [start : end : reverse] format :: Database - SearchField : Word [start : end : reverse] format :: File : Entry [start : end : reverse] format :: File : SearchField : Word [start : end : reverse]

Sequence Formats

Currently input/output supported formats (more than 42):

http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

Input Sequence Formats: fasta, EMBL (embl/em), Swiss-Prot

(swissprot/swiss/sw), GCG (gcg), MSF (msf), raw,…

Output Sequence Formats: embl, gcg, swiss, CLUSTALW (clustal, aln), …

SLIDE 28

Multiple sequences, single file

EMBOSS writes many sequences to a single file. Most sequence formats can

deal with this: Fasta, EMBL, PIR, MSF, Clustal, Phylip, etc. BUT NOT: Plain, Staden and GCG

EMBOSS reads many sequences from a single file.

Use filename:entryname if you wish to specify a single sequence. If there is only one sequence, or you wish to read all entries, use just the filename.

Multiple sequences, many files

The command-line qualifier “-ossingle” my be useful – it allows you to write out

several sequences, but it writes out each sequence to a separate file;

The name of the file is constructed from the ID name of the sequence and the

extension of the file is the format:

Ex. : The sequence with the ID name “IXI_567” in fasta format would be written to

the file “IXI_567.fasta” % seqret “embl:hsf*” –ossingle

The program seqretsplit will split an existing multiple sequence file into many

files.

SLIDE 29

Alignment output Formats

Several formats have been written or adopted for EMBOSS output:

http://emboss.sourceforge.net/docs/themes/AlignFormats.html

Multiple sequence alignment: fasta, msf,..
Pair-wise alignment: pair, score,…
Each program that writes an alignment has a default alignment format defined for

that program. However you can change the output formats with the “ –aformat” qualifier: % water –aformat msf

Feature Formats

A feature is a region of interest in a specified nucleic or protein sequence. It has a

specified start and end position. It has a name describing what type of thing it is: Ex: Swiss-Prot Feature table

FT DISULFID 3 40 FT DISULFID 4 32 FT DISULFID 16 26 FT VARIANT 22 22 P -> S (IN ISOFORM SI). FT VARIANT 25 25 L -> I (IN ISOFORM SI).

When reading or writing features associated with a sequence, there are a

standard set of formats that are used: UFO (Universal Feature Object) e.g. Swiss- Prot (swissprot), EMBL (embl), PIR (pir),… http://emboss.sourceforge.net/docs/themes/FeatureFormats.html

showfeat useful for displaying features.
etractfeat useful for extracting the sequences of features.

SLIDE 30

Feature Formats

Many programs will read in and use the feature table of an input sequence.

Amongst these are diffseq, extractfeat, maskfeat, seqret, showfeat

The feature table can be already a part of the sequence (which is generally the

case when you are reading the sequence from a database) or located in a separate file

Try the difference between:

% seqret sw:P15423 -osformat swiss –outseq stdout % seqret sw:P15423 -osformat swiss –outseq stdout -feature

Report Formats

There are many ways in which the results of an analysis can be reported:

http://emboss.sourceforge.net/docs/themes/ReportFormats.html

######################################## # Program: garnier # Rundate: Mon Feb 11 15:14:40 2002 # Report_file: report.dbmotif ######################################## #======================================= # # Sequence: 100K_RAT from: 1 to: 889 # HitCount: 206 # # DCH = 0, DCS = 0 # # Please cite: # Garnier, Osguthorpe and Robson (1978) J. Mol. Biol. 120:97-120 #--------------------------------------- # # Residue totals: H:364 E:149 T:191 C:185 # percent: H: 41.7 E: 17.1 T: 21.9 C: 21.2 # # #---------------------------------------

SLIDE 31

Report Formats

Many EMBOSS programs are now able to output their results in a standard report

format - you can change the report format used by putting '-rformat name' on the command-line, where 'name' is the name of one of the standard report formats.

examples:

embl Writes a report in EMBL feature table format pir Writes a report in PIR feature table format swiss Writes a report in SwissProt feature table format excel This is a TAB-delimited table format suitable for reading into spread-sheet programs such as Excel. seqtable A simple table format that includes the feature sequence Start End [tagnames] Sequence [start] [end] [tagvalues] [sequence]

Graphic Formats

-graph
output as X11 (default), PNG, ps, tektronics amongst other
CAVEAT: X11 default output doesn’t work with the X-terminal from a PC or Mac,

use “ps” instead. ps files can be opened with the Unix program gs. Or save the

utput as PNG. PNG files can be opened from PC or Mac.
Ex. % plotorf embl:xlrhodop
Plot potential open reading frames

SLIDE 32

Example: plotorf

Ex. % plotorf embl:xlrhodop –graph PNG

Plot potential open reading frames

Graphic Formats

[X] x11 Output to an X-window postscript Output to a postscript file (good for printing on a laser printer) cps Output to a color postscript file text Output to a text file data Output XY data points to a file. (good for importing into a graphing package) [P] png Output to a PNG image file (good for web pages) [X] Tek Output to tektronics terminal [X] xterm Output to an Xterm window [X]- requires X-windows [P] – requires PNG support The default filename is prog.format eg. octanol.ps

SLIDE 33

The file “stdout”

There is a “magic” filename that you can give whenever an output filename is

request: “stdout”;

If you enter this name, then the resulting sequence will not go to a file called “stdout”,

BUT it will be printed on the screen;

Useful when you wish to know the results immediately, or are testing various way of

running a program and wish to quickly see the results

You can specify the output to the screen by “format::stdout”: gcg::stdout

Reminder - help

If in doubt, use:

wossname program –help -verbose program -opt tfm program The friendly manual ;)

SLIDE 34

What EMBOSS does NOT

The major deficiencies in the EMBOSS package are:

BLAST, FASTA, ASSEMBLY (GelMerge, GelEnter,…)* , PAUP, sequence editor

Don’t worry those programs (and much more!) are also installed on the

machines running EMBOSS in Lausanne & Basel (http://www.bc2.unibas.ch/BC2/programs/bc2soft)

BLAST package (type blastall command)
FASTA package (fasta34,…)
PAUP (paup)
* PHRED, PHRAP, Consed
sequence editor: pico, emacs, vi

What EMBOSS does NOT

The major deficiencies in the EMBOSS package are:

BLAST, FASTA, ASSEMBLY (GelMerge, GelEnter,…) , PAUP, sequence editor

Graphical Interface:
BLAST:
SIB: http://www.expasy.org/tools/blast/
Swiss EMBnet: http://www.ch.embnet.org/software/BottomBLAST.html?
NCBI: http://www.ncbi.nlm.nih.gov/BLAST/
FASTA:

EBI: http://www.ebi.ac.uk/fasta33/

ClustalW:
Swiss EMBnet: http://www.ch.embnet.org/software/ClustalW.html
PAUP no graphical interface, use Phylip instead (part of EMBOSS)

SLIDE 35

Send the output of a program as input for a second program

% seqret embl:xlrhodop\[110:1174\] -stdout -auto | transeq –filter

stdout

Print output to stdout (the screen) instead of to a file.

filter

Take input from stdin (keyboard) and output to stdout

auto

Accept all the default settings and run without prompting the user

Combining EMBOSS with Perl

program2 program1

input

utput

SLIDE 36

Combining EMBOSS with Perl

program2 program1

input

utput

Parsed

utput/input

embedded in a single Perl script

Combining EMBOSS with Perl

program2 program1

input

utput

Parsed

utput/input

SLIDE 37

Combining EMBOSS with Perl

How to run an external programs from a Perl script? 1) System function: launches a child process which run a program:

system (“seqret embl:xlrhodop –outseq xlrhodop.fasta –auto”);

⇒ Perl waits for it to finish, then possibly grabs its output. 2) Processes as Filehandles: launches a child process that stays alive, communicating via pipes to Perl until the task is complete:

pen (SEQRET, “ seqret embl:xlrhodop –auto |” );

my @sequence = <SEQRET>; …

⇒ These filehandles “contains” the output of the launched command.

Combining EMBOSS with Perl: an example: cirdna

Draws circular maps of DNA constructs

Start 1001 End 4270 group label Block 1011 1362 3 ex1 endlabel label Tick 1610 8 EcoR1 endlabel label Block 1647 1815 1 endlabel label Tick 2459 8 BamH1 endlabel label Block 4139 4258 3 ex2 endlabel endgroup group label Range 2541 2812 [ ] 5 Alu endlabel label Range 3322 3497 > < 5 MER13 endlabel endgroup

SLIDE 38

######################################## # Program: restrict # Rundate: Sat Feb 26 21:53:07 2005 # Report_format: table # Report_file: ab014641.restrict ######################################## #======================================= # # Sequence: AB014641 from: 1 to: 3417 # HitCount: 9 # # Minimum cuts per enzyme: 1 # Maximum cuts per enzyme: 1 # Minimum length of recognition site: 4 # Blunt ends allowed # Sticky ends allowed # DNA is circular # Ambiguities allowed # #======================================= Start End Enzyme_name Restriction_site 5prime 3prime 5primerev 3primerev 1 8 NotI GCGGCCGC 2 6 . . 556 561 PvuI CGATCG 559 557 . . 856 861 Eco31I GGTCTC 862 866 . . 1335 1331 BcefI ACGGC 1317 1318 . . 1998 2003 XhoI CTCGAG 1998 2002 . . 2170 2175 HindII GTYRAC 2172 2172 . . 2680 2685 BclI TGATCA 2680 2684 . . 2918 2923 BamHI GGATCC 2918 2922 . . 3412 3417 HindIII AAGCTT 3412 3416 . . #--------------------------------------- #---------------------------------------

restrict

Finds restriction enzyme cleavage sites

Cloning vector pGEX-PUC-3T

restrict parsed output

Start 1 End 3417 group label Tick 1 8 NotI endlabel label Tick 556 8 PvuI endlabel label Tick 856 8 Eco31I Endlabel (...) label Tick 2918 8 BamHI endlabel label Tick 3412 8 HindIII endlabel endgroup

SLIDE 39

cirdna.pl

#!/usr/local/bin/perl #perl script to 1. parse the output of the EMBOSS program "restrict" which uses #the REBASE database of restriction enzymes to predict cut sites in a DNA sequence. #2. write a input file for the EMBOSS program "cirdna" and 3. run the "cirdna" #program which draws circular maps od DNA construct use strict; use warnings; #if no filename is specified on the command-line print USAGE and exit #the file should contain a DNA sequence in one of the EMBOSS recognized formats my $USAGE=" SYNOPSIS $0 filename\n"; unless (@ARGV){ print $USAGE ; exit; } #Read in the filename from the argument on the command line my $filename=$ARGV[0];

cirdna.pl

#write the parsed output of the "restrict program" in a file my $inputfile = "$filename.restrict"; unless (open(OUT, ">$inputfile")){ print " Could not open file $inputfile! to write to !! \n\n"; exit; }

SLIDE 40

cirdna.pl

#launch the emboss program "restrict" to do a restriction analysis of #the DNA #the maximum number of cuts for any restriction enzyme that will be #considered is set to 1 #a list of enzymes is provided and we specify that the DNA is circular #the command is: #"restrict filename -enzymes HindIII,XhoI -max=1 -plasmid=Y -auto #-stdout"

pen (RESTRICT, "restrict $filename -enzymes NotI,PvuI -max=1
plasmid=Y -auto -stdout |");

cirdna.pl

#Read the output in a "while" loop and extract the start and the #beginning of the DNA sequence #and the start, the end and the name of the restriction enzyme my $line; while ($line = <RESTRICT> ) { chomp($line); #regular expression to match start and the end of the sequence if ($line =~ /^#\sSequence.*from:\s+(\d+).*to:\s+(\d+)/){ my ($seqstart, $seqend) = ($1, $2); #print the DNA start and the beginning in the input file print OUT "Start\t$seqstart\nEnd\t$seqend\n\ngroup\n"; } #regular expression to match start, end and name of the enzyme if ($line =~ /^\s+([0-9]+)\s+([0-9]+)\s+(\w+)\s+/){ my ($start,$end,$enzyme) = ($1,$2,$3); print OUT "label\nTick\t$start\t8\n$enzyme\nendlabel\n"; } } print OUT "endgroup"; close (OUT);

SLIDE 41

cirdna.pl

#launch the emboss program "cirdna" to draw a circular maps od DNA #the command is: "cirdna inputfile –goutfile outputfile -graph ps -auto" my $cmd="cirdna $inputfile -goutfile $filename -graph ps -auto"; system ("$cmd"); exit;

restrict parsed output

Start 1 End 3417 group label Tick 1 8 NotI endlabel label Tick 556 8 PvuI endlabel label Tick 856 8 Eco31I Endlabel (...) label Tick 2918 8 BamHI endlabel label Tick 3412 8 HindIII endlabel endgroup

pGEX-PUC-3T.restrict

SLIDE 42

cirdna.pl output

pGEX-PUC-3T.ps

utput file

lindna

SLIDE 43

Reminder if you are lost

If in doubt, use:

wossname program –help -verbose program -opt tfm program The friendly manual ;)

References

http://emboss.sourceforge.net/
UK HGMP Resource Centre, Userguide, 2002
www.perl.org