EMBnet Course: Unix & PERL for biomedical researchers
Lausanne, 2 March 2005
Lorenza Bordoli
Swiss Institute of Bioinformatics
Outline
- What is EMBOSS?
- Major programs
- Running EMBOSS Programs from the
Unix command line
- Combining EMBOSS with Perl
Outline What is EMBOSS? Major programs Running EMBOSS Programs - - PDF document
EMBnet Course: Unix & PERL for biomedical researchers Lausanne, 2 March 2005 Lorenza Bordoli Swiss Institute of Bioinformatics Outline What is EMBOSS? Major programs Running EMBOSS Programs from the Unix command line
Lausanne, 2 March 2005
Lorenza Bordoli
Swiss Institute of Bioinformatics
History:
founded in 1982 as a service of the Department of Genetics at the University of Wisconsin;
algorithmically verified and adapted to needs);
started as a small collection of programs to support EMBL's research activities, in particular the development of automated DNA sequencing;
distribute academic software source code which uses the GCG libraries!
February 2005: version 2.9.0
a package of high-quality FREE Open Source software for sequence analysis;
interface, but each comes with its own documentation:
http://emboss.sourceforge.net/apps/#Overview
but no interface, requires local databases Unix command-line only
Jemboss, www2gcg, w2h, wEMBOSS… (with account) Pise, EMBOSS-GUI, SRS (no account) Staden, Kaptain, CoLiMate, Jemboss (local)
Bioinformatics Divisions of the RFCGR (the current home of EMBOSS) in the middle of 2005. The MRC Press Office has stated: "All MRC can say at this stage is that Council have made a decision to close the Research and Bioinformatics Divisions. However, the Director has been asked to draw up a closing down plan for consideration by Council in July."
therefore adversely affect the development and support of EMBOSS. We hope that alternative sources of funding can be found.”
that have a specific function.
important to the function and produce output in the form of files, plots, web pages or simple text output.
EMBOSS programs are run by:
Remote server: sib-dea.unil.ch Local computer: your PC in the lab, in the course room,… Remote server: you personal account
EMBOSS programs are run by:
Local computer Remote server: sib-dea.unil.ch
EMBOSS programs are run by:
embl:m93650
Local computer Remote server: sib-dea.unil.ch
Access to EMBOSS: graphical interfaces http://emboss.ch.embnet.org
http://www.bc2.unibas.ch/
Jemboss/wEMBOSS: unibasel account needed
wossname lists all EMBOSS programs showdb shows the available databases
seqret retrieves and/or changes format of a sequence seqretset retrieve and or change formats of a number seqretall
transeq translate a DNA sequence to protein backtranseq translate a protein sequence to DNA extractseq extract regions from a sequence cutseq remove a region from a sequence pasteseq inserts a sequence into another sequence infoseq display information about a sequence splitter split a sequence into smaller sequences
needle Needleman-Wusch sequence alignment water Smith-Waterman sequence alignment stretcher Myers and Miller global alignment matcher Huang and Miller local alignment dottup dotmatcher dotplot comparisons of two sequences. prettyplot plots multiple sequence alignments polydot supermatcher dotplot comparisons of multiple sequences emma ClustalW program
cusp generates a codon usage table syco synonymous codon usage plot dan calculates DNA/RNA melting temperature compseq sequence composition tables
remap restriction map of the sequence cpgplot cpgreport CpG island detection etandem einverted finds tandem and inverted repeats plotorf plots potential ORFs showorf pretty display of potential ORFs fuzznuc DNA pattern search tfscan scans sequence for TF binding sites
ief Isoelectric point calculation antigenic Finds potential antigenic sites digest protein digestion map findkm Vmax and Km calculations fuzzpro protein pattern search garnier protein 2D structure prediction helixturnhelix finds nucleic acid binding motifs
pepwindow displays protein hydropathy patmatdb patmatmotifs searching with motifs vs protein sequences pepcoil predicts coiled coil regions pepinfo pepstats Protein information Hammer package ehmmpfam, ehmmsearch, ehmmbuild, … Phylip package efitch, edolpenny, edollop, …
format::database:entry format::file:entry In general, a USA specifies
format::database:entry
needs a hint) * ;
* EMBOSS recognizes: GCG, FASTA, ClustalW, MSF, EMBL, GenBank, DNAStrider, Phylip, PIR, PAUP,ASN.1, NBRF, Fitch, IntelliGenetics
The most common ways of specifying a sequence are:
either the ID or the accession number (AC) of the sequence in the database Ex.: database:accession embl:X65923 database:ID swissprot:opsd_xenla file name myfile.seq
databases have two such identifiers for each sequence - an ID name and an Accession number.
function of its sequence: OPSD_HUMAN in Swiss-Prot !! ID names are not guaranteed to remain the same between different versions
remain with that sequence through the rest of the life of the database:
get a new Accession number and the Accession numbers of the merged sequences will be retained as 'secondary' Accession numbers.
You can easily find out what are the database name in your EMBOSS installation by running the showdb program: Unix % showdb Displays information on the currently available databases #Name Type ID Qry All Comment #==== ==== == === === ======= sw P OK OK OK Swiss-Prot section of UniProt swiss P OK OK OK Swiss-Prot section of UniProt swiss-prot P OK OK OK Swiss-Prot section of UniProt trembl P OK OK OK TrEMBL section of UniProt uniprot P OK OK OK UniProt (Swiss-Prot & TrEMBL), …
#Name Type ID Qry All Comment #==== ==== == === === ======= sw P OK OK OK Swiss-Prot section of UniProt
embl:x13776 ;
names: sw:opsd_* ;
;
filename all sequences in a file filename:entry an entry in a file dbname all sequences in a database (not recommended) dbname:entry a sequence in a database @listfile a list file list::listfile a list file asis::sequence a specific short sequence
“references” to sequences using any valid USA.
: the name of a sequence file; sw:opsd_xenla : a specific sequence in the Swiss-Prot database; @anotherlist : the name of a second list file;
add your comments: this won’t be read by the programs.
“references” to sequences using any valid USA.
: the name of a sequence file; sw:opsd_xenla : a specific sequence in the Swiss-Prot database; @anotherlist : the name of a second list file;
add your comments: this won’t be read by the programs.
immediately without it having to be in a file or a database.
asis::atgctagcttagc : for the sequence “atgctagcttagc” ;
>xyz some other comment ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgtagcctcactggagggcatt xyz: ID name
http://emboss.sourceforge.net/docs/themes/SequenceFormats.html
(swissprot/swiss/sw), GCG (gcg), MSF (msf), Genbank (genbank), raw,…
…
either:
which the user is asked (prompted) to enter a piece of information one at a time;
in which the user is asked (prompted) to enter a piece of information one at a time: % seqret Reads and writes (returns) a sequence Input sequence: embl:xlrhodop Output sequence [xlrhodop.fasta]: % more xlrhodop.fasta >XLRHODOP L07770 Xenopus laevis rhodopsin ggtagaacagcttcagttgggatcacaggcttctagggatccttt gggcaaaaaagaaac acagaaggcattctttctatacaagaaaggactttatagagctgc taccatgaacggaac
in which the user is asked (prompted) to enter a piece of information one at a time: % wossname Finds programs by keywords in their one-line documentation Keyword to search for: restrict SEARCH FOR 'RESTRICT’ recode Remove restriction sites but maintain the same translation remap Display a sequence with restriction cut sites, translation Etc…
sequence
% seqret -sequence filename.seq -outseq xlrhodop.fasta programname parameter/object: here: a sequence file qualifier: specify the properties of that object/parameter. here: input/output sequence
prompt for them; ex: sequence file name;
(-opt) qualifier; ex: begin and end of the sequence;
prompted);
prompt for them; ex: sequence file name;
qualifier; ex: begin and end of the sequence;
prompted);
% Programname -help
% wossname –help (-verbose) Mandatory qualifiers: [-search] string Enter a word or words here. Optional qualifiers:
Output file name Advanced qualifiers:
EMBOSS program documentation will be searched.
Prints a summary of the options the program can take. With
Prompt the user for the optional parameters
Accept all the default settings and run without prompting the user (used when running program scripts)
Ask for the start, end and reverse of the sequence input
Print output to stdout (the screen) instead of to a file.
Take input from stdin (keyboard) and output to stdout
Report warnings
Report errors
% seqret embl:xlrhodop -outseq xlrhodop.fasta % seqret embl:xlrhodop xlrhodop.fasta
% seqret -help -verbose
% seqret embl:xlrhodop –osformat gcg -outseq xlrhodop.gcg % seqret embl:xlrhodop xlrhodop.gcg –osformat gcg
% seqret embl:xlrhodop gcg::xlrhodop.gcg Unix % more xlrhodop.gcg !!NA_SEQUENCE 1.0 Xenopus laevis rhodopsin mRNA, complete cds. XLRHODOP Length: 1684 Type: N Check: 9453 .. 1 ggtagaacag cttcagttgg gatcacaggc ttctagggat cctttgggca 51 aaaaagaaac acagaaggca ttctttctat acaagaaagg actttataga
Can you find a program to: Display multiple alignments. Find ORFs (Open Reading Frames). Translate a sequence. Find restriction enzyme sites Find the isoelectric point of a protein. Do global alignments.
format::database:entry In general, a USA specifies
#Name Type ID Qry All Comment #==== ==== == === === ======= sw P OK OK OK Swiss-Prot section of UniProt
embl:x13776 ;
names: sw:opsd_* ;
;
"embl:*" embl:\*
% seqret “sw:*_HUMAN” Human.seq
filename : a file containing one or more sequences filename:entry : a given sequence in the file. The ‘entry’ mysequences:opsd_xenla is the ID or AC of the sequence in that file filename:entry[start:end] : a part of the sequence can be specified mysequences:opsd_xenla[1:20]* by the range mysequences:opsd_xenla[-1:-20]* : the last 20 residues/nucleotides mysequences:[1:20:r]* : reverse-complemented (nucleotide sequences) * In some Unix shell you might have to use backslash: \[ and \[
ID HSFAU standard; DNA; UNC; 518 BP. AC X65923; SV X65923.1 DE H.sapiens fau mRNA KW fau gene. OS Homo sapiens (human) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. SQ Sequence 518 BP; 125 A; 139 C; 148 G; 106 T; 0 other;
filed (“DE” line), their Keyword field (“KE” line), …
Field Names:
Name Searches for acc Accession number des Description id ID name key Keyword
Organism Name sv Sequence Version/GI Number
embl-des:fau : database embl-des:h*emoglobin : database myclones.seq:des:fau : file
Mandatory parts of the USAs are given in bold text. asis :: Sequence [start : end : reverse] format :: '@' ListFile [start : end : reverse] format :: 'list' : ListFile [start : end : reverse] format :: Database : Entry [start : end : reverse] format :: Database - SearchField : Word [start : end : reverse] format :: File : Entry [start : end : reverse] format :: File : SearchField : Word [start : end : reverse]
http://emboss.sourceforge.net/docs/themes/SequenceFormats.html
(swissprot/swiss/sw), GCG (gcg), MSF (msf), raw,…
deal with this: Fasta, EMBL, PIR, MSF, Clustal, Phylip, etc. BUT NOT: Plain, Staden and GCG
Use filename:entryname if you wish to specify a single sequence. If there is only one sequence, or you wish to read all entries, use just the filename.
several sequences, but it writes out each sequence to a separate file;
extension of the file is the format:
the file “IXI_567.fasta” % seqret “embl:hsf*” –ossingle
files.
http://emboss.sourceforge.net/docs/themes/AlignFormats.html
that program. However you can change the output formats with the “ –aformat” qualifier: % water –aformat msf
specified start and end position. It has a name describing what type of thing it is: Ex: Swiss-Prot Feature table
FT DISULFID 3 40 FT DISULFID 4 32 FT DISULFID 16 26 FT VARIANT 22 22 P -> S (IN ISOFORM SI). FT VARIANT 25 25 L -> I (IN ISOFORM SI).
standard set of formats that are used: UFO (Universal Feature Object) e.g. Swiss- Prot (swissprot), EMBL (embl), PIR (pir),… http://emboss.sourceforge.net/docs/themes/FeatureFormats.html
Amongst these are diffseq, extractfeat, maskfeat, seqret, showfeat
case when you are reading the sequence from a database) or located in a separate file
% seqret sw:P15423 -osformat swiss –outseq stdout % seqret sw:P15423 -osformat swiss –outseq stdout -feature
http://emboss.sourceforge.net/docs/themes/ReportFormats.html
######################################## # Program: garnier # Rundate: Mon Feb 11 15:14:40 2002 # Report_file: report.dbmotif ######################################## #======================================= # # Sequence: 100K_RAT from: 1 to: 889 # HitCount: 206 # # DCH = 0, DCS = 0 # # Please cite: # Garnier, Osguthorpe and Robson (1978) J. Mol. Biol. 120:97-120 #--------------------------------------- # # Residue totals: H:364 E:149 T:191 C:185 # percent: H: 41.7 E: 17.1 T: 21.9 C: 21.2 # # #---------------------------------------
format - you can change the report format used by putting '-rformat name' on the command-line, where 'name' is the name of one of the standard report formats.
embl Writes a report in EMBL feature table format pir Writes a report in PIR feature table format swiss Writes a report in SwissProt feature table format excel This is a TAB-delimited table format suitable for reading into spread-sheet programs such as Excel. seqtable A simple table format that includes the feature sequence Start End [tagnames] Sequence [start] [end] [tagvalues] [sequence]
use “ps” instead. ps files can be opened with the Unix program gs. Or save the
Plot potential open reading frames
[X] x11 Output to an X-window postscript Output to a postscript file (good for printing on a laser printer) cps Output to a color postscript file text Output to a text file data Output XY data points to a file. (good for importing into a graphing package) [P] png Output to a PNG image file (good for web pages) [X] Tek Output to tektronics terminal [X] xterm Output to an Xterm window [X]- requires X-windows [P] – requires PNG support The default filename is prog.format eg. octanol.ps
request: “stdout”;
BUT it will be printed on the screen;
running a program and wish to quickly see the results
wossname program –help -verbose program -opt tfm program The friendly manual ;)
BLAST, FASTA, ASSEMBLY (GelMerge, GelEnter,…)* , PAUP, sequence editor
machines running EMBOSS in Lausanne & Basel (http://www.bc2.unibas.ch/BC2/programs/bc2soft)
BLAST, FASTA, ASSEMBLY (GelMerge, GelEnter,…) , PAUP, sequence editor
EBI: http://www.ebi.ac.uk/fasta33/
% seqret embl:xlrhodop\[110:1174\] -stdout -auto | transeq –filter
Print output to stdout (the screen) instead of to a file.
Take input from stdin (keyboard) and output to stdout
Accept all the default settings and run without prompting the user
input
input
Parsed
input
Parsed
system (“seqret embl:xlrhodop –outseq xlrhodop.fasta –auto”);
my @sequence = <SEQRET>; …
Start 1001 End 4270 group label Block 1011 1362 3 ex1 endlabel label Tick 1610 8 EcoR1 endlabel label Block 1647 1815 1 endlabel label Tick 2459 8 BamH1 endlabel label Block 4139 4258 3 ex2 endlabel endgroup group label Range 2541 2812 [ ] 5 Alu endlabel label Range 3322 3497 > < 5 MER13 endlabel endgroup
######################################## # Program: restrict # Rundate: Sat Feb 26 21:53:07 2005 # Report_format: table # Report_file: ab014641.restrict ######################################## #======================================= # # Sequence: AB014641 from: 1 to: 3417 # HitCount: 9 # # Minimum cuts per enzyme: 1 # Maximum cuts per enzyme: 1 # Minimum length of recognition site: 4 # Blunt ends allowed # Sticky ends allowed # DNA is circular # Ambiguities allowed # #======================================= Start End Enzyme_name Restriction_site 5prime 3prime 5primerev 3primerev 1 8 NotI GCGGCCGC 2 6 . . 556 561 PvuI CGATCG 559 557 . . 856 861 Eco31I GGTCTC 862 866 . . 1335 1331 BcefI ACGGC 1317 1318 . . 1998 2003 XhoI CTCGAG 1998 2002 . . 2170 2175 HindII GTYRAC 2172 2172 . . 2680 2685 BclI TGATCA 2680 2684 . . 2918 2923 BamHI GGATCC 2918 2922 . . 3412 3417 HindIII AAGCTT 3412 3416 . . #--------------------------------------- #---------------------------------------
Cloning vector pGEX-PUC-3T
Start 1 End 3417 group label Tick 1 8 NotI endlabel label Tick 556 8 PvuI endlabel label Tick 856 8 Eco31I Endlabel (...) label Tick 2918 8 BamHI endlabel label Tick 3412 8 HindIII endlabel endgroup
#!/usr/local/bin/perl #perl script to 1. parse the output of the EMBOSS program "restrict" which uses #the REBASE database of restriction enzymes to predict cut sites in a DNA sequence. #2. write a input file for the EMBOSS program "cirdna" and 3. run the "cirdna" #program which draws circular maps od DNA construct use strict; use warnings; #if no filename is specified on the command-line print USAGE and exit #the file should contain a DNA sequence in one of the EMBOSS recognized formats my $USAGE=" SYNOPSIS $0 filename\n"; unless (@ARGV){ print $USAGE ; exit; } #Read in the filename from the argument on the command line my $filename=$ARGV[0];
#write the parsed output of the "restrict program" in a file my $inputfile = "$filename.restrict"; unless (open(OUT, ">$inputfile")){ print " Could not open file $inputfile! to write to !! \n\n"; exit; }
#launch the emboss program "restrict" to do a restriction analysis of #the DNA #the maximum number of cuts for any restriction enzyme that will be #considered is set to 1 #a list of enzymes is provided and we specify that the DNA is circular #the command is: #"restrict filename -enzymes HindIII,XhoI -max=1 -plasmid=Y -auto #-stdout"
#Read the output in a "while" loop and extract the start and the #beginning of the DNA sequence #and the start, the end and the name of the restriction enzyme my $line; while ($line = <RESTRICT> ) { chomp($line); #regular expression to match start and the end of the sequence if ($line =~ /^#\sSequence.*from:\s+(\d+).*to:\s+(\d+)/){ my ($seqstart, $seqend) = ($1, $2); #print the DNA start and the beginning in the input file print OUT "Start\t$seqstart\nEnd\t$seqend\n\ngroup\n"; } #regular expression to match start, end and name of the enzyme if ($line =~ /^\s+([0-9]+)\s+([0-9]+)\s+(\w+)\s+/){ my ($start,$end,$enzyme) = ($1,$2,$3); print OUT "label\nTick\t$start\t8\n$enzyme\nendlabel\n"; } } print OUT "endgroup"; close (OUT);
#launch the emboss program "cirdna" to draw a circular maps od DNA #the command is: "cirdna inputfile –goutfile outputfile -graph ps -auto" my $cmd="cirdna $inputfile -goutfile $filename -graph ps -auto"; system ("$cmd"); exit;
Start 1 End 3417 group label Tick 1 8 NotI endlabel label Tick 556 8 PvuI endlabel label Tick 856 8 Eco31I Endlabel (...) label Tick 2918 8 BamHI endlabel label Tick 3412 8 HindIII endlabel endgroup
wossname program –help -verbose program -opt tfm program The friendly manual ;)