Outline Combining different programs with Perl: writing a short - - PDF document

outline
SMART_READER_LITE
LIVE PREVIEW

Outline Combining different programs with Perl: writing a short - - PDF document

EMBnet Course: PERL for biomedical researchers Command line tools scripting Basel, 11 September 2008 Lorenza Bordoli Swiss Institute of Bioinformatics Outline Combining different programs with Perl: writing a short Pipeline in Perl


slide-1
SLIDE 1

EMBnet Course: PERL for biomedical researchers Command line tools scripting

Basel, 11 September 2008 Lorenza Bordoli

Swiss Institute of Bioinformatics

Lorenza Bordoli 11 September 2008

Outline

  • Combining different programs with Perl:

writing a short Pipeline in Perl

  • UniProt and its controlled vocabulary
  • Swissknife library
  • Overview of the programs of the pipeline:

Blast, seqret (EMBOSS), Clustalw, Tree- Puzzle

  • Details of the Perl script
slide-2
SLIDE 2

Lorenza Bordoli 11 September 2008

embedded in a single Perl script

Combining different programs with Perl

program1

input

  • utput

program2 program3

  • utput
  • utput

Lorenza Bordoli 11 September 2008

www.bc2.unibas.ch www.bc2.unibas.ch

slide-3
SLIDE 3

Protein sequence DB Swiss-Prot Mycobacterium tuberculosis Mycobacterium tuberculosis (MT) Kinase domain

MT protein sequences in multi FASTA file

Swiss-Prot (H. Sapiens)

Blast + seqret

MT protein + human homologous sequences in multi FASTA file

swissknife ClustalW

Multiple sequence Alignment of MT and Human Kinase homologues Phylogenetic tree

Tree-Puzzle swissknife

UniProt

www.uniprot.org www.uniprot.org

slide-4
SLIDE 4

Lorenza Bordoli 11 September 2008

UniProt

  • The UniProt Knowledgebase (UniProtKB)

provides the central database of protein sequences with accurate, consistent, rich sequence and functional annotation.

  • The UniProt Knowledgebase consists of two

sections:

– Swiss-Prot - a section containing manually-annotated records with information extracted from literature and curator-evaluated computational analysis, and – TrEMBL - a section with computationally analyzed records that await full manual annotation.

Lorenza Bordoli 11 September 2008

TrEMBL

  • TrEMBL is the computer-annotated section of the UniProt
  • Knowledgebase. It contains translations of all coding regions in the

DDBJ/EMBL/GenBank nucleotide databases, and protein sequences extracted from the literature or submitted to UniProtKB, which are not yet integrated into Swiss-Prot.

  • TrEMBL allows these sequences to be made publicly available

quickly without diluting the high quality annotation found in Swiss- Prot.

  • The information in a TrEMBL entry is initially derived directly from

the underlying DDBJ/EMBL/GenBank nucleotide entry and the quality of data is directly dependent on the information provided by the submitter of the nucleotide entry. This information may be enhanced later by automatic annotation procedures but if not, it remains as provided by the submitter until the entry is manually annotated and added to Swiss-Prot.

slide-5
SLIDE 5

Lorenza Bordoli 11 September 2008

Swiss-Prot

  • Swiss-Prot is an annotated protein sequence database. It was

established in 1986 and maintained collaboratively, since 1987, by the group of Amos Bairoch first at the Department of Medical Biochemistry of the University of Geneva and now at the Swiss Institute of Bioinformatics (SIB) and the EMBL Data Library (now the EMBL Outstation - The European Bioinformatics Institute (EBI)). The Swiss-Prot Protein Knowledgebase consists of sequence

  • entries. Sequence entries are composed of different line types, each

with their own format.

  • Swiss-Prot distinguishes itself by four distinct criteria:
  • 1. Annotations
  • 2. Minimal redundancy
  • 3. Integration with other databases
  • 4. Documentation

Lorenza Bordoli 11 September 2008

Swiss-Prot – 1. Annotations

In Swiss-Prot, as in many sequence databases, two classes of data can be distinguished: the core data and the annotation: 1.For each sequence entry the core data consists of:

  • The sequence data;
  • The citation information (bibliographical references);
  • The taxonomic data (description of the biological source of the protein).

2.The annotation consists of the description of the following items:

  • Function(s) of the protein;
  • Posttranslational modification(s) such as carbohydrates,

phosphorylation, acetylation and GPI-anchor;

  • Domains and sites, for example, calcium-binding regions, ATP-binding

sites, zinc fingers, homeoboxes, SH2 and SH3 domains and kringle;

  • Secondary structure, e.g. alpha helix, beta sheet;
  • Quaternary structure, i.g. homodimer, heterotrimer, etc.;
  • Similarities to other proteins;
  • Disease(s) associated with any number of deficiencies in the protein;
  • Sequence conflicts, variants, etc.
slide-6
SLIDE 6

UniProt – Structure of a sequence entry

Each sequence entry is composed of lines. Different types of lines, each with their own format, are used to record the various data that make up the entry:

ID GRAA_HUMAN Reviewed; 262 AA. AC P12544; Q6IB36; DT 01-OCT-1989, integrated into UniProtKB/Swiss-Prot. DT 01-OCT-1989, sequence version 1. DT 10-JUN-2008, entry version 103. DE RecName: Full=Granzyme A; DE EC=3.4.21.78; DE AltName: Full=Granzyme-1; DE AltName: Full=Cytotoxic T-lymphocyte proteinase 1; DE AltName: Full=Hanukkah factor; DE Short=H factor; DE Short=HF; DE AltName: Full=CTL tryptase; DE AltName: Full=Fragmentin-1; DE Flags: Precursor; GN Name=GZMA; Synonyms=CTLA3, HFSP; OS Homo sapiens (Human). OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; OC Catarrhini; Hominidae; Homo. OX NCBI_TaxID=9606; RN [1] RP NUCLEOTIDE SEQUENCE [MRNA]. […]

UniProt – Sequence entry lines

http://www.expasy.org/sprot/userman.html http://www.expasy.org/sprot/userman.html

slide-7
SLIDE 7

Lorenza Bordoli 11 September 2008

UniProt – Sequence entry

  • The entries in the UniProt Knowledgebase are

structured so as to be usable by human readers as well as by computer programs.

  • The explanations, descriptions, classifications

and other comments are in ordinary English.

  • Wherever possible, symbols familiar to

biochemists, protein chemists and molecular biologists are used.

  • Example: http://www.uniprot.org/uniprot/P65726

Lorenza Bordoli 11 September 2008

Swiss-Prot – Swissknife

  • You can write your parser to extract

information from the UniProt database or use:

  • The Swissknife, an object-oriented Perl

library to handle Swiss-Prot entries

  • http://swissknife.sourceforge.net/docs/
slide-8
SLIDE 8

Swiss-Prot – Swissknife

SWISS::Entry

  • Main module to handle SWISS-PROT entries. One Entry
  • bject represents one SWISS-PROT entry and provides

an API for its modification. use SWISS::Entry; # Read an entire record at a time $/ = "\n//\n"; while (<>){ $entry = SWISS::Entry->fromText($_); print $entry->AC, "\n"; }

Swiss-Prot – Swissknife

use SWISS::Entry; use SWISS::OCs; # Read an entire record at a time local $/ = "\n//\n"; while (<>){ # Read the entry my $entry = SWISS::Entry->fromText($_); # Print the primary accession number of each entry. print $entry->AC, ":\n"; #Print the multiple organism classification lines of each #entries my @OC = $entry->OCs->elements(); foreach my $oc (@OC){ print "$oc\t"; } print "\n\n"; }

slide-9
SLIDE 9

Swiss-Prot – Swissknife

use SWISS::Entry; use SWISS::OCs; use SWISS::FTs;

[…] #Print the FT lines of type "domain" of the entry foreach my $ft ( $entry->FTs->get('DOMAIN') ) { my $FTkey = $$ft[0]; my $FTfrom = scalar $$ft[1]; my $FTto = scalar $$ft[2]; my $FTdes = $$ft[3]; print "FT: $FTdes $FTkey from: $FTfrom to:$FTto \n"; }

Protein sequence DB Swiss-Prot Mycobacterium tuberculosis Mycobacterium tuberculosis (MT) Kinase domain

MT protein sequences in multi FASTA file

Swiss-Prot (H. Sapiens)

Blast + seqret

MT protein + human homologous sequences in multi FASTA file

swissknife ClustalW

Multiple sequence Alignment of MT and Human Kinase homologues Phylogenetic tree

Tree-Puzzle swissknife

slide-10
SLIDE 10

Blast

$blastall -p blastp -d sprot -i sequence.txt -m 9

blastall 2.2.16 arguments:

  • p Program Name [String]
  • d Database [String] default = nr
  • i Query File [File In] default = stdin
  • e Expectation value (E) [Real] default = 10.0
  • m alignment view options:

0 = pairwise, 1 = query-anchored showing identities, 2 = query-anchored no identities, 3 = flat query-anchored, show identities, 4 = flat query-anchored, no identities, 5 = query-anchored no identities and blunt ends, 6 = flat query-anchored, no identities and blunt ends, 7 = XML Blast output, 8 = tabular, 9 tabular with comment lines 10 ASN, text 11 ASN, binary [Integer] […] and more options .

Lorenza Bordoli 11 September 2008

$blastall -p blastp -d sprot -i sequence.txt -m 9

Program Query Database

blastp protein protein blastn nucleotide nucleotide blastx protein nucleotide protein tblastn protein protein nucleotide tblastx protein nucleotide protein nucleotide

VS VS VS VS VS

Blast

slide-11
SLIDE 11

Lorenza Bordoli 11 September 2008

Blast

$blastall -p blastp -d sprot -i sequence.txt -m 9

Available BC2 databases: CCDS_nucleotide CCDS_protein adna aprot est_human est_mouse est_others nr nr70 nr90 nt pdbaa pdbnt sprot trembl uniref100 uniref50 uniref90 ydna yprot Or you can specify your own database (see this afternoon)

Lorenza Bordoli 11 September 2008

Blast

$blastall -p blastp -d sprot -i sequence.txt -m 9 Sequence in the input file in FASTA format: >PKNA_MYCTU P65726 RecName: Full=Probable serine/threonine- protein kinase pknA; EC=2.7.11.1; MSPRVGVTLSGRYRLQRLIATGGMGQVWEAVDNRLGRRVAVKVLKSEFSSDPEFIERFRA EARTTAMLNHPGIASVHDYGESQMNGEGRTAYLVMELVNGEPLNSVLKRTGRLSLRHALD MLEQTGRALQIAHAAGLVHRDVKPGNILITPTGQVKITDFGIAKAVDAAPVTQTGMVMGT AQYIAPEQALGHDASPASDVYSLGVVGYEAVSGKRPFAGDGALTVAMKHIKEPPPPLPPD LPPNVRELIEITLVKNPAMRYRSGGPFADAVAAVRAGRRPPRPSQTPPPGRAAPAAIPSG TTARVAANSAGRTAASRRSRPATGGHRPPRRTFSSGQRALLWAAGVLGALAIIIAVLLVI KAPGDNSPQQAPTPTVTTTGNPPASNTGGTDASPRLNWTERGETRHSGLQSWVVPPTPHS RASLARYEIAQ

slide-12
SLIDE 12

Lorenza Bordoli 11 September 2008

Blast

$blastall -p blastp -d sprot -i sequence.txt -m 9 9: Output of Blast in tabular format with comments:

# BLASTP 2.2.16 [Mar-25-2007] # Query: PKNA_MYCTU P65726 RecName: Full=Probable serine/threonine-protein kinase pknA; EC=2.7.11.1; # Database: uniprot_sprot_human # Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end,

  • s. start, s. end, e-value, bit score

PKNA_MYCTU Q9P0L2 31.87 251 161 5 11 260 58 299 4e-23 105 PKNA_MYCTU P22612 30.84 227 148 2 2 228 33 250 4e-22 102 PKNA_MYCTU Q96L34 30.28 251 165 5 11 260 57 298 1e-21 100 PKNA_MYCTU P17612 30.40 227 149 2 2 228 33 250 1e-21 100 PKNA_MYCTU Q8TD08 32.73 220 133 5 12 223 12 224 2e-21 100

Protein sequence DB Swiss-Prot Mycobacterium tuberculosis Mycobacterium tuberculosis (MT) Kinase domain

MT protein sequences in multi FASTA file

Swiss-Prot (H. Sapiens)

Blast + seqret

MT protein + human homologous sequences in multi FASTA file

swissknife ClustalW

Multiple sequence Alignment of MT and Human Kinase homologues Phylogenetic tree

Tree-Puzzle swissknife

slide-13
SLIDE 13

Lorenza Bordoli 11 September 2008

What is EMBOSS ?

  • http://emboss.sourceforge.net/
  • EMBOSS, The European Molecular Biology Open Source Software Suite, is

a package of high-quality FREE Open Source software for sequence analysis;

  • EMBOSS includes hundreds of applications (+150). They share a similar

interface, but each comes with its own documentation:

  • Many sequence analysis & display programs.
  • Protein 3D structure prediction being developed.
  • Other assorted programs, eg: enzyme kinetics.
  • Extensible (with some C programming knowledge)!
  • Complete list of the programs in the currently release:

http://emboss.sourceforge.net/apps/#Overview

  • Grouped applications: http://emboss.sourceforge.net/apps/groups.html

Lorenza Bordoli 11 September 2008

EMBOSS: Introduction

  • The EMBOSS package consists of a large number of separate programs

that have a specific function.

  • They usually take a (number of) input file(s) and some parameters that are

important to the function and produce output in the form of files, plots, web pages or simple text output.

slide-14
SLIDE 14

Lorenza Bordoli 11 September 2008

Some major programs:

  • General:

wossname lists all EMBOSS programs showdb shows the available databases

  • Sequence retrieval:

seqret retrieves and/or changes format of a sequence seqretset retrieve and or change formats of a number seqretall

  • f sequences at once

transeq translate a DNA sequence to protein backtranseq translate a protein sequence to DNA extractseq extract regions from a sequence cutseq remove a region from a sequence pasteseq inserts a sequence into another sequence infoseq display information about a sequence splitter split a sequence into smaller sequences

Lorenza Bordoli 11 September 2008

Some major programs (cont.):

  • Sequence comparison

needle Needleman-Wusch sequence alignment water Smith-Waterman sequence alignment stretcher Myers and Miller global alignment matcher Huang and Miller local alignment dottup dotmatcher dotplot comparisons of two sequences. prettyplot plots multiple sequence alignments polydot supermatcher dotplot comparisons of multiple sequences emma ClustalW program

  • Sequence parameters

cusp generates a codon usage table syco synonymous codon usage plot dan calculates DNA/RNA melting temperature compseq sequence composition tables

slide-15
SLIDE 15

Lorenza Bordoli 11 September 2008

  • DNA Sequence features

remap restriction map of the sequence remap cpgplot cpgreport CpG island detection etandem einverted finds tandem and inverted repeats plotorf plots potential ORFs showorf pretty display of potential ORFs fuzznuc DNA pattern search tfscan scans sequence for TF binding sites

Some major programs (cont.):

Lorenza Bordoli 11 September 2008

  • Protein Sequence features

ief Isoelectric point calculation antigenic Finds potential antigenic sites digest protein digestion map findkm Vmax and Km calculations fuzzpro protein pattern search garnier protein 2D structure prediction helixturnhelix finds nucleic acid binding motifs

  • ctanol

pepwindow displays protein hydropathy patmatdb patmatmotifs searching with motifs vs protein sequences pepcoil predicts coiled coil regions pepinfo pepstats Protein information Hammer package ehmmpfam, ehmmsearch, ehmmbuild, … Phylip package efitch, edolpenny, edollop, …

Some major programs (cont.):

slide-16
SLIDE 16

Lorenza Bordoli 11 September 2008

Working with sequences :

  • EMBOSS reads sequences from files or databases.
  • It automatically recognizes the input sequence format.
  • You can easily specify many output formats.

Lorenza Bordoli 11 September 2008

Uniform Sequence Address (USA)

  • a standard way of specifying a sequence to be read into a program in EMBOSS
  • Sequences can be in databases or in files
  • It has the following syntax:

format::database:entry format::file:entry In general, a USA specifies

  • what sequence format to expect
  • what file or database to open
  • what entry to look for
slide-17
SLIDE 17

Lorenza Bordoli 11 September 2008

Uniform Sequence Address (USA)

format::database:entry

  • Of these only the “file” or “database” are necessary;
  • If format is omitted: EMBOSS will check and recognizes the format (occasionally

needs a hint) * ;

  • if the “entry “ part is omitted, all of the entries in the file or database are read in;

* EMBOSS recognizes: GCG, FASTA, ClustalW, MSF, EMBL, GenBank, DNAStrider, Phylip, PIR, PAUP,ASN.1, NBRF, Fitch, IntelliGenetics

Lorenza Bordoli 11 September 2008

Uniform Sequence Address (USA)

The most common ways of specifying a sequence are:

  • to type the name of the file that the sequence is in: myfile.seq
  • or type “db:entry”, where “db” is the name of the database and “entry” is

either the ID or the accession number (AC) of the sequence in the database Ex.: database:accession sw:P65726 database:ID swissprot:opsd_xenla file name myfile.seq

slide-18
SLIDE 18

Lorenza Bordoli 11 September 2008

ACs and IDs …

  • An entry in a database: uniquely identified in that database. Most sequence

databases have two such identifiers for each sequence - an ID name and an Accession number.

  • Why are there two such identifiers?
  • The ID name: a human-readable name that had some indication of the

function of its sequence: OPSD_HUMAN in Swiss-Prot !! ID names are not guaranteed to remain the same between different versions

  • f a database.
  • Accession numbers: unique alphanumeric identifiers that are guaranteed to

remain with that sequence through the rest of the life of the database:

  • P08100. If two sequences are merged into one, then the new sequence will

get a new Accession number and the Accession numbers of the merged sequences will be retained as 'secondary' Accession numbers.

Lorenza Bordoli 11 September 2008

Databases

You can easily find out what are the database name in your EMBOSS installation by running the showdb program: Unix % showdb Displays information on the currently available databases #Name Type ID Qry All Comment #==== ==== == === === ======= sw P OK OK OK Swiss-Prot section of UniProt swiss P OK OK OK Swiss-Prot section of UniProt swiss-prot P OK OK OK Swiss-Prot section of UniProt trembl P OK OK OK TrEMBL section of UniProt uniprot P OK OK OK UniProt (Swiss-Prot & TrEMBL), …

slide-19
SLIDE 19

Lorenza Bordoli 11 September 2008

Running EMBOSS from the Unix command line

Lorenza Bordoli 11 September 2008

The command Line

  • EMBOSS programs are called by giving their name on the UNIX command line

either:

  • without parameters: interactive way, query-answer session with the user, in

which the user is asked (prompted) to enter a piece of information one at a time;

  • or with parameters: the program will have all the information it needs all at
  • nce;
slide-20
SLIDE 20

The command Line

  • EMBOSS programs are called by giving their name on the UNIX command line:
  • without parameters: interactive way, query-answer session with the user,

in which the user is asked (prompted) to enter a piece of information one at a time: % seqret Reads and writes (returns) a sequence Input sequence: sw:P65726 Output sequence [pkna_myctu.fasta]: % more pkna_myctu.fasta >PKNA_MYCTU P65726 RecName: Full=Probable serine/threonine- protein kinase pknA; EC=2.7.11.1; MSPRVGVTLSGRYRLQRLIATGGMGQVWEAVDNRLGRRVAVKVLKSEFSSDPEFIERFRA EARTTAMLNHPGIASVHDYGESQMNGEGRTAYLVMELVNGEPLNSVLKRTGRLSLRHALD MLEQTGRALQIAHAAGLVHRDVKPGNILITPTGQVKITDFGIAKAVDAAPVTQTGMVMGT AQYIAPEQALGHDASPASDVYSLGVVGYEAVSGKRPFAGDGALTVAMKHIKEPPPPLPPD

Lorenza Bordoli 11 September 2008

The command Line

  • EMBOSS programs are called by giving their name on the UNIX command line:
  • with parameters: the program will have all the information it needs all at
  • nce:
  • % seqret -sequence sw:P65726 -outseq pkna_myctu.fasta
  • programname
  • parameter/object: here: a dtabase entry
  • qualifier: specify the properties of that object/parameter. here: input/output

sequence

slide-21
SLIDE 21

Lorenza Bordoli 11 September 2008

The command Line and parameters

% seqret -sequence sw:P65726 -outseq pkna_myctu.fasta programname parameter/object: here: a sequence file qualifier: specify the properties of that object/parameter. here: input/output sequence

  • There are 3 classes of parameters:
  • standard (mandatory) : minimum input for the program, the program will

prompt for them; ex: sequence file name;

  • additional (optional) : you will be prompted only if you use the –options

(-opt) qualifier; ex: begin and end of the sequence;

  • advanced : you have to specify them on the command line (you won’t be

prompted);

Lorenza Bordoli 11 September 2008

The command Line and parameters

  • There are 3 classes of parameters:
  • standard (mandatory) : minimum input for the program, the program will

prompt for them; ex: sequence file name;

  • additional (optional) : you will be prompted only if you use the –options

qualifier; ex: begin and end of the sequence;

  • advanced : you have to specify them on the command line (you won’ be

prompted);

  • How do I find them out? With the –help qualifier! (try also –verbose)

% Programname -help

slide-22
SLIDE 22

Lorenza Bordoli 11 September 2008

"-sequence" related qualifiers

  • sbegin

integer first base used

  • send

integer last base used, def=seq length

  • sreverse

bool reverse (if DNA)

  • sask

bool ask for begin/end/reverse

  • slower

bool make lower case

  • supper

bool make upper case

  • sformat

string input sequence format

Lorenza Bordoli 11 September 2008

Qualifiers: Example

% seqret –sequence sw:P80590 -outseq P80590.fasta

>LHB1_RHOTE P80590 Light-harvesting polypeptide B-885 beta- 1 chain (LH-1) (Antenna pigment polypeptide beta-1 chain). AEDRKSLSGLTEQEAQEFGTLYTQGVAFVAVIAVVAHALVWAWRPWLQ % seqret –sequence sw:P80590 -outseq P80590_sh.fasta -sbegin1 1 -send1 10 >LHB1_RHOTE P80590 Light-harvesting polypeptide B-885 beta-1 chain (LH-1) (Antenna pigment polypeptide beta-1 chain). AEDRKSLSGL

slide-23
SLIDE 23

Lorenza Bordoli 11 September 2008

Example: Seqret

  • Give seqret all of its data on the command-line :

% seqret –sequence sw:P80590 -outseq P80590.fasta % seqret sw:P80590 P80590.fasta

  • Even shorter, leave out the qualifier:

% seqret -help -verbose

Lorenza Bordoli 11 September 2008

The full USA syntax

filename : a file containing one or more sequences filename:entry : a given sequence in the file. The ‘entry’ mysequences:opsd_xenla is the ID or AC of the sequence in that file filename:entry[start:end] : a part of the sequence can be specified mysequences:opsd_xenla[1:20]* by the range mysequences:opsd_xenla[-1:-20]* : the last 20 residues/nucleotides mysequences:[1:20:r]* : reverse-complemented (nucleotide sequences) * In some Unix shell you might have to use backslash: \[ and \[

slide-24
SLIDE 24

Lorenza Bordoli 11 September 2008

Calling EMBOSS programs with no prompt

% seqret sw:P80590\[25:35\] -outseq pkna_myctu.fasta –auto

  • auto

Accept all the default settings and run without prompting the user

Lorenza Bordoli 11 September 2008

Send the output of a program as input for a second program

% seqret sw:P80590\[25:35\] -stdout -auto | infoseq –filter

  • stdout

Print output to stdout (the screen) instead of to a file.

  • filter

Take input from stdin and output to stdout

  • auto

Accept all the default settings and run without prompting the user

slide-25
SLIDE 25

Protein sequence DB Swiss-Prot Mycobacterium tuberculosis Mycobacterium tuberculosis (MT) Kinase domain

MT protein sequences in multi FASTA file

Swiss-Prot (H. Sapiens)

Blast + seqret

MT protein + human homologous sequences in multi FASTA file

swissknife ClustalW

Multiple sequence Alignment of MT and Human Kinase homologues Phylogenetic tree

Tree-Puzzle swissknife

Lorenza Bordoli 11 September 2008

ClustalW: Multiple sequence alignment tool

  • A multiple sequence alignment tool.
  • Many others currently available: T-coffee,

Muscle, ProbCons, …

  • To run just type: clustalw
  • Takes as input file a multi-fasta file.
  • Prompt the user for each

information/parameter.

  • Short demo …
slide-26
SLIDE 26

Lorenza Bordoli 11 September 2008

ClustalW: Multiple sequence alignment tool

Input: multifasta file Output: *.aln (and *.dnd) files

Protein sequence DB Swiss-Prot Mycobacterium tuberculosis Mycobacterium tuberculosis (MT) Kinase domain

MT protein sequences in multi FASTA file

Swiss-Prot (H. Sapiens)

Blast + seqret

MT protein + human homologous sequences in multi FASTA file

swissknife ClustalW

Multiple sequence Alignment of MT and Human Kinase homologues Phylogenetic tree

Tree-Puzzle swissknife

slide-27
SLIDE 27

Lorenza Bordoli 11 September 2008

Tree-Puzzle: Phylogenetic analysis

  • Other equivalent tools: PAUP, Phylip
  • http://www.tree-puzzle.de/
  • To run just type: puzzle
  • Prompt the user for each

information/parameter.

  • Takes as input a multiple sequence

alignment file in phylip format

  • Demo …

Tree-Puzzle: Phylogenetic analysis

Input: alignment in phylip format

  • utput: *.puzzle and (*.dist, *.tree) files
slide-28
SLIDE 28

Lorenza Bordoli 11 September 2008

STDIN, STDOUT, STDERR

Standard input (STDIN):

  • Standard input is data (often text) going into a
  • program. The program requests data transfers

by use of the read operation. Not all programs require input. For example, the ls program (which displays file names contained in a directory) performs its operation without any stream data input.

  • Unless redirected, input is expected from the

text terminal which started the program.

Lorenza Bordoli 11 September 2008

STDIN, STDOUT, STDERR

Standard output (STDOUT)

  • Standard output is the stream where a

program writes its output data. The program requests data transfer with the write operation. Not all programs generate

  • utput. For example the file rename

command mv is silent on success.

  • Unless redirected, standard output is the

text terminal which initiated the program.

slide-29
SLIDE 29

Lorenza Bordoli 11 September 2008

STDIN, STDOUT, STDERR

Standard error (STDERR)

  • Standard error is another output stream typically

used by programs to output error messages or

  • diagnostics. It is a stream independent of

standard output and can be redirected

  • separately. The usual destination is the text

terminal which started the program to provide the best chance of being seen even if standard

  • utput is redirected (so not readily observed).
  • For example, output of a program in a pipeline is

redirected to input of the next program, but errors from each program still go directly to the text terminal.

Lorenza Bordoli 11 September 2008

Run external programs from Perl

How to run an external programs from a Perl script? 1) System function: launches a child process which run a program:

system (“seqret sw:P80590 -outseq pkna_myctu.fasta –auto”);

⇒ Perl waits for it to finish, then possibly grabs its return value. (like typing unix shell command). 2) Processes as Filehandles: launches a child process that stays alive, communicating via pipes to Perl until the task is complete:

  • pen (SEQRET, “ seqret sw:P80590 –auto -stdout |” );

my @sequence = <SEQRET>; …

⇒ These filehandles “contains” the output of the launched command.

slide-30
SLIDE 30

Lorenza Bordoli 11 September 2008

Run external programs from Perl

How to run an external programs from a Perl script? 3) Exec: executes a system command and never returns:

exec (“hostname”); print “you will never see this”;

4) Backtick: run external programs (similar to system): $variable = `date`; ⇒The variable now contains (STDOUT): Wed Sep 10 21:10:29 CEST 2008

Lorenza Bordoli 11 September 2008

Run external programs from Perl

4) Backtick: using backticks with a program that requires prompt input: my $temp = `clustalw <<EOF 1 $inputfile 2 9 4 1 test.aln test.phy test.dnd X X X EOF`;

slide-31
SLIDE 31

Protein sequence DB Swiss-Prot Mycobacterium tuberculosis Mycobacterium tuberculosis (MT) Kinase domain

MT protein sequences in multi FASTA file “tempfile.txt”

Swiss-Prot (H. Sapiens)

Blast + seqret

MT protein + human homologous sequences in multi FASTA file

“tempfile.txt”

swissknife ClustalW

Multiple sequence Alignment of MT and Human Kinase homologues

“alignment.aln” “alignment.phy” “alignment.dnd”

…11 seq. Phylogenetic tree

(“alignment.phy.dist”,” alignemnt.phy.tree”, ” alignment.phy.puzzle”)

Tree-Puzzle swissknife

“1773.sprot”

“uniprot_sprot_human”

#!/import/bc2/soft/bin/perl5/perl use strict; use warnings; use English qw(-no_match_vars); # Use english variable names for errors use Carp; # The Carp routines are useful # in your own modules because # they act like die() or warn(), # but with a message which is # more likely to be useful to a # user of your module. use Getopt::Long; #To get command line options use SWISS::Entry; #To use the Swissknife library use SWISS::FTs; use SWISS::DEs; use SWISS::SQs; use SWISS::ACs; use SWISS::OXs;

slide-32
SLIDE 32

# ------------------------------------------------- # Variables declaration # # NCBI Tax ID of an organism in OX line e.g. 1773 my $taxid = 0; # Sequence annotation specified in the the UniProt # FT lines,e.g."Protein kinase." my $feature = ""; # UniProt data file, e.g. uniprot_sprot_bacteria.dat my $data = ""; # Remember our results my %SEQS; # To store the sequences needed as input for the clustalw # program my $inputfile = "tempfile.txt";

# -------------------------------------------------------- # Get User Input and check for errors # GetOptions ( "taxid=i" => \$taxid, # integer value "feature=s" => \$feature, # string "data=s" => \$data, # string ); croak "Can not read UNIPROT file. Usage example: $0 --taxid 1773 --feature "Protein kinase" --data uniprot.dat" unless (-r $data); croak "Invalid Taxid. Usage example: $0 --taxid 1773 --feature "Protein kinase." unless ($taxid > 0); croak "Please do not use human as example. Usage example: $0 --taxid 1773 -

  • feature "Protein kinase" --data uniprot.dat"

if ($taxid == 9606); croak "Invalid UniProt FT feature. Usage example: $0 --taxid 1773 --feature "Protein kinase" --data uniprot.dat" unless ( $feature );

./RunMe.pl --taxid 1773 --feature "Protein kinase" --data 1773.sprot

slide-33
SLIDE 33

# ---------------------------------------------------------- # Now parse UNIPROT for all entries of the requested species # Containing the specified sequence annotation (FT lines) # Use the "//" separator between SwissProt entries as # Separator when reading all entries

  • pen (my $fhi,"<$data") or croak "ERROR: Can not read UNIPROT.";

# change end of line separator to read an entire UniProt record at a time local $/ = "\n//\n"; LOOP: while ( <$fhi> ) { # Parse the SwissProt entries # For documentation of SwissKnife, # see http://swissknife.sourceforge.net my $entry = SWISS::Entry->fromText($_); my $ac = $entry->AC; my $seq = $entry->SQ; my ($DE) = $entry->DEs->elements; my $de = $DE->text; my ($OS) = $entry->OSs->elements; my $os = $OS->text; my ($OX) = $entry->OXs->NCBI_TaxID()->elements(); my $ox = $OX->text; chomp($seq); chomp($ox); chomp ($os); chomp ($de); # Let's see if we have the right species: if ($ox == $taxid) { # get the information from the FT lines foreach my $ft ( $entry->FTs->get('DOMAIN') ) { my $FTkey = $$ft[0]; my $FTfrom = scalar $$ft[1]; my $FTto = scalar $$ft[2]; my $FTdes = $$ft[3]; # match to the required sequence annotation (FT lines) if ($FTdes =~ /$feature/i) { $counter ++; print "$counter:$ac - $ox - $os - $de - "; print "$FTkey $FTfrom $FTto $FTdes\n"; my $domain = substr ($seq,$FTfrom,($FTto-$FTfrom+1)); # store the amino acid sequence corresponding to the sequence annotation $SEQS{$ac}=$domain; } } } last LOOP if $counter >= MAX; }

slide-34
SLIDE 34

# ------------------------------------------------------------------------- # Write the amino acid sequences (corresponding to the species and sequence annotation) as FASTA file

  • pen (my $fho,">$inputfile") or croak "ERROR: Can not write $inputfile";

foreach my $key (keys %SEQS) { print $fho ">$key\n$SEQS{$key}\n";} close ($fho); # ------------------------------------------------------------- # For each sequence from organism taxid, find the closest human # homolog with a simple blast search my $blastcmd = "blastall -p blastp -d uniprot_sprot_human -i tempfile.txt -m 9”; ### Launches an external program with Filehandles

  • pen (IN," $blastcmd |") ;

my @BLAST = <IN>; close (IN); my $blastcounter = 0; my %HOMOLOG; my %HIT; #Parse the blast output in tabular format ### Append the results in the input file for clustalw (Concatenate !!!)

  • pen ($fho,">>$inputfile") or croak "ERROR: Can not write $inputfile";

foreach my $line (@BLAST) { if ($line !~ /^\#/) #the comments in the blast output are discarded { #store the blast output separated by tabs(or spaces) in different variables my ($query, $hit, $identity, $alignment_length, $mismatches, $gap_openings, $qstart, $qend, $hit_start, $hit_end, $evalue, $bitscore) = split '\s', $line; if ((! defined $HOMOLOG{$query}) && (! defined $HIT{$hit})) # Use ONLY the first homolog, and ONLY ONCE { $HOMOLOG{$query} = $hit; $HIT{$hit} = $query;

slide-35
SLIDE 35

#retrieve the amino acid sequence corresponding to the hit (in FASTA #format) with the program seqret ### Call an external program with Back Ticks! my $seqret = "seqret -stdout -auto sw:$hit\[$hit_start:$hit_end\]"; my $hitseq = `$seqret`; # Clean the FASTA header line; keep only the first ID $hitseq =~ s/^>([A-Z0-9_]+)\s([A-Z0-9_]+)[^\n]*\n/>$1\n/; print $fho $hitseq; print "$query - $hit\n"; $blastcounter++; } } } close ($fho); # If we have more than one sequence, let's align # while we test some error scenarios ... # Variant 1: my $cmd = "clustalv $inputfile"; my $result = system ($cmd); if ($result) # Note that system returns 0 if everything went well. This is strange, but typical PERL { print "There was an error executing the following command:\n>>$cmd\n"; print "Think the program is called clustalw !\n"; } # Variant 2: print "Should we try again, this time with the correct command?\n"; my $wait = <STDIN>; $cmd = "clustalw $inputfile"; $result = system ($cmd); croak "There was an error executing the following command:\n>>$cmd\n" if ($result);

slide-36
SLIDE 36

# Variant 3: print "Seems like it worked well this time!"; print "But what if we want to change the output format of clustalw?"; print "We need Phylip output format for our phylogenetic tree ...\n"; $wait = <STDIN>; print "Running clustalw again, to write the results in phylip format\n"; ### Call an external program with Back Ticks! my $trash = `clustalw <<EOF 1 $inputfile 2 9 4 1 alignemnt.aln alignment.phy alignment.dnd X X X EOF`; # Run Tree-Puzzle to calculate a phylogenetic tree ### Call an external program with Back Ticks! print "Runing Tree-Puzzle to calculate a phylogenetic tree ..."; $result = `puzzle alignment.phy <<EOF y EOF`;