Outline Combining different programs with Perl: writing a short - PDF document

EMBnet Course: PERL for biomedical researchers Command line tools scripting Basel, 11 September 2008 Lorenza Bordoli Swiss Institute of Bioinformatics Outline • Combining different programs with Perl: writing a short Pipeline in Perl • UniProt and its controlled vocabulary • Swissknife library • Overview of the programs of the pipeline: Blast, seqret (EMBOSS), Clustalw, Tree- Puzzle • Details of the Perl script Lorenza Bordoli 11 September 2008

Combining different programs with Perl input program1 program2 program3 output output output embedded in a single Perl script 11 September 2008 Lorenza Bordoli www.bc2.unibas.ch www.bc2.unibas.ch Lorenza Bordoli 11 September 2008

Mycobacterium Mycobacterium tuberculosis tuberculosis (MT) Swiss-Prot swissknife swissknife Protein sequence DB Kinase domain Swiss-Prot ( H. Sapiens ) ClustalW Blast MT protein + human MT protein + Multiple sequence homologous sequences Alignment of MT and sequences seqret in multi FASTA file Human Kinase in multi FASTA file homologues Tree-Puzzle Phylogenetic tree UniProt www.uniprot.org www.uniprot.org

UniProt • The UniProt Knowledgebase (UniProtKB) provides the central database of protein sequences with accurate, consistent, rich sequence and functional annotation. • The UniProt Knowledgebase consists of two sections: – Swiss-Prot - a section containing manually-annotated records with information extracted from literature and curator-evaluated computational analysis, and – TrEMBL - a section with computationally analyzed records that await full manual annotation. 11 September 2008 Lorenza Bordoli TrEMBL • TrEMBL is the computer-annotated section of the UniProt Knowledgebase. It contains translations of all coding regions in the DDBJ/EMBL/GenBank nucleotide databases, and protein sequences extracted from the literature or submitted to UniProtKB, which are not yet integrated into Swiss-Prot. • TrEMBL allows these sequences to be made publicly available quickly without diluting the high quality annotation found in Swiss- Prot. • The information in a TrEMBL entry is initially derived directly from the underlying DDBJ/EMBL/GenBank nucleotide entry and the quality of data is directly dependent on the information provided by the submitter of the nucleotide entry. This information may be enhanced later by automatic annotation procedures but if not, it remains as provided by the submitter until the entry is manually annotated and added to Swiss-Prot. Lorenza Bordoli 11 September 2008

Swiss-Prot • Swiss-Prot is an annotated protein sequence database. It was established in 1986 and maintained collaboratively, since 1987, by the group of Amos Bairoch first at the Department of Medical Biochemistry of the University of Geneva and now at the Swiss Institute of Bioinformatics (SIB) and the EMBL Data Library (now the EMBL Outstation - The European Bioinformatics Institute (EBI)). The Swiss-Prot Protein Knowledgebase consists of sequence entries. Sequence entries are composed of different line types, each with their own format. • Swiss-Prot distinguishes itself by four distinct criteria: 1. Annotations 2. Minimal redundancy 3. Integration with other databases 4. Documentation 11 September 2008 Lorenza Bordoli Swiss-Prot – 1. Annotations In Swiss-Prot, as in many sequence databases, two classes of data can be distinguished: the core data and the annotation: 1.For each sequence entry the core data consists of: • The sequence data; • The citation information (bibliographical references); • The taxonomic data (description of the biological source of the protein). 2.The annotation consists of the description of the following items: • Function(s) of the protein; • Posttranslational modification(s) such as carbohydrates, phosphorylation, acetylation and GPI-anchor; • Domains and sites, for example, calcium-binding regions, ATP-binding sites, zinc fingers, homeoboxes, SH2 and SH3 domains and kringle; • Secondary structure, e.g. alpha helix, beta sheet; • Quaternary structure, i.g. homodimer, heterotrimer, etc.; • Similarities to other proteins; • Disease(s) associated with any number of deficiencies in the protein; • Sequence conflicts, variants, etc. Lorenza Bordoli 11 September 2008

UniProt – Structure of a sequence entry Each sequence entry is composed of lines. Different types of lines, each with their own format, are used to record the various data that make up the entry: ID GRAA_HUMAN Reviewed; 262 AA. AC P12544; Q6IB36; DT 01-OCT-1989, integrated into UniProtKB/Swiss-Prot. DT 01-OCT-1989, sequence version 1. DT 10-JUN-2008, entry version 103. DE RecName: Full=Granzyme A; DE EC=3.4.21.78; DE AltName: Full=Granzyme-1; DE AltName: Full=Cytotoxic T-lymphocyte proteinase 1; DE AltName: Full=Hanukkah factor; DE Short=H factor; DE Short=HF; DE AltName: Full=CTL tryptase; DE AltName: Full=Fragmentin-1; DE Flags: Precursor; GN Name=GZMA; Synonyms=CTLA3, HFSP; OS Homo sapiens (Human). OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; OC Catarrhini; Hominidae; Homo. OX NCBI_TaxID=9606; RN [1] RP NUCLEOTIDE SEQUENCE [MRNA]. […] UniProt – Sequence entry lines http://www.expasy.org/sprot/userman.html http://www.expasy.org/sprot/userman.html

UniProt – Sequence entry • The entries in the UniProt Knowledgebase are structured so as to be usable by human readers as well as by computer programs. • The explanations, descriptions, classifications and other comments are in ordinary English. • Wherever possible, symbols familiar to biochemists, protein chemists and molecular biologists are used. • Example: http://www.uniprot.org/uniprot/P65726 11 September 2008 Lorenza Bordoli Swiss-Prot – Swissknife • You can write your parser to extract information from the UniProt database or use: • The Swissknife, an object-oriented Perl library to handle Swiss-Prot entries • http://swissknife.sourceforge.net/docs/ Lorenza Bordoli 11 September 2008

Swiss-Prot – Swissknife SWISS::Entry • Main module to handle SWISS-PROT entries. One Entry object represents one SWISS-PROT entry and provides an API for its modification. use SWISS::Entry; # Read an entire record at a time $/ = "\n//\n"; while (<>){ $entry = SWISS::Entry->fromText($_); print $entry->AC, "\n"; } Swiss-Prot – Swissknife use SWISS::Entry; use SWISS::OCs; # Read an entire record at a time local $/ = "\n//\n"; while (<>){ # Read the entry my $entry = SWISS::Entry->fromText($_); # Print the primary accession number of each entry. print $entry->AC, ":\n"; #Print the multiple organism classification lines of each #entries my @OC = $entry->OCs->elements(); foreach my $oc (@OC){ print "$oc\t"; } print "\n\n"; }

Swiss-Prot – Swissknife use SWISS::Entry; use SWISS::OCs; use SWISS::FTs; […] #Print the FT lines of type "domain" of the entry foreach my $ft ( $entry->FTs->get('DOMAIN') ) { my $FTkey = $$ft[0]; my $FTfrom = scalar $$ft[1]; my $FTto = scalar $$ft[2]; my $FTdes = $$ft[3]; print "FT: $FTdes $FTkey from: $FTfrom to:$FTto \n"; } Mycobacterium Mycobacterium tuberculosis tuberculosis (MT) Swiss-Prot swissknife swissknife Protein sequence DB Kinase domain Swiss-Prot ( H. Sapiens ) ClustalW Blast MT protein + human MT protein + Multiple sequence homologous sequences Alignment of MT and sequences seqret in multi FASTA file Human Kinase in multi FASTA file homologues Tree-Puzzle Phylogenetic tree

Blast $blastall -p blastp -d sprot -i sequence.txt -m 9 blastall 2.2.16 arguments: -p Program Name [String] -d Database [String] default = nr -i Query File [File In] default = stdin -e Expectation value (E) [Real] default = 10.0 -m alignment view options: 0 = pairwise, 1 = query-anchored showing identities, 2 = query-anchored no identities, 3 = flat query-anchored, show identities, 4 = flat query-anchored, no identities, 5 = query-anchored no identities and blunt ends, 6 = flat query-anchored, no identities and blunt ends, 7 = XML Blast output, 8 = tabular, 9 tabular with comment lines 10 ASN, text 11 ASN, binary [Integer] […] and more options . Blast $blastall -p blastp -d sprot -i sequence.txt -m 9 Program Query Database protein protein VS blastp blastn nucleotide nucleotide VS blastx nucleotide protein protein VS tblastn nucleotide protein protein VS nucleotide nucleotide tblastx protein protein VS Lorenza Bordoli 11 September 2008

Outline Combining different programs with Perl: writing a short - PDF document

EMBnet Course: PERL for biomedical researchers Command line tools scripting Basel, 11 September 2008 Lorenza Bordoli Swiss Institute of Bioinformatics Outline Combining different programs with Perl: writing a short Pipeline in Perl

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Xingqian Peng, Huaqiao University, China Presented by Zhen Wu Presented by Zhen Wu October 30,2011

1 Web Application Development 2 3 Web Application Development CSS Outline An outline is a

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Outline Outline Deaf and Hearing Impaired Deaf and Hearing Impaired Physical Structures of

Of MODS and Models: Predicting and Validating Phenotypes from Pathway Tools Metabolic Models

ComiR: A New Efficient Tool for Predicting Multiple miRNA Targets Claudia Coronnello, PhD Dept.

Use of web conferencing tools in in managing ris isk of dis isengagement by online le

STRUCTURAL BIOLOGY AND RADIOBIOLOGY LAB I2BC - CEA Saclay PROTEIN INTERACTIONS AT THE HEART OF

Lecture 17: Heuristic methods for sequence alignment: BLAST and FASTA Fall 2019 November 14,

CSCI261 Lecture 21: Introduction to Classes ? Object-Oriented Programming (OOP) Ivan

Rina Dechter Causal Inference in Statistics, A primer, J. Pearl, M Glymur and N. Jewell slides12a

Third Quarter 2017 Earnings Conference Call November 8, 2017 Randall C. Stuewe, Chairman and CEO