ELIXIR SCOP (Murzin) ~3000 domain structure families CATH - - PDF document

elixir
SMART_READER_LITE
LIVE PREVIEW

ELIXIR SCOP (Murzin) ~3000 domain structure families CATH - - PDF document

23/07/2014 Groups involved in Genome3D Tom Blundell Cambridge University Julian Gough Bristol University David Jones UCL Alexey Murzin LMB (Cambridge) Annotating Genomes with Structures and Functions Christine Orengo UCL BBSRC


slide-1
SLIDE 1

23/07/2014 1

http://www.genome3d.eu

Annotating Genomes with Structures and Functions BBSRC funding from 2011, Established July 2012 SAB: Prof Geoff Barton – Dundee University, Prof Chas Bountra - Structural Genomics Consortium, Prof Torsten Schwede - Swiss Institute of Bioinformatics

Groups involved in Genome3D

Tom Blundell – Cambridge University Julian Gough – Bristol University David Jones – UCL Alexey Murzin – LMB (Cambridge) Christine Orengo – UCL Michael Sternberg – Imperial, London

ELIXIR

ELIXIR unites Europe’s leading life science organisations in safeguarding the biological data generated every day in publicly funded research. Learn more at www.elixir-europe.org

Resources

SCOP (Murzin) CATH (Orengo)

~3000 domain structure families

SUPERFAMILY (Gough) Gene3D (Orengo)

Predicted domain annotations for >30 million sequences in UniProt, ~70% of domains in completed genomes

FUGUE (Blundell) PdomTHREADER (Jones) PHYRE (Sternberg)

Predicted domain annotations and 3D models for selected organisms

Domain Structure Classification Domain Structure Annotation SCOP CATH Consensus

SUPERFAMILY Gene3D PHYRE pDomTHREADER FUGUE

KEGG terms GO terms

...

Other ...

Associated functional information from > 10 public sources

5000 10000 15000 20000 25000 30000 35000

Number of UniProt sequences

Predicted Domain Annotations

(Per UniProt Sequence)

6 groups 5 groups 4 groups 3 groups 2 groups 1 groups No groups

slide-2
SLIDE 2

23/07/2014 2

5000 10000 15000 20000 25000 30000 35000

Number of UniProt sequences

Predicted 3D Models

(Per UniProt Sequence)

4 groups 3 groups 2 groups 1 groups No groups

Genome3D Applications

SCOP CATH Consensus

ELIXIR-UK

  • UK node’s initial focus will be exclusively on training
  • The UK node will develop training infrastructure and focus on:

– training needs analysis and trainer workshops – e-support service platform (TeSS)

9

slide-3
SLIDE 3

23/07/2014 1

Genome3D Resources at Biochemistry, Cambridge

Tom L Blundell Bernardo Ochoa M James Smith Department of Biochemistry University of Cambridge

Genome3D

Alicia Higueruelo Richard Bickerton Semin Lee Bernardo Ochoa Montano Adrian Schreyer

TIMBAL

inhibitors of protein-protein interactions

PDB

UniProt mapping residue annotation

PICCOLO

protein-protein interactions

CREDO

protein- small molecule interactions

BIPA

protein- nucleic acid interactions

TOCCATA

family-based structural alignments

REQUIEM

nsSNP mapping comparative modelling nsSNP impact

Organisation

  • f Information

Databases in Biochemistry, Cambridge

GLORIA

Open Source

Sung Sam Gong Harry Jubb Alicia Higueruelo Richard Bickerton Semin Lee Sung Sam Gong Adrian Schreyer

TIMBAL

inhibitors of protein-protein interactions

PDB

UniProt mapping residue annotation

PICCOLO

protein-protein interactions

CREDO

protein- small molecule interactions

BIPA

protein- nucleic acid interactions

TOCCATA

family-based structural alignments

REQUIEM

nsSNP mapping comparative modelling nsSNP impact

Organisation of Information

Databases in Biochemistry, Cambridge Open Source

Data from papers

CREDO

database of protein-ligand interactions,

♦ represents contacts as structural interaction fingerprints, ♦ sequence-to-structure mapping ♦ molecular shape descriptors with Ultrafast Shape Recognition (USR), ♦ fragmentation of ligands in PDB, ♦ identification of approved drugs. ♦ completely scriptable through application programming interface.

Adrian Schreyer

Assembly of fragments: Composer (1987) 389 Citations

Sutcliffe et al., Protein Engineering, 1, 377-384

Satisfaction of spatial restraints: Modeller (1993) 6413 Citations

Sali and Blundell. J. Mol. Biol. 234: 779-815

Discrete sampling for ensembles consistent with spatial restraints of empirical data.

RAPPER (2006)

De Bakker, DePristo, Blundell Nature SMB 13, 184-185

Extending Knowledge of the Proteome

Knowledge-based prediction of protein structure

Blundell, Sibanda, Sternberg, Thornton Nature 326, 26 675 1987

675 Citations

Sequence structure homology recognition Fugue (2001) 1054 Citations

Shi, Blundell, MizuguchiJMB 310 (1), 243-257

FUGUE

 Sequence-structure homology recognition

program

 Defining characteristics:

 Use of Environment-Specific Substitution Tables

(ESSTs) in structural profiles

 Automatic alignment algorithm selection with

structure-dependent gap penalties

Shi J, Blundell TL and Mizuguchi K. Journal of Molecular Biology 310, no. 1 (June 29, 2001): 243–57. PMID: 11419950

slide-4
SLIDE 4

23/07/2014 2

Structural Environments

 Residues exist in variety of environments in protein structures,  this affects their conservation in evolution.  Examples of environments:

 secondary structure,  solvent exposure,  hydrogen bonding of main or side chain,  atypical dihedral angles.

 BLOSUM-like substitution tables can be derived for each

combination of environments (currently 64), improving the detection of remote homology and alignment quality.

TOCCATA

 Substitutes original HOMSTRAD database as

source of profiles for FUGUE.

 Constructed from a consensus of SCOP 1.75(b)

families and CATH 3.5 superfamilies, including multi-domain patterns (not used on G3D).

 Goal was to group domains/structures in minimal

number of categories, not analysis.

 Each structure annotated according to

conformation (ligand binding, oligomeric state).

http://structure.bioc.cam.ac.uk/toccata/

TOCCATA in numbers

 57,880 PDBids  135,894 PDB chains  228,014 domains

 114,647 with non-trivial ligands  148,605 as part of complexes

 8151 profiles

 6238 single domain families (2263 consensus)  1519 multi-domain profiles  394 repeated domain profiles

VIVACE Pipeline

Genomic sequences Domain assignment (FUGUE) Template selection Template alignment (BATON, FUGUE) Modelling (MODELLER)

Alignment + model annotation (JOY/XSuLT)

TOCCATA

Sequence enrichment (PSI-BLAST) Sequence pre-segmentation (HMMER+PFAM) Local Web Interface

Genome3D

Joy / XSuLT

 Encodes structural environment information (e.g. that

used by FUGUE

 XSuLT expands the original JOY to include other features

 inter-residue contacts, residue depth, interface & ligand binding

residues,

 predictions & custom per-residue annotations, among others.

JOY: Mizuguchi K, Deane CM, Blundell TL, Johnson MS and Overington JS.

Bioinformatics 14, no. 7 (January 1, 1998): 617 –623. PMID: 9730927

XSuLT: In preparation. Soon at http://structure.bioc.cam.ac.uk/xsult

Genome3D Stats

Genome FUGUE (SCOP) FUGUE (CATH) VIVACE (models)

  • E. coli

3,709 3,642 N/A

  • S. cerevisae

(baker’s yeast) 5,499 5,430 N/A

  • H. sapiens

(human) 15,620 14,967 15,133

slide-5
SLIDE 5

23/07/2014 3

Human Genomes & Mutations

Genome

Sequences Polygenic Disorders & Complex Diseases Cancer Somatic Mutations Mendelian Inherited Diseases BHD Syndrome

Asp187 Ser1528 Thr1526 Gly1529

Single Gene Mutations & Disease Early Onset Breast Cancer BRCA2

Can chemistry, structure and genomics information help identify mutations that cause disease? Which are the “drivers”, and which are the “passengers”?

SDM: Stability score calculation

Unfolded state represented by substitutions occurring

  • utside of regular secondary structure, solvent

exposed and non-hydrogen bonded

฀ sjk

F = ln P(r k/R j,wt)

P(rj/R j,wt)  P(r

k/R k,mut )

P(rj/R k,mut )      

F jk U jk

Δs Δs s   

Current Worth CL, Preissner R & Blundell TL (2011) SDM—a server for predicting effects of mutations on protein stability and malfunction. Nucleic Acids Research 39(Web Server issue):W215-W222 work on SDM: Catherine Worth Topham, C.M., Srinivasan, N. and Blundell, T.L. (1997) Protein Engineering.10: 7-21.

mCSM

Predicting the effect of mutations in proteins using graph-based signatures

Douglas E. V. Pires http://bleoberis.bioc .cam.ac.uk/mcsm/ Pires DEV, Ascher DB, Blundell TL (2013) mCSM: predicting the effects of mutations in proteins using graph-based signatures. Bioinformatics 30(3):335-342

Genome3D Resources at Biochemistry, Cambridge

Tom L Blundell Bernardo Ochoa M James Smith Department of Biochemistry University of Cambridge

Genome3D

slide-6
SLIDE 6

23/07/2014 1

http://supfam.org

?

A Hidden Markov model profile

PDB files SCOP domains, classification SUPERFAMILY profile HMMs SUPERFAMILY genome annotation

Example of an assignment

Structural assignments to genomes

  • ~5,000 genomes
  • ~125 million sequences
  • GO annotation of domains and sequences
  • Phylogenetic reconstruction
  • Comparative genomics/enrichment tools
slide-7
SLIDE 7

23/07/2014 2

Website Walkthrough

http://www.genome3d.eu/uniprot/id/Q01860/annotations

Website Walkthrough

http://www.genome3d.eu/uniprot/id/Q01860/annotations

Uniprot name

Website Walkthrough

http://www.genome3d.eu/uniprot/id/Q01860/annotations

Annotations from resources

Website Walkthrough

http://www.genome3d.eu/uniprot/id/Q01860/annotations

Structural overlay interface

Website Walkthrough

http://www.genome3d.eu/uniprot/id/Q01860/annotations

slide-8
SLIDE 8

23/07/2014 1

Lawrence Kelley Michael Sternberg Chris Yates Stefans Mezulis Imperial College London

The Phyre2 Web server for protein structure prediction incorporating the SuSPect amino-acid variant phenotype predictor

  • “Normal” Mode
  • “Intensive” Mode
  • Advanced functions

How does Phyre2 work?

ARDLVIPMIYCGHGY Advanced homology modelling Using hidden Markov Model matching ARDLVIPMIYCGHGY HMM PSI-Blast Hidden Markov Model DB of KNOWN STRUCTURES HMM-HMM Matching (HHsearch, Soeding)

Phyre2 (normal mode)

ARDL--VIPMIYCGHGY AFDLCDLIPV--CGMAY

Sequence of known structure 3D-Model

Example results

Top model info Secondary structure/disorder Domain analysis Detailed template information

  • “Normal” Mode
  • “Intensive” Mode
  • Advanced functions

How does Phyre2 work?

  • Because of local alignment or novel domain

combinations, domains often modelled separately

  • Regions with no detectable homology to known

structure unmodelled

  • Does not use multiple templates which, when

combined could result in better coverage

Thus need a system to fold a protein without templates and combine templates when we have them

Shortcomings of ‘normal’ Mode

slide-9
SLIDE 9

23/07/2014 2

structure simplification

Protein backbone Small hydrophilic sidechain Large hydrophobic sidechain Backbone C-alpha

Poing – simplified folding model

Based on Levitt and Warshel ARNDLSLDLVCS……. HMM PSI-Blast Hidden Markov Model DB of KNOWN STRUCTURES Extract pairwise distance constraints POING: Synthesise from virtual ribosome. Springs for constraints. Ab initio modelling

  • f missing regions.

FINAL MODEL HMM-HMM matching

Phyre + Poing

  • “Normal” Mode
  • “Intensive” Mode
  • Advanced functions

How does Phyre2 work?

  • PhyreAlarm – automatically re-run tricky

sequences every week

  • BackPhyre – compare a structure to up to 30

genomes

  • One-To-One Threading – use specfic PDB

for model building

Advanced functions

SVYDAAAQLTADVKKD…….

PhyreAlarm

HMM Newly added structure HMMs HMM-HMM matching User sequence Confident hit? Newly solved PDB Structures added WEEKLY Yes No Try again next week Perform full Phyre modelling Email results New 3D model

SVYDAAAQLTADVKKDLRDSW KVIGSDKKGNGVALMTTLFAD NQETIGYFKRLGNVSQGMAND KLRGHSITLMYALQNFIDQLD NPDSLDLVCS…….

BackPhyre

HMM Hidden Markov Model DB of Genomes HMM-HMM matching User structure

Rank Hit Confid

  • ence

1 Gi… 2 Gi.. 3 Gi.. . . . . Ranked list of genome hits

slide-10
SLIDE 10

23/07/2014 3

SVYDAAAQLTADVKK DLRDSWDLVCS…….

One to one threading

HMM of User structure HMM-HMM matching User structure

KLRGHSITLMYALQN NPDSLDLVCS…….

User sequence HMM of user sequence Final model

  • Model quality assessment
  • Location of functional sites
  • Effect of mutations on structure and function
  • Protein-protein Interface(s)

New: Phyre Investigator

  • Clashes
  • Rotamer outliers
  • Ramachandran outliers
  • ProQ2 model quality assessment
  • Alignment confidence (HHsearch)
  • Conservation/evolutionary trace (Jenson-Shannon divergence –far faster

and just as accurate as ET)

  • Catalytic Site Atlas
  • Disorder
  • Pocket detection (Fpocket)
  • Protein interface residues (PI-Site, ProtinDB)
  • Conserved Domain Database ‘conserved features’ for NCBI-curated

domains

Phyre Investigator

  • Will a SNP effect my protein’s function?
  • New method: SuSPect
  • Recently developed by Chris Yates in our

lab

  • Integrated into Phyre Investigator
  • Also standalone server

Effect of Mutations?

Phyre Investigator Phyre Investigator Phyre Investigator

slide-11
SLIDE 11

23/07/2014 4

Sequence conservation

  • PSSM
  • Pfam domain
  • Jensen-Shannon entropy

Structural features

  • Predicted solvent accessibility

Network features

  • Protein-protein interaction (PPI)

as domain centrality

Domain Conserva on Secondary structure Solvent accessibility Intrinsic disorder

Interactome

SuSPect – Phenotypic effect of amino acid variants

1 − Specificity Sensitivity 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 SuSPect Condel PolyPhen−2 SIFT MutationAssessor FATHMM

Specificity = TP TP +TN Sensitivity = TP TP + FP

SuSPect

Mutation Assessor SIFT 1 - Specificity PolyPhen2 Benchmark consists of 20k SNPs (15k Neutral, 5k pathogenic)

SuSPect – Results on non-training data (VariBench)

  • Arg 201 His in ATP-sensitive inward rectifier potassium

channel 11 (Kir6.6)

  • SuSPect gives score of 87/100 – high probability of disease

associated

Neonatal diabetes Phyre2 yields model which suggest structural basis for disease

Arg 201 forms H-bond with main chain O His in variant could not form similar interaction

Most variants predict to be disease associated

Neonatal diabetes

slide-12
SLIDE 12

23/07/2014 1

2735 domain structure superfamilies Orengo and Thornton (1993) ~280,000 domain structures in CATH HUP superfamily

Ian Sillitoe

Ian Sillitoe

CATHEDRAL

  • Rapid graph theory secondary

structure filter

  • Double dynamic programming for

accurate residue alignment

CATHs Existing Domain Recognition Algorithm

Redfern et al. PLOS Comp. Biol. (2009)

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 5 10 15 20 25 Rank % Correct Fold CATHEDRAL CE DALI LSQMAN STRUCTAL SSAP

SSAP

Fold Recognition Performance of CATHEDRAL

% Correct Fold Redfern et al. PLoS Comp. Biol. 2009 Rank

CATHEDRAL server

Domain superfamily

  • shared topological core 3D-motif
  • sequence or functional similarity

: Domain structure annotations

slide-13
SLIDE 13

23/07/2014 2

: Domain structure annotations

scan against HMMs for CATH Pfam protein sequences from genomes and UniProt > 26 million domain sequences assigned to CATH superfamilies <100 superfamilies (<5%) account for 70% of domain sequences in CATH-Gene3D

Pantoate - β-alanine ligase

HUP domain superfamily

Arg-tRNA synthetase Asn synthetase B

>40,000 sequences, >250 GO molecular function terms

Phosphopantetheine adenylyltransferase Glycerol-3-phosphate cytidylyltransferase

CATH-Gene3D functional families - FunFams

Large, diverse superfamily

Lee et al. NAR (2009), Rentzsch et al. BMC Bioinformatics (2013), Redrovag et al. Nature methods (2013)

Arg-tRNA synthetase Phosphopantetheine adenylyltransferase Pantoate - β-alanine ligase Asn synthetase B

FunFHMMer: Uses Specificity-Determining Positions (SDP) conserved specificity determining

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision MFO mode1

Performance in Function Prediction

CAFA international assessment, July 2014

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision BPO mode1

Performance in Function Prediction

CAFA international assessment, July 2014

slide-14
SLIDE 14

23/07/2014 3

FunFams are structurally and functionally coherent

Superfamily FunFams 3D models built for human using MODELLER algorithm (Sali Group)

FunFams are structurally and functionally coherent

residues in contact with protein partner active site

significant enrichment of Catalytic Site Atlas (CSA) residues in conserved residues – 2.2e-16 Dessailly et al. BBA (2013) functional sites predicted in all FunFams using the scorecons algorithm (Valdar and Thornton)

CATH Superfamily: FunFams

slide-15
SLIDE 15

23/07/2014 4

CATH FunFam Pages

FunTree

Nick Furnham

School of Hygene and Tropical Medicine

Janet Thornton, EBI MACiE

Data Collection

Structure and sequence alignments for enzyme families

  • > Phylogenetic trees

Annotate with functional information eg small molecule data -substrates, mechanisms etc

Data Processing

Ian Sillitoe

Ian Sillitoe

FunTree Page for the HUP Superfamily

6.3.2.1 2.7.7.1 Multidomain architecture (MDA)

substrates substrates mechanism

FunTree Page for the HUP Superfamily

slide-16
SLIDE 16

23/07/2014 1

A Brief Overview of the PSIPRED Workbench

David Jones UCL Depts. of Computer Science and Structural and Molecular Biology

The PSIPRED Workbench

  • First available to public in 1998
  • Originally offered secondary structure

prediction (PSIPRED), fold recognition (GenTHREADER) and transmembrane topology prediction (MEMSAT)

  • Now covers a range of applications,

including protein function prediction, disorder prediction

  • Most tools available for download or can be

run on the cloud via HADOOP

Scalable web services for the PSIPRED Protein Analysis Workbench DWA Buchan, F Minneci, TCO Nugent, K Bryson, DT Jones (2013) Nucleic acids research 41 (W1), W349-W357

DISOPRED3

  • A. Lobley, M.I. Sadowski, & D.T. Jones,

Bioinformatics (2009) 25:1761-1767

  • Highly sensitive domain-based fold recognition method
  • Outperforms HHPred and PRC on domain superfamily

recognition

  • Adaptation of pGenTHREADER profile-profile comparison

algorithm for local matches

  • Part of PSIPRED Protein Sequence Analysis Workbench
  • Profiles:
  • Residue propensities based on structural alignments,

secondary structure, potentials of mean force

  • Updated weekly
  • Tightly integrated with CATH domain releases

pDomTHREADER Domain Superfamily Recognition Performance

Structure-based alignments

FFPred 2.0

FFPred 2.0: Improved Homology-Independent Prediction of Gene Ontology Terms for Eukaryotic Protein Sequences F Minneci, D Piovesan, D Cozzetto, DT Jones (2013) PLOS ONE 8 (5), e63754

Functional implications of protein disorder

At the cellular level, IDPs and IDRs have been linked to the organization and re-wiring of protein-protein interaction networks, and to increased proteome diversity through alternative splicing across tissues and organisms At the protein level, IDR length and position correlate with their biological roles. Based on this observation, FFPRED assigns GO terms to gene products with limited or no sequence similarity at all to experimentally characterized proteins

GO:0016563 transcription activator activity

N S1 S2 S3 S4 S5 S6 S7 S8 C

Individual IDRs act as flexible linkers or contain binding sites (for proteins, nucleotides, lipids or metal ions) that usually fold upon binding

slide-17
SLIDE 17

23/07/2014 2

DISOPRED3

DISOPRED3 extends its predecessor with the aim of improving prediction accuracy, especially for long IDRs. Additional modules include:

  • a Neural Network trained on a much

larger PDB + Disprot dataset;

  • a profile-based nearest neighbour

method against PDB + Disprot

  • a Neural Network to integrate the results
  • f the component methods

Input Sequence

DISOPRED (ANN)

Nearest Neighb.

DISOPRED2 (SVM)

PSSM Feature Vectors Consensus ANN

Residue level predictions

DISOPRED3 vs DISOPRED2

  • Based on CASP data, we found DISOPRED3 to be more specific than

DISOPRED2, and more sensitive to long IDRs and far from the N- and C- terminus

  • DISOPRED3 was ranked at the top or near the top across a range of test

conditions and evaluation measures by the CASP9 and CASP10 assessment teams

CASP10 assessment results

  • The best methods attain 70% accuracy
  • Internal IDRs are harder to detect than

terminal ones

  • IDRs with 40 or more residues remain

challenging, but larger test sets are needed

Prediction of disordered binding regions

Short peptides bound to globular domains and likely to be unfolded in isolation based on the analysis of interface and accessible surface areas Unbound protein domain linkers as annotated in the CATH database

… D F D K D D D G D G D A D F

D P

D E D H D V D A O Q O Y O P …

Length of the segment PSSM data Amino acid composition Position along the sequence PB SVM Protein binding Non protein binding

15 aas

Benchmarking

Method TP FP FN Recall Precision F1

DISOPRED3 78 324 191 0.290 0.194 0.232 Naïve 134 1704 135 0.498 0.073 0.127 DISOPRED3 no PB SVM 81 1681 188 0.301 0.046 0.080 MFSPSSMpred 18 306 251 0.067 0.056 0.061 MoRFpred 16 291 253 0.059 0.052 0.056 ANCHOR 21 1481 248 0.078 0.014 0.024

  • Test set: 9 protein chains from DisProt with less than 30% identity to

DISOPRED3 training data, containing 14 IDRs of length between 5 and 50 residues and annotated as protein binding

  • Evaluation measures: precision and recall, as most negatives are expected

to be structured residues

  • Naïve predictions: random labelling of input amino acids as either disordered

protein binding or not with equal probability

MEMSAT-SVM / MEMPACK (2009 + 2012)

  • Replaces neural network with support vector machine classifiers
  • Trained using 131 sequences which all have crystal structures available
  • Can additionally identify re-entrant regions
  • Dynamic programming algorithm
  • Pores and pore stoichiometry

Nugent T and Jones DT. (2009) Transmembrane protein topology prediction using support vector machines. BMC Bioinformatics, 10:159 Nugent T and Jones DT. (2012) Detecting pore-lining regions in transmembrane protein sequences. BMC Bioinformatics, 13:169 Nugent T and Jones DT. (2013) Membrane protein orientation and refinement using a knowledge-based statistical potential. BMC Bioinformatics, 14:276

slide-18
SLIDE 18

23/07/2014 3

Window of 15 residues

Inside Transmembrane Outside Signal

Predicting the location of this residue

MLSPQAMSDFHKELKWLLCNIPGQKLASLANREYT… MLTGNAMTDFHRDLKYLLCQVPGQRLASLSNRDFT… MVTPQSISDFHREVKWLVCNIPGQKLANAANREYS… MLSPQAMSDFHRELKYLVCNIP-QKLASLANRNYT…

Multiple sequence alignment (sequence profile)

Outside Inside

Signal Peptide

MEMSAT3/MEMSAT-SVM Schema

Re-entrant Pore

Overall prediction accuracy

Crystal structure data sets are predominantly composed of prokaryotic structures

Tested using the Möller data set: 79% accuracy

Tested using the TOPDB data set: 67% accuracy . . .

Topology prediction performance

PSICOV Workbench (coming soon)

  • PSICOV2, NNPSICOV

– Residue contact prediction tools

  • FILM3

– De novo Transmembrane Protein Modelling

  • ContactTHREADER

– Fold recognition and modelling

  • PPI-PSICOV

FFPred 2.0 – Input via PSIPRED server

Acknowledgements

  • Domenico Cozzetto
  • Federico Minneci
  • Tim Nugent
  • Dan Buchan

FFPred 2.0 – Input via PSIPRED server

slide-19
SLIDE 19

23/07/2014 4

5-HT1B receptor (Homo sapiens)

5-HT1B receptor in complex with agonist ergotamine (PDB entry with ID 4iar); from Wikipedia, retrieved 25/06/2014 Serotonin receptor 1B:

  • serotonin receptor

subtype

  • well researched GPCR,

with 7 transmembrane helices

  • part of serotonin

pathways, widely distributed across the central nervous system

  • therefore, known signal

transduction activity, GPCR activity, ...

5-HT1B receptor – FFPred predictions (I)

Select “FFPred” tab in results page GO ontology Symbol and name for predicted GO term annotations ...for each category of SVM reliability, annotations are then ranked according to the posterior probability of the prediction being correct Predictions associated to less reliable SVMs are listed at the bottom, on a red background... High or low reliability of the SVM that was trained to predict the corresponding GO term

5-HT1B receptor – FFPred predictions (II)

Complete GO term predictions for the 5-HT1B receptor Expected GO terms for this well characterised protein are predicted with high probability More predictions may give indications for further experimental investigation Low-reliability predictions are meant to be further, relatively less safe suggestions

5-HT1B receptor – FFPred predictions (III)

Used protein features can be found below the GO term predictions Features like secondary structure (PSIPRED), disorder (DISOPRED), PEST regions, post-translational modifications can be read on the summary diagram or on the detailed amino acid map Transmembrane topology (MEMSAT- SVM) Amino acid composition and physico- chemical properties

Human SNW domain-containing protein 1

SNW1 (UniProt AC Q13573) is a eukaryotic multifunctional protein involved in transcription initiation and repression, mRNA splicing, and cell cycle by interacting with many partners. The sequence contains high proportion of polar and charged amino acids, typical of disordered proteins. NMR studies on its role in spliceosome maturation show that the N-terminal 172 positions are disordered and that residues from 59 to 79 fold upon binding the PPIL1 protein.

Human SNW domain-containing protein 1

SNW1 (UniProt AC Q13573) is a eukaryotic multifunctional protein involved in transcription initiation and repression, mRNA splicing, and cell cycle by interacting with many partners. The sequence contains high proportion of polar and charged amino acids, typical of disordered proteins.

slide-20
SLIDE 20

23/07/2014 5

DISOPRED predictions for SNW1

Predictions of both disordered regions and of protein binding residues within them are colour- coded within the input sequence 65% of the 172 N-terminal disordered residues correctly classified. Other predictions agree with common assumptions and consensus data in MobiDB and D2P2.. The disordered protein binding region from position 59 to 79 is predicted with 38% precision

slide-21
SLIDE 21

23/07/2014 1

Genome3D Workshop

Ian Sillitoe, Tony Lewis UCL

Genome3D

Why Structure?

Enzyme Active Sites Protein-Protein Interfaces

Crystal structure of the Anopheles gambiae 3- hydroxykynurenine transaminase. Rossi et al. PNAS

What is Genome3D?

  • Genome3D provides…

– consensus structural annotations – consensus 3D models

  • …by identifying similarities to known protein

domains

– libraries of structural domains from CATH/SCOP – domains are classified into “superfamilies”

Domain Structure Classification Domain Structure Annotation SCOP CATH Consensus

SUPERFAMILY Gene3D PHYRE pDomTHREADER FUGUE

KEGG terms GO terms

...

Other ...

Associated functional information from > 10 public sources

Genome3D Workflow

Identify Sequences

  • 10 model genomes (Uniprot)
  • Pfam representatives

Groups submit Annotations

  • Identify similarity to “template” domains from CATH / SCOP
  • Predict domain boundaries and/or 3D structure

Collate Predictions

  • Map CATH / SCOP superfamilies
  • Present data on website (one page per UniProt sequence)

Genome3D Consortium

Resource Principle Investigator Prediction Type Classification Source University DomSerf Jones Models CATH UCL FUGUE Blundell Annotations CATH / SCOP Cambridge Gene3D Orengo Annotations CATH UCL pDomTHREADER Jones Annotations CATH UCL Phyre2 Sternberg / Kelley Both SCOP / PDB Imperial SUPERFAMILY Gough Both SCOP Bristol VIVACE Blundell Models CATH / SCOP Cambridge Domain Classification Principle Investigator University CATH Orengo UCL SCOP Murzin Cambridge

slide-22
SLIDE 22

23/07/2014 2

CATH / SCOP Mapping

Domains, Superfamilies, Medals

Structural Domains Structural Domains

N C N C

Structural Domains

N C N C

CATH / SCOP Mapping

  • Find pairs of overlapping

SCOP/CATH domains:

– # residues in CATH domain – # residues in SCOP domain – # residues in common

  • Initially scan and store every
  • verlapping pair

(even trivial overlaps)

  • Later, reject on the basis of
  • verlap (of smallest domain)

SCOP CATH SCOP CATH

CATH / SCOP Mapping

  • Find related pairs of SCOP/CATH superfamilies
  • Map between

– SCOP v1.75 (1962 superfamilies) – CATH v3.5.0 (2626 superfamilies)

  • Assess how well pairs of superfamilies overlap

Gold 763 pairs Silver 134 pairs Bronze 532 pairs

slide-23
SLIDE 23

23/07/2014 3

SCOP/CATH Mapping Genome3D: Dataset

Genomes, Pfam, Coverage

Genome3D: Dataset

  • 10 model genomes

– Human (homo sapiens) – E. coli (escherichia coli) – Baker’s yeast (saccharomyces cerevisiae) – Mouse (mus musculus) – Mouse-ear cress (arabidopsis thaliana) – Fruit fly (drosophila melanogaster) – Nematode (caenorhabditis elegans) – Malaria parasite (plasmodium falciparum) – Bacterium (staphylococcus aureus) – Fission yeast (schizosaccharomyces pombe)

Genome3D: Dataset

  • Pfam has > 80% coverage of UniProt

– Exploit Pfam’s coverage – Reuse existing pipeline – Worked with Pfam to select representative sequences

  • Pfam - the 11th Genome

5000 10000 15000 20000 25000 30000 35000

Number of UniProt sequences

Predicted Domain Annotations

(Per UniProt Sequence)

6 groups 5 groups 4 groups 3 groups 2 groups 1 groups No groups

Genome3D: Coverage

5000 10000 15000 20000 25000 30000 35000

Number of UniProt sequences

Predicted 3D Models

(Per UniProt Sequence)

4 groups 3 groups 2 groups 1 groups No groups

Genome3D: Coverage

slide-24
SLIDE 24

23/07/2014 4

Genome3D: Web

www.genome3d.eu

Search UniProt: Overview UniProt: Annotations Predicted Domains

slide-25
SLIDE 25

23/07/2014 5

Linking CATH / SCOP Predicted 3D Structures P56537 (around 75% sequence id) P56537 (around 75% sequence id) P56537 (around 75% sequence id) P56537 (around 75% sequence id)

slide-26
SLIDE 26

23/07/2014 6

P56537 (around 75% sequence id) P56537 (around 75% sequence id) P56537 (around 75% sequence id) A1L167 (around 52% sequence id) A1L167 (around 52% sequence id) A1L167 (around 52% sequence id)

slide-27
SLIDE 27

23/07/2014 7

A1L167 (around 52% sequence id) A1L167 (around 52% sequence id) A1L167 (around 52% sequence id) A1L167 (around 52% sequence id) A1L167 (around 52% sequence id) P0C617 (around 20% sequence id)

slide-28
SLIDE 28

23/07/2014 8

P0C617 (around 20% sequence id) P0C617 (around 20% sequence id) P0C617 (around 20% sequence id) P0C617 (around 20% sequence id) P0C617 (around 20% sequence id)

Tutorial after Coffee Break In Foster Court B29

slide-29
SLIDE 29

23/07/2014 9 Tutorial after Coffee Break In Foster Court B29 Tutorial after Coffee Break In Foster Court B29 Tutorial after Coffee Break In Foster Court B29 Tutorial after Coffee Break In Foster Court B29