Bioinformatics Tools for Analyzing Enzyme Families Greg Butler - - PowerPoint PPT Presentation

▶

Dec 09, 2022 28 likes •402 views

Bioinformatics Tools for Analyzing Enzyme Families Greg Butler Department of Computer Science and Software Engineering Centre for Structural and Functional Genomics Concordia University, Montreal www.cs.concordia.ca/~gregb

SLIDE 1

Bioinformatics Tools for Analyzing Enzyme Families

Greg Butler Department of Computer Science and Software Engineering Centre for Structural and Functional Genomics Concordia University, Montreal www.cs.concordia.ca/~gregb gregb@cs.concordia.ca

SLIDE 2

Outline Overview of Tools PipeAlign Panther FlowerPower Case Study on Cellulases

SLIDE 3

Overview of Tools — Problems Addressed Multiple Sequence Alignment (MSA) Problem: Given a set of protein sequences, and an objective function, determine the optimal alignment of the sequences. SubProblems: Selecting Homologues to Align; Choice of Objective Func- tion; High-Quality MSA; Removing Outlier Sequences Phylogenetic Tree Construction Problem: Given a set of protein sequences, and their pairwise distances (using some distance metric), constuct a phylogenetic tree. Classifier for Family — usually Hidden Markov Model (HMM) Problem: Given a family of protein sequences, and a multiple sequence alignment (MSA) for the family, construct a classifier which given a pro- tein sequence can determine whether or not the protein is a member of the family. Split Family into Subfamilies Problem: Given a family of protein sequences, and a multiple sequence alignment (MSA) for the family, and ..., determine a clustering of the sequences in the family into subfamilies. Consistency of MSA, Tree, Classifiers for Family and Subfamilies

SLIDE 4

Overview of Tools PipeAlign Given a seed sequence, constructs MSA for family and sub- families. — (optional) include candidate family members http://igbmc.u-strasbg.fr/PipeAlign/ Panther DB of sequences, MSAs, trees, and HMM classifiers for protein families and subfamilies semi-automatically for human, mouse, ... Given query protein sequence, classifiers determine family and subfamily — can download all HMM classifiers https://panther.appliedbiosystems.com/ FlowerPower Given seed protein sequence, determines the family and subfamily, their MSAs, trees, and HMM classifiers. — like improved PSI-Blast against UniProt for MSA’s — postprocess MSAs using Bˆ ETE for trees, GTREE for display http://phylogenomics.berkeley.edu/cgi-bin/flowerpower/input flowerpower.py

SLIDE 5

What is an Enzyme? Enzyme is a protein that catalyses a reaction.

SLIDE 6

What is an Enzyme? Enzymes are very specific. Enzymes are very efficient catalysts.

SLIDE 7

Enzyme Families and (Some) Classification Schemes Aim: To classify and organize enzymes. EC (Enzyme Commission) numbers ”To consider the classification and nomenclature of enzymes and coen- zymes, their units of activity and standard methods of assay, together with the symbols used in the description of enzyme kinetics.” GO (Gene Ontology) three classifications of gene products — molecular function — biological process — cellular component CATH: Class, Architecture, Topology, Homology “There is no objective definition. a family is clearly related by sequence similarity, a superfamily is composed of families whose sequence rela- tionship isn’t clear, but which are believed on structural and functional grounds to be homologous, and a fold is a group of superfamilies that share a common structural topology but are not necessarily homolo- gous.” InterPro combination of many classification schemes

SLIDE 8

Gene Ontology — Entry

SLIDE 9

InterPro

SLIDE 10

Multiple Sequence Alignment (MSA) Problem: Given a set of protein sequences, and an objective function, determine the optimal alignment of the sequences. Why? Amino acid sequence determines protein structure determines enzyme function

SLIDE 11

MSA Issues Multiple sequence alignment is a complicated task — choice of the sequences — choice of an objective function — the optimization of the objective function Issues — math vs biology (optimal MSA not necessarily ”good” MSA for biologist) — outliers affect results — divergence can affect choice of parameters/algorithms — multi-domain sequences are problems — many sequences, long sequences costly Ideal — align closely related sequences — trim so only one domain present — feed in lots of constraints eg, structural information

SLIDE 12

Approaches to MSA Progressive — sequences are added one by one to the multiple alignment according to a precomputed order Iterative — iteratively modify a sub-optimal solution Stochastic iterative — randomly modify — result is either kept or discarded dependent on an acceptance function — convergence via more stringent acceptance function Consistency-based “given a set of independent observations, the most consistent are often closer to the truth” — optimal MSA is one that agrees the most with all the possible optimal pair-wise alignments Constraint-based — use prior information as constraints on the alignment

SLIDE 13

SLIDE 14

Recent MSA Algorithms and Systems Partial Order Alignment (POA) Progressive POA MUSCLE and SATCHMO Indonesia system (Uppsala) using structural constraints DIALIGN with User Constraints

SLIDE 15

Splitting Families into Subfamilies Problem: Given the sequences for a family of enzymes, determine how to delineate cohesive subfamilies. Why?: more homologous means easier to study — easier to build better alignments — easier to build better classifiers Subproblem: remove outliers from the set of sequences

SLIDE 16

Building Classifiers for Enzyme Families Problem: Given the sequences for a family of enzymes, determine how to decide membership in the family. “In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by

verall sequence alignment, but it can be identified by the occurrence

in its sequence of a particular cluster of residue types which is variously known as a pattern, motif, signature, or fingerprint.” A profile or weight matrix is a table of position-specific amino acid weights and gap costs. A domain is a conserved protein region. — “independently folding structural unit” A fingerprint is a group of conserved motifs used to characterise a protein family.

SLIDE 17

Panther System from Celera

”The PANTHER database was designed for high-throughput analysis of protein sequences. One of the key features is a simplified ontology of protein function, which allows browsing of the database by biological functions. Biologist curators have as- sociated the ontology terms with groups of protein sequences rather than individual

sequences. Statistical models (Hidden Markov Models, or HMMs) are built from each
f these groups. The advantage of this approach is that new sequences can be automat-

ically classified as they become available. To ensure accurate functional classification, HMMs are constructed not only for families, but also for functionally distinct subfamilies. Multiple sequence alignments and phylogenetic trees, including curator-assigned information, are available for each family. The current version of the PANTHER database includes training sequences from all organisms in the GenBank non-redundant protein database, and the HMMs have been used to classify gene products across the entire genomes of human, and Drosophila melanogaster.”

SLIDE 18

Panther System from Celera

SLIDE 19

Panther System from Celera

SLIDE 20

Panther System from Celera

SLIDE 21

PipeAlign System from Strasbourg

SLIDE 22

PipeAlign Tools Ballast — getting a better set of homologues — conservation profile DbClustal — combined local and global alignment using anchors NorMD — a reliable objective function — normalized Mean Distance scores RASCAL and LEON — detection and correction of alignment errors — removal of outliers — realignment of blocks and inter-block regions Secator and DPC — split into subfamilies

SLIDE 23

Inferring Paralogs and Orthologs — Basic Idea Compare phylogenetic tree of sequences with species tree Systems: RIO, Orthostrapper, Bˆ ETE

SLIDE 24

RIO: Resampled Inference of Orthologs Problem with Basic Idea: Inaccuracy of Phylogenetic Tree Solution: Bootstrap resampling of tree gives probability of phylogenetic tree indicating ortholog New concepts from RIO: super-ortholog, ultra-paralog, subtree neighbour Still problem of inferring function (even) from orthologs!

SLIDE 25

Cellulase Case Study Dataset of biochemically characterized cellulases for Kwang-Bo Joung:

27 endoglucosidases (egl) EC 3.2.1.4
23 cellobiohydrolases (cbh) EC 3.2.1.91
28 beta-glucosidases (bgl) EC 3.2.1.21

Characterized into 92 families of Glycosol Hydrolases (GH) Kwang-Bo Joung aligned domains using ClustalW and combined into

tree. Used Prosite patterns to clarify subfamily membership.

Noted misclassifications: — P46236, GH Family 6, in SP as egl; literature says “cellulase” (ie egl

r cbh); should be cbh

— P37698, GH Family 48, should be cbh not egl Kwang-Bo Joung’s classification (see tree): egl-A of size 4, 2 in GH Family 45, 2 in GH Family 7 (has cbh) egl-B of size 1 in GH Family 6 (has cbh) egl-C of size 2 in GH Family 9 (has cbh) egl-D of size 5 in GH Family 12 and 8 egl-E of size 15 in GH Family 5 cbh-A of size 14 in GH Family 7 (has egl) cbh-B of size 5 in GH Family 6 (has egl) cbh-C of size 2 in GH Family 9 (has egl) cbh-D of size 3 in GH Family 48 bgl-A of size 13 in GH Family 3 bgl-B of size 15 in GH Family 1

SLIDE 26

Cellulase Case Study

SLIDE 27

Cellulase Case Study Dataset of biochemically characterized cellulases for Kwang-Bo Joung:

27 endoglucosidases (egl)
23 cellobiohydrolases (cbh)
28 beta-glucosidases (bgl)

Automatic analysis by PipeAlign Input: One of the datasets, first sequence as the seed Parameters allowed PipeAlign to search UniProt and add up to 200 ho- mologues Results: egl analysis confirmed several subfamilies in egl-E — but not P17974 nor Q12624 — and not egl-A to egl-D cbh analysis confirmed cbh-A — but not cbh-B to cbh-E bgl analysis confirmed half of bgl-A bgl analysis confirmed bgl-B (without P40740)

SLIDE 28

Case Study — Server

SLIDE 29

Case Study — Server

SLIDE 30

Case Study — Server Case Study — Server

SLIDE 31

Case Study — Report for endoglycosidases

SLIDE 32

Case Study — Report for endoglycosidases

SLIDE 33

Case Study — Report for endoglycosidases

SLIDE 34

Case Study — GTREE Display for endoglycosidases

SLIDE 35

Case Study: FlowerPower vs PipeAlign PipeAlign as above on cellobiohydrolases First sequence as seed, all cbh sequences provided, up to 200 from UniProt Three families — cellobiohydrolases (cbh-A) family of size 92 — two families of egl’s (???) of size 45 and 43 FlowerPower with first cellobiohydrolase sequence as seed Two families — cellobiohydrolases of size 52 — egl-A of size 4

SLIDE 36

Case Study: FlowerPower — GTREE Display for cellobiohydrolases

SLIDE 37

Bioinformatics Tools for Analyzing Enzyme Families Greg Butler - - PowerPoint PPT Presentation

Bioinformatics Tools for Analyzing Enzyme Families

Greg Butler Department of Computer Science and Software Engineering Centre for Structural and Functional Genomics Concordia University, Montreal www.cs.concordia.ca/~gregb gregb@cs.concordia.ca

Outline Overview of Tools PipeAlign Panther FlowerPower Case Study on Cellulases

What is an Enzyme? Enzyme is a protein that catalyses a reaction.

What is an Enzyme? Enzymes are very specific. Enzymes are very efficient catalysts.

Gene Ontology — Entry

InterPro

Multiple Sequence Alignment (MSA) Problem: Given a set of protein sequences, and an objective function, determine the optimal alignment of the sequences. Why? Amino acid sequence determines protein structure determines enzyme function

Recent MSA Algorithms and Systems Partial Order Alignment (POA) Progressive POA MUSCLE and SATCHMO Indonesia system (Uppsala) using structural constraints DIALIGN with User Constraints

Building Classifiers for Enzyme Families Problem: Given the sequences for a family of enzymes, determine how to decide membership in the family. “In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by

Panther System from Celera

Panther System from Celera

Panther System from Celera

Panther System from Celera

PipeAlign System from Strasbourg

Inferring Paralogs and Orthologs — Basic Idea Compare phylogenetic tree of sequences with species tree Systems: RIO, Orthostrapper, Bˆ ETE

Cellulase Case Study Dataset of biochemically characterized cellulases for Kwang-Bo Joung:

Characterized into 92 families of Glycosol Hydrolases (GH) Kwang-Bo Joung aligned domains using ClustalW and combined into

Noted misclassifications: — P46236, GH Family 6, in SP as egl; literature says “cellulase” (ie egl

Cellulase Case Study

Cellulase Case Study Dataset of biochemically characterized cellulases for Kwang-Bo Joung:

Case Study — Server

Case Study — Server

Case Study — Server Case Study — Server

Case Study — Report for endoglycosidases

Case Study — Report for endoglycosidases

Case Study — Report for endoglycosidases

Case Study — GTREE Display for endoglycosidases

Case Study: FlowerPower — GTREE Display for cellobiohydrolases

Thank You.

Questions?