Five hierarchical levels of sequence-structure correlations in - PowerPoint PPT Presentation

Five hierarchical levels of sequence-structure correlations in proteins Chris Bystroff Rensselaer Polytechnic Institute Troy, New York, USA

What does structure prediction tell us about the physics of folding? Check one: A. If we can predict protein structures, then we know how proteins fold. B. If we know how proteins fold, then we can predict protein structures.

Two ways to predict protein structure... query best Database search sequence alignment (statistics) lowest query Folding Simulation energy sequence (physics)

...two very different Underlying principles Darwin: query best Proteins with a common sequence alignment ancestor have the same fold. millions of years Boltzmann: lowest query Proteins adopt a minimum energy sequence the free energy conformation. microseconds to seconds

Darwin versus Boltzmann. Do hybrid models make sense? BLAST threading Rosetta AMBER physics Global structure similarity Sequence similarity Knowledge-based physics

We know proteins fold via pathways. local structure first, eliminating alternate pathways, then global

Proteins can fold because they don't have to search all of conformational space.

We know that proteins have a heirarchy of structural similarity... conserves... Class 2° content Architecture packing of 2° Topology* chain connectivity *Fold recognition algorithms work at this level Image borrowed from CATH database

Can we use the database to make models for folding pathways? Steps along the Steps in early folding pathway: data mining: (1) Initiation local motifs (2) propagation extended local motifs (3) condensation pairs of motifs (4) molten globule multiple motifs late (5) native state aligned multiple motifs

Heirarchical level 1: Folding initiation site motifs recurrent Non-homologous sequences sequence HDFPIEGGDS P M Q T IF FW S N ANAKLSHGY CPYDNIW M Q T IFFN QSAAVYSVLHLIFLT IDMNPQGSIE M QTIFF GYAESA ELSPVVNFLE EM QTIF F ISGFTQTANSD I N W G S M Q T IFFEE W Q LM NV M DKIPS I FNESKKKGIA M QTIFF ILSGR PPPM QTI FFVIVNYN ESKHALWCSVD PW M W NLM Q TIFF ISQ QVIEIPS MQT IFF VFSHDEQ MKLKGLKGA Is it a recurrent structure?

Sampling bias creates problems for motif mining First we must "factor out" inheritance

Removing database redundancy (1): Cluster sequences into phylogenetic trees. (3): Convert each One family, one count. position to a probability distribution. (2): apply a tree weight to each sequence. ( ) w k δ s kj = aa i ∑ k = seqs ij = P ∑ w k k = seqs w w w w w ww w "sequence profile"

Clustering sequence profiles to find recurrent patterns 26 27 28 29 30 31 32 G G P P D D E E K K R R H H S S T T N N AA Q Q A A M M Y Y W W V V I I L L F F C C 26 27 28 29 30 31 32 position Each dot represents a short profile similarity metric (product of log-likelihood ratios) ( ) LLR q ij ( ) ∑ ∑ ∑ ∑ D ( p , q ) = ijl − P LLR p ij | P ikl | i = 1,20 l = 1, L positions amîno j acids i

The I-sites Library Backbone angles: Type-I ψ =green, hairpin φ =red diverging type-2 turn Amino Serine Frayed acids hairpin helix arranged from non- polar to polar alpha-alpha corner glycine helix N-cap Proline helix C-cap

Are I-sites really folding initiation sites? Prediction experiments (Bystroff & Baker, Proteins, 1997) NMR data on peptides (Yi et al, J.Mol.Biol., 1998) Molecular dynamics simulations (Bystroff & Garde, Proteins, 2002)

Level 2. Motif grammar Arrangement of I-sites motifs in proteins is highly non-random helix beta beta helix cap strand turn Adjacencies can be modeled as a Markov chain

Aligned motifs become a Markov chain Type-1 G α C-cap φ ψ aligned α helix aligned profiles structures Type-2 G α C-cap Type-1 state G α C-cap topology: α helix Type-2 G α C-cap

A Markov state from HMMSTR next state a ij amino acid b i = {ACDEF...} symbols r i = {HGEBdblLex} previous a hi state structure d i = {HST} symbols c i = {mnhd...} a ik next state

Discretized structure states: backbone angle regions ( r i )

How an HMM works We have S (the sequence). We want Q (the state sequence), P(Q|S) is the probability of Q given S ∏ ( ) = π q 1 ( s 1 ) P Q | S a q t − 1 q i B q t ( s t ) t = 2, N starting states ⎛ ( ) ⎞ arrows d i D t ⎜ ⎟ ( ) = ( ) ⎜ ⎟ B i s t r i R t b q i ( O t ) ⎜ ⎟ amino acid profiles ⎜ ⎟ ( ) c i C t ⎝ ⎠

HMMSTR Hidden Markov Model for local protein STRucture 282 nodes 317 transitions Unified model for 31 distinct sequence- structure motifs (Bystroff & Baker, J. Mol. Biol., 2000)

Level 1: I-sites Level 2: HMMSTR propagation initiation

Level 3: Pairwise Motif-Motif Contact Potentials • G (p, q, s) represents the free energy of a motif-motif contact. ∑ ∑ ( ) Γ i + s , q ( ) Γ i , p PDBselect i ∋ D i , i + s < 8 Å G ( p , q , s ) = − log ∑ ∑ ( ) Γ i + s , q ( ) Γ i , p PDBselect i

if d ( i , j ) ≤ D if d ( i , j ) > D What is a contact map? 1 0 ⎧ ⎨ ⎩ S ( I , J ) = Definition:

Both axes: sequence Red: favorable contact Blue: unfavorable E(i,j)

Features in a contact map can be interpreted as a TOPS diagram helices strands

Features in a contact map can be interpreted as a TOPS diagram helices strands Which one is right?

A rule-based simulation procedure. amphipathicnon-polar T0130 X True contact map True Contact Map T0130 CASP5 Contact energies ab initio Prediction

Level 4: Multibody arrangements of local motifs It is difficult to see similarities between these two proteins, but...

Different folds can have the same arrangement of secondary structure elements. 1alk 3 2 7 2 6 1 5 3 4 4 1 1vpt 4 3 7 6 5 4 1 2 3 1 2

SCALI : Structural Core ALIgnment

How SCALI works (1) Gapless alignment of HMMSTR states (2) Initialize tree search w/ one gapless fragment. (3) Add a new fragment iff it is compatible and has a high score . (4) Tree leaves when no fragments can be added. Score of leaves = aligned contacts + permutation penalty.

HMMs may be built based on non- sequential alignments Markov states represent amino acid sequences and positions in space. Connections between them represent loops.

Hidden Markov models for α/β/α proteins

Non-sequential clusters may be a useful for classifying proteins Core packing classes Multiple non-sequential alignments are more specific than “architecture” but not as specific as “topology”.`

Level 1: I-sites Level 2: HMMSTR Level 4: SCALI Level 3: HMMSTR-CM molten propagation condensation globule initiation

Level 5: Global topology Separation of the SCOP 1.53 database into training and test sequences, shown for the G proteins test family

Support Vector Machine 4052 proteins --> x2 54-dimensional Support vector. Each Vectors dimension is the Optimal hyperplane order of appearance HMMSTR states for one family. X1 Support Vectors

HMMSTR as the basis for a Support Vector Machine SCOP benchmark of 54 sequence families 4052 proteins, represented as 282-dimensional vector = Prob of each HMMSTR state. (Hou,Y et al , Bioinformatics, 2003; Proteins, 2004)

No sparse data problem as we mine longer and longer patterns! Why? Steps along the early folding pathway: Model Complexity (1) Initiation I-sites ~40 motifs (2) propagation HMMSTR 1.1 transitions/node (3) condensation HMMSTR-CM ~1% of pairs occur (4) molten globule SCALI only self-avoiding paths late (5) native state SVM-HMMSTR ~1000-2000 folds

Are there any conclusions? We assumed that proteins fold in a certain, heirarchical manner, mined the data accordingly and found recurrence at every level, from short motifs to global structure.

Funding from: HMMSTR : NSF-CISE Chris Bystroff Vesteinn Thorsson David Baker SVM-HMMSTR Yaoming Huang Bystroff Lab ( Nat.Univ.Singapore ) Yu Shao Donna Crone Yuna Hou Xin Yuan Rachel van Duyne Mong-Li Lee Kwang Kim Ben Cole Wynne Hsu www.bioinfo.rpi.edu/~bystrc/ HMMSTR says: Think Globally, Act Locally.

Are I-sites folding initiation sites? Patterns of conservation suggest energetic motive 2. sidechain 1. backbone contacts angle constraints 3. negative design

NMR structures confirm independent folding (a) (c) 1 2 3 4 5 6 7 26 27 2829 30 3132 G G G G P P P P D D D D color E E E E scale K K K K �1. R R R R 0.8 H H H H 0.6 S S S S 0.4 T T 0.2 T T 0.0 N N N N AA AA AA AA -.2 Q Q Q Q -.4 A A A A -.6 M M M M -.8 Y Y Y Y Š-1 W W W W V V V V I I I I L L L L F F F F C C C C 1 2 3 4 5 6 7 26 27 2829 30 3132 (b) position (d) position diverging turn motif NMR structure of a 7-residue I-sites motif in isolation (Yi et al , J. Mol. Biol, 1998)

Five hierarchical levels of sequence-structure correlations in - PowerPoint PPT Presentation

Five hierarchical levels of sequence-structure correlations in proteins Chris Bystroff Rensselaer Polytechnic Institute Troy, New York, USA What does structure prediction tell us about the physics of folding? Check one: A. If we can

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Lesson 2 Greek Vocabulary One does not equal five!!! One does not equal five!!! One does not

Hierarchical Bounding Volume October 11, 2005 () Hierarchical Bounding Volume October 11, 2005

What is a hierarchical model? Richard Erickson Quantitative Ecologist DataCamp Hierarchical

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Higher product levels of skew fields J. Cimpri c July 1, 2004 1 product levels levels of

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

Five Winds The concept of the cluster development of the internal and inbound tourism in

Communicating Critical Events: Communicating Critical Events: CEO Transitions and Risk to

Sensitivity Analysis and Uncertainty Sensitivity Analysis and Uncertainty Propagation from Basic

A A Historical and Functional Ov Overview of f Artifi ficial Intelligence wi with h Hy

Tensor Core Performance and Precision Josef Schle, University Kaiserslautern, Germany,

De novo structure determination of a 27.5 kDa protein- RNA complex: a Dead End for classical NMR

IIT DELHI Eco.coli INTRODUCTION INTRODUCTION Poll ollutants Em Emis issio ion St

ME 101: Engineering Mechanics Rajib Kumar Bhattacharjya Department of Civil Engineering Indian

Mining from Big Data Vipin Kumar Department of Computer Science University of Minnesota

Sambuz

Useful Links

Newsletter

Mail Us

Five hierarchical levels of sequence-structure correlations in - PowerPoint PPT Presentation

Five hierarchical levels of sequence-structure correlations in proteins Chris Bystroff Rensselaer Polytechnic Institute Troy, New York, USA What does structure prediction tell us about the physics of folding? Check one: A. If we can

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Lesson 2 Greek Vocabulary One does not equal five!!! One does not equal five!!! One does not

Hierarchical Bounding Volume October 11, 2005 () Hierarchical Bounding Volume October 11, 2005

What is a hierarchical model? Richard Erickson Quantitative Ecologist DataCamp Hierarchical

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Higher product levels of skew fields J. Cimpri c July 1, 2004 1 product levels levels of

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

Five Winds The concept of the cluster development of the internal and inbound tourism in

Communicating Critical Events: Communicating Critical Events: CEO Transitions and Risk to

Sensitivity Analysis and Uncertainty Sensitivity Analysis and Uncertainty Propagation from Basic

A A Historical and Functional Ov Overview of f Artifi ficial Intelligence wi with h Hy

Tensor Core Performance and Precision Josef Schle, University Kaiserslautern, Germany,

De novo structure determination of a 27.5 kDa protein- RNA complex: a Dead End for classical NMR

IIT DELHI Eco.coli INTRODUCTION INTRODUCTION Poll ollutants Em Emis issio ion St

ME 101: Engineering Mechanics Rajib Kumar Bhattacharjya Department of Civil Engineering Indian

Mining from Big Data Vipin Kumar Department of Computer Science University of Minnesota

Sambuz

Useful Links

Newsletter

Mail Us

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or