improving protein secondary structure prediction based on
play

Improving protein secondary structure prediction based on short - PowerPoint PPT Presentation

Improving protein secondary structure prediction based on short subsequences with local structure similarity Hsin-Nan Lin, Ting-Yi Sung, Shinn-Ying Ho, Wen-Lian Hsu Bioinformatics Program, TIGP (Taiwan International Graduate Program), Academia


  1. Improving protein secondary structure prediction based on short subsequences with local structure similarity Hsin-Nan Lin, Ting-Yi Sung, Shinn-Ying Ho, Wen-Lian Hsu Bioinformatics Program, TIGP (Taiwan International Graduate Program), Academia Sinica, Taiwan The author Hsin-Nan Lin wishes to acknowledge, with thanks, the Taiwan International Graduate Program (TIGP) of Academia Sinica for financial support towards attending this conference.

  2. Outline � Introduction ◦ Protein secondary structure predictions ◦ Existing PSS methods � Methods ◦ Synonymous words ◦ Compilation of a synonymous dictionary ◦ Prediction model � Results ◦ Experiment results ◦ Two factors that affect prediction performance � Conclusions 2/22

  3. Protein Secondary Structure Prediction � Protein secondary structure (PSS) elements ◦ The local conformation of amino acids ◦ 3 secondary structure states: helix (H), strand (E), coil (C). Ref: http://bioweb.wku.edu/courses/biol22000/3AAprotein/images/F03-08C.GIF � Protein secondary structure prediction ◦ Assign one of the states to each amino acid. ◦ Useful for protein 3D structure prediction, function prediction, and subcellular localization prediction, etc. 3/22

  4. Existing PSS methods � Template based methods Ref: Rajkumar Bondugula, Dong Xu, 2006 � Sequence profile based methods Ref: http://bioinfo.se/kurser/swell/secstrpred.html 4/22

  5. Outline � Introduction ◦ Protein Secondary Structure Predictions ◦ Existing PSS methods � Methods ◦ Synonymous words ◦ Construction of a Synonymous Dictionary ◦ Prediction Algorithm � Results ◦ Experiment results ◦ Two factors that affect prediction performance � Conclusions 5/22

  6. A Dictionary based approach -- SymPred � Treating proteomic data as a language ◦ A protein structure is encoding by its amino acid sequence. ◦ protein sequence � text ◦ protein structure � meaning � Treating PSS prediction as a translation problem � protein sequence � secondary structure state sequence � A general approach for analyzing protein sequences ◦ It can be applied to PSL prediction, function prediction, remote homology detection, sequence alignment, etc. 6/22

  7. Synonymous words in protein sequences � Protein language remains a mystery � Structure robustness ◦ Structures are more conserved than sequences ◦ Proteins of 40% ↑ sequence identity are highly similar in structure � A significant local pairwise alignment of two proteins implies two similar paragraphs. ◦ Define synonymous words in protein sequences � Definition ◦ A synonymous word is an n-gram of a protein sequence aligned with another n-gram in the other protein. 7/22

  8. 8/22 Synonymous words (cont.) EWQL � HHHH � DFDM

  9. Compilation of Synonymous Dictionary A protein sequence � PSI-BLAST � sequence alignments � synonymous words 9/22

  10. An example of synonymous word entry Flexibility: PSL, functions, 3D structure,.. 10/22

  11. Properties of synonymous words � Protein dependency ◦ Synonymous words are generated from significant sequence alignments (Context-sensitive). � two similar protein words do not imply they are synonymous ◦ The material of generating synonymous words depends on the query protein sequence. � Sequence Identity Independency ◦ Protein A �� Protein B (SI = 50%) ◦ Protein B �� Protein C (SI = 40%) ◦ Protein A �� Protein C (SI = 20%) Similar Similar proteins of B proteins of A Similar proteins of C 11/22

  12. Translation model Obtain the final structure through voting 12/22

  13. Outline � Introduction ◦ Protein Secondary Structure Predictions ◦ Existing PSS methods � Methods ◦ Synonymous words ◦ Construction of a Synonymous Dictionary ◦ Prediction Algorithm � Results ◦ Experiment results ◦ Two factors that affect prediction performance � Conclusions 13/22

  14. Datasets � DSSP Database ◦ A database of PSS assignments ◦ DsspNr-25 � A Non-redundant subset of DSSP � 8,297 protein chains � EVA benchmark datasets ◦ A platform analyzing PSS predictors ◦ EVA_Set1: 80 protein chains ◦ EVA_Set2: 212 protein chains 14/22

  15. Translation Performance on DsspNr-25 DsspNr ‐ 25 Q3 Q3H Q3E Q3C SOV (8297 proteins) SymPred 81.0 84.3 71.6 77.7 76.0 +5.9% +7.3% PROSP 75.1 79.7 67.6 71.3 68.7 15/22

  16. Two factors affect translation performance � Word length ◦ a trade-off between specificity and sensitivity � Long words: increase specificity, lose sensitivity � Short words: lose specificity, increase sensitivity ◦ exact matching vs. inexact matching � Exact matching: WGPV �� WGPV (exactly the same) � Inexact matching: WGPV �� WGPV, *GPV, W*PV, WG*V, WGP* (at most one mismatch character) 16/22

  17. Two factors affect translation performance (cont.) � Template pool size SymPred has the potential to improve further when the number of proteins of known structures continue increasing 17/22

  18. Performance Comparison on EVA_Set1 EVA_Set1 Q3 ERRsig SOV ERRsig (80 proteins) Q3 SOV ± 1.4 ± 1.9 SymPred 78.8 76.4 ± 1.2 ± 1.5 SAM ‐ T99sec 77.2 74.6 ± 1.4 ± 2.0 PSIPRED 76.8 75.4 ± 1.4 ± 1.9 PROFsec 75.5 74.9 ± 1.4 ± 1.9 PHDpsi 73.4 69.5 18/22

  19. Performance Comparison on EVA_Set2 EVA_Set2 Q3 ERRsig SOV ERRsig (212 proteins) Q3 SOV ± 0.9 ± 1.2 SymPred 79.2 76.0 ± 0.8 ± 1.1 PSIPRED 77.8 75.4 ± 0.8 ± 1.1 PROFsec 76.7 74.8 ± 0.8 ± 1.2 PHDpsi 75.0 70.9 19/22

  20. 20/22 Confidence Level vs. Q3 PCC = 0.992

  21. Outline � Introduction ◦ Protein Secondary Structure Predictions ◦ Existing PSS methods � Methods ◦ Synonymous words ◦ Construction of a Synonymous Dictionary ◦ Prediction Algorithm � Results ◦ Experiment results ◦ Two factors that affect prediction performance � Conclusions 21/22

  22. Conclusions � Local similarities in protein sequences exhibit conserved structures. � With the increasing number of protein sequences of known structures, SymPred can further improve prediction accuracy. � The prediction result is traceable. � Our dictionary based approach is general for various protein related problems. � Synonymous words provide an alternative sequence analysis method. 22/22

  23. Thank You ! Please visit our web server http://bio-cluster.iis.sinica.edu.tw/~bioapp/SymPred/

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend