Improving protein secondary structure prediction based on short - - PowerPoint PPT Presentation

improving protein secondary structure prediction based on
SMART_READER_LITE
LIVE PREVIEW

Improving protein secondary structure prediction based on short - - PowerPoint PPT Presentation

Improving protein secondary structure prediction based on short subsequences with local structure similarity Hsin-Nan Lin, Ting-Yi Sung, Shinn-Ying Ho, Wen-Lian Hsu Bioinformatics Program, TIGP (Taiwan International Graduate Program), Academia


slide-1
SLIDE 1

Improving protein secondary structure prediction based on short subsequences with local structure similarity

Hsin-Nan Lin, Ting-Yi Sung, Shinn-Ying Ho, Wen-Lian Hsu Bioinformatics Program, TIGP (Taiwan International Graduate Program), Academia Sinica, Taiwan

The author Hsin-Nan Lin wishes to acknowledge, with thanks, the Taiwan International Graduate Program (TIGP) of Academia Sinica for financial support towards attending this conference.

slide-2
SLIDE 2

2/22

Outline

Introduction

  • Protein secondary structure predictions
  • Existing PSS methods

Methods

  • Synonymous words
  • Compilation of a synonymous dictionary
  • Prediction model

Results

  • Experiment results
  • Two factors that affect prediction performance

Conclusions

slide-3
SLIDE 3

3/22

Protein Secondary Structure Prediction

Protein secondary structure (PSS) elements

  • The local conformation of amino acids
  • 3 secondary structure states: helix (H), strand (E), coil (C).

Protein secondary structure prediction

  • Assign one of the states to each amino acid.
  • Useful for protein 3D structure prediction, function prediction, and

subcellular localization prediction, etc.

Ref: http://bioweb.wku.edu/courses/biol22000/3AAprotein/images/F03-08C.GIF

slide-4
SLIDE 4

4/22

Existing PSS methods

Template based methods Sequence profile based methods

Ref: Rajkumar Bondugula, Dong Xu, 2006 Ref: http://bioinfo.se/kurser/swell/secstrpred.html

slide-5
SLIDE 5

5/22

Outline

Introduction

  • Protein Secondary Structure Predictions
  • Existing PSS methods

Methods

  • Synonymous words
  • Construction of a Synonymous Dictionary
  • Prediction Algorithm

Results

  • Experiment results
  • Two factors that affect prediction performance

Conclusions

slide-6
SLIDE 6

6/22

A Dictionary based approach -- SymPred

Treating proteomic data as a language

  • A protein structure is encoding by its amino acid sequence.
  • protein sequence text
  • protein structure meaning

Treating PSS prediction as a translation problem

protein sequence secondary structure state sequence

A general approach for analyzing protein sequences

  • It can be applied to PSL prediction, function prediction, remote

homology detection, sequence alignment, etc.

slide-7
SLIDE 7

7/22

Synonymous words in protein sequences

Protein language remains a mystery Structure robustness

  • Structures are more conserved than sequences
  • Proteins of 40%↑sequence identity are highly similar in

structure

A significant local pairwise alignment of two proteins

implies two similar paragraphs.

  • Define synonymous words in protein sequences

Definition

  • A synonymous word is an n-gram of a protein sequence

aligned with another n-gram in the other protein.

slide-8
SLIDE 8

8/22

Synonymous words (cont.)

EWQL HHHH DFDM

slide-9
SLIDE 9

9/22

Compilation of Synonymous Dictionary

A protein sequence PSI-BLAST sequence alignments synonymous words

slide-10
SLIDE 10

10/22

An example of synonymous word entry

Flexibility: PSL, functions, 3D structure,..

slide-11
SLIDE 11

11/22

Properties of synonymous words

Protein dependency

  • Synonymous words are generated from significant sequence

alignments (Context-sensitive).

two similar protein words do not imply they are synonymous

  • The material of generating synonymous words depends on

the query protein sequence.

Sequence Identity Independency

  • Protein A Protein B (SI = 50%)
  • Protein B Protein C (SI = 40%)
  • Protein A Protein C (SI = 20%)

Similar proteins of A Similar proteins of B Similar proteins of C

slide-12
SLIDE 12

12/22

Translation model

Obtain the final structure through voting

slide-13
SLIDE 13

13/22

Outline

Introduction

  • Protein Secondary Structure Predictions
  • Existing PSS methods

Methods

  • Synonymous words
  • Construction of a Synonymous Dictionary
  • Prediction Algorithm

Results

  • Experiment results
  • Two factors that affect prediction performance

Conclusions

slide-14
SLIDE 14

14/22

Datasets

DSSP Database

  • A database of PSS assignments
  • DsspNr-25

A Non-redundant subset of DSSP 8,297 protein chains

EVA benchmark datasets

  • A platform analyzing PSS predictors
  • EVA_Set1: 80 protein chains
  • EVA_Set2: 212 protein chains
slide-15
SLIDE 15

15/22

Translation Performance on DsspNr-25

DsspNr‐25 (8297 proteins) Q3 Q3H Q3E Q3C SOV SymPred 81.0 84.3 71.6 77.7 76.0 PROSP 75.1 79.7 67.6 71.3 68.7

+5.9% +7.3%

slide-16
SLIDE 16

16/22

Two factors affect translation performance

Word length

  • a trade-off between specificity and sensitivity

Long words: increase specificity, lose sensitivity Short words: lose specificity, increase sensitivity

  • exact matching vs. inexact matching

Exact matching: WGPV WGPV (exactly the same) Inexact matching: WGPV WGPV, *GPV, W*PV, WG*V, WGP* (at most one mismatch character)

slide-17
SLIDE 17

17/22

Two factors affect translation performance (cont.)

Template pool size

SymPred has the potential to improve further when the number

  • f proteins of known structures continue increasing
slide-18
SLIDE 18

18/22

Performance Comparison on EVA_Set1

EVA_Set1 (80 proteins) Q3 ERRsig Q3 SOV ERRsig SOV SymPred 78.8 ± 1.4 76.4 ± 1.9 SAM‐T99sec 77.2 ± 1.2 74.6 ± 1.5 PSIPRED 76.8 ± 1.4 75.4 ± 2.0 PROFsec 75.5 ± 1.4 74.9 ± 1.9 PHDpsi 73.4 ± 1.4 69.5 ± 1.9

slide-19
SLIDE 19

19/22

Performance Comparison on EVA_Set2

EVA_Set2 (212 proteins) Q3 ERRsig Q3 SOV ERRsig SOV SymPred 79.2 ± 0.9 76.0 ± 1.2 PSIPRED 77.8 ± 0.8 75.4 ± 1.1 PROFsec 76.7 ± 0.8 74.8 ± 1.1 PHDpsi 75.0 ± 0.8 70.9 ± 1.2

slide-20
SLIDE 20

20/22

Confidence Level vs. Q3

PCC = 0.992

slide-21
SLIDE 21

21/22

Outline

Introduction

  • Protein Secondary Structure Predictions
  • Existing PSS methods

Methods

  • Synonymous words
  • Construction of a Synonymous Dictionary
  • Prediction Algorithm

Results

  • Experiment results
  • Two factors that affect prediction performance

Conclusions

slide-22
SLIDE 22

22/22

Conclusions

Local similarities in protein sequences exhibit conserved

structures.

With the increasing number of protein sequences of known

structures, SymPred can further improve prediction accuracy.

The prediction result is traceable. Our dictionary based approach is general for various protein

related problems.

Synonymous words provide an alternative sequence analysis

method.

slide-23
SLIDE 23

Thank You !

Please visit our web server

http://bio-cluster.iis.sinica.edu.tw/~bioapp/SymPred/