Contents Introduction 1 Methods and Materials 2 3 Results and - - PowerPoint PPT Presentation

contents
SMART_READER_LITE
LIVE PREVIEW

Contents Introduction 1 Methods and Materials 2 3 Results and - - PowerPoint PPT Presentation

Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy Shixiang Wan From Tianjin University, China Email: shixiangwan@tju.edu.cn 1 Contents Introduction 1 Methods and Materials 2 3 Results and


slide-1
SLIDE 1

Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy

1

Shixiang Wan From Tianjin University, China Email: shixiangwan@tju.edu.cn

slide-2
SLIDE 2

Contents

2

1 2 3 4 Introduction Results and Discussion Conclusion Methods and Materials

slide-3
SLIDE 3

Introduction

3

slide-4
SLIDE 4

4

TATA-binding Protein (TBP)

l What are they?

Ø A general transcription factor that binds specifically to a DNA sequence called the TATA box Ø Play an important role in initiation of transcription Ø Especially important in DNA melting (double strand separation)

l Importance of TBP

Ø Promising studies have shown that TBP is involved in the molecular mechanism

  • f neurodegenerative diseases. There are

37 known TBP pathogenic mutations. Disease of these mutations includes epilepsy, parkinson's syndrome, personality disorder, developmental malformations and so on. Therefore, TBP has a key role in precision medical and genetic testing.

slide-5
SLIDE 5

5

Increased interest in TBP

Reference: https://www.ncbi.nlm.nih.gov/pubmed/

slide-6
SLIDE 6

6

UniProt - high quality protein database

  • Website

http://www.uniprot.org/

  • Screenshot
slide-7
SLIDE 7

7

Methods for TBP prediction

Traditional experimental methods

Expensive Time-consuming

Computational prediction methods

Input Output Intrinsic limitations Large-scale sequence pool Computers Predictions Urgent demand to realize fast and correct prediction

slide-8
SLIDE 8

8

Machine learning based Methods

  • Framework of machine learning based prediction methods
  • Factors of influencing the performance

Feature representation methods, classification algorithms, and datasets for model building

slide-9
SLIDE 9

9

Major problem and challenge

  • Imbalance dataset
  • Redundancy (or high similarity)
  • Difficult for searching the best dimension fast
  • Low similarity between positive-class and negative-class

samples

slide-10
SLIDE 10

Methods and Materials

10

slide-11
SLIDE 11

11

Framework of Pretata

Input TBP sequences CD-HIT (reduce redundancy) 188D Prediction Results Step 1. Feature Representation Step 2. Classifier Prediction Three improvements Negative samples collection (for imbalance data) 473D 611D LibD3C LibSVM IBK Random Forest Bagging Data Preparation Step 3. Optimal Solution TBP or not ?

slide-12
SLIDE 12

12

Dataset construction

l Positive dataset construction (true TBP) Ø Increase the number of positive samples (about 559 were downloaded) Ø Reduce the redundancy of the positive samples (sequence similarity < 90%) using CD-HIT l Negative dataset construction (non-true TBP) Ø Raw negative dataset (about 8,465 sequences) Ø Purify negative dataset

559 positive sequences 559 negative sequences

8465 negative sequences LibSVM misclassification

Extract

Take out Replenish negative-class

slide-13
SLIDE 13

13

Novel features (611D)

l 188D Ø Features based on composition and physicochemical properties of amino acids Ø 20D: the proportions of the 20 kinds of amino acids in the sequence Ø 21×8D: 21 kinds of statistical properties time 8 physicochemical properties l 473D Ø Features from secondary structure l 188D + 473D = 611D Ø The composition, physicochemical and secondary structure features are combined into 611D high dimension feature vectors.

slide-14
SLIDE 14

14

Dimensionality reduction strategy

l Optimal dimension searching method reduce dimension of features to find the best results, including global multiple search and local linear search.

initial dimension dimension (1) dimension (2) dimension (3) dimension (k) tolerable dimension

{

primary step

{ { {

dimension (1,1) dimension (1,2) dimension (1,k)

{ {

dimension (k,1) dimension (k,2) dimension (k,k)

{ {

dimension (k-1)

{

primary step secondary step secondary step accuracy the best accuracy

slide-15
SLIDE 15

15

Performance evaluation

l Four commonly used metrics Ø Sensitivity (SE), Specificity (SP), Accuracy (ACC), and Mathew’s Correlation Coefficient (MCC) l Formulation for the four metrics Two metrics for comprehensive evaluation

  • f a binary predictor
slide-16
SLIDE 16

Results and Discussion

16

slide-17
SLIDE 17

17

Classifier based on autocorrelation comparison results

  • 611D has far better performance than 188D and 473D;
  • LibSVM is better than any else on each extraction method of them.

Accuracy

79.25% 79.57% 86.29% 81.84% 82.80% 90.46% 78.44% 76.97% 82.71% 76.92% 77.96% 79.84% 79.74% 80.29% 87.66% 70% 72% 74% 76% 78% 80% 82% 84% 86% 88% 90% 92% 188D 473D 611D

LibD3C LibSVM IBK RandomForest Bagging

slide-18
SLIDE 18

18

Prediction methods comparison results

Pretata V.S. A group of traditional prediction methods

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

BLASTP PSI-BLASTP 611D Pretata ACC SN SP

ACC 92.92% SN 95.50% SP 87.30%

Best

slide-19
SLIDE 19

19

Pretata searching process

Experiment Best dimension = 324D

60% 65% 70% 75% 80% 85% 90% 95% 100%

10 50 90 130 170 210 250 290 330 370 410 450 490 530 570 610 650

75% 78% 81% 84% 87% 90% 93% 96% 99% 230 240 250 260 270 280 290 300 310 320 330 SN SP ACC

Accuracy Accuracy Dimension

ACC 92.92% SN 95.50% SP 87.30%

Best

slide-20
SLIDE 20

Conclusion

20

slide-21
SLIDE 21

21

Conclusions

  • Propose a highly represented method for imbalance dataset

(Negative-class Purification)

  • Propose 611D high dimension feature vectors, including composition,

physicochemical and secondary structure features (611 Dimension Feature Model)

  • Propose a novel and promising TBP prediction method - Pretata

(Pretata Learning model)

  • Available webserver - Pretata Server (Pretata web server)

website: http://server.malab.cn/preTata/

slide-22
SLIDE 22

22

Conclusions

slide-23
SLIDE 23

23

Conclusions

slide-24
SLIDE 24

24