Contents Introduction 1 Methods and Materials 2 3 Results and - PowerPoint PPT Presentation

Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy Shixiang Wan From Tianjin University, China Email: shixiangwan@tju.edu.cn 1

Contents Introduction 1 Methods and Materials 2 3 Results and Discussion Conclusion 4 2

Introduction 3

TATA-binding Protein (TBP) l What are they? Ø A general transcription factor that binds specifically to a DNA sequence called the TATA box Ø Play an important role in initiation of transcription Ø Especially important in DNA melting (double strand separation) l Importance of TBP Ø Promising studies have shown that TBP is involved in the molecular mechanism of neurodegenerative diseases. There are 37 known TBP pathogenic mutations. Disease of these mutations includes epilepsy, parkinson's syndrome, personality disorder, developmental malformations and so on. Therefore, TBP has a key role in precision medical and genetic testing. 4

Increased interest in TBP Reference: https://www.ncbi.nlm.nih.gov/pubmed/ 5

UniProt - high quality protein database • Website http://www.uniprot.org/ • Screenshot 6

Methods for TBP prediction Traditional experimental methods Intrinsic limitations Time-consuming Expensive Urgent demand to realize fast and correct prediction Computational prediction methods Output Input Large-scale sequence pool Computers Predictions 7

Machine learning based Methods • Framework of machine learning based prediction methods • Factors of influencing the performance Feature representation methods, classification algorithms, and datasets for model building 8

Major problem and challenge • Imbalance dataset • Redundancy (or high similarity) • Difficult for searching the best dimension fast • Low similarity between positive-class and negative-class samples 9

Methods and Materials 10

Framework of Pretata Input TBP sequences CD-HIT (reduce redundancy) Step 1. Data Preparation Negative samples collection (for imbalance data) Three improvements Step 2. 188D 473D 611D Feature Representation Optimal Solution Step 3. Random LibD3C LibSVM IBK Bagging Classifier Forest Prediction Prediction Results TBP or not ? 11

Dataset construction l Positive dataset construction (true TBP) Ø Increase the number of positive samples (about 559 were downloaded) Ø Reduce the redundancy of the positive samples (sequence similarity < 90%) using CD-HIT l Negative dataset construction (non-true TBP) Ø Raw negative dataset (about 8,465 sequences) Ø Purify negative dataset 8465 negative sequences Replenish negative-class Extract 559 negative sequences Take out LibSVM misclassification 559 positive sequences 12

Novel features (611D) l 188D Ø Features based on composition and physicochemical properties of amino acids Ø 20D: the proportions of the 20 kinds of amino acids in the sequence Ø 21 × 8D: 21 kinds of statistical properties time 8 physicochemical properties l 473D Ø Features from secondary structure l 188D + 473D = 611D Ø The composition, physicochemical and secondary structure features are combined into 611D high dimension feature vectors. 13

Dimensionality reduction strategy initial dimension { secondary step dimension (1,1) { dimension (1,2) dimension (1) { primary step { dimension (1, k ) dimension (2) { dimension (3) dimension ( k-1 ) { secondary step dimension ( k ,1) primary step { dimension ( k ,2) the best dimension ( k ) { { accuracy dimension ( k , k ) tolerable dimension accuracy l Optimal dimension searching method reduce dimension of features to find the best results, including global multiple search and local linear search. 14

Performance evaluation l Four commonly used metrics Ø Sensitivity (SE), Specificity (SP), Accuracy (ACC), and Mathew’s Correlation Coefficient (MCC) l Formulation for the four metrics Two metrics for comprehensive evaluation of a binary predictor 15

Results and Discussion 16

Classifier based on autocorrelation comparison results Accuracy 92% 90.46% 90% 87.66% 88% 86.29% 86% 81.84% 84% 82.80% 79.74% 82.71% 80.29% 78.44% 82% 77.96% 79.84% 79.57% 76.92% 79.25% 80% 76.97% 78% 76% 74% 72% 70% 188D 473D 611D LibD3C LibSVM IBK RandomForest Bagging • 611D has far better performance than 188D and 473D; • LibSVM is better than any else on each extraction method of them. 17

Prediction methods comparison results ACC 92.92% SN 95.50% SP 87.30% 100% 90% Best 80% 70% 60% 50% 40% 30% 20% 10% 0% BLASTP PSI-BLASTP 611D Pretata ACC SN SP Pretata V.S. A group of traditional prediction methods 18

Pretata searching process 100% Accuracy 95% 90% 85% 80% 75% ACC 92.92% 70% SN 95.50% 65% SP 87.30% 60% 10 50 90 130 170 210 250 290 330 370 410 450 490 530 570 610 650 Dimension Accuracy 99% 96% 93% 90% 87% Best 84% 81% 78% 75% 230 240 250 260 270 280 290 300 310 320 330 SN SP ACC Experiment Best dimension = 324D 19

Conclusion 20

Conclusions • Propose a highly represented method for imbalance dataset (Negative-class Purification) • Propose 611D high dimension feature vectors, including composition, physicochemical and secondary structure features (611 Dimension Feature Model) • Propose a novel and promising TBP prediction method - Pretata (Pretata Learning model) • Available webserver - Pretata Server (Pretata web server) website: http://server.malab.cn/preTata/ 21

Conclusions 22

Conclusions 23

Contents Introduction 1 Methods and Materials 2 3 Results and - PowerPoint PPT Presentation

Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy Shixiang Wan From Tianjin University, China Email: shixiangwan@tju.edu.cn 1 Contents Introduction 1 Methods and Materials 2 3 Results and

Level 1, V2.0 Level 1, V2.0 1 Course Contents Course Contents Course Contents Course

Oasys Post Processing New Features in Version 16.0 www.arup.com/dyna Back to Contents Back to

Contents averages averages Contents Contents Harmonic mean (average) Harmonic mean (average)

Sage as a Calculator By Samaneh shafi naderi By Samaneh shafi naderi Sage as a Calculator

Contents Contents Fluid

Contents Contents.....2 Butter

PRODUCT LAW WORLDVIEW PRODUCT LAW WORLDVIEW TABLE OF CONTENTS TABLE OF CONTENTS INTRODUCTION

The Waterbase Limited Investor Presentation June - 2016 Contents Contents 2 Safe Harbour

17 www.scad.ae Table of Contents Table of Contents

Scytls voter-verifiability solutions Pnyx.DRE and Pnyx.VVPAT Contents Contents

Cencosud April 2016 Corporate Presentation | Contents | 2 Contents Investment Highlights

3 August 2006 Hong Kong www.solomon-systech.com Table of contents Table of contents

CONTENTS CONTENTS A. Company Profile 03 B. Products 06 Appendix 29 2/30 A. Company Profile

INVESTOR PRESENTATION February 2020 CONTENTS TABLE OF CONTENTS Majid Al Futtaim 2019

Marine Biodiversity Yoshihisa Shirayama Contents Contents Characteristics of Marine

Taeil Enterprise the antimicrobial material technology Table of Contents Table of Contents

CSE 527 Lecture 7 Relative entropy Convergence of EM Weight matrix motif models Talk Today

Team Mumbai Emma Finkbeiner, Taylor Pecko-Reid, & Sashwatha Sridhar Mumbai, The Gateway to

Chemical Reactions and is intended for the non-commercial use of students and teachers. These

Develop Your Data Mindset Module 9 - Periodic Assessment for Differentiation Part 2 - Absorb,

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

Markov chains Dr. Jarad Niemi STAT 544 - Iowa State University April 2, 2018 Jarad Niemi

Direct Dark Matter Searches: an Overview GGI Conference on Dark Matter Florence, February 9, 2009

Excited State Spectroscopy & QCD Hadron spectroscopy 2 Determination of hadron spectrum

Sambuz

Useful Links

Newsletter

Mail Us

Contents Introduction 1 Methods and Materials 2 3 Results and - PowerPoint PPT Presentation

Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy Shixiang Wan From Tianjin University, China Email: shixiangwan@tju.edu.cn 1 Contents Introduction 1 Methods and Materials 2 3 Results and

Level 1, V2.0 Level 1, V2.0 1 Course Contents Course Contents Course Contents Course

Oasys Post Processing New Features in Version 16.0 www.arup.com/dyna Back to Contents Back to

Contents averages averages Contents Contents Harmonic mean (average) Harmonic mean (average)

Sage as a Calculator By Samaneh shafi naderi By Samaneh shafi naderi Sage as a Calculator

Contents Contents Fluid

Contents Contents.....2 Butter

PRODUCT LAW WORLDVIEW PRODUCT LAW WORLDVIEW TABLE OF CONTENTS TABLE OF CONTENTS INTRODUCTION

The Waterbase Limited Investor Presentation June - 2016 Contents Contents 2 Safe Harbour

17 www.scad.ae Table of Contents Table of Contents

Scytls voter-verifiability solutions Pnyx.DRE and Pnyx.VVPAT Contents Contents

Cencosud April 2016 Corporate Presentation | Contents | 2 Contents Investment Highlights

3 August 2006 Hong Kong www.solomon-systech.com Table of contents Table of contents

CONTENTS CONTENTS A. Company Profile 03 B. Products 06 Appendix 29 2/30 A. Company Profile

INVESTOR PRESENTATION February 2020 CONTENTS TABLE OF CONTENTS Majid Al Futtaim 2019

Marine Biodiversity Yoshihisa Shirayama Contents Contents Characteristics of Marine

Taeil Enterprise the antimicrobial material technology Table of Contents Table of Contents

CSE 527 Lecture 7 Relative entropy Convergence of EM Weight matrix motif models Talk Today

Team Mumbai Emma Finkbeiner, Taylor Pecko-Reid, &amp; Sashwatha Sridhar Mumbai, The Gateway to

Chemical Reactions and is intended for the non-commercial use of students and teachers. These

Develop Your Data Mindset Module 9 - Periodic Assessment for Differentiation Part 2 - Absorb,

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

Markov chains Dr. Jarad Niemi STAT 544 - Iowa State University April 2, 2018 Jarad Niemi

Direct Dark Matter Searches: an Overview GGI Conference on Dark Matter Florence, February 9, 2009

Excited State Spectroscopy &amp; QCD Hadron spectroscopy 2 Determination of hadron spectrum

Sambuz

Useful Links

Newsletter

Mail Us

Team Mumbai Emma Finkbeiner, Taylor Pecko-Reid, & Sashwatha Sridhar Mumbai, The Gateway to

Excited State Spectroscopy & QCD Hadron spectroscopy 2 Determination of hadron spectrum