Analysis of the Signal Peptide dataset November 28, 2019 1 Signal - PowerPoint PPT Presentation

Analysis of the Signal Peptide dataset November 28, 2019 1

Signal Peptide - A short peptide (typically 15-30 residues long), destined towards the secretory pathway - Cleaved during translocation across membrane existing in all 3 kingdoms of life 2

Our dataset ● FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. 3

Our dataset ● The FASTA file contains for each protein (in order): ● Header (e.g. ">Q8TF40|EUKARYA|NO_SP|0") ● Protein sequence (first 70 residues only) ● Residue annotation 4

Our dataset The header contains information about: ● The protein ID (e.g. "Q8TF40") ● The kingdom of life the organism (that contains the protein) belongs to (e.g. "EUKARYA") ● The type of signal peptide the protein contains (e.g. "NO_SP") ● The data set split the protein belongs to (e.g. "0") 5

Our dataset ● 20,758 proteins ● 4 types of signal peptides ● 6 residue types ● 20% sequence similarity 6

Our dataset ● 5 splits for cross-validation with similar residue distribution ● Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. 7

Class distributions Strong dataset imbalance: most proteins ● don’t contain Signal Peptides SP Signal Peptide LIPO Lipoprotein Signal Peptide TAT Tat Signal Peptide NO_SP No Signal Peptide 8

Residue annotations 91.25% S Sec/SPI signal peptide T Tat/SPI signal peptide L Sec/SPII signal peptide I Cytoplasm M Transmembrane 8.75% O Extracellular 9

Prediction Baseline 10

Dealing with class imbalance Undersampling (majority classes) ● Oversampling (minority classes) ● Class weights ● SMOTE (synthetic samples) ● 11

ELMo Embeddings ELMo Embeddings: ● Embedded Language Models Used in Natural Language Processing ● In our case, embeddings represent the context of each residue ● Either 64 dim or 1024 dim per residue ● 12

Learning from high-dimensional data Reduce the dimensions ● t-SNE ● ● Techniques for dimensionality reduction and clustering that preserve the proportionality of the objects -> Visualization of high dimensionality datasets 13

PCA vs t-SNE 14

Results of t-SNE for the 64 dim embeddings 15

Results of t-SNE for the 64 dim embeddings for L signal peptides 16

Results of t-SNE for the 64 dim embeddings for S signal peptides 17

Results of t-SNE for the 64 dim embeddings for T signal peptides 18

Notes ● Results are based on the perplexity = 30 ● Not a lot of information ● 1024 dimensional embeddings can be more helpful 19

References ● https://zhanglab.ccmb.med.umich.edu/FASTA/ ● https://machinelearningmastery.com/k-fold-cross-validation/ ● https://towardsdatascience.com/visualising-high-dimensional -datasets-using-pca-and-t-sne-in-python-8ef87e7915b 20

Thank you very much! 21

Analysis of the Signal Peptide dataset November 28, 2019 1 Signal - PowerPoint PPT Presentation

Analysis of the Signal Peptide dataset November 28, 2019 1 Signal Peptide - A short peptide (typically 15-30 residues long), destined towards the secretory pathway - Cleaved during translocation across membrane existing in all 3 kingdoms of

L14 Mass Spec Quantitation MS applications Microarray analysis CSE182 LC-MS Maps Peptide 2 I

Peptide modeling in isolation and in interaction : steps towards rational peptide design Pierre

1 | Core SMA Dataset Review 2020 Core SMA Dataset for TREAT-NMD affiliated Registries First

Algorithms in Bioinformatics: A f Practical Introduction Practical Introduction Peptide

Tx Signal: 1000 Hz sine wave; Attenuation; Random noise with 0.5ms spike Tx Signal Noise Rx

Surprise Billing Surprise Billing Dataset Review Dataset Review October 9, October 9, 2019

The Problem I K G J E C H F A D B = dataset In dataset creation, if each step is

Mina Kwon 2020. 04. 09. vs vs Preference Gaze influence Fixation Choice A HIGH B LOW

Kinetic Pathway of Antimicrobial Peptide Magainin 2-Induced Pore Formation in Lipid Membranes 1.

Proteomics Informatics Protein identification I: searching protein sequence collections and

Proteomics Informatics Protein identification I: searching protein sequence collections and

Speech Processing 15-492/18-492 Speech Synthesis Signal Processing Signal Manipulation Signal

Waveform Generation Fundamental part of signal processing is the signal. Within the

Sampling a Signal an analog signal together with some samples of the signal. The samples

Signal Types Recall even digital signals are just voltages Analog signal Continuous

Signal Types Recall even digital signals are just voltages Analog signal Continuous

Drugs for chronic hepatitis C the next 5 years Dr. Thomas von Hahn Klinik fr

Afric Africa Cen Center r of of Ex Excellence for or Neg Neglected Trop opical Diseases

Renal Failure 1 1 National Yang Ming University 2 Chronic Renal Failure Kidneys no longer

Center for Computational Medicine W/Prof Hugh Barrett Center for Computational Medicine School

Pre-specified analysis of atherogenic lipoproteins Kausik K Ray, MB ChB, FRCP (Lon), FRCP (Ed),

Resverlogix Corp. Corporate Update Conference Call & Webcast June 10, 2020 at 11 am ET

RTA: Specific Aims Faheem Guirgis, MD Center for Research Training Slides from Rosemarie

Yeshiva University Our lab Media 2 3 Cytoplasm 1 Periplasm 1 2 3 The Set-up: periplasm

Sambuz

Useful Links

Newsletter

Mail Us

Analysis of the Signal Peptide dataset November 28, 2019 1 Signal - PowerPoint PPT Presentation

Analysis of the Signal Peptide dataset November 28, 2019 1 Signal Peptide - A short peptide (typically 15-30 residues long), destined towards the secretory pathway - Cleaved during translocation across membrane existing in all 3 kingdoms of

L14 Mass Spec Quantitation MS applications Microarray analysis CSE182 LC-MS Maps Peptide 2 I

Peptide modeling in isolation and in interaction : steps towards rational peptide design Pierre

1 | Core SMA Dataset Review 2020 Core SMA Dataset for TREAT-NMD affiliated Registries First

Algorithms in Bioinformatics: A f Practical Introduction Practical Introduction Peptide

Tx Signal: 1000 Hz sine wave; Attenuation; Random noise with 0.5ms spike Tx Signal Noise Rx

Surprise Billing Surprise Billing Dataset Review Dataset Review October 9, October 9, 2019

The Problem I K G J E C H F A D B = dataset In dataset creation, if each step is

Mina Kwon 2020. 04. 09. vs vs Preference Gaze influence Fixation Choice A HIGH B LOW

Kinetic Pathway of Antimicrobial Peptide Magainin 2-Induced Pore Formation in Lipid Membranes 1.

Proteomics Informatics Protein identification I: searching protein sequence collections and

Proteomics Informatics Protein identification I: searching protein sequence collections and

Speech Processing 15-492/18-492 Speech Synthesis Signal Processing Signal Manipulation Signal

Waveform Generation Fundamental part of signal processing is the signal. Within the

Sampling a Signal an analog signal together with some samples of the signal. The samples

Signal Types Recall even digital signals are just voltages Analog signal Continuous

Signal Types Recall even digital signals are just voltages Analog signal Continuous

Drugs for chronic hepatitis C the next 5 years Dr. Thomas von Hahn Klinik fr

Afric Africa Cen Center r of of Ex Excellence for or Neg Neglected Trop opical Diseases

Renal Failure 1 1 National Yang Ming University 2 Chronic Renal Failure Kidneys no longer

Center for Computational Medicine W/Prof Hugh Barrett Center for Computational Medicine School

Pre-specified analysis of atherogenic lipoproteins Kausik K Ray, MB ChB, FRCP (Lon), FRCP (Ed),

Resverlogix Corp. Corporate Update Conference Call &amp; Webcast June 10, 2020 at 11 am ET

RTA: Specific Aims Faheem Guirgis, MD Center for Research Training Slides from Rosemarie

Yeshiva University Our lab Media 2 3 Cytoplasm 1 Periplasm 1 2 3 The Set-up: periplasm

Sambuz

Useful Links

Newsletter

Mail Us

Resverlogix Corp. Corporate Update Conference Call & Webcast June 10, 2020 at 11 am ET