analysis of the signal peptide dataset
play

Analysis of the Signal Peptide dataset November 28, 2019 1 Signal - PowerPoint PPT Presentation

Analysis of the Signal Peptide dataset November 28, 2019 1 Signal Peptide - A short peptide (typically 15-30 residues long), destined towards the secretory pathway - Cleaved during translocation across membrane existing in all 3 kingdoms of


  1. Analysis of the Signal Peptide dataset November 28, 2019 1

  2. Signal Peptide - A short peptide (typically 15-30 residues long), destined towards the secretory pathway - Cleaved during translocation across membrane existing in all 3 kingdoms of life 2

  3. Our dataset ● FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. 3

  4. Our dataset ● The FASTA file contains for each protein (in order): ● Header (e.g. ">Q8TF40|EUKARYA|NO_SP|0") ● Protein sequence (first 70 residues only) ● Residue annotation 4

  5. Our dataset The header contains information about: ● The protein ID (e.g. "Q8TF40") ● The kingdom of life the organism (that contains the protein) belongs to (e.g. "EUKARYA") ● The type of signal peptide the protein contains (e.g. "NO_SP") ● The data set split the protein belongs to (e.g. "0") 5

  6. Our dataset ● 20,758 proteins ● 4 types of signal peptides ● 6 residue types ● 20% sequence similarity 6

  7. Our dataset ● 5 splits for cross-validation with similar residue distribution ● Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. 7

  8. Class distributions Strong dataset imbalance: most proteins ● don’t contain Signal Peptides SP Signal Peptide LIPO Lipoprotein Signal Peptide TAT Tat Signal Peptide NO_SP No Signal Peptide 8

  9. Residue annotations 91.25% S Sec/SPI signal peptide T Tat/SPI signal peptide L Sec/SPII signal peptide I Cytoplasm M Transmembrane 8.75% O Extracellular 9

  10. Prediction Baseline 10

  11. Dealing with class imbalance Undersampling (majority classes) ● Oversampling (minority classes) ● Class weights ● SMOTE (synthetic samples) ● 11

  12. ELMo Embeddings ELMo Embeddings: ● Embedded Language Models Used in Natural Language Processing ● In our case, embeddings represent the context of each residue ● Either 64 dim or 1024 dim per residue ● 12

  13. Learning from high-dimensional data Reduce the dimensions ● t-SNE ● ● Techniques for dimensionality reduction and clustering that preserve the proportionality of the objects -> Visualization of high dimensionality datasets 13

  14. PCA vs t-SNE 14

  15. Results of t-SNE for the 64 dim embeddings 15

  16. Results of t-SNE for the 64 dim embeddings for L signal peptides 16

  17. Results of t-SNE for the 64 dim embeddings for S signal peptides 17

  18. Results of t-SNE for the 64 dim embeddings for T signal peptides 18

  19. Notes ● Results are based on the perplexity = 30 ● Not a lot of information ● 1024 dimensional embeddings can be more helpful 19

  20. References ● https://zhanglab.ccmb.med.umich.edu/FASTA/ ● https://machinelearningmastery.com/k-fold-cross-validation/ ● https://towardsdatascience.com/visualising-high-dimensional -datasets-using-pca-and-t-sne-in-python-8ef87e7915b 20

  21. Thank you very much! 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend