Analysis of the Signal Peptide dataset November 28, 2019 1 Signal - - PowerPoint PPT Presentation

analysis of the signal peptide dataset
SMART_READER_LITE
LIVE PREVIEW

Analysis of the Signal Peptide dataset November 28, 2019 1 Signal - - PowerPoint PPT Presentation

Analysis of the Signal Peptide dataset November 28, 2019 1 Signal Peptide - A short peptide (typically 15-30 residues long), destined towards the secretory pathway - Cleaved during translocation across membrane existing in all 3 kingdoms of


slide-1
SLIDE 1

Analysis of the Signal Peptide dataset

November 28, 2019

1

slide-2
SLIDE 2

Signal Peptide

  • A short peptide (typically 15-30 residues long), destined

towards the secretory pathway

  • Cleaved during translocation across membrane existing in all

3 kingdoms of life

2

slide-3
SLIDE 3

Our dataset

  • FASTA format is a text-based format for representing either

nucleotide sequences or peptide sequences, in which base pairs

  • r amino acids are represented using single-letter codes. A

sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length.

3

slide-4
SLIDE 4

Our dataset

  • The FASTA file contains for each protein (in order):
  • Header (e.g. ">Q8TF40|EUKARYA|NO_SP|0")
  • Protein sequence (first 70 residues only)
  • Residue annotation

4

slide-5
SLIDE 5

Our dataset

The header contains information about:

  • The protein ID (e.g. "Q8TF40")
  • The kingdom of life the organism (that contains the protein)

belongs to (e.g. "EUKARYA")

  • The type of signal peptide the protein contains (e.g. "NO_SP")
  • The data set split the protein belongs to (e.g. "0")

5

slide-6
SLIDE 6

Our dataset

  • 20,758 proteins
  • 4 types of signal peptides
  • 6 residue types
  • 20% sequence similarity

6

slide-7
SLIDE 7

Our dataset

  • 5 splits for cross-validation with similar residue distribution
  • Cross-validation is a resampling procedure used to evaluate

machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation.

7

slide-8
SLIDE 8

Class distributions

  • Strong dataset imbalance: most proteins

don’t contain Signal Peptides SP Signal Peptide LIPO Lipoprotein Signal Peptide TAT Tat Signal Peptide NO_SP No Signal Peptide

8

slide-9
SLIDE 9

Residue annotations

S Sec/SPI signal peptide T Tat/SPI signal peptide L Sec/SPII signal peptide I Cytoplasm M Transmembrane O Extracellular

91.25% 8.75%

9

slide-10
SLIDE 10

Prediction Baseline

10

slide-11
SLIDE 11

Dealing with class imbalance

  • Undersampling (majority classes)
  • Oversampling (minority classes)
  • Class weights
  • SMOTE (synthetic samples)

11

slide-12
SLIDE 12

ELMo Embeddings

  • ELMo Embeddings:

Embedded Language Models

  • Used in Natural Language Processing
  • In our case, embeddings represent the context of each residue
  • Either 64 dim or 1024 dim per residue

12

slide-13
SLIDE 13

Learning from high-dimensional data

  • Reduce the dimensions
  • t-SNE
  • Techniques for dimensionality reduction and clustering that preserve the

proportionality of the objects

  • > Visualization of high dimensionality datasets

13

slide-14
SLIDE 14

PCA vs t-SNE

14

slide-15
SLIDE 15

Results of t-SNE for the 64 dim embeddings

15

slide-16
SLIDE 16

16

Results of t-SNE for the 64 dim embeddings for L signal peptides

slide-17
SLIDE 17

17

Results of t-SNE for the 64 dim embeddings for S signal peptides

slide-18
SLIDE 18

18

Results of t-SNE for the 64 dim embeddings for T signal peptides

slide-19
SLIDE 19

Notes

  • Results are based on the perplexity = 30
  • Not a lot of information
  • 1024 dimensional embeddings can be more helpful

19

slide-20
SLIDE 20

References

  • https://zhanglab.ccmb.med.umich.edu/FASTA/
  • https://machinelearningmastery.com/k-fold-cross-validation/
  • https://towardsdatascience.com/visualising-high-dimensional
  • datasets-using-pca-and-t-sne-in-python-8ef87e7915b

20

slide-21
SLIDE 21

21

Thank you very much!