transfer string kernel for cross context sequence
play

Transfer String Kernel for Cross-Context Sequence Specific - PowerPoint PPT Presentation

Transfer String Kernel for Cross-Context Sequence Specific DNA-Protein Binding Prediction by Ritambhara Singh IIIT-Delhi June 10, 2016 1 Biology in a Slide CELL PROTEIN RNA DNA ORGANISM 2 DNA and Diseases Down Syndrome


  1. Transfer String Kernel for Cross-Context Sequence Specific DNA-Protein Binding Prediction by Ritambhara Singh IIIT-Delhi June 10, 2016 1

  2. Biology in a Slide CELL PROTEIN RNA DNA ORGANISM 2

  3. DNA and Diseases • Down Syndrome • Parkinson’s Disease • Autism • Muscular Atrophy CELL PROTEIN • Sickle Cell Disease RNA DNA ………. ORGANISM ……….. 3

  4. Transcription Factors Transcription Factor ATCGCGTAGCTAGGGATGACAGACACACATAATTCTAGATA ¡ CELL PROTEIN Gene Transcription RNA Factor DNA Binding Site ORGANISM 4

  5. ChIP-seq Maps TF binding Transcription Factor DNA Gene Transcription Factor Binding Site CHIP-SEQ ChIP-seq Peak Map for TF ATATCGTATCTAAACCGCCTCGG ¡ ATATCGTATCTTTTAAACCGGGTTGGCCACTAGA ¡ Genome 5

  6. TF Binding Differs Across Contexts ATATCGTATCTTTTAAACCGGGTATGTAATGCAT ¡ ATATCGTATCTAAACCGCCCGTGT ¡ ATATCGTATCTTTTAAACCGGGTTGGCCAGTATA ¡ ATATCGTATCTAAACCGCCCTGCA ¡ 6

  7. Current Challenge: ENCODE Data Gap (Blood Cell) (Stem Cell) (Leukemia) (Lung Cancer) (Immunity related) ? ? (Nerve Cell) (Cervical Cancer) Source : http://genome.ucsc.edu/ENCODE/dataMatrix/encodeChipMatrixHuman.html 7

  8. Case for Computational Tools 8

  9. Existing Computational Tools Generative Discriminative Approaches Approaches MEME STRING KERNEL+SVM CISFINDER 9

  10. Generative : PWM Based approach ChIP-seq Peak Map for TF Genome ATATCGTATAACAATAACCGGGAACTAATAGC ¡ ATATCGTATCTAACAAATCCTACT ¡ 1 2 3 4 5 6 7 8 9 10 11 12 Position A 14 0 0 14 28 40 9 45 42 13 15 9 Weight T 12 3 4 12 11 10 9 6 5 38 12 3 Matrix C 3 0 1 8 2 2 36 2 2 0 1 0 G 0 1 0 16 10 1 2 3 2 0 7 11 Sequence Logo 10

  11. Generative Approach : Output ? ? Genome ATATCGTATCTTTTAAACCGGGTTGGCCAATAGC ¡ ATATCGTATCTAAACCGCCCTACT ¡ 11 Source : http://www.cbil.upenn.edu/EpoDB/release/version_2.2/meme/meme-output.html#sample

  12. Generative Approach: Limitations – Output: Long list of potential TFs – Work well for only well preserved motifs or large training datasets – PWMs for all ~2000 TFs not available – Lower prediction performance than discriminative approaches 12

  13. Discriminative Approach : Output ? ? Genome ATATCGTATCTTTTAAACCGGGTTGGCCAATAGC ¡ ATATCGTATCTAAACCGCCCTACT ¡ Peak Genome ATATCGTATCTTTTAAACCGGGTTGGCCAATAGC ¡ ATATCGTATCTAAACCGCCCTACT ¡ +1 -1 13

  14. Discriminative : String Kernel Approach Support Vector Machine 14

  15. Discriminative Approach : Limitation Assumption: Training/test data follow same distribution regardless of context. 15

  16. Aim • Improve prediction of Transcription Factor Binding sites across contexts using knowledge transfer. 16

  17. Proposed Solution : Cross-Context Knowledge Transfer 17

  18. Transfer String Kernel : Overview Xs Xt Target ATCGAT ATACAT Source Context GTATAC GCTTAC Context Feature Feature Conversion Conversion Knowledge Transfer (KMM) Training Classification 18

  19. Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 19

  20. Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 20

  21. String Kernel : Spectrum Kernel Feature map indexed by all k-length subsequences (“k-mers”) from alphabet Σ of amino acids, | Σ |=20 21

  22. String Kernel : Mismatch Kernel For k-mer s , the mismatch neighborhood N (k,m) (s) is the set of all k-mers t within m mismatches from s . 22

  23. Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 23

  24. Support Vector Machine w . x + b = 0 w . x + b ≤ -1 w . x + b ≥ +1 Negative Instances (y = -1) Positive Instances (y = +1) 24

  25. Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 25

  26. Transfer Learning (KMM) True Ratios densities r(x) p tr (x) r(x) p te (x) 26

  27. Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 27

  28. Importance Re-weighting Original Weights KMM Weights 28

  29. Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 29

  30. Transfer String Kernel (TSK) Target ! ATCGATCG CCCGATCG Source ! ATCGATCG% Context ! CTCGCTCC% Context ! Mismatch Mismatch Feature String Feature String Conversion ! Conversion ! Kernel Kernel Kernel Mean Knowledge Matching Transfer ! (KMM) Importance Re- Training ! Classification ! weighting SVM 30

  31. Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 31

  32. Experimental Setup • 14 Transcription Factors (ENCODE ChIP-seq) • Top 1000 positive sequences (500 training and 500 testing) • 1000 random negative sequences • Hyper-parameter tuning for k=(8,10,12) and m=(1,2,3) • Dictionary size = 4 {A,T,C,G} 32

  33. Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 33

  34. Results 34

  35. Results – Cross Context 0.9 0.88 AUC Score 0.86 0.84 0.82 0.8 Sin3a Max Mxi1 Chd2 Ctcf Transcription Factors TSK SK 35

  36. Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 36

  37. Results – Cross context 37

  38. Summary • TSK overall improves the cross-context TFBS predictions; • String kernel based approaches perform better than the state-of- the-art Position Weight/Frequency Matrix based TFBS tools; • TSK approach is generalizable for performance improvement of any cross-context sequence prediction task. Presented in BIOKDD ’15 38

  39. Acknowledgements Dr. Mazhar Adli Adli Lab : Department of Biochemistry and Molecular Genetics @Uva Nipun Batra IIIT-Delhi 39

  40. Machine Learning Lab @ UVa Beilun Wang Dr. Yanjun Qi (Advisor) Weilin Xu Jack Lanchantin Ji Gao 40

  41. Future Directions • Deep Learning : – Gene expression prediction using histone modification data (ECCB 2016) – Improving TFBS prediction using DNA sequences (ICLR Workshop 2016, ICML Workshop 2016) • String Kernels: Improving efficiency!! (on-going work) 41

  42. Thank You 42

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend