Transfer String Kernel for Cross-Context Sequence Specific - - PowerPoint PPT Presentation

transfer string kernel for cross context sequence
SMART_READER_LITE
LIVE PREVIEW

Transfer String Kernel for Cross-Context Sequence Specific - - PowerPoint PPT Presentation

Transfer String Kernel for Cross-Context Sequence Specific DNA-Protein Binding Prediction by Ritambhara Singh IIIT-Delhi June 10, 2016 1 Biology in a Slide CELL PROTEIN RNA DNA ORGANISM 2 DNA and Diseases Down Syndrome


slide-1
SLIDE 1

Transfer String Kernel for Cross-Context Sequence Specific DNA-Protein Binding Prediction

by Ritambhara Singh IIIT-Delhi June 10, 2016

1

slide-2
SLIDE 2

Biology in a Slide

2

DNA RNA PROTEIN CELL ORGANISM

slide-3
SLIDE 3

DNA and Diseases

3

DNA RNA PROTEIN CELL ORGANISM

  • Down Syndrome
  • Parkinson’s Disease
  • Autism
  • Muscular Atrophy
  • Sickle Cell Disease

………. ………..

slide-4
SLIDE 4

Transcription Factors

4

DNA RNA PROTEIN CELL ORGANISM

Gene Transcription Factor Transcription Factor Binding Site

ATCGCGTAGCTAGGGATGACAGACACACATAATTCTAGATA ¡

slide-5
SLIDE 5

ChIP-seq Maps TF binding

5 Transcription Factor Gene Genome

ATATCGTATCTTTTAAACCGGGTTGGCCACTAGA ¡ ATATCGTATCTAAACCGCCTCGG ¡

ChIP-seq Map for TF Peak Transcription Factor Binding Site

CHIP-SEQ

DNA

slide-6
SLIDE 6

TF Binding Differs Across Contexts

6

ATATCGTATCTTTTAAACCGGGTATGTAATGCAT ¡ ATATCGTATCTAAACCGCCCGTGT ¡ ATATCGTATCTTTTAAACCGGGTTGGCCAGTATA ¡ ATATCGTATCTAAACCGCCCTGCA ¡

slide-7
SLIDE 7

7

? ?

(Blood Cell) (Stem Cell) (Leukemia) (Lung Cancer) (Cervical Cancer) (Nerve Cell) (Immunity related)

Current Challenge: ENCODE Data Gap

Source : http://genome.ucsc.edu/ENCODE/dataMatrix/encodeChipMatrixHuman.html

slide-8
SLIDE 8

Case for Computational Tools

8

slide-9
SLIDE 9

Existing Computational Tools

9

Generative Approaches Discriminative Approaches

MEME CISFINDER STRING KERNEL+SVM

slide-10
SLIDE 10

Generative : PWM Based approach

10 Genome

ATATCGTATAACAATAACCGGGAACTAATAGC ¡ ATATCGTATCTAACAAATCCTACT ¡

ChIP-seq Map for TF Peak Sequence Logo

1 2 3 4 5 6 7 8 9 10 11 12 A 14 14 28 40 9 45 42 13 15 9 T 12 3 4 12 11 10 9 6 5 38 12 3 C 3 1 8 2 2 36 2 2 1 G 1 16 10 1 2 3 2 7 11

Position Weight Matrix

slide-11
SLIDE 11

Genome

ATATCGTATCTTTTAAACCGGGTTGGCCAATAGC ¡ ATATCGTATCTAAACCGCCCTACT ¡

? ?

Generative Approach : Output

11

Source : http://www.cbil.upenn.edu/EpoDB/release/version_2.2/meme/meme-output.html#sample

slide-12
SLIDE 12

Generative Approach: Limitations

– Output: Long list of potential TFs – Work well for only well preserved motifs or large training datasets – PWMs for all ~2000 TFs not available – Lower prediction performance than discriminative approaches

12

slide-13
SLIDE 13

Genome

ATATCGTATCTTTTAAACCGGGTTGGCCAATAGC ¡ ATATCGTATCTAAACCGCCCTACT ¡

? ?

ATATCGTATCTTTTAAACCGGGTTGGCCAATAGC ¡

Peak

ATATCGTATCTAAACCGCCCTACT ¡

Genome

+1

  • 1

Discriminative Approach : Output

13

slide-14
SLIDE 14

Discriminative : String Kernel Approach

14

Support Vector Machine

slide-15
SLIDE 15

Discriminative Approach : Limitation

Assumption: Training/test data follow same distribution regardless of context.

15

slide-16
SLIDE 16

Aim

  • Improve prediction of Transcription Factor

Binding sites across contexts using knowledge transfer.

16

slide-17
SLIDE 17

Proposed Solution : Cross-Context Knowledge Transfer

17

slide-18
SLIDE 18

Transfer String Kernel : Overview

Feature Conversion Feature Conversion Knowledge Transfer Classification Source Context Target Context Training (KMM)

ATCGAT GTATAC ATACAT GCTTAC

Xs Xt

18

slide-19
SLIDE 19

Outline

  • Method

– String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel

  • Evaluation

– Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction

19

slide-20
SLIDE 20

Outline

  • Method

– String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel

  • Evaluation

– Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction

20

slide-21
SLIDE 21

String Kernel : Spectrum Kernel

21

Feature map indexed by all k-length subsequences (“k-mers”) from alphabet Σ of amino acids, |Σ|=20

slide-22
SLIDE 22

String Kernel : Mismatch Kernel

22

For k-mer s, the mismatch neighborhood N(k,m)(s) is the set of all k-mers t within m mismatches from s.

slide-23
SLIDE 23

Outline

  • Method

– String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel

  • Evaluation

– Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction

23

slide-24
SLIDE 24

Support Vector Machine

24 Negative Instances (y = -1) Positive Instances (y = +1)

w . x + b ≤ -1 w . x + b ≥ +1 w . x + b = 0

slide-25
SLIDE 25

Outline

  • Method

– String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel

  • Evaluation

– Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction

25

slide-26
SLIDE 26

Transfer Learning (KMM)

26

True densities Ratios

ptr(x) pte(x) r(x) r(x)

slide-27
SLIDE 27

Outline

  • Method

– String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel

  • Evaluation

– Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction

27

slide-28
SLIDE 28

Importance Re-weighting

28

Original Weights KMM Weights

slide-29
SLIDE 29

Outline

  • Method

– String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel

  • Evaluation

– Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction

29

slide-30
SLIDE 30

Transfer String Kernel (TSK)

30

Feature Conversion! Feature Conversion! Knowledge Transfer! Classification! Source! Context! Target! Context! Training!

ATCGATCG ATCGATCG% CCCGATCG CTCGCTCC%

Mismatch String Kernel Mismatch String Kernel Kernel Mean Matching (KMM) Importance Re- weighting SVM

slide-31
SLIDE 31

Outline

  • Method

– String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel

  • Evaluation

– Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction

31

slide-32
SLIDE 32

Experimental Setup

  • 14 Transcription Factors (ENCODE ChIP-seq)
  • Top 1000 positive sequences (500 training and

500 testing)

  • 1000 random negative sequences
  • Hyper-parameter tuning for k=(8,10,12) and

m=(1,2,3)

  • Dictionary size = 4 {A,T,C,G}

32

slide-33
SLIDE 33

Outline

  • Method

– String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel

  • Evaluation

– Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction

33

slide-34
SLIDE 34

Results

34

slide-35
SLIDE 35

Results – Cross Context

0.8 0.82 0.84 0.86 0.88 0.9 Sin3a Max Mxi1 Chd2 Ctcf

AUC Score Transcription Factors

TSK SK

35

slide-36
SLIDE 36

Outline

  • Method

– String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel

  • Evaluation

– Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction

36

slide-37
SLIDE 37

Results – Cross context

37

slide-38
SLIDE 38

Summary

  • TSK overall improves the cross-context TFBS predictions;
  • String kernel based approaches perform better than the state-of-

the-art Position Weight/Frequency Matrix based TFBS tools;

  • TSK approach is generalizable for performance improvement of

any cross-context sequence prediction task. Presented in BIOKDD ’15

38

slide-39
SLIDE 39

Acknowledgements

  • Dr. Mazhar Adli

Adli Lab : Department of Biochemistry and Molecular Genetics @Uva Nipun Batra IIIT-Delhi

39

slide-40
SLIDE 40

Machine Learning Lab @ UVa

  • Dr. Yanjun Qi

(Advisor) Jack Lanchantin Beilun Wang Weilin Xu Ji Gao

40

slide-41
SLIDE 41

Future Directions

  • Deep Learning :

– Gene expression prediction using histone modification data (ECCB 2016) – Improving TFBS prediction using DNA sequences (ICLR Workshop 2016, ICML Workshop 2016)

  • String Kernels: Improving efficiency!! (on-going

work)

41

slide-42
SLIDE 42

Thank You

42