finding motifs using random projections
play

Finding Motifs Using Random Projections by J. Buhler and M. Tompa A - PowerPoint PPT Presentation

Finding Motifs Using Random Projections by J. Buhler and M. Tompa A Presentation by Gunola Drillon Anisah Ghoorah Lin Han Frank Dondelinger 1/48 Overview DNA motifs The problem Current approaches New algorithm - PROJECTION


  1. Finding Motifs Using Random Projections by J. Buhler and M. Tompa A Presentation by Guénola Drillon Anisah Ghoorah Lin Han Frank Dondelinger 1/48

  2. Overview  DNA motifs  The problem  Current approaches  New algorithm - PROJECTION  PROJECTION’s results  Conclusions 2/48

  3. DNA Motifs  DNA − 4 nucleotides: A, T, C and G  DNA Motifs − Short, recurring patterns in DNA − Strongly conserved − Have biological function . Gene regulation, gene interaction − Indicate sequence-specific binding sites for proteins D’haeselleer (2006) 3/48

  4. Motif Representations  Consensus pattern − Using IUPAC code  Frequency matrix  Logo D’haeselleer (2006) 4/48

  5. Motif Finding Problem Planted ( l , d )-Motif • − Planted (11,2)-Motif: CCGATTACCGA l -mers • All possible subsequences of length l in each sequence - 5/48

  6. Problem Definition  Given t sequences, each of length n , find a motif M of length l , where each planted instance differs from M in d positions − Planted ( l , d )-Motif  No prior knowledge of motif M Planted instance t sequences 6/48

  7. Why Motif Finding?  Comparative genomics − Study similar genes in different species using microarrays − Identification of transcription factor binding sites − Genetic regulatory network  Genomes are large and complex  Simple search won’t work!  Need more efficient search algorithm 7/48

  8. Current approaches (1) - Local Search  Gibbs Sampling - Lawrence et al (1993) − Obtain an initial motif model − Use an iterative approach based on probability to find correct motif  MEME - Bailey & Elkan (1995) − Obtain an initial motif − Use EM approach to find correct motif  CONSENSUS - Hertz & Stormo (1999) − Obtain an initial motif − Use an iterative approach to build up motifs by adding more and more pattern instances. 8/48

  9. Problem with Local Search  Depends on initial conditions  Local optima issues − Returns best solution in neighbourhood − Not necessarily the best planted motif 9/48

  10. Current approaches (2)  Enumeration − Exhaustive enumeration of all possible motifs M − Cover the entire search space − No risk of getting stuck in local optimum  Problem − Too rigid for most real-world binding sites − Run in time exponential to motif length 10/48

  11. Current approaches (3)  WINNOWER – Pevzner & Sze (2000) − Graph-theoretic approach which represents a motif as a large clique  SP-STAR – Pevzner & Sze (2000) − Heuristic local improvement technique using a scoring function  Both solve the planted (15,4)-motif problem  Problem − Fail to find the planted (14,4), (16,5), (18,6) motif problems 11/48

  12. New Approach PROJECTION Algorithm − Random Projections (global search) − Motif Refinement (local search) 12/48

  13. Random Projection Hash h(x)  Choose k of the l positions at random  Consider x as an l -mer, then h(x) is the k -mer resulting from selecting k residues of x  A projection from l -dimensional space onto a k - dimensional subspace  Example: l = 15 Projection k = 7 A T G GC A T TCA GAT TC TGCTGAT Projection = (2, 4, 5, 7, 11, 12, 13) 13/48

  14. Random Projection  4 k buckets in total  M: motif; h(M): the planted bucket  If k < l - d , a number of planted motifs in planted bucket  If k not too small, less than one l -mer in random bucket  Highly enriched l -mers in planted bucket enable recovering the motif. 14/48

  15. Random Projection  s is the threshold for potential planted bucket  Choose buckets that contain at least s l -mers 15/48

  16. Example of Hashing and Buckets l = 7, k = 4 with projection position (1,2,5,7) ...TAGACATCCGACTTGCCTTACTAC... Buckets ATCCGAC ATGC GCTC 16/48

  17. Example of Hashing and Buckets  s=3  Choose buckets which contain more than 3 l - mers ATGC GCTC CATC ATTC 17/48

  18. Three Important Parameters Projection size k  k < l - d and k not too small to keep planted bucket highly enriched  Larger k to ensure we have less than one l -mer in each bucket 18/48

  19. Three Important Parameters Bucket threshold s  Varies according to the data we use.  Case of (20, 2) and (16, 5)  Larger number of sequences 19/48

  20. Three Important Parameters The number m of independent trials to run: m = log  1 − Q  log  B   Q: probability that s or more motif instances in planted bucket in at least one of m trials  B: probability that fewer than s planted instances in planted bucket in a number of independent Bernoulli trials 20/48

  21. Motif Refinement We have our buckets: Now what? For each large enough bucket h:  Use h as a starting point W h  Apply EM to refine W h to W h *  Get consensus motif C using model W h * At the end, return best C found 21/48

  22. Starting Point W h W h is a model for the motif 4 x l matrix → W h (i, j) = probability of base i in position j Approximation that works in practice 22/48

  23. Starting Point W h : Example In bucket h: W h 1 2 3 Positions T  1 / 3  1 / 3 1 / 3 A 1 AGT 1 / 3 C 0 0 AAA 2 / 3 G 0 0 AGC 0 0 Bases To avoid too many zeroes, add background probability b i using Laplace smoothing 23/48

  24. Refinement: Finding W h * Use EM to refine initial model W h Let S be the dataset, P the background distribution Find W h * that (locally) maximises: * , P  Pr  S | W h Pr  S | P  Could take a long time! Better : Run only a few iterations of EM 24/48

  25. Refinement: Find the Motif From W h *, want to determine motif: For each input sequence: Determine likeliest l- mer w.r.t. W h * Likelihood of l -mer x determined by: *  Pr  x | W h Pr  x | P  Get set T of t most likely l -mers 25/48

  26. Refinement: Find the Motif Example: Most likely 2-mer in AGT (Assume P same for all bases in all positions) 1 2 T  0.49  A 0.88 0.20 C 0.01 0.30 W = G 0.10 0.01 0.01 Two 2-mers: AG and GT Likelihood of AG: 0.88x0.01 = 0.0088 Likelihood of GT: 0.10x0.49 = 0.0490 Add GT to set T 26/48

  27. Refinement: Find the Motif Once set T complete: Find consensus C h Then calculate s(T): number of l -mers in T that are further than d away from C h Return the consensus with the smallest value s(T) over all buckets and all runs Ideally, find C h such that s(T) = 0 27/48

  28. Refinement: Find the Motif Example : l = 3, d = 1, T = {AGT, AAA, AGC} Consensus: AG? -> Many schemes possible Let's say consensus AGT dist(AGT, AGT) = 0 dist(AAA, AGT) = 2 2 > d dist(AGC, AGT) = 1 s(T) = 1 28/48

  29. Refinement: A Heuristic For the simulated data, we can do better than minimising s( T ) over all buckets and all runs. sc(T) = number of l -mers in T that are at most d away from C h Let T' contain the l -mers that are closest to C h If sc(T') > sc(T), replace T with T' and repeat Usually converges quickly. If final score sc(T) = t , return the motif, otherwise maximise score over all buckets and all runs 29/48

  30. Refinement: A Heuristic Example : l = 3, d = 1, T = {AGT, AAA, AGC}, C h = AGT sc( T ) = 2 If: S = {AGTC, AAAT, AGCT} then: T' = {AGT, AAT , AGC} sc( T' ) = 3 → return consensus of T' (which happens to be C h ) 30/48

  31. PROJECTION Algorithm Recap PROJECTION algorithm:  Do random projections − Hash l -mers to buckets using k random positions − Use full buckets as starting points  Do motif refinement − Get model W h from bucket h − Refine to optimal model W h * (using e.g. EM) − Return best consensus motif 31/48

  32. Experimental Results • Experiments on Simulated Data • Limitations on Solvable (l,d)-Motif Problems • Transcription Factor Binding Sites • Ribosomes Binding Sites 32/48

  33. Experiments on Simulated Data Simulated Data: 1 – a motif M is chosen randomly 2 – t independent planted instances are produced by randomly selecting d positions in M 3 – their position in the input sequence is selected randomly 4 – n-l residues of each sequence are chosen randomly 33/48

  34. Experiments on Simulated Data Performance coefficient : GGACCTCAATGCAGGATACACCGATCGGTA GGAGTACGGCAAGTCCCCATGTGAGGACCT AGGCTGGACCAGGACCTGACTCTACACCTA TGGACCTGCAGGATACAGCGGGACCTATCG …… K = the t*l residue positions in the t planted motif instances P = corresponding set of residues in the instances predicted by the algorithm 34/48

  35. Experiments on Simulated Data Average PC on planted (l,d)-motif 1.2 1 P ROJE CTION 0.8 WINNOWE R 0.6 PC SP -STAR 0.4 Gibbs 0.2 0 10.2 11.2 12.3 13.3 14.4 15.4 16.5 17.5 18.6 19.6 (l,d) Results : - Average on 20 random instances - All runs used projection size k = 7 and bucket threshold s = 4 but a different number of iterations m - WINNOWER (k = 2)

  36. Limitations on Solvable (l,d)-Motif Problems Why is it difficult to find a (l,d)-motif ? What is the difference between : (9,2) (11,3) (13,4) (15,5) (17,6) and (10,2) (12,3) (14,4) (16,5) (18,6) ? between : (l,d) and (l+1,d) ? 36/48

  37. Limitations on Solvable (l,d)-Motif Problems i   4   4  • The probability that a i l − i d  3 1 l p d = ∑ random sequence will correspond to a motif M i = 0 with up to d substitutions  The probability to n − l  1  l  1 − 1 − p d  t E  l ,d = 4 find a random motif 37/48

  38. Limitations on Solvable (l,d)-Motif Problems Statistics of spurious (l,d)-motifs in simulated data : 38/48

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend