dna mo f discovery
play

DNA Mo'f Discovery COMPSCI 260 Spring 2016 DNA motif - PowerPoint PPT Presentation

DNA Mo'f Discovery COMPSCI 260 Spring 2016 DNA motif discovery Input: X 1 A set of DNA sequences X 2 bound by the same TF X 3 Upstream regions of co- regulated genes X 4 Sequences bound in a


  1. DNA ¡Mo'f ¡Discovery ¡ COMPSCI ¡260 ¡– ¡Spring ¡2016 ¡

  2. DNA motif discovery • Input: X 1 – A set of DNA sequences X 2 bound by the same TF X 3 • Upstream regions of co- regulated genes X 4 • Sequences bound in a ChIP-chip/seq experiment – Each sequence is believed to contain a binding site X n • Output: – Locations of the binding sites in the sequences – Parameters of the model describing the binding sites

  3. ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCA TATCCATATCTAATCTTAC TTATA TGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC TATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTG TAATACGCTTAACTGCTCATTGCTATATTGAAGTA CGG ATTAGAAGCCG CCG AG CGG GCGACAGCCCT CCG A CGG AAGACTCTCCT C GAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGG TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCT CGC GCCGCACTGCT CCG AACAATAAAGATTCTACAATACT AAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATA CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGG CCCCA CAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG AAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATC Genes ¡ AATGCGATTAGTTTTTTAGCCTTATTTC TGGGG TAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATA TATAA ATGGAA AAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGAT GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT CTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTC TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATA ATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA AAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCT CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG CATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAA TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATAT TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA TGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCA TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG TCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAAGTTCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATC CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA CTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTGCTCACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTT TF ¡binding ¡ TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT TGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCAACTGGCAGTGGATTGTCTTCTTCGGCCGCATTC sites ¡ CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT ATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATGTCCAAGCAAAATTTAATGCGTATTACGG GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA TCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGCGGTGAGGAAGATCATGCTCTATA TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC CGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAAATTAGCTTTGTTATTGCGAAC CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG ACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCACTACAGCTGCAAATGTTT TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG TAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAGATTTCATGAACGT GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA TTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGATGCTAGTA AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT CTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGAAGAAT TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG TCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATC CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC TTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTG ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA ATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAA GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA ATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACAT ATTGGGCAGCTGTCTATATGAATTATAA GTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT AGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATC AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA ATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCA GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG TTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACT ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG ACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTT CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC GATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGAC GAA CCCTTTGT CCTACTGATTAA TTTTGTAC TGAATTT GGACAAT TCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA GTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGA AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA AACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTACGAGAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCA AATTTTCAAGTTAGA CAAGGAC AAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA GATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACAGAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCA TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTG TATAA AACTCCTTTCTTAATTTCACTCTAAAGCAT CCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAAAATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGT CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATA ATGCCAGACAATCTATCATTACATT AAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTTTCCATTAAATCTCTGTTCTCTCTTACTTATATGAT AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA GATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCATACCCCATAGAGAAGATCTTTCGGTTCGAAGACAT GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA TCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTC GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA GCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATAT CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC AACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATC CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT TATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCG GTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCT TCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAAT

  4. Regulatory ¡elements ¡– ¡TF ¡binding ¡sites ¡ General Transcription factors (TFs) transcriptional machinery TSS gene TF binding sites

  5. DNA motif discovery ¡ • Models? ¡ IUPAC R T G A S T C A Y Consensus codes 1 2 3 4 5 6 7 8 9 A 0.54 0.01 0.01 0.97 0.00 0.01 0.06 0.97 0.01 Position Weight Matrix C 0.04 0.01 0.01 0.01 0.44 0.01 0.93 0.01 0.41 PWM (PSSM) G 0.41 0.01 0.93 0.01 0.56 0.01 0.01 0.01 0.04 T 0.01 0.97 0.05 0.01 0.00 0.97 0.00 0.01 0.54 Motif logo

  6. Motif logos and the information content • The IC of a motif tells us how different the motif is from the background distribution • In general, when building a motif (and especially a motif logo) the background distribution is assumed to be uniform (b A =b C =b G =b T =0.25) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 A 0.25 0.42 0.62 0.30 0.00 1.00 0.00 0.00 0.00 0.00 0.23 0.05 0.15 0.20 C 0.25 0.10 0.03 0.32 1.00 0.00 1.00 0.00 0.00 0.00 0.15 0.30 0.33 0.26 G 0.25 0.33 0.30 0.15 0.00 0.00 0.00 1.00 0.00 1.00 0.32 0.03 0.10 0.16 T 0.25 0.15 0.05 0.23 0.00 0.00 0.00 0.00 1.00 0.00 0.30 0.62 0.42 0.38 IC is 0 + (1 x log 2 (1/0.25)) + 0 + 0=2 IC is 4 x 0.25 x log 2 (0.25/0.25)=0

  7. DNA ¡mo'f ¡discovery ¡ • Similar ¡to ¡a ¡local ¡alignment ¡problem ¡

  8. DNA ¡mo'f ¡discovery ¡ • Similar ¡to ¡a ¡local ¡alignment ¡problem, ¡but… ¡ • What ¡makes ¡mo'f ¡discovery ¡hard? ¡ ¡ – Mo'fs ¡act ¡at ¡variable ¡distances ¡upstream ¡(or ¡ downstream) ¡of ¡target ¡genes ¡ ¡ – Mo'fs ¡are ¡short ¡(5-­‑15bp) ¡ ¡ – Mo'fs ¡are ¡degenerate ¡ ¡

  9. Where ¡do ¡ambiguous ¡bases ¡come ¡from? … Forkhead (Foxo4, 3L2C) bZIP (Gcn4, 1YSA) Homeodomain (Pdx1, 2H1K) bHLH (Pho4, 1A0A)

  10. Where ¡do ¡ambiguous ¡bases ¡come ¡from? PDB structure: 1MDY

  11. Where ¡does ¡specificity ¡come ¡from? ¡

  12. Where ¡does ¡specificity ¡come ¡from? ¡

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend