bioinformatics for the identification of sequences
play

Bioinformatics for the Identification of Sequences Regulating Gene - PowerPoint PPT Presentation

Bioinformatics for the Identification of Sequences Regulating Gene Transcription Wyeth W. Wasserman University of British Columbia www.cisreg.ca Acknowledgements Collaborators Wasserman Group Dave Arenillas Jenny Bryan (UBC) Jochen Brumm


  1. Bioinformatics for the Identification of Sequences Regulating Gene Transcription Wyeth W. Wasserman University of British Columbia www.cisreg.ca

  2. Acknowledgements Collaborators Wasserman Group Dave Arenillas Jenny Bryan (UBC) Jochen Brumm Brenda Gallie (OCI) Alice Chou Jens Lagergren (KTH) Debra Fulton Chip Lawrence (Brown) Shannan Ho Sui Carol Huang Boris Lenhard (K.I.) Danielle Kemmer (KI) James Mortimer (MF) Byron Kuo Jacob Odeberg (KTH) Jonathan Lim Raf Podowski (KI) Dora Pak Group Alumni Chris Walsh Wynand Alkema Dimas Yusuf Elena Herzog Annette Höglund Collaborating Trainees William Krivan Malin Andersson (KTH) Öjvind Johansson (KTH) Luis Mendoza Stuart Lithwick (U.Toronto) Albin Sandelin Support: CIHR, CGDN, MSFHR, CFI, Merck-Frosst, BC Children’s Hospital Foundation

  3. Overview CMMT • DISCRIMINATION: TFBS Prediction with Motif Models • Phylogenetic Footprinting • Combinatorial Interactions • Current Activities • DISCOVERY: Inferring Regulatory Mechanisms for Co-Expressed (Co-Regulated) Genes • Motif Over-representation • Pattern Discovery • Current Activities

  4. CMMT Transcription Factor Binding Sites (over-simplified for pedagogical purposes) Pol-II TATA URF URE

  5. Teaching a computer to find TFBS…

  6. Representing Binding Sites for a TF Set of Set of binding binding sites sites • A single site AAGTTAATGA AAGTTAATGA CAGTTAATAA CAGTTAATAA • AAGTTAATGA GAGTTAAACA GAGTTAAACA CAGTTAATTA CAGTTAATTA • A set of sites represented as a consensus GAGTTAATAA GAGTTAATAA CAGTTATTCA CAGTTATTCA GAGTTAATAA • VDRTWRWWSHD (IUPAC degenerate DNA) GAGTTAATAA CAGTTAATCA CAGTTAATCA AGATTAAAGA • A matrix describing a a set of sites AGATTAAAGA AAGTTAACGA AAGTTAACGA AGGTTAACGA AGGTTAACGA ATGTTGATGA ATGTTGATGA AAGTTAATGA AAGTTAATGA A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 AAGTTAACGA C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 AAGTTAACGA AAATTAATGA G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 AAATTAATGA GAGTTAATGA GAGTTAATGA T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4 AAGTTAATCA AAGTTAATCA AAGTTGATGA AAGTTGATGA AAATTAATGA AAATTAATGA ATGTTAATGA ATGTTAATGA AAGTAAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA

  7. PFMs to PWMs (PSSMs) CMMT f matrix w matrix f (b,i)+ s (N) A 5 0 1 0 0 Log ( ) A 1.6 -1.7 -0.2 -1.7 -1.7 p (b) C 0 2 2 4 0 C -1.7 0.5 0.5 1.3 -1.7 G 0 3 1 0 4 G -1.7 1.0 -0.2 -1.7 1.3 T 0 0 1 1 1 T -1.7 -1.7 -0.2 -0.2 -0.2

  8. Detecting binding sites in a single sequence Scanning a sequence against a PW M Sp1 ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ] G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ] T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ] Abs_score = 13.4 (sum of column scores) Calculating the relative score Scanning 1 3 0 0 bp of hum an insulin receptor gene w ith Sp1 A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ] at rel_ score threshold of 7 5 % G [ 1.2348 1.2348 1.2348 1.2348 2.1222 2.1222 2.1222 2.1222 0.4368 1.2348 1.2348 1.5128 1.5128 1.7457 1.7457 1.7457 1.7457 -1.5 ] T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 1.7457 ] Max_score = 15.2 (sum of highest column scores) A [-0.2284 0.4368 -1.5 -1.5 - -1.5 1.5 0.4368 - -1.5 1.5 - -1.5 1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 - -1.5 1.5 -0.2284 -1.5 -0.2284 -1.5 ] G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 - -1.5 1.5 ] T [ 0.4368 0.4368 - -0.2284 0.2284 -1.5 - 1.5 - -1.5 1.5 -0.2284 0.4368 0.4368 0.4368 - -1.5 1.5 1.7457 ] Min_score = -10.3 (sum of lowest column scores) Abs_score - Min_score = ⋅ Rel_score 100 % Max_score - Min_score 13.4 - (-10.3) = ⋅ = 100% 93% − − 15.2 ( 10.3) Ouch.

  9. Performance of Profiles CMMT • 95% of predicted sites bound in vitro (Tronche 1997) • MyoD binding sites predicted about once every 600 bp (Fickett 1995) • The Futility Theorem – Nearly 100% of predicted transcription factor binding sites have no function in vivo

  10. CMMT JASPAR AN OPEN-ACCESS DATABASE OF TF BINDING PROFILES

  11. CMMT Overcoming the Specificity Problems DISCRIMINATION

  12. Phylogenetic Footprinting Dramatically Reduces Spurious Hits Human Mouse Actin, alpha cardiac

  13. Performance: Human vs. Mouse CMMT SELECTIVITY SENSITIVITY • Testing set: 40 experimentally defined sites in 15 well studied genes (Replicated with 100+ site set) • 75-90% of defined sites detected with conservation filter, while only 11-16% of total predictions retained

  14. CMMT Now Featuring: Ortholog Sequence Retrieval Service ConSite (www.cisreg.ca)

  15. CMMT Current Activity: Analysis of Genetic Variation in TFBS ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT

  16. Sequence Variation in TFBS CMMT URF AaGT TSS GENE DISEASE/CONDITION (associated) REFERENCE UGT1A1 Gilbert’s Syndrome –jaundice PJ Bosma, et al., 1995 UCP3 Elevated Body Mass S Otabe et al., 2000 TNFalpha Malaria Susceptibility JC Knight et al., 1999 Resistin Elevated Body Mass JC Engert et al., 2002 IL4Ralpha Reduced soluble IL4R H Hackstein et al., 2001 ABCA1 Coronary artery disease KY Zwarts et al., 2002 Ob Leptin levels J Hager et al., 1998 PEPCK Obesity Y. Olswang et al., 2002 PR Endometrial cancer I DeVivo et al., 2002 LDLR Familial hypercholesterolemia Koivisto et al., 1994

  17. Identifying allele-specific binding site predictions CMMT 4 2 2 1 S wt -S mt 0 0 1 2 3 4 5 6 7 8 9 10 11 -2 -1 -4 -2 1234567890123456789012345 ACGCAT AAGTTAAtGAATAAC AGAT ............. c ...........

  18. CMMT RAVEN screenshots

  19. Recent and Active Projects CMMT • JUMBO-JASPAR – Building a second generation open-access database • NHR-scan – Identification of binding sites for nuclear hormone receptors

  20. CMMT Discrimination of Regulatory Modules TFs do NOT act in isolation

  21. Layers of Complexity in Metazoan Transcription

  22. Detecting Clusters of TF Binding Sites CMMT • Trained Methods – Sufficient examples of real clusters to establish weights on the relative importance of each TF • Statistical Over-Representation of Combinations – Binding profiles available for a set of biologically motivated TFs

  23. Training for the detection of liver cis -regulatory modules (CRMs) CMMT

  24. Building a predictive model (Brief, as this is well described in the literature) CMMT HNF1 C/EBP HNF3 At 60% sensitivity, predictions made ~1/30,000 HNF4 bp

  25. UGT1A1 CMMT 1 Liver Module Model Score 0.8 0.6 Series1 Wildtype 0.4 Series2 Other 0.2 0 -0.2 100 510 920 1330 1740 2150 2560 2970 3380 3790 4200 4610 5020 5430 5840 “Window” Position in Sequence

  26. MSCAN: An untrained method for CRM detection (w/ J. Lagergren, Royal Technical University of Sweden) CMMT • MSCAN takes as input a user-defined set of TF profiles • Calculates significance for each observed “site” based on local sequence characteristics • Calculates cluster significance using a dynamic programming approach • Approximately 1 significant liver cluster / 18 000 bp in human genome sequence • Filters out statistically significant clusters of sites that contain local repeats • Identification of non-random characteristics in DNA http://mscan.cgb.ki.se

  27. Current Activities on Combinatorial Binding Prediction CMMT • Social network analysis to identify a reliable set of genes regulated by a given set of TFs

  28. Making better predictions CMMT • Profiles make far too many false predictions to have predictive value in isolation • Phylogenetic footprinting eliminates ~90% of false predictions • Algorithms for detection of clusters of binding sites perform better, especially when possible to create train on known examples for the target context

  29. CMMT Linking co-expressed genes from microarrays to candidate transcription factors

  30. CMMT DISCOVERY Inferring regulatory mechanisms for subsets of co-expressed genes

  31. CMMT Deciphering Regulation of Co- Expressed Genes

  32. oPOSSUM Procedure CMMT Set of co- Automated Phylogenetic expressed sequence retrieval Footprinting genes from EnsEMBL ORCA Putative Statistical Detection of mediating significance of transcription factor transcription binding sites binding sites factors

  33. Statistical Methods for Identifying Over- represented TFBS CMMT • Z scores – Based on the number of occurrences of the TFBS relative to background – Normalized for sequence length – Simple binomial distribution model • Fisher exact probability scores – Based on the number of genes containing the TFBS relative to background – Hypergeometric probability distribution

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend