Discovery and Analysis of Regulatory Regions in the Human Genome - PowerPoint PPT Presentation

Discovery and Analysis of Regulatory Regions in the Human Genome Wyeth Wasserman Centre for Molecular Medicine and Therapeutics Children’s and Women’s Hospital University of British Columbia

Overview CMMT • Basics of promoter analysis – Bioinformatics for detection of transcription factor binding sites • Discrimination of Regulatory Regions – Given binding models for relevant TFs, predict regulatory sequences – Genetic variation within regulatory regions • Pattern discovery (as time permits) – Given a set of co-regulated genes, predict binding sites for contributing TFs – \Given a newly discovered binding profile, predict genes in a regulon

Transcription Simplified CMMT URF Pol-II URE TATA

Teaching a computer to find TFBS…

Representing Binding Sites for a TF Set of Set of binding binding sites sites • A single site AAGTTAATGA AAGTTAATGA CAGTTAATAA CAGTTAATAA • AAGTTAATGA GAGTTAAACA GAGTTAAACA CAGTTAATTA CAGTTAATTA • A set of sites represented as a consensus GAGTTAATAA GAGTTAATAA CAGTTATTCA CAGTTATTCA GAGTTAATAA • VDRTWRWWSHD (IUPAC degenerate DNA) GAGTTAATAA CAGTTAATCA CAGTTAATCA AGATTAAAGA • A matrix describing a a set of sites AGATTAAAGA AAGTTAACGA AAGTTAACGA AGGTTAACGA AGGTTAACGA ATGTTGATGA ATGTTGATGA AAGTTAATGA A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 AAGTTAATGA AAGTTAACGA C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 AAGTTAACGA AAATTAATGA AAATTAATGA G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 GAGTTAATGA T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4 GAGTTAATGA AAGTTAATCA AAGTTAATCA AAGTTGATGA AAGTTGATGA AAATTAATGA AAATTAATGA ATGTTAATGA ATGTTAATGA AAGTAAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA

PFMs to PWMs CMMT One would like to add the following features to the model: 1. Correcting for the base frequencies in DNA 2. Weighting for the confidence (depth) in the pattern 3. Convert to log-scale probability for easy arithmetic f matrix w matrix f (b,i)+ s (N) Log ( ) A 5 0 1 0 0 A 1.6 -1.7 -0.2 -1.7 -1.7 p (b) C 0 2 2 4 0 C -1.7 0.5 0.5 1.3 -1.7 G 0 3 1 0 4 G -1.7 1.0 -0.2 -1.7 1.3 T 0 0 1 1 1 T -1.7 -1.7 -0.2 -0.2 -0.2 TGCTG = 0.9

Performance of Profiles CMMT • 95% of predicted sites bound in vitro (Tronche 1997) • MyoD binding sites predicted about once every 600 bp (Fickett 1995) • The Futility Theorem – Nearly 100% of predicted transcription factor binding sites have no function in vivo

A 1 kbp promoter screened with collection of TF profiles CMMT

CMMT Phylogenetic Footprinting for better specificity 70,000,000 years of evolution reveals most regulatory regions.

SIDENOTE: Global Progressive Alignments (ORCA Algorithm) CMMT ORCA • Global alignments memory = product of sequence lengths • Progressive alignment by banding with local alignments (e.g. BLAST) and running global method on banded sub-segments • Recursion with decreasingly stringent parameters

Phylogenetic Footprinting Identifies Functional Segments CMMT % Identity 200 bp Window Start Position (human sequence) Actin gene compared between human and mouse by ORCA.

Phylogenetic Footprinting (2) CMMT FoxC2 1 100% 0.8 80% % Identity 0.6 60% 0.4 40% 20% 0.2 0% 0 -0.2 0 1000 2000 3000 4000 5000 6000 7000 Start Position of 200bp Window

Recall... CMMT

1kbp promoter with phylogenetic footprinting CMMT

Choosing the ”right” species... CMMT CHICKEN HUMAN MOUSE HUMAN COW HUMAN

Performance: Human vs. Mouse CMMT SELECTIVITY SENSITIVITY • Testing set: 40 experimentally defined sites in 15 well studied genes (Replicated with 100+ site set) • 75-90% of defined sites detected with conservation filter, while only 11-16% of total predictions retained

ConSite (www.phylofoot.org) CMMT Now driven by the ORCA Aligner

Emerging Issues CMMT • Multiple sequence comparisons – Incorporate phylogenetic trees – Visualization • Analysis of closely related species – Phylogenetic shadowing • Genome rearrangements – Inversion compatible alignment algorithm • Higher order models of TFBS

CMMT Improving Pattern Discrimination TFs do NOT act in isolation

Layers of Complexity in Metazoan Transcription

Biochemical complexity enables greater complexity in regulation CMMT Yeast ORF A GO GO GO 500 bp Humans EXON 1 2 EXON 3 GO GO GO GO GO GO GO GO GO 20 000 bp

Detecting Clusters of TF Binding Sites CMMT • Trained Methods – Sufficient examples of real clusters to establish weights on the relative importance of each TF • Statistical Over-representation – Binding profiles available for a set of biologically motivated

Training for the detection of liver cis -regulatory modules (CRMs) CMMT

Models for Liver TFs… (10 second slide for 3 months of work) CMMT HNF3 HNF1 HNF4 C/EBP

Logistic Regression Analysis CMMT ∗ α 1 Optimize α vector to maximize the distance between output values for positive and negative training data. ∗ α 2 Σ “logit” ∗ α 3 Output value is: e logit ∗ α 4 p(x)= 1 + e logit

Performance of the Liver Model CMMT • Performance – Sensitivity: 60% of known CRMs detected – Specificity: 1 prediction/35,000bp • Limitations – Applies to genes expressed late in hepatocyte differentiation – Requires 10-15 genes in positive training set – This model doesn’t account for multiple sites for the same TF • New methods from several groups address this limit

UGT1A1 CMMT 1 Liver Module Model Score 0.8 0.6 Series1 Wildtype 0.4 Series2 Other 0.2 0 -0.2 100 510 920 1330 1740 2150 2560 2970 3380 3790 4200 4610 5020 5430 5840 “Window” Position in Sequence

MSCAN: An untrained method for CRM detection (w/ J. Lagergren, Royal Technical University of Sweden) CMMT • MSCAN takes as input a user-defined set of TF profiles • Calculates significance for each observed “site” based on local sequence characteristics • Calculates cluster significance using a dynamic programming approach • Approximately 1 significant liver cluster / 18 000 bp in human genome sequence • Filters out statistically significant clusters of sites that contain local repeats • Identification of non-random characteristics in DNA http://mscan.cgb.ki.se

CMMT JASPAR (jaspar.cgb.ki.se) OPEN-ACCESS DATABASE OF TF BINDING PROFILES

Making better predictions CMMT • Profiles make far too many false predictions to have predictive value in isolation • Phylogenetic footprinting eliminates ~90% of false predictions • Algorithms for detection of clusters of binding sites perform better, especially when possible to create trained discriminant functions

CMMT RAVEN Project: Regulatory Analysis of Variation in ENhancers Genetic variation in TFBS can result in biomedically important phenotypes

Sequence Variation in TFBS CMMT URF AaGT TSS GENE DISEASE/CONDITION (associated) REFERENCE UGT1A1 Gilbert’s Syndrome –jaundice PJ Bosma, et al., 1995 UCP3 Elevated Body Mass S Otabe et al., 2000 TNFalpha Malaria Susceptibility JC Knight et al., 1999 Resistin Elevated Body Mass JC Engert et al., 2002 IL4Ralpha Reduced soluble IL4R H Hackstein et al., 2001 ABCA1 Coronary artery disease KY Zwarts et al., 2002 Ob Leptin levels J Hager et al., 1998 PEPCK Obesity Y. Olswang et al., 2002 PR Endometrial cancer I DeVivo et al., 2002 LDLR Familial hypercholesterolemia Koivisto et al., 1994

CMMT Stage 1: Prediction of Regulatory Regions

Stage 1: Identify Putative Regulatory Regions CMMT • Retrieves orthologous human and mouse gene sequences from GeneLynx • Aligns sequences with ORCA Aligner • Finds most significant non-coding regions • Designs primers FoxC2 1 100% 0.8 80% 0.6 60% 0.4 40% 20% 0.2 0% 0 -0.2 0 1000 2000 3000 4000 5000 6000 7000

Data/Orthology obtained from GeneLynx (www.genelynx.org) CMMT

CMMT Stage 2: Analysis of Polymorphisms ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT

Identify variations that generate allele-specific binding site predictions CMMT 4 Differences in scores 2 0 1 2 3 4 5 6 7 8 9 10 11 -2 -4 1234567890123456789012345 ACGCAT AAGTTAAtGAATAAC AGAT ............. c ...........

CMMT RAVEN Implementation Status A first look at the alpha-version of the RAVEN service…

RAVEN screenshots CMMT

CMMT Stage 3: Prediction of Regulatory “HotSpots”

UGT1A1 (Gilbert’s Syndrome) CMMT 1 Liver Module Model Score 0.8 0.6 Series1 Wildtype 0.4 Series2 Mutant 0.2 0 -0.2 100 510 920 1330 1740 2150 2560 2970 3380 3790 4200 4610 5020 5430 5840 “Window” Position in Sequence

“HotSpots” in Muscle Regulatory Module (200bp) CMMT 0.2 Maximum Differential for any potential SNP 0.1 0 -0.1 -0.2

Discovery and Analysis of Regulatory Regions in the Human Genome - PowerPoint PPT Presentation

Discovery and Analysis of Regulatory Regions in the Human Genome Wyeth Wasserman Centre for Molecular Medicine and Therapeutics Childrens and Womens Hospital University of British Columbia Overview CMMT Basics of promoter analysis

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Mapping pathogenic regulatory regions and genes Chris Cotsapas Yale/Broad Mapping pathogenic

Discovery and Analysis of Regulatory Regions in the Human Genome Wyeth Wasserman Centre for

regions and cities the role of the European Committee of the Regions Startup Europe Regions

Regulatory Binder By: Sam Payn Regulatory Binder Goals To learn about regulatory binders.

The Geographic Regions of the US and NC The Geography and 4 Regions of NC NC Geographic

Council of European Municipalities and Regions A Europe for our municipalities and regions

VPN Discovery VPN Discovery Design Team Discussions and Options Design Team Discussions and

From Search to Discovery in our Future Library From Search to Discovery W e see a spectrum of

Watson Discovery Spring 2020 Discovery pipeline Using NLU, document conversion, and UI tools

Health Care the Danish Model Janet Samuel, Danish Regions Danish Regions The Danish Health

ICANN s s geographical geographical ICANN II : the sequel : the sequel Regions II

Tunnel End-point Discovery Tunnel End-point Discovery draft-palet-v6ops-tun-auto-disc-03.txt

Regulatory Impact Analysis Suyash Rai, Carnegie India November 21, 2019 1 Outline 1.

International Regulatory Cooperation The regulatory perspective Nick Malyshev, Head, Regulatory

Potty Training in Potty Training in Potty Training in Potty Training in Four Days Four Days

Probability and Inference Dr. Jarad Niemi STAT 544 - Iowa State University January 23, 2019

Making Employment a Reality: Were All Responsible Derek Nord, PhD, FAAIDD Director and

Listening Sessions: Impact of the Coronavirus Pandemic on the Disability Community Michael

The Avatar project: Improving embedded security with SE, KLEE and Qemu

University of Kentucky College of College of Communication and Information Strategic Planning

7/27/16 Bond Referendum (CNCB) Matching Grant program for Local Governments 2016 Presented by

Ted Stevens Anchorage International Tie-Down Permit Regulations Advisory Committee Meeting #2

Sambuz

Useful Links

Newsletter

Mail Us