NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey - - PowerPoint PPT Presentation
NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey - - PowerPoint PPT Presentation
NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey Winter School in Mathematical and Computational Biology July 3, 2012 NGS Analysis and Transcriptional Regulation RNA-seq Measuring transcription levels (gene
NGS Analysis and Transcriptional Regulation
- RNA-seq
– Measuring transcription levels (gene expression) – Detecting RNA regulators (e.g., miRNA)
- ChIP-seq (and ChIP-exo)
– Chromatin modifications – Binding of transcription factor proteins
Talk Overview
- I. Basic Transcriptional Regulation
- II. ChIP-seq and ChIP-exo
- III. Analyzing ChIP-seq & ChIP-exo data
a) Mapping b) Peak calling c) Motif discovery & Enrichment Analysis d) Location analysis
Part I: Basic Transcriptional Regulation
Source: ¡Steven ¡Chu ¡
Transcription Factors
- Mammalian transcription is controlled
(in part) by about 1400 transcription factor (TF) proteins.
- These proteins control transcription in
two main ways:
– Directly, by promoting (or preventing) the assembly of the pre-initiation complex. – Indirectly, by modifying the chromatin.
BASAL TRANSCRIPTION: ¡
- The pre-initiation complex assembles at
the core promoter.
- This results in only low levels of
transcription because the interaction is unstable.
DNA ¡
+ ¡
Core ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡
DNA ¡ Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡
PROXIMAL PROMOTER:
- The proximal promoter extends upstream
- f the promoter.
- It contains binding sites for repressor and
activator transcription factors.
- This stabilizes the transcriptional
- machinery. ¡
- Some transcription factors (“activators”)
bind to sites in the proximal promoter. ACTIVATORS:
- This increases transcription. ¡
DNA ¡
+ ¡ + ¡
Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡
- This reduces transcription.
- Their binding can block binding by co-
factors and activators.
- Some factors do not stabilize the
transcriptional machinery. REPRESSORS:
+++ ¡
DNA ¡
+ ¡ + ¡
Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡
ENHANCER REGIONS:
DNA ¡
+ ¡ + ¡
Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡ Enhancer ¡Region ¡
1-‑-‑100Kb ¡
- Often very distant—1000s of base pairs. ¡
- Groups of binding sites located upstream
- r downstream of a promoter. ¡
ENHANCER REGIONS:
- DNA looping brings factors into contact
with transcriptional machinery. ¡
- Bound activators increase transcription. ¡
- Both activator and repressor transcription
factors can occupy enhancer regions. ¡
+++ ¡
DNA ¡
+ ¡ + ¡
Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡ Enhancer ¡Region ¡
+++ ¡
Chromatin modification by TFs:
DNA ¡
+ ¡ + ¡
Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡ Enhancer ¡Region ¡
- Tissue-specific transcription factors can
bind to HATs, causing chromatin to open. ¡
- This can increase transcription.
- Histone Acetyltransferases (HATs)
acetylate histones. ¡
Specific ¡ General ¡
HAT ¡
Part II: ChIP-seq & ChIP-exo
Source: ¡Steven ¡Chu ¡
ChIP-seq
ChIP-Exo
Rhee ¡and ¡Pugh, ¡Cell ¡201. ¡
Rhee ¡and ¡Pugh, ¡Cell ¡2011. ¡
ChIP-seq & ChIP-exo
Part III: Analyzing ChIP-seq Data
Source: ¡Steven ¡Chu ¡
Analyzing TF ChIP-seq Data
- Key messages of this talk:
– Use controls! – Validate your data at each step. – But this is Science! What could possibly go wrong…?
Things that can go wrong in ChIP-seq…
- 1. Low affinity antibody
- 2. Non-specific antibody
- 3. Contamination
- 4. Poor choice of peak calling algorithm (or
parameters) … etc.
Steps in ChIP-seq Data Analysis
- 1. Mapping: where do the sequence “tags”
map to the genome?
- 2. Peak Calling: where are the regions of
significant tag concentration?
- 3. Motif Discovery: what is the binding
motif?
- 4. Location Analysis: where are the peaks w/
respect to genes, promoters, introns etc?
1) Mapping ChIP-seq Tags
- Tags: ChIP-seq produces a pool of
“tags” (~100bp)
- Tag Count: measure of enrichment of region
- Negative Control: “input DNA” tag count
Tallack ¡et ¡al, ¡Genome ¡Res., ¡2019 ¡
2) ChIP-seq Peak Calling
- ChIP-seq produces
a pool of “tags”.
- Tags are currently
about 100 bp long.
- Tag is the 5’ end of
a DNA fragment.
- But DNA is double-
stranded so…
Wilbanks ¡and ¡FaccioM, ¡PLoS ¡One, ¡2010 ¡
ChIP-seq Peak Calling
- Peak callers combine overlapping tags to
get the “peak height”.
- Sometimes strand information is used
to combine tags on opposite strands.
- Fold-enrichment (tag count / control tag
count) is usually used as the criterion for declaring a peak.
…ChIP-seq Peak Callers
Wilbanks ¡and ¡FaccioM, ¡PLoS ¡One, ¡2010 ¡
Sanity check: are your peaks reasonable
- Width: TF ChIP-seq peaks should be relatively
short (< 300bp) compared to histone modification peaks.
– Are your peaks too wide?
- Number: Is the number of TF ChIP-seq peaks
reasonable?
– Some key TFs bind ~30,000 sites but your TF probably only binds far fewer (~1000?)
- Location: Do your peaks co-occur with histone
marks and genes your TF regulates?
- The next analysis steps will help you answer
these questions!
3) Motif Discovery & Enrichment Analysis
- If your TF binds DNA directly (and
sequence-specifically), Motif Discovery should find its binding motif.
- The DNA-binding motif of your TF
should be centrally enriched in the peaks, and hould be Central Motif Enrichment Analysis (CMEA) should find it.
Caveats in ChIP-seq Motif Analysis
- Peak regions may
contain other TF motifs due to looping.
- The binding of the
ChIP-ed factor “X” may be indirect.
- ChIP-ed motif might
be weak due to assisted binding.
Farnham, ¡Nature ¡Reviews ¡GeneMcs, ¡2009 ¡
TF Binding Motif Discovery
- ChIP-seq provides
extremely rich data for inferring the DNA-binding affinity
- f the ChIP-ed
transcription factor.
- In principle,
discovering the motif is simple. ààà
- ChIP-seq peaks tend
to be within +/- 50bp
- f the bound factor.
- So we just examine
the peak regions for enriched patterns.
MEME Suite tools for ChIP-seq motif discovery and enrichment
- The MEME Suite (http://meme.nbcr.net) contains
several motif discovery and enrichment algorithms appropriate for ChIP-seq data analysis.
– Discovery & Enrichment: MEME-ChIP – Discovery: MEME, DREME, GLAM2 – Enrichment: CentriMo, AME
Example: Motif discovery in NFIC ChIP-seq data
- Pjanic et al. predicted 39,807 ChIP-seq
peaks in NFIC ChIP-seq data.
- They do not report a using motif discovery
- n these peaks.
- We used MEME-ChIP which runs both
MEME and DREME to perform motif discovery on the 100-bp NFIC ChIP-seq peak regions.
Machanick ¡& ¡Bailey, ¡BioinformaMcs, ¡2011 ¡
Motif discovery fails in the (original) NFIC dataset
- An NFIC motif is know from in vitro data,
based on only 16 sites.
- MEME and DREME fail to find this motif in
the NFIC data.
- But so do the other algorithms we tried:
Amadeus, peak-motifs, Trawler and Weeder.
The problem: poor peak calling!
- We applied a
different ChIP-seq peak calling algorithm (ChIP-peak) which predicts only 700 peaks (rather than 40,000).
- MEME discovers the
NFI-family binding motif in this new set
- f peaks.
“site-‑probability” ¡curve ¡ ¡
MA0119.1 Position CEQLOGO 22.09.10 17:31TGGC
C TAA
GC
A T GC
T G A C A TGCCAG TA
PosiMon ¡of ¡Best ¡Site ¡ Probability ¡
Central Motif Enrichment Analysis: CentriMo
- CentriMo searches
for known motifs whose sites are most centrally enriched in the ChIP-seq regions.
- Use 500bp regions
centered on each ChIP-seq peak.
500-‑bp ¡ChIP-‑seq ¡regions ¡ W=120 ¡ L=500 ¡ S ¡= ¡number ¡of ¡“successes” ¡= ¡4 ¡ T ¡= ¡number ¡of ¡“trials” ¡= ¡5 ¡
Bailey ¡et ¡al, ¡NAR ¡2012 ¡
0.0005 0.001 0.0015 0.002 0.0025 0.003
- 250 -200 -150 -100
- 50
50 100 150 200 250 probability position of best site in sequence MA0119.1 p=2.4e-031,w=295,n=5409 MA0244.1 p=4.6e-015,w=381,n=39398 MA0161.1 p=7.3e-015,w=329,n=39356 MA0099.1 p=5.5e-014,w=343,n=34267 MA0406.1 p=8.1e-012,w=323,n=31383
Central Motif Enrichment confirms the known NFIC motif—even in the original peaks
- NFIC motif is most centrally enriched of 862 JASPAR
+UniPROBE motifs (p = 10-31).
MA0119.1 Position CEQLOGO 22.09.10 17:31TGGC
C TAA
GC
A T GC
T G A C A TGCCAG TA
NFIC ¡
- However, standard motif enrichment algorithms (including AME)
do not show the NFIC as the most enriched motif.
1. Published successful KLF1 ChIP-seq.
- 2. Published failed KLF1 ChIP-seq.
Central Motif Enrichment Analysis shows when things go right (or wrong).
- 0.002
- 0.001
0.001 0.002 0.003 0.004 0.005 0.006 0.007
- 250 -200 -150 -100
- 50
50 100 150 200 250 probability position of best site in sequence MA0039.2 p=4.4e-066,w=111,n=712 Klf7_primary p=6.9e-056,w=103,n=676 MA0140.1 p=1.5e-048,w=177,n=693 MA0035.2 p=2.4e-040,w=194,n=756
- 1. ¡Successful ¡KLF1 ¡ChIP-‑seq ¡
A
G
T GA
C
AT
CA
CC
TG
ACC
TC
C
T A
KLF4 ¡
0.0005 0.001 0.0015 0.002 0.0025
- 250 -200 -150 -100
- 50
50 100 150 200 250 probability position of best site in sequence MA0039.2 p=7.2e-001,w=365,n=11404
MA0039.2 Position
CEQLOGO 10.10.11 18:17 T CA
G
T GA
C
AT
CA
CC
TG
ACC
TC
C
T A
p ¡= ¡0.7 ¡
- 2. ¡Failed ¡KLF1 ¡ChIP-‑seq ¡
KLF4 ¡
Motif Spacing Analysis finds co- factor motifs and TF complexes
4) Location Analysis
- Counts how often TF binding sites are in, say,
promoters, intergenic or intragenic regions.
Farnham, ¡Nature ¡Reviews ¡GeneMcs, ¡2009 ¡
Predicting Target Genes
- Location analysis allows identification of target
genes.
- TF binding sites in promoters probably are
regulatory.
- “Nearest TSS” rule is
- ften used to assign
binding sites to target genes, but distal sites may regulate some
- ther gene via
chromatin looping.
Farnham, ¡Nature ¡Reviews ¡GeneMcs, ¡2009 ¡
Example: Binding near TSSs
- Histogram of
distances from Klf1 ChIP-seq peak to the nearest TSS.
- KLF1 has a
population of binding sites in promoters (small hump on left), but most are distal.
Tallack ¡et ¡al, ¡Genome ¡Res., ¡2019 ¡
¡ ¡
Acknowledgements
The MEME Suite
- Tom Whitington
- Philip Machanick
- James Johnson
- Martin Frith
- William Noble
- Charles Grant
- Shobhit Gupta
KLF Project
- Michael Tallack
- Tom Whitington
- Andrew Perkins
- Sean Grimmond
- Brooke Gardiner
- Ehsan Nourbakhsh
- Nicole Cloonan
- Elanor Wainwright
- Janelle Keys
- Wai Shan Yuen