NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey - - PowerPoint PPT Presentation
NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey - - PowerPoint PPT Presentation
NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey Winter School in Mathematical and Computational Biology July 2, 2013 NGS Analysis and Transcriptional Regulation RNA-seq Measuring transcription levels (gene
NGS Analysis and Transcriptional Regulation
- RNA-seq
– Measuring transcription levels (gene expression) – Detecting RNA regulators (e.g., miRNA)
- ChIP-seq
– Chromatin modifications – Binding of transcription factor proteins
Talk Overview
- I. Transcriptional Regulation 101
- II. ChIP-seq 101
- III. Analyzing ChIP-seq data
- IV. Combining ChIP-seq and RNA-seq
Part I: Basic Transcriptional Regulation
Source: ¡Steven ¡Chu ¡
Transcription Factors
- Mammalian transcription is controlled
(in part) by about 1400 DNA-binding transcription factor (TF) proteins.
- These proteins control transcription in
two main ways:
– Directly, by promoting (or preventing) the assembly of the pre-initiation complex. – Indirectly, by modifying chromatin.
BASAL TRANSCRIPTION: ¡
- The pre-initiation complex assembles at
the core promoter.
- This results in only low levels of
transcription because the interaction is unstable.
DNA ¡
+ ¡
Core ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡
DNA ¡ Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡
PROXIMAL PROMOTER:
- The proximal promoter extends upstream
- f the promoter.
- It contains binding sites for repressor and
activator transcription factors.
- Some transcription factors (“activators”)
stabilize the transcriptional machinery when they bind to sites in the proximal promoter. ACTIVATORS:
- This increases transcription. ¡
DNA ¡
+ ¡ + ¡
Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡
- This reduces transcription.
- Their binding can block binding by co-
factors and activators.
- Some factors do not stabilize the
transcriptional machinery. REPRESSORS:
+++ ¡
DNA ¡
+ ¡ + ¡
Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡
ENHANCER REGIONS:
DNA ¡
+ ¡ + ¡
Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡ Enhancer ¡Region ¡
1-‑-‑100Kb ¡
- Often very distant—1000s of base pairs. ¡
- Groups of binding sites located upstream
- r downstream of a promoter. ¡
- Activator and repressor transcription
factors compete to occupy enhancer regions.
- DNA looping brings factors into contact
with transcriptional machinery.
- Bound activators increase transcription. ¡
¡
ENHANCER REGIONS:
DNA ¡
+ ¡
Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡ Enhancer ¡Region ¡
+++ ¡
TATA ¡ ¡ ¡ ¡INR ¡
+++ ¡
Chromatin modification by TFs:
DNA ¡
+ ¡ + ¡
Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡ Enhancer ¡Region ¡
- Tissue-specific transcription factors can
bind to HATs, causing chromatin to open. ¡
- This can increase transcription.
- Example: Histone Acetyltransferases
(HATs) acetylate histones. ¡
Specific ¡ General ¡
HAT ¡
Part II: ChIP-seq Overview
Source: ¡Steven ¡Chu ¡
ChIP-seq
- Chromatin
ImmunoPrecipitation followed by high- throughput sequencing.
- TF binding sites
(“punctate peaks”)
- Chromatin mods
(“broad peaks”
Steps in ChIP-seq
- Cross-link proteins
to DNA
- Fragment chromatin
- Immunoprecipitate
with antibody to protein
- Size-select and
ligate
- Amplify
- Sequence
Cross-‑link ¡
What can I learn from ChIP-seq?
- What chromatin regions
are marked as active promoters or enhancers?
- Where is my TF bound?
- What is its DNA-binding
motif?
- What genes might it
regulate?
Part III: Analyzing ChIP-seq Data
Source: ¡Steven ¡Chu ¡
Analyzing TF ChIP-seq Data
- Key messages of this talk:
– Use controls! – Validate your data at each step. – But this is Science! What could possibly go wrong…?
Things that can go wrong in ChIP-seq…
- 1. Low affinity antibody
- 2. Non-specific antibody
- 3. Contamination
- 4. Poor choice of peak calling algorithm (or
parameters) … etc.
Steps in ChIP-seq Data Analysis
- 1. Mapping: where do the sequence “tags”
map to the genome?
- 2. Peak Calling: where are the regions of
significant tag concentration?
- 3. Motif Discovery: what is the binding
motif?
- 4. Location Analysis: where are the peaks w/
respect to genes, promoters, introns etc?
1) Mapping ChIP-seq Tags
- Tags: ChIP-seq produces a pool of
“tags” (~100bp)
- Tag Count: measure of enrichment of region
- Negative Control: “input DNA” tag count
Tallack ¡et ¡al., ¡Genome ¡Res., ¡2010 ¡
Do the mapped tags make sense?
- Each ~100 bp tag is the
5’ end of a DNA fragment.
- But DNA is double-
stranded so there are tags from both strands.
- We expect pairs of
clusters of tags on
- pposite strands,
separated by the fragment length.
Wilbanks ¡and ¡FaccioK, ¡PLoS ¡One, ¡2010 ¡
Strand Cross Correlation Analysis (SCCA)
- If we shift the
anti-sense tags left by the (average) fragment length, we should see maximum correlation between the reads on the two strands.
Kharchenko ¡et ¡al., ¡Nature ¡Biotechnology, ¡2009 ¡
SCCA often shows two maxima
- Fragment-length
peak at average fragment length (as we expected)
- Read-length peak at
average read length (due to variable and dispersed mappability of genomic positions)
read-‑length ¡peak ¡ fragment-‑length ¡peak ¡
Landt ¡S ¡G ¡et ¡al. ¡Genome ¡Res. ¡2012;22:1813-‑1831 ¡
Quality control 1: SCCA identifies failed ChIP-seq
Landt ¡S ¡G ¡et ¡al. ¡Genome ¡Res. ¡2012;22:1813-‑1831 ¡
ENCODE Guidelines:
- Normalized Strand Correlation,
NSC > 1.05
- Relative Strand Correlation,
RSC > 0.8
- https://code.google.com/p/
phantompeakqualtools
2) ChIP-seq Peak Calling
- Peak callers combine
- verlapping tags to get the
“peak height”.
- Often, strand information
and shifting is used to combine tags on opposite strands.
- Fold-enrichment (tag
count / control tag count) is usually used as the criterion for declaring a peak.
Wilbanks ¡and ¡Faccio., ¡PLoS ¡One, ¡2010 ¡
Some ChIP-seq peak callers use SCCA
Bailey ¡et. ¡al., ¡PLoS ¡Comp ¡Bio, ¡in ¡press. ¡
Uses ¡SCCA ¡ Uses ¡SCCA ¡
Sanity checks: Are your peaks reasonable?
- Width: TF ChIP-seq peaks should be relatively
short (< 300bp) compared to histone modification peaks.
– Are your peaks too wide?
- Number: Is the number of TF ChIP-seq peaks
reasonable?
– Some key TFs bind ~30,000 sites but your TF probably only binds far fewer (~1000?)
- Location: Do your peaks co-occur with histone
marks and genes that your TF regulates?
– Examine some peaks using the UCSC genome browser and ENCODE histone tracks
Quality control 2: Fraction of Reads in Peaks (FRiP)
- Only a fraction of
reads typically fall within ChIP-seq peaks.
- ENCODE guideline:
FRiP > 1%
- Caveat: A lower FRiP
threshold may be appropriate if there are very few peaks.
Landt ¡S ¡G ¡et ¡al. ¡Genome ¡Res. ¡2012;22:1813-‑1831 ¡
How many of my peaks are “real”?
- Irreproducible Discovery Rate (IDR)
compares the ranks of peaks from two biological replicates.
– Rank peaks by significance (p-value or q- value) – Reproducible discoveries (peaks) should have similar ranks between replicates.
- ENCODE: reports peaks at 1% IDR
- https://sites.google.com/site/
anshulkundaje/projects/idr
Quality control 3: IDR identifies failed ChIP-seq
Landt ¡S ¡G ¡et ¡al. ¡Genome ¡Res. ¡2012;22:1813-‑1831 ¡
High ¡Reproducibility ¡ Low ¡Reproducibility ¡
3) Motif Discovery & Enrichment Analysis
- If your TF binds DNA directly (and
sequence-specifically), Motif Discovery should find its binding motif.
- The DNA-binding motif of your TF
should be centrally enriched in the peaks, and Central Motif Enrichment Analysis (CMEA) should find it.
Caveats in ChIP-seq Motif Analysis
- Peak regions may
contain other TF motifs due to looping.
- The binding of the
ChIP-ed factor “X” may be indirect.
- ChIP-ed motif might
be weak due to assisted binding.
Farnham, ¡Nature ¡Reviews ¡Gene>cs, ¡2009 ¡
TF Binding Motif Discovery
- ChIP-seq provides
extremely rich data for inferring the DNA-binding affinity
- f the ChIP-ed
transcription factor.
- In principle,
discovering the motif is simple. ààà
- ChIP-seq peaks tend
to be within +/- 50bp
- f the bound factor.
- So we just examine
the peak regions for enriched patterns.
MEME Suite tools for ChIP-seq motif discovery and enrichment
- The MEME Suite (http://meme.nbcr.net) contains
several motif discovery and enrichment algorithms appropriate for ChIP-seq data analysis.
– Discovery & Enrichment: MEME-ChIP – Discovery: MEME, DREME, GLAM2 – Enrichment: CentriMo, AME
Example: Motif discovery in NFIC ChIP-seq data
- Pjanic et al. predicted 39,807 ChIP-seq
peaks in NFIC ChIP-seq data.
- They do not report a using motif discovery
- n these peaks.
- We used MEME-ChIP which runs both
MEME and DREME to perform motif discovery on the 100-bp NFIC ChIP-seq peak regions.
Machanick ¡& ¡Bailey, ¡Bioinforma>cs, ¡2011 ¡
Motif discovery fails in the (original) NFIC dataset
- An NFIC motif is known from in vitro data,
based on only 16 sites.
- MEME and DREME fail to find this motif in
the NFIC data.
- But so do the other algorithms we tried:
Amadeus, peak-motifs, Trawler and Weeder.
The problem: poor peak calling!
- We applied a
different ChIP-seq peak calling algorithm (ChIP-peak) which predicts only 700 peaks (rather than 40,000).
- MEME discovers the
NFI-family binding motif in this new set
- f peaks.
“site-‑probability” ¡curve ¡ ¡
MA0119.1 Position CEQLOGO 22.09.10 17:31TGGC
C TAA
GC
A T GC
T G A C A TGCCAG TA
PosiKon ¡of ¡Best ¡Site ¡ Probability ¡
Central Motif Enrichment Analysis: CentriMo
- CentriMo searches
for known motifs whose sites are most centrally enriched in the ChIP-seq regions.
- Use 500bp regions
centered on each ChIP-seq peak.
500-‑bp ¡ChIP-‑seq ¡regions ¡ W=120 ¡ L=500 ¡ S ¡= ¡number ¡of ¡“successes” ¡= ¡4 ¡ T ¡= ¡number ¡of ¡“trials” ¡= ¡5 ¡
Bailey ¡et ¡al, ¡NAR, ¡2012 ¡
0.0005 0.001 0.0015 0.002 0.0025 0.003
- 250 -200 -150 -100
- 50
50 100 150 200 250 probability position of best site in sequence MA0119.1 p=2.4e-031,w=295,n=5409 MA0244.1 p=4.6e-015,w=381,n=39398 MA0161.1 p=7.3e-015,w=329,n=39356 MA0099.1 p=5.5e-014,w=343,n=34267 MA0406.1 p=8.1e-012,w=323,n=31383
Central Motif Enrichment confirms the known NFIC motif—even in the original peaks
- NFIC motif is most centrally enriched of 862 JASPAR and
UniPROBE motifs (p = 10-31).
MA0119.1 Position CEQLOGO 22.09.10 17:31TGGC
C TAA
GC
A T GC
T G A C A TGCCAG TA
NFIC ¡
- However, standard motif enrichment algorithms do not show the
NFIC as the most enriched motif.
Quality control 4: CMEA identifies failed ChIP-seq
0.0005 0.001 0.0015 0.002 0.0025
- 250 -200 -150 -100
- 50
50 100 150 200 250 probability position of best site in sequence MA0039.2 p=7.2e-001,w=365,n=11404
MA0039.2 Position
CEQLOGO 10.10.11 18:17 T CA
G
T GA
C
AT
CA
CC
TG
ACC
TC
C
T A
p ¡= ¡0.7 ¡
- 2. ¡Failed ¡KLF1 ¡ChIP-‑seq ¡
KLF4 ¡
Pilon ¡et ¡al., ¡Blood, ¡2011 ¡
- 0.002
- 0.001
0.001 0.002 0.003 0.004 0.005 0.006 0.007
- 250 -200 -150 -100
- 50
50 100 150 200 250 probability position of best site in sequence MA0039.2 p=4.4e-066,w=111,n=712 Klf7_primary p=6.9e-056,w=103,n=676 MA0140.1 p=1.5e-048,w=177,n=693 MA0035.2 p=2.4e-040,w=194,n=756
- 1. ¡Successful ¡KLF1 ¡ChIP-‑seq ¡
A
G
T GA
C
AT
CA
CC
TG
ACC
TC
C
T A
KLF4 ¡
Tallack ¡et ¡al., ¡Genome ¡Res, ¡2010 ¡
New motif databases
- In vitro motifs are especially useful for
verifying that your ChIP-seq worked.
- They are independent of the motifs
found by motif discovery in your ChIP- seq data.
– UniPROBE: 386 mouse TF motifs from protein-binding microarrays. – Jolma et al., Cell, 2013: 738 human and mouse TF motifs from SELEX
4) Location Analysis
- Counts how often TF binding sites are in, say,
promoters, intergenic or intragenic regions.
Farnham, ¡Nature ¡Reviews ¡Gene>cs, ¡2009 ¡
Example: Predicting Target Genes
- TF binding sites in promoters probably are
regulatory.
- “Nearest TSS” rule is
- ften used to assign
binding sites to target genes.
- But distal sites may
regulate some other gene via chromatin looping.
Farnham, ¡Nature ¡Reviews ¡Gene:cs, ¡2009 ¡
Klf1 binding near TSSs
- Histogram of
distances from Klf1 ChIP-seq peak to the nearest TSS.
- KLF1 has a
population of binding sites in promoters (small hump on left), but most are distal.
Tallack ¡et ¡al, ¡Genome ¡Res, ¡2010 ¡
Motif Spacing Analysis finds co- factor motifs and TF complexes
Part IV: Combining ChIP-seq and RNA-seq
Source: ¡Steven ¡Chu ¡
Identification of KLF1 target genes using RNA-seq
3 x Klf1-/- libraries 3 x Klf1+/+ libraries CuffDiff
RefSeq.gtf (gene definition set)
690 KLF1 “Activated” genes 118 KLF1 “Repressed” genes
At Bonferroni corrected p-val <0.05 and >1.5 fold change (KO vs WT) E2f2 E2f4 200 400 600 800 1000
mRNA-seq FPKM
mRNA-seq * * qRT-PCR valida.on ¡
Tallack ¡et ¡al, ¡Genome ¡Res, ¡2012 ¡
The KLF1 Transcriptome
Tallack ¡et ¡al, ¡Genome ¡Res, ¡2012 ¡
KLF1 is a (direct) Activator
The distance from KLF1 ChIP-seq peaks to the nearest TSS (putative target gene) is less for “Activated” genes than for “Repressed” genes.
Tallack ¡et ¡al, ¡Genome ¡Res, ¡2012 ¡
Final reminders
- Check your data at each step!
– Read mapping
- Strand Cross Correlation Analysis (SCCA)
– Peak calling
- Fraction of Reads in Peaks (FRiP)
- Irreproducible Discovery Rate (IDR) analysis
– Motif discovery / enrichment analysis
- De novo motif found?
- In vitro motif centrally enriched?
¡ ¡
Acknowledgements
The MEME Suite
- Tom Whitington
- Philip Machanick
- James Johnson
- Martin Frith
- William Noble
- Charles Grant
- Shobhit Gupta
KLF Project
- Michael Tallack
- Tom Whitington
- Andrew Perkins
- Sean Grimmond
- Brooke Gardiner
- Ehsan Nourbakhsh
- Nicole Cloonan
- Elanor Wainwright
- Janelle Keys
- Wai Shan Yuen