NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey - - PowerPoint PPT Presentation

ngs sequence analysis for regulation and epigenomics
SMART_READER_LITE
LIVE PREVIEW

NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey - - PowerPoint PPT Presentation

NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey Winter School in Mathematical and Computational Biology July 2, 2013 NGS Analysis and Transcriptional Regulation RNA-seq Measuring transcription levels (gene


slide-1
SLIDE 1

NGS Sequence Analysis for Regulation and Epigenomics

Timothy Bailey Winter School in Mathematical and Computational Biology July 2, 2013

slide-2
SLIDE 2

NGS Analysis and Transcriptional Regulation

  • RNA-seq

– Measuring transcription levels (gene expression) – Detecting RNA regulators (e.g., miRNA)

  • ChIP-seq

– Chromatin modifications – Binding of transcription factor proteins

slide-3
SLIDE 3

Talk Overview

  • I. Transcriptional Regulation 101
  • II. ChIP-seq 101
  • III. Analyzing ChIP-seq data
  • IV. Combining ChIP-seq and RNA-seq
slide-4
SLIDE 4

Part I: Basic Transcriptional Regulation

Source: ¡Steven ¡Chu ¡

slide-5
SLIDE 5

Transcription Factors

  • Mammalian transcription is controlled

(in part) by about 1400 DNA-binding transcription factor (TF) proteins.

  • These proteins control transcription in

two main ways:

– Directly, by promoting (or preventing) the assembly of the pre-initiation complex. – Indirectly, by modifying chromatin.

slide-6
SLIDE 6

BASAL TRANSCRIPTION: ¡

  • The pre-initiation complex assembles at

the core promoter.

  • This results in only low levels of

transcription because the interaction is unstable.

DNA ¡

+ ¡

Core ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡

slide-7
SLIDE 7

DNA ¡ Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡

PROXIMAL PROMOTER:

  • The proximal promoter extends upstream
  • f the promoter.
  • It contains binding sites for repressor and

activator transcription factors.

slide-8
SLIDE 8
  • Some transcription factors (“activators”)

stabilize the transcriptional machinery when they bind to sites in the proximal promoter. ACTIVATORS:

  • This increases transcription. ¡

DNA ¡

+ ¡ + ¡

Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡

slide-9
SLIDE 9
  • This reduces transcription.
  • Their binding can block binding by co-

factors and activators.

  • Some factors do not stabilize the

transcriptional machinery. REPRESSORS:

+++ ¡

DNA ¡

+ ¡ + ¡

Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡

slide-10
SLIDE 10

ENHANCER REGIONS:

DNA ¡

+ ¡ + ¡

Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡ Enhancer ¡Region ¡

1-­‑-­‑100Kb ¡

  • Often very distant—1000s of base pairs. ¡
  • Groups of binding sites located upstream
  • r downstream of a promoter. ¡
slide-11
SLIDE 11
  • Activator and repressor transcription

factors compete to occupy enhancer regions.

  • DNA looping brings factors into contact

with transcriptional machinery.

  • Bound activators increase transcription. ¡

¡

ENHANCER REGIONS:

DNA ¡

+ ¡

Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡ Enhancer ¡Region ¡

+++ ¡

TATA ¡ ¡ ¡ ¡INR ¡

slide-12
SLIDE 12

+++ ¡

Chromatin modification by TFs:

DNA ¡

+ ¡ + ¡

Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡ Enhancer ¡Region ¡

  • Tissue-specific transcription factors can

bind to HATs, causing chromatin to open. ¡

  • This can increase transcription.
  • Example: Histone Acetyltransferases

(HATs) acetylate histones. ¡

Specific ¡ General ¡

HAT ¡

slide-13
SLIDE 13

Part II: ChIP-seq Overview

Source: ¡Steven ¡Chu ¡

slide-14
SLIDE 14

ChIP-seq

  • Chromatin

ImmunoPrecipitation followed by high- throughput sequencing.

  • TF binding sites

(“punctate peaks”)

  • Chromatin mods

(“broad peaks”

slide-15
SLIDE 15

Steps in ChIP-seq

  • Cross-link proteins

to DNA

  • Fragment chromatin
  • Immunoprecipitate

with antibody to protein

  • Size-select and

ligate

  • Amplify
  • Sequence

Cross-­‑link ¡

slide-16
SLIDE 16

What can I learn from ChIP-seq?

  • What chromatin regions

are marked as active promoters or enhancers?

  • Where is my TF bound?
  • What is its DNA-binding

motif?

  • What genes might it

regulate?

slide-17
SLIDE 17

Part III: Analyzing ChIP-seq Data

Source: ¡Steven ¡Chu ¡

slide-18
SLIDE 18

Analyzing TF ChIP-seq Data

  • Key messages of this talk:

– Use controls! – Validate your data at each step. – But this is Science! What could possibly go wrong…?

slide-19
SLIDE 19

Things that can go wrong in ChIP-seq…

  • 1. Low affinity antibody
  • 2. Non-specific antibody
  • 3. Contamination
  • 4. Poor choice of peak calling algorithm (or

parameters) … etc.

slide-20
SLIDE 20

Steps in ChIP-seq Data Analysis

  • 1. Mapping: where do the sequence “tags”

map to the genome?

  • 2. Peak Calling: where are the regions of

significant tag concentration?

  • 3. Motif Discovery: what is the binding

motif?

  • 4. Location Analysis: where are the peaks w/

respect to genes, promoters, introns etc?

slide-21
SLIDE 21

1) Mapping ChIP-seq Tags

  • Tags: ChIP-seq produces a pool of

“tags” (~100bp)

  • Tag Count: measure of enrichment of region
  • Negative Control: “input DNA” tag count

Tallack ¡et ¡al., ¡Genome ¡Res., ¡2010 ¡

slide-22
SLIDE 22

Do the mapped tags make sense?

  • Each ~100 bp tag is the

5’ end of a DNA fragment.

  • But DNA is double-

stranded so there are tags from both strands.

  • We expect pairs of

clusters of tags on

  • pposite strands,

separated by the fragment length.

Wilbanks ¡and ¡FaccioK, ¡PLoS ¡One, ¡2010 ¡

slide-23
SLIDE 23

Strand Cross Correlation Analysis (SCCA)

  • If we shift the

anti-sense tags left by the (average) fragment length, we should see maximum correlation between the reads on the two strands.

Kharchenko ¡et ¡al., ¡Nature ¡Biotechnology, ¡2009 ¡

slide-24
SLIDE 24

SCCA often shows two maxima

  • Fragment-length

peak at average fragment length (as we expected)

  • Read-length peak at

average read length (due to variable and dispersed mappability of genomic positions)

read-­‑length ¡peak ¡ fragment-­‑length ¡peak ¡

Landt ¡S ¡G ¡et ¡al. ¡Genome ¡Res. ¡2012;22:1813-­‑1831 ¡

slide-25
SLIDE 25

Quality control 1: SCCA identifies failed ChIP-seq

Landt ¡S ¡G ¡et ¡al. ¡Genome ¡Res. ¡2012;22:1813-­‑1831 ¡

ENCODE Guidelines:

  • Normalized Strand Correlation,

NSC > 1.05

  • Relative Strand Correlation,

RSC > 0.8

  • https://code.google.com/p/

phantompeakqualtools

slide-26
SLIDE 26

2) ChIP-seq Peak Calling

  • Peak callers combine
  • verlapping tags to get the

“peak height”.

  • Often, strand information

and shifting is used to combine tags on opposite strands.

  • Fold-enrichment (tag

count / control tag count) is usually used as the criterion for declaring a peak.

Wilbanks ¡and ¡Faccio., ¡PLoS ¡One, ¡2010 ¡

slide-27
SLIDE 27

Some ChIP-seq peak callers use SCCA

Bailey ¡et. ¡al., ¡PLoS ¡Comp ¡Bio, ¡in ¡press. ¡

Uses ¡SCCA ¡ Uses ¡SCCA ¡

slide-28
SLIDE 28

Sanity checks: Are your peaks reasonable?

  • Width: TF ChIP-seq peaks should be relatively

short (< 300bp) compared to histone modification peaks.

– Are your peaks too wide?

  • Number: Is the number of TF ChIP-seq peaks

reasonable?

– Some key TFs bind ~30,000 sites but your TF probably only binds far fewer (~1000?)

  • Location: Do your peaks co-occur with histone

marks and genes that your TF regulates?

– Examine some peaks using the UCSC genome browser and ENCODE histone tracks

slide-29
SLIDE 29

Quality control 2: Fraction of Reads in Peaks (FRiP)

  • Only a fraction of

reads typically fall within ChIP-seq peaks.

  • ENCODE guideline:

FRiP > 1%

  • Caveat: A lower FRiP

threshold may be appropriate if there are very few peaks.

Landt ¡S ¡G ¡et ¡al. ¡Genome ¡Res. ¡2012;22:1813-­‑1831 ¡

slide-30
SLIDE 30

How many of my peaks are “real”?

  • Irreproducible Discovery Rate (IDR)

compares the ranks of peaks from two biological replicates.

– Rank peaks by significance (p-value or q- value) – Reproducible discoveries (peaks) should have similar ranks between replicates.

  • ENCODE: reports peaks at 1% IDR
  • https://sites.google.com/site/

anshulkundaje/projects/idr

slide-31
SLIDE 31

Quality control 3: IDR identifies failed ChIP-seq

Landt ¡S ¡G ¡et ¡al. ¡Genome ¡Res. ¡2012;22:1813-­‑1831 ¡

High ¡Reproducibility ¡ Low ¡Reproducibility ¡

slide-32
SLIDE 32

3) Motif Discovery & Enrichment Analysis

  • If your TF binds DNA directly (and

sequence-specifically), Motif Discovery should find its binding motif.

  • The DNA-binding motif of your TF

should be centrally enriched in the peaks, and Central Motif Enrichment Analysis (CMEA) should find it.

slide-33
SLIDE 33

Caveats in ChIP-seq Motif Analysis

  • Peak regions may

contain other TF motifs due to looping.

  • The binding of the

ChIP-ed factor “X” may be indirect.

  • ChIP-ed motif might

be weak due to assisted binding.

Farnham, ¡Nature ¡Reviews ¡Gene>cs, ¡2009 ¡

slide-34
SLIDE 34

TF Binding Motif Discovery

  • ChIP-seq provides

extremely rich data for inferring the DNA-binding affinity

  • f the ChIP-ed

transcription factor.

  • In principle,

discovering the motif is simple. ààà

  • ChIP-seq peaks tend

to be within +/- 50bp

  • f the bound factor.
  • So we just examine

the peak regions for enriched patterns.

slide-35
SLIDE 35

MEME Suite tools for ChIP-seq motif discovery and enrichment

  • The MEME Suite (http://meme.nbcr.net) contains

several motif discovery and enrichment algorithms appropriate for ChIP-seq data analysis.

– Discovery & Enrichment: MEME-ChIP – Discovery: MEME, DREME, GLAM2 – Enrichment: CentriMo, AME

slide-36
SLIDE 36

Example: Motif discovery in NFIC ChIP-seq data

  • Pjanic et al. predicted 39,807 ChIP-seq

peaks in NFIC ChIP-seq data.

  • They do not report a using motif discovery
  • n these peaks.
  • We used MEME-ChIP which runs both

MEME and DREME to perform motif discovery on the 100-bp NFIC ChIP-seq peak regions.

Machanick ¡& ¡Bailey, ¡Bioinforma>cs, ¡2011 ¡

slide-37
SLIDE 37

Motif discovery fails in the (original) NFIC dataset

  • An NFIC motif is known from in vitro data,

based on only 16 sites.

  • MEME and DREME fail to find this motif in

the NFIC data.

  • But so do the other algorithms we tried:

Amadeus, peak-motifs, Trawler and Weeder.

slide-38
SLIDE 38

The problem: poor peak calling!

  • We applied a

different ChIP-seq peak calling algorithm (ChIP-peak) which predicts only 700 peaks (rather than 40,000).

  • MEME discovers the

NFI-family binding motif in this new set

  • f peaks.
slide-39
SLIDE 39

“site-­‑probability” ¡curve ¡ ¡

MA0119.1 Position CEQLOGO 22.09.10 17:31

TGGC

C T

AA

G

C

A T G

C

T G A C A TGCCAG T

A

PosiKon ¡of ¡Best ¡Site ¡ Probability ¡

Central Motif Enrichment Analysis: CentriMo

  • CentriMo searches

for known motifs whose sites are most centrally enriched in the ChIP-seq regions.

  • Use 500bp regions

centered on each ChIP-seq peak.

500-­‑bp ¡ChIP-­‑seq ¡regions ¡ W=120 ¡ L=500 ¡ S ¡= ¡number ¡of ¡“successes” ¡= ¡4 ¡ T ¡= ¡number ¡of ¡“trials” ¡= ¡5 ¡

Bailey ¡et ¡al, ¡NAR, ¡2012 ¡

slide-40
SLIDE 40

0.0005 0.001 0.0015 0.002 0.0025 0.003

  • 250 -200 -150 -100
  • 50

50 100 150 200 250 probability position of best site in sequence MA0119.1 p=2.4e-031,w=295,n=5409 MA0244.1 p=4.6e-015,w=381,n=39398 MA0161.1 p=7.3e-015,w=329,n=39356 MA0099.1 p=5.5e-014,w=343,n=34267 MA0406.1 p=8.1e-012,w=323,n=31383

Central Motif Enrichment confirms the known NFIC motif—even in the original peaks

  • NFIC motif is most centrally enriched of 862 JASPAR and

UniPROBE motifs (p = 10-31).

MA0119.1 Position CEQLOGO 22.09.10 17:31

TGGC

C T

AA

G

C

A T G

C

T G A C A TGCCAG T

A

NFIC ¡

  • However, standard motif enrichment algorithms do not show the

NFIC as the most enriched motif.

slide-41
SLIDE 41

Quality control 4: CMEA identifies failed ChIP-seq

0.0005 0.001 0.0015 0.002 0.0025

  • 250 -200 -150 -100
  • 50

50 100 150 200 250 probability position of best site in sequence MA0039.2 p=7.2e-001,w=365,n=11404

MA0039.2 Position

CEQLOGO 10.10.11 18:17 T C

A

G

T G

A

C

A

T

CA

CC

T

G

ACC

T

C

C

T A

p ¡= ¡0.7 ¡

  • 2. ¡Failed ¡KLF1 ¡ChIP-­‑seq ¡

KLF4 ¡

Pilon ¡et ¡al., ¡Blood, ¡2011 ¡

  • 0.002
  • 0.001

0.001 0.002 0.003 0.004 0.005 0.006 0.007

  • 250 -200 -150 -100
  • 50

50 100 150 200 250 probability position of best site in sequence MA0039.2 p=4.4e-066,w=111,n=712 Klf7_primary p=6.9e-056,w=103,n=676 MA0140.1 p=1.5e-048,w=177,n=693 MA0035.2 p=2.4e-040,w=194,n=756

  • 1. ¡Successful ¡KLF1 ¡ChIP-­‑seq ¡
T C

A

G

T G

A

C

A

T

CA

CC

T

G

ACC

T

C

C

T A

KLF4 ¡

Tallack ¡et ¡al., ¡Genome ¡Res, ¡2010 ¡

slide-42
SLIDE 42

New motif databases

  • In vitro motifs are especially useful for

verifying that your ChIP-seq worked.

  • They are independent of the motifs

found by motif discovery in your ChIP- seq data.

– UniPROBE: 386 mouse TF motifs from protein-binding microarrays. – Jolma et al., Cell, 2013: 738 human and mouse TF motifs from SELEX

slide-43
SLIDE 43

4) Location Analysis

  • Counts how often TF binding sites are in, say,

promoters, intergenic or intragenic regions.

Farnham, ¡Nature ¡Reviews ¡Gene>cs, ¡2009 ¡

slide-44
SLIDE 44

Example: Predicting Target Genes

  • TF binding sites in promoters probably are

regulatory.

  • “Nearest TSS” rule is
  • ften used to assign

binding sites to target genes.

  • But distal sites may

regulate some other gene via chromatin looping.

Farnham, ¡Nature ¡Reviews ¡Gene:cs, ¡2009 ¡

slide-45
SLIDE 45

Klf1 binding near TSSs

  • Histogram of

distances from Klf1 ChIP-seq peak to the nearest TSS.

  • KLF1 has a

population of binding sites in promoters (small hump on left), but most are distal.

Tallack ¡et ¡al, ¡Genome ¡Res, ¡2010 ¡

slide-46
SLIDE 46

Motif Spacing Analysis finds co- factor motifs and TF complexes

slide-47
SLIDE 47

Part IV: Combining ChIP-seq and RNA-seq

Source: ¡Steven ¡Chu ¡

slide-48
SLIDE 48

Identification of KLF1 target genes using RNA-seq

3 x Klf1-/- libraries 3 x Klf1+/+ libraries CuffDiff

RefSeq.gtf (gene definition set)

690 KLF1 “Activated” genes 118 KLF1 “Repressed” genes

At Bonferroni corrected p-val <0.05 and >1.5 fold change (KO vs WT) E2f2 E2f4 200 400 600 800 1000

mRNA-seq FPKM

mRNA-seq * * qRT-PCR valida.on ¡

Tallack ¡et ¡al, ¡Genome ¡Res, ¡2012 ¡

slide-49
SLIDE 49

The KLF1 Transcriptome

Tallack ¡et ¡al, ¡Genome ¡Res, ¡2012 ¡

slide-50
SLIDE 50

KLF1 is a (direct) Activator

The distance from KLF1 ChIP-seq peaks to the nearest TSS (putative target gene) is less for “Activated” genes than for “Repressed” genes.

Tallack ¡et ¡al, ¡Genome ¡Res, ¡2012 ¡

slide-51
SLIDE 51

Final reminders

  • Check your data at each step!

– Read mapping

  • Strand Cross Correlation Analysis (SCCA)

– Peak calling

  • Fraction of Reads in Peaks (FRiP)
  • Irreproducible Discovery Rate (IDR) analysis

– Motif discovery / enrichment analysis

  • De novo motif found?
  • In vitro motif centrally enriched?
slide-52
SLIDE 52

¡ ¡

Acknowledgements

The MEME Suite

  • Tom Whitington
  • Philip Machanick
  • James Johnson
  • Martin Frith
  • William Noble
  • Charles Grant
  • Shobhit Gupta

KLF Project

  • Michael Tallack
  • Tom Whitington
  • Andrew Perkins
  • Sean Grimmond
  • Brooke Gardiner
  • Ehsan Nourbakhsh
  • Nicole Cloonan
  • Elanor Wainwright
  • Janelle Keys
  • Wai Shan Yuen