[PPT] - NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey PowerPoint Presentation

SLIDE 1

NGS Sequence Analysis for Regulation and Epigenomics

Timothy Bailey Winter School in Mathematical and Computational Biology July 3, 2012

SLIDE 2

NGS Analysis and Transcriptional Regulation

RNA-seq

– Measuring transcription levels (gene expression) – Detecting RNA regulators (e.g., miRNA)

ChIP-seq (and ChIP-exo)

– Chromatin modifications – Binding of transcription factor proteins

SLIDE 3

Talk Overview

I. Basic Transcriptional Regulation
II. ChIP-seq and ChIP-exo
III. Analyzing ChIP-seq & ChIP-exo data

a) Mapping b) Peak calling c) Motif discovery & Enrichment Analysis d) Location analysis

SLIDE 4

Part I: Basic Transcriptional Regulation

Source: ¡Steven ¡Chu ¡

SLIDE 5

Transcription Factors

Mammalian transcription is controlled

(in part) by about 1400 transcription factor (TF) proteins.

These proteins control transcription in

two main ways:

– Directly, by promoting (or preventing) the assembly of the pre-initiation complex. – Indirectly, by modifying the chromatin.

SLIDE 6

BASAL TRANSCRIPTION: ¡

The pre-initiation complex assembles at

the core promoter.

This results in only low levels of

transcription because the interaction is unstable.

DNA ¡

+ ¡

Core ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡

SLIDE 7

DNA ¡ Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡

PROXIMAL PROMOTER:

The proximal promoter extends upstream
f the promoter.
It contains binding sites for repressor and

activator transcription factors.

SLIDE 8

This stabilizes the transcriptional
machinery. ¡
Some transcription factors (“activators”)

bind to sites in the proximal promoter. ACTIVATORS:

This increases transcription. ¡

DNA ¡

+ ¡ + ¡

Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡

SLIDE 9

This reduces transcription.
Their binding can block binding by co-

factors and activators.

Some factors do not stabilize the

transcriptional machinery. REPRESSORS:

+++ ¡

DNA ¡

+ ¡ + ¡

Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡

SLIDE 10

ENHANCER REGIONS:

DNA ¡

+ ¡ + ¡

Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡ Enhancer ¡Region ¡

1-‑-‑100Kb ¡

Often very distant—1000s of base pairs. ¡
Groups of binding sites located upstream
r downstream of a promoter. ¡

SLIDE 11

ENHANCER REGIONS:

DNA looping brings factors into contact

with transcriptional machinery. ¡

Bound activators increase transcription. ¡
Both activator and repressor transcription

factors can occupy enhancer regions. ¡

+++ ¡

DNA ¡

+ ¡ + ¡

Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡ Enhancer ¡Region ¡

SLIDE 12

+++ ¡

Chromatin modification by TFs:

DNA ¡

+ ¡ + ¡

Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡ Enhancer ¡Region ¡

Tissue-specific transcription factors can

bind to HATs, causing chromatin to open. ¡

This can increase transcription.
Histone Acetyltransferases (HATs)

acetylate histones. ¡

Specific ¡ General ¡

HAT ¡

SLIDE 13

Part II: ChIP-seq & ChIP-exo

Source: ¡Steven ¡Chu ¡

SLIDE 14

ChIP-seq

SLIDE 15

ChIP-Exo

Rhee ¡and ¡Pugh, ¡Cell ¡201. ¡

SLIDE 16

Rhee ¡and ¡Pugh, ¡Cell ¡2011. ¡

ChIP-seq & ChIP-exo

SLIDE 17

Part III: Analyzing ChIP-seq Data

Source: ¡Steven ¡Chu ¡

SLIDE 18

Analyzing TF ChIP-seq Data

Key messages of this talk:

– Use controls! – Validate your data at each step. – But this is Science! What could possibly go wrong…?

SLIDE 19

Things that can go wrong in ChIP-seq…

1. Low affinity antibody
2. Non-specific antibody
3. Contamination
4. Poor choice of peak calling algorithm (or

parameters) … etc.

SLIDE 20

Steps in ChIP-seq Data Analysis

1. Mapping: where do the sequence “tags”

map to the genome?

2. Peak Calling: where are the regions of

significant tag concentration?

3. Motif Discovery: what is the binding

motif?

4. Location Analysis: where are the peaks w/

respect to genes, promoters, introns etc?

SLIDE 21

1) Mapping ChIP-seq Tags

Tags: ChIP-seq produces a pool of

“tags” (~100bp)

Tag Count: measure of enrichment of region
Negative Control: “input DNA” tag count

Tallack ¡et ¡al, ¡Genome ¡Res., ¡2019 ¡

SLIDE 22

2) ChIP-seq Peak Calling

ChIP-seq produces

a pool of “tags”.

Tags are currently

about 100 bp long.

Tag is the 5’ end of

a DNA fragment.

But DNA is double-

stranded so…

Wilbanks ¡and ¡FaccioM, ¡PLoS ¡One, ¡2010 ¡

SLIDE 23

ChIP-seq Peak Calling

Peak callers combine overlapping tags to

get the “peak height”.

Sometimes strand information is used

to combine tags on opposite strands.

Fold-enrichment (tag count / control tag

count) is usually used as the criterion for declaring a peak.

SLIDE 24

…ChIP-seq Peak Callers

Wilbanks ¡and ¡FaccioM, ¡PLoS ¡One, ¡2010 ¡

SLIDE 25

Sanity check: are your peaks reasonable

Width: TF ChIP-seq peaks should be relatively

short (< 300bp) compared to histone modification peaks.

– Are your peaks too wide?

Number: Is the number of TF ChIP-seq peaks

reasonable?

– Some key TFs bind ~30,000 sites but your TF probably only binds far fewer (~1000?)

Location: Do your peaks co-occur with histone

marks and genes your TF regulates?

The next analysis steps will help you answer

these questions!

SLIDE 26

3) Motif Discovery & Enrichment Analysis

If your TF binds DNA directly (and

sequence-specifically), Motif Discovery should find its binding motif.

The DNA-binding motif of your TF

should be centrally enriched in the peaks, and hould be Central Motif Enrichment Analysis (CMEA) should find it.

SLIDE 27

Caveats in ChIP-seq Motif Analysis

Peak regions may

contain other TF motifs due to looping.

The binding of the

ChIP-ed factor “X” may be indirect.

ChIP-ed motif might

be weak due to assisted binding.

Farnham, ¡Nature ¡Reviews ¡GeneMcs, ¡2009 ¡

SLIDE 28

TF Binding Motif Discovery

ChIP-seq provides

extremely rich data for inferring the DNA-binding affinity

f the ChIP-ed

transcription factor.

In principle,

discovering the motif is simple. ààà

ChIP-seq peaks tend

to be within +/- 50bp

f the bound factor.
So we just examine

the peak regions for enriched patterns.

SLIDE 29

MEME Suite tools for ChIP-seq motif discovery and enrichment

The MEME Suite (http://meme.nbcr.net) contains

several motif discovery and enrichment algorithms appropriate for ChIP-seq data analysis.

– Discovery & Enrichment: MEME-ChIP – Discovery: MEME, DREME, GLAM2 – Enrichment: CentriMo, AME

SLIDE 30

Example: Motif discovery in NFIC ChIP-seq data

Pjanic et al. predicted 39,807 ChIP-seq

peaks in NFIC ChIP-seq data.

They do not report a using motif discovery
n these peaks.
We used MEME-ChIP which runs both

MEME and DREME to perform motif discovery on the 100-bp NFIC ChIP-seq peak regions.

Machanick ¡& ¡Bailey, ¡BioinformaMcs, ¡2011 ¡

SLIDE 31

Motif discovery fails in the (original) NFIC dataset

An NFIC motif is know from in vitro data,

based on only 16 sites.

MEME and DREME fail to find this motif in

the NFIC data.

But so do the other algorithms we tried:

Amadeus, peak-motifs, Trawler and Weeder.

SLIDE 32

The problem: poor peak calling!

We applied a

different ChIP-seq peak calling algorithm (ChIP-peak) which predicts only 700 peaks (rather than 40,000).

MEME discovers the

NFI-family binding motif in this new set

f peaks.

SLIDE 33

“site-‑probability” ¡curve ¡ ¡

MA0119.1 Position CEQLOGO 22.09.10 17:31

TGGC

C T

AA

G

C

A T G

C

T G A C A TGCCAG T

A

PosiMon ¡of ¡Best ¡Site ¡ Probability ¡

Central Motif Enrichment Analysis: CentriMo

CentriMo searches

for known motifs whose sites are most centrally enriched in the ChIP-seq regions.

Use 500bp regions

centered on each ChIP-seq peak.

500-‑bp ¡ChIP-‑seq ¡regions ¡ W=120 ¡ L=500 ¡ S ¡= ¡number ¡of ¡“successes” ¡= ¡4 ¡ T ¡= ¡number ¡of ¡“trials” ¡= ¡5 ¡

Bailey ¡et ¡al, ¡NAR ¡2012 ¡

SLIDE 34

0.0005 0.001 0.0015 0.002 0.0025 0.003

250 -200 -150 -100
50

50 100 150 200 250 probability position of best site in sequence MA0119.1 p=2.4e-031,w=295,n=5409 MA0244.1 p=4.6e-015,w=381,n=39398 MA0161.1 p=7.3e-015,w=329,n=39356 MA0099.1 p=5.5e-014,w=343,n=34267 MA0406.1 p=8.1e-012,w=323,n=31383

Central Motif Enrichment confirms the known NFIC motif—even in the original peaks

NFIC motif is most centrally enriched of 862 JASPAR

+UniPROBE motifs (p = 10-31).

MA0119.1 Position CEQLOGO 22.09.10 17:31

TGGC

C T

AA

G

C

A T G

C

T G A C A TGCCAG T

A

NFIC ¡

However, standard motif enrichment algorithms (including AME)

do not show the NFIC as the most enriched motif.

SLIDE 35

1. Published successful KLF1 ChIP-seq.

2. Published failed KLF1 ChIP-seq.

Central Motif Enrichment Analysis shows when things go right (or wrong).

0.002
0.001

0.001 0.002 0.003 0.004 0.005 0.006 0.007

250 -200 -150 -100
50

50 100 150 200 250 probability position of best site in sequence MA0039.2 p=4.4e-066,w=111,n=712 Klf7_primary p=6.9e-056,w=103,n=676 MA0140.1 p=1.5e-048,w=177,n=693 MA0035.2 p=2.4e-040,w=194,n=756

1. ¡Successful ¡KLF1 ¡ChIP-‑seq ¡

T C

A

G

T G

A

C

A

T

CA

CC

T

G

ACC

T

C

T A

KLF4 ¡

0.0005 0.001 0.0015 0.002 0.0025

250 -200 -150 -100
50

50 100 150 200 250 probability position of best site in sequence MA0039.2 p=7.2e-001,w=365,n=11404

MA0039.2 Position

CEQLOGO 10.10.11 18:17 T C

A

G

T G

A

C

A

T

CA

CC

T

G

ACC

T

C

T A

p ¡= ¡0.7 ¡

2. ¡Failed ¡KLF1 ¡ChIP-‑seq ¡

KLF4 ¡

SLIDE 36

Motif Spacing Analysis finds co- factor motifs and TF complexes

SLIDE 37

4) Location Analysis

Counts how often TF binding sites are in, say,

promoters, intergenic or intragenic regions.

Farnham, ¡Nature ¡Reviews ¡GeneMcs, ¡2009 ¡

SLIDE 38

Predicting Target Genes

Location analysis allows identification of target

genes.

TF binding sites in promoters probably are

regulatory.

“Nearest TSS” rule is
ften used to assign

binding sites to target genes, but distal sites may regulate some

ther gene via

chromatin looping.

Farnham, ¡Nature ¡Reviews ¡GeneMcs, ¡2009 ¡

SLIDE 39

Example: Binding near TSSs

Histogram of

distances from Klf1 ChIP-seq peak to the nearest TSS.

KLF1 has a

population of binding sites in promoters (small hump on left), but most are distal.

Tallack ¡et ¡al, ¡Genome ¡Res., ¡2019 ¡

SLIDE 40

¡ ¡

Acknowledgements

The MEME Suite

Tom Whitington
Philip Machanick
James Johnson
Martin Frith
William Noble
Charles Grant
Shobhit Gupta

KLF Project

Michael Tallack
Tom Whitington
Andrew Perkins
Sean Grimmond
Brooke Gardiner
Ehsan Nourbakhsh
Nicole Cloonan
Elanor Wainwright
Janelle Keys
Wai Shan Yuen