NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey - - PowerPoint PPT Presentation

ngs sequence analysis for regulation and epigenomics
SMART_READER_LITE
LIVE PREVIEW

NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey - - PowerPoint PPT Presentation

NGS Sequence Analysis for Regulation and Epigenomics Timothy Bailey Winter School in Mathematical and Computational Biology July 3, 2012 NGS Analysis and Transcriptional Regulation RNA-seq Measuring transcription levels (gene


slide-1
SLIDE 1

NGS Sequence Analysis for Regulation and Epigenomics

Timothy Bailey Winter School in Mathematical and Computational Biology July 3, 2012

slide-2
SLIDE 2

NGS Analysis and Transcriptional Regulation

  • RNA-seq

– Measuring transcription levels (gene expression) – Detecting RNA regulators (e.g., miRNA)

  • ChIP-seq (and ChIP-exo)

– Chromatin modifications – Binding of transcription factor proteins

slide-3
SLIDE 3

Talk Overview

  • I. Basic Transcriptional Regulation
  • II. ChIP-seq and ChIP-exo
  • III. Analyzing ChIP-seq & ChIP-exo data

a) Mapping b) Peak calling c) Motif discovery & Enrichment Analysis d) Location analysis

slide-4
SLIDE 4

Part I: Basic Transcriptional Regulation

Source: ¡Steven ¡Chu ¡

slide-5
SLIDE 5

Transcription Factors

  • Mammalian transcription is controlled

(in part) by about 1400 transcription factor (TF) proteins.

  • These proteins control transcription in

two main ways:

– Directly, by promoting (or preventing) the assembly of the pre-initiation complex. – Indirectly, by modifying the chromatin.

slide-6
SLIDE 6

BASAL TRANSCRIPTION: ¡

  • The pre-initiation complex assembles at

the core promoter.

  • This results in only low levels of

transcription because the interaction is unstable.

DNA ¡

+ ¡

Core ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡

slide-7
SLIDE 7

DNA ¡ Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡

PROXIMAL PROMOTER:

  • The proximal promoter extends upstream
  • f the promoter.
  • It contains binding sites for repressor and

activator transcription factors.

slide-8
SLIDE 8
  • This stabilizes the transcriptional
  • machinery. ¡
  • Some transcription factors (“activators”)

bind to sites in the proximal promoter. ACTIVATORS:

  • This increases transcription. ¡

DNA ¡

+ ¡ + ¡

Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡

slide-9
SLIDE 9
  • This reduces transcription.
  • Their binding can block binding by co-

factors and activators.

  • Some factors do not stabilize the

transcriptional machinery. REPRESSORS:

+++ ¡

DNA ¡

+ ¡ + ¡

Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡

slide-10
SLIDE 10

ENHANCER REGIONS:

DNA ¡

+ ¡ + ¡

Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡ Enhancer ¡Region ¡

1-­‑-­‑100Kb ¡

  • Often very distant—1000s of base pairs. ¡
  • Groups of binding sites located upstream
  • r downstream of a promoter. ¡
slide-11
SLIDE 11

ENHANCER REGIONS:

  • DNA looping brings factors into contact

with transcriptional machinery. ¡

  • Bound activators increase transcription. ¡
  • Both activator and repressor transcription

factors can occupy enhancer regions. ¡

+++ ¡

DNA ¡

+ ¡ + ¡

Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡ Enhancer ¡Region ¡

slide-12
SLIDE 12

+++ ¡

Chromatin modification by TFs:

DNA ¡

+ ¡ + ¡

Proximal ¡Promoter ¡ TATA ¡ ¡ ¡ ¡INR ¡ Enhancer ¡Region ¡

  • Tissue-specific transcription factors can

bind to HATs, causing chromatin to open. ¡

  • This can increase transcription.
  • Histone Acetyltransferases (HATs)

acetylate histones. ¡

Specific ¡ General ¡

HAT ¡

slide-13
SLIDE 13

Part II: ChIP-seq & ChIP-exo

Source: ¡Steven ¡Chu ¡

slide-14
SLIDE 14

ChIP-seq

slide-15
SLIDE 15

ChIP-Exo

Rhee ¡and ¡Pugh, ¡Cell ¡201. ¡

slide-16
SLIDE 16

Rhee ¡and ¡Pugh, ¡Cell ¡2011. ¡

ChIP-seq & ChIP-exo

slide-17
SLIDE 17

Part III: Analyzing ChIP-seq Data

Source: ¡Steven ¡Chu ¡

slide-18
SLIDE 18

Analyzing TF ChIP-seq Data

  • Key messages of this talk:

– Use controls! – Validate your data at each step. – But this is Science! What could possibly go wrong…?

slide-19
SLIDE 19

Things that can go wrong in ChIP-seq…

  • 1. Low affinity antibody
  • 2. Non-specific antibody
  • 3. Contamination
  • 4. Poor choice of peak calling algorithm (or

parameters) … etc.

slide-20
SLIDE 20

Steps in ChIP-seq Data Analysis

  • 1. Mapping: where do the sequence “tags”

map to the genome?

  • 2. Peak Calling: where are the regions of

significant tag concentration?

  • 3. Motif Discovery: what is the binding

motif?

  • 4. Location Analysis: where are the peaks w/

respect to genes, promoters, introns etc?

slide-21
SLIDE 21

1) Mapping ChIP-seq Tags

  • Tags: ChIP-seq produces a pool of

“tags” (~100bp)

  • Tag Count: measure of enrichment of region
  • Negative Control: “input DNA” tag count

Tallack ¡et ¡al, ¡Genome ¡Res., ¡2019 ¡

slide-22
SLIDE 22

2) ChIP-seq Peak Calling

  • ChIP-seq produces

a pool of “tags”.

  • Tags are currently

about 100 bp long.

  • Tag is the 5’ end of

a DNA fragment.

  • But DNA is double-

stranded so…

Wilbanks ¡and ¡FaccioM, ¡PLoS ¡One, ¡2010 ¡

slide-23
SLIDE 23

ChIP-seq Peak Calling

  • Peak callers combine overlapping tags to

get the “peak height”.

  • Sometimes strand information is used

to combine tags on opposite strands.

  • Fold-enrichment (tag count / control tag

count) is usually used as the criterion for declaring a peak.

slide-24
SLIDE 24

…ChIP-seq Peak Callers

Wilbanks ¡and ¡FaccioM, ¡PLoS ¡One, ¡2010 ¡

slide-25
SLIDE 25

Sanity check: are your peaks reasonable

  • Width: TF ChIP-seq peaks should be relatively

short (< 300bp) compared to histone modification peaks.

– Are your peaks too wide?

  • Number: Is the number of TF ChIP-seq peaks

reasonable?

– Some key TFs bind ~30,000 sites but your TF probably only binds far fewer (~1000?)

  • Location: Do your peaks co-occur with histone

marks and genes your TF regulates?

  • The next analysis steps will help you answer

these questions!

slide-26
SLIDE 26

3) Motif Discovery & Enrichment Analysis

  • If your TF binds DNA directly (and

sequence-specifically), Motif Discovery should find its binding motif.

  • The DNA-binding motif of your TF

should be centrally enriched in the peaks, and hould be Central Motif Enrichment Analysis (CMEA) should find it.

slide-27
SLIDE 27

Caveats in ChIP-seq Motif Analysis

  • Peak regions may

contain other TF motifs due to looping.

  • The binding of the

ChIP-ed factor “X” may be indirect.

  • ChIP-ed motif might

be weak due to assisted binding.

Farnham, ¡Nature ¡Reviews ¡GeneMcs, ¡2009 ¡

slide-28
SLIDE 28

TF Binding Motif Discovery

  • ChIP-seq provides

extremely rich data for inferring the DNA-binding affinity

  • f the ChIP-ed

transcription factor.

  • In principle,

discovering the motif is simple. ààà

  • ChIP-seq peaks tend

to be within +/- 50bp

  • f the bound factor.
  • So we just examine

the peak regions for enriched patterns.

slide-29
SLIDE 29

MEME Suite tools for ChIP-seq motif discovery and enrichment

  • The MEME Suite (http://meme.nbcr.net) contains

several motif discovery and enrichment algorithms appropriate for ChIP-seq data analysis.

– Discovery & Enrichment: MEME-ChIP – Discovery: MEME, DREME, GLAM2 – Enrichment: CentriMo, AME

slide-30
SLIDE 30

Example: Motif discovery in NFIC ChIP-seq data

  • Pjanic et al. predicted 39,807 ChIP-seq

peaks in NFIC ChIP-seq data.

  • They do not report a using motif discovery
  • n these peaks.
  • We used MEME-ChIP which runs both

MEME and DREME to perform motif discovery on the 100-bp NFIC ChIP-seq peak regions.

Machanick ¡& ¡Bailey, ¡BioinformaMcs, ¡2011 ¡

slide-31
SLIDE 31

Motif discovery fails in the (original) NFIC dataset

  • An NFIC motif is know from in vitro data,

based on only 16 sites.

  • MEME and DREME fail to find this motif in

the NFIC data.

  • But so do the other algorithms we tried:

Amadeus, peak-motifs, Trawler and Weeder.

slide-32
SLIDE 32

The problem: poor peak calling!

  • We applied a

different ChIP-seq peak calling algorithm (ChIP-peak) which predicts only 700 peaks (rather than 40,000).

  • MEME discovers the

NFI-family binding motif in this new set

  • f peaks.
slide-33
SLIDE 33

“site-­‑probability” ¡curve ¡ ¡

MA0119.1 Position CEQLOGO 22.09.10 17:31

TGGC

C T

AA

G

C

A T G

C

T G A C A TGCCAG T

A

PosiMon ¡of ¡Best ¡Site ¡ Probability ¡

Central Motif Enrichment Analysis: CentriMo

  • CentriMo searches

for known motifs whose sites are most centrally enriched in the ChIP-seq regions.

  • Use 500bp regions

centered on each ChIP-seq peak.

500-­‑bp ¡ChIP-­‑seq ¡regions ¡ W=120 ¡ L=500 ¡ S ¡= ¡number ¡of ¡“successes” ¡= ¡4 ¡ T ¡= ¡number ¡of ¡“trials” ¡= ¡5 ¡

Bailey ¡et ¡al, ¡NAR ¡2012 ¡

slide-34
SLIDE 34

0.0005 0.001 0.0015 0.002 0.0025 0.003

  • 250 -200 -150 -100
  • 50

50 100 150 200 250 probability position of best site in sequence MA0119.1 p=2.4e-031,w=295,n=5409 MA0244.1 p=4.6e-015,w=381,n=39398 MA0161.1 p=7.3e-015,w=329,n=39356 MA0099.1 p=5.5e-014,w=343,n=34267 MA0406.1 p=8.1e-012,w=323,n=31383

Central Motif Enrichment confirms the known NFIC motif—even in the original peaks

  • NFIC motif is most centrally enriched of 862 JASPAR

+UniPROBE motifs (p = 10-31).

MA0119.1 Position CEQLOGO 22.09.10 17:31

TGGC

C T

AA

G

C

A T G

C

T G A C A TGCCAG T

A

NFIC ¡

  • However, standard motif enrichment algorithms (including AME)

do not show the NFIC as the most enriched motif.

slide-35
SLIDE 35

1. Published successful KLF1 ChIP-seq.

  • 2. Published failed KLF1 ChIP-seq.

Central Motif Enrichment Analysis shows when things go right (or wrong).

  • 0.002
  • 0.001

0.001 0.002 0.003 0.004 0.005 0.006 0.007

  • 250 -200 -150 -100
  • 50

50 100 150 200 250 probability position of best site in sequence MA0039.2 p=4.4e-066,w=111,n=712 Klf7_primary p=6.9e-056,w=103,n=676 MA0140.1 p=1.5e-048,w=177,n=693 MA0035.2 p=2.4e-040,w=194,n=756

  • 1. ¡Successful ¡KLF1 ¡ChIP-­‑seq ¡
T C

A

G

T G

A

C

A

T

CA

CC

T

G

ACC

T

C

C

T A

KLF4 ¡

0.0005 0.001 0.0015 0.002 0.0025

  • 250 -200 -150 -100
  • 50

50 100 150 200 250 probability position of best site in sequence MA0039.2 p=7.2e-001,w=365,n=11404

MA0039.2 Position

CEQLOGO 10.10.11 18:17 T C

A

G

T G

A

C

A

T

CA

CC

T

G

ACC

T

C

C

T A

p ¡= ¡0.7 ¡

  • 2. ¡Failed ¡KLF1 ¡ChIP-­‑seq ¡

KLF4 ¡

slide-36
SLIDE 36

Motif Spacing Analysis finds co- factor motifs and TF complexes

slide-37
SLIDE 37

4) Location Analysis

  • Counts how often TF binding sites are in, say,

promoters, intergenic or intragenic regions.

Farnham, ¡Nature ¡Reviews ¡GeneMcs, ¡2009 ¡

slide-38
SLIDE 38

Predicting Target Genes

  • Location analysis allows identification of target

genes.

  • TF binding sites in promoters probably are

regulatory.

  • “Nearest TSS” rule is
  • ften used to assign

binding sites to target genes, but distal sites may regulate some

  • ther gene via

chromatin looping.

Farnham, ¡Nature ¡Reviews ¡GeneMcs, ¡2009 ¡

slide-39
SLIDE 39

Example: Binding near TSSs

  • Histogram of

distances from Klf1 ChIP-seq peak to the nearest TSS.

  • KLF1 has a

population of binding sites in promoters (small hump on left), but most are distal.

Tallack ¡et ¡al, ¡Genome ¡Res., ¡2019 ¡

slide-40
SLIDE 40

¡ ¡

Acknowledgements

The MEME Suite

  • Tom Whitington
  • Philip Machanick
  • James Johnson
  • Martin Frith
  • William Noble
  • Charles Grant
  • Shobhit Gupta

KLF Project

  • Michael Tallack
  • Tom Whitington
  • Andrew Perkins
  • Sean Grimmond
  • Brooke Gardiner
  • Ehsan Nourbakhsh
  • Nicole Cloonan
  • Elanor Wainwright
  • Janelle Keys
  • Wai Shan Yuen