Re-analysis of a CD4 ChIP-Seq data set with csaw Ryan C. Thompson - - PowerPoint PPT Presentation

re analysis of a cd4 chip seq data set with csaw
SMART_READER_LITE
LIVE PREVIEW

Re-analysis of a CD4 ChIP-Seq data set with csaw Ryan C. Thompson - - PowerPoint PPT Presentation

Re-analysis of a CD4 ChIP-Seq data set with csaw Ryan C. Thompson Salomon Lab The Scripps Research Institute May 6, 2016 Ryan C. Thompson Advanced RNA-seq analysis Outline Intro to T-cells and experimental design ChIP-Seq overview


slide-1
SLIDE 1

Re-analysis of a CD4 ChIP-Seq data set with csaw

Ryan C. Thompson Salomon Lab The Scripps Research Institute May 6, 2016

Ryan C. Thompson Advanced RNA-seq analysis

slide-2
SLIDE 2

Outline

Intro to T-cells and experimental design ChIP-Seq overview Consensus peak-calling wih IDR Previous promoter-oriented analysis (published soon) Initial QC and analysis of whole genome analysis Genomic region blacklists

Ryan C. Thompson Advanced RNA-seq analysis

slide-3
SLIDE 3

CD4 T-cell activation and memory formation

Time # Cells

TCR CD4 CD28

Naïve T Cell

B7 MHC

Activation Antigen Presenting Cell

Effector Cells P r

  • l

i f e r a t i

  • n

Memory Cells

Cell Death

Naïve T Cell

Figure 1: CD4 T-cell response to successive infections

Ryan C. Thompson Advanced RNA-seq analysis

slide-4
SLIDE 4

Experimental design

Isolate and culture naïve & memory cells from 4 donors Activate cells and take samples at pre-activation (day 0) and at days 1, 5, and 14 post-activation RNA-seq and ChIP-seq on all samples ChIP using antibodies against H3K4Me2, H3K4Me3, HeK27me3 (and input) Data analysis? Profjt

Ryan C. Thompson Advanced RNA-seq analysis

slide-5
SLIDE 5

Experimental design

Isolate and culture naïve & memory cells from 4 donors Activate cells and take samples at pre-activation (day 0) and at days 1, 5, and 14 post-activation RNA-seq and ChIP-seq on all samples ChIP using antibodies against H3K4Me2, H3K4Me3, HeK27me3 (and input) Data analysis? Profjt

Ryan C. Thompson Advanced RNA-seq analysis

slide-6
SLIDE 6

How ChIP-Seq works, more or less

Figure 2: Overview of ChIP-Seq workfmow

Ryan C. Thompson Advanced RNA-seq analysis

slide-7
SLIDE 7

Promoter-oriented analysis

Ryan C. Thompson Advanced RNA-seq analysis

slide-8
SLIDE 8

First steps

Map with bowtie2 Call peaks with MACS Determine consensus biologically reproduced peaks using Irreproducible Discovery Rate (like FDR but comparing consistency between two lists) Process RNA-seq as usual: tophat htseq-count edgeR

Ryan C. Thompson Advanced RNA-seq analysis

slide-9
SLIDE 9

First steps

Map with bowtie2 Call peaks with MACS Determine consensus biologically reproduced peaks using Irreproducible Discovery Rate (like FDR but comparing consistency between two lists) Process RNA-seq as usual: tophat → htseq-count → edgeR

Ryan C. Thompson Advanced RNA-seq analysis

slide-10
SLIDE 10

IDR workfmow

IDR helps choose a signifjcance threshold at which peaks are reproducible between biological replicates. Call peaks in each individual sample Run IDR on each pairof samples to determine p-value → IDR mapping Since multiple samples give more information than any pair, take the smallest relationship as an upper bound for the

  • verall IDR

Combine all samples and call peaks again Filter combined-sample peak calls at the p-value corresponding to the chosen IDR threshold to obtain consensus peaks (Should do saturation analysis, but I didn’t)

Ryan C. Thompson Advanced RNA-seq analysis

slide-11
SLIDE 11

IDR workfmow

IDR helps choose a signifjcance threshold at which peaks are reproducible between biological replicates. Call peaks in each individual sample Run IDR on each pairof samples to determine p-value → IDR mapping Since multiple samples give more information than any pair, take the smallest relationship as an upper bound for the

  • verall IDR

Combine all samples and call peaks again Filter combined-sample peak calls at the p-value corresponding to the chosen IDR threshold to obtain consensus peaks (Should do saturation analysis, but I didn’t)

Ryan C. Thompson Advanced RNA-seq analysis

slide-12
SLIDE 12

Promoter analysis

The original analysis focused on gene promoters, and comparing promoter histone behavior with gene expression. It also focused mainly on H3K4me2/3. Determine efgective promoter radius by looking at distribution

  • f nearest TSS-to-peak distances

Defjne promoter regions as TSS ± radius Merge overlapping promoters for the same gene (for genes with multiple TSS) Count reads in promoter regions Perform “difgerential binding” analysis on promoter counts using edgeR, similarly to RNA-seq Look at RNA-seq DE vs promoter ChIP-Seq DB Look at RNA-seq vs peak presence/absence in promoter

Ryan C. Thompson Advanced RNA-seq analysis

slide-13
SLIDE 13

Determining promoter radius

0.00000 0.00025 0.00050 0.00075 2500 5000 7500 10000

Distance to Peak density

Sampletype H3K4me2 H3K4me3 H3K27me3 Celltype Naive Memory

Promoter Peak Summit Distance Profiles

Figure 3: Distribution of distances from TSS to nearest peak summit

Ryan C. Thompson Advanced RNA-seq analysis

slide-14
SLIDE 14

Static expression distribusion vs peak status

  • ● ●
  • ● ●
  • ● ●
  • ● ●
  • ● ●
  • ● ●
  • ●●
  • ● ●
  • ● ●
  • ●●
  • ●●
  • ●●
  • ● ● ●
  • ● ●
  • ● ●
  • ● ●
  • ●●
  • ● ●
  • ●● ●
  • ●●
  • ● ●
  • ●●
  • ●●
  • ● ●
  • ● ●
  • ●●
  • ● ●
  • ● ●
  • ● ●
  • ● ● ●
  • ●●
  • ●●
  • ● ●
  • ● ●
  • ● ●
  • ●●
  • ● ●
  • ● ●
  • ●●
  • ●●
  • ● ●
  • ● ●
  • ● ●
  • ●●
  • ●●
  • ● ●
  • ●●
  • ● ●
  • Naive.D0

Naive.D1 Naive.D5 Naive.D14 Memory.D0 Memory.D1 Memory.D5 Memory.D14 None H3K4me2 H3K4me2.H3K4me3 H3K4me3 None H3K4me2 H3K4me2.H3K4me3 H3K4me3 −10 10 −10 10 −10 10 −10 10

log2(FPKM) PeakStatus

FPKM Distribution Conditioned on Peak Status

Figure 4: Gene expression distribution by promoter peak status

Ryan C. Thompson Advanced RNA-seq analysis

slide-15
SLIDE 15

Static expression distribusion vs peak status

Naive.D0 Naive.D1 Naive.D5 Naive.D14 Memory.D0 Memory.D1 Memory.D5 Memory.D14 −10 10 −10 10 −10 10 H3K4me2 H3K4me3 H3K27me3 −10 10 −10 10 −10 10 −10 10 −10 10 −10 10 −10 10 −10 10

NoPeak Peak

log2(FPKM) QQ Plot, Peak vs No Peak

Figure 5: Gene expression QQ plots, peak vs no peak

Ryan C. Thompson Advanced RNA-seq analysis

slide-16
SLIDE 16

Histone and RNA interaction dynamics

The above plots only compared promoter peak status and expression at the same time points. What if we look at, e.g. initial peak status vs expression change over time, or vice versa?

Ryan C. Thompson Advanced RNA-seq analysis

slide-17
SLIDE 17

Histone and RNA interaction dynamics

The above plots only compared promoter peak status and expression at the same time points. What if we look at, e.g. initial peak status vs expression change over time, or vice versa?

Ryan C. Thompson Advanced RNA-seq analysis

slide-18
SLIDE 18

Initial peak status predicts 24h RNA change in naïve cells

0.2 0.4 0.6 0.8 1 1.2 1.4 H3K4me2 H3K4me3 H3K4me2 and H3K4me3

Peak Status at 0 h H3K4 peak status at 0 h correlates with RNA change at 24 h after activation of naive cells

***

A

RN A Log2 Fold Change at 24 h

Figure 3

** **

Figure 6: RNA 24h change vs peak status

Ryan C. Thompson Advanced RNA-seq analysis

slide-19
SLIDE 19

Initial promoter HeK4me3/2 ratio predicts 24h RNA change in naïve cells

  • 1.5
  • 1
  • 0.5

0.5 1 1.5 H3K4me2 dominant (H3K4me2:me3 ratio > 1) H3K4me3 intermediate (H3K4me2:me3 ratio = 0.5-1) H3K4me3 dominant (H3K4me2:me3 ratio < 0.5)

H3K4 methylation enrichment at 0 h

RN A Log2 Fold Change at 24 h H3K4me2/3 ratio

B

H3K4me2/H3K4me3 ratio at 0 h determines RNA change 24 h after activation of naive cells

**** **** ****

Figure 7: RNA 24h change vs peak status

Ryan C. Thompson Advanced RNA-seq analysis

slide-20
SLIDE 20

Difgerent relationship between peaks and expression in memory

  • 1.5
  • 1
  • 0.5

0.5 1 No H3K4me H3K4me2 H3K4me3 H3K4me2 and H3K4me3

RNA Log2 Fold Change (24 h) H3K4me peak status (0 h)

* **

N.S.

A B

D

E

H3K4me3 methylation status predicts RNA expression change after activation of memory CD4 T cells

RNA Log Fold

Figure 8: RNA 24h change vs peak status

Ryan C. Thompson Advanced RNA-seq analysis

slide-21
SLIDE 21

Difgerent relationship between peaks and expression in memory

0.1 0.2 0.3 0.4 0.5 0.6 No H3K4me3 enrichment at 0 h (H3K4me3/input < 1) H3K4me3 enrichment at 0 h (H3K4me3/input > 1)

H3K4me3 enrichment at 0 h increases RNA expression after activation of memory

B

**** H3K4me3 methylation status predicts activation of memory CD4 T cells

RNA Log2 Fold Change (24 h)

Figure 9: RNA 24h change vs peak status

Ryan C. Thompson Advanced RNA-seq analysis

slide-22
SLIDE 22

Genome-wide analysis

Ryan C. Thompson Advanced RNA-seq analysis

slide-23
SLIDE 23

Starting point

Using existing bam fjles Using existing peak calls Analyze using the csaw Bioconductor package (csaw = “ChIP-Seq Analysis with Windows”) QC Steps

Look at strand cross-correlation plots to determine fragment length Look at coverage plots centred on local maxima to determine protein footprint size Look at sample-vs-sample MA plots to determine proper normalization

Analyze window counts with edgeR Combiond signifjcance of adjacent windows to increase power

Ryan C. Thompson Advanced RNA-seq analysis

slide-24
SLIDE 24

Starting point

Using existing bam fjles Using existing peak calls Analyze using the csaw Bioconductor package (csaw = “ChIP-Seq Analysis with Windows”) QC Steps

Look at strand cross-correlation plots to determine fragment length Look at coverage plots centred on local maxima to determine protein footprint size Look at sample-vs-sample MA plots to determine proper normalization

Analyze window counts with edgeR Combiond signifjcance of adjacent windows to increase power

Ryan C. Thompson Advanced RNA-seq analysis

slide-25
SLIDE 25

Starting point

Using existing bam fjles Using existing peak calls Analyze using the csaw Bioconductor package (csaw = “ChIP-Seq Analysis with Windows”) QC Steps

Look at strand cross-correlation plots to determine fragment length Look at coverage plots centred on local maxima to determine protein footprint size Look at sample-vs-sample MA plots to determine proper normalization

Analyze window counts with edgeR Combiond signifjcance of adjacent windows to increase power

Ryan C. Thompson Advanced RNA-seq analysis

slide-26
SLIDE 26

Determining fragment length from strand cross-correlation

H3K27me3 H3K4me2 H3K4me3 input 0.000 0.002 0.004 0.006 0.00 0.01 0.02 0.03 0.00 0.02 0.04 0.06 0.000 0.005 0.010 0.015 0.020 250 500 750 1000 250 500 750 1000 250 500 750 1000 250 500 750 1000

Delay CCF Group

NaiveD0 MemoryD0 NaiveD1 MemoryD1 NaiveD5 MemoryD5 NaiveD14 MemoryD14

Reference

Nucleosome Footprint (147bp) Read Length (100bp)

Cross−Correlation Function, Loess−Smoothed (span = 0.1)

Figure 10: Strand Cross-correlation plots

Ryan C. Thompson Advanced RNA-seq analysis

slide-27
SLIDE 27

Determining peak width by profjling coverage around local maxima

H3K27me3 H3K4me2 H3K4me3 input 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 −200 200 −200 200 −200 200 −200 200 Distance RelativeCoverage TreatmentGroup NaiveD0 MemoryD0 NaiveD1 MemoryD1 NaiveD5 MemoryD5 NaiveD14 MemoryD14 Relative Coverage Around Maxima, Raw, 300bp Radius

Figure 11: Local maxima coverage plots

Ryan C. Thompson Advanced RNA-seq analysis

slide-28
SLIDE 28

Windows?

Figure 12: No, not that kind of windows

Ryan C. Thompson Advanced RNA-seq analysis

slide-29
SLIDE 29

Window counting

Tile windows of footprint size across the genome at desired resolution Extend each read’s 3′ end to the fragment length Count fragments overlapping each window (non-unoque is OK) Also count the total reads assigned Separately repeat with large bins of 10kb, requiring each read to map to a single bin (10kb bins will be used for background normalization)

Ryan C. Thompson Advanced RNA-seq analysis

slide-30
SLIDE 30

Normalizing window counts

Multiple ways to normalize Composition (i.e. background) normalization: run TMM (standard edgeR norm. method) on 10kb bins Effjciency bias normalization: Select only high-abundance windows and run TMM Peak-overlap normalization: Select only windows that overlap called peaks and run TMM Do these give difgerent results? How to choose which is correct?

Ryan C. Thompson Advanced RNA-seq analysis

slide-31
SLIDE 31

Normalizing window counts

Multiple ways to normalize Composition (i.e. background) normalization: run TMM (standard edgeR norm. method) on 10kb bins Effjciency bias normalization: Select only high-abundance windows and run TMM Peak-overlap normalization: Select only windows that overlap called peaks and run TMM Do these give difgerent results? How to choose which is correct?

Ryan C. Thompson Advanced RNA-seq analysis

slide-32
SLIDE 32

High-abundance windows correlate with peaks

4 8 12 FALSE TRUE

Peak Overlap Abundance

Window Mean logCPM by Peak Status

Figure 13: Window abundance vs peak overlap

Ryan C. Thompson Advanced RNA-seq analysis

slide-33
SLIDE 33

ChIP-Seq MA plots are bimodal

Day 0 Day 1 Day 5 Day 14 −1 1 −1 1 −1 1 −1 1 Donor 4659 Donor 5053 Donor 5131 Donor 5291 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

log2(CPM) log2(FC)

0.0 0.1 0.2 0.3 0.4 density

MA plots for H3K4me2 samples in Naive CD4 T−cells

Figure 14: Promoter MA Plots

Ryan C. Thompson Advanced RNA-seq analysis

slide-34
SLIDE 34

Normalization factors line up with one of the modes

−8 −4 4 −2.5 0.0 2.5 5.0 7.5

log2(CPM) log2(FC)

0.00 0.05 0.10 0.15 0.20

Density Norm Type

Comp HiAb Peaks

10KB Bin MA Plot for H3K4me3−Naive−D5−Donor5053 vs H3K4me3−Naive−D1−Donor5053

Figure 15: 10kb bin MA Plot

Ryan C. Thompson Advanced RNA-seq analysis

slide-35
SLIDE 35

Choosing a normalization

Genuine global changes in histone modifjcation are possible Global changes can also be explained by variations in ChIP enrichment effjciency (effjcinecy bias) Normalizing to the signal eliminates global changes; normalizing to the BG preserves them We need to rule out effjciency bias before we can use the BG normalization This can be done by making sure technical replicates have consistent effjciency Or we could just give up and use that loess curve

Ryan C. Thompson Advanced RNA-seq analysis

slide-36
SLIDE 36

Choosing a normalization

Genuine global changes in histone modifjcation are possible Global changes can also be explained by variations in ChIP enrichment effjciency (effjcinecy bias) Normalizing to the signal eliminates global changes; normalizing to the BG preserves them We need to rule out effjciency bias before we can use the BG normalization This can be done by making sure technical replicates have consistent effjciency Or we could just give up and use that loess curve

Ryan C. Thompson Advanced RNA-seq analysis

slide-37
SLIDE 37

Blacklist regions

Some areas of the genome have spurious extreme high coverage (over 100x the rest of the genome’s coverage) even in input samples These regions tend to have higher proportions of multi-mapping reads Some of them are known repeats, but others have no obvious sequence features to explain the coverage as a mapping artifact Peak-calling and other analyses don’t work in these regions Reads mapping to these regions must be excluded from downstream analyses What happens if you don’t do this?

Ryan C. Thompson Advanced RNA-seq analysis

slide-38
SLIDE 38

Blacklist regions

Some areas of the genome have spurious extreme high coverage (over 100x the rest of the genome’s coverage) even in input samples These regions tend to have higher proportions of multi-mapping reads Some of them are known repeats, but others have no obvious sequence features to explain the coverage as a mapping artifact Peak-calling and other analyses don’t work in these regions Reads mapping to these regions must be excluded from downstream analyses What happens if you don’t do this?

Ryan C. Thompson Advanced RNA-seq analysis

slide-39
SLIDE 39

CCF Plot with blacklist exclusion

H3K27me3 H3K4me2 H3K4me3 input 0.000 0.002 0.004 0.006 0.00 0.01 0.02 0.03 0.00 0.02 0.04 0.06 0.000 0.005 0.010 0.015 0.020 250 500 750 1000 250 500 750 1000 250 500 750 1000 250 500 750 1000

Delay CCF Group

NaiveD0 MemoryD0 NaiveD1 MemoryD1 NaiveD5 MemoryD5 NaiveD14 MemoryD14

Reference

Nucleosome Footprint (147bp) Read Length (100bp)

Cross−Correlation Function, Loess−Smoothed (span = 0.1)

Figure 16: Strand Cross-correlation plots with blacklisting

Ryan C. Thompson Advanced RNA-seq analysis

slide-40
SLIDE 40

CCF Plot WITHOUT blacklist exclusion

H3K27me3 H3K4me2 H3K4me3 input 0.00 0.02 0.04 0.06 0.00 0.02 0.04 0.06 0.00 0.02 0.04 0.06 0.08 0.000 0.025 0.050 0.075 0.100 250 500 750 1000 250 500 750 1000 250 500 750 1000 250 500 750 1000 Delay CCF TreatmentGroup NaiveD0 MemoryD0 NaiveD1 MemoryD1 NaiveD5 MemoryD5 NaiveD14 MemoryD14 Cross−Correlation Function, Loess−Smoothed (span = 0.1)

Figure 17: Strand Cross-correlation plots without blacklisting

Ryan C. Thompson Advanced RNA-seq analysis

slide-41
SLIDE 41

MA plot with blacklist exclusion

−8 −4 4 −2.5 0.0 2.5 5.0 7.5

log2(CPM) log2(FC)

0.00 0.05 0.10 0.15 0.20

Density Norm Type

Comp HiAb Peaks

10KB Bin MA Plot for H3K4me3−Naive−D5−Donor5053 vs H3K4me3−Naive−D1−Donor5053

Figure 18: 10kb bin MA Plot with blacklisting

Ryan C. Thompson Advanced RNA-seq analysis

slide-42
SLIDE 42

MA plot WITHOUT blacklist exclusion

−8 −4 4 0.0 2.5 5.0 7.5 10.0

log2(CPM) log2(FC)

0.00 0.05 0.10 0.15 0.20

Density Norm Type

Comp HiAb Peaks

10KB Bin MA Plot for H3K4me3−Naive−D5−Donor5053 vs H3K4me3−Naive−D1−Donor5053

Figure 19: 10kb bin MA Plot without blacklisting

Ryan C. Thompson Advanced RNA-seq analysis

slide-43
SLIDE 43

Any Questions?

Ryan C. Thompson Advanced RNA-seq analysis