ChIP-seq analysis M. Defrance, C. Herrmann, S. Le Gras, D. Puthier, - PowerPoint PPT Presentation

ChIP-seq analysis M. Defrance, C. Herrmann, S. Le Gras, D. Puthier, M. Thomas.Chollier ● Data visualization, quality control, normalization & peak calling  Presentation (Carl Herrmann)  Practical session ● Peak annotation  Presentation (Matthieu Defrance)  Practical session ● From peaks to motifs  Presentation (Jacques van Helden)  Practical session Reads Peaks Annotations Motifs Carl Herrmann – Ecole Aviesan Roscoff 2015

Datasets used ● estrogen-receptor (ESR1) is a key factor in breast cancer developement ● goal of the study: understand the dependency of ESR1 binding on presence of co-factors, in particular GATA3, which is mutated in breast cancers ● approaches: GATA3 silencing (siRNA), ChIP-seq on ESR1 in wt vs. siGATA3 conditions, chromatin profjling Carl Herrmann – Ecole Aviesan Roscoff 2015

Datasets used ● ESR1 ChIP-seq in WT & siGATA3 conditions ( 3 replicates = 6 datasets) ● H3K4me1 in WT & siGATA3 conditions (1 replicate = 2 datasets) ● Input dataset in MCF-7 (1 replicate = 1 dataset) ● p300 before estrogen stimulation ● GATA3/FOXA1 ChIP-seq before/after estrogen stimulation ● microarray expression data, etc ... Carl Herrmann – Ecole Aviesan Roscoff 2015

Hands on !! Let's have a look at the data Carl Herrmann – Ecole Aviesan Roscoff 2015

What we want to do do we have more signal here ... … than here ? Carl Herrmann – Ecole Aviesan Roscoff 2015

Keys aspects of ChIP-seq analysis (1) Quality Control : do I have signal ? (2) Determine signal coverage (3) Modelling noise levels (4) Scaling/ normalizing datasets (5) Detecting enriched peak regions (6) Performing difgerential analysis Carl Herrmann – Ecole Aviesan Roscoff 2015

0. Principle of ChIP-seq The binding site itself is generally not sequenced ! We expect to see a typical strand asymetry in read densities ChIP peak recognition pattern → [Wilbanks & Faccioti PLoS One (2010)] Carl Herrmann – Ecole Aviesan Roscoff 2015

0. Principle of ChIP-seq Strand asymetry is blurred when multiple proteins bind or in case of histone modifjcations ChIP [Wilbanks & Faccioti PLoS One (2010)] Carl Herrmann – Ecole Aviesan Roscoff 2015

Principle of ChIP-seq treatment read density (=WIG/bigWig) input read density (=WIG/bigWig) peak (=BED) aligned reads + strand (=BAM) aligned reads - strand (=BAM) Carl Herrmann – Ecole Aviesan Roscoff 2015

1. Quality control ● Qualitative  Look at your favorite gene/locus in IGV !  Heatmap of signal e.g. H3K4me3 at promoters → Carl Herrmann – Ecole Aviesan Roscoff 2015

1. Quality control ● Quantitative  Fraction of reads in peaks (FRiP) FRiP = reads ∈ peaks total reads depends on type of ChIP (TF/histone) →  PCR Bottleneck coeffjcient (PBC) : measure of library complexity Genomic positions PBC < 0.5 PBC = N 1 with 1 read aligned 0.5 < PBC < 0.8 0.8 < PBC N d Genomic positions with ≥ 1 read aligned https://www.encodeproject.org/data-standards/2012-quality-metrics/ Carl Herrmann – Ecole Aviesan Roscoff 2015

2. from reads to coverage ● to visualize the data, we use coverage plots (=density of fragments per genomic region) ● need to reduce BAM fjle to more compact format → bigWig/bedGraph Carl Herrmann – Ecole Aviesan Roscoff 2015

2. from reads to coverage ● Reads are extended to 3' to fragment length ● Read counts are computed for each bin ● Counts are normalized  reads per genomic content normalize to 1x coverage → SD = n mapped reads × L G eff  reads per kilobase per million reads per bin RPKM = n reads / bin × W bin n mapped reads → deepTools : bamCoverage Carl Herrmann – Ecole Aviesan Roscoff 2015

2. from reads to coverage ESR1 input H3K4me1 ESR1 reads H3K4me1 reads Carl Herrmann – Ecole Aviesan Roscoff 2015

3. signal and noise MCF7 genome hg19 reference genome Carl Herrmann – Ecole Aviesan Roscoff 2015

3. signal to noise treatment input k=35 k=50 k=100 ● Mappability issue : alignability track shows, how many times a read from a given position of the genome would align  a=1 read from this position ONLY aligns to this position →  a=1/n read from this position could align to n locations → → we usually only keep uniquely aligned reads : positions with a < 1 have no reads left Carl Herrmann – Ecole Aviesan Roscoff 2015

3. signal to noise The availability of a control sample in mandatory ! → mock IP with unspecifjc antibody → sequencing of input (=naked) DNA Carl Herrmann – Ecole Aviesan Roscoff 2015

4. modelling background level How to get a noise free track ? ● naïve subtraction treatment – input is not possible, because both libraries have difgerent sequencing depth ! ● Solution 1 : before subtraction, scale both libraries by total number of reads (library size)  RPGC SD = n mapped reads × L  RPKM G eff RPKM = n reads / bin × W bin n mapped reads Carl Herrmann – Ecole Aviesan Roscoff 2015

4. modelling background level scale smaler dataset Treatment N= 10 M reads Input N'= 12 M reads Problem : signal infmuences scaling factor More signal (but equal noise) artifjcial noise over-estimation → Carl Herrmann – Ecole Aviesan Roscoff 2015

4. modelling background level input 1 area ~ number of reads = 10 10 5 treatment 1 area ~ number of reads = 10 + 4 + 4 = 18 10 Scaling by library size : upscale input by 18/10 = 1.8 5 estimated noise level treatment 1 10 Noise level is over-estimated ! Carl Herrmann – Ecole Aviesan Roscoff 2015

4. modelling background level input 1 area ~ number of reads = 10 10 5 treatment 1 area ~ number of reads = 10 + 9 + 4 = 23 10 Scaling by library size : upscale input by 23/10 = 2.3 5 estimated noise level treatment 1 10 Carl Herrmann – Ecole Aviesan Roscoff 2015

4. modelling background level ● more advanced : linear regression by exclusing peak regions (PeakSeq) ● read counts in 1Mb regions in input and treatment excluding enriched (=signal) regions all regions Carl Herrmann – Ecole Aviesan Roscoff 2015

4. modelling background level ● Alternative strategy (deepTools Diaz et al.) → 1. bin genome into n 10 kb windows 2. count reads in each window for input ( X i ) and treatment ( Y i ) 3. total number of reads is N X and N Y 4. order Y i from less to most enriched → Y (i) 5. defjne and plot j j p j = ∑ i = 1 Y ( i ) / M Y ; q j = ∑ i = 1 X ( i ) / M X  p j = proportion of reads in the j less enriched windows 25% of the genome 90% of the genome contains no reads ! contain ~ 25% of reads Carl Herrmann – Ecole Aviesan Roscoff 2015

4. modelling background level this is were pj and qj difger most both datasets contain treatment dataset only noise in this range contains signal → scale according to number of reads in this range Carl Herrmann – Ecole Aviesan Roscoff 2015

5. from reads to peaks ● Tag shifting vs. extension  positive/negative strand read peaks do not represent the true location of the binding site  fragment length is d and can be estimated from strand asymmetry  reads can be elongated to a size of d  reads can be shifted by d/2 increased resolution → example of MACS model building using top enriched regions Carl Herrmann – Ecole Aviesan Roscoff 2015

5. from reads to peaks “Read shifting” d Carl Herrmann – Ecole Aviesan Roscoff 2015

5. from reads to peaks “Read extension” d Carl Herrmann – Ecole Aviesan Roscoff 2015

Carl Herrmann – Ecole Aviesan Roscoff 2015

Some methods separate the tag densities into difgerent strands and take advantage of tag asymmetry Most consider merged densities and look for enrichment Carl Herrmann – Ecole Aviesan Roscoff 2015

Tag shift Tag extension Tags unchanged Carl Herrmann – Ecole Aviesan Roscoff 2015

5. from reads to peaks ● Determining “enriched” regions  sliding window across the genome  at each location, evaluate the enrichement of the signal wrt. expected background based on the distribution  retain regions with P-values below threshold  evaluate FDR Pval ~ 0.6 Pval < 1e-20 Carl Herrmann – Ecole Aviesan Roscoff 2015

6. MACS [Zhang et al. Genome Biol. 2008] ● Step 1 : estimating fragment length d  slide a window of size BANDWIDTH  retain top regions with MFOLD enrichment of treatment vs. input  plot average +/- strand read densities estimate d → enrichment > MFOLD treatment control Carl Herrmann – Ecole Aviesan Roscoff 2015

5. MACS [Zhang et al. Genome Biol. 2008] ● Step 2 : identifjcation of local noise parameter  slide a window of size 2*d across treatment and input  estimate parameter λ local of Poisson distribution 1 kb estimate λ over difg. ranges 5 kb → take the max 10 kb full genome Carl Herrmann – Ecole Aviesan Roscoff 2015

ChIP-seq analysis M. Defrance, C. Herrmann, S. Le Gras, D. Puthier, - PowerPoint PPT Presentation

ChIP-seq analysis M. Defrance, C. Herrmann, S. Le Gras, D. Puthier, M. Thomas.Chollier Data visualization, quality control, normalization & peak calling Presentation (Carl Herrmann) Practical session Peak annotation

Jen Grenier Director, TREx Facility Announcements New and Improved Project Submission Form

Importing data Peter Humburg Statistician, Macquarie University DataCamp ChIP-seq Workflows in

Methods for Analyzing ChIP-Seq data Introduction to the ChIP-Seq server at SIB Lausanne Public

Introduction to RNA-Seq Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

ChIP-seq data analysis 04-05-12 Outlook Friday 04-05-12: Next-generation sequencing

Introduction to Chromatin IP sequencing (ChIP-seq) data analysis Workshop on ChIP-seq data

The Epigenome Tools 2: ChIP-Seq and Data Analysis Chongzhi Zang zang@virginia.edu

Scaling normalisation for ChIP-seq with exogenous chromatin Workshop on ChIP-seq data analysis

Introduction to differential binding Peter Humburg Statistician, Macquarie University DataCamp

Re-analysis of a CD4 ChIP-Seq data set with csaw Ryan C. Thompson Salomon Lab The Scripps

RNA-seq Data Analysis Introduction to RNA-seq data analysis June, 2018 1 Luigi Grassi < lg

Calibration des Microroc (II) Alex, Cyril, Giom, Jean, Max 09 Mai 2011, Annecy 1 Reminder 2

Genome-wide supervised ChIP-seq peak detection Toby Dylan Hocking toby.hocking@mail.mcgill.ca

Introduction to ChIP-seq Joanna Krupka CRUK Summer School in Bioinformatics

Overview of the DE analysis Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

RNA-seq: filtering, quality control and visualisation COMBINE RNA-seq Workshop QC and

Signal Transduction Pathway Smorgasbord Ron Bose, MD PhD Biochemistry and Molecular Cell Biology

MOUNT SINAI Commission 01 Inquiry Medicine Hormone Receptor Testing HOSPITAL 600 University

Breast 2019 2019-2020 NAACCR W EBINAR SERIES 2 Q&A Please submit all questions concerning

7/11/2019 I have nothing to disclose Selective Estrogen Receptor Modulators (Raloxifene)

Differential expression analysis John Blischak Instructor DataCamp Differential Expression

Gene Regulation Bioinformatics Wyeth W. Wasserman University of British Columbia www.cisreg.ca

Improving gene signatures by the identification of differentially expressed modules in molecular

transcriptional programs in breast cancer HATICE U. OSMANBEYOGLU RAPHAEL PELOSSOF JACQUELINE F.

ChIP-seq analysis M. Defrance, C. Herrmann, S. Le Gras, D. Puthier, - PowerPoint PPT Presentation

ChIP-seq analysis M. Defrance, C. Herrmann, S. Le Gras, D. Puthier, M. Thomas.Chollier Data visualization, quality control, normalization & peak calling Presentation (Carl Herrmann) Practical session Peak annotation

Jen Grenier Director, TREx Facility Announcements New and Improved Project Submission Form

Importing data Peter Humburg Statistician, Macquarie University DataCamp ChIP-seq Workflows in

Methods for Analyzing ChIP-Seq data Introduction to the ChIP-Seq server at SIB Lausanne Public

Introduction to RNA-Seq Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

ChIP-seq data analysis 04-05-12 Outlook Friday 04-05-12: Next-generation sequencing

Introduction to Chromatin IP sequencing (ChIP-seq) data analysis Workshop on ChIP-seq data

The Epigenome Tools 2: ChIP-Seq and Data Analysis Chongzhi Zang zang@virginia.edu

Scaling normalisation for ChIP-seq with exogenous chromatin Workshop on ChIP-seq data analysis

Introduction to differential binding Peter Humburg Statistician, Macquarie University DataCamp

Re-analysis of a CD4 ChIP-Seq data set with csaw Ryan C. Thompson Salomon Lab The Scripps

RNA-seq Data Analysis Introduction to RNA-seq data analysis June, 2018 1 Luigi Grassi &lt; lg

Calibration des Microroc (II) Alex, Cyril, Giom, Jean, Max 09 Mai 2011, Annecy 1 Reminder 2

Genome-wide supervised ChIP-seq peak detection Toby Dylan Hocking toby.hocking@mail.mcgill.ca

Introduction to ChIP-seq Joanna Krupka CRUK Summer School in Bioinformatics

Overview of the DE analysis Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

RNA-seq: filtering, quality control and visualisation COMBINE RNA-seq Workshop QC and

Signal Transduction Pathway Smorgasbord Ron Bose, MD PhD Biochemistry and Molecular Cell Biology

MOUNT SINAI Commission 01 Inquiry Medicine Hormone Receptor Testing HOSPITAL 600 University

Breast 2019 2019-2020 NAACCR W EBINAR SERIES 2 Q&amp;A Please submit all questions concerning

7/11/2019 I have nothing to disclose Selective Estrogen Receptor Modulators (Raloxifene)

Differential expression analysis John Blischak Instructor DataCamp Differential Expression

Gene Regulation Bioinformatics Wyeth W. Wasserman University of British Columbia www.cisreg.ca

Improving gene signatures by the identification of differentially expressed modules in molecular

transcriptional programs in breast cancer HATICE U. OSMANBEYOGLU RAPHAEL PELOSSOF JACQUELINE F.

RNA-seq Data Analysis Introduction to RNA-seq data analysis June, 2018 1 Luigi Grassi < lg

Breast 2019 2019-2020 NAACCR W EBINAR SERIES 2 Q&A Please submit all questions concerning