Introduction to Chromatin IP sequencing (ChIP-seq) data analysis - PowerPoint PPT Presentation

Introduction to Chromatin IP – sequencing (ChIP-seq) data analysis Workshop on ChIP-seq data analysis Stockholm, 7 November 2018 Agata Smialowska NBIS, SciLifeLab, Stockholm University

Chromatin state and gene expression PEV Position effect variegation in Drosophila eye (nature.com) First observed by H. Muller 1930 Juxtaposition of eye colour genes with heterochromatin results in the “mottled” eye colouration (red and white). Proteins, which bind heterochromatin, act to “spread” the silencing signal by providing a forward feedback loop. Heterochromatin Protein 1; Histone methyltransferase Su(var)3-9; H3K9 methylation

www.pollev.com/AGATASMIALOW506

Chromatin immunoprecipitation RnDsystems

Applications General transcription machinery

Applications Promoter-associated transcription factors

Applications Distal enhancers

Applications Histone modifications and variants Activation states Co-factors

ChIP-seq workflow Liu, Pott and Huss, BMC Biology 2010

Workflow of a ChIP-seq study design study obtain input chromatin perform precipitation Wet lab construct library sequence library bioinformatic analysis

Critical factors • Antibody selection • Proper control sample (input chromatin or mock IP) • Library cloning and sequencing • Algorithm for peak detection • Enough material and biological replicates • Reproducibility in chromatin fragmentation • Cross-linker choice

Experiment design • Sound experimental design: replication, randomisation and blocking (R.A. Fisher, 1935) • In the absence of a proper design, it is essentially impossible to partition biological variation from technical variation • Sequencing depth: depends on the structure of the signal; cannot be linearly scaled to genome size • Single- vs. paired-end reads: PE improves read mapping confidence and gives a direct measure of fragment size, which otherwise has to be modelled or estimated

Experiment design Ideal design: ChIP input library/sequencing replicates X Each sample has a matched input Input sequenced to a comparable depth as IP sample ChIP input replicates library/sequencing X ChIP under-sequenced input ChIP input library/sequencing replicates ✓ ChIP well-sequenced input

Biological replicates and randomisation libraries sequencing X technical replicates are generally a waste of time sample and money ≥2 biological replicates for site identification ≥3 biological replicates for differential binding samples replicates libraries sequencing many studies do not account for batch X origin effects experiment i. time ii. Origin experiment1 experiment2 Experiment3… libraries, sequencing, etc ✓ time ------->

Importance of sequencing depth actual replicates pooled data X ✓ if you need to pool your data, then it is under-sequenced under-sequenced data pooled data

Sequencing depth depends on data type Chromatin Transcription Chromatin Remodellers Factors Remodellers Histone marks Histone marks RNA polymerase II point-source mixed signal broad signal TF: 20 M ? Human: ? H3K4me3: 25 M H3K27me3: 40 M H3K36me3: 35 M H3K9me3: >55 M No clear guidelines for mixed and broad type of peaks Source: The ENCODE consortium; Jung et al, NAR 2014

• ChIP – sequencing: introduction from a bioinformatics point of view • Principles of analysis of ChIP-seq data • ChIP-seq: downstream analyses • Resources

Chromatin = DNA + proteins Park, Nature Rev Genetics, 2009

Data analysis

design study Workflow of a ChIP-seq study obtain input chromatin perform precipitation Wet lab construct library sequence library library quality control filter sequences align sequences filter alignments Iterative process identify peaks / regions of enrichment assess data quality understand the data / results downstream analyses

• ChIP – sequencing: introduction from a bioinformatics point of view • Principles of analysis of ChIP-seq data • ChIP-seq: downstream analyses • Resources

Two questions to address • 1. Did the ChIP part of the ChIP-seq experiment work? Was the enrichment successful? • 2. Where are the binding sites (of the protein of interest)?

Word of caution! ChIP-seq experiments are more unpredictable than RNA-seq! Error sources: chromatin structure PCR over-amplification non-specific antibody other things?

ChIP-seq QC: did the ChIP work? • 1. Inspect the signal (mapped reads, coverage profiles) in genome browser • 2. Compute peak-independent quality metrics (cross correlation, cumulative enrichment) • 3. Assess replicate consistency (correlations between replicates of the same condition)

tag density distribution reproducibility similarity of coverage signal at known sites … Spotting inconsistencies Confounding factors Under-sequenced libraries …

How do I know my data is of good quality? Library complexity Marinov et al, G3 2013

Quality control: tag uniqueness – library complexity metric Sequence duplication level > 80% (low complexity library) FastQC Babraham Institute NRF: Non-redundant fraction (of reads): proportion of unique tags / total

How do I know my data is of good quality? Objective (i.e. peak independent) metrics to quantify enrichment in ChIP-seq; for TF in mammalian systems: Normalised Strand Correlation NSC Relative Strand Correlation RSC Large-scale quality analysis of published ChIP-seq data sets: 20% low quality 25% intermediate quality 30% inputs have metrics similar to IPs Marinov et al, G3 2013

Strand cross-correlation The correlation between signal of the 5ʹ end of reads on the (+) and (-) strands is assessed after successive shifts of the reads on the (+) strand and the point of maximum correlation between the two strands is used as an estimation of fragment length. Cross correlation Strand shift Carroll et al, Front Genet 2014

Strand cross-correlation Max CC – Min CC Max CC value (fLen) RSC = NSC = Phantom CC – Min CC Min CC Carroll et al, Front Genet 2014

Cross-correlation plots ChIP ENCFF000OWMed.sorted.1.bam.picard.bam ENCFF000PMJ.sorted.1.bam ENCFF000PMG.sorted.1.bam 0.225 Acceptable 0.23 0.30 Very good Poor enrichment, enrichment 0.220 0.22 0.29 enrichment possibly cross − correlation cross − correlation cross − correlation 0.215 undersequenced 0.21 0.28 0.210 0.27 0.20 0.205 0.26 0.19 0.200 0.25 − 500 0 500 1000 1500 − 500 0 500 1000 1500 − 500 0 500 1000 1500 strand − shift (105,455) strand − shift (125) strand − shift (130) NSC=1.14102,RSC=1.06452,Qtag=1 NSC=1.21367,RSC=1.39752,Qtag=1 NSC=1.28071,RSC=0.987276,Qtag=0 Input ENCFF000PET.sorted.1.bam.picard.bam ENCFF000PON.sorted.1.bam.picard.bam 0.300 Read 0.278 0.298 No clustering clustering 0.296 0.277 Good input Bad input cross − correlation cross − correlation 0.294 0.276 0.292 0.290 0.275 0.288 0.274 0.286 − 500 0 500 1000 1500 − 500 0 500 1000 1500 strand − shift (100,265,245) strand − shift (90,200,210) NSC=1.01443,RSC=0.289702,Qtag= − 1 NSC=1.0166,RSC=0.92739,Qtag=0

Cumulative enrichment aka “Fingerprint” is another metric for successful enrichment http://deeptools.readthedocs.org Diaz et al, Genome Biol 2012

Park, Nature Rev Genetics, 2009

Peak calling appropriate methodologies depend on data type Chromatin Transcription Chromatin Remodellers Factors Remodellers Histone marks Histone marks RNA polymerase II punctate mixed signal broad signal SPP - - MACS2 MACS2 in broad mode, windows approaches This is an active area of algorithm development

Principle of peak detection Symmetry in reads mapped to opposite DNA strands Computation of enrichment model

Pepke, 2009

Point-source vs. broad peak detection Sequence-specific binding (TFs) Distributed binding (histones, RNApol2) Wilbanks 2010

Comparison of peak calling algorithms Peak overlap (Ho et al, 2012) > 50 % 20 % Wilbanks 2010

“Hyper-chippable” regions Reads mapped to these regions should be filtered out prior to peak calling Tracks available from UCSC for human, mouse, fly and worm DER – Duke Excluded Regions (11 repeat classes) UHS – Ultra High Signal (open chromatin) DAC – consensus excluded regions Carroll et al, Front Genet 2014

Quality considerations • ChIP-seq quality guidelines from the ENCODE project (Relative strand cross-correlation, Irreproducible discovery rate) • Antibody validation • Appropriate sequencing depth (depending on genome size and peak type). For human genome and broad-source peaks, min. 40-50M reads is required. • Experimental replication • Fraction of reads in peaks (FRiP) > 1% • Cross correlation (correlation of the density of sequences aligned to opposite DNA strands after shifting by the fragment size) • Experimental verification of known binding sites (and sites not bound as negative controls)

ChIP-exo: improvement in binding site identification Rhee and Pugh, Cell 2011

Other functional genomics techniques Clifford et al, Nature Rev Genet, 2014

Introduction to Chromatin IP sequencing (ChIP-seq) data analysis - PowerPoint PPT Presentation

Introduction to Chromatin IP sequencing (ChIP-seq) data analysis Workshop on ChIP-seq data analysis Stockholm, 7 November 2018 Agata Smialowska NBIS, SciLifeLab, Stockholm University Chromatin state and gene expression PEV Position effect

Scaling normalisation for ChIP-seq with exogenous chromatin Workshop on ChIP-seq data analysis

Methods for Analyzing ChIP-Seq data Introduction to the ChIP-Seq server at SIB Lausanne Public

ChIP-seq data analysis 04-05-12 Outlook Friday 04-05-12: Next-generation sequencing

Importing data Peter Humburg Statistician, Macquarie University DataCamp ChIP-seq Workflows in

Jen Grenier Director, TREx Facility Announcements New and Improved Project Submission Form

RNA-seq basics: From reads to differential expression COMBINE RNA-seq Workshop RNA sequencing

Introduction to RNA-Seq Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

The Epigenome Tools 2: ChIP-Seq and Data Analysis Chongzhi Zang zang@virginia.edu

Introduction to differential binding Peter Humburg Statistician, Macquarie University DataCamp

Sequencing technology and assembly Sanger sequencing Sanger sequencing with radioactivity

Re-analysis of a CD4 ChIP-Seq data set with csaw Ryan C. Thompson Salomon Lab The Scripps

Differential expression analysis for sequencing count data Simon Anders RNA-Seq Count data in

Genomics Sequencing tech Sequencing tech: next generation What do we get from sequencing? How

RNA-seq Data Analysis Introduction to RNA-seq data analysis June, 2018 1 Luigi Grassi < lg

Calibration des Microroc (II) Alex, Cyril, Giom, Jean, Max 09 Mai 2011, Annecy 1 Reminder 2

Introduc)on to Chroma)n IP sequencing (ChIP-seq) data

DNA Binding Proteins CSE 527 Autumn 2007 A variety of DNA binding proteins (transcription

c-Si solar cells High Efficiency concepts of c-Si wafer

Colors of Asteroid Families H. Campins*, J. Ziffer, J. Licandro, J. de Len Pisa May 5, 2011

Network Embedding under Partial Monitoring for Evolving Networks Yu Han 1 , Jie Tang 1 and Qian

The Multilabel Naive Credal Classifier Alessandro Antonucci and Giorgio Corani {

The Organization of Knowledge Geoff Nunberg Concepts of Information i218 Feb. 19, 2015 A MODEST

Partners PrEP Trial Oral PrEP for Heterosexual Couples in Kenya and Uganda Partners PrEP: Study

Research (NIHR) in England Sponsored by: &

Sambuz

Useful Links

Newsletter

Mail Us

Introduction to Chromatin IP sequencing (ChIP-seq) data analysis - PowerPoint PPT Presentation

Introduction to Chromatin IP sequencing (ChIP-seq) data analysis Workshop on ChIP-seq data analysis Stockholm, 7 November 2018 Agata Smialowska NBIS, SciLifeLab, Stockholm University Chromatin state and gene expression PEV Position effect

Scaling normalisation for ChIP-seq with exogenous chromatin Workshop on ChIP-seq data analysis

Methods for Analyzing ChIP-Seq data Introduction to the ChIP-Seq server at SIB Lausanne Public

ChIP-seq data analysis 04-05-12 Outlook Friday 04-05-12: Next-generation sequencing

Importing data Peter Humburg Statistician, Macquarie University DataCamp ChIP-seq Workflows in

Jen Grenier Director, TREx Facility Announcements New and Improved Project Submission Form

RNA-seq basics: From reads to differential expression COMBINE RNA-seq Workshop RNA sequencing

Introduction to RNA-Seq Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

The Epigenome Tools 2: ChIP-Seq and Data Analysis Chongzhi Zang zang@virginia.edu

Introduction to differential binding Peter Humburg Statistician, Macquarie University DataCamp

Sequencing technology and assembly Sanger sequencing Sanger sequencing with radioactivity

Re-analysis of a CD4 ChIP-Seq data set with csaw Ryan C. Thompson Salomon Lab The Scripps

Differential expression analysis for sequencing count data Simon Anders RNA-Seq Count data in

Genomics Sequencing tech Sequencing tech: next generation What do we get from sequencing? How

RNA-seq Data Analysis Introduction to RNA-seq data analysis June, 2018 1 Luigi Grassi &lt; lg

Calibration des Microroc (II) Alex, Cyril, Giom, Jean, Max 09 Mai 2011, Annecy 1 Reminder 2

Introduc)on to Chroma)n IP sequencing (ChIP-seq) data

DNA Binding Proteins CSE 527 Autumn 2007 A variety of DNA binding proteins (transcription

c-Si solar cells High Efficiency concepts of c-Si wafer

Colors of Asteroid Families H. Campins*, J. Ziffer, J. Licandro, J. de Len Pisa May 5, 2011

Network Embedding under Partial Monitoring for Evolving Networks Yu Han 1 , Jie Tang 1 and Qian

The Multilabel Naive Credal Classifier Alessandro Antonucci and Giorgio Corani {

The Organization of Knowledge Geoff Nunberg Concepts of Information i218 Feb. 19, 2015 A MODEST

Partners PrEP Trial Oral PrEP for Heterosexual Couples in Kenya and Uganda Partners PrEP: Study

Research (NIHR) in England Sponsored by: &amp;

Sambuz

Useful Links

Newsletter

Mail Us

RNA-seq Data Analysis Introduction to RNA-seq data analysis June, 2018 1 Luigi Grassi < lg

Research (NIHR) in England Sponsored by: &