introduction to chromatin ip sequencing chip seq data
play

Introduction to Chromatin IP sequencing (ChIP-seq) data analysis - PowerPoint PPT Presentation

Introduction to Chromatin IP sequencing (ChIP-seq) data analysis Workshop on ChIP-seq data analysis Stockholm, 7 November 2018 Agata Smialowska NBIS, SciLifeLab, Stockholm University Chromatin state and gene expression PEV Position effect


  1. Introduction to Chromatin IP – sequencing (ChIP-seq) data analysis Workshop on ChIP-seq data analysis Stockholm, 7 November 2018 Agata Smialowska NBIS, SciLifeLab, Stockholm University

  2. Chromatin state and gene expression PEV Position effect variegation in Drosophila eye (nature.com) First observed by H. Muller 1930 Juxtaposition of eye colour genes with heterochromatin results in the “mottled” eye colouration (red and white). Proteins, which bind heterochromatin, act to “spread” the silencing signal by providing a forward feedback loop. Heterochromatin Protein 1; Histone methyltransferase Su(var)3-9; H3K9 methylation

  3. www.pollev.com/AGATASMIALOW506

  4. Chromatin immunoprecipitation RnDsystems

  5. Applications General transcription machinery

  6. Applications Promoter-associated transcription factors

  7. Applications Distal enhancers

  8. Applications Histone modifications and variants Activation states Co-factors

  9. ChIP-seq workflow Liu, Pott and Huss, BMC Biology 2010

  10. Workflow of a ChIP-seq study design study obtain input chromatin perform precipitation Wet lab construct library sequence library bioinformatic analysis

  11. Critical factors • Antibody selection • Proper control sample (input chromatin or mock IP) • Library cloning and sequencing • Algorithm for peak detection • Enough material and biological replicates • Reproducibility in chromatin fragmentation • Cross-linker choice

  12. Experiment design • Sound experimental design: replication, randomisation and blocking (R.A. Fisher, 1935) • In the absence of a proper design, it is essentially impossible to partition biological variation from technical variation • Sequencing depth: depends on the structure of the signal; cannot be linearly scaled to genome size • Single- vs. paired-end reads: PE improves read mapping confidence and gives a direct measure of fragment size, which otherwise has to be modelled or estimated

  13. Experiment design Ideal design: ChIP input library/sequencing replicates X Each sample has a matched input Input sequenced to a comparable depth as IP sample ChIP input replicates library/sequencing X ChIP under-sequenced input ChIP input library/sequencing replicates ✓ ChIP well-sequenced input

  14. Biological replicates and randomisation libraries sequencing X technical replicates are generally a waste of time sample and money ≥2 biological replicates for site identification ≥3 biological replicates for differential binding samples replicates libraries sequencing many studies do not account for batch X origin effects experiment i. time ii. Origin experiment1 experiment2 Experiment3… libraries, sequencing, etc ✓ time ------->

  15. Importance of sequencing depth actual replicates pooled data X ✓ if you need to pool your data, then it is under-sequenced under-sequenced data pooled data

  16. Sequencing depth depends on data type Chromatin Transcription Chromatin Remodellers Factors Remodellers Histone marks Histone marks RNA polymerase II point-source mixed signal broad signal TF: 20 M ? Human: ? H3K4me3: 25 M H3K27me3: 40 M H3K36me3: 35 M H3K9me3: >55 M No clear guidelines for mixed and broad type of peaks Source: The ENCODE consortium; Jung et al, NAR 2014

  17. • ChIP – sequencing: introduction from a bioinformatics point of view • Principles of analysis of ChIP-seq data • ChIP-seq: downstream analyses • Resources

  18. • ChIP – sequencing: introduction from a bioinformatics point of view • Principles of analysis of ChIP-seq data • ChIP-seq: downstream analyses • Resources

  19. Chromatin = DNA + proteins Park, Nature Rev Genetics, 2009

  20. Data analysis

  21. design study Workflow of a ChIP-seq study obtain input chromatin perform precipitation Wet lab construct library sequence library library quality control filter sequences align sequences filter alignments Iterative process identify peaks / regions of enrichment assess data quality understand the data / results downstream analyses

  22. • ChIP – sequencing: introduction from a bioinformatics point of view • Principles of analysis of ChIP-seq data • ChIP-seq: downstream analyses • Resources

  23. Two questions to address • 1. Did the ChIP part of the ChIP-seq experiment work? Was the enrichment successful? • 2. Where are the binding sites (of the protein of interest)?

  24. Word of caution! ChIP-seq experiments are more unpredictable than RNA-seq! Error sources: chromatin structure PCR over-amplification non-specific antibody other things?

  25. ChIP-seq QC: did the ChIP work? • 1. Inspect the signal (mapped reads, coverage profiles) in genome browser • 2. Compute peak-independent quality metrics (cross correlation, cumulative enrichment) • 3. Assess replicate consistency (correlations between replicates of the same condition)

  26. tag density distribution reproducibility similarity of coverage signal at known sites … Spotting inconsistencies Confounding factors Under-sequenced libraries …

  27. How do I know my data is of good quality? Library complexity Marinov et al, G3 2013

  28. Quality control: tag uniqueness – library complexity metric Sequence duplication level > 80% (low complexity library) FastQC Babraham Institute NRF: Non-redundant fraction (of reads): proportion of unique tags / total

  29. How do I know my data is of good quality? Objective (i.e. peak independent) metrics to quantify enrichment in ChIP-seq; for TF in mammalian systems: Normalised Strand Correlation NSC Relative Strand Correlation RSC Large-scale quality analysis of published ChIP-seq data sets: 20% low quality 25% intermediate quality 30% inputs have metrics similar to IPs Marinov et al, G3 2013

  30. Strand cross-correlation The correlation between signal of the 5ʹ end of reads on the (+) and (-) strands is assessed after successive shifts of the reads on the (+) strand and the point of maximum correlation between the two strands is used as an estimation of fragment length. Cross correlation Strand shift Carroll et al, Front Genet 2014

  31. Strand cross-correlation Max CC – Min CC Max CC value (fLen) RSC = NSC = Phantom CC – Min CC Min CC Carroll et al, Front Genet 2014

  32. Cross-correlation plots ChIP ENCFF000OWMed.sorted.1.bam.picard.bam ENCFF000PMJ.sorted.1.bam ENCFF000PMG.sorted.1.bam 0.225 Acceptable 0.23 0.30 Very good Poor enrichment, enrichment 0.220 0.22 0.29 enrichment possibly cross − correlation cross − correlation cross − correlation 0.215 undersequenced 0.21 0.28 0.210 0.27 0.20 0.205 0.26 0.19 0.200 0.25 − 500 0 500 1000 1500 − 500 0 500 1000 1500 − 500 0 500 1000 1500 strand − shift (105,455) strand − shift (125) strand − shift (130) NSC=1.14102,RSC=1.06452,Qtag=1 NSC=1.21367,RSC=1.39752,Qtag=1 NSC=1.28071,RSC=0.987276,Qtag=0 Input ENCFF000PET.sorted.1.bam.picard.bam ENCFF000PON.sorted.1.bam.picard.bam 0.300 Read 0.278 0.298 No clustering clustering 0.296 0.277 Good input Bad input cross − correlation cross − correlation 0.294 0.276 0.292 0.290 0.275 0.288 0.274 0.286 − 500 0 500 1000 1500 − 500 0 500 1000 1500 strand − shift (100,265,245) strand − shift (90,200,210) NSC=1.01443,RSC=0.289702,Qtag= − 1 NSC=1.0166,RSC=0.92739,Qtag=0

  33. Cumulative enrichment aka “Fingerprint” is another metric for successful enrichment http://deeptools.readthedocs.org Diaz et al, Genome Biol 2012

  34. Park, Nature Rev Genetics, 2009

  35. Peak calling appropriate methodologies depend on data type Chromatin Transcription Chromatin Remodellers Factors Remodellers Histone marks Histone marks RNA polymerase II punctate mixed signal broad signal SPP - - MACS2 MACS2 in broad mode, windows approaches This is an active area of algorithm development

  36. Principle of peak detection Symmetry in reads mapped to opposite DNA strands Computation of enrichment model

  37. Pepke, 2009

  38. Point-source vs. broad peak detection Sequence-specific binding (TFs) Distributed binding (histones, RNApol2) Wilbanks 2010

  39. Comparison of peak calling algorithms Peak overlap (Ho et al, 2012) > 50 % 20 % Wilbanks 2010

  40. “Hyper-chippable” regions Reads mapped to these regions should be filtered out prior to peak calling Tracks available from UCSC for human, mouse, fly and worm DER – Duke Excluded Regions (11 repeat classes) UHS – Ultra High Signal (open chromatin) DAC – consensus excluded regions Carroll et al, Front Genet 2014

  41. Quality considerations • ChIP-seq quality guidelines from the ENCODE project (Relative strand cross-correlation, Irreproducible discovery rate) • Antibody validation • Appropriate sequencing depth (depending on genome size and peak type). For human genome and broad-source peaks, min. 40-50M reads is required. • Experimental replication • Fraction of reads in peaks (FRiP) > 1% • Cross correlation (correlation of the density of sequences aligned to opposite DNA strands after shifting by the fragment size) • Experimental verification of known binding sites (and sites not bound as negative controls)

  42. ChIP-exo: improvement in binding site identification Rhee and Pugh, Cell 2011

  43. Other functional genomics techniques Clifford et al, Nature Rev Genet, 2014

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend