The Epigenome Tools 2: ChIP-Seq and Data Analysis Chongzhi Zang - - PowerPoint PPT Presentation

the epigenome tools 2 chip seq and data analysis
SMART_READER_LITE
LIVE PREVIEW

The Epigenome Tools 2: ChIP-Seq and Data Analysis Chongzhi Zang - - PowerPoint PPT Presentation

The Epigenome Tools 2: ChIP-Seq and Data Analysis Chongzhi Zang zang@virginia.edu http://zanglab.com PHS5705: Public Health Genomics March 20, 2017 1 Outline Epigenome: basics review ChIP-seq overview ChIP-seq data analysis 2


slide-1
SLIDE 1

The Epigenome Tools 2: ChIP-Seq and Data Analysis

Chongzhi Zang

zang@virginia.edu http://zanglab.com PHS5705: Public Health Genomics March 20, 2017

1

slide-2
SLIDE 2

Outline

  • Epigenome: basics review
  • ChIP-seq overview
  • ChIP-seq data analysis

2

slide-3
SLIDE 3

Epigenome

3 Original figure from ENCODE, Darryl Leja (NHGRI), Ian Dunham (EBI)

nucleosome histone

The epigenome is a multitude of chemical compounds that can tell the genome what to do. The epigenome is made up of chemical compounds and proteins that can attach to DNA and direct such actions as turning genes on or off, controlling the production of proteins in particular cells. -- from genome.gov

slide-4
SLIDE 4

Epigenomic marks

  • DNA methylation
  • Histone marks

– Covalent modifications – Histone variants

  • Chromatin regulators

– Histone modifying enzymes – Chromatin remodeling complexes

  • * Transcription factors

4

slide-5
SLIDE 5

5

Histone modifications

  • Nucleosome Core Particles
  • Core Histones: H2A, H2B, H3, H4
  • Covalent modifications on histone

tails include: methylation (me), acetylation (ac), phosphorylation …

  • Histone variants
  • Histone modifications are

implicated in influencing gene expression.

Allis C. et al. Epigenetics. 2006

Notation: H3K4me3

slide-6
SLIDE 6

Differential expression log2 (fold-change)

Histone modifications associate with regulation of gene expression

6 Wang, Zang et al. Nat Genet 2008

0.35 Fractions of enhancers 0.30 0.25 0.20 0.15 0.10 0.05

H2AK5ac H2AK9ac H2BK5ac H2BK5me1 H2BK12ac H2BK20ac H2BK120ac H3K4ac H3K4me1 H3K4me2 H3K4me3 H3K9me1 H3K9me2 H3K9me3 H3K14ac H3K18ac H3K23ac H3K27ac H3K27me1 H3K27me2 H3K36me1 H3K36me3 H3K79me1 H3K79me2 H3K79me3 H3R2me1 H3R2me2 H4K5ac H4K8ac H4K12ac H4K16ac H4K91ac H4K20me1 H4K20me3 H3K27me3 H3K36ac H3K9ac H2A.Z

slide-7
SLIDE 7

“Functions” of histone marks

Table 3. Distinctive Chromatin Features of Genomic Elements Functional Annotation Histone Marks Promoters H3K4me3 Bivalent/Poised Promoter H3K4me3/H3K27me3 Transcribed Gene Body H3K36me3 Enhancer (both active and poised) H3K4me1 Poised Developmental Enhancer H3K4me1/H3K27me3 Active Enhancer H3K4me1/H3K27ac Polycomb Repressed Regions H3K27me3 Heterochromatin H3K9me3

7 Rivera & Ren Cell 2013

slide-8
SLIDE 8

H3K4me3/H3K27me3 Bivalent Domain

8

H3K4me3 H3K27me3

Repressed Remained Induced Poised

From: https://pubs.niaaa.nih.gov/publications/arcr351/77-85.htm

slide-9
SLIDE 9

ChIP-seq: Profiling epigenomes with sequencing

9 Original figure from ENCODE, Darryl Leja (NHGRI), Ian Dunham (EBI)

nucleosome histone ATAC-seq

slide-10
SLIDE 10

Published ChIP-seq datasets are skyrocketing We are entering the Big Data era

10

  • !
  • Number of ChIP-seq datasets on GEO

500 1000 1500 2000 2500 3000 Mei et al. Nucleic Acids Research 2016

slide-11
SLIDE 11

Chromatin ImmunoPrecipitation (ChIP)

11

slide-12
SLIDE 12

Protein-DNA crosslinking in vivo (for TF)

12

slide-13
SLIDE 13

Chop the chromatin using sonication (TF) or micrococal nuclease (MNase) digestion (histone)

13

slide-14
SLIDE 14

Specific factor-targeting antibody

14

slide-15
SLIDE 15

Immunoprecipitation

15

slide-16
SLIDE 16

DNA purification

16

slide-17
SLIDE 17

PCR amplification and sequencing

17

slide-18
SLIDE 18

ChIP-seq data analysis overview

18

Scale chr19: 500 bases hg19 15,308,000 15,308,100 15,308,200 15,308,300 15,308,400 15,308,500 15,308,600 15,308,700 15,308,800 15,308,900 15,309,000 15,309,100 15,309,200 User Supplied Track

@ILLUMINA-8879DC:231:KK:3:1:1070:945 1:Y:0: NNNAATACAGTCAGAAACATATCATATTGGAGAATA #################################### @ILLUMINA-8879DC:231:KK:3:1:1153:945 1:Y:0: NNNAAGCACACAGAAGATAACTAAACAATCAAGTAG #################################### @ILLUMINA-8879DC:231:KK:3:1:1222:945 1:Y:0: NNNAAGGGTCTTGAGAAGAAATCATTCTGGATGGCA #################################### @ILLUMINA-8879DC:231:KK:3:1:1304:939 1:Y:0: NNNCCAGGCTCCCGCGATTCTCCTGCCTCAGCTTCT #################################### @ILLUMINA-8879DC:231:KK:3:1:1354:945 1:Y:0: NNNCTCTTCCTTAGCTAAACTTTCAACTAAGCCAAA #################################### @ILLUMINA-8879DC:231:KK:3:1:1411:932 1:Y:0: NNNGTAGGACCATTGGCGTTGCGACACAAAAAATTT #################################### @ILLUMINA-8879DC:231:KK:3:1:1496:937 1:Y:0: NNNTTCATCGGGTTGAGAGTCCCCTTGTTGCATGCA #################################### @ILLUMINA-8879DC:231:KK:3:1:1533:939 1:Y:0: NNNATTTTCCCGTTCCAGGTCGCAATTTCCGCCGTT #################################### @ILLUMINA-8879DC:231:KK:3:1:1573:940 1:Y:0: NNNGGGGTGCGCCTTTAGTCCCAGCTACTCAGGAAC ####################################

slide-19
SLIDE 19

ChIP-seq data analysis overview

  • Where in the genome do these sequence reads come

from? - Sequence alignment and quality control

  • What does the enrichment of sequences mean? - Peak

calling

  • What can we learn from these data? – Downstream

analysis and integration

19

slide-20
SLIDE 20

ChIP-seq data analysis: basic processing

  • alignment of each sequence read: bowtie or BWA
  • redundancy control:

20

cannot map to the reference genome can map to multiple loci in the genome can map to a unique location in the genome ✗ ✗ ✔ ✔

Langmead et al. 2009, Zang et al. 2009

slide-21
SLIDE 21
  • pile-up profiling
  • Peak/signal

detection

ChIP-seq data analysis: Peak calling

  • DNA fragment size estimation

d

21

−600 −400 −200 200 400 600 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Distance to the middle Percentage forward tags reverse tags

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 50 100 150 200 250 300 350 400

s

peak model cross-correlation

slide-22
SLIDE 22
  • Sharp peaks

transcription factor binding, DNase, ATAC-seq MACS (Zhang, 2008)

dynamic background Poisson model

ChIP-seq data analysis: Peak calling

  • Broad peaks

Histone modifications, “super-enhancers” Diffuse

SICER (Zang, 2009) Spatial clustering of localized weak signal and integrative Poisson model

22 Wang, Zang et al. 2014

NOTCH1 H3K27ac

slide-23
SLIDE 23

MACS

  • Model-based Analysis for ChIP-Seq
  • Tag distribution along the genome ~ Poisson distribution (λBG

= total tag / genome size)

  • ChIP-seq show local biases in the genome

– Chromatin and sequencing bias – 200-300bp control windows have to few tags – But can look further Dynamic λlocal = max(λBG, [λctrl, λ1k,] λ5k, λ10k) ChIP Control 300bp 1kb 5kb 10kb

http://liulab.dfci.harvard.edu/MACS/ Zhang et al, Genome Bio, 2008

slide-24
SLIDE 24

SICER

  • Spatial-clustering Identification of ChIP-Enriched Regions

24 Zang et al. Bioinformatics 2009

★★★★★

  • mictools.com

10kb 5kb

slide-25
SLIDE 25

ChIP-seq peak calling: Parameters

25

Parameter Remarks Genome Species and reference genome version, e.g. hg38, hg18, mm10, mm9 Effective genome rate Fraction of the mappable genome, vary in species, read length, etc. DNA fragment size Estimated by default; can specify

  • therwise

Window size Data resolution, usually nucleosome periodicity length, i.e. 200bp Gap size (for SICER only) Allowable gaps between eligible windows, usually 2 or 3 windows P-value cut-off Threshold for peak calling, from model False discovery rate (FDR) cut-off Threshold for peak calling, BH correction from p-value.

slide-26
SLIDE 26

ChIP-seq data analysis: Review

  • 1. Read mapping (sequence alignment)
  • 2. Peak calling: MACS or SICER
  • 1. QC
  • 2. DNA fragment size estimation (for Single-end)
  • 3. Pile-up profile generation
  • 4. Peak/signal detection
  • 3. Downstream analysis/integration

26

slide-27
SLIDE 27

Data formats

  • fastq: raw sequences
  • BED:

chr11 10344210 10344260 255

  • chr4

76649430 76649480 255 + chr3 77858754 77858804 255 + chr16 62688333 62688383 255 + chr22 33031123 33031173 255

  • SAM/BAM: aligned sequencing reads
  • bedGraph, Wig, bigWig: pile-up profiles for browser

visualization

27

slide-28
SLIDE 28

Data flow

28

Raw sequence reads

  • fastq

Aligned reads

  • BAM/BED

Profile; Peaks

  • bedGraph/Wig/bigWig
  • BED

MACS/SICER Bowtie/BWA Reference genome

slide-29
SLIDE 29

Galaxy: web-interface analysis platform

  • https://usegalaxy.org/

29

slide-30
SLIDE 30

Run MACS on Cistrome, a Galaxy-based platform

  • http://cistrome.org/ap/

30

slide-31
SLIDE 31

Run SICER on Galaxy-based platforms

  • http://services.cbib.u-bordeaux.fr/galaxy/

31

slide-32
SLIDE 32

ChIP-seq: Downstream analysis

  • Data visualization

– UCSC genome browser: http://genome.ucsc.edu/ – WashU epigenome browser: http://epigenomegateway.wustl.edu/ – IGV: http://software.broadinstitute.org/software/igv/

  • Meta analysis

– CEAS: http://liulab.dfci.harvard.edu/CEAS/

  • Integration with gene expression

– BETA: http://cistrome.org/BETA/ – MARGE: http://cistrome.org/MARGE/

  • Integration with other epigenomic data

– GREAT: http://great.stanford.edu – ENCODE SCREEN: http://screen.umassmed.edu/ – MANCIE: https://cran.r-project.org/package=MANCIE – Cistrome DB: http://cistrome.org/db/

32

slide-33
SLIDE 33

BETA: Binding Expression Target Analysis

  • Regulatory Potential

33

TSS

P(gi) =

  • j∈S(i)

exp

  • −∆ij

λ

  • j

i

slide-34
SLIDE 34

MARGE: A big data driven, integrative regression and semi- supervised approach for predicting functional enhancers

34 Wang, Zang et al. Genome Res 2016

samples samples samples sample selection enhancer prediction

slide-35
SLIDE 35

ENCODE

35

https://www.encodeproject.org/

slide-36
SLIDE 36

Cistrome Data Browser

http://cistrome.org/db/

36

slide-37
SLIDE 37

ChIP-seq data analysis: Review

  • 1. Read mapping (sequence alignment)
  • 2. Peak calling: MACS or SICER
  • 1. QC
  • 2. DNA fragment size estimation (for Single-end)
  • 3. Pile-up profile generation
  • 4. Peak/signal detection
  • 3. Downstream analysis/integration

37

slide-38
SLIDE 38

Summary

  • ChIP-seq is used to profile epigenomes
  • ChIP-seq data analysis
  • MACS for narrow peaks
  • SICER for broad peaks
  • Online tools and resources

38

slide-39
SLIDE 39

Further Reading

The cancer epigenome: Concepts, challenges, and therapeutic opportunities Science 17 Mar 2017: Vol. 355, Issue 6330, pp.1147-1152 http://science.sciencemag.org/content/355/6330/1147

39

Heterochromatin Heterochromatin Euchromatin

Reader Eraser Writer Nucleosome DNA Histones BRD2/3/4 SMARCA2 SMARCA4 BAZ2A PB1 BRPF1 ATAD2 L3MBTL3 WDR5 BRD9 BRD7 CBX7 EZH2 MMSET DOT1L SETD7 MLL1 PRMT1 PRMT3 PRMT5 SMYD2 G9a/GLP DNMT HDAC JMJD3/UTX LSD1 CREBBP EP300 MOZ IDH1* IDH2* DNA modifcations DNA and histone modifcations Histone modifcations Therapies targeting:

Nucleus

slide-40
SLIDE 40

Thank you very much!

zang@virginia.edu http://zanglab.com

40