Computational Systems Biology Deep Learning in the Life Sciences - - PowerPoint PPT Presentation

computational systems biology deep learning in the life
SMART_READER_LITE
LIVE PREVIEW

Computational Systems Biology Deep Learning in the Life Sciences - - PowerPoint PPT Presentation

Computational Systems Biology Deep Learning in the Life Sciences 6.802 6.874 20.390 20.490 HST.506 David Gifford Lecture 10 March 12, 2019 Histone Marks Chromatin 3D Structure http://mit6874.github.io 1 Goals for today Chromatin


slide-1
SLIDE 1

Computational Systems Biology Deep Learning in the Life Sciences

6.802 6.874 20.390 20.490 HST.506

David Gifford Lecture 10 March 12, 2019

Histone Marks Chromatin 3D Structure

http://mit6874.github.io

1

slide-2
SLIDE 2

Goals for today

  • Chromatin marks and their models
  • Hidden Markov Model (HMM)
  • Deep learning model (DeepSEA)
  • Three-dimensional chromatin structure
  • Inferring it
  • Predicting it
slide-3
SLIDE 3
  • 1. Chromatin marks and

biological state

slide-4
SLIDE 4
slide-5
SLIDE 5

Chromatin and Nucleosome Organization

Nucleosome DNA - 146 base pairs, wrapped 1.7 times in a left-handed superhelix Proteins - two copies of each Histones H2A, H2B, H3 and H4. Higher organisms have linker H1 histone

Green -H3, yellow - H4, red - H2A, pink - H2B. Dark and light blue - DNA

Histone variants H3 variants: H3.3 - transcribed CENP-A - centromeres H2A variants: H2A.X - DNA damage macroH2A - X chromosome H2A.Z - transcribed regions Khorasanizadeh, (2004)

slide-6
SLIDE 6

Chro ma tin

  • rg a niza tio n ha s

multiple struc tura l la ye rs a nd o rg a nize s c hro ma tin into “do ma ins” Bo th DNA me thyla tio n a nd c hro ma tin ma rks c o nta in impo rta nt func tio na l info rma tio n

slide-7
SLIDE 7

Histo ne T a il Mo dific a tio ns

Sims III et al., 2003

slide-8
SLIDE 8

H3K 4me 3 RNA Po l I I

We c an obse r ve c hr

  • matin mar

ks and othe r ge nome assoc iate d pr

  • te ins using ChIP- se q
slide-9
SLIDE 9

De te c tion of Class I (ac tive ) and Class II (poise d) e nhanc e r

  • s. a ) b ) hE

SC ChI P-se q re a d de nsity pro file s we re g e ne ra te d fo r the indic a te d histo ne mo dific a tio ns c e nte re d o n p300-b o und re g io ns in the to p 1000 Cla ss I a nd Cla ss I I e nha nc e rs, re spe c tive ly. c ) hE SC Na no g ChI P-se q sho ws tha t Na no g b inds a t the thre e pre dic te d Cla ss I I e nha nc e r po sitio ns ne a r the CDX2 g e ne

slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12
  • 2. Learning chromatin states
slide-13
SLIDE 13

Roadmap Epigenomics Consortium et al. Nature 518, 317-330 (2015) doi:10.1038/nature14248

Can we find late nt state to e xplain obse r ve d mar ks?

slide-14
SLIDE 14

Hidde n Mar kov Mode ls

Hidde n sta te x in [1 .. m] F

  • r e xa mple , m c a n 15

E mitte d symb o l y c a n b e multi dime nsio na l F

  • r e xa mple , histo ne a nd a c c e ssib ility da ta a t g e no mic

lo c us t One no de e ve ry 200b p do wn g e no me Pa ra me te rs a re P(xt+1 | xt), P(yt | xt)

slide-15
SLIDE 15

Hidde n Mar kov Mode ls c an be use d to c r e ate late nt state s that ge ne r ate c hr

  • matin mar

ks

Hidde n Ma rko v Mo de l (Chro mHMM) Divide g e no me into 200b p windo ws Hidde n sta te fo r a 200b p windo w mo de ls wha t histo ne ma rks a re pre se nt in the windo w Unsupe rvise d – re sulting sta te s must b e inte rpre te d with inde pe nde nt da ta T he numb e r o f sta te s is fixe d a nd is a mo de ling de c isio n

slide-16
SLIDE 16

ChromHMM Model Parameter Visualization.

Hoffman M M et al. Nucl. Acids Res. 2013;41:827-841

P(xt+1 | xt) P(yt | xt)

slide-17
SLIDE 17

Chr

  • mHMM se gme nt base d c hr
  • matin

state s

slide-18
SLIDE 18

Roadmap Epigenomics Consortium et al. Nature 518, 317-330 (2015) doi:10.1038/nature14248

Tissues and cell types profiled in the Roadmap Epigenomics Consortium.

slide-19
SLIDE 19

Roadmap Epigenomics Consortium et al. Nature 518, 317-330 (2015) doi:10.1038/nature14248

slide-20
SLIDE 20
  • 3. Predicting chromatin state

from sequence

slide-21
SLIDE 21

DeepSea learns TF binding, accessibility, and chromatin marks

125 DNa se fe a ture s, 690 T F fe a ture s, 104 histo ne fe a ture s 1000 b p windo w thre e c o nvo lutio n la ye rs with 320, 480 a nd 960 ke rne ls 17% o f g e no me 690 T F b inding pro file s fo r 160 diffe re nt T F s, 125 DHS pro file s a nd 104 histo ne -ma rk pro file s Chr 8 a nd 9 e xc lude d

slide-22
SLIDE 22

DeepSea can predict differentially accessible regions based upon SNP value

slide-23
SLIDE 23

An ensemble logistic regression classifier based on DeepSea output can identify regulatory variants

slide-24
SLIDE 24
  • 4. Three-dimensional

interactions

slide-25
SLIDE 25

HiC, HiChip, a nd ChI A-PE T da ta re ve a l dista l g e no me inte ra c tio ns

slide-26
SLIDE 26

E nhanc e r s r e gulate distal tar ge t ge ne s by ge nome looping

Ge ne Po l I I Ma ste r Re g ula to rs Me dia to r E nha nc e r Co he sin

slide-27
SLIDE 27

in situ HiC identifies proximal genomic contacts

Ce ll. 2014 De c 18; 159(7): 1665–1680.

slide-28
SLIDE 28

in situ HiC reveals interactions at 1 – 5 KB resolution

slide-29
SLIDE 29

Observed interchromosomal interaction distances fall off exponentially

slide-30
SLIDE 30

ChIA-PET identifies protein mediated interactions and improves resolution for those events

slide-31
SLIDE 31

ChIA-PET data are consistent with HiC data

slide-32
SLIDE 32

ChIA-PET discovered enhancer linkages

slide-33
SLIDE 33

Issue s with ChIA- PE T

  • 1. Hig h fa lse ne g a tive ra te . L

ib ra rie s pro duc e d a re no t c o mple x e no ug h to pe rmit furthe r disc o ve ry b y a dditio na l se q ue nc ing .

  • 2. Spe c ific to a pro te in (RNA

Po lyme ra se I I in o ur e xa mple )

  • 3. Hi-C a nd de riva tive s ma y so lve

the se pro b le ms e ve ntua lly

slide-34
SLIDE 34

HiChIP identifies protein mediated interactions

slide-35
SLIDE 35

HiChIP is more sensitive than ChIA-PET

slide-36
SLIDE 36

HiChIP and ChIA-PET interactions compared Smc1a antibody (part of cohesion complex)

slide-37
SLIDE 37

XIST promoter interactions show more support from HiChIP than Hi-C

slide-38
SLIDE 38

HiChIP (Smc1a) is more sensitive than HiC

slide-39
SLIDE 39
  • 5a. Discovering interactions:

Anchor-based

slide-40
SLIDE 40

Method 1: Discover anchors using ChIP-seq methods Given anchors, what is the chance of observing an interaction by chance? c a e nds c b e nds I

a ,b inte ra c tio ns o b se rve d

N to ta l e nds

slide-41
SLIDE 41

c a e nds c b e nds I

a ,b inte ra c tio ns o b se rve d

N to ta l e nds What is the chance of observing an interaction by chance?

slide-42
SLIDE 42

E stimating total e ve nts fr

  • m ove r

lap

I ma g ine we pe rfo rm two b io lo g ic a l re plic a te s

  • f a n e xpe rime nt a nd o b ta in 1000 e ve nts in

e a c h, o f whic h 900 a re ide ntic a l. We c a n use a hype rg e o me tric mo de l to infe r ho w ma ny po ssib le e ve nts e xist (N) g ive n two sa mple size s (m a nd n) a nd a n o ve rla p (k): Using this mo de l, we pre dic t ~1100 to ta l e ve nts

slide-43
SLIDE 43

Appr

  • ximate c lose d for

m solution for total numbe r

  • f e ve nts

T he ML e stima te o f N is a ppro xima te ly: One wa y to se e this is b y using the no rma l a ppro xima tio n o f the b ino mia l a ppro xima tio n to the hype rg e o me tric distrib utio n:

slide-44
SLIDE 44
  • 5b. Discovering interactions:

Density-based

slide-45
SLIDE 45

Nucleic Acids Research, 14 February 2019, gkz051, https://doi.org/10.1093/nar/gkz051

  • Figure 1. CID uses density-based clustering to discover chromatin interactions. (A) ChIA-PET interactions can be discovered as groups of dense arcs

connecting two genomic regions. Each arc is a PET. (B) The PETs plotted on a two-dimensional map using the genomic coordinates of the two reads. Each point is a PET. The colors represent the density values, defined as the number of PETs in the neighborhood. The red dashed square represents the size of the neighborhood. (C) The clustering decision graph. Each point is a PET. The points with high density and high delta values are selected as cluster

  • centers. For simplicity, only large clusters are labelled. (D) The read pairs are assigned to the nearest cluster centers. The clusters are labeled as in (C).

(E) The clusters are visualized as arcs. The clusters are labeled as in (C) and (D).

Me tho d 2: CI D use s de nsity-b a se d c luste ring to disc o ve r c hro ma tin inte ra c tio ns

slide-46
SLIDE 46

Method 2: Density cluster interaction origins

We use a thre e -c o mpo ne nt mixture mo de l to de sc rib e c o nditio na l distrib utio n o f PE T

  • c o unt fro m a ll the PE

T c luste rs. One c o mpo ne nt re pre se nts true inte ra c tio n PE T c luste r (T iPC), a nd the o the r two fo r ra ndo m c o llisio n PE T c luste r (Rc PC) a nd ra ndo m lig a tio n PE T c luste r (RlPC), re spe c tive ly. T iPC a nd Rc PC mo de ls inc lude d a ,b dista nc e b e twe e n c luste rs

https:/ / a c a de mic .o up.c o m/ b io info rma tic s/ a rtic le / 31/ 23/ 3 https:/ / a c a de mic .o up.c o m/ na r/ a dva nc e -a rtic le / do i/ 10.1093/ na r/ g

slide-47
SLIDE 47

Cluster interaction origins

slide-48
SLIDE 48

Jaccard coefficient – measure of set similarity

slide-49
SLIDE 49

CID is more reproducible and sensitive

slide-50
SLIDE 50
  • 6. Predicting enhancer-promoter

interactions

slide-51
SLIDE 51

https:/ / www.na ture .c o m/ a rtic le s/ ng .3539

TargetFinder uses multiple data types to predict HiC interactions

slide-52
SLIDE 52

TargetFinder Training Data

slide-53
SLIDE 53

TargetFinder – Ratio of the CTCF and RAD21 ChIP-seq signals occurring within interacting enhancers and non- interacting enhancers

slide-54
SLIDE 54

TargetFinder – Enrichment of signals at transcription start sites (TSS)

Da rk – inte ra c ting ; L ig ht – no n-inte ra c ting

slide-55
SLIDE 55

TargetFinder – Performance

F e a ture s fo r e nha nc e rs a nd pro mo te rs o nly (E / P), e xte nde d e nha nc e rs a nd pro mo te rs (E E / P), a nd e nha nc e rs a nd pro mo te rs plus the windo ws b e t

slide-56
SLIDE 56

Deep learning network for predicting enhancer-promoter interactions

slide-57
SLIDE 57

Se q ue nc e

  • 2kb

se q ue nc e windo ws Chro ma tin – 10 kb / 200 b p b ins DNa se -se q , H3K 4me 1, H3K 4me 2, H3K 27a c , H3K 27me 3, H3K 36me 3, a nd H3K 9me 3

Sequence and chromatin anchor networks outputs are concatenated

slide-58
SLIDE 58

Enhancer promoter prediction performance with varying feature sets