Lecture 6: Regulatory genomics Gene regulation, chromatin - - PowerPoint PPT Presentation

lecture 6 regulatory genomics
SMART_READER_LITE
LIVE PREVIEW

Lecture 6: Regulatory genomics Gene regulation, chromatin - - PowerPoint PPT Presentation

6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences Lecture 6: Regulatory genomics Gene regulation, chromatin accessibility, DNA regulatory code Prof. Manolis Kellis Slides credit: 6.047,


slide-1
SLIDE 1

6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences

Lecture 6: Regulatory genomics

Gene regulation, chromatin accessibility, DNA regulatory code

  • Prof. Manolis Kellis

http://mit6874.github.io

Slides credit: 6.047, Anshul Kundaje, David Gifford

slide-2
SLIDE 2

Deep Learning for Regulatory Genomics

  • 1. Biological foundations: Building blocks of Gene Regulation

– Gene regulation: Cell diversity, Epigenomics, Regulators (TFs), Motifs, Disease role – Probing gene regulation: TFs/histones: ChIP-seq, Accessibility: DNase/ATAC-seq

  • 2. Classical methods for Regulatory Genomics and Motif Discovery

– Enrichment-based motif discovery: Expectation Maximization, Gibbs Sampling – Experimental: PBMs, SELEX. Comparative genomics: Evolutionary conservation.

  • 3. Regulatory Genomics CNNs (Convolutional Neural Networks): Foundations

– Key idea: pixels  DNA letters. Patches/filters  Motifs. Higher  combinations – Learning convolutional filters  Motif discovery. Applying them  Motif matches

  • 4. Regulatory Genomics CNNs/RNNs in Practice: Diverse Architectures

– DeepBind: Learn motifs, use in (shallow) fully-connected layer, mutation impact – DeepSea: Train model directly on mutational impact prediction – Basset: Multi-task DNase prediction in 164 cell types, reuse/learn motifs – ChromPuter: Multi-task prediction of different TFs, reuse partner motifs – DeepLIFT: Model interpretation based on neuron activation properties – DanQ: Recurrent Neural Network for sequential data analysis

slide-3
SLIDE 3
  • 1a. Basics of gene regulation
slide-4
SLIDE 4

One Genome – Many Cell Types

4

ACCAGTTACGACGGTCA GGGTACTGATACCCCAA ACCGTTGACCGCATTTA CAGACGGGGTTTGGGTT TTGCCCCACACAGGTAC GTTAGCTACTGGTTTAG CAATTTACCGTTACAAC GTTTACAGGGTTACGGT TGGGATTTGAAAAAAAG TTTGAGTTGGTTTTTTC ACGGTAGAACGTACCGT TACCAGTA

Image Source wikipedia

slide-5
SLIDE 5

DNA packaging

  • Why packaging

– DNA is very long – Cell is very small

  • Compression

– Chromosome is 50,000 times shorter than extended DNA

  • Using the DNA

– Before a piece of DNA is used for anything, this compact structure must

  • pen locally
  • Now emerging:

– Role of accessibility – State in chromatin itself – Role of 3D interactions

slide-6
SLIDE 6

Combinations of marks encode epigenomic state

  • 100s of known modifications, many new still emerging
  • Systematic mapping using ChIP-, Bisulfite-, DNase-Seq
  • H3K4me3
  • H3K9ac
  • DNase
  • H3K36me3
  • H3K79me2
  • H4K20me1
  • H3K4me1
  • H3K27ac
  • DNase
  • H3K9me3
  • H3K27me3
  • DNAmethyl
  • H3K4me3
  • H3K4me1
  • H3K27ac
  • H3K36me3
  • H4K20me1
  • H3K79me3
  • H3K27me3
  • H3K9me3
  • H3K9ac
  • H3K18ac

Enhancers Promoters Transcribed Repressed

slide-7
SLIDE 7

Summarize multiple marks into chromatin states

ChromHMM: multi-variate hidden Markov model

WashU Epigenome Browser

30+ epigenomics marks Chromatin state track summary

slide-8
SLIDE 8

Promoter region Enhancer region Protein-coding sequence

T ra nsc ription fa c tors c ontrol a c tiva tion of c e ll- type - spe c ific promote rs a nd e nha nc e rs

slide-9
SLIDE 9

T F s use DNA-b inding do ma ins to re c o g nize spe c ific DNA se q ue nc e s in the g e no me

DNA-binding domain of Engrailed “Logo” or “motif” TAATTA CACGTG AGATAAGA TCATTA

slide-10
SLIDE 10

Re g ula to r struc ture  re c o g nize d mo tifs

  • Pro te ins ‘ fe e l’ DNA
  • Re a d c he mic a l pro pe rtie s o f b a se s
  • Do NOT
  • pe n DNA (no b a se

c o mple me nta rity)

  • 3D T
  • po lo g y dic ta te s spe c ific ity
  • F

ully c o nstra ine d po sitio ns:

 e ve ry a to m ma tte rs

  • “Amb ig uo us / de g e ne ra te ”

po sitio ns

 lo o se ly c o nta c te d

  • Othe r type s o f re c o g nitio n
  • Mic ro RNAs: c o mple me nta rity
  • Nuc le o so me s: GC c o nte nt
  • RNAs: struc ture / se q n c o mb ina tio n
slide-11
SLIDE 11

Mo tifs summa rize T F se q ue nc e spe c ific ity

  • Summa rize

info rma tio n

  • I

nte g ra te ma ny po sitio ns

  • Me a sure o f

info rma tio n

  • Disting uish mo tif
  • vs. mo tif insta nc e
  • Assumptio ns:
  • I

nde pe nde nc e

  • F

ixe d spa c ing

slide-12
SLIDE 12

Re gulator y motifs at all le ve ls of pr e / post- tx r e gulation

  • T

he pa rts list: ~20-30k g e ne s

  • Pro te in-c o ding g e ne s, RNA g e ne s (tRNA, mic ro RNA, snRNA)
  • T

he c irc uitry: c o nstruc ts c o ntro lling g e ne usa g e

  • E

nha nc e rs, pro mo te rs, splic ing , po st-tra nsc riptio na l mo tifs

  • T

he re g ula to ry c o de , c o mplic a tio ns:

  • Co mb ina to ria l c o ding o f ‘ uniq ue ta g s’
  • Da ta -c e ntric e nc o ding o f a ddre sse s
  • Ove rla id with ‘ me mo ry’ ma rks
  • L

a rg e -sc a le o n/ o ff sta te s

  • Mo dula tio n o f the la rg e -sc a le c o ding
  • Po st-tra nsc riptio na l a nd po st-tra nsla tio na l info rma tio n
  • T
  • da y: disc o ve ring mo tifs in c o -re g ula te d pro mo te rs a nd de no vo

mo tif disc o ve ry & ta rg e t ide ntific a tio n Enhancer regions Promoter motifs

Where in the body? When in time? Which variants?

Splicing signals

Which subsets?

Motifs at RNA level

slide-13
SLIDE 13

Disrupte d mo tif a t the he a rt o f F T O o b e sity lo c us

Obese Lean Strongest association with obesity C-to-T disruption of AT-rich regulatory motif Restoring motif restores thermogenesis

slide-14
SLIDE 14
  • 1b. Technologies for probing gene regulation
slide-15
SLIDE 15

Bar-coded multiplexed sequencing

Mapping regulator binding: ChIP-seq

(Chromatin immunoprecipitation followed by sequencing) TF=transcription factor

antibody

slide-16
SLIDE 16

ChIP-chip and ChIP-Seq technology overview

Image adapted from Wikipedia

  • r modification

Modification-specific antibodies  Chromatin Immuno-Precipitation followed by: ChIP-chip: array hybridization ChIP-Seq: Massively Parallel Next-gen Sequencing

slide-17
SLIDE 17

ChIP-Seq Histone Modifications: What the raw data looks like

  • Each sequence tag is 30 base pairs long
  • Tags are mapped to unique positions in the ~3 billion

base reference genome

  • Number of reads depends on sequencing depth.

Typically on the order of 10 million mapped reads.

17

slide-18
SLIDE 18

Chro ma tin a c c e ssib ility c a n re ve a l T F b inding

She rwo o d, RI , e t a l. “ Disc ove r

y of dir e c tiona l a nd nondir e c tiona l pione e r tr a nsc r iption fa c tor s by mode ling DNa se pr

  • file

ma g nitude a nd sha pe ” Nat. Bio te c h 2014.

slide-19
SLIDE 19

DNa se - se q r e ve a ls g e nome pr

  • te c tion

pr

  • file s
slide-20
SLIDE 20

AT AC-se q

slide-21
SLIDE 21

GM12878, Chr. 14, E a c h po int is a c c e ssib ility in a 2 kb windo w

AT AC- se q and DNase - se q ar e not ide ntic al

Ha shimo to T B, e t a l. “ A Syne r

gistic DNA L

  • gic Pr

e dic ts Ge nome - wide Chr

  • matin

Ac c e ssibility” Ge no me Re se arc h 2016

slide-22
SLIDE 22

Dnase - se q is le ss de fine d e vide nc e than ChIP- se q

ChIP-seq reports TF-binding locations regions (specifically) DNase-seq reports proximal TF- non-binding locations (noisily)

A

seq seq

slide-23
SLIDE 23

Bound fa c tor s le a ve distinc t DNa se - se q pr

  • file s

CTCF Brg Oct4 Zfx Esrrb motif

Aggregate CTCF: Individual CTCF:

Individua l binding site pr e dic tion is diffic ult

slide-24
SLIDE 24

~650,000 TF Motifs ~50,000 binding sites for a typical TF

Motifs c a n pr e dic t T F binding

Binding site s c ha ng e a c r

  • ss time
slide-25
SLIDE 25

Chr

  • matin ac c e ssibly influe nc e s

tr ansc r iption fac tor binding

  • Mo de ling a c c e ssib ility pro file s yie lds b inding

pre dic tio ns a nd pio ne e r fa c to r disc o ve ry

  • Asymme tric a c c e ssib ility is induc e d b y

dire c tio nal pio ne e rs

  • T

he b inding o f se ttle r fac to rs c a n b e e na b le d b y pro xima l pio ne e r fa c to r b inding

She rwo o d, RI , e t a l. “ Disc ove r

y of dir e c tiona l a nd nondir e c tiona l pione e r tr a nsc r iption fa c tor s by mode ling DNa se pr

  • file

ma g nitude a nd sha pe ” Nat. Bio te c h 2014.

slide-26
SLIDE 26

Deep Learning for Regulatory Genomics

  • 1. Biological foundations: Building blocks of Gene Regulation

– Gene regulation: Cell diversity, Epigenomics, Regulators (TFs), Motifs, Disease role – Probing gene regulation: TFs/histones: ChIP-seq, Accessibility: DNase/ATAC-seq

  • 2. Classical methods for Regulatory Genomics and Motif Discovery

– Enrichment-based motif discovery: Expectation Maximization, Gibbs Sampling – Experimental: PBMs, SELEX. Comparative genomics: Evolutionary conservation.

  • 3. Regulatory Genomics CNNs (Convolutional Neural Networks): Foundations

– Key idea: pixels  DNA letters. Patches/filters  Motifs. Higher  combinations – Learning convolutional filters  Motif discovery. Applying them  Motif matches

  • 4. Regulatory Genomics CNNs/RNNs in Practice: Diverse Architectures

– DeepBind: Learn motifs, use in (shallow) fully-connected layer, mutation impact – DeepSea: Train model directly on mutational impact prediction – Basset: Multi-task DNase prediction in 164 cell types, reuse/learn motifs – ChromPuter: Multi-task prediction of different TFs, reuse partner motifs – DeepLIFT: Model interpretation based on neuron activation properties – DanQ: Recurrent Neural Network for sequential data analysis

slide-27
SLIDE 27
  • 2. Classical regulatory genomics

(before Deep Learning)

slide-28
SLIDE 28

Enrichment-based discovery methods

Given a set of co-regulated/functionally related genes, find common motifs in their promoter regions

  • Align the promoters to each other using local alignment
  • Use expert knowledge for what motifs should look like
  • Find ‘median’ string by enumeration (motif/sample driven)
  • Start with conserved blocks in the upstream regions
slide-29
SLIDE 29

Starting positions  Motif matrix

sequence positions

A C G T

1 2 3 4 5 6 7 8

0.1 0.1 0.6 0.2

  • given aligned sequences  easy to compute profile matrix

0.1 0.5 0.2 0.2 0.3 0.2 0.2 0.3 0.2 0.1 0.5 0.2 0.1 0.1 0.6 0.2 0.3 0.2 0.1 0.4 0.1 0.1 0.7 0.1 0.3 0.2 0.2 0.3 shared motif given profile matrix 

  • easy to find starting position probabilities

Key idea: Iterative procedure for estimating both, given uncertainty (learning problem with hidden variables: the starting positions) expectation maximization

slide-30
SLIDE 30

Experimental factor-centric discovery of motifs

SELEX (Systematic Evolution of Ligands by Exponential Enrichment; Klug & Famulok, 1994). DIP-Chip (DNA- immunoprecipitation with microarray detection; Liu et al., 2005) PBMs (Protein binding microarrays; Mukherjee, 2004) Double stranded DNA arrays

slide-31
SLIDE 31

Approaches to regulatory motif discovery

  • Expectation Maximization (e.g. MEME)

– Iteratively refine positions / motif profile

  • Gibbs Sampling (e.g. AlignACE)

– Iteratively sample positions / motif profile

  • Enumeration with wildcards (e.g. Weeder)

– Allows global enrichment/background score

  • Peak-height correlation (e.g. MatrixREDUCE)

– Alternative to cutoff-based approach

  • Conservation-based discovery (e.g. MCS)

– Genome-wide score, up-/down-stream bias

  • Protein Domains (e.g. PBMs, SELEX)

– In vitro motif identification, seq-/array-based

Region-based motif discovery Genome-wide In vitro / trans

slide-32
SLIDE 32

Deep Learning for Regulatory Genomics

  • 1. Biological foundations: Building blocks of Gene Regulation

– Gene regulation: Cell diversity, Epigenomics, Regulators (TFs), Motifs, Disease role – Probing gene regulation: TFs/histones: ChIP-seq, Accessibility: DNase/ATAC-seq

  • 2. Classical methods for Regulatory Genomics and Motif Discovery

– Enrichment-based motif discovery: Expectation Maximization, Gibbs Sampling – Experimental: PBMs, SELEX. Comparative genomics: Evolutionary conservation.

  • 3. Regulatory Genomics CNNs (Convolutional Neural Networks): Foundations

– Key idea: pixels  DNA letters. Patches/filters  Motifs. Higher  combinations – Learning convolutional filters  Motif discovery. Applying them  Motif matches

  • 4. Regulatory Genomics CNNs/RNNs in Practice: Diverse Architectures

– DeepBind: Learn motifs, use in (shallow) fully-connected layer, mutation impact – DeepSea: Train model directly on mutational impact prediction – Basset: Multi-task DNase prediction in 164 cell types, reuse/learn motifs – ChromPuter: Multi-task prediction of different TFs, reuse partner motifs – DeepLIFT: Model interpretation based on neuron activation properties – DanQ: Recurrent Neural Network for sequential data analysis

slide-33
SLIDE 33

Convolutional layer (same color = shared weights) Later conv layers operate on outputs

  • f previous conv layers

e u Conv Layer 1 Kernel width = 4 stride = 2* num filters / num channels = 3 Total neurons = 15

Ma x=2

2 6 1

Maxpooling layers take the max

  • ver sets of conv layer outputs

Maxpooling layer pool width = 2 stride = 1 Conv Layer 2 Kernel width = 3 stride = 1 num filters / num channels = 2 total neurons = 6

Typically followed by one or more fully connected layers

*for genomics, a stride of 1 for conv layers is recommended P (TF = bound | X)

Sigmoid activations

Deep convolutional neural network

G C A T T A C C G A T A A

Ma x=2

slide-34
SLIDE 34
  • 3a. CNNs for Regulatory Genomics Foundations

(Low-level features)

slide-35
SLIDE 35

An example of using CNN to model DNA sequence

NNNATGCAGCANN N

A T G C Matrix representation of DNA sequence (darker = stronger) Representing DNA sequence as 2D matrix:

slide-36
SLIDE 36

Convolution – extracting invariant feature

Applying 4 bp sequence filter along the DNA matrix:

ATGCAGCA

  • n 1st position

3rd position

Y ellow = high activity; blue = low activity

A T G C A T G C A T G C A T G C

slide-37
SLIDE 37

Convolution – extracting invariant feature

Convolution module

NNNATGCAGCANN N

convolutio n filters Matrix representation

  • f DNA sequence

(darker = stronger) filtere d signal ATGCAGC A rectification

(denoising) max

pooling max A T G C

Rectification = ignore signals below some threshold. Pooling = summary of each channel by max or average.

slide-38
SLIDE 38

Prediction using extracted features map

Convolution module Prediction module ChIP-seq, PBMs, SELEX Experiments DNA sequence

A T G C A G C A N N N (...) (...) (...) (...) (...) (...) GCRC GCRC GCRC|ATRc Affinity higher-level combinations TGRT match filter max match filter max match filter max TGRT ATRc ATRc

Individual motifs

[Park and Kellis, 2015]

A T G C

slide-39
SLIDE 39

Key properties of regulatory sequence

TRANSCRIPTION FACTOR BINDING Regulatory proteins called transcription factors (TFs) bind to high affinity sequence patterns (motifs) in regulatory DNA Transcription factor Regulatory DNA sequences Motif

slide-40
SLIDE 40

Sequence motifs: PWM

A 1 1 0.5 C 0.5 G 0.5 1 T 1 0.5

Position weight matrix (PWM)

Bits

PWM logo

https://en.wikipedia.org/wiki/Sequence_logo

GGA T AA CGATAA CGATAT GGATAT Set of aligned sequences Bound by TF

slide-41
SLIDE 41

Sequence motifs: PSSM

Position-specific scoring matrix (PSSM) PSSM logo

A

  • 5.7
  • 3.2

3.7

  • 3.2

3.7 0.6 C 0.5

  • 3.2
  • 3.2
  • 3.2
  • 3.2
  • 5.7

G 0.5 3.7

  • 3.2
  • 3.2
  • 3.2
  • 5.7

T

  • 5.7
  • 3.2
  • 3.2

3.7

  • 3.2

0.5

Accounting for genomic background nucleotide distribution

slide-42
SLIDE 42

Sc Scor

  • ring a s

sequence ce wi with a a motif tif PSSM SSM

G C A T T A C C G A T A A

Input sequence One-hot encoding (X) Scoring weights W

A

  • 5.7
  • 3.2

3.7

  • 3.2

3.7 0.6 C 0.5

  • 3.2
  • 3.2
  • 3.2
  • 3.2
  • 5.7

G 0.5 3.7

  • 3.2
  • 3.2
  • 3.2
  • 5.7

T

  • 5.7
  • 3.2
  • 3.2

3.7

  • 3.2

0.5

PSSM parameters

slide-43
SLIDE 43

G C A T T A C C G A T A A

Input sequence One-hot encoding (X) Scoring weights W

  • 5.4

A

  • 5.7
  • 3.2

3.7

  • 3.2

3.7 0.6 C 0.5

  • 3.2
  • 3.2
  • 3.2
  • 3.2
  • 5.7

G 0.5 3.7

  • 3.2
  • 3.2
  • 3.2
  • 5.7

T

  • 5.7
  • 3.2
  • 3.2

3.7

  • 3.2

0.5

Motif match Scores sum(W * x)

Convolution: Scoring a sequence with a PSSM

slide-44
SLIDE 44

G C A T T A C C G A T A A

Input sequence One-hot encoding (X) Scoring weights W

  • 5.4

2.0 A

  • 5.7
  • 3.2

3.7

  • 3.2

3.7 0.6 C 0.5

  • 3.2
  • 3.2
  • 3.2
  • 3.2
  • 5.7

G 0.5 3.7

  • 3.2
  • 3.2
  • 3.2
  • 5.7

T

  • 5.7
  • 3.2
  • 3.2

3.7

  • 3.2

0.5

Motif match Scores sum(W * x)

Convolution

slide-45
SLIDE 45

G C A T T A C C G A T A A

Input sequence One-hot encoding (X)

A

  • 5.7
  • 3.2

3.7

  • 3.2

3.7 0.6 C 0.5

  • 3.2
  • 3.2
  • 3.2
  • 3.2
  • 5.7

G 0.5 3.7

  • 3.2
  • 3.2
  • 3.2
  • 5.7

T

  • 5.7
  • 3.2
  • 3.2

3.7

  • 3.2

0.5

Scoring weights W

  • 2.2
  • 5.4

2.0

  • 4.3
  • 24
  • 17
  • 18
  • 11
  • 12

16

  • 5.5
  • 8.5
  • 5.2

Motif match Scores sum(W * x)

Convolution

slide-46
SLIDE 46

G C A T T A C C G A T A A

Input sequence One-hot encoding (X)

A

  • 5.7
  • 3.2

3.7

  • 3.2

3.7 0.6 C 0.5

  • 3.2
  • 3.2
  • 3.2
  • 3.2
  • 5.7

G 0.5 3.7

  • 3.2
  • 3.2
  • 3.2
  • 5.7

T

  • 5.7
  • 3.2
  • 3.2

3.7

  • 3.2

0.5

Scoring weights W

  • 2.2
  • 5.4

2.0

  • 4.3
  • 24
  • 17
  • 18
  • 11
  • 12

16

  • 5.5
  • 8.5
  • 5.2

Thresholded Motif Scores max(0, W*x) Motif match Scores W*x

Thresholding scores

2.0 16

slide-47
SLIDE 47
  • 3b. CNNs for Regulatory Genomics Foundations

(Higher-level learning)

slide-48
SLIDE 48

T F T F

  • Positive class of genomic sequences

bound a transcription factor of interest

  • Negative class of genomic sequences

not bound by a transcription factor

  • f interest

Can we learn patterns in the DNA sequence that distinguish these 2 classes of genomic sequences?

Learning patterns in regulatory DNA sequence

slide-49
SLIDE 49

HOMOTYPIC MOTIF DENSITY Regulatory sequences often contain more than one binding instance of a TF resulting in homotypic clusters of motifs of the same TF

Key properties of regulatory sequence

slide-50
SLIDE 50

Key properties of regulatory sequence

HETEROTYPIC MOTIF COMBINATIONS Regulatory sequences often bound by combinations of TFs resulting in heterotypic clusters of motifs of different TFs

slide-51
SLIDE 51

Key properties of regulatory sequence

SPATIAL GRAMMARS OF HETEROTYPIC MOTIF COMBINATIONS Regulatory sequences are often bound by combinations of TFs with specific spatial and positional constraints resulting in distinct motif grammars

slide-52
SLIDE 52

A simple classifier (An artificial neuron)

parameters Linear function

Z

Training the neuron means learning the optimal w’s and b

slide-53
SLIDE 53

A simple classifier (An artificial neuron) Y

Non-linear function Logistic / Sigmoid Useful for predicting probabilitie parameters Training the neuron means learning the optimal w’s and b

slide-54
SLIDE 54

A simple classifier (An artificial neuron) Y

Training the neuron means learning the optimal w’s and b

ReLu (Rectified Linear Unit)

Useful for thresholding parameters Non-linear function

slide-55
SLIDE 55

Artificial neuron can represent a motif

Y

parameters

slide-56
SLIDE 56

Convolutional filters learn motifs (PSSM)

Biological motivation of Deep CNN

Max pool thresholded scores over windows Threshold scores using ReLU Scan sequence using filters Predict probabilities using logistic neuron

slide-57
SLIDE 57

Convolutional layer (same color = shared weights) Later conv layers operate on outputs

  • f previous conv layers

e u Conv Layer 1 Kernel width = 4 stride = 2* num filters / num channels = 3 Total neurons = 15

Ma x=2

2 6 1

Maxpooling layers take the max

  • ver sets of conv layer outputs

Maxpooling layer pool width = 2 stride = 1 Conv Layer 2 Kernel width = 3 stride = 1 num filters / num channels = 2 total neurons = 6

Typically followed by one or more fully connected layers

*for genomics, a stride of 1 for conv layers is recommended P (TF = bound | X)

Sigmoid activations

Deep convolutional neural network

G C A T T A C C G A T A A

Ma x=2

slide-58
SLIDE 58

Convolutional layer (same color = shared weights) Later conv layers operate on outputs

  • f previous conv layers

e u Conv Layer 1 Kernel width = 4 stride = 2 num filters / num channels = 3 Total neurons = 15

Ma x=2

2 6 1

Maxpooling layers take the max

  • ver sets of conv layer outputs

Maxpooling layer pool width = 2 stride = 1 Conv Layer 2 Kernel width = 3 stride = 1 num filters / num channels = 2 total neurons = 6

Typically followed by one or more fully connected layers

Multi-task CNN

G C A T T A C C G A T A A

Ma x=2

P (TF1 = bound | X) P (TF2 = bound | X)

Multi-task output (sigmoid activations here)

slide-59
SLIDE 59

Convolution al layer (same color = shared weights) Later conv layers operate on

  • utputs of previous conv

layers

Conv Layer 1 Kernel width = 4 stride = 2* num filters / num channels = 3 Total neurons = 15

M a xm = 2ax M a xm = 6ax 2 6 1

Maxpooling layers take the max

  • ver sets of conv layer
  • utputs

Maxpooling layer pool width = 2 stride = 1 Conv Layer 2 Kernel width = 3 stride = 1 num filters / num channels = 2 total neurons = 6

Typically followed by one or more fully connected layers

Multi-task CNN

G C A T T A C C G A T A A

slide-60
SLIDE 60

Deep Learning for Regulatory Genomics

  • 1. Biological foundations: Building blocks of Gene Regulation

– Gene regulation: Cell diversity, Epigenomics, Regulators (TFs), Motifs, Disease role – Probing gene regulation: TFs/histones: ChIP-seq, Accessibility: DNase/ATAC-seq

  • 2. Classical methods for Regulatory Genomics and Motif Discovery

– Enrichment-based motif discovery: Expectation Maximization, Gibbs Sampling – Experimental: PBMs, SELEX. Comparative genomics: Evolutionary conservation.

  • 3. Regulatory Genomics CNNs (Convolutional Neural Networks): Foundations

– Key idea: pixels  DNA letters. Patches/filters  Motifs. Higher  combinations – Learning convolutional filters  Motif discovery. Applying them  Motif matches

  • 4. Regulatory Genomics CNNs/RNNs in Practice: Diverse Architectures

– DeepBind: Learn motifs, use in (shallow) fully-connected layer, mutation impact – DeepSea: Train model directly on mutational impact prediction – Basset: Multi-task DNase prediction in 164 cell types, reuse/learn motifs – ChromPuter: Multi-task prediction of different TFs, reuse partner motifs – DeepLIFT: Model interpretation based on neuron activation properties – DanQ: Recurrent Neural Network for sequential data analysis

slide-61
SLIDE 61
  • 4. Regulatory Genomics CNNs in Practice:

(a) DeepBind

slide-62
SLIDE 62

DeepBind

[Alipanahi et al., 2015]

slide-63
SLIDE 63

http://www.nature.com/nbt/journal/v33/n8/full/nbt.3300.html

slide-64
SLIDE 64
slide-65
SLIDE 65

Constructing mutation map

NNNATGTAGCA NNN

Ref

NNNATGCAGCANNN

A T G C A T G C Alt DeepBin d Model p(sref|w ) p(salt|w ) ∆s =(p(s |w) - p(s |w))

j alt ref

max(0,p(salt|w),p(sref| w))

slide-66
SLIDE 66

filtered signal rectification (denoising) A T G C

Constructing sequence logo

Motif 1 ATGCAGCA

NNNATGCAGCANN N

GCAG CAGC ACGA

Test sequence

Motif 2 Motif 2 Motif 1 GCAG GCTG GATG

. . .

GTAG GCTG

G A

G

C A

T T

Motif 2 CAGC GGTC AGTC

. . .

AGGC GGTG G G

C A

T

G

G

C

A

...

PFM

slide-67
SLIDE 67

Predicting disease mutations

[Alipanahi et al., 2015]

slide-68
SLIDE 68

DeepBind summary

The key deep learning techniques:

  • Convolutional learning
  • Representational learning
  • Back-propagation and stochastic gradient
  • Regularization and dropout
  • Parallel GPU computing especially useful for hyperparameter

search Limitations in DeepBind:

  • Require defining negative training examples, which is often

arbitrary

  • Using observed mutation data only as post-hoc evaluation
  • Modeling each regulatory dataset separately
slide-69
SLIDE 69

Regulatory Genomics CNNs in Practice: (b) DeepSEA

slide-70
SLIDE 70

DeepSea

DeepSea:

  • Similar as DeepBind but

trained a separate CNN on each of the ENCODE/Roadmap Epigenomic chromatin profiles 919 chromatin features (125 DNase features, 690 TF features, 104 histone features).

  • It uses the ∆s mutation

score as input to train a linear logistic regression to predict GWAS and eQTL SNPs defined from the GRASP database with a P- value cutoff of 1E-10 and GWAS SNPs from the NHGRI GWAS Catalog

[Zhou and Troyanskaya, 2015]

slide-71
SLIDE 71

Regulatory Genomics CNNs in Practice: (c) Basset

slide-72
SLIDE 72

Ba sse t: L e a rning the re g ula to ry c o de o f the a c c e ssib le g e no me with de e p c o nvo lutio na l ne ura l ne two rks.

Da vid R. K e lle y Ja spe r Sno e k Jo hn L . Rinn Ge no me Re se a rc h, Ma rc h 2016

slide-73
SLIDE 73

Basset

300 Simultaneously predicting DNase sites in 164 cell types 300 convolution filters CNN-based Basset outperforms gkm-SVM Convolutional filters connected to the input sequence recapitulate some known TF motifs

[Kelley et al., 2016]

slide-74
SLIDE 74

Ba sse tt a rc hite c ture fo r a c c e ssib ility pre dic tio n

300 filte rs 3 c o nv la ye rs 3 F C la ye rs 168 o utputs (1 pe r c e ll type ) 3 fully c o nne c te d la ye rs I nput: 600 b p Output: 168 b its 1.9 millio n tra ining e xa mple s

slide-75
SLIDE 75

Ba sse tt AUC pe rfo rma nc e vs. g km-SVM

slide-76
SLIDE 76

45% o f filte r de rive d mo tifs a re fo und in the CI S-BP da ta b a se

Mo tifs c re a te d b y c luste ring ma tc hing input se q ue nc e s a nd c o mputing PWM

slide-77
SLIDE 77

Mo tif de rive d fro m filte rs with mo re info rma tio n te nd to b e a nno ta te d

slide-78
SLIDE 78

Co mputa tio na l sa tura tio n muta g e ne sis o f a n AP-1 site re ve a ls lo ss o f a c c e ssib ility

slide-79
SLIDE 79

Regulatory Genomics CNNs in Practice: (d) Chromputer

slide-80
SLIDE 80

ChromPuter

E2F6 Other TFs

Class Probabilities

CTCF MYC GA TA1 SOX2 OCT 4 NANO G

2nd FC Layer 1st FC Layer

Multi- task learning

2nd set of Convolutional Maps

1D DNase-seq/ATAC-seq profile DNA sequence

(Anshul Kundaje’s group from Stanford)

slide-81
SLIDE 81

How does a deep conv. neural network transform the raw V-plot input at each layer

Promoter Enhancer

Initial Smoothing 1st set of Convolutional Maps 2nd Smoothing 2nd set of Convolutional Maps 3rd Smoothing 1st Fully Connected Layer 2nd Fully Connected Layer Class Probabilities V-Plot Input (300 x 2001) Chromatin State

+1Kb

  • 1Kb

500 Pure CTCF 500 500

slide-82
SLIDE 82

After initial pooling (smoothing)

Pure CTCF Promoter Enhancer

Initial Smoothing 1st set of Convolutional Maps 2nd Smoothing 2nd set of Convolutional Maps 3rd Smoothing 1st Fully Connected Layer 2nd Fully Connected Layer Class Probabilities V-Plot Input (300 x 2001) Chromatin State

slide-83
SLIDE 83

Second set of convolutional maps

Pure CTCF Promoter Enhancer

Initial Smoothing 1st set of Convolutional Maps 2nd Smoothing 2nd set of Convolutional Maps 3rd Smoothing 1st Fully Connected Layer 2nd Fully Connected Layer Class Probabilities V-Plot Input (300 x 2001) Chromatin State

slide-84
SLIDE 84

Learning from multiple 1D functional data (e.g. DNase, MNase)

19 1st Convolution Layer 2nd Convolution Layer

2nd FC Layer

1D MNase signal (1 x 2001)

Class Probabilities

3rd Convolution Layer

1st FC Layer

1st Convolution Layer 2nd Convolution Layer 3rd Convolution Layer

1D DNase signal (1 x 2001) Chromatin State

Scan DNase profile using filter

slide-85
SLIDE 85

Learning from raw DNA sequence

Higher layers learn motif combinations

Class Probabilities

Score sequence using filters Convolutional layers learn motif (PWM) like filters

slide-86
SLIDE 86

The ChrompuTer

Integrating multiple inputs (1D, 2D signals, sequence) to simulatenously predict multiple

  • utputs

Chromati n State TF Binding Class Probabilities H3K4me 3 H3K9m e 3 H3K27me3 H3K4me1 H2A. Z H3K36me 3

2nd FC Layer 1st FC Layer

Initial Smoothing 1st set of Convolutional Maps 2nd Smoothing 2nd set of Convolutional Maps 3rd Smoothing 1st Combined FC Layer 2nd Combined FC Layer V-Plot Input (300 x 2001) Initial Smoothing 1st set of Convolutional Maps 2nd Smoothing 2nd set of Convolutional Maps 3rd Smoothing 1st Combined FC Layer 2nd Combined FC Layer Initial Smoothing 1st set of Convolutional Maps 2nd Smoothing 2nd set of Convolutional Maps 3rd Smoothing 1st Combined FC Layer 2nd Combined FC Layer

Multi- task learning

slide-87
SLIDE 87

Chromatin architecture can predict

chromatin state in held out chromosome

(same cell type)

Model + Input data types 8-class chromatin state accuracy (%) Majority class (baseline) 42% Gene proximity 59% Random Forest: ATAC-seq (150M reads) 61% Chromputer: DNase (60M reads) 68.1% Chromputer: Mnase (1.5B reads) 69.3% Chromputer: ATAC-seq (150M reads) 75.9% Chromputer: DNase + MNase 81.6% Chromputer: ATAC-seq + sequence 83.5% Chromputer: DNase + MNase + sequence 86.2% Label accuracy across replicates (upper bound) 88%

slide-88
SLIDE 88

High cross cell-type chromatin state prediction

  • Learn model on DNase and MNase only
  • Learn on GM12878, predict on K562 (and vice versa)
  • Requires local normalization to make signal comparable

8 class chromatin state accuracy Train ↓ / Test → GM12878 K562 GM12878 0.816 0.818 K562 0.769 0.844

slide-89
SLIDE 89

Predicting individual histone marks from ATAC/DNase/MNase/Sequence

Area under Precision recall curve

0.75 0.5 0.25 CTCF H3K27ac H3K4me3 H3K4me1 H3K9ac H2Az H3K36me3 H3K27me3 H3K9me3

slide-90
SLIDE 90

Chromputer trained on TF ChIP-seq predicts cross cell-type in-vivo TF binding with high accuracy

25

Chromputer

Area under Precision Recall (PR) curve

c-MYC YY1 CTCF

Inputs: Seq + DNA shape + DNase profile Positives: Reproducible ChIP-seq peaks Negatives: All other DNase peaks + flanks + matched random sites Test sets: Held out chromosomes in held out cell types

slide-91
SLIDE 91

DeepLift reveals feature importance at the input layer

G C A T T A C C G A T A A

Nano g Gata1

G C A T T

Which neurons/filter s are predictive? Which nucleotides in input sequence are contributing to binding

Key idea:

  • ReLU is piece-wide linear
  • Backpropagation differences of outputs using observed and reference

inputs (e.g., inputs of all zeros) to obtain gradient w.r.t. the input

  • Importance of any input to any output is the gradients weighted by the input

itself

(Anshul Kundaje’s group from Stanford)

slide-92
SLIDE 92

Deep Learning for Regulatory Genomics

  • 1. Biological foundations: Building blocks of Gene Regulation

– Gene regulation: Cell diversity, Epigenomics, Regulators (TFs), Motifs, Disease role – Probing gene regulation: TFs/histones: ChIP-seq, Accessibility: DNase/ATAC-seq

  • 2. Classical methods for Regulatory Genomics and Motif Discovery

– Enrichment-based motif discovery: Expectation Maximization, Gibbs Sampling – Experimental: PBMs, SELEX. Comparative genomics: Evolutionary conservation.

  • 3. Regulatory Genomics CNNs (Convolutional Neural Networks): Foundations

– Key idea: pixels  DNA letters. Patches/filters  Motifs. Higher  combinations – Learning convolutional filters  Motif discovery. Applying them  Motif matches

  • 4. Regulatory Genomics CNNs/RNNs in Practice: Diverse Architectures

– DeepBind: Learn motifs, use in (shallow) fully-connected layer, mutation impact – DeepSea: Train model directly on mutational impact prediction – Basset: Multi-task DNase prediction in 164 cell types, reuse/learn motifs – ChromPuter: Multi-task prediction of different TFs, reuse partner motifs – DeepLIFT: Model interpretation based on neuron activation properties – DanQ: Recurrent Neural Network for sequential data analysis