Not Just a Black Box: Interpretable Deep Learning for Genomics
Avan> Shrikumar, Peyton Greenside, Anshul Kundaje
Peyton Anshul
1
Not Just a Black Box: Interpretable Deep Learning for Genomics - - PowerPoint PPT Presentation
Not Just a Black Box: Interpretable Deep Learning for Genomics Avan> Shrikumar, Peyton Greenside, Anshul Kundaje Peyton Anshul 1 With great power comes really poor interpretability Deep Interpretable Deep Power Learning Learning
Peyton Anshul
1
Deep Learning
Interpretability Power
Classical statistics Traditional machine learning Interpretable Deep Learning
2
fer=lized egg liver cells cardiac cells blood cells
How is cell-type-specific gene expression controlled?
3
Cell-types are different because different genes are turned on
Most of the genome exists in a closed state…
“histone” proteins act like spools that the DNA winds around Most “controller” proteins can’t bind closed DNA
Figures from Shlyueva et al., Nature Reviews Gene/cs, 2014
…except for cell-type specific open control elements
“Controller” proteins bind to DNA paUerns present in these “control elements”
…which then ac=vate nearby genes 4
Figures from Shlyueva et al., Nature Reviews Gene/cs, 2014
Many muta=ons have no effect! Experimentally measure cell-type specific openness Predict openness from sequence using deep learning Interpret the model to learn important posi=ons!
5
*Stranger et al., Genet., 2011
Learned paUern detectors Input: DNA sequence represented as ones and zeros Later layers build on paUerns of previous layer Accessible in HSCs Output: Open (+1) vs not open (0)
A C G T 1 1 1 1 1 1 1 1 1 1 1 1 1
Computer vision
Open in cell- type X Open in cell- type Y
6
7
8
Open in cell- type X Open in tcell- type Y
Alipanahi et al, 2015 Zhou & Troyanskaya, 2015
9
i1 i2
h = max(0, 1 – i1 – i2) y = 1 - h
y = (i1 + i2) when (i1 + i2) < 1 = 1 when (i1 + i2) >= 1
1 1 2
=1 =1 =1
10
Input: DNA sequence represented as ones and zeros
Open in cell- type X Open in cell- type Y
A C G T 1 1 1 1 1 1 1 1 1 1 1 1 1
Open in cell- type X
github.com/kundajelab/deeplip
11
When (i1 + i2) >= 1, gradient is 0
1 1 2
y = (i1 + i2) when (i1 + i2) < 1 = 1 when (i1 + i2) >= 1 i1 i2
h = max(0, 1 – i1 – i2) y = 1 - h
12
1 2 h=1 when (i1 + i2) = 0 (reference)
At (i1 + i2) = 2, the “difference from reference” is -1, NOT 0
Reference: i1=0 & i2=0 y = (i1 + i2) when (i1 + i2) < 1 = 1 when (i1 + i2) >= 1 i1 i2
h = max(0, 1 – i1 – i2) y = 1 - h
13
Sigmoid is 0.5 when input is 0 “difference from reference” is +0.5 when inputs is >> 0 (assuming reference input
14
Original Reference DeepLIFT scores
CIFAR10 model, class = “ship”
Sugges>ons on how to pick a reference:
zeros)
ACGT in background set
generated by shuffling the original sequence 15
i1 i2 y = i1 – h2 h1 = i1-i2
1
1
h2 = max(0, h1)
16
y = max(0, i1 – i2)
Standard breakdown (gradient*input): 4 = (10 from i1) + (-6 from i2) max(0, i1 - i2) i1 - i2 i1=10 i2=6 +10 Equally-valid alterna=ve breakdown: 4 = (4 from i1) + (0 from i2) max(0, i1 - i2) i1 - i2 i1=10 i2=6 4 Average: 4 = (7 from i1) + (-3 from i2) i1 - i2
i1 = 10, i2 = 6 = max(0, 10-6) = 4
Would get this breakdown even with y = i1 – i2 It doesn’t leverage the nonlinearity; gradients are *local*
17
i1 i2 y = i1 – h2 h1 = i1-i2
1
1
h2 = max(0, h1)
18
8->3 8->6 Guided Backprop Integrated gradients DeepLIFT
19
Peyton Greenside
Publicly available “openness” data (Corces & Buenrostro et al., 2016)
Hematopoe=c stem cell White blood cell Red blood cell
20
Importance in HSC’s Importance in B-cells Importance in Erythroid Gata Gata Gata SPI1
SPI1 protein binding signal GATA1 protein binding signal
Openness signal Peyton Greenside HSC’s Erythroid B-cells
21
22
Individual GATA paUern detectors mo=fs found by DeepBind (Alipanahi et al.)
Problem: High levels of redundancy, because mul>ple neurons cooperate with each other Computer vision
23
Insight: input-level importance scores reveal combined contribu=ons
Sequence 1 Sequence 2 Sequence 3 score score score
24
Nanog protein
Nanog DNA-binding signal
Foreground: 1000s of sequences bound by Nanog in embryonic stem cells vs. Background: Open regions in embryonic stem cells 94% auROC on held-out test-set
25
Publicly available paUerns (Kheradpour et al.) Corresponding Result of MoDISCo
26
Single MoDISco feature beUer predicts Nanog binding than all 4 other features combined
Simula=on: random background sequence with 0-2 0-2 Posi=ve set: at least one and at least one Iden=fy sequences with one and one Mutate the
Peyton Greenside
(“Gata” paUern) (“Tal” paUern)
27
Nathan Boley Maryna Taranova Oana Ursu Daniel Kim Chris Probert Jin-Wook Lee Michael Wainberg Rahul Mohan
Chuan Sheng Foo Johnny Israeli Irene Kaplow Funding HHMI Interna=onal Student Research Fellowship Bio-X fellowship Microsop Women’s Fellowship NIH R01ES02500902 Peyton Greenside Nasa SinnoU-Armstrong Anna Shcherbina Anshul Kundaje
foreground: both and
Mo=fs from Kheradpour et al. Peyton Greenside Missing “GATA” paUern
Logis=c regression
hits to each paUern, auROC
4 MoDISco Mo=fs 5 known paUerns (ENCODE db)
All 32 de-novo from tradi=onal method (HOMER)
Top 4 de-novo from tradi=onal method (HOMER) 4 known paUerns (HOMER db)
#paUerns=32 #paUerns=4
DNA sequence paUern “Footprint” paUern in accessibility signal from experiment (DNase) Model trained to predict binding of CTCF protein from sequence + accessibility signal (DNase) Chuan Sheng Foo Nasa SinnoU-Armstrong
10
input
10
gradient
1
input
10
grad*Δinput (taylor)
10
“reference output” if “reference”=0
input
10
DeepLIFT contribu>on
“difference from reference”
Corces & Buenrostro et al. 2016
Approach (1) Train a deep learning model to predict cell-type specific accessibility from DNA sequence (2) Interpret the model to learn regulatory paUerns
– Mo>f discovery: Primary mo=fs and cofactor mo=fs? – Heterogeneity: Are there different classes of Nanog sites? – Sequence Grammars: homotypic/heterotypic co-binding, density and spacing of mo=fs in individual sequences
– Posi>ve set: 5,473 reproducible Nanog peaks in H1-ESC – Nega>ve set: 149,231 H1ESC DNase-seq peaks that don’t overlap the ChIP-seq – Training and test set on different chromosomes – Model Performance: 94% auROC, 50.7% auPRC
Peyton Greenside
9 of ~70 mo>fs discovered in monocytes
SNAI2 CEBPB E-box ELK1 RUNX EBF1 GATA AP1 SPI1
0 = inaccessible 1 = accessible
Learned Mo=fs Paths to Erythroid
0,0,0,0,1 0,0,0,1,1 0,0,1,1,1 0,1,1,1,1 1,1,1,1,1 (Snai2) (Ebf1) (SPI1) (Bhlhe40) (GATA)
Peyton Greenside
DeepLIFT, de-novo De-novo HOMER default parameters Encode mo=fs
Reproducible Nanog ChIP-seq peaks Zic3 mo=f Sox2 mo=f 1 2 3 Nanog mo=f Oct4-Sox2-Nanog fusion mo=f
Accessibility Track DeepLIFT Track
HSC CMP GMP MEP CD8 NK
HSC CMP GMP MEP CD8 NK