Not Just a Black Box: Interpretable Deep Learning for Genomics - - PowerPoint PPT Presentation

not just a black box interpretable deep learning for
SMART_READER_LITE
LIVE PREVIEW

Not Just a Black Box: Interpretable Deep Learning for Genomics - - PowerPoint PPT Presentation

Not Just a Black Box: Interpretable Deep Learning for Genomics Avan> Shrikumar, Peyton Greenside, Anshul Kundaje Peyton Anshul 1 With great power comes really poor interpretability Deep Interpretable Deep Power Learning Learning


slide-1
SLIDE 1

Not Just a Black Box: Interpretable Deep Learning for Genomics

Avan> Shrikumar, Peyton Greenside, Anshul Kundaje

Peyton Anshul

1

slide-2
SLIDE 2

With great power comes really poor interpretability…

Deep Learning

Interpretability Power

Classical statistics Traditional machine learning Interpretable Deep Learning

2

slide-3
SLIDE 3

Example biological problem: understanding stem cell differen=a=on

fer=lized egg liver cells cardiac cells blood cells

How is cell-type-specific gene expression controlled?

Ans: “control elements” that show cell-type-specific openness

3

Cell-types are different because different genes are turned on

slide-4
SLIDE 4

“control elements” show >ssue-specific openness

Most of the genome exists in a closed state…

“histone” proteins act like spools that the DNA winds around Most “controller” proteins can’t bind closed DNA

Figures from Shlyueva et al., Nature Reviews Gene/cs, 2014

…except for cell-type specific open control elements

“Controller” proteins bind to DNA paUerns present in these “control elements”

…which then ac=vate nearby genes 4

slide-5
SLIDE 5

89%* of disease-associated muta=ons occur

  • utside of genes

Which posi=ons in controller sites are important?

Figures from Shlyueva et al., Nature Reviews Gene/cs, 2014

Many muta=ons have no effect! Experimentally measure cell-type specific openness Predict openness from sequence using deep learning Interpret the model to learn important posi=ons!

5

*Stranger et al., Genet., 2011

slide-6
SLIDE 6

Overview of deep learning model C G A T A A C C G A T A T

Learned paUern detectors Input: DNA sequence represented as ones and zeros Later layers build on paUerns of previous layer Accessible in HSCs Output: Open (+1) vs not open (0)

A C G T 1 1 1 1 1 1 1 1 1 1 1 1 1

Computer vision

Open in cell- type X Open in cell- type Y

6

slide-7
SLIDE 7

Ques>ons for the model

  • Which posi=ons in the DNA

sequence are the important ones?

  • What are the recurring paUerns in

the DNA?

7

slide-8
SLIDE 8

Ques>ons for the model

  • Which posi=ons in the DNA

sequence are the important ones?

  • What are the recurring paUerns in

the DNA?

8

slide-9
SLIDE 9

C G A T A A C C G A T A T

Open in cell- type X Open in tcell- type Y

How can we iden=fy important nucleo=des?

In-silico mutagenesis

A

?

G T A C T C G T

…................................

Alipanahi et al, 2015 Zhou & Troyanskaya, 2015

9

slide-10
SLIDE 10

i1 i2

h = max(0, 1 – i1 – i2) y = 1 - h

y = (i1 + i2) when (i1 + i2) < 1 = 1 when (i1 + i2) >= 1

i1 + i2 h

1 1 2

y

Satura=on problem illustrated

=1 =1 =1

10

slide-11
SLIDE 11

C G A T A A C C G A T A T

Input: DNA sequence represented as ones and zeros

Open in cell- type X Open in cell- type Y

“Backpropaga=on” based approaches

A C G T 1 1 1 1 1 1 1 1 1 1 1 1 1

Open in cell- type X

G A T A C C G A A Examples

  • Gradients

(Simonyan et al.)

  • DeepLIFT

github.com/kundajelab/deeplip

11

C

slide-12
SLIDE 12

Satura=on revisited

When (i1 + i2) >= 1, gradient is 0

i1 + i2 h

1 1 2

y

y = (i1 + i2) when (i1 + i2) < 1 = 1 when (i1 + i2) >= 1 i1 i2

h = max(0, 1 – i1 – i2) y = 1 - h

12

slide-13
SLIDE 13

The DeepLIFT solu=on: difference from reference

i1 + i2 h 1

1 2 h=1 when (i1 + i2) = 0 (reference)

At (i1 + i2) = 2, the “difference from reference” is -1, NOT 0

Reference: i1=0 & i2=0 y = (i1 + i2) when (i1 + i2) < 1 = 1 when (i1 + i2) >= 1 i1 i2

h = max(0, 1 – i1 – i2) y = 1 - h

13

slide-14
SLIDE 14

DeepLIFT generalizes to other func=on types…

Sigmoid is 0.5 when input is 0 “difference from reference” is +0.5 when inputs is >> 0 (assuming reference input

  • f 0)

14

slide-15
SLIDE 15

Reference maUers!

Original Reference DeepLIFT scores

CIFAR10 model, class = “ship”

Sugges>ons on how to pick a reference:

  • MNIST: background (all

zeros)

  • Genomics:
  • Average frequency of

ACGT in background set

  • mul=ple references

generated by shuffling the original sequence 15

slide-16
SLIDE 16

i1 i2 y = i1 – h2 h1 = i1-i2

1

  • 1

Example failure-mode 2: “min” (AND) rela=on

1

  • 1

h2 = max(0, h1)

y = i1 – max(0, i1 – i2) = min(i1, i2) à gradient 0 for either i1 or i2

16

slide-17
SLIDE 17

DeepLIFT idea 2: consider different orders for posi=ve and nega=ve terms

  • 6

y = max(0, i1 – i2)

Standard breakdown (gradient*input): 4 = (10 from i1) + (-6 from i2) max(0, i1 - i2) i1 - i2 i1=10 i2=6 +10 Equally-valid alterna=ve breakdown: 4 = (4 from i1) + (0 from i2) max(0, i1 - i2) i1 - i2 i1=10 i2=6 4 Average: 4 = (7 from i1) + (-3 from i2) i1 - i2

i1 = 10, i2 = 6 = max(0, 10-6) = 4

Would get this breakdown even with y = i1 – i2 It doesn’t leverage the nonlinearity; gradients are *local*

17

slide-18
SLIDE 18

i1 i2 y = i1 – h2 h1 = i1-i2

1

  • 1

Example failure-mode 2: “min” (AND) rela=on

1

  • 1

h2 = max(0, h1)

y = i1 – max(0, i1 – i2) = min(i1, i2) à gradient 0 for either i1 or i2 à DeepLIFT gives 50% importance to each of i1 and i2

18

slide-19
SLIDE 19

Eg: morphing 8 to a 3 or a 6

  • riginal

8->3 8->6 Guided Backprop Integrated gradients DeepLIFT

19

slide-20
SLIDE 20

Peyton Greenside

Publicly available “openness” data (Corces & Buenrostro et al., 2016)

Case study: understanding “control elements” of blood cell types

Hematopoe=c stem cell White blood cell Red blood cell

20

slide-21
SLIDE 21

Importance in HSC’s Importance in B-cells Importance in Erythroid Gata Gata Gata SPI1

Cell-type-specific use of “controller” sequence in HSC, B-cells and Erythroid

SPI1 protein binding signal GATA1 protein binding signal

No data available No data available No peak No peak Protein not present in cell

Openness signal Peyton Greenside HSC’s Erythroid B-cells

21

slide-22
SLIDE 22

Ques>ons for the model

  • Which posi=ons in the DNA

sequence are the important ones?

  • What are the recurring paUerns in

the DNA?

22

slide-23
SLIDE 23

Individual GATA paUern detectors mo=fs found by DeepBind (Alipanahi et al.)

Naïve idea: look at individual paUern detectors

Problem: High levels of redundancy, because mul>ple neurons cooperate with each other Computer vision

23

slide-24
SLIDE 24

How do we combine the contribu=ons of mul=ple paUern detectors to find consolidated paUerns?

Insight: input-level importance scores reveal combined contribu=ons

Sequence 1 Sequence 2 Sequence 3 score score score

MoDISco: Mo=f Discovery from Importance Scores

24

slide-25
SLIDE 25

Nanog protein

Nanog DNA-binding signal

Foreground: 1000s of sequences bound by Nanog in embryonic stem cells vs. Background: Open regions in embryonic stem cells 94% auROC on held-out test-set

Case-study: Predic=ng Nanog binding in embryonic stem cells

25

slide-26
SLIDE 26

Learning reoccurring paUerns

Publicly available paUerns (Kheradpour et al.) Corresponding Result of MoDISCo

26

Single MoDISco feature beUer predicts Nanog binding than all 4 other features combined

slide-27
SLIDE 27

In development:

Discover dependencies with “Delta DeepLIFT”

Simula=on: random background sequence with 0-2 0-2 Posi=ve set: at least one and at least one Iden=fy sequences with one and one Mutate the

Peyton Greenside

(“Gata” paUern) (“Tal” paUern)

27

slide-28
SLIDE 28

Summary

  • DeepLIFT: can reveal cell-type-specific importance
  • f posi=ons at “control elements”

– With advantages over gradients/in-silico mutagenesis – hUps://github.com/kundajelab/deeplip

  • MoDISco: Mo=f Discovery from Importance Scores
  • Broader and more consolidated mo=fs compared

to other approaches

  • Delta DeepLIFT to iden=fy dependencies
slide-29
SLIDE 29

Nathan Boley Maryna Taranova Oana Ursu Daniel Kim Chris Probert Jin-Wook Lee Michael Wainberg Rahul Mohan

Chuan Sheng Foo Johnny Israeli Irene Kaplow Funding HHMI Interna=onal Student Research Fellowship Bio-X fellowship Microsop Women’s Fellowship NIH R01ES02500902 Peyton Greenside Nasa SinnoU-Armstrong Anna Shcherbina Anshul Kundaje

slide-30
SLIDE 30

gradient*inp DeepLIFT

foreground: both and

Mo=fs from Kheradpour et al. Peyton Greenside Missing “GATA” paUern

slide-31
SLIDE 31

Logis=c regression

  • n top

hits to each paUern, auROC

4 MoDISco Mo=fs 5 known paUerns (ENCODE db)

All 32 de-novo from tradi=onal method (HOMER)

Top 4 de-novo from tradi=onal method (HOMER) 4 known paUerns (HOMER db)

Consolidated MoDISco paUerns don’t lose

  • info. rela=ve to fragmented paUerns

#paUerns=32 #paUerns=4

slide-32
SLIDE 32

Mo=f discovery works on con=nuous signals

DNA sequence paUern “Footprint” paUern in accessibility signal from experiment (DNase) Model trained to predict binding of CTCF protein from sequence + accessibility signal (DNase) Chuan Sheng Foo Nasa SinnoU-Armstrong

slide-33
SLIDE 33

Example failure-mode 2: thresholding

  • utput = max(0, input – 10)

input

10

  • utput

input

10

gradient

1

input

10

grad*Δinput (taylor)

10

  • utput minus

“reference output” if “reference”=0

input

10

DeepLIFT contribu>on

“difference from reference”

slide-34
SLIDE 34

Example: understanding regulatory sequences governing accessibility in different cell types

Corces & Buenrostro et al. 2016

Approach (1) Train a deep learning model to predict cell-type specific accessibility from DNA sequence (2) Interpret the model to learn regulatory paUerns

slide-35
SLIDE 35

Case-study 1: Predic=ng Nanog binding in H1-ESC

  • Q’s:

– Mo>f discovery: Primary mo=fs and cofactor mo=fs? – Heterogeneity: Are there different classes of Nanog sites? – Sequence Grammars: homotypic/heterotypic co-binding, density and spacing of mo=fs in individual sequences

  • Our model:

– Posi>ve set: 5,473 reproducible Nanog peaks in H1-ESC – Nega>ve set: 149,231 H1ESC DNase-seq peaks that don’t overlap the ChIP-seq – Training and test set on different chromosomes – Model Performance: 94% auROC, 50.7% auPRC

slide-36
SLIDE 36

Case Study 1, con=nued: Discovering regulatory mo=fs governing hematopoesis

Peyton Greenside

9 of ~70 mo>fs discovered in monocytes

SNAI2 CEBPB E-box ELK1 RUNX EBF1 GATA AP1 SPI1

slide-37
SLIDE 37

TFs controlling cell fate trajectories to Erythroid

0 = inaccessible 1 = accessible

Learned Mo=fs Paths to Erythroid

0,0,0,0,1 0,0,0,1,1 0,0,1,1,1 0,1,1,1,1 1,1,1,1,1 (Snai2) (Ebf1) (SPI1) (Bhlhe40) (GATA)

Peyton Greenside

slide-38
SLIDE 38

Contrast to tradi=onal mo=fs

DeepLIFT, de-novo De-novo HOMER default parameters Encode mo=fs

slide-39
SLIDE 39

At least 3 dis=nct classes on Nanog sites

Reproducible Nanog ChIP-seq peaks Zic3 mo=f Sox2 mo=f 1 2 3 Nanog mo=f Oct4-Sox2-Nanog fusion mo=f

slide-40
SLIDE 40

TET2 Locus

GATA importance appears concurrently with peak accessibility

Accessibility Track DeepLIFT Track

HSC CMP GMP MEP CD8 NK

slide-41
SLIDE 41

TET2 Locus

NKX-2 importance appears concurrently with peak accessibility

HSC CMP GMP MEP CD8 NK