decode the human genome - - PowerPoint PPT Presentation

decode the human genome
SMART_READER_LITE
LIVE PREVIEW

decode the human genome - - PowerPoint PPT Presentation

AGCCAAGCAGCAAAGTTTTGCTGCTGTTTATTTTTGTAGCTCTTACTATATTCTACTTTTACCA TTGAAAATATTGAGGAAGTTATTTATATTTCTATTTTTTATATATTATATATTTTATGTATTTTAAT ATTACTATTACACATAATTATTTTTTATATATATGAAGTACCAATGACTTCCTTTTCCAGAGCAA


slide-1
SLIDE 1

AGCCAAGCAGCAAAGTTTTGCTGCTGTTTATTTTTGTAGCTCTTACTATATTCTACTTTTACCA TTGAAAATATTGAGGAAGTTATTTATATTTCTATTTTTTATATATTATATATTTTATGTATTTTAAT ATTACTATTACACATAATTATTTTTTATATATATGAAGTACCAATGACTTCCTTTTCCAGAGCAA TAATGAAATTTCACAGTATGAAAATGGAAGAAATCAATAAAATTATACGTGACCTGTGGCGA AGTACCTATCGTGGACAAGGTGAGTACCATGGTGTATCACAAATGCTCTTTCCAAAGCCCTC TCCGCAGCTCTTCCCCTTATGACCTCTCATCATGCCAGCATTACCTCCCTGGACCCCTTTCTAA GCATGTCTTTGAGATTTTCTAAGAATTCTTATCTTGGCAACATCTTGTAGCAAGAAAATGTAA AGTTTTCTGTTCCAGAGCCTAACAGGACTTACATATTTGACTGCAGTAGGCATTATATTTAGC TGATGACATAATAGGTTCTGTCATAGTGTAGATAGGGATAAGCCAAAATGCAATAAGAAAAA CCATCCAGAGGAAACTCTTTTTTTTTTCTTTTTCTTTTTTTTTTTTCCAGATGGAGTCTCGCA CTTCTCTGTCACCCGGGCTGGAGCGCAGTGGTGCAATCTTGGCTCACTGCAACCTCCACCT CCTGGGTTCAGGTGATTCTCCCACCTCAGCCTCCCGAGTAGTAGCTGGAATTACAGGTGCG CGCTCCCACACCTGGCTAATTTTTTGTATTCTTAGTAGAGATGGGGTTTCACCATGTTGGCCA GGCTGGTCTCAAACTCCTGCCCTCAGGTGATCTGCCCACCTTGGCCTCCCAGTGTTGGGTTT ACAGGCGTGAGCCACCGCGCCTGGCCTGGAGGAAACTCTTAACAGGGAAACTAAGAAAG AGTTGAGGCTGAGGAACTGGGGCATCTGGGTTGCTTCTGGCCAGACCACCAGGCTCTTGA ATCCTCCCAGCCAGAGAAAGAGTTTCCACACCAGCCATTGTTTTCCTCTGGTAATGTCAGCC TCATCTGTTGTTCCTAGGCTTACTTGATATGTTTGTAAATGACAAAAGGCTACAGAGCATAGA

Deep le learning approaches to decode the human genome

Anshul Kundaje

Genetics, Computer Science Stanford University http://anshul.kundaje.net

slide-2
SLIDE 2

TGCCAAGCAGCAAAGTTTTGCTGCTGTTTATTTTTGTAGCTCTTACTATATTCT ACTTTTACCATTGAAAATATTGAGGAAGTTATTTATATTTCTATTTTTTATATAT TATATATTTTATGTATTTTAATATTACTATTACACATAATTATTTTTTATATATATGA AGTACCAATGACTTCCTTTTCCAGAGCAATAATGAAATTTCACAGTATGAAA ATGGAAGAAATCAATAAAATTATACGTGACCTGTGGCGAAGTACCTATCGTG GACAAGGTGAGTACCATGGTGTATCACAAATGCTCTTTCCAAAGCCCTCTCC GCAGCTCTTCCCCTTATGACCTCTCATCATGCCAGCATTACCTCCCTGGACCC CTTTCTAAGCATGTCTTTGAGATTTTCTAAGAATTCTTATCTTGGCAACATCTT GTAGCAAGAAAATGTAAAGTTTTCTGTTCCAGAGCCTAACAGGACTTACATA TTTGACTGCAGTAGGCATTATATTTAGCTGATGACATAATAGGTTCTGTCATA GTGTAGATAGGGATAAGCCAAAATGCAATAAGAAAAACCATCCAGAGGAA ACTCTTTTTTTTTTCTTTTTCTTTTTTTTTTTTCCAGATGGAGTCTCGCACTTC TCTGTCACCCGGGCTGGAGCGCAGTGGTGCAATCTTGGCTCACTGCAACCT CCACCTCCTGGGTTCAGGTGATTCTCCCACCTCAGCCTCCCGAGTAGTAGCT GGAATTACAGGTGCGCGCTCCCACACCTGGCTAATTTTTTGTATTCTTAGTA GAGATGGGGTTTCACCATGTTGGCCAGGCTGGTCTCAAACTCCTGCCCTCA GGTGATCTGCCCACCTTGGCCTCCCAGTGTTGGGTTTACAGGCGTGAGCCA CCGCGCCTGGCCTGGAGGAAACTCTTAACAGGGAAACTAAGAAAGAGTTG AGGCTGAGGAACTGGGGCATCTGGGTTGCTTCTGGCCAGACCACCAGGCT CTTGAATCCTCCCAGCCAGAGAAAGAGTTTCCACACCAGCCATTGTTTTCCT CTGGTAATGTCAGCCTCATCTGTTGTTCCTAGGCTTACTTGATATGTTTGTAA ATGACAAAAGGCTACAGAGCATAGGTTCCTCTAAAATATTCTTCTTCCTGTGT CAGATATTGAATACATAGAAATACGGTCTGATGCCGATGAAAATGTATCAGCT TCTGATAAAAGGCGGAATTATAACTACCGAGTGGTGATGCTGAAGGGAGAC ACAGCCTTGGATATGCGAGGACGATGCAGTGCTGGACAAAAGGCAGGTAT CTCAAAAGCCTGGGGAGCCAACTCACCCAAGTAACTGAAAGAGAGAAACA AACATCAGTGCAGTGGAAGCACCCAAGGCTACACCTGAATGGTGGGAAGC TCTTTGCTGCTATATAAAATGAATCAGGCTCAGCTACTATTATT …………

2003

~ 3 billion nucleotides The Human Genome Proje ject

slide-3
SLIDE 3

Population sequencing to id identify dis isease-associated genetic varia iants

Statistically significant association?

Oxford Nanopore technology

slide-4
SLIDE 4

~ 3 billion nucleotides

TGCCAAGCAGCAAAGTTTTGCTGCTGTTTATTTTTGTAGCTCTTACTATATTCT ACTTTTACCATTGAAAATATTGAGGAAGTTATTTATATTTCTATTTTTTATATAT TATATATTTTATGTATTTTAATATTACTATTACACATAATTATTTTTTATATATATGA AGTACCAATGACTTCCTTTTCCAGAGCAATAATGAAATTTCACAGTATGAAA ATGGAAGAAATCAATAAAATTATACGTGACCTGTGGCGAAGTACCTATCGTG GACAAGGTGAGTACCATGGTGTATCACAAATGCTCTTTCCAAAGCCCTCTCC GCAGCTCTTCCCCTTATGACCTCTCATCATGCCAGCATTACCTCCCTGGACCC CTTTCTAAGCATGTCTTTGAGATTTTCTAAGAATTCTTATCTTGGCAACATCTT GTAGCAAGAAAATGTAAAGTTTTCTGTTCCAGAGCCTAACAGGACTTACATA TTTGACTGCAGTAGGCATTATATTTAGCTGATGACATAATAGGTTCTGTCATA GTGTAGATAGGGATAAGCCAAAATGCAATAAGAAAAACCATCCAGAGGAA ACTCTTTTTTTTTTCTTTTTCTTTTTTTTTTTTCCAGATGGAGTCTCGCACTTC TCTGTCACCCGGGCTGGAGCGCAGTGGTGCAATCTTGGCTCACTGCAACCT CCACCTCCTGGGTTCAGGTGATTCTCCCACCTCAGCCTCCCGAGTAGTAGCT GGAATTACAGGTGCGCGCTCCCACACCTGGCTAATTTTTTGTATTCTTAGTA GAGATGGGGTTTCACCATGTTGGCCAGGCTGGTCTCAAACTCCTGCCCTCA GGTGATCTGCCCACCTTGGCCTCCCAGTGTTGGGTTTACAGGCGTGAGCCA CCGCGCCTGGCCTGGAGGAAACTCTTAACAGGGAAACTAAGAAAGAGTTG AGGCTGAGGAACTGGGGCATCTGGGTTGCTTCTGGCCAGACCACCAGGCT CTTGAATCCTCCCAGCCAGAGAAAGAGTTTCCACACCAGCCATTGTTTTCCT CTGGTAATGTCAGCCTCATCTGTTGTTCCTAGGCTTACTTGATATGTTTGTAA ATGACAAAAGGCTACAGAGCATAGGTTCCTCTAAAATATTCTTCTTCCTGTGT CAGATATTGAATACATAGAAATACGGTCTGATGCCGATGAAAATGTATCAGCT TCTGATAAAAGGCGGAATTATAACTACCGAGTGGTGATGCTGAAGGGAGAC ACAGCCTTGGATATGCGAGGACGATGCAGTGCTGGACAAAAGGCAGGTAT CTCAAAAGCCTGGGGAGCCAACTCACCCAAGTAACTGAAAGAGAGAAACA AACATCAGTGCAGTGGAAGCACCCAAGGCTACACCTGAATGGTGGGAAGC TCTTTGCTGCTATATAAAATGAATCAGGCTCAGCTACTATTATT …………

Function?

Decodin ing genome function

slide-5
SLIDE 5

ACCAGTTACGACGG TCAGGGTACTGATA CCCCAAACCGTTGA CCGCATTTACAGAC GGGGTTTGGGTTTT GCCCCACACAGGTA CGTTAGCTACTGGT TTAGCAATTTACCG TTACAACGTTTACA GGGTTACGGTTGGG ATTTGAAAAAAAGT TTGAGTTGGTTTTT TCACGGTAGAACGT ACCTTACAAA…………

One genome  many y cell ll typ ypes

http://www.roadmapepigenomics.org/

slide-6
SLIDE 6

Bio iochemical markers of cell ll-type specif ific functional l ele lements

Active gene Repressed gene Protein

https://www.broadinstitute.org/news/1504

Control elements

99 % Non- coding 1.5 % Protein Coding

slide-7
SLIDE 7

100s of Cell-Types/Tissues 100s of cell types and tissues NIH funded collaborative consortia Machine learning, Probabilistic models, Deep learning Identifying tissue- specific control elements Interpreting disease- associated genetic variation Learning sequence code of control elements

slide-8
SLIDE 8

Active control elements Active control elements Active genes Repressed elements

  • ~20,000 genes
  • ~2 million novel

putative control elements!

  • cell-type specific

activity A comprehensive functional annotation of the human genome

slide-9
SLIDE 9

2M control ele lements show hig ighly modular tis issue-specific activ ivity

  • ~20,000 genes
  • ~2 million novel

putative control elements!

  • modular tissue-

specific activity!

2M control elements 100s of Tissues Active Inactive

slide-10
SLIDE 10

Decodin ing DNA words and grammars that specif ify tis issue-specific ic control ele lements

Regulatory proteins bind DNA words (landing pads) in control elements! ‘Motif Discovery’

slide-11
SLIDE 11

Learning dis iscrimin inative DNA words from tis issue-specif ific control l ele lement sequences

Training Input sequences (X) Classification function F(X) Class = +1 Class = +1 Class = +1 Class = -1 Class = -1 Class = -1 Training Output labels (Y) ‘Training’ means learning the function F(X) from multiple input,

  • utput pairs (X,Y)

sequences of control elements active in Tissue 1 sequences of control elements NOT active in Tissue 1 but active in

  • ther tissues
slide-12
SLIDE 12

C G A T A A C C G A T A T

Learned pattern detectors One-hot encoded input: DNA sequence represented as ones and zeros Later layers build on patterns of previous layer Binary Output: Active (1) vs Inactive (0)

Deep convolu lutional neural network (CNN) on DNA sequence in inputs

A C G T 1 1 1 1 1 1 1 1 1 1 1 1 1 Is seq. active in cell type 1?

slide-13
SLIDE 13

Deeper conv. layers learn DNA word combinations (grammars)

Score sequence using filters

Convolutional layers Neurons learn DNA word pattern detectors

Is seq. active in cell type 1?

prediction accuracy Mean auROC = 0.82 Mean auPRC = 0.65

Is seq. active in cell type 100? Is seq. active in cell type 2?

Multi-task learning

Similar to Kelley et al. 2016 (Basset) Zhou et al. 2015 (DeepSEA)

Mult lti-task deep eep CNNs le learn dis iscriminative DNA word pattern detectors

Millions of input sequences of control elements

slide-14
SLIDE 14

C G A T A A C C G A T A T

Is seq. active in cell type 1? Is seq. active in cell type 2?

How can we id identify fy im important parts of f th the in input se sequences?

In-silico mutagenesis

  • inefficient
  • misleading results due to

saturation/buffering

A

?

G T A C T C G T

…................................

Alipanahi et al, 2015 Zhou & Troyanskaya, 2015 Kelley et al 2016

slide-15
SLIDE 15

C G A T A A C C G A T A T

Is seq. active in cell type 1? Is seq. active in cell type 2?

Efficient “Backpropagation” based approaches

A C G T 1 1 1 1 1 1 1 1 1 1 1 1 1

Is seq. active in cell type 1?

G A T A C C G A A

Gr Gradient bas based meth thods

  • Saliency maps (Simonyan 2013)
  • Deconv networks (Zeiler, Fergus 2013)
  • Guided backprop (Springerberg 2014)
  • Layerwise relevance propagation (Bach

2015)

  • Integrated gradients (Sundarajan 2016)

Avanti Shrikumar Peyton Greenside

DeepLIFT

Shrikumar et al. Learning Important Features Through Propagating Activation Differences https://arxiv.org/abs/1704.02685 CODE: https://github.com/kundajelab/deeplift

slide-16
SLIDE 16

DeepLIFT id identifies combinatorial grammars of f DNA words defi fining ti tissue-specifi fic control ele lements!

Shrikumar et al. https://arxiv.org/abs/1704.02685 CODE: https://github.com/kundajelab/deeplift

slide-17
SLIDE 17

Dis istinct combinations of f DNA words can active same control ele lement in in dif ifferent ti tissues

Peyton Greenside

Control element sequence

SPI1

DeepLIFT scores Tissue: Blood stem cells

Position along sequence Gata (Rc) Gata (Rc) Gata SPI1

DeepLIFT scores Tissue: Red blood cells

SPI1 protein binding data GATA1 protein binding data

Validation experiment results

slide-18
SLIDE 18

Decoding ti tissue-specifi fic combinatorial grammars in in millions of f genomic control ele lements!

Peyton Greenside

slide-19
SLIDE 19

MoDISCO: Id Identif ifyin ing recurr rring DNA words across control l ele lements

Insight: filter contributions are resolved at the nucleotide level

Sequence 1 Sequence 2 Sequence 3 Δprob Δprob Δprob

Avanti Shrikumar Peyton Greenside

slide-20
SLIDE 20

We le learn 1000s of f known and novel DNA words defi fining ti tissue- specific control ele lements!

slide-21
SLIDE 21

Can deep CNNs train ined on control ele lements be useful for understanding dis isease-associa iated genetic varia iants?

> 1000 population sequencing studies of diverse diseases

> 90% of f comple lex dise isease-associa iated varia iants are not t in in genes. . Hig ighly ly enri riched in in contr trol l elem lements!

slide-22
SLIDE 22

Deep CNNs can predic ict and in interpret effects of dis isease-associated genetic varia iants in in rele levant tis issue context xt

Original prediction: 0.558 0.528 0.554 0.969 0.960 0.889 Mutated prediction: 0.543 0.583 0.557 0.926 0.900 0.756 Difference (Percent):

  • 1.5%

+5.4% 0.3%

  • 4.3%
  • 5.9%
  • 13.2%

Breaking the ‘C’ results in significant drop in probability of active control element! Unstimulated coronary smooth muscle cells

  • Breaks the ‘C’ in TGACTCA DNA word which is

binding site for an important protein (AP1).

  • Variant specifically manifests in stimulated cells

Stimulated coronary smooth muscle cells from patients

A genetic variant C -> T strongly associated with coronary heart disease

slide-23
SLIDE 23

Fu Future of personali lized medic icin ine

Personal genome sequences Personal functional genomic data Electronic medical records / Clinical data / biometrics / Literature mining Longitudinal data Domain-specific machine learning + AI Rapid interpretation of personal genomes Data-driven personal diagnosis (cause rather than symptoms) Drug target identification and design Optimal treatment regimens

slide-24
SLIDE 24

How to train your DRAGONN

Deep RegulA lAtory ry GenOmic ic Neural l Nets http://kundajelab.github.io/dragonn/

Interactive Cloud based tutorials on deep learning on genomic sequence

Johnny Israeli

slide-25
SLIDE 25

Acknowledgements

25

Will Greenleaf Chuan Sheng Foo

Kundaje Lab members

Johnny Israeli

R01ES02500902 U41-HG007000-04S1 U01HG007919-02 (GGR)

Avanti Shrikumar Peyton Greenside

Funding Conflict of Interest: Deep Genomics (SAB), Epinomics (SAB)

Chris Probert Irene Kaplow