Kipoi: : model zoo for genomics iga Avsec PhD candidate, Technical - - PowerPoint PPT Presentation

kipoi model zoo for genomics
SMART_READER_LITE
LIVE PREVIEW

Kipoi: : model zoo for genomics iga Avsec PhD candidate, Technical - - PowerPoint PPT Presentation

Kipoi: : model zoo for genomics iga Avsec PhD candidate, Technical University of Munich www.gagneurlab.in.tum.de @gagneurlab, @KipoiZoo, @Avsecz Genomics ACGTGTCAGTAGTTAAGCTAGTAGCTGATCGGTAACGTAGTGCACGTGTCAGTAGTTAAGCTAGTAGCTGATC 3 billion


slide-1
SLIDE 1

Kipoi: : model zoo for genomics

Žiga Avsec

PhD candidate, Technical University of Munich www.gagneurlab.in.tum.de @gagneurlab, @KipoiZoo, @Avsecz

slide-2
SLIDE 2

Genomics

ACGTGTCAGTAGTTAAGCTAGTAGCTGATCGGTAACGTAGTGCACGTGTCAGTAGTTAAGCTAGTAGCTGATC

3 billion letters (x2) = 1 genome 37 trillion cells

slide-3
SLIDE 3

3

Proteins = main building blocks

Genome Protein1 Protein1 Protein2 Protein3 Protein3 Protein3 ~100k - 1M

slide-4
SLIDE 4

4

Proteins = main building blocks

Genome Protein1 Protein1 Protein2 Protein3 Protein2 ~100k - 1M

slide-5
SLIDE 5

5

Proteins = main building blocks

Genome Protein1 Protein1 Protein2 Protein3 Protein2 Protein2 Protein1 Protein2 Protein3 ~100k - 1M Protein complex Function1 Function2

slide-6
SLIDE 6

How to make proteins from the genome?

slide-7
SLIDE 7

7

atcgtatatatcatgatatggatacgcatagatcatgactcaggatacg

DNA (2)

aucaugauauggauacgcauagaucaugacuca

Transcription

precursor RNA Protein (~100,000)

Gene expression: How information in DNA is read out

aucaugauacauagaucaugacuca

Splicing

mature RNA

Translation

Gene (~20,000, 1% of DNA) Intron Exon

slide-8
SLIDE 8

8

atcgtatatatcatgatatggatacgcatagatcatgactcaggatacg

Transcription factor Transcription Factor binding site

Gene expression: How information in DNA is read out

slide-9
SLIDE 9

9

atcgtatatatcatgatatggatacgcatagatcatgactcaggatacg

RNA polymerase

Gene expression: How information in DNA is read out

slide-10
SLIDE 10

10

atcgtatatatcatgatatggatacgcatagatcatgactcaggatacg

Gene expression: How information in DNA is read out

slide-11
SLIDE 11

11

atcgtatatatcatgatatggatacgcatagatcatgactcaggatacg

Gene expression: How information in DNA is read out

slide-12
SLIDE 12

12

The regulatory elements control:

  • The position of transcription initiation (what)
  • The frequency of transcription (how much)

atcgtatatatcatgatatggatacgcatagatcatgactcaggatacg

Gene expression: How information in DNA is read out

slide-13
SLIDE 13

13

atcttatatatcatgatatggatacgcatagatcatgactcaggatacg

Genetic variants can disrupt regulatory elements

Reference Patient

atcgtatatatcatgatatggatacgcatagatcatgactcaggatacg

slide-14
SLIDE 14

14

aucaugauauggauacgcauagaucaugacuca aucaugauacauagaucaugacuca

Transcription

Thousands of regulatory elements across all steps of gene expression

Splicing Translation RNA degradation

Ø

Protein degradation

Ø

cttatcacagtgtatatcatgatatggatacgcatagatcatgactcaggatacg

slide-15
SLIDE 15

15

Experimental data

Measuring the regulatory steps via sequencing

slide-16
SLIDE 16

16

Experimental data Predictive models

Learning the regulatory steps

slide-17
SLIDE 17

17

Experimental data Predictive models

Learning the regulatory steps

slide-18
SLIDE 18

18

GATA TAL

cttatcacagtgtatatcatgatatggatacgcatagatcatgactcaggatacg

Detecting regulatory elements with convolutional neural networks

slide-19
SLIDE 19

19

Eraslan*, Avsec* et al Nature Review Genetics 2019 (In press)

slide-20
SLIDE 20

20

Eraslan*, Avsec* et al Nature Review Genetics 2019 (In press)

slide-21
SLIDE 21

21

Eraslan*, Avsec* et al Nature Review Genetics 2019 (In press)

slide-22
SLIDE 22

22

Eraslan*, Avsec* et al Nature Review Genetics 2019 (In press)

slide-23
SLIDE 23

23

Eraslan*, Avsec* et al Nature Review Genetics 2019 (In press)

slide-24
SLIDE 24

24

Eraslan*, Avsec* et al Nature Review Genetics 2019 (In press)

slide-25
SLIDE 25

25

That’s why we need GPUs in regulatory genomics.

Eraslan*, Avsec* et al Nature Review Genetics 2019 (In press)

slide-26
SLIDE 26

26

Experimental data Predictive models

Learning the regulatory steps

slide-27
SLIDE 27

List of published predictive models

27

  • Transcriptional regulation
  • TF Binding
  • PWM Scanning (Jaspar, Cis-BP, MEME)
  • DeepBind
  • Improved DeepBind
  • FactorNet
  • GERV
  • DanQ
  • CKN-seq
  • Chromatin
  • DeepSEA
  • DeepChrome
  • Basenji
  • DNA methylation
  • CpGenie
  • DeepCpG
  • DNA Accessibility
  • Basset
  • TSS:
  • FIDDLE
  • Gene-Expression
  • Basenji
  • Expecto
  • Post-transcriptional regulation
  • RBP binding
  • iDeep
  • rbp_eclip (Avsec et al)
  • miRNA binding
  • TargetScan
  • deepMiRGene
  • Splicing
  • MaxEntScan 5’, 3’
  • Labranchor
  • HAL
  • MMSplice
  • SpliceAI
  • mRNA half-life
  • Cheng et al 2017
  • Polyadenylation
  • APARENT
  • Translation
  • Optimus_5Prime
  • Cuperus et al 2017

See also: https://github.com/greenelab/deep-review

slide-28
SLIDE 28

List of published predictive models

28

  • Transcriptional regulation
  • TF Binding
  • PWM Scanning (Jaspar, Cis-BP, MEME)
  • DeepBind
  • Improved DeepBind
  • FactorNet
  • GERV
  • DanQ
  • CKN-seq
  • Chromatin
  • DeepSEA
  • DeepChrome
  • Basenji
  • DNA methylation
  • CpGenie
  • DeepCpG
  • DNA Accessibility
  • Basset
  • TSS:
  • FIDDLE
  • Gene-Expression
  • Basenji
  • Expecto
  • Post-transcriptional regulation
  • RBP binding
  • iDeep
  • rbp_eclip (Avsec et al)
  • miRNA binding
  • TargetScan
  • deepMiRGene
  • Splicing
  • MaxEntScan 5’, 3’
  • Labranchor
  • HAL
  • MMSplice
  • SpliceAI
  • mRNA half-life
  • Cheng et al 2017
  • Polyadenylation
  • APARENT
  • Translation
  • Optimus_5Prime
  • Cuperus et al 2017

Can we easily apply these models to new data? Can we easily re-use these models?

See also: https://github.com/greenelab/deep-review

slide-29
SLIDE 29

29

Lack of standardization leads to poor sharing and re-use of trained predictive models in genomics

Trained predictive models Code repository Paper supplements Author-maintained web page

slide-30
SLIDE 30

30

Lack of standardization leads to poor sharing and re-use of trained predictive models in genomics

Trained predictive models Code repository Paper supplements Author-maintained web page Data .... Bioinformatics software

slide-31
SLIDE 31

31

The “Findable Accessible Interoperable Reusable” principle not only for data but also for trained predictive models

slide-32
SLIDE 32

32

Challenges

  • Making predictions end-to-end
  • predict <model> -i input.data -o output.data
slide-33
SLIDE 33

33

Challenges

  • Making predictions end-to-end
  • predict <model> -i input.data -o output.data
  • Data heterogeneity (think one paper = one dataset)
slide-34
SLIDE 34

34

Challenges

  • Making predictions end-to-end
  • predict <model> -i input.data -o output.data
  • Data heterogeneity (think one paper = one dataset)
  • Model heterogeneity (from deep learning frameworks to

custom code)

  • Dependency issues
slide-35
SLIDE 35

35

with A. Kundaje, Stanford, and O. Stegle, DKFZ

Kipoi.org [Kípi]

Avsec et al, Nature Biotechnology (In press)

slide-36
SLIDE 36

36

Trained model (model.yaml)

slide-37
SLIDE 37

37

Model

TGATCGAGG GTAGCTAGC CGTGAGTTT

Output Model Input Parameters Can be implemented using: data-loader model “Parameterized function”

slide-38
SLIDE 38

38

Model

data-loader model

slide-39
SLIDE 39

Data-loader

data-loader model chr1 1000 2000 chr2 5000 7000 >chr1 NNNNNNNNNNNN... intervals.bed genome.fa resize extract transform

array([[[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [1, 0, 0, 0], [ … ]], [[0, 1, 0, 0], [0, 0, 1, 0], [1, 0, 0, 0], [0, 0, 0, 1], [ … ]]])

github.com/kipoi/kipoiseq

slide-40
SLIDE 40

40

Dependencies

data-loader model

slide-41
SLIDE 41

41

Test predictions

...

TGATCGAGG GTAGCTAGC CGTGAGTTT TGATCGAGG GTAGCTAGC CGTGAGTTT TGATCGAGG GTAGCTAGC CGTGAGTTT

x

slide-42
SLIDE 42

42

Schema

TGATC GAGGA

... Supports multiple inputs/outputs

slide-43
SLIDE 43

43

General information

slide-44
SLIDE 44

44

Model repository

slide-45
SLIDE 45

45

slide-46
SLIDE 46

46

slide-47
SLIDE 47

47

# model groups: 20 # of models: 2073

slide-48
SLIDE 48

48

  • Install new conda environment
  • Test the pipeline: data -> dataloader -> model
  • Test the predictions match

Testing models

slide-49
SLIDE 49

49

  • Install new conda environment
  • Test the pipeline: data -> dataloader -> model
  • Test the predictions match

Testing models

When?

  • Pull-request
  • Nightly (all model

groups)

slide-50
SLIDE 50

50

Using models

slide-51
SLIDE 51

For the impatient: 30 seconds introduction to Kipoi

slide-52
SLIDE 52

52

slide-53
SLIDE 53

53

Case study 1: Benchmarking models

slide-54
SLIDE 54

54

Benchmarking alternative models

slide-55
SLIDE 55

55

Benchmarking alternative models

slide-56
SLIDE 56

56

Benchmarking alternative models

# Create and activate a new conda environment with # all model dependencies installed kipoi env create <Model> source activate kipoi-<Model> # Run model prediction kipoi predict <Model> \

  • -dataloader_args='{

“intervals_file”: “intervals.bed”, “fasta_file”: “hg38.fa”}' \

  • o ‘<Model>.preds.h5'
slide-57
SLIDE 57

57

Benchmarking alternative models

# Create and activate a new conda environment with # all model dependencies installed kipoi env create <Model> source activate kipoi-<Model> # Run model prediction kipoi predict <Model> \

  • -dataloader_args='{

“intervals_file”: “intervals.bed”, “fasta_file”: “hg38.fa”}' \

  • o ‘<Model>.preds.h5'

<- Works very nicely with workflow-management tools like Snakemake

slide-58
SLIDE 58

58

# Create and activate a new conda environment with # all model dependencies installed kipoi env create <Model> source activate kipoi-<Model> # Run model prediction kipoi predict <Model> \

  • -dataloader_args='{

“intervals_file”: “intervals.bed”, “fasta_file”: “hg38.fa”}' \

  • o ‘<Model>.preds.h5'

Why not a single command?

predict <model> -i input.data -o output.data

slide-59
SLIDE 59

59

# Run model prediction kipoi predict <Model> \

  • -dataloader_args='{

“intervals_file”: “intervals.bed”, “fasta_file”: “hg38.fa”}' \

  • o ‘<Model>.preds.h5'
  • -singularity

Why not a single command?

predict <model> -i input.data -o output.data

slide-60
SLIDE 60

60

# Run model prediction kipoi predict <Model> \

  • -dataloader_args='{

“intervals_file”: “intervals.bed”, “fasta_file”: “hg38.fa”}' \

  • o ‘<Model>.preds.h5'
  • -singularity

Why not a single command?

predict <model> -i input.data -o output.data

input.data Container Model

  • utput.data
slide-61
SLIDE 61

61

# Run model prediction kipoi predict <Model> \

  • -dataloader_args='{

“intervals_file”: “intervals.bed”, “fasta_file”: “hg38.fa”}' \

  • o ‘<Model>.preds.h5'
  • -singularity

Why not a single command?

predict <model> -i input.data -o output.data

input.data Container Model

  • utput.data

In-progress:

  • -docker
  • -docker-gpu

<- Container will be available on NGC

slide-62
SLIDE 62

62

Transfer learning

slide-63
SLIDE 63

Transfer learning: Adapting existing models to new tasks

  • Conv. layers

Dense Dense Dense

Model with transferred parameters Pre-trained model DNA accessibility in 421 cell-types DNA accessibility in new cell type

  • Conv. layers

Dense Dense Dense

Area under the Precision-recall curve Training epoch

See also Kelley et al. Gen. res. 2016

Randomly initialized (>1day) Transferred (<4h) Takes a few days to train (Divergent421 model in Kipoi)

slide-64
SLIDE 64

Transfer learning: Adapting existing models to new tasks

Training epoch Area under the Precision-recall curve

See also Kelley et al. Gen. res. 2016

slide-65
SLIDE 65

65

Interpreting models

slide-66
SLIDE 66

66

Eraslan*, Avsec* et al NRG 2019 (In press)

slide-67
SLIDE 67

67

Eraslan*, Avsec* et al NRG 2019 (In press)

slide-68
SLIDE 68

68

Eraslan*, Avsec* et al NRG 2019 (In press)

slide-69
SLIDE 69

69

# Python import kipoi from kipoi_interpret.importance_scores.gradient import GradientXInput model = kipoi.get_model("model”) imp_score = GradientXInput(model) scores = imp_score.score(seqs) # CLI kipoi interpret create_mutation_map \ <Model> \

  • -dataloader_args='{

“intervals_file”: “intervals.bed”, “fasta_file”: “hg38.fa”}' \

  • o ‘mmap.h5'

kipoi-interpret

slide-70
SLIDE 70

70

Feature importance score plugin

  • Variant rs35703285, near the beta globin gene HBB, is pathogenic (ClinVar) and

linked to Beta thalassemia

slide-71
SLIDE 71

71

Feature importance score plugin

  • Variant rs35703285, near the beta globin gene HBB, is pathogenic (ClinVar) and

linked to Beta thalassemia

Methods

  • ISM
  • grad
  • input*grad
  • DeepLift
slide-72
SLIDE 72

72

Scoring genetic variants

slide-73
SLIDE 73

73

atcttatatatcatgatatggatacgcatagatcatgactcaggatacg

Scoring genetic variants

Reference Patient

atcgtatatatcatgatatggatacgcatagatcatgactcaggatacg

#CHROM POS ID REF ALT … chr22 41320486 . G T …

  • Each one of us carries

ca 1,000,000 variants

slide-74
SLIDE 74

74

Variant effect prediction plugin

  • In-silico mutagenesis
slide-75
SLIDE 75

75

Variant effect prediction plugin

# Annotate VCF file with variant scores kipoi veff score_variants <Model> \

  • -dataloader_args='{

“fasta_file”: “hg38.fa”}' \

  • -vcf_path 'input.vcf' \
  • o ‘annotated.vcf'
  • Supported by 12/20 model groups,

runnable on VCF files

  • In-silico mutagenesis
slide-76
SLIDE 76

76

Kipoi variant scoring as a DNAnexus applet

slide-77
SLIDE 77

77

atcgtatatatcatgatatggatacgcatagatcatgactcaggatacg aucaugauauggauacgcauagaucaugacuca

Splicing: an essential step of protein production

aucaugauacauagaucaugacuca

Splicing

slide-78
SLIDE 78

78

atcgtatatatcatgatatggatactcatagatcatgactcaggatacg aucaugauauggauactcauagaucaugacuca

Splicing: an essential step of protein production

aucaugauacauataucaugacuca

Splicing

slide-79
SLIDE 79

79

Scotti & Swanson, 2016 NRG

Splicing is a complex, multi-step process

slide-80
SLIDE 80

80

Different models model different regions

Donor Acceptor Branchpoint MaxEntScan/3prime MaxEntScan/5prime HAL labranchor

slide-81
SLIDE 81

81

Different models model different regions

Donor Acceptor Branchpoint MaxEntScan/3prime MaxEntScan/5prime HAL labranchor

Binary classification:

  • Pathogenic (ClinVar)
  • Benign (ClinVar)
slide-82
SLIDE 82

Ensemble model predicting pathogenic variants near splice sites

See also MMSplice from Cheng et al., Genome Biology, CAGI Splicing challenge 2018 winner

Kipoi models: KipoiSplice/4 KipoiSplice/4cons MMSplice

slide-83
SLIDE 83

Summary

83

slide-84
SLIDE 84

84

Experimental data Predictive models

Learning the regulatory steps

slide-85
SLIDE 85

85

with A. Kundaje, Stanford, and O. Stegle, DKFZ

Kipoi.org [Kípi]

Avsec et al, Nature Biotechnology (In press)

slide-86
SLIDE 86

86

Kundaje lab

Stanford

  • Anshul Kundaje
  • Avanti Shrikumar
  • Nancy Xu
  • Abhimanyu Banerjee
  • Chuan Sheng Foo

Gagneur lab

TU Munich

  • Julien Gagneur
  • Jun Cheng

Stegle lab

Cambridge EMBL-EBI

  • Oliver Stegle
  • Roman Kreuzhuber
  • Thorsten Beider
  • Lara Urban

Acknowledgements

.org

@KipoiZoo

Roman Kreuzhuber

Nvidia

  • Jonny Israeli
  • Fernanda Foertter
  • Gary Dunn
  • Adam Simpson

Thorsten Beider

DNA Nexus

  • Jason Chin
  • Maria Simbirsky
  • Andrew Carroll
slide-87
SLIDE 87

Kipoi: : model zoo for genomics

Žiga Avsec

PhD candidate, Technical University of Munich www.gagneurlab.in.tum.de @gagneurlab, @KipoiZoo, @Avsecz

slide-88
SLIDE 88

88