A deep learning based approach for genetic risk prediction Raquel - - PowerPoint PPT Presentation

a deep learning based approach
SMART_READER_LITE
LIVE PREVIEW

A deep learning based approach for genetic risk prediction Raquel - - PowerPoint PPT Presentation

A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research Translational Institute raqueld@scripps.edu, @RaquelDiasSRTI Ali Torkamani, PhD. atorkama@scripps.edu, @ATorkamani Whole


slide-1
SLIDE 1

A deep learning based approach for genetic risk prediction

Raquel Dias, PhD. Senior Staff Scientist Scripps Research Translational Institute raqueld@scripps.edu, @RaquelDiasSRTI Ali Torkamani, PhD. atorkama@scripps.edu, @ATorkamani

slide-2
SLIDE 2

Whole Genome Sequencing vs. Genotype array

Full Data (whole genome sequencing) Sparse Data (genotype array)

0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 0 ? ? ? 0 0 1 1 ? ? 1 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1

slide-3
SLIDE 3

Whole Genome Sequencing vs. Genotype array

Full Data ~80M genetic variants Sparse Data ~4 million genetic

0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 0 ? ? ? 0 0 1 1 ? ? 1 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1

slide-4
SLIDE 4

Genetic imputation problem

... ... ... ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... …

Reference haplotypes

0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1

HapMap or 1,000 Genomes (whole genome) Cases and Controls typed Genotype array Prediction

... ... ... ...

Study genotypes

slide-5
SLIDE 5

A typical imputation approach

Multiethnic Haplotype Reference Consortium (HRC) Study genotypes Reference panel Linkage disequilibrium (LD r2) structure

0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1

Mapping Prediction

... ... ... ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... … ... ... ... ...

slide-6
SLIDE 6

A typical imputation approach

Muli-ethinic Haplotype Reference Consortium (HRC) Study genotypes Reference panel

0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1

Mapping Prediction

... ... ... ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... … ... ... ... ...

Linkage disequilibrium (LD r2) structure

slide-7
SLIDE 7

Polygenic Risk Score (PRS)

slide-8
SLIDE 8

Polygenic Risk Calculation

Design

100,000+ subjects

Results

Millions of known variants

Polygenic Risk Score

Cumulative sum w/ Trait* w/o Trait

Σ

*Trait can often be heterogeneous e.g. coronary artery = heart attack, stroke, bypass surgery, etc.

slide-9
SLIDE 9

Objectives

  • 1. More accurate and faster imputation
  • 2. Find important genetic variants
  • 3. Better polygenic risk score calculation
slide-10
SLIDE 10

Our proposed approach

v

1 1 2 1 1 2 2

𝑥 𝑥′ Input layer Encoding Hidden layer Output layer Decoding

slide-11
SLIDE 11

Denoising autoencoder for image restoration

Noise Mask

Bigdeli, Siavash Arjomand, and Matthias Zwicker. "Image restoration using autoencoding priors." arXiv preprint arXiv:1703.09964 (2017). Wang, Ruxin, and Dacheng Tao. "Non-local auto-encoder with collaborative stabilization for image restoration." IEEE Transactions on Image Processing 25.5 (2016): 2117-2129.

slide-12
SLIDE 12

Genotype imputation case study example

Ground truth (whole genome sequencing) Masked input (genotype array)

0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 0 ? ? ? 0 0 1 1 ? ? 1 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1

Mask

slide-13
SLIDE 13

Case study: 9p21.3 region of the genome

  • Length: 59846 bp
  • 846 genetic variants in reference panel (whole genome

data)

  • Approx. 200 common variants
  • Approx. 600 rare variants
  • Only 17-47 variants in genotype array!!!
  • Strong association to coronary artery disease (CAD)
  • Genotyped and sequenced in many studies
slide-14
SLIDE 14

Training on the reference panel: Data augmentation strategy

... ... ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ...

Reference Whole Genome

... ... ... ... ... ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ...

Masked input

... ... ... ... ... ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ...

Reconstructed Output

... ... ...

Mask Autoencoder

slide-15
SLIDE 15

Customized Sparsity Loss Function

 Sparsity loss with Kullback-Leibler (KL) / cross entropy element:  Customized loss adjusted for hidden activation sparsity:  Mean Squared Error:

𝐸𝐿𝑀(𝜍| ො 𝜍 = 𝜍 ∗ log 𝜍 ො 𝜍 + 1 − 𝜍 ∗ log 1 − 𝜍 1 − ො 𝜍 𝑚𝑝𝑡𝑡 = 𝑁𝑇𝐹 + 𝛾 ∗ ෍

𝑗=1 𝑜

𝐸𝐿𝑀(𝑗) 𝑁𝑇𝐹 = 1 𝑜 ෍

𝑗=1 𝑜

(𝑧𝑗 − ො 𝑧𝑗)

slide-16
SLIDE 16

Hyper parameters to be optimized

  • b
  • r
  • Activation functions
  • L1/L2 regularizers
  • Learning rate
  • Batch size
slide-17
SLIDE 17

Parallel Grid Search Hyperparameter optimization approach

10000 X training Grid samples 100 X grid search samples Hyperparameter combinations 9 GPUs available:

  • 7 GTX 1080,
  • 1 Titan V,
  • 1 Titan Xp

1

860 hours (sequential run, 100 epochs)

2 … 100

4 3 2 1 10000 9999 …

10000 X training Grid samples Trained model performance

4 3 2 1 10000 9999

Accuracy, loss Sparsity, MSE

slide-18
SLIDE 18

Grid Search Results: training accuracy

slide-19
SLIDE 19

Grid Search results: assessing best hyperparameter values

slide-20
SLIDE 20

b r

 b  r  Learning rate

Effect of hyper parameter values in training accuracy

Pearson correlation (r2)

slide-21
SLIDE 21

b r

 b  r  Learning rate

Effect of hyper parameter values in training accuracy

Pearson correlation (r2)

slide-22
SLIDE 22

b r

 b  r  Learning rate

Effect of hyper parameter values in training accuracy

Pearson correlation (r2)

slide-23
SLIDE 23

b r

 b  r  Learning rate

Effect of hyper parameter values in training accuracy

Pearson correlation (r2)

slide-24
SLIDE 24

Optimizing batch size: training accuracy

Learning steps Accuracy Loss Learning steps 10 batches 50 batches 100 batches 1000 batches

slide-25
SLIDE 25

Optimizing batch size: training run time

Run time (hours) Accuracy 10 batches 50 batches 100 batches 1000 batches

slide-26
SLIDE 26

Testing on multiple case studies

  • Atherosclerosis Risk in Communities (ARIC)
  • More than 3000 samples
  • Whole genome sequencing (846 variants, 0% mask, ground truth)
  • Affymetrix 6.0 genotype array (17 variants, 98% mask, input data)
  • Framingham Heart Study (FHS)
  • More than 500 samples
  • Whole genome sequencing (846 variants, 0% mask, ground truth)
  • Illumina 500K genotype array (47 variants, 95% mask, input data)
  • Illumina 5M (93 variants, 89% mask, input data)
slide-27
SLIDE 27

Accuracy in additional case studies: Proposed approach versus common statistic methodology

Performance: rare variants Performance: all variants

slide-28
SLIDE 28

Accuracy in additional case studies: Proposed approach versus common statistic methodology

Performance: common variants Performance: all variants

slide-29
SLIDE 29

Run time: Proposed approach versus common statistic methodology

slide-30
SLIDE 30

Linkage disequilibrium structure: ARIC

Prediction Ground truth All variants Rare variants Common variants Linkage disequilibrium (LD) r2

slide-31
SLIDE 31

Linkage disequilibrium structure: FHS

Prediction Ground truth All variants Rare variants Common variants Linkage disequilibrium (LD) r2

slide-32
SLIDE 32

Interpretability: identifying representative genetic variants

Maximal information criteria

slide-33
SLIDE 33

Conclusions

  • Grid search was able to find high accuracy models (>0.90)
  • Hyperparameters played an important role in training

performance

  • Reconstruction of genetic variants from very sparse data with

high accuracy (>0.80)

  • Superior computational performance, faster predictions
  • Fine parameter tuning may be necessary
slide-34
SLIDE 34

Future steps

  • Expand to other genomic regions, fine parameter tuning
  • Use imputation autoencoder results as input for polygenic

risk score calculation

𝑥 𝑥′ Input layer Hidden layer Imputed genotypes 𝑥 𝑥′ Input layer Hidden layer Predicted Genetic Risk Imputation Autoencoder Feed Forward Neural Network

slide-35
SLIDE 35

Future steps

  • Focal loss to compensate for rare variants

Lin, Tsung-Yi, et al. "Focal loss for dense object detection." Proceedings

  • f the IEEE international conference on computer vision. 2017.
slide-36
SLIDE 36

Limitations: expanding the methodology to other genomic regions

slide-37
SLIDE 37

Acknowledgements

Johnny Israeli Carla Leibowitz Fernanda Foertter Brian Welker David Nola Ali Torkamani PhD Shang-Fu Chen Elias Salfati PhD Doug Evans Shuchen Liu Alex Lippman Nathan W. PhD Emily Spencer PhD Eric Topol MD

Contact: raqueld@scripps.edu, @RaquelDiasSRTI atorkama@scripps.edu, @Atorkamani

NVIDIA collaborators

NIH/NCRR flagship CTSA Grant

Funding

slide-38
SLIDE 38

Thanks for your attention!!