[PPT] - A deep learning based approach for genetic risk prediction Raquel PowerPoint Presentation

SLIDE 1

A deep learning based approach for genetic risk prediction

Raquel Dias, PhD. Senior Staff Scientist Scripps Research Translational Institute raqueld@scripps.edu, @RaquelDiasSRTI Ali Torkamani, PhD. atorkama@scripps.edu, @ATorkamani

SLIDE 2

Whole Genome Sequencing vs. Genotype array

Full Data (whole genome sequencing) Sparse Data (genotype array)

0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 0 ? ? ? 0 0 1 1 ? ? 1 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1

SLIDE 3

Whole Genome Sequencing vs. Genotype array

Full Data ~80M genetic variants Sparse Data ~4 million genetic

0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 0 ? ? ? 0 0 1 1 ? ? 1 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1

SLIDE 4

Genetic imputation problem

... ... ... ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... …

Reference haplotypes

0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1

HapMap or 1,000 Genomes (whole genome) Cases and Controls typed Genotype array Prediction

... ... ... ...

Study genotypes

SLIDE 5

A typical imputation approach

Multiethnic Haplotype Reference Consortium (HRC) Study genotypes Reference panel Linkage disequilibrium (LD r2) structure

0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1

Mapping Prediction

... ... ... ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... … ... ... ... ...

SLIDE 6

A typical imputation approach

Muli-ethinic Haplotype Reference Consortium (HRC) Study genotypes Reference panel

0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1

Mapping Prediction

... ... ... ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... … ... ... ... ...

Linkage disequilibrium (LD r2) structure

SLIDE 7

Polygenic Risk Score (PRS)

SLIDE 8

Polygenic Risk Calculation

Design

100,000+ subjects

Results

Millions of known variants

Polygenic Risk Score

Cumulative sum w/ Trait* w/o Trait

Σ

*Trait can often be heterogeneous e.g. coronary artery = heart attack, stroke, bypass surgery, etc.

SLIDE 9

Objectives

1. More accurate and faster imputation
2. Find important genetic variants
3. Better polygenic risk score calculation

SLIDE 10

Our proposed approach

v

1 1 2 1 1 2 2

𝑥 𝑥′ Input layer Encoding Hidden layer Output layer Decoding

SLIDE 11

Denoising autoencoder for image restoration

Noise Mask

Bigdeli, Siavash Arjomand, and Matthias Zwicker. "Image restoration using autoencoding priors." arXiv preprint arXiv:1703.09964 (2017). Wang, Ruxin, and Dacheng Tao. "Non-local auto-encoder with collaborative stabilization for image restoration." IEEE Transactions on Image Processing 25.5 (2016): 2117-2129.

SLIDE 12

Genotype imputation case study example

Ground truth (whole genome sequencing) Masked input (genotype array)

0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 0 ? ? ? 0 0 1 1 ? ? 1 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1

Mask

SLIDE 13

Case study: 9p21.3 region of the genome

Length: 59846 bp
846 genetic variants in reference panel (whole genome

data)

Approx. 200 common variants
Approx. 600 rare variants
Only 17-47 variants in genotype array!!!
Strong association to coronary artery disease (CAD)
Genotyped and sequenced in many studies

SLIDE 14

Training on the reference panel: Data augmentation strategy

... ... ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ...

Reference Whole Genome

... ... ... ... ... ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ...

Masked input

... ... ... ... ... ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ...

Reconstructed Output

... ... ...

Mask Autoencoder

SLIDE 15

Customized Sparsity Loss Function

 Sparsity loss with Kullback-Leibler (KL) / cross entropy element:  Customized loss adjusted for hidden activation sparsity:  Mean Squared Error:

𝐸𝐿𝑀(𝜍| ො 𝜍 = 𝜍 ∗ log 𝜍 ො 𝜍 + 1 − 𝜍 ∗ log 1 − 𝜍 1 − ො 𝜍 𝑚𝑝𝑡𝑡 = 𝑁𝑇𝐹 + 𝛾 ∗ ෍

𝑗=1 𝑜

𝐸𝐿𝑀(𝑗) 𝑁𝑇𝐹 = 1 𝑜 ෍

𝑗=1 𝑜

(𝑧𝑗 − ො 𝑧𝑗)

SLIDE 16

Hyper parameters to be optimized

b
r
Activation functions
L1/L2 regularizers
Learning rate
Batch size

SLIDE 17

Parallel Grid Search Hyperparameter optimization approach

…

10000 X training Grid samples 100 X grid search samples Hyperparameter combinations 9 GPUs available:

7 GTX 1080,
1 Titan V,
1 Titan Xp

1

860 hours (sequential run, 100 epochs)

2 … 100

4 3 2 1 10000 9999 …

10000 X training Grid samples Trained model performance

4 3 2 1 10000 9999

Accuracy, loss Sparsity, MSE

SLIDE 18

Grid Search Results: training accuracy

SLIDE 19

Grid Search results: assessing best hyperparameter values

SLIDE 20

b r

 b  r  Learning rate

Effect of hyper parameter values in training accuracy

Pearson correlation (r2)

SLIDE 21

b r

 b  r  Learning rate

Effect of hyper parameter values in training accuracy

Pearson correlation (r2)

SLIDE 22

b r

 b  r  Learning rate

Effect of hyper parameter values in training accuracy

Pearson correlation (r2)

SLIDE 23

b r

 b  r  Learning rate

Effect of hyper parameter values in training accuracy

Pearson correlation (r2)

SLIDE 24

Optimizing batch size: training accuracy

Learning steps Accuracy Loss Learning steps 10 batches 50 batches 100 batches 1000 batches

SLIDE 25

Optimizing batch size: training run time

Run time (hours) Accuracy 10 batches 50 batches 100 batches 1000 batches

SLIDE 26

Testing on multiple case studies

Atherosclerosis Risk in Communities (ARIC)
More than 3000 samples
Whole genome sequencing (846 variants, 0% mask, ground truth)
Affymetrix 6.0 genotype array (17 variants, 98% mask, input data)
Framingham Heart Study (FHS)
More than 500 samples
Whole genome sequencing (846 variants, 0% mask, ground truth)
Illumina 500K genotype array (47 variants, 95% mask, input data)
Illumina 5M (93 variants, 89% mask, input data)

SLIDE 27

Accuracy in additional case studies: Proposed approach versus common statistic methodology

Performance: rare variants Performance: all variants

SLIDE 28

Accuracy in additional case studies: Proposed approach versus common statistic methodology

Performance: common variants Performance: all variants

SLIDE 29

Run time: Proposed approach versus common statistic methodology

SLIDE 30

Linkage disequilibrium structure: ARIC

Prediction Ground truth All variants Rare variants Common variants Linkage disequilibrium (LD) r2

SLIDE 31

Linkage disequilibrium structure: FHS

Prediction Ground truth All variants Rare variants Common variants Linkage disequilibrium (LD) r2

SLIDE 32

Interpretability: identifying representative genetic variants

Maximal information criteria

SLIDE 33

Conclusions

Grid search was able to find high accuracy models (>0.90)
Hyperparameters played an important role in training

performance

Reconstruction of genetic variants from very sparse data with

high accuracy (>0.80)

Superior computational performance, faster predictions
Fine parameter tuning may be necessary

SLIDE 34

Future steps

Expand to other genomic regions, fine parameter tuning
Use imputation autoencoder results as input for polygenic

risk score calculation

𝑥 𝑥′ Input layer Hidden layer Imputed genotypes 𝑥 𝑥′ Input layer Hidden layer Predicted Genetic Risk Imputation Autoencoder Feed Forward Neural Network

SLIDE 35

Future steps

Focal loss to compensate for rare variants

Lin, Tsung-Yi, et al. "Focal loss for dense object detection." Proceedings

f the IEEE international conference on computer vision. 2017.

SLIDE 36

Limitations: expanding the methodology to other genomic regions

SLIDE 37

Acknowledgements

Johnny Israeli Carla Leibowitz Fernanda Foertter Brian Welker David Nola Ali Torkamani PhD Shang-Fu Chen Elias Salfati PhD Doug Evans Shuchen Liu Alex Lippman Nathan W. PhD Emily Spencer PhD Eric Topol MD

Contact: raqueld@scripps.edu, @RaquelDiasSRTI atorkama@scripps.edu, @Atorkamani

NVIDIA collaborators

NIH/NCRR flagship CTSA Grant

Funding

SLIDE 38

A deep learning based approach for genetic risk prediction

Whole Genome Sequencing vs. Genotype array

Whole Genome Sequencing vs. Genotype array

Genetic imputation problem

A typical imputation approach

A typical imputation approach

Polygenic Risk Score (PRS)

Polygenic Risk Calculation

Design

Results

Polygenic Risk Score

Σ

Objectives

Our proposed approach

Denoising autoencoder for image restoration

Genotype imputation case study example

Case study: 9p21.3 region of the genome

data)

Training on the reference panel: Data augmentation strategy

Customized Sparsity Loss Function

Hyper parameters to be optimized

Parallel Grid Search Hyperparameter optimization approach

Grid Search Results: training accuracy

Grid Search results: assessing best hyperparameter values

Effect of hyper parameter values in training accuracy

Effect of hyper parameter values in training accuracy

Effect of hyper parameter values in training accuracy

Effect of hyper parameter values in training accuracy

Optimizing batch size: training accuracy

Optimizing batch size: training run time

Testing on multiple case studies

Accuracy in additional case studies: Proposed approach versus common statistic methodology

Accuracy in additional case studies: Proposed approach versus common statistic methodology

Run time: Proposed approach versus common statistic methodology

Linkage disequilibrium structure: ARIC

Linkage disequilibrium structure: FHS

Interpretability: identifying representative genetic variants

Maximal information criteria

Conclusions

performance

high accuracy (>0.80)

Future steps

Future steps

Limitations: expanding the methodology to other genomic regions

Acknowledgements

Thanks for your attention!!