A deep learning based approach for genetic risk prediction
Raquel Dias, PhD. Senior Staff Scientist Scripps Research Translational Institute raqueld@scripps.edu, @RaquelDiasSRTI Ali Torkamani, PhD. atorkama@scripps.edu, @ATorkamani
A deep learning based approach for genetic risk prediction Raquel - - PowerPoint PPT Presentation
A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research Translational Institute raqueld@scripps.edu, @RaquelDiasSRTI Ali Torkamani, PhD. atorkama@scripps.edu, @ATorkamani Whole
Raquel Dias, PhD. Senior Staff Scientist Scripps Research Translational Institute raqueld@scripps.edu, @RaquelDiasSRTI Ali Torkamani, PhD. atorkama@scripps.edu, @ATorkamani
Full Data (whole genome sequencing) Sparse Data (genotype array)
0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 0 ? ? ? 0 0 1 1 ? ? 1 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1
Full Data ~80M genetic variants Sparse Data ~4 million genetic
0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 0 ? ? ? 0 0 1 1 ? ? 1 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1
... ... ... ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... …
Reference haplotypes
0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1
HapMap or 1,000 Genomes (whole genome) Cases and Controls typed Genotype array Prediction
... ... ... ...
Study genotypes
Multiethnic Haplotype Reference Consortium (HRC) Study genotypes Reference panel Linkage disequilibrium (LD r2) structure
0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1 0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1
Mapping Prediction
... ... ... ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... … ... ... ... ...
Muli-ethinic Haplotype Reference Consortium (HRC) Study genotypes Reference panel
0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1
Mapping Prediction
... ... ... ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... … ... ... ... ...
Linkage disequilibrium (LD r2) structure
100,000+ subjects
Millions of known variants
Cumulative sum w/ Trait* w/o Trait
*Trait can often be heterogeneous e.g. coronary artery = heart attack, stroke, bypass surgery, etc.
v
1 1 2 1 1 2 2
𝑥 𝑥′ Input layer Encoding Hidden layer Output layer Decoding
Noise Mask
Bigdeli, Siavash Arjomand, and Matthias Zwicker. "Image restoration using autoencoding priors." arXiv preprint arXiv:1703.09964 (2017). Wang, Ruxin, and Dacheng Tao. "Non-local auto-encoder with collaborative stabilization for image restoration." IEEE Transactions on Image Processing 25.5 (2016): 2117-2129.
Ground truth (whole genome sequencing) Masked input (genotype array)
0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 0 ? ? ? 0 0 1 1 ? ? 1 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ? 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1
Mask
... ... ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ...
Reference Whole Genome
... ... ... ... ... ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ...
Masked input
... ... ... ... ... ... ... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ... ... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ... ... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ...
Reconstructed Output
... ... ...
Mask Autoencoder
Sparsity loss with Kullback-Leibler (KL) / cross entropy element: Customized loss adjusted for hidden activation sparsity: Mean Squared Error:
𝐸𝐿𝑀(𝜍| ො 𝜍 = 𝜍 ∗ log 𝜍 ො 𝜍 + 1 − 𝜍 ∗ log 1 − 𝜍 1 − ො 𝜍 𝑚𝑝𝑡𝑡 = 𝑁𝑇𝐹 + 𝛾 ∗
𝑗=1 𝑜
𝐸𝐿𝑀(𝑗) 𝑁𝑇𝐹 = 1 𝑜
𝑗=1 𝑜
(𝑧𝑗 − ො 𝑧𝑗)
…
10000 X training Grid samples 100 X grid search samples Hyperparameter combinations 9 GPUs available:
1
860 hours (sequential run, 100 epochs)
2 … 100
4 3 2 1 10000 9999 …
10000 X training Grid samples Trained model performance
4 3 2 1 10000 9999
Accuracy, loss Sparsity, MSE
b r
b r Learning rate
Pearson correlation (r2)
b r
b r Learning rate
Pearson correlation (r2)
b r
b r Learning rate
Pearson correlation (r2)
b r
b r Learning rate
Pearson correlation (r2)
Learning steps Accuracy Loss Learning steps 10 batches 50 batches 100 batches 1000 batches
Run time (hours) Accuracy 10 batches 50 batches 100 batches 1000 batches
Performance: rare variants Performance: all variants
Performance: common variants Performance: all variants
Prediction Ground truth All variants Rare variants Common variants Linkage disequilibrium (LD) r2
Prediction Ground truth All variants Rare variants Common variants Linkage disequilibrium (LD) r2
risk score calculation
𝑥 𝑥′ Input layer Hidden layer Imputed genotypes 𝑥 𝑥′ Input layer Hidden layer Predicted Genetic Risk Imputation Autoencoder Feed Forward Neural Network
Lin, Tsung-Yi, et al. "Focal loss for dense object detection." Proceedings
Johnny Israeli Carla Leibowitz Fernanda Foertter Brian Welker David Nola Ali Torkamani PhD Shang-Fu Chen Elias Salfati PhD Doug Evans Shuchen Liu Alex Lippman Nathan W. PhD Emily Spencer PhD Eric Topol MD
Contact: raqueld@scripps.edu, @RaquelDiasSRTI atorkama@scripps.edu, @Atorkamani
NVIDIA collaborators
NIH/NCRR flagship CTSA Grant
Funding