diet networks thin parameters for fat genomics
play

Diet Networks: Thin Parameters for Fat Genomics Adriana Romero, - PowerPoint PPT Presentation

Institut des algorithmes dapprentissage de Montral Diet Networks: Thin Parameters for Fat Genomics Adriana Romero, Pierre Luc Carrier, Akram Erraqabi, Tristan Sylvain, Alex Auvolat, Etienne Dejoie, Marc-Andr Legault, Marie-Pierre


  1. Institut des algorithmes d’apprentissage de Montréal Diet Networks: Thin Parameters for Fat Genomics Adriana Romero, Pierre Luc Carrier, Akram Erraqabi, Tristan Sylvain, Alex Auvolat, Etienne Dejoie, Marc-André Legault, Marie-Pierre Dubé, Julie G. Hussin, Yoshua Bengio

  2. Outline Motivation & Challenges • Deep Learning Architectures • Diet Networks • Results • Wrap up and future research directions •

  3. Motivation & Challenges

  4. Motivation Deep Learning Zhou et al., 2015 from http://quincypublicschools.com/library/contact-school/book-stack/ from https://en.wikipedia.org/wiki/GeForce_10_series

  5. Motivation Deep Learning Genomics from http://quincypublicschools.com/library/contact-school/book-stack/ from https://www.genome.gov/sequencingcostsdata/ from https://en.wikipedia.org/wiki/GeForce_10_series

  6. Genomic Data as Fat Data Target millions of simple • variants across the genome (SNPs). Number of participants • limited, even for large datasets. # participants # SNP (features) (samples) Dire imbalance between # samples Fat Data and # input features

  7. Challenges: Parameter explosion Linear classifier: Naive setup for SNP data: # inputs = hundreds of thousands W (parameters) # samples = thousands # samples << # parameters In deep networks: the # parameters in the 1st layer grows linearly with the # inputs. # inputs = # parameters

  8. Challenges: Overfitting from https://shapeofdata.wordpress.com/2013/03/26/general-regression-and-over-fitting/

  9. Challenges: The curse of dimensionality Considering: 300K SNPs - 3 possible values (0, 1, 2) - 3 300K combinations !! from http://nikhilbuduma.com/2015/03/10/the-curse-of-dimensionality/

  10. Why deep learning? Capturing information directly from the raw input data is not trivial and • often involves complex and non-linear functions. Many problems become easier if the input data is transformed into a • representation that emphasizes its most relevant characteristics.

  11. Multi-Layer Perceptron (MLP) Supervised learning: desired output values are provided Describe data as a hierarchy of concepts Unsupervised learning: aims to discover hidden structure in the input data.

  12. CNN - reducing the number of parameters Parameter sharing. • Exploit spatially local • correlations. Suitable for data with a • grid-like topology. Problem: When the full DNA sequence is unavailable, other type of methods seem more appropriate.

  13. Diet Networks

  14. The idea Use a novel neural network reparametrization, which considerably reduces • the number of free parameters when the input is very high-dimensional and orders of magnitude larger than the number of training samples.

  15. The model Input data: Fx100 N x F , N << F 100 MLP MLP MLP 50K 500 Emb. MLP Emb. 100 Fx100 30M 300K Input = 1 feature (SNP) Input = 1 sample 1 x N 1 x F

  16. Embeddings Raw (learnt embedding, end to end training) • MLP MLP Per class histograms • Emb. MLP

  17. Per class histogram Individuals 0 0 2 1 0 2 0 1 Class 1: 1 x 0, 2 x 1, 1 x 2 0.25 0.50 0.25 s P N S Class 2: 3 x 0, 0 x 1, 1 x 2 0.75 0 0.25 0.25 0.50 0.25 0.75 0 0.25

  18. The 1000 Genomes Project (1) Large-scale comparison of DNA sequences from populations, thanks to the presence of • genetic variations. Represents 26 populations from 5 geographical regions, in total 3 ,450 individuals • SNP inclusion/ exclusion criteria: • Genetic variants with frequencies of at least 5% Excluded SNPs positioned on sex chromosomes Only included SNPs in approximate linkage equilibrium with each other As a result, we obtained 315 ,345 SNPs, encoded as having 0, 1 or 2 copies of a genetic • mutation (non-reference nucleotide).

  19. Experimental setup Ethnicity prediction from SNPs on 1000 Genomes data. • Metric: misclassification error and number of free parameters. • 5-fold crossvalidation. •

  20. Quantitive results (1) Embedding Misclassification error (%) # free parameters Without reconstruction Basic MLP 8.31 +- 1.83 31.5M Diet Networks (raw end2end) 7.62 +- 02 227.3k Diet Networks (histograms) 6.90+- 1.60 18.0k With reconstruction Basic MLP 7.76 +- 1.38 63M Diet Networks (raw end2end) 6.85 +- 1.72 534.8k Diet Networks (histograms) 7.01 +- 1.20 28.1k

  21. Quantitive results (2) Embedding Misclassification error (%) Diet Networks (histograms) 7.01 +- 1.20 PCA (10 PCs) 20.56 +- 3.20 PCA (50 PCs) 12.29 +- 0.89 PCA (100 PCs) 10.52 +- 0.25 PCA (200 PCs) 9.33 +- 1.24

  22. Quantitive results (3)

  23. What is the network learning? Layer 2 MLP Layer 1 MLP Input

  24. What is the network learning? Raw input Layer 1 Layer 2 Ethnicities

  25. What is the network learning? Raw input Layer 1 Layer 2 Continents

  26. Wrap up and future research directions

  27. Wrap up We demonstrated the potential of deep learning models to tackle genomic- • specific tasks. The parameter explosion introduced by high dimensional genomic data can • be mitigated by smart model parameterization, such as Diet Networks.

  28. What comes next… Conducting genetic association studies, with emphasis on population-aware • analyses of SNP data in disease cohorts. Identify the genetic basis of common diseases to achieve a better patient • risk prediction and improve our overall understanding of disease etiology.

  29. Institut des algorithmes d’apprentissage de Montréal Diet Networks: Thin Parameters for Fat Genomics Adriana Romero, Pierre Luc Carrier, Akram Erraqabi, Tristan Sylvain, Alex Auvolat, Etienne Dejoie, Marc-André Legault, Marie-Pierre Dubé, Julie G. Hussin, Yoshua Bengio Thank you! @adri_romsor Code: https://github.com/adri-romsor/DietNetworks

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend