Diet Networks: Thin Parameters for Fat Genomics Adriana Romero, - PowerPoint PPT Presentation

Institut des algorithmes d’apprentissage de Montréal Diet Networks: Thin Parameters for Fat Genomics Adriana Romero, Pierre Luc Carrier, Akram Erraqabi, Tristan Sylvain, Alex Auvolat, Etienne Dejoie, Marc-André Legault, Marie-Pierre Dubé, Julie G. Hussin, Yoshua Bengio

Outline Motivation & Challenges • Deep Learning Architectures • Diet Networks • Results • Wrap up and future research directions •

Motivation & Challenges

Motivation Deep Learning Zhou et al., 2015 from http://quincypublicschools.com/library/contact-school/book-stack/ from https://en.wikipedia.org/wiki/GeForce_10_series

Motivation Deep Learning Genomics from http://quincypublicschools.com/library/contact-school/book-stack/ from https://www.genome.gov/sequencingcostsdata/ from https://en.wikipedia.org/wiki/GeForce_10_series

Genomic Data as Fat Data Target millions of simple • variants across the genome (SNPs). Number of participants • limited, even for large datasets. # participants # SNP (features) (samples) Dire imbalance between # samples Fat Data and # input features

Challenges: Parameter explosion Linear classifier: Naive setup for SNP data: # inputs = hundreds of thousands W (parameters) # samples = thousands # samples << # parameters In deep networks: the # parameters in the 1st layer grows linearly with the # inputs. # inputs = # parameters

Challenges: Overfitting from https://shapeofdata.wordpress.com/2013/03/26/general-regression-and-over-fitting/

Challenges: The curse of dimensionality Considering: 300K SNPs - 3 possible values (0, 1, 2) - 3 300K combinations !! from http://nikhilbuduma.com/2015/03/10/the-curse-of-dimensionality/

Why deep learning? Capturing information directly from the raw input data is not trivial and • often involves complex and non-linear functions. Many problems become easier if the input data is transformed into a • representation that emphasizes its most relevant characteristics.

Multi-Layer Perceptron (MLP) Supervised learning: desired output values are provided Describe data as a hierarchy of concepts Unsupervised learning: aims to discover hidden structure in the input data.

CNN - reducing the number of parameters Parameter sharing. • Exploit spatially local • correlations. Suitable for data with a • grid-like topology. Problem: When the full DNA sequence is unavailable, other type of methods seem more appropriate.

Diet Networks

The idea Use a novel neural network reparametrization, which considerably reduces • the number of free parameters when the input is very high-dimensional and orders of magnitude larger than the number of training samples.

The model Input data: Fx100 N x F , N << F 100 MLP MLP MLP 50K 500 Emb. MLP Emb. 100 Fx100 30M 300K Input = 1 feature (SNP) Input = 1 sample 1 x N 1 x F

Embeddings Raw (learnt embedding, end to end training) • MLP MLP Per class histograms • Emb. MLP

Per class histogram Individuals 0 0 2 1 0 2 0 1 Class 1: 1 x 0, 2 x 1, 1 x 2 0.25 0.50 0.25 s P N S Class 2: 3 x 0, 0 x 1, 1 x 2 0.75 0 0.25 0.25 0.50 0.25 0.75 0 0.25

The 1000 Genomes Project (1) Large-scale comparison of DNA sequences from populations, thanks to the presence of • genetic variations. Represents 26 populations from 5 geographical regions, in total 3 ,450 individuals • SNP inclusion/ exclusion criteria: • Genetic variants with frequencies of at least 5% Excluded SNPs positioned on sex chromosomes Only included SNPs in approximate linkage equilibrium with each other As a result, we obtained 315 ,345 SNPs, encoded as having 0, 1 or 2 copies of a genetic • mutation (non-reference nucleotide).

Experimental setup Ethnicity prediction from SNPs on 1000 Genomes data. • Metric: misclassification error and number of free parameters. • 5-fold crossvalidation. •

Quantitive results (1) Embedding Misclassification error (%) # free parameters Without reconstruction Basic MLP 8.31 +- 1.83 31.5M Diet Networks (raw end2end) 7.62 +- 02 227.3k Diet Networks (histograms) 6.90+- 1.60 18.0k With reconstruction Basic MLP 7.76 +- 1.38 63M Diet Networks (raw end2end) 6.85 +- 1.72 534.8k Diet Networks (histograms) 7.01 +- 1.20 28.1k

Quantitive results (2) Embedding Misclassification error (%) Diet Networks (histograms) 7.01 +- 1.20 PCA (10 PCs) 20.56 +- 3.20 PCA (50 PCs) 12.29 +- 0.89 PCA (100 PCs) 10.52 +- 0.25 PCA (200 PCs) 9.33 +- 1.24

Quantitive results (3)

What is the network learning? Layer 2 MLP Layer 1 MLP Input

What is the network learning? Raw input Layer 1 Layer 2 Ethnicities

What is the network learning? Raw input Layer 1 Layer 2 Continents

Wrap up and future research directions

Wrap up We demonstrated the potential of deep learning models to tackle genomic- • specific tasks. The parameter explosion introduced by high dimensional genomic data can • be mitigated by smart model parameterization, such as Diet Networks.

What comes next… Conducting genetic association studies, with emphasis on population-aware • analyses of SNP data in disease cohorts. Identify the genetic basis of common diseases to achieve a better patient • risk prediction and improve our overall understanding of disease etiology.

Institut des algorithmes d’apprentissage de Montréal Diet Networks: Thin Parameters for Fat Genomics Adriana Romero, Pierre Luc Carrier, Akram Erraqabi, Tristan Sylvain, Alex Auvolat, Etienne Dejoie, Marc-André Legault, Marie-Pierre Dubé, Julie G. Hussin, Yoshua Bengio Thank you! @adri_romsor Code: https://github.com/adri-romsor/DietNetworks

Diet Networks: Thin Parameters for Fat Genomics Adriana Romero, - PowerPoint PPT Presentation

Institut des algorithmes dapprentissage de Montral Diet Networks: Thin Parameters for Fat Genomics Adriana Romero, Pierre Luc Carrier, Akram Erraqabi, Tristan Sylvain, Alex Auvolat, Etienne Dejoie, Marc-Andr Legault, Marie-Pierre

Genomics Genomics extravaganza extravaganza Genomics Genomics overview overview Genomics

Yonge Street Road Diet A diet that really works! Public Information Centre May 15, 2017

PALEO? EFFECT OF PALEO DIET ON CVD By: Hannah Wolf and Sam Aldrich WHAT IS THE PALEO DIET?

Topics for the Day What is a Mediterranean diet? Why might you want to eat a Mediterranean diet?

Thin Film Photovoltaic Solar Pilot Line Thin Film Photovoltaic Solar Pilot Line Thin Film

FAT File system Case studies Microsoft, late 70s FAT late 70s; Microsoft File Allocation Table

Melbourne Genomics Establishing data governance in clinical genomics Ian Pham Data Governance

Genomics extravaganza Genomics overview Genomics analysis of the structure and function of very

Outline Part 1 Introduction to Genomics Part 2 Visual Design for Genomics Part 3 Hands-On

Melbourne Genomics Data and technology to support and enable genomics Kate Birch Data &

Laura Roche RD Horizon Health Authority May 29, 2018 AGENDA History of the Ketogenic Diet

Findings for IANDA / Cost of the Diet in Ghana WFP, 2017 Linear optimization (Cost of the Diet,

Design and Implemention of a Plugin Scheduler for DIET March 11, 2005 Design and Implemention of

Comparative Genomics: Comparative Genomics: Sequence, Structure, Sequence, Structure, and

Vectorization in Graphics Recognition: To Thin or not to Thin Karl Tombre and Salvatore Tabbone

Fat Determination Study - Hydrotherm vs. AOAC 922.06 What is Fat? Nutrient used as an energy

Framing Social Inclusion Policies Frances Stewart 1 Presentation What is social exclusion

Rationally Approaching the Estimation of the Nutritive Value of Feed Ingredients 1. Chemical

The Basics for New Providers Individual Service Plan (ISP) Goal and Objective Development Part

2018 FULL-YEAR RESULTS Paris | February 28, 2019 Listen to the webcast of the meeting on

TILAPIA FARMING IN HUNGARY WITH THE USE OF GEOTHERMICAL WATER SUPPLY Lszl Szathmri, 1

NUTRITION DEPARTMENT PROGRAM VISION Three Year Plan 1. Restructure Department 2. Set

CLEVELAND AVE. FEASIBILITY STUDY August 2012 PURPOSE Corridor identified in Bicentennial

Bootstrapping Food Preferences Through an Adaptive Visual Interface Longqi Yang , Yin Cui, Fan

Diet Networks: Thin Parameters for Fat Genomics Adriana Romero, - PowerPoint PPT Presentation

Institut des algorithmes dapprentissage de Montral Diet Networks: Thin Parameters for Fat Genomics Adriana Romero, Pierre Luc Carrier, Akram Erraqabi, Tristan Sylvain, Alex Auvolat, Etienne Dejoie, Marc-Andr Legault, Marie-Pierre

Genomics Genomics extravaganza extravaganza Genomics Genomics overview overview Genomics

Yonge Street Road Diet A diet that really works! Public Information Centre May 15, 2017

PALEO? EFFECT OF PALEO DIET ON CVD By: Hannah Wolf and Sam Aldrich WHAT IS THE PALEO DIET?

Topics for the Day What is a Mediterranean diet? Why might you want to eat a Mediterranean diet?

Thin Film Photovoltaic Solar Pilot Line Thin Film Photovoltaic Solar Pilot Line Thin Film

FAT File system Case studies Microsoft, late 70s FAT late 70s; Microsoft File Allocation Table

Melbourne Genomics Establishing data governance in clinical genomics Ian Pham Data Governance

Genomics extravaganza Genomics overview Genomics analysis of the structure and function of very

Outline Part 1 Introduction to Genomics Part 2 Visual Design for Genomics Part 3 Hands-On

Melbourne Genomics Data and technology to support and enable genomics Kate Birch Data &amp;

Laura Roche RD Horizon Health Authority May 29, 2018 AGENDA History of the Ketogenic Diet

Findings for IANDA / Cost of the Diet in Ghana WFP, 2017 Linear optimization (Cost of the Diet,

Design and Implemention of a Plugin Scheduler for DIET March 11, 2005 Design and Implemention of

Comparative Genomics: Comparative Genomics: Sequence, Structure, Sequence, Structure, and

Vectorization in Graphics Recognition: To Thin or not to Thin Karl Tombre and Salvatore Tabbone

Fat Determination Study - Hydrotherm vs. AOAC 922.06 What is Fat? Nutrient used as an energy

Framing Social Inclusion Policies Frances Stewart 1 Presentation What is social exclusion

Rationally Approaching the Estimation of the Nutritive Value of Feed Ingredients 1. Chemical

The Basics for New Providers Individual Service Plan (ISP) Goal and Objective Development Part

2018 FULL-YEAR RESULTS Paris | February 28, 2019 Listen to the webcast of the meeting on

TILAPIA FARMING IN HUNGARY WITH THE USE OF GEOTHERMICAL WATER SUPPLY Lszl Szathmri, 1

NUTRITION DEPARTMENT PROGRAM VISION Three Year Plan 1. Restructure Department 2. Set

CLEVELAND AVE. FEASIBILITY STUDY August 2012 PURPOSE Corridor identified in Bicentennial

Bootstrapping Food Preferences Through an Adaptive Visual Interface Longqi Yang , Yin Cui, Fan

Melbourne Genomics Data and technology to support and enable genomics Kate Birch Data &