Diet Networks: Thin Parameters for Fat Genomics Adriana Romero, - - PowerPoint PPT Presentation

diet networks thin parameters for fat genomics
SMART_READER_LITE
LIVE PREVIEW

Diet Networks: Thin Parameters for Fat Genomics Adriana Romero, - - PowerPoint PPT Presentation

Institut des algorithmes dapprentissage de Montral Diet Networks: Thin Parameters for Fat Genomics Adriana Romero, Pierre Luc Carrier, Akram Erraqabi, Tristan Sylvain, Alex Auvolat, Etienne Dejoie, Marc-Andr Legault, Marie-Pierre


slide-1
SLIDE 1

Institut des algorithmes d’apprentissage de Montréal

Diet Networks: Thin Parameters for Fat Genomics

Adriana Romero, Pierre Luc Carrier, Akram Erraqabi, Tristan Sylvain, Alex Auvolat, Etienne Dejoie, Marc-André Legault, Marie-Pierre Dubé, Julie G. Hussin, Yoshua Bengio

slide-2
SLIDE 2

Outline

  • Motivation & Challenges
  • Deep Learning Architectures
  • Diet Networks
  • Results
  • Wrap up and future research directions
slide-3
SLIDE 3

Motivation & Challenges

slide-4
SLIDE 4

Motivation

from http://quincypublicschools.com/library/contact-school/book-stack/ from https://en.wikipedia.org/wiki/GeForce_10_series

Deep Learning

Zhou et al., 2015

slide-5
SLIDE 5

Motivation

from http://quincypublicschools.com/library/contact-school/book-stack/ from https://en.wikipedia.org/wiki/GeForce_10_series from https://www.genome.gov/sequencingcostsdata/

Deep Learning Genomics

slide-6
SLIDE 6

Genomic Data as Fat Data

  • Target millions of simple

variants across the genome (SNPs).

  • Number of participants

limited, even for large datasets. Dire imbalance between # samples and # input features

Fat Data # SNP (features) # participants (samples)

slide-7
SLIDE 7

Challenges: Parameter explosion

Linear classifier:

# inputs = # parameters

Naive setup for SNP data: # inputs = hundreds of thousands # samples = thousands # samples << # parameters In deep networks: the # parameters in the 1st layer grows linearly with the # inputs.

W (parameters)

slide-8
SLIDE 8

Challenges: Overfitting

from https://shapeofdata.wordpress.com/2013/03/26/general-regression-and-over-fitting/

slide-9
SLIDE 9

from http://nikhilbuduma.com/2015/03/10/the-curse-of-dimensionality/

Challenges: The curse of dimensionality

Considering:

  • 300K SNPs
  • 3 possible values (0, 1, 2)

3300K combinations !!

slide-10
SLIDE 10

Why deep learning?

  • Capturing information directly from the raw input data is not trivial and
  • ften involves complex and non-linear functions.
  • Many problems become easier if the input data is transformed into a

representation that emphasizes its most relevant characteristics.

slide-11
SLIDE 11

Multi-Layer Perceptron (MLP)

Describe data as a hierarchy

  • f concepts

Supervised learning: desired

  • utput values are provided

Unsupervised learning: aims to discover hidden structure in the input data.

slide-12
SLIDE 12

CNN - reducing the number of parameters

  • Parameter sharing.
  • Exploit spatially local

correlations.

  • Suitable for data with a

grid-like topology.

Problem: When the full DNA sequence is unavailable, other type

  • f methods seem more appropriate.
slide-13
SLIDE 13

Diet Networks

slide-14
SLIDE 14

The idea

  • Use a novel neural network reparametrization, which considerably reduces

the number of free parameters when the input is very high-dimensional and orders of magnitude larger than the number of training samples.

slide-15
SLIDE 15

The model

MLP MLP Emb. MLP

Input = 1 sample 1xF Input = 1 feature (SNP) 1xN

Emb. MLP

Input data: NxF, N << F

100 300K 30M 100 500 50K

Fx100 Fx100

slide-16
SLIDE 16

Embeddings

  • Raw (learnt embedding, end to end training)
  • Per class histograms

Emb. MLP MLP MLP

slide-17
SLIDE 17

Per class histogram

S N P s Individuals

2 1 2 1

Class 1: 1 x 0, 2 x 1, 1 x 2 Class 2: 3 x 0, 0 x 1, 1 x 2

0.25 0.50 0.25 0.75 0.25 0.25 0.50 0.25 0.75 0.25

slide-18
SLIDE 18

The 1000 Genomes Project (1)

  • Large-scale comparison of DNA sequences from populations, thanks to the presence of

genetic variations.

  • Represents 26 populations from 5 geographical regions, in total 3

,450 individuals

  • SNP inclusion/

exclusion criteria: Genetic variants with frequencies of at least 5% Excluded SNPs positioned on sex chromosomes Only included SNPs in approximate linkage equilibrium with each other

  • As a result, we obtained 315

,345 SNPs, encoded as having 0, 1 or 2 copies of a genetic mutation (non-reference nucleotide).

slide-19
SLIDE 19

Experimental setup

  • Ethnicity prediction from SNPs on 1000 Genomes data.
  • Metric: misclassification error and number of free parameters.
  • 5-fold crossvalidation.
slide-20
SLIDE 20

Quantitive results (1)

Embedding Misclassification error (%) # free parameters Without reconstruction Basic MLP 8.31 +- 1.83 31.5M Diet Networks (raw end2end) 7.62 +- 02 227.3k Diet Networks (histograms) 6.90+- 1.60 18.0k With reconstruction Basic MLP 7.76 +- 1.38 63M Diet Networks (raw end2end) 6.85 +- 1.72 534.8k Diet Networks (histograms) 7.01 +- 1.20 28.1k

slide-21
SLIDE 21

Quantitive results (2)

Embedding Misclassification error (%) Diet Networks (histograms) 7.01 +- 1.20 PCA (10 PCs) 20.56 +- 3.20 PCA (50 PCs) 12.29 +- 0.89 PCA (100 PCs) 10.52 +- 0.25 PCA (200 PCs) 9.33 +- 1.24

slide-22
SLIDE 22

Quantitive results (3)

slide-23
SLIDE 23

What is the network learning?

MLP MLP

Input Layer 1 Layer 2

slide-24
SLIDE 24

Raw input Layer 1 Layer 2

What is the network learning?

Ethnicities

slide-25
SLIDE 25

What is the network learning?

Raw input Layer 1 Layer 2 Continents

slide-26
SLIDE 26

Wrap up and future research directions

slide-27
SLIDE 27

Wrap up

  • We demonstrated the potential of deep learning models to tackle genomic-

specific tasks.

  • The parameter explosion introduced by high dimensional genomic data can

be mitigated by smart model parameterization, such as Diet Networks.

slide-28
SLIDE 28

What comes next…

  • Conducting genetic association studies, with emphasis on population-aware

analyses of SNP data in disease cohorts.

  • Identify the genetic basis of common diseases to achieve a better patient

risk prediction and improve our overall understanding of disease etiology.

slide-29
SLIDE 29

Institut des algorithmes d’apprentissage de Montréal

Diet Networks: Thin Parameters for Fat Genomics

Adriana Romero, Pierre Luc Carrier, Akram Erraqabi, Tristan Sylvain, Alex Auvolat, Etienne Dejoie, Marc-André Legault, Marie-Pierre Dubé, Julie G. Hussin, Yoshua Bengio

Thank you!

Code: https://github.com/adri-romsor/DietNetworks @adri_romsor