Statistical online learning of large-scale imaging-genetics data - - PowerPoint PPT Presentation

statistical online learning of large scale imaging
SMART_READER_LITE
LIVE PREVIEW

Statistical online learning of large-scale imaging-genetics data - - PowerPoint PPT Presentation

Statistical online learning of large-scale imaging-genetics data Data Science Meetup Nice - Sophia-Antipolis Marco Lorenzi Universit Cte dAzur Inria Sophia Antipolis, Asclepios Research Project - 1 William Utermolhen (1933-2007)


slide-1
SLIDE 1
  • 1

Statistical online learning

  • f large-scale imaging-genetics data

Data Science Meetup Nice - Sophia-Antipolis

Marco Lorenzi

Université Côte d’Azur Inria Sophia Antipolis, Asclepios Research Project

slide-2
SLIDE 2
  • 2

William Utermolhen (1933-2007)

1996 1997 1998 1999 2000 1967

Self-portrais 1995: Alzheimer’s disease diagnosis

slide-3
SLIDE 3
  • 3

Functionality loss Functionality loss Mood alterations Mood alterations Cognitive impairment Cognitive impairment Apraxia Apraxia Language problems Language problems Memory loss Memory loss

Enormous human and societal cost

The disease with the largest economic impact (Europe et US)

Impact on families ~20,000 $ every year in 1998

[Moore et al., J Gerontol B Psychol Sci Soc Sci 1998]

Impact on families ~20,000 $ every year in 1998

[Moore et al., J Gerontol B Psychol Sci Soc Sci 1998]

Health-care 160 billion $ every year worldwide

[Wimo et al., Dement Geriatr Cogn Disord 1998]

Health-care 160 billion $ every year worldwide

[Wimo et al., Dement Geriatr Cogn Disord 1998]

Alzheimer’s disease: the most common form of dementia

slide-4
SLIDE 4
  • 4

People affected in the world 26,6 millions in 2006

[Brookmeyer et al., Alzheimers and Dementia 2007] 1,33 1,33 7,21 7,21 3,1 3,1 7,21 7,21 3,1 3,1 2,3 2,3 12,63 12,63 0,23 0,23

slide-5
SLIDE 5

[Brookmeyer et al., Alzheimers and Dementia 2007] 6,33 6,33 10,85 10,85

People affected in the world 106 millions en 2050

62,85 62,85 0,84 0,84 16,51 16,51 8,85 8,85 10,85 10,85 6,33 6,33 “Looming epidemy” 2017 No cures nor preventive measures “Looming epidemy” 2017 No cures nor preventive measures

  • 5
slide-6
SLIDE 6
  • Dr. Aloysius “Alois”

Alzheimer (1864-1915)

source http://www.alz.org [Kahn et al, PNAS 2007]

Amyloid plaques & neurofibrillary tangles Brain atrophy Normal “Alzheimer’s” Normal “Alzheimer’s”

Urgent need: understanding the disease

Auguste Deter (1850-1906)

  • 6
slide-7
SLIDE 7

Jack et al, Lancet Neurol 2010; Frisoni et al, Nature Rev Neurol 2010

Vascularity Sociodemographic Genetics Microbiome

?

  • 7

A story with several actors

slide-8
SLIDE 8

Introduction Disentangling the pathological mechanisms

drug discovery

Patient stratification (diagnostic)

effective clinical trials

Multifactorial processes

Forward Models  Data

  • Targeted
  • Testing “mechanistic” hypothesis
  •  Difficult to account for several

factors Backward Data  Models

  • Exploring unknown interactions
  • Based on inferential methods
  •  Generalization and validation

Approaches

Lorenzi Marco IPMC 2017

  • 8
slide-9
SLIDE 9

Introduction

A research challenge

+

Data science Statistical learning Biomedical research Neuroimaging Combine heterogeneous data and observations for:

  • Improve the understanding of the disease
  • Better treatment
  • Better diagnostic
  • 9
slide-10
SLIDE 10
  • 10
  • Data (disease markers)
  • Algorithms
  • Databases

Joint modeling of brain and genetic data in Alzheimer’s disease

  • Ingredients -
slide-11
SLIDE 11
  • 11
  • Data (disease markers)
  • Algorithms
  • Databases

Joint modeling of brain and genetic data in Alzheimer’s disease

  • Ingredients -
slide-12
SLIDE 12
  • 12

Brain imaging

Quantify the brain structure Grey matter Connectivity Brain cortical thickness

slide-13
SLIDE 13
  • 13

Genetics

Identifying meaningful genetic variants (Single Nucleotide Polymorphism -SNP- ) in a population Discovering the encoded information Heritability

Novembre et al, Nature, 2008

Association with a disease

slide-14
SLIDE 14
  • 14
  • Data (disease markers)
  • Algorithms
  • Databases

Joint modeling of brain and genetic data in Alzheimer’s disease

  • Ingredients -
slide-15
SLIDE 15

chromosome N candidate SNP chromosome N chromosome 1 … chromosome N chromosome 1 … many SNP (~106)

several scalars many voxel /mesh measures (~105) GWAs

low high

statistical complexity

very high

Association between SNP and brain features

  • 15

GWAS = genome wide association studies

slide-16
SLIDE 16

Introduction

Maximizing the joint relationship between genetic variants and brain features

Liu et al, Front in Neuroinformatics, 2014; Silver et al, NeuroImage 2012; Szymczak et al, Genetic Epidemiology 2009; …

X =

N individuals ~106 SNPs N individuals ~105 brain features

Partial least squares (PLS)

Multivariate Association studies

  • 16

Y =

maxp,q Cov( X . p, Y . q )

slide-17
SLIDE 17

Introduction

chromosome N PLS weights = relative importance

Maximizing the joint relationship between genetic variants and brain features Y = X =

N individuals ~106 SNPs N individuals ~105 brain features

maxp,q Cov( X . p, Y . q ) Partial least squares (PLS)

Multivariate Association studies

Liu et al, Front in Neuroinformatics, 2014; Silver et al, NeuroImage 2012; Szymczak et al, Genetic Epidemiology 2009; …

  • 17
slide-18
SLIDE 18
  • Pros. Overcomes issues of mass univariate analysis
  • Avoiding independent multiple testing
  • Exploring SNP-SNP interaction (epistatic effects)

Introduction

chromosome N PLS weights = relative importance

Cons.

  • Overfitting and reproducibility
  • Computational complexity

Maximizing the joint relationship between genetic variants and brain features X =

N individuals ~106 SNPs N individuals ~105 brain features

Partial least squares (PLS)

Multivariate Association studies

Liu et al, Front in Neuroinformatics, 2014; Silver et al, NeuroImage 2012; Szymczak et al, Genetic Epidemiology 2009; …

  • 18

Y =

maxp,q Cov( X . p, Y . q )

slide-19
SLIDE 19

Stability assessment

Imaging genetic

Random partitioning of the population in non-overlapping groups (split-half)

  • 19
slide-20
SLIDE 20

Partitioning of chromosomes (bin size: 10k ) PLS weights associated to individual SNPs PLS weights associated to individual SNPs

Extraction of PLS components

PLS PLS PLS PLS

Stability assessment

Random partitioning of the population in non-overlapping groups (split-half)

  • 20
slide-21
SLIDE 21

Imaging genetic

Top 5%

PLS PLS PLS PLS

Partitioning of chromosomes (bin size: 10k ) PLS weights associated to individual SNPs PLS weights associated to individual SNPs

Extraction of PLS components

Random partitioning of the population in non-overlapping groups (split-half)

Stability assessment

  • 21
slide-22
SLIDE 22

1 1 1 1 1 1

1

1

PLS PLS PLS PLS

Identification of relevant loci (binarization)

Stability assessment

  • 22
slide-23
SLIDE 23

PLS PLS PLS PLS

1 1 1 1 1 1

1

1 1 1 .

Stable estimator of relevant loci (AND)

Stability assessment

  • 23
slide-24
SLIDE 24

PLS PLS PLS PLS

1 1 1 1 1 1

1

1 1 1 .

Stable estimator of relevant loci (AND)

106 iterations 106 iterations

Stability assessment

  • 24
slide-25
SLIDE 25

PLS PLS PLS PLS

1 1 1 1 1 1

1

1

Same procedure for the assessment of brain thickness component at each mesh point 106 iterations 106 iterations

1 1 .

Stable estimator of relevant loci (AND)

Stability assessment

  • 25
slide-26
SLIDE 26
  • 26

A multivariate answer

Lorenzi et al. AAIC 2016

  • A. Altmann
slide-27
SLIDE 27

PLS statistical result

chromosome N

p relevant locus

proximal areas (+/- 5kbp)

Analysis

  • f genomic areas
  • 27

Investigating biological mechanisms through Meta-analysis

slide-28
SLIDE 28

PLS statistical result

chromosome N

p relevant locus

Analysis

  • f genomic areas

Querying gene annotation databases

  • 28

Investigating biological mechanisms through Meta-analysis

McLaren et al. The Ensembl Variant Effect Predictor. Genome Biology, 20

proximal areas (+/- 5kbp)

slide-29
SLIDE 29

148 SNP-gene combinations

14 Significantly expressed genes TM2D1 (amyloid-beta binding protein), IL10RA (increase in hippo in mouse model), TRIB3 (neuronal cell death, modulates PSEN1 stability, interacts with APP)

6 tested tissues

hippocampus, whole blood, Adipose subcutaneous, artery tibia, nerve tibial, treated fibroblast

  • 29

Investigating biological mechanisms through Meta-analysis

TM2D1 0.005 0.053 IL10RA 0.107 0.620 TRIB3 0.003 0.003 ZBTB7A 0.036 0.913 LYSMD4 0.000 0.206 CRYL1 0.621 0.118 FAM135B 0.000 0.559 IP6K3 0.000 0.465 ITGA1 0.099 0.731 KIN 0.001 0.206 LAMC1 0.002 0.062 LINC00941 0.000 0.690 RBPMS2 0.000 0.215 RP11-181K3.4 0.002 0.053

Significance (p-value) training testing

  • S. Wray
slide-30
SLIDE 30
  • 30
  • Data (disease markers)
  • Algorithms
  • Databases

Joint modeling of brain and genetic data in Alzheimer’s disease

  • Ingredients -
slide-31
SLIDE 31
  • 31

Large multicentric clinical studies

Data for ~100’000 individuals

Challenge: Meta-study

slide-32
SLIDE 32

C1 C2 CM

State-of-art: analysis of univariate outcome (p-value, effect size, standard error, …) Cons.

  • Multiple testing  low statistical power
  • No SNP-SNP interaction
  • Limited interpretability

Problem. How to develop multivariate imaging- genetics modeling approaches within a meta-analysis context?

Meta-analysis in genetic studies

  • 32
slide-33
SLIDE 33

C1

C2

… CM

q brain features (~105) N individuals

chromosome 22 chromosome 1

X Y’

=

U Λ V’

p SNPs (~106)

C1 C2 … CM U1 Λ1V1’ U2 Λ2V2’ UM ΛMVM’

= + + … + Extending meta-analysis for multivariate models

  • 33
slide-34
SLIDE 34

Sequential PLS Meta PLS

Extending meta-analysis for multivariate models

  • 34
slide-35
SLIDE 35

Mean and sd of dot product Absolute feature-wise error

Testing

Lorenzi et al. MASAMB 2016

  • 35
slide-36
SLIDE 36

Warnings

  • Often required to process large datasets with standard hardware
  • Need of processing large datasets across different sites

Linking brain atrophy to biological functions through Multivariate analysis of genotype-phenotype relationship

+

thorough cross-validation for stability assessment

maxp,q Cov( X . p, Y . q )

Conclusions I

  • 36
slide-37
SLIDE 37

Conclusions II

Research scenario in Côte d’Azur?

  • Local and national data collection initiatives
  • International biobanks
  • Thousands of individuals

currently > 10’000

  • Large databases

currently > 30TB of raw data Challenges

  • Algorithms
  • Infrastructure / Infostructure
  • Clinical translation

Positions available 

  • 37
slide-38
SLIDE 38

Thank you!

  • 38
slide-39
SLIDE 39
  • 39