A19 Research Internship Results Charlie Cloutier-Langevin & - - PowerPoint PPT Presentation

a19 research internship results
SMART_READER_LITE
LIVE PREVIEW

A19 Research Internship Results Charlie Cloutier-Langevin & - - PowerPoint PPT Presentation

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion A19 Research Internship Results Charlie Cloutier-Langevin & Julien Corriveau-Trudel Universit e de Sherbrooke Tuesday, December


slide-1
SLIDE 1

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

A19 Research Internship Results

Charlie Cloutier-Langevin & Julien Corriveau-Trudel

Universit´ e de Sherbrooke

Tuesday, December 10th 2019

Supervisors: F´ elix Camirand Lemyre, Alan A. Cohen, Nancy Presse Collaborators: V´ eronique Legault, Val´ erie Turcot, Alistair Senior

slide-2
SLIDE 2

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Introduction

Context New approach to study aging through Physiological Dysregulation (Phys. Dys.) with the Mahalanobis distance [4][5][9] Advent of the NuAge Dataset Task Study the potential relationship between nutrients intake of an individual and the deviance of his or her biological profile from a reference population.

slide-3
SLIDE 3

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

NuAge Dataset

The NuAge dataset in numbers: 1 754 elderly women and men, from age 68 to 81; 6 586 visits, between 1 and 4 visits per person; 23 186 24h recalls, 1 to 3 recalls per timepoints, for 5 timepoints; 188 medical variables and 43 nutritional variables; 364 421 missing values out of 1 238’168 entries (29.4%); Each year, a set of of biological, nutritional, functional, medical, and social traits is measured for each participant.

slide-4
SLIDE 4

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Physiological systems considered

Different physiological systems considered:

1 Oxygen Transport 2 Liver/Kidney functions 3 Hematopoiesis 4 Micronutrients 5 Lipids

System information comes from a previous study on how to regroup these biomarkers and the effects of using different subsets

  • f biomarkers [4].

Global Phys. Dys. score has been computed, which is the sum

  • f the Phys. Dys. values of all systems.
slide-5
SLIDE 5

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Table of contents

1

Transformation to normality

2

Longitudinal imputation Intrapolation Extrapolation Results

3

Clustering

4

Measurement error and regression Additive Error Model CoCoLasso Deconvolution Nonparametric Regression

5

Conclusion

slide-6
SLIDE 6

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Transformation to normality

slide-7
SLIDE 7

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Transformation for normality

Statistical methods + normality = Better performances Classic transformations As best transform provided by V´ eronique Legault Examples: sqrt(), log(), exp() Parametric transformation methods Provide an accurate and simplified process for transformation. Parametric transformation methods BoxCox[1] Yeo-Johnson[12] Manly[8]

slide-8
SLIDE 8

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

BoxCox Transformation

Best transformation? ... = ⇒ BoxCox transformation! BoxCox transformation (Box & Cox, 1964) Parametric power transformation Strictly positive observation values λ ǫ [-5, 5] Defined as: y (λ) :=

  • yλ−1

λ , if λ = 0

ln(y), if λ = 0

slide-9
SLIDE 9

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

BoxCox Transformation

Since the BoxCox transformation requires strictly positive data, Box & Cox proposed a shifted modification. Shifted BoxCox transformation Parametric power transformation λ1 ǫ [-5, 5] New shifting parameters λ2 Defined as: y(λ)

i

:= (yi+λ2)λ1−1

λ1

, if λ1 = 0 ln(yi + λ2), if λ1 = 0 Remark Shift the data by λ2 = 1 does not impact the result because it will not impact the variable distribution

slide-10
SLIDE 10

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Transformation examples and comparison

Let’s compare best transform and BoxCox for 2 examples EXAMPLE 1 Biomarker name: Creatinine Name in data set: CREAT Best transform applied: log(x) BoxCox λ applied: λ = -0.5 (equivalent to

1 √x )

slide-11
SLIDE 11

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Transformation examples and comparison

Creatinine (CREAT) No transformation histogram: Right skewed

slide-12
SLIDE 12

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Transformation examples and comparison

Creatinine (CREAT) Best transform transformation histogram: Approaching normality

slide-13
SLIDE 13

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Transformation examples and comparison

Creatinine (CREAT) BoxCox transformation histogram: Almost normal

slide-14
SLIDE 14

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Transformation examples and comparison

Creatinine (CREAT) No transformation Q-Q plot: No line shape

slide-15
SLIDE 15

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Transformation examples and comparison

Creatinine (CREAT) Best transform transformation Q-Q plot: Ends does not fit the line

slide-16
SLIDE 16

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Transformation examples and comparison

Creatinine (CREAT) BoxCox transformation Q-Q plot: Approaching a line shape

slide-17
SLIDE 17

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Transformation examples and comparison

Creatinine (CREAT) Shapiro-Wilk[11] normality test comparison: H0: Data are from a normally distributed population No transformation: p-value < 2.2e-16 = ⇒ Reject H0 Best transform: p-value = 3.844e-15 = ⇒ Reject H0 BoxCox: p-value = 0.004968 = ⇒ Reject H0

slide-18
SLIDE 18

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Transformation examples and comparison

EXAMPLE 2 Biomarker name: Weight Name in data set: weight Best transform applied: x (no transformation) BoxCox λ applied: λ = 0.1

slide-19
SLIDE 19

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Transformation examples and comparison

Weight No transformation (& Best transform) histogram: Right skewed

slide-20
SLIDE 20

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Transformation examples and comparison

Weight BoxCox transformation histogram: Almost normal

slide-21
SLIDE 21

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Transformation examples and comparison

Weight No transformation (& Best transform Q-Q plot: No line shape

slide-22
SLIDE 22

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Transformation examples and comparison

Weight BoxCox transformation Q-Q plot: Almost a line shape

slide-23
SLIDE 23

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Transformation examples and comparison

Weight Shapiro-Wilk[11] normality test comparison: H0: Data are from a normally distributed population No transformation: p-value < 2.2e-16 = ⇒ Reject H0 BoxCox: p-value = 0.005793 = ⇒ Reject H0

slide-24
SLIDE 24

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Transformation examples and comparison

General results 119 continuous biomarker variables transformed Average difference of λ1 between Best transform and BoxCox = 0.7831933 The only biomarker that was not transformed by BoxCox is lipids tot, compare to 75 for Best transform Limitations In most case, we still have to reject normality. BoxCox search for the best power transformation in NuAge data set, it could vary on other data sets. Not necessarily the best results for every variable, but the best overall

slide-25
SLIDE 25

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Imputation

slide-26
SLIDE 26

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

What is imputation and why impute data?

What is imputation? Imputation is the act of substituting missing values. Replacing NAs by a plausible value. Why imputation? Having more data means stronger statistical model, since dataimputed > datainitial, and a consistent statistical model is one that is ”stronger” when n → ∞.

slide-27
SLIDE 27

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Beware naive imputation

Risk: Introducing bias to subsequent statistical estimations. Mean imputation Imputing variable X with the mean of non missing values of X attenuates correlation in the data. Linear regression imputation On the opposite, imputing with linear regression based on non missing values of X will strengthen correlation.

slide-28
SLIDE 28

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Imputation of NuAge biomarker dataset

In NuAge biomarkers: 1 238 168 entries, with 364 421 of them missing. In the light of our objectives: Conservative approach ⇒ only impute the necessary, without negatively impacting the computation of Mahalanobis Distance(MHBD).

slide-29
SLIDE 29

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Impacting MHBD (part 1)

Definition Let X be the set of n observations in a p dimensional space, with mean µ and covariance matrix Σ. Let xi be an observation from this set. The Mahalanobis Distance is defined as follow: DM(x) =

  • (xi − µ)TΣ−1(x − µ).

A generalization of measuring ”how many standard deviation the

  • bservation is from the mean”.
slide-30
SLIDE 30

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Impacting independence between individuals

Using the population = ⇒ introducing correlation between individuals (something we don’t want.) To dodge this effect, we relied on longitudinal data instead. Time reference: The first visit.

slide-31
SLIDE 31

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Longitudinal imputation

Using all variables to impute data? We rejected this approach since correlation between variables was too close to 0, and the small number of timepoints (up to 4) introduced too much variability. Using only the few data points in one variable = ⇒ think about intrapolation vs extrapolation.

slide-32
SLIDE 32

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Intrapolation

Intrapolation is predicting data points that are within the bounds

  • f the data.

In our data, intrapolating Phys. Dys. values will never generate negative values.

slide-33
SLIDE 33

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Intrapolation

slide-34
SLIDE 34

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Intrapolation

slide-35
SLIDE 35

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Extrapolation

Extrapolation is predicting data points that are OUTSIDE the bounds of the data. Extrapolating with too few points becomes a slippery slope. The next slides includes some illustrations.

slide-36
SLIDE 36

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Extrapolation with 2 points

slide-37
SLIDE 37

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Extrapolation with 2 points

slide-38
SLIDE 38

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Extrapolation with 2 points

Extrapolating with only 2 points, we have seen, can result in

  • utputs that are far from what we would expect.

Our solution was to extrapolate with the mean, which can be justified by the fact that the slope computed on all subjects of each variables were not significantly non-zero.

slide-39
SLIDE 39

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Extrapolation with 3 points

slide-40
SLIDE 40

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Results

Out of 1 238’168 entries, 364’421 were missing (29.4%). Our imputation method successfully substituted 128’367 NAs, which represents 35.2% of the missing values. The remaining missing values are either of non-continuous and non-categorical type, or are still there because the subject had no values to rely on for imputation.

slide-41
SLIDE 41

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Clustering

slide-42
SLIDE 42

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Clustering goals

Clustering - general goals Have a better understanding of data Discover / describe known or hidden patterns in the data Clustering - Our goals Searching for nutritional tendency and describing them. Looking if these patterns can provide knowledge on our models and on physiological dysregulation

slide-43
SLIDE 43

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

What is clustering?

Unsupervised learning technique ⇒ observation data only. Partitioning observation into different groups (clusters) Regroup according to the similarity or dissimilarity among the data Remarks

(a)

More than a new method publish on arXiv.org / day

(b)

There is not a better technique or algorithm

(c)

Choice of methods = ⇒ Considerable impact on results

slide-44
SLIDE 44

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Clustering method selection

In the case of our research, we decided to perform clustering on nutritional data. However, typical clustering questions occur: Does it have ”real” patterns into our data? Is it possible to subdivide the observations into groups? How many groups can we subdivide our

  • bservation into?

Which clustering algorithm (with what parameters) we should use? An approach that can brings us possible answers to those question is the CONSENSUS CLUSTERING methods.

slide-45
SLIDE 45

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Consensus clustering

Consensus clustering (Monti et al., 2003) [10] Clustering methods that search to present a consensus between several run of a given clustering algorithm. Observation Consensus clustering Best number of cluster & Quantification of cluster’s stability

slide-46
SLIDE 46

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Consensus clustering

Consensus clustering procedure (Monti et al., 2003)[10] Given a set of observation D, a clustering algorithm C, a number

  • f cluster range K and a resampling scheme with the number H of

resampling to do.

1 For each number of cluster k in K

For each iteration of resampling h in H

Resample D Cluster the resample using C Compute a connectivity matrix M(h) (to be define)

Compute a consensus matrix M (to be define)

2 Find the best number of cluster k* in K based on consensus

distribution of M (one for each k)

3 Partition D into k* clusters based on M of k*

slide-47
SLIDE 47

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Further look on Consensus clustering

Connectivity matrix Given a subsample h and two observations i & j Connectivity matrix is define as M(h)(i, j) = 1 if items i and j belong to the same cluster,

  • therwise.

Consensus Matrix Given H subsample and a connectivity matrix for each Consensus matrix is define as M(i, j) =

  • h M(h)(i, j)
  • h I (h)(i, j)

M(i, j) = 0.5 = ⇒ No consensus

slide-48
SLIDE 48

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Further look on Consensus clustering

Example Let do an example on the board: For three observations, a number of cluster k=2 and two subsamples. Suppose that we obtain these connextivity matrix after clustering. M(1) =   1 1 1 1 1   and M(2)   1 1 1 1 1   After applying the consensus matrix equation, we obtain M =   1 1

1 2

1 1

1 2

1  

slide-49
SLIDE 49

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

DiceR: a R package for consensus clustering

DiceR(2007)[3] Compute consensus clustering for several algorithms Compare the results and give the most relevant and stable algorithm and k. To choose the better algorithm, DiceR test the validity of the algorithms on the three criteria:

tiny

Compactness

tiiny Connectivity tiiiny Separation

slide-50
SLIDE 50

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Clustering experimentation

Some details on clustering experiments: Taking the average intake at T1 for each individual Selection of variable to cluster on:

1

Selection 1: CHOL, TRANS, SATS, MONO, POLY = ⇒ DONE

2

Selection 2: Selection 1 / ENERGY = ⇒ ON ANALISYS

3

Selection 3: All nutriments = ⇒ ON GOING

Using DiceR with these algorithm:

i

Agglomerating hierarchical clustering (H CLUST)

ii

Divisive hierarchical clustering (DIANA)

iii

K-means (KM)

iv

Partitioning around medoids (PAM)

Remark Since these algorithms are distance based, then we have also to decide on which distance we will run the algorithms and where the distance between the different cluster will be measure (linkage)

slide-51
SLIDE 51

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Clustering results - Selecting K - selection 1

Using the ”proportion of ambiguous clustering” (PAC) [10] measure the algorithm give k = 2 as the number of cluster.

Table 1: Selection 1 - PAC measure

K H clust PAM DIANA KM 2 0.1609881 0.2841789 0.001832891 0.03700218 3 0.1896322 0.2702275 0.002043847 0.29046358 4 0.2580985 0.2692805 0.007719303 0.30591599 5 0.1172091 0.3833785 0.008524355 0.29816908 6 0.1856891 0.2404978 0.014866197 0.22395310 ... ... ... ... ... PAC measure: the lower, the better!

slide-52
SLIDE 52

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Clustering results - selecting algorithms - selection 1

According the clustering validation criteria of the consensus clustering, the better algorithms are:

1 Agglomerative hierarchical (H clust) 2 Divisive hierarchical (DIANA) 3 K-means (KM) 4 Partitioning around medoids (PAM)

HOWEVER

Table 2: Selection 1 - Observations per cluster

Clusters H clust PAM DIANA KM Cluster 1 1744 1128 1461 1135 Cluster 2 1 617 284

slide-53
SLIDE 53

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

PAM & DIANA clustering algorithms

Partitioning around medoids (PAM) Must know the number of cluster a prior. Choose k medoids (random or semi-random) = ⇒ Results depend of initialization Calculate every distances medoid-observation Place the observation in the cluster of the closest medoids Recalculate medoids and repeat until it converged (stable). Divisive hierarchical clustering (DIANA) Select all the observation as a cluster Split it in two Repeat with the bigger cluster (high variance whithin the cluster) Can continue until all observations are a cluster.

slide-54
SLIDE 54

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

K-medoids (PAM) iterations

slide-55
SLIDE 55

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

K-medoids (PAM) iterations

slide-56
SLIDE 56

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

K-medoids (PAM) iterations

slide-57
SLIDE 57

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

K-medoids (PAM) iterations

slide-58
SLIDE 58

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

K-medoids (PAM) iterations

slide-59
SLIDE 59

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

K-medoids (PAM) iterations

slide-60
SLIDE 60

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

K-medoids (PAM) iterations

slide-61
SLIDE 61

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

K-medoids (PAM) iterations

slide-62
SLIDE 62

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

DIANA iterations (dendrogram)

slide-63
SLIDE 63

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Clustering results - DIANA & PAM - selection 1

Describing clusters: Nutriment medians & means = ⇒ higher in cluster 2 T-test and C.I on means difference = ⇒ Significant differences DIANA & PAM has found 2 significant different cluster!

Example: SATS mean box plot for each cluster

slide-64
SLIDE 64

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Clustering results - DIANA & PAM - selection 1

slide-65
SLIDE 65

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Clustering results - DIANA & PAM - selection 1

slide-66
SLIDE 66

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Clustering results - DIANA & PAM - selection 1

Impact on Phys. Dys.:

  • Dys. Phys. medians & means =

⇒ low variation between clusters T-test and C.I on means difference = ⇒ No significant differences

Example: PAM Lipid Physiological Dysregulation box plot for each cluster

slide-67
SLIDE 67

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Clustering results - DIANA - selection 1

slide-68
SLIDE 68

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Measurement Error and Regression

slide-69
SLIDE 69

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Measurement error model in 24H recalls

F´ elix made a good point in justifying the measurement error model in 24H recalls (24HR) data in a presentation at the CRCHUS [2]. Thus, we dug this avenue.

slide-70
SLIDE 70

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Additive Measurement Error Model

Additive Measurement Error Model with repeated values Let there be n observations with m repeated values, written Wij, such that they follow the model: Wij = Ti + Uij, where Ti is the variable of interest and Uij is an error variable, with: variance: Var [Uij] = σ2

U,

mean: E [Uij] = 0. = ⇒ Wij is the repeated contaminated version of Ti, and Ti is empirically unavailable.

slide-71
SLIDE 71

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

The additive error model on food intake

Nutrient measurement error ↔ deviation from the long-term mean intake (LTMI). Wij = Ti + Uij Wij: Observed values in the 24HR, Ti: Long-term mean intake (LTMI), Uij: deviation from the LTMI (variability between days).

slide-72
SLIDE 72

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

The additive error model on food intake

(Image is a courtesy of F´ elix CL, from his talk at CRCHUS [2].)

slide-73
SLIDE 73

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Transformation to approach additive error model

Generalization: The data is one transformation apart from an additive model: h (Wij) = Ti + Uij, with h(·) is a real valued bijective and monotone function from a family of function H. Example If data under Multiplicative error model (Wij = mij · Ti) = ⇒ log(Wij) = Ti + log(mij). (Additive!)

slide-74
SLIDE 74

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Transformation to approach additive error model

One transformation we already know: Box-Cox power transformation. With a criteria to validate if data is subject to additive error, we search for the best λ parameter of: Box-Cox h(y|λ) = y(λ) = yλ−1

λ

, if λ = 0 ln(y), if λ = 0 , where y > 0

slide-75
SLIDE 75

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Estimation of error variance

Ratio of measurement error variance on total variance of some nutritional values:

slide-76
SLIDE 76

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Models designed for additive error model

Convex Conditioned Lasso Regression (CoCoLasso): penalized linear regression, conditioned for measurement error Deconvolution non-parametric regression: generalization of the Nadaraya-Watson kernel non-parametric regression.

slide-77
SLIDE 77

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Convex Conditioned Lasso Regression

CoCoLasso in a sentence: Fit a multidimensional linear function with weight decay on the parameters, while correcting the model for additive error model. [6] Weight decay is a way to encode our preference for simpler models, which, in the case of Lasso, means selecting few variables that affect the linear function.

slide-78
SLIDE 78

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

CoCoLasso results

For every physiological system, the resulting optimal linear function was the mean. Since the effect seems nonlinear, = ⇒ nonparametric model might capture the effects.

slide-79
SLIDE 79

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Deconvolution Nonparametric Regression

Deconvolution nonparametric (NP) regresssion: generalization

  • f the Nadaraya-Watson nonparametric regression.

The initial model relies on the distribution of the prediction

  • variable. The deconvolution NP regression will correct the

distribution information for additive error using Deconvolution Kernel Density Estimation (DKDE). [7] Downsides: I did not find a multidimensional equivalent Intuitively, I can tell that it will suffer from multidimensionality.

slide-80
SLIDE 80

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Univariate NP Regression

10 20 30 100 200

PROTEIN global

Non parametric regression between mean daily PROTEIN intake and global dysregulation score

Data points removed for confidentiality

slide-81
SLIDE 81

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Univariate NP Regression

10 20 30 100 200 300 400 500

CARBO global

Non parametric regression between mean daily CARBO intake and global dysregulation score

Data points removed for confidentiality

slide-82
SLIDE 82

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Univariate NP Regression

10 20 30 50 100 150

LIPID global

Non parametric regression between mean daily LIPID intake and global dysregulation score

Data points removed for confidentiality

slide-83
SLIDE 83

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Conclusion

Conclusion

slide-84
SLIDE 84

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Summary

Our work includes: Transforming biomarker data to approach normality, Imputing biomarker data longitudinally, Exploratory Clustering of nutritional data, Transformation to approach additive measurement error model, Exploration of a multivariate penalized linear regression model, designed for additive error, Exploration of a univariate nonparametric regresssion model, also designed for additive error.

slide-85
SLIDE 85

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Next steps...

To explore: More clustering! (On other data subset), Discretized version of the difference of Phys. Dys. compared to time T1, Correction of Phys. Dys. or nutritional values for confounders, such as BMI or Energy(KJ) intake, Other nonparametric regression models, designed for additive error model.

slide-86
SLIDE 86

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Any questions?

slide-87
SLIDE 87

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

[1] Box, G. E. P., and Cox, D. R. An analysis of transformations. Journal of the Royal Statistical Society. Series B (Methodological) 26, 2 (1964), 211–252. [2] Camirand Lemyre, F. Analyse de donn´ ees li´ ees ` a des apports alimentaires: d´ efis, probl` emes et pistes de solutions. Presentation at CRCHUS, September 2019. [3] Chiu, D., and Talhouk, A. diceR: Diverse Cluster Ensemble in R, 2019. R package version 0.6.0. [4] Cohen, A. A., Milot, E., Li, Q., Legault, V., Fried,

  • L. P., and Ferrucci, L.

Cross-population validation of statistical distance as a measure of physiological dysregulation during aging. Experimental gerontology 57 (Sep 2014), 203–210.

slide-88
SLIDE 88

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

[5] Cohen, A. A., Milot, E., Yong, J., Seplaki, C. L., F¨ ul¨

  • p, T., Bandeen-Roche, K., and Fried, L. P.

A novel statistical approach shows evidence for multi-system physiological dysregulation during aging. Mechanisms of Ageing and Development 134, 3–4 (Mar 2013), 110–117. [6] Datta, A., and Zou, H. Cocolasso for high-dimensional error-in-variables regression. The Annals of Statistics 45, 6 (Dec 2017), 2400–2426. Zbl: 06838137. [7] Fan, J., and Truong, Y. K. Nonparametric regression with errors in variables. The Annals of Statistics 21, 4 (Dec 1993), 1900–1925. Zbl: 0791.62042. [8] Manly, B. F. J. Exponential data transformations.

slide-89
SLIDE 89

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

Journal of the Royal Statistical Society. Series D (The Statistician) 25, 1 (1976), 37–42. [9] Milot, E., Morissette-Thomas, V., Li, Q., Fried,

  • L. P., Ferrucci, L., and Cohen, A. A.

Trajectories of physiological dysregulation predicts mortality and health outcomes in a consistent manner across three populations. Mechanisms of ageing and development 0 (2014), 56–63. [10] Monti, S., Tamayo, P., Mesirov, J., and Golub, T. Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning 52, 1 (Jul 2003), 91–118. [11] Shapiro, S. S., and Wilk, M. B. An analysis of variance test for normality (complete samples). Biometrika 52, 3/4 (1965), 591–611.

slide-90
SLIDE 90

Transformation to normality Longitudinal imputation Clustering Measurement error and regression Conclusion

[12] Yeo, I.-K., and Johnson, R. A. A new family of power transformations to improve normality

  • r symmetry.

Biometrika 87, 4 (2000), 954–959.