Multiblock Method for Categorical Variables Application to air - - PowerPoint PPT Presentation

multiblock method for categorical variables
SMART_READER_LITE
LIVE PREVIEW

Multiblock Method for Categorical Variables Application to air - - PowerPoint PPT Presentation

1. Position of the problem 2. Methods 3. Case study 4. Conclusions & perspectives Multiblock Method for Categorical Variables Application to air quality in pig farms S. Bougeard 1 , E.M. Qannari 2 & C. Fablet 1 1 French agency for food,


slide-1
SLIDE 1
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives

Multiblock Method for Categorical Variables

Application to air quality in pig farms

  • S. Bougeard1, E.M. Qannari2 & C. Fablet1

1 French agency for food, environmental and occupational health safety (Anses), Department of Epidemiology, Ploufragan,

France

2 Nantes-Atlantic National College of Veterinary Medicine, Food Science and Engineering (Oniris), Department of

Chemometrics and Sensometrics, Nantes, France

1 / 15

slide-2
SLIDE 2
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives

Table of contents

1

Position of the problem

2

Methods Categorical multiblock Redundancy Analysis (Cat-mbRA) Alternative methods

3

Case study Study of air quality in pig farms Relationships between variables Risk factors for inappropriate air quality Method comparison

4

Conclusions & perspectives

2 / 15

slide-3
SLIDE 3
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives

Statistical issues for epidemiological surveys

  • 2. Expectations

Global optimization criterion with eigensolution, Assessment of the risk factors, Factorial representation of data.

→ Multiblock modelling extended to

categorical data.

  • 1. Advantages & limits of usual procedures

Generalized linear models

Well-adapted for categorical variables, Limited number of explanatory variables, Constraints when y consists of more than two categories.

Decision trees, random forest, boosting, bagging, SVM

Small misclassification errors, No regression coefficients.

3 / 15

slide-4
SLIDE 4
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives

Statistical issues for epidemiological surveys

  • 2. Expectations

Global optimization criterion with eigensolution, Assessment of the risk factors, Factorial representation of data.

→ Multiblock modelling extended to

categorical data.

  • 1. Advantages & limits of usual procedures

Generalized linear models

Well-adapted for categorical variables, Limited number of explanatory variables, Constraints when y consists of more than two categories.

Decision trees, random forest, boosting, bagging, SVM

Small misclassification errors, No regression coefficients.

3 / 15

slide-5
SLIDE 5
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives

Statistical issues for epidemiological surveys

  • 2. Expectations

Global optimization criterion with eigensolution, Assessment of the risk factors, Factorial representation of data.

→ Multiblock modelling extended to

categorical data.

  • 1. Advantages & limits of usual procedures

Generalized linear models

Well-adapted for categorical variables, Limited number of explanatory variables, Constraints when y consists of more than two categories.

Decision trees, random forest, boosting, bagging, SVM

Small misclassification errors, No regression coefficients.

3 / 15

slide-6
SLIDE 6
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives
  • 21. Cat-mbRA
  • 22. Alternative methods

Table of contents

1

Position of the problem

2

Methods Categorical multiblock Redundancy Analysis (Cat-mbRA) Alternative methods

3

Case study Study of air quality in pig farms Relationships between variables Risk factors for inappropriate air quality Method comparison

4

Conclusions & perspectives

4 / 15

slide-7
SLIDE 7
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives
  • 21. Cat-mbRA
  • 22. Alternative methods

Categorical multiblock Redundancy Analysis

The latent variables represent the categorical variable coding : t(1)

k

= Xkw(1)

k

, u(1) = ˜ Yv(1) PXk is the projector onto the subspace spanned by the dummy variables associated with xk. Criterion to maximize

∑k cov2(u(1),t(1)

k

), with ||t(1)

k

|| = ||v(1)|| = 1 ∑k ||PXk u(1)||2 =

v(1)′ ˜ Y ′ ∑k PXk ˜ Yv(1) with ||v(1)|| = 1 First order solution v(1) is the eigenvector of ∑k ˜ Y ′PXk ˜ Y associated with the largest eigenvalue

λ(1) = ∑k ||PXk u(1)||2

5 / 15

slide-8
SLIDE 8
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives
  • 21. Cat-mbRA
  • 22. Alternative methods

Categorical multiblock Redundancy Analysis

The latent variables represent the categorical variable coding : t(1)

k

= Xkw(1)

k

, u(1) = ˜ Yv(1) PXk is the projector onto the subspace spanned by the dummy variables associated with xk. Criterion to maximize

∑k cov2(u(1),t(1)

k

), with ||t(1)

k

|| = ||v(1)|| = 1 ∑k ||PXk u(1)||2 =

v(1)′ ˜ Y ′ ∑k PXk ˜ Yv(1) with ||v(1)|| = 1 First order solution v(1) is the eigenvector of ∑k ˜ Y ′PXk ˜ Y associated with the largest eigenvalue

λ(1) = ∑k ||PXk u(1)||2

5 / 15

slide-9
SLIDE 9
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives
  • 21. Cat-mbRA
  • 22. Alternative methods

Categorical multiblock Redundancy Analysis

The latent variables represent the categorical variable coding : t(1)

k

= Xkw(1)

k

, u(1) = ˜ Yv(1) PXk is the projector onto the subspace spanned by the dummy variables associated with xk. Criterion to maximize

∑k cov2(u(1),t(1)

k

), with ||t(1)

k

|| = ||v(1)|| = 1 ∑k ||PXk u(1)||2 =

v(1)′ ˜ Y ′ ∑k PXk ˜ Yv(1) with ||v(1)|| = 1 First order solution v(1) is the eigenvector of ∑k ˜ Y ′PXk ˜ Y associated with the largest eigenvalue

λ(1) = ∑k ||PXk u(1)||2

5 / 15

slide-10
SLIDE 10
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives
  • 21. Cat-mbRA
  • 22. Alternative methods

Categorical multiblock Redundancy Analysis

The latent variables represent the categorical variable coding : t(1)

k

= Xkw(1)

k

, u(1) = ˜ Yv(1) PXk is the projector onto the subspace spanned by the dummy variables associated with xk. Criterion to maximize

∑k cov2(u(1),t(1)

k

), with ||t(1)

k

|| = ||v(1)|| = 1 ∑k ||PXk u(1)||2 =

v(1)′ ˜ Y ′ ∑k PXk ˜ Yv(1) with ||v(1)|| = 1 First order solution v(1) is the eigenvector of ∑k ˜ Y ′PXk ˜ Y associated with the largest eigenvalue

λ(1) = ∑k ||PXk u(1)||2

5 / 15

slide-11
SLIDE 11
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives
  • 21. Cat-mbRA
  • 22. Alternative methods

Categorical multiblock Redundancy Analysis

The latent variables represent the categorical variable coding : t(1)

k

= Xkw(1)

k

, u(1) = ˜ Yv(1) PXk is the projector onto the subspace spanned by the dummy variables associated with xk. Criterion to maximize

∑k cov2(u(1),t(1)

k

), with ||t(1)

k

|| = ||v(1)|| = 1 ∑k ||PXk u(1)||2 =

v(1)′ ˜ Y ′ ∑k PXk ˜ Yv(1) with ||v(1)|| = 1 First order solution v(1) is the eigenvector of ∑k ˜ Y ′PXk ˜ Y associated with the largest eigenvalue

λ(1) = ∑k ||PXk u(1)||2

5 / 15

slide-12
SLIDE 12
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives
  • 21. Cat-mbRA
  • 22. Alternative methods

Categorical multiblock Redundancy Analysis (Cat-mbRA)

PXk is the projector onto the subspace spanned by the dummy variables associated with xk. Partial components (t1,...,tK ) Projection of u(1) onto each subspace spanned by Xk → t(1)

k

=

PXk u(1)

||PXk u(1)||

Synthesis with a global component t t(1) sums up all the partial codings : t(1) = ∑k a(1)

k

t(1)

k

with

∑k a(1)2

k

= 1,

t(1) = ∑k

||PXk u(1)||

∑l ||PXl u(1)||2 t(1)

k

=

∑k PXk u(1)

∑l ||PXl u(1)||2

Higher order solutions while considering the residuals of the orthogonal projections of

(X1,...,XK ) onto the subspaces spanned by t(1), (t(1),t(2)), . . .

6 / 15

slide-13
SLIDE 13
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives
  • 21. Cat-mbRA
  • 22. Alternative methods

Categorical multiblock Redundancy Analysis (Cat-mbRA)

PXk is the projector onto the subspace spanned by the dummy variables associated with xk. Partial components (t1,...,tK ) Projection of u(1) onto each subspace spanned by Xk → t(1)

k

=

PXk u(1)

||PXk u(1)||

Synthesis with a global component t t(1) sums up all the partial codings : t(1) = ∑k a(1)

k

t(1)

k

with

∑k a(1)2

k

= 1,

t(1) = ∑k

||PXk u(1)||

∑l ||PXl u(1)||2 t(1)

k

=

∑k PXk u(1)

∑l ||PXl u(1)||2

Higher order solutions while considering the residuals of the orthogonal projections of

(X1,...,XK ) onto the subspaces spanned by t(1), (t(1),t(2)), . . .

6 / 15

slide-14
SLIDE 14
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives
  • 21. Cat-mbRA
  • 22. Alternative methods

Categorical multiblock Redundancy Analysis (Cat-mbRA)

PXk is the projector onto the subspace spanned by the dummy variables associated with xk. Partial components (t1,...,tK ) Projection of u(1) onto each subspace spanned by Xk → t(1)

k

=

PXk u(1)

||PXk u(1)||

Synthesis with a global component t t(1) sums up all the partial codings : t(1) = ∑k a(1)

k

t(1)

k

with

∑k a(1)2

k

= 1,

t(1) = ∑k

||PXk u(1)||

∑l ||PXl u(1)||2 t(1)

k

=

∑k PXk u(1)

∑l ||PXl u(1)||2

Higher order solutions while considering the residuals of the orthogonal projections of

(X1,...,XK ) onto the subspaces spanned by t(1), (t(1),t(2)), . . .

6 / 15

slide-15
SLIDE 15
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives
  • 21. Cat-mbRA
  • 22. Alternative methods

Alternative methods for qualitative discrimination

Robust Generalized Linear Model framework In the two-class case : ridge logistic regression [Barker & Brown, 2001], principal component logistic regression [Aguilera et al., 2006], PLS generalized regression (e.g. PLS logistic regression) [Marx, 1996 ; Bastien et al.,

2005].

Factorial analysis framework Disqual procedure [Saporta & Niang, 2006], Multiple non Symmetrical Correspondence Analysis [Lauro & Balbi, 1999]. Multiblock and Structural Equation Modelling framework Categorical extension of GCA-RT, i.e. MCA-RT [Kissita, 2003] and of multiblock PLS, i.e. MCOI-catPLS [D’Ambra et al., 2002], Categorical extension of SEM [Skrondal & Rabe-Hesketh, 2005] and of PLS-PM

[Jakobowicz & Derquenne, 2007 ; Russolillo, 2009].

7 / 15

slide-16
SLIDE 16
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives
  • 21. Cat-mbRA
  • 22. Alternative methods

Alternative methods for qualitative discrimination

Robust Generalized Linear Model framework In the two-class case : ridge logistic regression [Barker & Brown, 2001], principal component logistic regression [Aguilera et al., 2006], PLS generalized regression (e.g. PLS logistic regression) [Marx, 1996 ; Bastien et al.,

2005].

Factorial analysis framework Disqual procedure [Saporta & Niang, 2006], Multiple non Symmetrical Correspondence Analysis [Lauro & Balbi, 1999]. Multiblock and Structural Equation Modelling framework Categorical extension of GCA-RT, i.e. MCA-RT [Kissita, 2003] and of multiblock PLS, i.e. MCOI-catPLS [D’Ambra et al., 2002], Categorical extension of SEM [Skrondal & Rabe-Hesketh, 2005] and of PLS-PM

[Jakobowicz & Derquenne, 2007 ; Russolillo, 2009].

7 / 15

slide-17
SLIDE 17
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives
  • 21. Cat-mbRA
  • 22. Alternative methods

Alternative methods for qualitative discrimination

Robust Generalized Linear Model framework In the two-class case : ridge logistic regression [Barker & Brown, 2001], principal component logistic regression [Aguilera et al., 2006], PLS generalized regression (e.g. PLS logistic regression) [Marx, 1996 ; Bastien et al.,

2005].

Factorial analysis framework Disqual procedure [Saporta & Niang, 2006], Multiple non Symmetrical Correspondence Analysis [Lauro & Balbi, 1999]. Multiblock and Structural Equation Modelling framework Categorical extension of GCA-RT, i.e. MCA-RT [Kissita, 2003] and of multiblock PLS, i.e. MCOI-catPLS [D’Ambra et al., 2002], Categorical extension of SEM [Skrondal & Rabe-Hesketh, 2005] and of PLS-PM

[Jakobowicz & Derquenne, 2007 ; Russolillo, 2009].

7 / 15

slide-18
SLIDE 18
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives
  • 31. Air quality in pig farms
  • 32. Relationships between variables
  • 33. Risk factors
  • 34. Method comparison

Table of contents

1

Position of the problem

2

Methods Categorical multiblock Redundancy Analysis (Cat-mbRA) Alternative methods

3

Case study Study of air quality in pig farms Relationships between variables Risk factors for inappropriate air quality Method comparison

4

Conclusions & perspectives

8 / 15

slide-19
SLIDE 19
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives
  • 31. Air quality in pig farms
  • 32. Relationships between variables
  • 33. Risk factors
  • 34. Method comparison

Categorical epidemiological data

Epidemiological survey Part of the French study of pig respiratory diseases (2006− 2008), Risk factors for inappropriate air quality in post-weaning rooms (cold or gases). Data description Dependent variable : air quality in pig post-weaning rooms (cold, temperate, temperate with gases), 13 explanatory variables : management practices (7 var.), outside measurements (2 var.), farm structure (4 var.),

(N = 85) farrow-to-finish pig farms (out of

128) split up into 3 dependent categories (22/38/25). Correlated variables (significant χ2 tests)

9 / 15

slide-20
SLIDE 20
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives
  • 31. Air quality in pig farms
  • 32. Relationships between variables
  • 33. Risk factors
  • 34. Method comparison

Categorical epidemiological data

Epidemiological survey Part of the French study of pig respiratory diseases (2006− 2008), Risk factors for inappropriate air quality in post-weaning rooms (cold or gases). Data description Dependent variable : air quality in pig post-weaning rooms (cold, temperate, temperate with gases), 13 explanatory variables : management practices (7 var.), outside measurements (2 var.), farm structure (4 var.),

(N = 85) farrow-to-finish pig farms (out of

128) split up into 3 dependent categories (22/38/25). Correlated variables (significant χ2 tests)

9 / 15

slide-21
SLIDE 21
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives
  • 31. Air quality in pig farms
  • 32. Relationships between variables
  • 33. Risk factors
  • 34. Method comparison

Categorical epidemiological data

Epidemiological survey Part of the French study of pig respiratory diseases (2006− 2008), Risk factors for inappropriate air quality in post-weaning rooms (cold or gases). Data description Dependent variable : air quality in pig post-weaning rooms (cold, temperate, temperate with gases), 13 explanatory variables : management practices (7 var.), outside measurements (2 var.), farm structure (4 var.),

(N = 85) farrow-to-finish pig farms (out of

128) split up into 3 dependent categories (22/38/25). Correlated variables (significant χ2 tests)

9 / 15

slide-22
SLIDE 22
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives
  • 31. Air quality in pig farms
  • 32. Relationships between variables
  • 33. Risk factors
  • 34. Method comparison

Plot of the variable loadings on the first two latent variables of cat-mbRA

Dependent variable Farm structure (explanatory variables), Outside measurements (explanatory variables), Management practices (explanatory variables). Available additional information : plot of individual scores.

Interpretation of factors which influence the temperate air quality (target) Farm structure : floor material in plastic, large number of pens in the room (> 7), Management practices : duration of room heating before piglets entry (> 20h), Outside measurements : medium outdoor humidity (80− 90%)

10 / 15

slide-23
SLIDE 23
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives
  • 31. Air quality in pig farms
  • 32. Relationships between variables
  • 33. Risk factors
  • 34. Method comparison

Risk factors for inappropriate air quality in pig farms

Results obtained from cat-mbRA with (hopt = 2) latent variables, significant regression coefficients

Available additional information : the block importances (a2

k) can be transformed into

cumulated contributions of the whole categorical variables.

11 / 15

slide-24
SLIDE 24
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives
  • 31. Air quality in pig farms
  • 32. Relationships between variables
  • 33. Risk factors
  • 34. Method comparison

Comparison with alternative methods

Interpretation and additional information from confusion matrices Cat-mbRA, Cat-mbPLS, M-NSCA with deflation on T : similar performances, except for cat-mbPLS prediction (fail in predicting AirQual-2 for the first dim.), Disqual

The MCA components are included by order of importance of how they explain X, Correct prediction with a large number of latent variables (perform in explaining and predicting AirQual-1)

12 / 15

slide-25
SLIDE 25
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives

Table of contents

1

Position of the problem

2

Methods Categorical multiblock Redundancy Analysis (Cat-mbRA) Alternative methods

3

Case study Study of air quality in pig farms Relationships between variables Risk factors for inappropriate air quality Method comparison

4

Conclusions & perspectives

13 / 15

slide-26
SLIDE 26
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives

Concluding remarks

Conclusion Proposition of a new and successful method for qualitative discrimination (categorical multiblock Redundancy Analysis, cat-mbRA), Extension in the field of multiblock modeling framework (interpretation tools), Application to (several) real epidemiological surveys, Code programs and interpretation tools developed in Matlab R

.

Perspectives Comparison with other methods in case of a two-class prediction (e.g. logistic regression, PLS logistic regression, . . .) [working paper], Simulation study to better understand and compare the method performances, Extension to the prediction of several categorical variables.

Thanks to financiers [Acemo, Anavelec, Celtys, I-Tek, Rose-Eludis, Sodalec, Tuffigo] and to farmers.

14 / 15

slide-27
SLIDE 27
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives

Concluding remarks

Conclusion Proposition of a new and successful method for qualitative discrimination (categorical multiblock Redundancy Analysis, cat-mbRA), Extension in the field of multiblock modeling framework (interpretation tools), Application to (several) real epidemiological surveys, Code programs and interpretation tools developed in Matlab R

.

Perspectives Comparison with other methods in case of a two-class prediction (e.g. logistic regression, PLS logistic regression, . . .) [working paper], Simulation study to better understand and compare the method performances, Extension to the prediction of several categorical variables.

Thanks to financiers [Acemo, Anavelec, Celtys, I-Tek, Rose-Eludis, Sodalec, Tuffigo] and to farmers.

14 / 15

slide-28
SLIDE 28
  • 1. Position of the problem
  • 2. Methods
  • 3. Case study
  • 4. Conclusions & perspectives

Multiblock Method for Categorical Variables

Application to air quality in pig farms

  • S. Bougeard1, E.M. Qannari2 & C. Fablet1

1 French agency for food, environmental and occupational health safety (Anses), Department of Epidemiology, Ploufragan,

France

2 Nantes-Atlantic National College of Veterinary Medicine, Food Science and Engineering (Oniris), Department of

Chemometrics and Sensometrics, Nantes, France

15 / 15