[PPT] - Detection of influential points as a byproduct of resampling-based PowerPoint Presentation

SLIDE 1

Detection of influential points as a byproduct of resampling-based variable selection procedures

Riccardo De Bin

Department of Mathematics - University of Oslo

based on a joint work with Anne-Laure Boulesteix (University of Munich) and Willi Sauerbrei (University Medical Center Freiburg)

Seminars in Statistics, February 21th, 2019 1/ 37

SLIDE 2

Resampling-based detection of influential points

Outline of the talk

Introduction Methods Detection of possible influential points Conclusions

Seminars in Statistics, February 21th, 2019 2/ 37

SLIDE 3

Resampling-based detection of influential points

Introduction: overview

Importance of model stability:

◮ small data perturbations may lead to different selected models; ◮ the best model is not clear (if there is one).

among the approaches that handle this issue:

◮ resampling-based variable selection (Chen & George, 1985); ◮ frequentist model averaging (Buckland et al., 1997);

these approaches may rely on variable selection performed on

several pseudo-samples (inclusion frequencies, data-driven weights, . . . );

we use the same information to detect possible influential

points:

◮ outliers; ◮ single observations which have a high impact on the results. Seminars in Statistics, February 21th, 2019 3/ 37

SLIDE 4

Resampling-based detection of influential points

Introduction: body fat data

Reference: Johnson (1996);
Sample size: 252;
Outcome: percentage of body fat (continuous);
Covariates: age, weight, height and other 10 continuous body

circumference measurements;

Data: http://portal.uni-freiburg.de/imbi/

Royston-Sauerbrei-book/Multivariable_ Model-building/downloads/datasets/edu_bodyfat_ both.zip.

This dataset contains at least one influential point (Royston

& Sauerbrei, 2007): observation 39.

Seminars in Statistics, February 21th, 2019 4/ 37

SLIDE 5

Resampling-based detection of influential points

Introduction: body fat data

BIC α = 0.05 AIC variable in

ut

in

ut

in

ut

← obs. 39 age ✓ ✓ ✓ ✓ weight ✓ ✓ ✓ height ✓ ✓ ✓ neck ✓ ✓ chest ✓ ab ✓ ✓ ✓ ✓ ✓ ✓ hip ✓ thigh ✓ knee ankle biceps forearm ✓ ✓ ✓ ✓ wrist ✓ ✓ ✓ ✓ ✓ ✓ NB: the models are obtained by using backward elimination

Seminars in Statistics, February 21th, 2019 5/ 37

SLIDE 6

Resampling-based detection of influential points

Methods: resampling-based variable selection

To perform a resampling-based variable selection:

generate several pseudo-samples through a resampling

technique (e.g., bootstrap, subsampling, . . . );

apply a variable selection procedure on each pseudo-sample

(e.g., backward elimination);

for each variable, consider the proportion of pseudo-samples in

which the variable has been selected → inclusion frequency;

consider only variables with larger inclusion frequencies.

Seminars in Statistics, February 21th, 2019 6/ 37

SLIDE 7

Resampling-based detection of influential points

Methods: resampling-based variable selection

variable pseudo-sample V1 V2 V3 . . . Vq−1 Vq 1 1 1 . . . 1 2 1 1 . . . 3 1 1 . . . 1 . . . . . . . . . . . . ... . . . . . . B 1 1 . . . inclusion 0.96 0.24 1.00 . . . 0.05 0.69 frequency

Seminars in Statistics, February 21th, 2019 7/ 37

SLIDE 8

Resampling-based detection of influential points

Methods: model averaging with resampling-based weights

Fit k models M1, . . . , Mk on the data;
for each model, compute the estimate ˆ

θMj;

compute the estimate ˆ

θ as the weighted average ˆ θ = k

j=1 wj ˆ

θMj;

A highly relevant point is the choice of the weights wj:

◮ based on information criteria or Mallows’ criterion (e.g.,

Buckland et al., 1997; Hjort & Claeskens, 2003; Hansen, 2007);

◮ resampling-based (Burnham & Anderson, 2002; Augustin

et al., 2005);

we focus on the latter:

◮ find the best model for several pseudo-samples; ◮ wj is the proportion of time in which the model Mj is

selected.

Seminars in Statistics, February 21th, 2019 8/ 37

SLIDE 9

Resampling-based detection of influential points

Methods: model averaging with resampling-based weights

variable pseudo-sample V1 V2 V3 . . . Vq−1 Vq model 1 1 1 . . . 1 → M1 2 1 1 . . . → M2 3 1 1 . . . 1 → M1 . . . . . . . . . . . . ... . . . . . . → . . . B 1 1 . . . → Mk

Seminars in Statistics, February 21th, 2019 9/ 37

SLIDE 10

Resampling-based detection of influential points

Methods: inclusion matrix

Both approaches rely on the inclusion matrix;
each row shows which variables are included in the best model
n a particular pseudo-sample;

variable pseudo-sample V1 V2 V3 . . . Vq−1 Vq model 1 1 1 . . . 1 → M1 2 1 1 . . . → M2 3 1 1 . . . 1 → M1 . . . . . . . . . . . . ... . . . . . . → . . . B 1 1 . . . → Mk inclusion 0.96 0.24 1.00 . . . 0.05 0.69 frequency

Seminars in Statistics, February 21th, 2019 10/ 37

SLIDE 11

Resampling-based detection of influential points

Detection of possible influential points: towards the frequency matrix

For each row we know which observations belong to the specific pseudo-sample and which do not;

⇓

we can contrast results for pseudo-samples with and without a specific observation

for each variable, two separate inclusion frequencies;

◮ inclusion frequencies computed on pseudo-samples including

the i-th observation → I-frequencies;

◮ inclusion frequencies computed on pseudo-samples without the

i-th observation → O-frequencies;

Seminars in Statistics, February 21th, 2019 11/ 37

SLIDE 12

Resampling-based detection of influential points

Detection of possible influential points: I-frequency matrix

We can organize the I-frequencies in a I-frequency matrix,

◮ each row correspond to I-frequencies computed only on

pseudo-samples containing a specific observation.

bservation

variable V1 V2 V3 . . . Vq−1 Vq 1 0.969 0.215 1.000 . . . 0.023 0.692 2 0.902 0.260 1.000 . . . 0.056 0.776 3 0.994 0.241 1.000 . . . 0.087 0.614 . . . . . . . . . . . . ... . . . . . . n − 1 0.978 0.301 1.000 . . . 0.061 0.661 n 0.984 0.292 1.000 . . . 0.047 0.676

Seminars in Statistics, February 21th, 2019 12/ 37

SLIDE 13

Resampling-based detection of influential points

Detection of possible influential points: I-frequency matrix

No influential points ↔ similar I-frequencies (i.e., similar

values in the same column);

a possible strongly separated I-frequency may be reveal the

presence of an influential point; how do we identify possible separated values? graphical approach analytical approach

Seminars in Statistics, February 21th, 2019 13/ 37

SLIDE 14

Resampling-based detection of influential points

Detection of possible influential points: graphical approach

The idea is to take advantage of “the human gift for pattern

recognition” (Friedman & Tukey, 1974);

the box-plot is a simple and effective tool:

◮ extreme observations are not included in the whiskers; ◮ usually when farther than 1.5 IR from the first/third quartile; ◮ the extreme points are those of interest → anomalous inclusion

frequencies.

Seminars in Statistics, February 21th, 2019 14/ 37

SLIDE 15

Resampling-based detection of influential points

Detection of possible influential points: body fat example

0.0

0.2 0.4 0.6 0.8 1.0 variable inclusion frequency age weight height neck chest ab hip thigh knee ankle biceps forearm wrist Seminars in Statistics, February 21th, 2019 15/ 37

SLIDE 16

Resampling-based detection of influential points

Detection of possible influential points: body fat example

0.0

0.2 0.4 0.6 0.8 1.0 variable inclusion frequency age weight height neck chest ab hip thigh knee ankle biceps forearm wrist

O O O O

39 39 39 39 Seminars in Statistics, February 21th, 2019 15/ 37

SLIDE 17

Resampling-based detection of influential points

Detection of possible influential points: body fat example

0.0

0.2 0.4 0.6 0.8 1.0 variable inclusion frequency age weight height neck chest ab hip thigh knee ankle biceps forearm wrist

O O O O O O O O

39 39 39 39 221 54 31 86 Seminars in Statistics, February 21th, 2019 15/ 37

SLIDE 18

Resampling-based detection of influential points

Detection of possible influential points: I-frequency graphics

From the previous graphic we can:

identify possible influential points;
evaluate the importance of a variable (median I-frequency);
have an idea on the variability of the I-frequencies.
The I-frequency variability is an indicator of the reliability of a

variable inclusion frequency:

◮ smaller variability ↔ more stable inclusion frequency; ◮ between two correlated variables with similar inclusion

frequencies, we may prefer that with smaller variability.

Seminars in Statistics, February 21th, 2019 16/ 37

SLIDE 19

Resampling-based detection of influential points

Detection of possible influential points: I-frequency graphics

If the focus is mainly on the detection of influential points:

it may be convenient to show standardized I-frequencies;
easier to evaluate the “distance” of a separate point from the
thers;
it is possible to perform an objective evaluation:

◮ kσ rule; ◮ statistical test. Seminars in Statistics, February 21th, 2019 17/ 37

SLIDE 20

Resampling-based detection of influential points

Detection of possible influential points: body fat example

−10

−5 5 10 variable inclusion frequency age weight height neck chest ab hip thigh knee ankle biceps forearm wrist Seminars in Statistics, February 21th, 2019 18/ 37

SLIDE 21

Resampling-based detection of influential points

Detection of possible influential points: body fat example

−10

−5 5 10 variable inclusion frequency age weight height neck chest ab hip thigh knee ankle biceps forearm wrist

O O O O O O O O

39 39 39 39 221 54 31 86 Seminars in Statistics, February 21th, 2019 18/ 37

SLIDE 22

Resampling-based detection of influential points

Detection of possible influential points: statistical tests

We are mainly interested on the effect of an observation on

the inclusion/exclusion of a variable in the final model;

we can simply apply an univariate tests on each column of the

I-frequency matrix:

◮ Dixon’s Q test (Dixon, 1950); ◮ Grubbs’ G test (Grubbs, 1950).

Here we focus on the latter (Grubbs’ G test);

◮ idea: test whether there is a value (the minimum or the

maximum) too far from the others;

◮ results do not depend on the sample size (in contrast to

Dixon’s Q test);

Seminars in Statistics, February 21th, 2019 19/ 37

SLIDE 23

Resampling-based detection of influential points

Detection of possible influential points: Grubbs’ tests

Given x1, . . . , xn from a Gaussian distribution, the Grubbs’

test rejects the hypothesis of no outlier in the sample if max

i=1,...,n

|xi − ¯ x| s > C(α, n) = (n − 1)

t2

1−α/(2n),n−2

(n − 2 + t2

α/(2n),n−2),

where:

◮ ¯

x denotes the sample mean;

◮ s is the estimated standard deviation; ◮ t1−α/(2n),n−2 is the quantile 1 − α/(2n) of a t distribution

with n − 2 degrees of freedom.

◮ α is the significance level on which the test is conducted; ◮ it may be necessary to implement a multiplicity correction; Seminars in Statistics, February 21th, 2019 20/ 37

SLIDE 24

Resampling-based detection of influential points

Detection of possible influential points: Grubbs’ tests

The test statistic of the Grubbs’ test is related to the

standardized I-frequencies;

we can plot the rejection area of the test in the previous

graphic;

value outside the bands ±C(α, n) → possible influential point;

Seminars in Statistics, February 21th, 2019 21/ 37

SLIDE 25

Resampling-based detection of influential points

Detection of possible influential points: body fat example

−10

−5 5 10 variable inclusion frequency age weight height neck chest ab hip thigh knee ankle biceps forearm wrist Seminars in Statistics, February 21th, 2019 22/ 37

SLIDE 26

Resampling-based detection of influential points

Detection of possible influential points: body fat example

−10

−5 5 10 variable inclusion frequency age weight height neck chest ab hip thigh knee ankle biceps forearm wrist

O O O O O O O O

39 39 39 39 221 54 31 86 Seminars in Statistics, February 21th, 2019 22/ 37

SLIDE 27

Resampling-based detection of influential points

Detection of possible influential points: remarks

Note:

the Grubbs’ test is built to identify a single outlier;
for multiple outliers, a different critical value should be

consider, C(α, n, i) = (n − i)

t2

1−α/(2(n−i+1)),n−i−1

(n − i − 1 + t2

α/(2(n−i+1)),n−i−1),

where i indicates the number of “candidate” outliers;

for reasonably large sample size (n > 50), the critical value

does not change much with i.

Seminars in Statistics, February 21th, 2019 23/ 37

SLIDE 28

Resampling-based detection of influential points

Detection of possible influential points: remarks

Possible issues:

assumption of the Grubbs’ test: Gaussian distribution;

◮ it does not necessarily hold for the I-frequencies; ◮ more appropriate a beta distribution;

the beta distribution can be approximated by a Gaussian when

the coefficients are sufficiently large;

i.e., when the data points are far from the boundaries →

interesting situations!

◮ I-frequencies ≈ 0, irrelevant variables; ◮ I-frequencies ≈ 1, variables always included in the final model.

Grubbs’ test resistant against deviation from independence

(Srivastava, 1980).

Seminars in Statistics, February 21th, 2019 24/ 37

SLIDE 29

Resampling-based detection of influential points

Detection of possible influential points: different resampling technique

So far we have used bootstrap as a resampling technique to

generate the pseudo-samples;

alternatives such as subsampling may also be considered;

◮ see De Bin et al. (2016) for a comparison in the context of

resampling-based variable selection;

dependence on the subsample size m:

◮ usual values (m = 0.5n, 0.632n) → no strong difference with

the bootstrap approach (at least in this example);

◮ when m = n − 2, I-frequencies → O-frequencies, the method

is an approximation of the delete-2 jackknife procedure (see, e.g., Martin et al., 2010);

only with bootstrap: separately consider pseudo-samples by

the number of times they include an observation i.

Seminars in Statistics, February 21th, 2019 25/ 37

SLIDE 30

Resampling-based detection of influential points

Detection of possible influential points: different number of replications

I-frequencies-1: based on pseudo-samples in which observation

i is included only 1 time;

I-frequencies-M: i-th observation included 2 or more times;

◮ an observation is repeated > 2 times in a few cases;

together with the O-frequencies (i.e., 0 times),
additional information on the effect of the observation on the

inclusion or exclusion of a variable from the statistical model.

Seminars in Statistics, February 21th, 2019 26/ 37

SLIDE 31

Resampling-based detection of influential points

Detection of possible influential points: different number of replications

0.0

0.2 0.4 0.6 0.8 1.0

Observation 39

variable inclusion frequency age weight height neck chest ab hip thigh knee ankle biceps forearm wrist

excluded

included 1 time included M>1 time

0.0

0.2 0.4 0.6 0.8 1.0

Observation 221

variable inclusion frequency age weight height neck chest ab hip thigh knee ankle biceps forearm wrist

excluded

included 1 time included M>1 time

Seminars in Statistics, February 21th, 2019 27/ 37

SLIDE 32

Resampling-based detection of influential points

Detection of possible influential points: comparison with FSDA

Alternative: FSDA by Atkinson & Riani (2002),
a model is first fitted on a carefully selected subsample;
then stepwise re-fitted on the same subsamples enlarged

◮ at each step, with one of the observation initially excluded; ◮ that closer to the fitted model; ◮ last observations have highest probability to be influential

points;

the method is robust against the masking effect:

◮ weaker influential points are considered before stronger;

focuses on the effect of specific observations from a model

point of view, rather than a variable point of view.

Seminars in Statistics, February 21th, 2019 28/ 37

SLIDE 33

Resampling-based detection of influential points

Detection of possible influential points: comparison with FSDA

Seminars in Statistics, February 21th, 2019 29/ 37

SLIDE 34

Resampling-based detection of influential points

Detection of possible influential points: myeloma data example

Reference: Krall et al. (1975);
Sample size: 65 (48 events);
Outcome: overall survival time (time-to-event);
Covariates: 16, either binary or continuous;
Data: http://portal.uni-freiburg.de/imbi/

Royston-Sauerbrei-book/Multivariable_ Model-building/downloads/datasets/myeloma.zip.

Proportional hazard assumption acceptable (see, e.g., Kuk,

1984; Chen & Wang, 1991)

Seminars in Statistics, February 21th, 2019 30/ 37

SLIDE 35

Resampling-based detection of influential points

Detection of possible influential points: myeloma data example

0.0

0.2 0.4 0.6 0.8 1.0 inclusion frequency logBUN hemogl platelets infections age sex logWBC fractures log%BM lymphoc myelCells proteinuria BenceJone protein globin calcium

−4

−2 2 4 standardized inclusion frequency logBUN hemogl platelets infections age sex logWBC fractures log%BM lymphoc myelCells proteinuria BenceJone protein globin calcium

O O

44 44

Seminars in Statistics, February 21th, 2019 31/ 37

SLIDE 36

Resampling-based detection of influential points

Detection of possible influential points: myeloma data example

Relatively small sample size (there are only 48 events);
no strong influential points in this dataset;
the only point that is inside the rejection region: observation

44 (for variables hemogl and protein);

not far from the rejection region boundaries;
it may simply be related to the type I error of the Grubbs test.

Seminars in Statistics, February 21th, 2019 32/ 37

SLIDE 37

Resampling-based detection of influential points

Conclusions: summary

Our proposed method (De Bin et al., 2017):

can be used to detect possible influential points;
uses only available information from a resampling-based

procedure:

◮ resampling-based variable selection; ◮ resampling-based weights for model averaging; ◮ . . .

the decision whether an observation is or is not an influential

point must be done by the user, and an algorithm should be

nly used to nominate possible candidates (Hadi et al., 2009).

Seminars in Statistics, February 21th, 2019 33/ 37

SLIDE 38

Resampling-based detection of influential points

Conclusions: future (actually current) work

Future plans (with Kristoffer H. Hellton and Tonje G. Lien):

extension to high-dimensional data;
further issues:

◮ small sample size wrt number of variables; ◮ multiplicity of tests; ◮ . . .

similar approach with stable variable selection method;
focus on the effect of the single observation on the choice of

the tuning parameter in a ridge regression.

Seminars in Statistics, February 21th, 2019 34/ 37

SLIDE 39

Resampling-based detection of influential points

References I

Atkinson, A. C. & Riani, M. (2002). Forward search added-variable t-tests and the effect of masked outliers on model selection. Biometrika 39, 939–946. Augustin, N., Sauerbrei, W. & Schumacher, M. (2005). The practical utility of incorporating model selection uncertainty into prognostic models for survival data. Statistical Modelling 5, 95–118. Buckland, S. T., Burnham, K. P. & Augustin, N. H. (1997). Model selection: an integral part of inference. Biometrics 53, 603–618. Burnham, K. P. & Anderson, D. R. (2002). Model selection and multimodel inference: a practical information-theoretic approach. Springer. Chen, C.-H. & George, S. L. (1985). The bootstrap and identification of prognostic factors via Cox’s proportional hazards regression model. Statistics in Medicine 4, 39–46. Chen, C.-H. & Wang, P. C. (1991). Diagnostic plots in Cox’s regression

model. Biometrics 47, 841–850.

Seminars in Statistics, February 21th, 2019 35/ 37

SLIDE 40

Resampling-based detection of influential points

References II

De Bin, R., Janitza, S., Sauerbrei, W. & Boulesteix, A.-L. (2016). Subsampling versus bootstrapping in resampling-based model selection for multivariable regression. Biometrics 72, 272–280. De Bin, R., Sauerbrei, W. & Boulesteix, A.-L. (2017). Detection of influential points as a byproduct of resampling-based model-building

procedures. Computational Statistics & Data Analysis 116, 19–31.

Dixon, W. J. (1950). Analysis of extreme values. The Annals of Mathematical Statistics 21, 488–506. Friedman, J. H. & Tukey, J. W. (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Trans. Comput. 23, 881–890. Grubbs, F. E. (1950). Sample criteria for testing outlying observations. The Annals of Mathematical Statistics 21, 27–58. Hadi, A. S., Imon, A. & Werner, M. (2009). Detection of outliers. Wiley Interdisciplinary Reviews: Computational Statistics 1, 57–70. Hansen, B. E. (2007). Least squares model averaging. Econometrica 75, 1175–1189.

Seminars in Statistics, February 21th, 2019 36/ 37

SLIDE 41

Resampling-based detection of influential points

References III

Hjort, N. L. & Claeskens, G. (2003). Frequentist model average

estimators. Journal of the American Statistical Association 98, 879–899.

Johnson, R. W. (1996). Fitting percentage of body fat to simple body

measurements. Journal of Statistics Education 4, 265–266.

Krall, J. M., Uthoff, V. A. & Harley, J. B. (1975). A step-up procedure for selecting variables associated with survival. Biometrics 31, 49–57. Kuk, A. Y. (1984). All subsets regression in a proportional hazards model. Biometrika 71, 587–592. Martin, M. A., Roberts, S. & Zheng, L. (2010). Delete-2 and delete-3 jackknife procedures for unmasking in regression. Australian & New Zealand Journal of Statistics 52, 45–60. Royston, P. & Sauerbrei, W. (2007). Improving the robustness of fractional polynomial models by preliminary covariate transformation: A pragmatic approach. Computational Statistics & Data Analysis 51, 4240–4253. Srivastava, M. (1980). Effect of equicorrelation in detecting a spurious

bservation. Canadian Journal of Statistics 8, 249–251.

Seminars in Statistics, February 21th, 2019 37/ 37

SLIDE 42

Resampling-based detection of influential points Seminars in Statistics, February 21th, 2019 37/ 37

SLIDE 43

Resampling-based detection of influential points

Effect of observation 221 in the model building process

BIC α = 0.05 AIC variable in

ut

in

ut

in

ut

← obs. 221 age ✓ ✓ ✓ weight ✓ ✓ ✓ ✓ ✓ ✓ height neck ✓ ✓ chest ab ✓ ✓ ✓ ✓ ✓ ✓ hip ✓ thigh ✓ ✓ ✓ knee ankle biceps forearm ✓ ✓ ✓ ✓ wrist ✓ ✓ ✓ ✓ ✓ ✓ NB: the models are obtained by using backward elimination

Seminars in Statistics, February 21th, 2019 37/ 37