Regression Diagnostics and the Forward Search 3. A Single - - PowerPoint PPT Presentation

regression diagnostics and the forward search 3 a single
SMART_READER_LITE
LIVE PREVIEW

Regression Diagnostics and the Forward Search 3. A Single - - PowerPoint PPT Presentation

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample Anthony Atkinson, LSE Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample p. 1/29 Multivariate Normality Much multivariate data is


slide-1
SLIDE 1

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample

Anthony Atkinson, LSE

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 1/29

slide-2
SLIDE 2

Multivariate Normality

Much multivariate data is modelled with the normal distribution,

  • ften after a transformation to approximate normality

(Box and Cox 1964). But do we have:

  • A sample from a single normal population?
  • The same, but with some outliers?
  • A sample from several normal populations?
  • The same with outliers as well?

The numbers of populations and of outliers are both unknown

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 2/29

slide-3
SLIDE 3

Obscured Structure

The main diagnostic tools that we use are:

  • 1. Plots of the data, especially scatterplot matrices
  • 2. Various plots of Mahalanobis distances. The squared distances

for the sample are defined as d2

i = {yi − ˆ

µ}T ˆ Σ−1{yi − ˆ µ}, (i = 1, . . . , n), where ˆ µ is the vector of means of the n observations and ˆ Σ is the unbiased estimator of the population covariance matrix.

  • 3. These are the multivariate form of scaled residuals.

But: 1. Hard to interpret for many variables ; 2. Subject to masking.

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 3/29

slide-4
SLIDE 4

The Forward Search 1

We use the Forward Search to find structure:

  • Explore relationship between data and fitted models that

may be obscured by fitting (masking)

  • Output mostly graphical (versions of tests).
  • FS orders the observations by closeness to the assumed

model

  • Start with a small subset of the data
  • Move Forward: increase the number of observations m

used for fitting the model.

  • Continue until m = n

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 4/29

slide-5
SLIDE 5

The Forward Search 2

For a subset of m observations the parameter estimates are ˆ µ(m) and ˆ Σ(m). From this subset we obtain n squared Mahalanobis distances d2

i (m) = {yi − ˆ

µ(m)}T ˆ Σ−1(m){yi − ˆ µ(m)}, (i = 1, . . . , n).

  • When m observations are used in fitting, the optimum subset

S∗(m) yields n squared distances d2

i (m∗)

  • Order these squared distances and take the observations

corresponding to the m + 1 smallest as the new subset S∗(m + 1)

  • Usually this process augments the subset by only one
  • bservation. Sometimes two or more observations enter as one or

more leave

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 5/29

slide-6
SLIDE 6

The Forward Search 3: One Population

  • For each m0 ≤ m ≤ n, plot the n distances di(m∗), a

forward plot.

  • The starting subset of m0 ( < n/10) comes from bivariate

boxplots that exclude outlying observations in any one or two-dimensional plot

  • Content of contours adjusted to give required m0
  • With one population the search is not sensitive to the

exact choice of starting subset.

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 6/29

slide-7
SLIDE 7

The Forward Search 4

The distances tend to decrease as n increases. If interest is in the latter part of the search we look at

  • Scaled distances

di(m∗) ×

Σ(m∗)|/|ˆ Σ(n)| 1/2v

  • v is the dimension of the observations y (v variables) and

ˆ Σ(n) is the estimate of Σ at the end of the search.

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 7/29

slide-8
SLIDE 8

Swiss Heads 1

As a first example of the use of forward plots we start with data given by Flury and Riedwyl (1988, p. 218): six readings on the dimensions of the heads of 200 twenty year old Swiss soldiers.

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 8/29

slide-9
SLIDE 9

y1

100 110 120 104 111 104 111 50 60 70 104 111 104 111 125 135 145 100 110 120 130 104 111 100 110 120 104 111

y2

104 111 104 111 104 111 104 111 104 111 104 111

y3

104 111 104 111 110 120 130 140 104 111 50 60 70 104 111 104 111 104 111

y4

104 111 104 111 104 111 104 111 104 111 104 111

y5

115 125 135 104 111 100 110 120 130 125 135 145 104 111 104 111 110 120 130 140 104 111 104 111 115 125 135 104 111

y6

Swiss heads: scatterplot matrix with observations 104 and 111 marked

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 9/29

slide-10
SLIDE 10 100 110 120 130

y1

159 10 100 105 110 115 120 125

y2

147 110 115 120 125 130 135 140

y3

50 55 60 65 70 75

y4

104 111 115 120 125 130 135

y5

194 195 80 125 130 135 140 145 150

y6

57 160

Swiss heads: boxplots of the six variables with univariate outliers labelled

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 10/29

slide-11
SLIDE 11

Swiss Heads

Starting the Search. We find observations within elliptical contours fitted to all the data. The scaling parameter for the ellipses is called θ, the value being chosen to give the desired value for m0. The distribution of the d2

i (n) is scaled Beta, approximated by a

scaled F distribution - exact if Σ estimated but µ known. The value of θ can be interpreted as a quantile of the F

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 11/29

slide-12
SLIDE 12

Swiss heads: scatterplot matrix. The outer ellipse (θ = 4.71) indicates some potential outliers. The inner ellipse (θ = 0.92) gives m0 = 25 ???

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 12/29

slide-13
SLIDE 13

Subset size m Mahalanobis distances 50 100 150 200 1 2 3 4 5 6

111 104

Swiss heads: forward plot of scaled Mahalanobis distances showing little structure. The rising diagonal white band separates those units which are in the subset from those that are not. At the end of the search there are perhaps two outliers, observations 104 and 111.

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 13/29

slide-14
SLIDE 14

Swiss Heads 2

Of course, we do not have to look at a plot of all the distances

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 14/29

slide-15
SLIDE 15

Subset size m Mahalanobis distances 50 100 150 200 1 2 3 4 5 6

111 104

Swiss heads: forward plot of scaled Mahalanobis distances. The trajectories for units 104 and 111 are highlighted; they are initially not particularly extreme

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 15/29

slide-16
SLIDE 16

Swiss Heads 3

The plot of unscaled distances looks similar

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 16/29

slide-17
SLIDE 17

Subset size m Mahalanobis distances 50 100 150 200 2 4 6 8 10 12

111 104

Swiss heads: forward plot of unscaled Mahalanobis distances. The trajectories for units 104 and 111 are again highlighted; the behaviour at the end of the search is obscured

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 17/29

slide-18
SLIDE 18

The Forward Search 4: Outliers

To detect outliers we examine the minimum Mahalanobis distance amongst observations not in the subset d[m+1](m) = min di(m) i / ∈ S(m), (1)

  • r its scaled version d sc

[m+1](m). If observation [m + 1] is an

  • utlier relative to the other m observations, this distance will be

large compared to the maximum Mahalanobis distance of

  • bservations in the subset.
  • If observation [m + 1] is an outlier, so will be all the remaining

n − m − 2 observations with larger values of di.

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 18/29

slide-19
SLIDE 19

Subset size m Minimum MD 50 100 150 200 3.0 3.5 4.0 4.5

  • Swiss heads: forward plot of minimum distances of units

not in the subset. There may be a few outliers entering at the end of the search

  • Use simulation to provide distribution

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 19/29

slide-20
SLIDE 20

Simulation Envelopes

Subset size m Minimum Mahalanobis distance 50 100 150 200 3.0 3.5 4.0 4.5 5.0 5.5

  • Swiss heads: forward plot of minimum distances of units not in

the subset.

  • 1, 5, 50, 95 and 99% points of 10,000 simulation envelopes (and

an approximation)

  • No outliers indicated

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 20/29

slide-21
SLIDE 21

The Forward Search 5

  • Here we seem to have one normal population with two

slightly extreme observations

  • Do these observations matter?
  • Do they affect inferences?
  • Are they important for themselves?
  • The Forward Search reduces multivariate (v-dimensional)

problems to 2 dimensions

  • But it may be informative to look at plots of the data in the

light of the search results.

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 21/29

slide-22
SLIDE 22

y1

100 110 120 50 60 70 125 135 145 100 110 120 130 100 110 120

y2 y3

110 120 130 140 50 60 70

y4 y5

115 125 135 100 110 120 130 125 135 145 110 120 130 140 115 125 135

y6

Units 104 and 111 are plotted as dots

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 22/29

slide-23
SLIDE 23

Swiss Banknote Data

  • There are (again) 200 observations on six variables
  • All notes have been withdrawn from circulation and

classified by an expert

  • 100 notes are “genuine”, 100 “forgeries”
  • But the notes may be misclassified
  • There may be more than one forger
  • For the moment we just look at the forgeries

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 23/29

slide-24
SLIDE 24

Swiss Banknote Data

To determine whether there are any outliers, we look at the forward plot of minimum distances of units not in the subset

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 24/29

slide-25
SLIDE 25

Subset size m Minimum Mahalanobis distance 20 40 60 80 100 3 4 5 6 7 Subset size m Minimum scaled Mahalanobis distance 20 40 60 80 100 2 3 4 5

Swiss banknotes, forgeries (n = 100): forward plot of minimum Mahalanobis distance with superimposed 1, 5, 95, and 99% bootstrap envelopes using 10000 simulations. Left panel unscaled distances, right panel scaled distances. There is a clear indication of the presence of outliers which starts around m = 84. Note the masking

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 25/29

slide-26
SLIDE 26

Resuperimposition

Subset size m Minimum scaled Mahalanobis distance 20 40 60 80 100 2 3 4 5

  • The envelopes rise sharply at the end
  • If there is masking, the final part of the search may lie

inside the envelopes

  • The envelopes depend on n
  • We find the largest value of m for which the observed

values lie within the envelopes for n = m

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 26/29

slide-27
SLIDE 27

Subset size m Mahalanobis distances 20 30 40 50 60 70 80 3 4 5 6 7 Try n=84 Subset size m Mahalanobis distances 20 30 40 50 60 70 80 3 4 5 6 7 Try n=85 Subset size m Mahalanobis distances 20 30 40 50 60 70 80 3 4 5 6 7 Try n=86 Subset size m Mahalanobis distances 20 30 40 50 60 70 80 3 4 5 6 7 Try n=87

Swiss Banknotes: forward plot of minimum Mahalanobis distance. When n = 84 and 85, the observed curve lies within the 99% envelope, but there is clear evidence of an outlier when n = 86. The evidence becomes even stronger when another observation is included.

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 27/29

slide-28
SLIDE 28

Several Populations

  • In the “forgeries” there were 85 “good” observations and

15 outliers.

  • Do these outliers form a group?
  • What happens if there are several groups?
  • For the banknote data we will at least have “genuine notes”

“forgeries” and “outliers (from the forgeries)”

  • Can we detect these with the FS?
  • An example of a clustering problem with the number of

groups and their properties both unknown,

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 28/29

slide-29
SLIDE 29

References

Box, G. E. P. and D. R. Cox (1964). An analysis of transformations (with discussion). Journal of the Royal Statistical Society, Series B 26, 211–246. Flury, B. and H. Riedwyl (1988). Multivariate Statistics: A Practical Approach. London: Chapman and Hall.

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample – p. 29/29

slide-30
SLIDE 30

References

Box, G. E. P. and D. R. Cox (1964). An analysis of transformations (with discussion). Journal of the Royal Statistical Society, Series B 26, 211–246. Flury, B. and H. Riedwyl (1988). Multivariate Statis- tics: A Practical Approach. London: Chapman and Hall. 29-1