Quantile plots: New planks in an old campaign Nicholas J. Cox - - PowerPoint PPT Presentation

quantile plots new planks in an old campaign
SMART_READER_LITE
LIVE PREVIEW

Quantile plots: New planks in an old campaign Nicholas J. Cox - - PowerPoint PPT Presentation

Quantile plots: New planks in an old campaign Nicholas J. Cox Department of Geography 1 Quantile plots Quantile plots show ordered values (raw data, estimates, residuals, whatever) against rank or cumulative probability or a one-to-one


slide-1
SLIDE 1

Quantile plots: New planks in an old campaign

Nicholas J. Cox Department of Geography

1

slide-2
SLIDE 2

Quantile plots

Quantile plots show

  • rdered values (raw data, estimates, residuals, whatever)

against rank or cumulative probability or a one-to-one function of the same.

Tied values are assigned distinct ranks or probabilities.

2

slide-3
SLIDE 3

Example with a ut o dataset

3

10 20 30 40 Quantiles of Mileage (mpg) .25 .5 .75 1 Fraction of the data

slide-4
SLIDE 4

qua nt i l e default

In this default from the official command qua nt i l e ,

  • rdered values are plotted on the y axis and the fraction of

the data (cumulative probability) on the x axis. Quantiles (order statistics) are plotted against plotting position (i − 0.5)/n for rank i and sample size n. Syntax was s ys us e a ut o, c l e a r qua nt i l e m pg, a s pe c t ( 1)

4

slide-5
SLIDE 5

Quantile plots have a long history

Adolphe Quetelet Sir Francis Galton G. Udny Yule Sir Ronald Fisher 1796–1874 1822–1911 1871–1951 1890–1962

all used quantile plots avant la lettre.

In geomorphology, hypsometric curves for showing altitude distributions are a long-established device with the same flavour.

5

slide-6
SLIDE 6

Quantile plots named as such

Martin B. Wilk Ramanathan Gnanadesikan 1922–2013 1932–2015

Wilk, M. B. and Gnanadesikan, R. 1968. Probability plotting methods for the analysis of

  • data. Biom etrika 55: 1–17.

6

slide-7
SLIDE 7

A relatively long history in Stata

Stata/ Graphics User's Guide (August 1985) included do-files qua nt i l e . do and qqpl ot . do. Graph.Kit (February 1986) included commands qua nt i l e , qqpl ot and qnor m .

Thanks to Pat Branton of StataCorp for this history.

7

slide-8
SLIDE 8

Related plots use the same information

Cumulative distribution plots show cumulative probability

  • n the y axis.

Survival function plots show the complementary probability. Clearly, axes can be exchanged or reflected. di s t pl ot (Stata Journal ) supports both. Many people will already know about s t s gr a ph.

8

slide-9
SLIDE 9

So, why any fuss?

The presentation is built on a long-considered view that quantile plots are the best single plot for univariate distributions. No other kind of plot shows so m any features so well across a range of sam ple sizes with so few arbitrary decisions. Example: Histograms require binning choices. Example: Density plots require kernel choices. Example: Box plots often leave out too much.

9

slide-10
SLIDE 10

What’s in a name? QQ-plots

Talk of quantile-quantile (Q-Q or QQ-) plots is also common. As discussed here, all quantile plots are also QQ-plots. The default quantile plot is just a plot of values against the quantiles of a standard uniform or rectangular distribution.

10

slide-11
SLIDE 11

NJC commands

The main commands I have introduced in this territory are ◊ qua nt i l 2 (Stata Technical Bulletin) ◊ qpl ot (Stata Journal) ◊ s t r i ppl ot (SSC) Others will be mentioned later.

11

slide-12
SLIDE 12

qua nt i l 2

This command published in Stata Technical Bulletin 51: 16–18 (1999) generalized qua nt i l e : ◊ One or more variables may be plotted. ◊ Sort order may be reversed. ◊ by( ) option is supported. ◊ Plotting position is generalised to (i − a) / (n − 2a + 1): compare a = 0.5 or (i − 0.5)/ n wired into qua nt i l e .

12

slide-13
SLIDE 13

qpl ot

The command qua nt i l 2 was renamed qpl ot and further revised in Stata Journal 5: 442−460 and 471 (2005), with later updates: ◊ ove r ( ) option is also supported. ◊ Ranks may be plotted as well as plotting positions. ◊ The x axis scale may be transformed on the fly. ◊ r e c a s t ( ) to other t wowa y types is supported.

13

slide-14
SLIDE 14

s t r i ppl ot

The command s t r i ppl ot

  • n SSC started under Stata 6

as one wa ypl ot in 1999 as an alternative to gr a ph,

  • ne wa y and has morphed into (roughly) a superset of the
  • fficial command dot pl ot .

It is mentioned here because of its general support for quantile plots as one style and its specific support for quantile-box plots, on which more shortly.

14

slide-15
SLIDE 15

Comparing two groups is basic

superimposed juxtaposed

15

10 20 30 40 quantiles of Mileage (mpg) .2 .4 .6 .8 1 fraction of the data Domestic Foreign 10 20 30 40 Mileage (mpg) Domestic Foreign Car type

slide-16
SLIDE 16

Syntax was

qpl ot m pg,

  • ve r ( f or e i gn)

a s pe c t ( 1) s t r i ppl ot m pg,

  • ve r ( f or e i gn)

c um ul a t i ve c e nt r e ve r t i c a l a s pe c t ( 1)

16

10 20 30 40 Mileage (mpg) Domestic Foreign Car type 10 20 30 40 quantiles of Mileage (mpg) .2 .4 .6 .8 1 fraction of the data Domestic Foreign

slide-17
SLIDE 17

Quantiles and transformations commute

In essence, transformed quantiles and quantiles of transformed data are one and the same, with easy exceptions such as reciprocals reversing order. So, quantile plots mesh easily with transformations, such as thinking on logarithmic scale. For the latter, we just add simple syntax such as ys c ( l og) . Note that this is not true of (e.g.) histograms, box plots

  • r density plots, which need re-drawing.

17

slide-18
SLIDE 18

The shift is multiplicative, not additive?

18

10 20 30 40 quantiles of Mileage (mpg) .2 .4 .6 .8 1 fraction of the data Domestic Foreign 10 20 30 40 Mileage (mpg) Domestic Foreign Car type

slide-19
SLIDE 19

A more unusual example

Glacier terminus position change may be positive or negative, with possible outliers of either sign. Cube root transformation pulls in both tails and (fortuitously but fortunately) can separate advancing and retreating glaciers. Here we use the s t r i ppl ot command and data from Miles, B.W.J., Stokes, C.R., Vieli, A. and Cox, N.J. 2013. Rapid, climate-driven changes in outlet glaciers on the Pacific coast of East Antarctica. Nature 500: 563–566.

19

slide-20
SLIDE 20

20

  • 6000
  • 4000
  • 2000

2000 Terminus position change (m yr

  • 1)

1974-1990 1990-2000 2000-2010

boxplots show 5 25 50 75 95% points

Pacific Coast of East Antarctica glaciers

slide-21
SLIDE 21

21

3000 1000 300

  • 300
  • 1000
  • 3000

Terminus position change (m yr

  • 1)

1974-1990 1990-2000 2000-2010

cube root scale

boxplots show 5 25 50 75 95% points

Pacific Coast of East Antarctica glaciers

slide-22
SLIDE 22

m ul t qpl ot (Stata Journal)

m ul t qpl ot is a convenience command to plot several quantile plots at once. It has uses in data screening and reporting. It might prove more illuminating than the tables of descriptive statistics ritual in various professions. We use here the Chapman data from Dixon, W. J. and Massey, F.J. 1983. Introduction to Statistical Analysis. 4th ed. New York: McGraw–Hill.

22

slide-23
SLIDE 23

23

23 33 42 52 70 .25 .5 .75 1

age (years)

90 110 120 130 190 .25 .5 .75 1

systolic blood pressure (mm Hg)

55 75 80 90 112 .25 .5 .75 1

diastolic blood pressure (mm Hg)

135 245.5 276 331 520 .25 .5 .75 1

cholesterol (mg/dl)

62 67 68 70 74 .25 .5 .75 1

height (in)

108 147 163 180 262 .25 .5 .75 1

weight (lb)

slide-24
SLIDE 24

m ul t qpl ot details

By default the minimum, lower quartile, median, upper quartile and maximum are labelled on the y axis – so we are half-way to showing a box plot too. By default also variable labels (or names) appear at the top. More at Stata Journal 12:549–561 (2012) and 13:640–666 (2013).

24

slide-25
SLIDE 25

Raw or smoothed?

Quantile plots show the data as they come: we get to see

  • utliers, grouping, gaps and other quirks of the data, as well

as location, scale and general shape. But sometimes the details are just noise or fine structure we do not care about. Once you register that values of m pg in the auto data are all reported as integers, you want to set that aside. You can smooth quantiles, notably using the Harrell and Davis method, which turns out to be bootstrapping in

  • disguise. hdqua nt i l e (SSC) offers the calculation.

25

slide-26
SLIDE 26

Harrell, F.E. and Davis, C.E. 1982. A new distribution-free quantile

  • estimator. Biom etrika 69:

635–640.

26

10 20 30 40 H-D quantiles of mpg .2 .4 .6 .8 1 fraction of the data Domestic Foreign

slide-27
SLIDE 27

Letter values

Often we do not really need all the quantiles, especially if the sample size is large. We could just use the letter values, which are the median, quartiles (fourths), octiles (eighths), and so forth out to the extremes, halving the tail probabilities at each step. l v supports letter value displays. l va l ue s (SSC) is now available to generate variables.

Thanks to David Hoaglin for suggesting letter values at the Chicago meeting and to Kit Baum for posting l va l ue s on SSC.

27

slide-28
SLIDE 28

Parsimony of letter values

For n data values, there are 1 + 2 ceil(log2 n) letter values . For n = 1000, 10 6 , 10 9, there are 21, 41, 61 letter values. We will see examples shortly.

28

slide-29
SLIDE 29

Fitting or testing named distributions

Using quantile plots to compare data with named distributions is common. The leading example is using the normal (Gaussian) as reference distribution. Indeed, many statistical people first meet quantile plots as such norm al probability plots. Yudi Pawitan in his 2001 book In All Likelihood (Oxford University Press) advocates normal QQ-plots as making sense generally —even when comparison with normal distributions is not the goal.

29

slide-30
SLIDE 30

qnor m available but limited

qnor m is already available as an official command —but it is limited to the plotting of just one set of values.

30

slide-31
SLIDE 31

Named distributions with qpl ot

qpl ot has a general t r s c a l e ( ) option to transform the x axis scale that otherwise would show plotting positions

  • r ranks.

For normal distributions, the syntax is just to add t r s c a l e ( i nvnor m a l ( @ ) ) @ is a placeholder for what would otherwise be plotted. i nvnor m a l ( ) is Stata’s name for the normal quantile function (as an inverse cumulative distribution function).

31

slide-32
SLIDE 32

32

10 20 30 40 quantiles of Mileage (mpg)

  • 2
  • 1

1 2 invnormal(P) Domestic Foreign

slide-33
SLIDE 33

A standard plot in support of t tests?

This plot is suggested as a standard for two-group comparisons: ◊ We see all the data, including outliers or other problems. ◊ Use of a normal probability scale shows how far that assumption (read: ideal condition) is satisfied. ◊ The vertical position of each group tells us about location, specifically means. ◊ The slope or tilt of each group tells us about scale, specifically standard deviations. ◊ It is helpful even if we eventually use Wilcoxon-Mann- Whitney or something else.

33

slide-34
SLIDE 34

What if you had paired values?

Plot the differences, naturally. Nothing stops you plotting the original values too, but at some point the graphics should respect the pairing.

34

slide-35
SLIDE 35

Different axis labelling?

The last plot used a scale of standard normal deviates or z scores. Some might prefer different labelling, e.g. % points. m yl a be l s (SSC) is a helper command, which puts the mapping in a local macro for your main command: m yl a be l s 1 2 5 10( 20) 90 95 98 99, m ys c a l e ( i nvnor m a l ( @ / 100) ) l oc a l ( pl a be l s )

35

slide-36
SLIDE 36

36

10 20 30 40 quantiles of Mileage (mpg) 1 2 5 10 30 50 70 90 95 98 99 exceedance probability (%) Foreign Domestic

slide-37
SLIDE 37

Syntax for that example

s ys us e a ut o, c l e a r m yl a be l s 1 2 5 10( 20) 90 95 98 99, m ys c a l e ( i nvnor m a l ( @ / 100) ) l oc a l ( pl a be l s ) qpl ot m pg, ove r ( f or e i gn) t r s c a l e ( i nvnor m a l ( @ ) ) a s pe c t ( 1) xl a ( ` pl a be l s ' ) xt i t l e ( e xc e e da nc e pr oba bi l i t y ( % ) ) xs c ( t i t l e ga p( *5) ) l e ge nd( pos ( 11) r i ng( 0) or de r ( 2 1) c ol ( 1) )

37

slide-38
SLIDE 38

How would letter values do?

For the auto data there are 52 domestic cars 13 letter values 22 foreign cars 11 letter values. The use of letter values is parsimonious, but respectful of major detail: extremes are always echoed.

38

slide-39
SLIDE 39

39

10 20 30 40 quantiles of Mileage (mpg)

  • 2
  • 1

1 2 invnormal(P) Foreign Domestic 10 20 30 40 quantiles of Mileage (mpg)

  • 2
  • 1

1 2 invnormal(P) Foreign Domestic

slide-40
SLIDE 40

Other named distributions?

There are many, many named distributions for which customised QQ-plot commands could be written. I am guilty of programs for beta, Dagum, Dirichlet , exponential, gamma, generalized beta (second kind), Gumbel, inverse gamma, inverse Gaussian, lognormal, Singh-Maddala and Weibull distributions. But a better approach when feasible is to allow a distribution to be specified on the fly.

40

slide-41
SLIDE 41

Harold Jeffreys suggested that error distributions are more like t distributions with 7 df than like Gaussians. 1939/ 1948/ 1961. Theory of

  • probability. Oxford

University Press. Ch.5.7

  • 1938. The law of error and

the combination of

  • bservations. Philosophical

Transactions of the Royal Society, Series A 237: 231–271 Sir Harold Jeffreys 1891–1989 County Durham man established that the Earth’s core is liquid pioneer Bayesian

41

slide-42
SLIDE 42

42

  • 6
  • 4
  • 2

2 4 6 t 7 df

  • 3
  • 2
  • 1

1 2 3 normal

kurtosis 5 kurtosis 3

plotted for probability in [0.001, 0.999]

slide-43
SLIDE 43

How to explore?

Simulate with r t ( 7, ) and samples of desired size. t r s c a l e ( i nvt ( 7, @ ) ) sets up x axis scale on the fly.

43

slide-44
SLIDE 44

44

  • 5

5

  • 5

5

  • 5

5

  • 2

2 4

  • 2

2 4

  • 2

2 4

1 2 3 4 5 6 7 8 9

quantiles of t7 normal deviates

slide-45
SLIDE 45

45

  • 5

5

  • 5

5

  • 5

5

  • 4
  • 2

2 4 -4

  • 2

2 4 -4

  • 2

2 4

1 2 3 4 5 6 7 8 9

quantiles of t7 t with 7 df

slide-46
SLIDE 46

Box plot hybrids

46

slide-47
SLIDE 47

Adding a box plot flavour

Earlier we saw how extremes and quartiles could be made explicit on the y axis of a quantile plot. They are the minimal ingredients for a box plot. Clearly we can also flag cumulative probabilities 0(0.25)1 on the corresponding x axis scale.

47

23 33 42 52 70 .25 .5 .75 1

age (years)

90 110 120 130 190 .25 .5 .75 1

systolic blood pressure (mm Hg)

55 75 80 90 112 .25 .5 .75 1

diastolic blood pressure (mm Hg)

135 245.5 276 331 520 .25 .5 .75 1

cholesterol (mg/dl)

62 67 68 70 74 .25 .5 .75 1

height (in)

108 147 163 180 262 .25 .5 .75 1

weight (lb)

slide-48
SLIDE 48

Tracing the box

In m ul t qpl ot by default the box is shown as part of a double set of grid lines. This helps underline that half of the points on a box plot are inside the box and half outside, a basic fact

  • ften missed in interpreting

these plots, even by experienced researchers.

48

23 33 42 52 70 .25 .5 .75 1

age (years)

slide-49
SLIDE 49

Quantile-box plots

Emanuel Parzen introduced quantile-box plots in 1979. Nonparametric statistical data modeling. Journal of the Am erican Statistical Association 74: 105–131. His original examples were not especially impressive, perhaps one reason they have not been more widely emulated. Emanuel Parzen 1929–2016

49

slide-50
SLIDE 50

Boston housing data

Here for quantile-box plots we use data from Harrison, D. and Rubinfeld, D.L. 1978. Hedonic prices and the demand for clean air. Journal of Environm ental Econom ics and Managem ent 5: 81–102. https:/ archive.ics.uci.edu/ ml/ datasets/ Housing

Number of Figures in original paper: 1 Number of Figures showing raw data: 0

50

slide-51
SLIDE 51

Broad contrast and fine structure

s t r i ppl ot M EDV,

  • ve r ( CHAS)

ve r t i c a l c um ul a t i ve c e nt r e box c um pr ob a s pe c t ( 1)

51

10 20 30 40 50 Median value of owner-occupied homes in $1000 1 tract bounds Charles River

slide-52
SLIDE 52

Some quirks in that dataset

52

12.5 100 .25 .5 .75 1

% residential land zoned lots >25000 sq. ft

.46 5.19 9.69 18.1 27.74 .25 .5 .75 1

% non-retail business acres per town

1 4 5 24 24 .25 .5 .75 1

accessibility to radial highways

187 279 330 666 711 .25 .5 .75 1

full-value property-tax rate per $10,000

slide-53
SLIDE 53

Ordinal (graded) data

Ordinal (graded) data can be shown with quantile plots too. Such data might alternatively be plotted against the midpoints of the corresponding probability intervals. Statistical discussion was given in Stata Journal 4: 190–215 (2004), Section 5.

53

slide-54
SLIDE 54

54

1 2 3 4 5 quantiles of Repair Record 1978 .2 .4 .6 .8 1 fraction of the data 1 2 3 4 5 quantiles of Repair Record 1978

  • 4
  • 2

2 4 logit(P) Foreign Domestic

slide-55
SLIDE 55

qpl ot r e p78, a s pe c t ( 1) ove r ( f or e i gn) m i dpoi nt r e c a s t ( c onne c t ) t r s c a l e ( l ogi t ( @ ) ) xs c ( t i t l e ga p( *5) ) l e ge nd( pos ( 11) r i ng( 0) c ol ( 1) or de r ( 2 1) )

The m i dpoi nt option is included in a Software Update in press, Stata Journal 16(3) 2016.

55

slide-56
SLIDE 56

Differences of quantiles

Plotting differences of quantiles versus their mean or versus plotting position is often a good idea. c qua nt i l e (SSC) is a helper program. Much more was said on this at Stata Journal 7: 275–279 (2007).

56

slide-57
SLIDE 57

Words from the wise

57

slide-58
SLIDE 58

Graphs force us to note the unexpected; nothing could be more important. John Wilder Tukey 1915–2000 Using the data to guide the data analysis is almost as dangerous as not doing so. Frank E. Harrell Jr

58

slide-59
SLIDE 59

Questions?

59

slide-60
SLIDE 60

All graphs use Stata scheme s 1c ol or , which I strongly recommend as a lazy but good default. This font is Georgia. Thi s f ont i s Luc i da Cons ol e .

60