There are three types of lies lies, damned lies and statistics - - PowerPoint PPT Presentation

there are three types of lies lies damned lies and
SMART_READER_LITE
LIVE PREVIEW

There are three types of lies lies, damned lies and statistics - - PowerPoint PPT Presentation

There are three types of lies lies, damned lies and statistics Benjamin Disraeli British prime minister (Tory). William Gladstone Defeated Disraeli in the general election of 1868. President of the Royal Statistical Society


slide-1
SLIDE 1
slide-2
SLIDE 2

There are three types of lies — lies, damned lies and statistics

slide-3
SLIDE 3

Benjamin Disraeli

◮ British prime minister (Tory).

slide-4
SLIDE 4

William Gladstone

◮ Defeated Disraeli in the general election of 1868. ◮ President of the Royal Statistical Society 1867-1869.

slide-5
SLIDE 5

Another Disraeli quote

. . . That question is this: Is man an ape or an angel? I, my lord, I am on the side of the angels. I repudiate with indignation and abhorrence those new fangled theories. (Oxford Diocesan Conference 25/11/1864)

slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9

A rational approach to uncertainty?

1850 1900 1950 2000 −0.6 −0.2 0.2 Global temperature year Temperature anomaly (C) 1850 1900 1950 2000 250 300 350 400 Atmospheric C02 year CO2 (PPM)

slide-10
SLIDE 10

Absorption spectra

slide-11
SLIDE 11

Is abstraction the problem?

slide-12
SLIDE 12
slide-13
SLIDE 13

Baker & Bellis, 1993, Animal Behaviour

count

0.0 0.2 0.4 0.6 0.8 1.0 100 300 500 0.0 0.2 0.4 0.6 0.8 1.0

prop.partner

100 200 300 400 500 40 60 80 100 120 140 160 40 80 120 160

time.ipc

slide-14
SLIDE 14

The Baker and Bellis Analysis

0.0 0.2 0.4 0.6 0.8 1.0 100 300 500 prop.partner count 0.0 0.2 0.4 0.6 0.8 1.0 −200 200 prop.partner count 0.0 0.2 0.4 0.6 0.8 1.0 −200 200 prop.partner count 0.0 0.2 0.4 0.6 0.8 1.0 −200 200 prop.partner count 40 80 120 160 −300 −100 100 time.ipc rsd 40 80 120 160 −200 200 time.ipc count 0.0 0.2 0.4 0.6 0.8 1.0 −200 200 prop.partner rsd 0.0 0.2 0.4 0.6 0.8 1.0 −200 200 prop.partner count

slide-15
SLIDE 15

Baker and Bellis Conclusions

◮ At the end of the process they asked whether the apparent

straight line relationships were stronger than could plausibly have arisen by chance.

◮ On this basis they concluded that there is evidence for

count declining with proportion of time spent together.

◮ Time since last copulation seemed not to play a detectable

role.

◮ But they also collected another dataset . . .

slide-16
SLIDE 16

count

20 24 28 52 56 60 64 165 175 185 10 15 20 25 30 100 400 20 24 28

f.age f.height

155 170 52 58 64

f.weight m.age

20 30 40 165 180

m.height m.weight

60 80 100 300 500 10 20 30 155 165 175 20 30 40 60 70 80 90

m.vol

slide-17
SLIDE 17

More conclusions. . .

◮ Going through the same process as with the first data set,

leads to the conclusion that only female weight is linearly related to count.

◮ But a careful look at the residuals shows that this

conclusion is completely dependent on a single data point with very low sperm count.

◮ Re-do the analysis without this datum, and only volume

matters.

◮ Actually it’s the same subjects in both datasets, and we

can match up the volumes with the first dataset.

◮ Repeating the first analysis with volume added, leads to

the dull conclusion that there is only any evidence for a linear relationship between count and volume.

◮ This result has limited marketing potential.

slide-18
SLIDE 18

But why straight lines anyway?

count

0.0 0.2 0.4 0.6 0.8 1.0 100 300 500 0.0 0.2 0.4 0.6 0.8 1.0

prop.partner

100 200 300 400 500 40 60 80 100 120 140 160 40 80 120 160

time.ipc

slide-19
SLIDE 19

Smoothing

  • 1. What if the relationship between the residuals and a

variable does not look like a straight line?

  • 2. Why not let it be a smooth curve, instead?

0.0 0.2 0.4 0.6 0.8 1.0 −300 −100 100 300 prop.partner s(prop.partner,1.07) 40 60 80 100 140 −300 −100 100 300 time.ipc s(time.ipc,1.77)

slide-20
SLIDE 20

How to choose the best fit curve?

◮ Take a bendy strip of wood. ◮ Hook it up to the data points with springs. ◮ The result is a spline

1.5 2.0 2.5 3.0 2.0 2.5 3.0 3.5 4.0 4.5 size wear

slide-21
SLIDE 21

Splines are controllable

◮ Changing the flexibility of the spline changes the curve.

1.5 2.0 2.5 3.0 2.0 3.0 4.0 size wear 1.5 2.0 2.5 3.0 2.0 3.0 4.0 size wear 1.5 2.0 2.5 3.0 2.0 3.0 4.0 size wear 1.5 2.0 2.5 3.0 2.0 3.0 4.0 size wear

◮ Splines can be described mathematically, in a way that is

easy to work with.

slide-22
SLIDE 22

Smooth surfaces: thin plate splines

◮ For smooth surfaces there are several options ◮ We can replace the bendy strip, with a bendy sheet. . .

x 0.2 0.4 0.6 0.8 z 0.2 0.4 0.6 0.8 linear predictor 0.0 0.2 0.4 0.6 0.8 x 0.2 0.4 0.6 0.8 z 0.2 0.4 0.6 0.8 linear predictor 0.0 0.2 0.4 0.6 0.8 x 0.2 0.4 0.6 0.8 z 0.2 0.4 0.6 0.8 linear predictor 0.0 0.2 0.4 0.6 0.8 x 0.2 0.4 0.6 0.8 z 0.2 0.4 0.6 0.8 linear predictor 0.0 0.2 0.4 0.6 0.8

slide-23
SLIDE 23

More smooth surfaces: tensor product splines

◮ Or we can make a surface from a lattice of bendy strips. ◮ The strips should usually have different degrees of

flexibility in the two directions.

x z f(x,z)

slide-24
SLIDE 24

Yet more smooth surfaces: soap films

◮ For smoothing within oddly shaped areas, it can help to

replace bendy sheets/strips, with a soap film.

◮ This avoids smoothing across the area boundary.

58.0 58.5 59.0 59.5 60.0 60.5 44.0 44.5 45.0 45.5 46.0 46.5

longitude latitude

58.0 58.5 59.0 59.5 60.0 60.5 44.0 44.5 45.0 45.5 46.0 46.5

longitude latitude

58.0 58.5 59.0 59.5 60.0 60.5 44.0 44.5 45.0 45.5 46.0 46.5

longitude latitude

58.0 58.5 59.0 59.5 60.0 60.5 44.0 44.5 45.0 45.5 46.0 46.5

longitude latitude

58.0 58.5 59.0 59.5 60.0 60.5 44.0 44.5 45.0 45.5 46.0 46.5

longitude latitude

58.0 58.5 59.0 59.5 60.0 60.5 44.0 44.5 45.0 45.5 46.0 46.5

longitude latitude

slide-25
SLIDE 25

How flexible should the spline be?

◮ Mathematically, all these ways of describing a surface,

have the degree of smoothness controlled by just one or two numbers . . .

◮ . . . which must be chosen. How?

0.2 0.4 0.6 0.8 1.0 −2 2 4 6 8

λ too high x y

0.2 0.4 0.6 0.8 1.0 −2 2 4 6 8

λ about right x y

0.2 0.4 0.6 0.8 1.0 −2 2 4 6 8

λ too low x y

slide-26
SLIDE 26

Cleaning up a brain scan

10 20 30 40 50 50 60 70 80 medFPQ brain image Y X

◮ Model log FPQ as a smooth surface, represented using a

thin plate spline.

◮ Springs attaching the plate to the data have strength

dependent on the height of the plate.

slide-27
SLIDE 27

Smoothed version

10 20 30 40 50 50 60 70 80

linear predictor

Y X

slide-28
SLIDE 28

Is Cairo getting hotter?

1000 2000 3000 50 60 70 80 90 time (days) temperature (F)

◮ A model . . .

◮ The temperature varies smoothly with day of year. ◮ There might be an additional smooth long term trend in

temperature.

◮ The small scale day to day fluctuations are probably

correlated between one day and the next.

slide-29
SLIDE 29

Yes it is.

100 200 300 −15 −10 −5 5 10

day.of.year s(day.of.year,8.52)

1000 2000 3000 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

time s(time,1.35)

slide-30
SLIDE 30

Predicting octane rating

1000 1200 1400 1600 0.0 0.2 0.4 0.6 0.8 1.0 1.2

  • ctane = 85.3

wavelength (nm) log(1/R) ◮ How can we predict the octane rating from the spectrum?

slide-31
SLIDE 31

Octane prediction model

1000 1200 1400 1600 0.0 0.2 0.4 0.6 0.8 1.0 1.2

  • ctane = 85.3

wavelength (nm) log(1/R)

◮ Model: octane rating is a constant plus the average value

  • f the red curve multiplied by the spectrum (blue).

◮ Need to estimate the red curve.

slide-32
SLIDE 32

Octane prediction fit

1000 1200 1400 1600 −8 −4 2 4 6

Estimated function nm s(nm,7.9):NIR

84 85 86 87 88 89 84 86 88

  • ctane

fitted measured

slide-33
SLIDE 33

Diabetic Retinopathy Study

10 20 30 40 50 0.0 0.4 0.8 10 15 20 20 30 40 50 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8

ret

20 30 40 50

bmi

10 15 20

gly

10 20 30 40 50 20 40

dur ◮ Model is that probability of retinopathy is related to a sum

  • f smooth curves depending on bmi, gly and dur plus

smooth surfaces depending on bmi & gly, gly & dur . . .

slide-34
SLIDE 34

Diabetic Retinopathy Results

10 20 30 40 50 −4 −2 2 4 6

dur s(dur,3.26)

10 15 20 −4 −2 2 4 6

gly s(gly,1)

20 30 40 50 −4 −2 2 4 6

bmi s(bmi,2.67) dur gly te(dur,gly,0) dur bmi te(dur,bmi,0) gly bmi te(gly,bmi,2.5)

slide-35
SLIDE 35

Diabetic Retinopathy Results II

bmi gly linear predictor

15 20 25 30 35 40 45 50 10 15 20

linear predictor

bmi gly bmi gly linear predictor

red/green are +/− TRUE s.e.

bmi gly linear predictor

red/green are +/− TRUE s.e.

bmi gly linear predictor

red/green are +/− TRUE s.e.

slide-36
SLIDE 36

cran.r-project.org

slide-37
SLIDE 37

Picture Credits

◮ Gladstone and Disraeli are from the House of Commons web site. ◮ The 1921 Eugenics conference logo is from en.wikipedia.org/wiki/File:Eugenics congress logo.png ◮ The Gates of Auschwitz are from oncampus.richmond.edu/academics/education/ projects/webquests/holocaust/images/arbeit macht frei.jpg ◮ Hogarth’s South Sea Bubble can be found at www.library.hbs.edu/hc/ssb/images/using-top.jpg, but I’ve lost where I found the

  • ne shown.

◮ The absorption spectrum figure is from www.te-software.co.nz/blog/augie auer.htm ◮ Reproductions of Picasso’s Les Demoiselles d’Avignon are available from many

  • sites. The one shown is possibly from

www.enjoyart.com/library/featured artists/pablopicasso/large/Bmcgaw-P591.jpg ◮ The cover of Sperm Wars was taken from www.amazon.co.uk.

slide-38
SLIDE 38

Data Credits

◮ The Global CO2 and temperature data are from

www.cru.uea.ac.uk/cru/data/temperature/ and the Scripps Institute CO2 research group.

◮ The Aral Sea CO2 data are from the SeaWifs satellite. ◮ For full credits for the Cairo and Brain Scan data, see R

package gamair.

◮ The octane data are available in R package pls. ◮ The Retinopathy data are available in R package gss.