Data preparation & presentation Gary Collins EQUATOR Network, - - PowerPoint PPT Presentation

data preparation amp presentation
SMART_READER_LITE
LIVE PREVIEW

Data preparation & presentation Gary Collins EQUATOR Network, - - PowerPoint PPT Presentation

Data preparation & presentation Gary Collins EQUATOR Network, Centre for Statistics in Medicine NDORMS, University of Oxford EQUATOR Network OUCAGS training course 24 October 2015 Data preparation Prior to ANY data analysis it is


slide-1
SLIDE 1

Data preparation & presentation

Gary Collins

EQUATOR Network, Centre for Statistics in Medicine NDORMS, University of Oxford EQUATOR Network – OUCAGS training course 24 October 2015

slide-2
SLIDE 2

2

Data preparation

  • Prior to ANY data analysis it is important to check

your data

– often more time is spent ‘cleaning’ and examining the data than the actual analysis – reliable results rely on accurate and reliable data

  • garbage in  garbage out
  • Data accuracy can be improved by prospectively

thinking about the data before collecting it

– produce a data collection sheet (also systematic reviews) – set up database – is possible, include basic checks in the database

  • often done for clinical trials
slide-3
SLIDE 3

3

Prospective data collection

slide-4
SLIDE 4

4

Data Definitions (CRASH dataset)

Variable Label Comments Max length Type Codes Patient ID Six digit unique identifier treatment box-pack number 7 string Sex Gender of the patient 1 Number 0=Male 1=Female DOB Date of birth of the patient DD/MM/YYYY 10 Date

  • 7303=DOB not

known TRAND Time of randomisation HH:MM:SSS 8 Time GCS_EYE Glasgow Coma Scale: Eye

  • pening

1 Number 4=spontaneous 3=to sound 2=to pain 1=none

slide-5
SLIDE 5

5

Data preparation (after collection)

  • For data types

– discrete (e.g. number of children): check for non-integer values – continuous (e.g. height): check values line in plausible range – binary (e.g. male/female) – nominal (e.g. blood type) – ordinal (e.g. cancer stage) – dates: check for valid dates, date of event before date of birth – most statistical software doesn’t by default differentiate between upper and lower case (e.g. yes, Yes, yES, yeS, yES, Yes, YeS, YES)

  • Missing data: consistency in how this is recorded

– how much is missing

Check for unlikely values

slide-6
SLIDE 6

6

Van den Broeck et al, PLoS Med 2005

slide-7
SLIDE 7

7

Implausible values

Cancer Male Female Total Breast 45 45 Colorectal 23 17 40 Lung 8 7 15 Prostate 12 1 13 Total 43 70 113

slide-8
SLIDE 8

8

Outliers

  • Outliers will seem incompatible with the rest of the

data

  • May deviate from the main body of the data
  • Usually extreme values (high or low)
  • Can be genuine observations
  • Can have considerable influence on the results
  • Any suspicious values should be checked
  • The decision to include or exclude outliers should

be made with caution

  • Provided all the measurements are valid, often useful to

analyse with and without the outlier

slide-9
SLIDE 9

9

Data presentation (exploration/presentation)

  • Plot the data, plot the data, plot the data !!!

– prior to any statistical analysis – look at the data! – get familiar with the data – what does it look like – can highlight/indicate any errors (spurious values) in the data part of data cleaning) – what is the distribution of the data (univariate)?

  • do they have an approximate Normal distribution?
  • if not, can may be transformed (e.g. log transformation)?
  • Plot the data, plot the data, plot the data !!!
slide-10
SLIDE 10

10

Unfortunately, not all data look like this

slide-11
SLIDE 11

11

Positively skewed data

Bland & Altman. BMJ 1996

slide-12
SLIDE 12

12

CRASH-2 trial

(distribution of baseline Glasgow Coma Score)

slide-13
SLIDE 13

13

slide-14
SLIDE 14

14

slide-15
SLIDE 15

15

Results: Graphs

  • Statisticians like to (ideally) see the raw data!
  • Use graphs to describe results, for example

– Dotplots (or boxplots for large data sets)

  • good for comparing groups

– Scatter plots

  • good to accompany correlation analyses

– Survival curves

  • time-to-event analyses

– Forest plots

  • meta-analyses
slide-16
SLIDE 16

16

What can you tell me about these data?

For each dataset would you believe that the: Mean of x is 9 Variance of x is 11 Mean of y is 7.5 Variance of y is 4.122 Correlation between x and y is 0.816 Regression line: y = 3 + 0.5x Called ‘Anscombe’s Quartet’ Illustrates the importance of showing your data

slide-17
SLIDE 17

17

Data presentation (exploration/presentation)

  • Where possible, plot the raw values

– there is NO reason to hide it – let the data tell the story

  • Beware of ‘chartjunk’ (Edward Tufte)

– anything that distracts the viewer (including you) from the information that graph is intended to present – let the data speak for themselves, don’t clutter the plot

slide-18
SLIDE 18

18

Dynamite plots

slide-19
SLIDE 19

19

Dynamite plots

slide-20
SLIDE 20

20

Dynamite plots

Schriger & Cooper. Ann Emerg Med 2001 Mean=81.905 mm Mean=77.445 mm

slide-21
SLIDE 21

21

The same data presented differently

Schriger & Cooper. Ann Emerg Med 2001

slide-22
SLIDE 22

22

slide-23
SLIDE 23

23

Summary

  • Think upfront about data collection

– How are the data coded? – Be consistent throughout – Have data checks to flag implausible values

  • Plot the data, plot the data, plot the data!

– Get to know your data – Check for outliers

  • Avoid ‘chartjunk’

– Maximise data:ink ratio