Data preparation & presentation Gary Collins EQUATOR Network, - - PowerPoint PPT Presentation

▶

Dec 19, 2023 348 likes •585 views

Data preparation & presentation Gary Collins EQUATOR Network, Centre for Statistics in Medicine NDORMS, University of Oxford EQUATOR Network OUCAGS training course 24 October 2015 Data preparation Prior to ANY data analysis it is

SLIDE 1

Data preparation & presentation

Gary Collins

EQUATOR Network, Centre for Statistics in Medicine NDORMS, University of Oxford EQUATOR Network – OUCAGS training course 24 October 2015

SLIDE 2

Data preparation

Prior to ANY data analysis it is important to check

your data

– often more time is spent ‘cleaning’ and examining the data than the actual analysis – reliable results rely on accurate and reliable data

garbage in  garbage out
Data accuracy can be improved by prospectively

thinking about the data before collecting it

– produce a data collection sheet (also systematic reviews) – set up database – is possible, include basic checks in the database

often done for clinical trials

SLIDE 3

Prospective data collection

SLIDE 4

Data Definitions (CRASH dataset)

Variable Label Comments Max length Type Codes Patient ID Six digit unique identifier treatment box-pack number 7 string Sex Gender of the patient 1 Number 0=Male 1=Female DOB Date of birth of the patient DD/MM/YYYY 10 Date

7303=DOB not

known TRAND Time of randomisation HH:MM:SSS 8 Time GCS_EYE Glasgow Coma Scale: Eye

pening

1 Number 4=spontaneous 3=to sound 2=to pain 1=none

SLIDE 5

Data preparation (after collection)

For data types

– discrete (e.g. number of children): check for non-integer values – continuous (e.g. height): check values line in plausible range – binary (e.g. male/female) – nominal (e.g. blood type) – ordinal (e.g. cancer stage) – dates: check for valid dates, date of event before date of birth – most statistical software doesn’t by default differentiate between upper and lower case (e.g. yes, Yes, yES, yeS, yES, Yes, YeS, YES)

Missing data: consistency in how this is recorded

– how much is missing

Check for unlikely values

SLIDE 6

Van den Broeck et al, PLoS Med 2005

SLIDE 7

Implausible values

Cancer Male Female Total Breast 45 45 Colorectal 23 17 40 Lung 8 7 15 Prostate 12 1 13 Total 43 70 113

SLIDE 8

Outliers

Outliers will seem incompatible with the rest of the

data

May deviate from the main body of the data
Usually extreme values (high or low)
Can be genuine observations
Can have considerable influence on the results
Any suspicious values should be checked
The decision to include or exclude outliers should

be made with caution

Provided all the measurements are valid, often useful to

analyse with and without the outlier

SLIDE 9

Data presentation (exploration/presentation)

Plot the data, plot the data, plot the data !!!

– prior to any statistical analysis – look at the data! – get familiar with the data – what does it look like – can highlight/indicate any errors (spurious values) in the data part of data cleaning) – what is the distribution of the data (univariate)?

do they have an approximate Normal distribution?
if not, can may be transformed (e.g. log transformation)?
Plot the data, plot the data, plot the data !!!

SLIDE 10

Unfortunately, not all data look like this

SLIDE 11

Positively skewed data

Bland & Altman. BMJ 1996

SLIDE 12

CRASH-2 trial

(distribution of baseline Glasgow Coma Score)

SLIDE 13

SLIDE 14

SLIDE 15

Results: Graphs

Statisticians like to (ideally) see the raw data!
Use graphs to describe results, for example

– Dotplots (or boxplots for large data sets)

good for comparing groups

– Scatter plots

good to accompany correlation analyses

– Survival curves

time-to-event analyses

– Forest plots

meta-analyses

SLIDE 16

What can you tell me about these data?

For each dataset would you believe that the: Mean of x is 9 Variance of x is 11 Mean of y is 7.5 Variance of y is 4.122 Correlation between x and y is 0.816 Regression line: y = 3 + 0.5x Called ‘Anscombe’s Quartet’ Illustrates the importance of showing your data

SLIDE 17

Data presentation (exploration/presentation)

Where possible, plot the raw values

– there is NO reason to hide it – let the data tell the story

Beware of ‘chartjunk’ (Edward Tufte)

– anything that distracts the viewer (including you) from the information that graph is intended to present – let the data speak for themselves, don’t clutter the plot

SLIDE 18

Dynamite plots

SLIDE 19

Dynamite plots

SLIDE 20

Dynamite plots

Schriger & Cooper. Ann Emerg Med 2001 Mean=81.905 mm Mean=77.445 mm

SLIDE 21

The same data presented differently

Schriger & Cooper. Ann Emerg Med 2001

SLIDE 22

SLIDE 23

Summary

Think upfront about data collection

– How are the data coded? – Be consistent throughout – Have data checks to flag implausible values

Plot the data, plot the data, plot the data!

– Get to know your data – Check for outliers

Avoid ‘chartjunk’

Data preparation & presentation

Gary Collins

Data preparation

your data

– often more time is spent ‘cleaning’ and examining the data than the actual analysis – reliable results rely on accurate and reliable data

thinking about the data before collecting it

– produce a data collection sheet (also systematic reviews) – set up database – is possible, include basic checks in the database

Prospective data collection

Data Definitions (CRASH dataset)

Data preparation (after collection)

– how much is missing

Check for unlikely values

Van den Broeck et al, PLoS Med 2005

Implausible values

Cancer Male Female Total Breast 45 45 Colorectal 23 17 40 Lung 8 7 15 Prostate 12 1 13 Total 43 70 113

Outliers

data

be made with caution

analyse with and without the outlier

Data presentation (exploration/presentation)

– prior to any statistical analysis – look at the data! – get familiar with the data – what does it look like – can highlight/indicate any errors (spurious values) in the data part of data cleaning) – what is the distribution of the data (univariate)?

Unfortunately, not all data look like this

Positively skewed data

Bland & Altman. BMJ 1996

CRASH-2 trial

(distribution of baseline Glasgow Coma Score)

Results: Graphs

– Dotplots (or boxplots for large data sets)

– Scatter plots

– Survival curves

– Forest plots

What can you tell me about these data?

For each dataset would you believe that the: Mean of x is 9 Variance of x is 11 Mean of y is 7.5 Variance of y is 4.122 Correlation between x and y is 0.816 Regression line: y = 3 + 0.5x Called ‘Anscombe’s Quartet’ Illustrates the importance of showing your data

Data presentation (exploration/presentation)

– there is NO reason to hide it – let the data tell the story

– anything that distracts the viewer (including you) from the information that graph is intended to present – let the data speak for themselves, don’t clutter the plot

Dynamite plots

Dynamite plots

Dynamite plots

Schriger & Cooper. Ann Emerg Med 2001 Mean=81.905 mm Mean=77.445 mm

The same data presented differently

Schriger & Cooper. Ann Emerg Med 2001

Summary

– How are the data coded? – Be consistent throughout – Have data checks to flag implausible values

– Get to know your data – Check for outliers

– Maximise data:ink ratio