Statistics in medicine Sources of variation Lecture 1- part 1: - - PDF document

statistics in medicine
SMART_READER_LITE
LIVE PREVIEW

Statistics in medicine Sources of variation Lecture 1- part 1: - - PDF document

11/4/2016 Outline Statistics in medicine Sources of variation Lecture 1- part 1: Describing variation, and Types of variables graphical presentation Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology


slide-1
SLIDE 1

11/4/2016 1

S L I D E 0

Statistics in medicine

Lecture 1- part 1: Describing variation, and graphical presentation

Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu

S L I D E 1

Outline

  • Sources of variation
  • Types of variables

S L I D E 2

Readings and resources

  • Chapter 9, p105-118: Jekel's epidemiology, biostatistics, preventive

medicine, and public health by David L. Katz et al (4th edition).

S L I D E 3

  • Almost every characteristic that is measured on a patient varies
  • THAT IS WHY IT IS CALLED A VARIABLE
  • EXAMPLES
  • Blood glucose level
  • Blood pressure
  • Diet
  • Electrolytes
  • etc.…
slide-2
SLIDE 2

11/4/2016 2

S L I D E 4

There are different sources of variation

Let us consider blood pressure as an example

  • Biologic differences

– Age, race, diet, affect blood pressure

  • Older patients, of African descent, and those who

consume high salt diet tend to have high blood pressure

  • Measurement conditions

– Time of the day, anxiety, fatigue…etc.

  • High blood pressure is observed following exercise,

and with anxiety

S L I D E 5

There are different sources of variation

Let us consider blood pressure as an example

  • Measurement error

– Systematic error

  • Distort the data in one direction leading to bias
  • bscure the truth
  • Ex. Defective BP cuff that tend to give high readings

– Random error

  • Slight, inevitable inaccuracies
  • Not systematic because it makes some readings too

high, and some too low Statistics can adjust for random error, but can not fix systematic error

S L I D E 6

To understand variation, you have to describe it

  • Descriptive statistics definition:

–Statistics, such as the mean, the standard deviation, the proportion, and the rate, used to describe attributes of a set of a data

S L I D E 7

Variable could be quantitative or qualitative

  • Qualitative
  • Skin color
  • Jaundice
  • Heart murmurs
  • Quantitative

–Blood pressure –Electrolytes levels

http://clinicalgate.com/wp-content/uploads/2015/06/B9781437729306000483_f48-02- 97 81437729306.jpg

slide-3
SLIDE 3

11/4/2016 3

S L I D E 8

There are different types of variables

–Nominal –Dichotomous (binary) –Ordinal (ranked) –Continuous (interval) –Continuous (ratio) –Risks and proportions –Counts and units of observation –Combining data

S L I D E 9

Nominal variables (qualitative)

  • Nominal are “naming” variables
  • Definition:

– The simplest scale of measurement. Used for characteristics that have no numerical values, no measurement scales and no rank order. It is also called a categorical or qualitative scale.

  • Ex. Skin color

– Different number can be assigned to each color

  • E.g. 1: purple, 2: black, 3: white, 4 blue, 5: tan

– It makes no difference to the statistical analysis which number is assigned to which color, because the number is merely a numerical name for a color

  • Percentages and proportions are commonly used to

summarize the data

S L I D E 10

Dichotomous variables (qualitative)

  • Dichotomous from the Greek “cut into two” variables
  • Ex.: Normal/abnormal skin color, living/dead
  • Some time it s not enough to describe the data as two

categories living/dead, but it is important to know how long the patient survived  survival analysis

S L I D E 11

Ordinal “ranked” variables

  • Definition:

– Used for characteristics that have an underlying order to their values; that have clearly implied direction from better to worse.

  • Are categorical (qualitative) scales
  • Three or more levels
  • Although there is an order among categories, however the

difference between two adjacent categories is not the same throughout the scale

slide-4
SLIDE 4

11/4/2016 4

S L I D E 12

Ordinal “ranked” variables

  • Percentages and proportions are commonly used to

summarize the data

  • Medians are sometime used to describe the whole data

https://openclipart.org/detail/218053/pain-scale http://biology-forums.com/gallery/2137_18_05_12_2_25_00.jpeg

  • Ex. Pain scale: “0- no pain” - “10- worst

imaginable pain”

  • Ex. Pitting edema grading scale: “0- no

edema” - “4+- sever edema”

S L I D E 13

Numerical scales (quantitative)

  • Definition:

– The highest level of measurement. It is used for characteristics that can be given numerical values; the difference between numbers have meaning, ex. BMI, height.

  • Types
  • Interval
  • Ratio
  • Discrete
  • Measures of central tendencies are usually used to

summarize: means, medians

S L I D E 14

Numerical scales (continuous)

  • Has a value on a continuum
  • Interval: arbitrary zero point
  • Ex. Centigrade temperature scale
  • Ratio: absolute zero point
  • Ex. Kalvin temperature scale

https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd= &cad=rja&uact=8&ved=0ahUKEwiuo6nf8sjOAhUEkh4KHXTZAnUQjRwI Bw&url=http%3A%2F%2Fwww.livescience.com%2F39994- kelv in.html&psig=AFQjCNFGVvg1wdLx78W2V44wDlZQDQB17A&ust=147 1 538633651130 S L I D E 15

Numerical scales (Discrete)

  • Has values equal to integers
  • Units of observation: person, animal, thing, etc.…
  • Presented in frequency tables
  • One characteristic in the x-axis, one characteristic in the y-axis,

and counts in the cells

Cholesterol level Gender Checked Not checked Total Female 17(63%) 10(37%) 27(100%) Male 25 (57%) 19(43%) 44(100%) Total 42(59%) 29(41%) 71(100%)

Source: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition).

Frequency table of gender by whether serum total cholesterol was checked or not

slide-5
SLIDE 5

11/4/2016 5

S L I D E 16

Risks and proportions

  • Risk is the conditional probability of an event (e.g. death) in a

defined population in a defined period.

  • Share some characteristics of discrete and some characteristics of

continuous variables

  • Ex. A discrete event (e.g., death) occurred in a fraction of

population

  • Calculated by the ratio of counts in the numerator to counts in

denominator

S L I D E 17

Combining data

  • Continuous variable could be converted to ordinal variable
  • When data is converted to categories individual information is lost
  • The fewer the number of categories the greater is the amount of

information lost

20 40 60 80 100 120

Birth weight (g)

Source: Buehler W et al. Public Health Rep 1 02:151-161, 1987

Histogram of neonatal mortality rate per 1000 live births , by birth weight group, United States 1980

S L I D E 18

Statistics in medicine

Lecture 1- part 2: Describing variation, and graphical presentation

Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu

S L I D E 19

Outline

  • Frequency distributions

–Frequency distribution of continuous data –Frequency distribution of binary data

slide-6
SLIDE 6

11/4/2016 6

S L I D E 20

Readings and resources

  • Chapter 9, p105-118: Jekel's epidemiology, biostatistics, preventive

medicine, and public health by David L. Katz et al (4th edition).

S L I D E 21

Frequency distribution is

S L I D E 22

Frequency distribution is

TABLE of data displaying the VALUE of each data point ( or range of data points) in one column and the FREQUENCY with which that value occurs in the other column PLOT of data displaying the VALUE of each data point ( or range of data points) on one axis and the FREQUENCY with which that value

  • ccurs on the other axis

S L I D E 23

Frequency tables

  • Definition

– A table showing the number and or the percentages of observations

  • ccurring at different values (or range of values) of a variable.
  • Steps of creating frequency table

– Decide on the number of non-overlapping intervals

  • It is better to have equal width intervals
  • Usually 6 to 14 intervals are adequate to demonstrate the shape of

the distribution

  • Creating intervals means: continuous variable converted to ordinal

variable – Information on individual level is lost – Count the number of observations in each interval

  • Percentages could be calculated as well

– Percentage=the number of observation in the interval divided by the total number of observations, multiplied by 100

  • Presented graphically by histogram
slide-7
SLIDE 7

11/4/2016 7

S L I D E 24

Frequency tables

Categories of glucose level

  • f 180 participants

Category Count % <=70 14 7.78 71-100 104 57.78 101-125 26 14.44 >=126 36 20.00

Glucose level of 180 participants

Glucose level Count Glucose level Count Glucose level Count 52 1 88 2 140 4 66 1 89 5 143 1 67 1 90 8 145 5 68 2 92 3 149 2 69 2 95 11 150 4 70 7 96 1 155 2 71 1 98 1 158 1 72 2 100 12 160 1 73 1 103 4 165 4 75 12 108 1 168 1 76 2 110 11 170 1 77 4 115 1 172 1 78 6 120 6 220 1 79 4 121 1 80 11 122 1 82 2 124 1 83 2 130 3 85 9 133 1 86 4 135 3 87 1 139 1

S L I D E 25

There are REAL and THEORITICAL frequency distributions

  • Real

– Obtained from the actual data

  • Theoretical

– Calculated using certain assumptions – The most commonly used is “NORMAL (GAUSSIAN) DISTRIBUTION”

  • Most statistical methods assume that the data is

normally distributed

  • Real data are seldom perfectly normally distributed
  • Based on the central limit theory, if the sample size is

large, the assumption of normal distribution usually hold even if the data is skewed

S L I D E 26

Normal (Gaussian) distribution

  • Continuous distribution
  • Used if the population (σ) is known
  • A symmetric bell-shaped probability distribution with

a shape that is determined by mean (µ) and standard deviation (σ)

Different µ Same σ Same µ different σ

S L I D E 27

Normal (Gaussian) distribution

  • Properties:

–Bell shape –Depends on mean (µ) and standard deviation (σ) –Symmetric about the mean (µ) –Mean=median=mode

slide-8
SLIDE 8

11/4/2016 8

S L I D E 28

Normal (Gaussian) distribution

  • The area under the curve is

the probability (relative frequency) of the values comprising the normal distribution.

– The area under the whole curve = 1

  • 68% within µ + 1σ
  • 95% within µ + 2σ

(actually 1.96σ)

  • 99% within µ + 3σ

(actually 2.58σ)

S L I D E 29

Normal (Gaussian) distribution, example

  • If the math test scores is

normally distributed with a mean of 10 and standard deviation of 3, then what is the range of scores in which 68%

  • f the student scores will lie?

–68% of the students will have a score within µ + 1σ –10+3 =between 7 and 13

S L I D E 30

Standard normal distribution (z)

  • The normal distribution with mean 0

and standard deviation 1

  • If the mean#0 and SD#1do z

transformation  allow using the standard normal table – 𝑨 =

𝑦−𝜈 𝜏 , where x is the value of the

variable, µ is the mean, σ is the SD

  • A positive z means the value is above

the mean

  • A negative z means the value is below

the mean

  • If the z is known you can get the x

– x= µ + zσ

Graph generated by R

S L I D E 31

Standard normal distribution (z)

  • Properties:

–Bell shape –Symmetric about the mean –Mean=median=mode –Mean=0 –Standard deviation=1 –The area under the curve = 1

  • 68% within µ + 1σ
  • 95% within µ + 2σ
  • 99% within µ + 3σ

Graph generated by R

slide-9
SLIDE 9

11/4/2016 9

S L I D E 32

Standard normal distribution (z) tables

Areas under the standard normal curve (z scores)

  • Could be used to

find proportion above ,below , or between any z scores

  • The first column

includes the stem of the z value

  • The top row

includes the second and third digit of the z value

Source: http://image.slidesharecdn.com/copyofz-table-130515110049-phpapp02/95/copy-of-z- table-1-638.jpg?cb=1368615687 Area under the curve to the left i.e. below z Z score

Positive z Negative z

S L I D E 33

Standard normal distribution (z), example

  • If the mean of students’

test scores is 80, and the standard deviation is 10, what is the test score that divides the highest 5% of scores (i.e. find the students at or above the 95% percentile)?

  • Solution:

– Find the z score that marks the upper 5% 1.645 – The test score= µ + 1.645σ= 80+1.645*10=96.45 – Conclusion: the upper 5% has a test score >96.45

https://i.ytimg.com/vi/SSHCPCS5cys/maxresdefault.jpg

S L I D E 34

Standard normal distribution (z) tables

  • If the mean of HDL cholesterol is 45

mg/dL, and the standard deviation is 5, what is the proportion of population that have HDL values > 40 mg/dL?

  • Solution:

– Find the z score equivalent to 40 mg/dL  𝑨 = 𝑦−𝜈

𝜏 = (40-45)/5= -1

– P(HDL>40)=P(z>-1)=1-P(z<=1-) – Find the area (probability) below (HDL=40) =.1587 – P(HDL>40)= 1-0.1587=0.8413 – Conclusion: 84.13% of people in the population are expected to have HDL value 40 mg/dL

Source: http://www.gridgit.com/postpic/2014/10/negative-z-score-table-pdf_287337.png Area under the curve to the left i.e. below z Z score

Negative z table

S L I D E 35

T-distribution

  • A symmetric distribution with mean 0

and standard deviation larger than that for the normal distribution for small sample sizes.

  • Used if the population standard

deviation is unknown

  • Needed when the sample size is small
  • t and z distributions are very similar if

n>30

  • Properties:

– Symmetric – Bell shape – Shape change based on degrees of freedom k – Mean=median=mode=0 – Standard deviation > 1

Z & t almost identical when sample size ~30 Graph generated by R

slide-10
SLIDE 10

11/4/2016 10

S L I D E 36

T-distribution

  • Degrees of freedom (df)

– Is the number of observations that are free to vary – When calculating the mean, the sum of observations are fixed, therefore when adding up the N

  • bservations, each observation could be vary, except

the last one, because the total has to be fixed. Therefore, only N-1 observations can vary if one mean is to be estimated (one-sample), and (N1+N2)-2

  • bservations can vary if two means are to be estimated

(two-sample) – df= total sample size-number of means that are calculated

S L I D E 37

T-distribution

  • Table of critical

values of t distribution

Source: http://elvers.us/psy216/tables/tvalues.htm df Levels of Significance for a One-Tailed Test 0.2500 0.2000 0.1500 0.1000 0.0500 0.0250 0.0100 0.0050 0.0005 Levels of Signficance for a Two-Tailed Test 0.5000 0.4000 0.3000 0.2000 0.1000 0.0500 0.0200 0.0100 0.0010 1 1.000 1.376 1.963 3.078 6.314 12.706 31.821 63.657 636.619 2 0.816 1.061 1.386 1.886 2.920 4.303 6.964 9.925 31.599 3 0.765 0.978 1.250 1.638 2.353 3.182 4.541 5.841 12.924 4 0.741 0.941 1.189 1.533 2.132 2.776 3.747 4.604 8.610 5 0.727 0.920 1.156 1.476 2.015 2.570 3.365 4.032 6.869 6 0.718 0.906 1.134 1.440 1.943 2.447 3.143 3.707 5.959 7 0.711 0.896 1.119 1.415 1.895 2.365 2.998 3.499 5.408 8 0.706 0.889 1.108 1.397 1.859 2.306 2.896 3.355 5.041 9 0.703 0.883 1.100 1.383 1.833 2.262 2.821 3.250 4.781 10 0.700 0.879 1.093 1.372 1.812 2.228 2.764 3.169 4.587

S L I D E 38

Binomial distribution is used to describe the frequency distribution of dichotomous data

  • The probability distribution that describes the number of successes X
  • bserved in n independent trials, each with the same probability of
  • ccurrence
  • For binary variables
  • Defined by n and π
  • If sample is large, or proportion ~.5 z distribution could be used

Graphs generated by R

S L I D E 39

Chi-square distribution (X2) is used for analysis of counts

  • The distribution used to analyze

counts in frequency tables.

  • A nonsymmetrical distribution with

mean (µ) and variance (σ2)

  • Used for categorical (nominal) data
  • Properties:

– Degrees of freedom = υ – µ = υ – σ2 = υ*2 – Approaches normal distribution with the increase in df

Graph generated by R

slide-11
SLIDE 11

11/4/2016 11

S L I D E 40

Statistics in medicine

Lecture 1- part 3: Describing variation, and graphical presentation

Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu

S L I D E 41

Readings and resources

  • Chapter 9, p105-118: Jekel's epidemiology, biostatistics, preventive

medicine, and public health by David L. Katz et al (4th edition).

S L I D E 42

Summarizing numerical data

  • Continuous variable

– Measures of central tendency

–Measures of dispersion

  • Nominal data

– Proportions – Percentages – Ratios – Rates

S L I D E 43

Measures of central tendency

  • Definition:

–Index or summary numbers that describe the middle of a distribution

  • Types:

–Mean –Median –Mode

slide-12
SLIDE 12

11/4/2016 12

S L I D E 44

The mean

  • Types

–Arithmetic –Geometric

S L I D E 45

The arithmetic mean

  • The most commonly used statistics
  • Definition:

– The arithmetic average of the observations, which is denoted by µ in the population and by in the

  • sample. In a sample the mean is the sum of X values

divided by the number n in the sample

  • Arithmetic mean’s calculation
  • Sensitive to extreme values
  • Could be used with numerical scales
  • Should NOT be used with ordinal scales

S L I D E 46

Example of arithmetic mean’s calculation Arithmetic mean =

88+86+93+⋯+106 20

=

1775 20 =

89.05

Subject Glucose 1 88 2 86 3 93 4 79 5 83 6 98 7 74 8 96 9 95 10 78 11 75 12 98 13 90 14 108 15 81 16 108 17 76 18 97 19 72 20 106

Basic Statistical Measures Location Variability Mean 89.05000 Std Deviation 11.58254 Median 89.00000 Variance 134.15526 Mode 98.00000 Range 36.00000 Interquartile Range 19.00000 SAS 9.4 output

S L I D E 47

The geometric mean

  • Less commonly used than arithmetic mean
  • Definition:

– The nth root of the product of n observations

  • Geometric mean’s calculation

Log GM i.e. the mean of the log values

  • Exponentiation GM
  • Used with skewed distributions or logarithms
slide-13
SLIDE 13

11/4/2016 13

S L I D E 48

Example of geometric mean’s calculation

  • Geometric mean

Subject Glucose Log glucose 1 88 4.477337 2 86 4.454347 3 93 4.532599 4 79 4.369448 5 83 4.418841 6 98 4.584967 7 74 4.304065 8 96 4.564348 9 95 4.553877 10 78 4.356709 11 75 4.317488 12 98 4.584967 13 90 4.49981 14 108 4.682131 15 81 4.394449 16 108 4.682131 17 76 4.330733 18 97 4.574711 19 72 4.276666 20 106 4.663439 Sum 1781 89.62306 Arethmetic Mean 89.05 4.481153 Geometric mean 88.33649

'

S L I D E 49

The median

  • Definition:

– A measure of central tendency. It is the middle

  • bservation; i.e., the one that divides the distribution
  • f values into halves.it is also equal to the 50th

percentile

  • Median’s calculation

– Arrange observation ascending or descending – Count in to find

  • Odd number of observations: the middle value
  • Even number of observations: the mean of the two middle

values

  • Less sensitive to extreme value than the mean
  • Could be used with numerical scales
  • Could be used with ordinal scales

S L I D E 50

Example of median’s calculation

  • Median

(88+90)/2 =89

Subject Glucose 19 72 7 74 11 75 17 76 10 78 4 79 15 81 5 83 2 86 1 88 13 90 3 93 9 95 8 96 18 97 6 98 12 98 20 106 14 108 16 108

Basic Statistical Measures Location Variability Mean 89.05000 Std Deviation 11.58254 Median 89.00000 Variance 134.15526 Mode 98.00000 Range 36.00000 Interquartile Range 19.00000 SAS 9.4 output

S L I D E 51

The mode

  • Definition:

– The value of a numerical variable that occurs the most frequently

  • Mode’s calculation

– Count the number of times each value occur – The mode is the value that is most frequent

  • Some data might not have mode
  • Some data might have two modes bimodal
  • Some data might have > two modes multimodal
  • Modal class could be estimated, which is the interval

that has the largest number of observations

slide-14
SLIDE 14

11/4/2016 14

S L I D E 52

Example of mode’s calculation

  • Modes

98 and 108

Subject Glucose 19 72 7 74 11 75 17 76 10 78 4 79 15 81 5 83 2 86 1 88 13 90 3 93 9 95 8 96 18 97 6 98 12 98 20 106 14 108 16 108

S L I D E 53

Use of measures of central tendency

  • What is the best measure for a particular

dataset?

–The choice depends on:

  • Type of scale

–Numerical arithmetic mean or median –Ordinal  median –Logarithmic scale geometric mean

  • Distribution

–Symmetrical: the same shape on both sides

  • f the mean arithmetic mean or median

–Skewed: outliers in one direction median –Bimodal:  mode

S L I D E 54

Measures of spread (dispersion)

  • Definition:

–Index or summary numbers that describe the spread of observations about the middle value.

  • Types

–Range –Standard deviation –Coefficient of variation –Percentiles –Interquartile range

S L I D E 55

The range

  • Definition:

– The difference between the largest and the smallest

  • bservation
  • Range’s calculation

– Rank the data – Range=largest value – smallest value

  • Sometimes, minimum and maximum values are

displayed instead of the range

slide-15
SLIDE 15

11/4/2016 15

S L I D E 56

Example of range’s calculation

  • Range

–108-72=36 –Or present the lower and upper values (72,108)

Subject Glucose 19 72 7 74 11 75 17 76 10 78 4 79 15 81 5 83 2 86 1 88 13 90 3 93 9 95 8 96 18 97 6 98 12 98 20 106 14 108 16 108

Basic Statistical Measures Location Variability Mean 89.05000 Std Deviation 11.58254 Median 89.00000 Variance 134.15526 Mode 98.00000 Range 36.00000 Interquartile Range 19.00000 SAS 9.4 output

S L I D E 57

The standard deviation

  • Definition:

– The most common measure of spread, denoted by σ in the population and SD or s in the sample. It can be used with the mean to describe the distribution of observations. It is the square root of the average of the squared deviations of the observations from their mean

  • SD’s calculation
  • Other computational formulas exists

S L I D E 58

The standard deviation

  • SD is used in many statistical tests
  • Could be used with the mean to describe the

distribution of observation

– If the mean – 2SD contains zero  skewed observations

  • Characteristics of SD:

– If the distribution is bell shape

  • 67% of observations lie between mean+1SD
  • 95% of observations lie between mean+2SD
  • 99.7% of observations lie between mean+3SD

– Regardless of the shape

  • At least 75% of observations lie between mean+2SD

S L I D E 59

Example of SD’s calculation

SD’s calculation

Subject Glucose 1 88

  • 1.05

1.1025 2 86

  • 3.05

9.3025 3 93 3.95 15.6025 4 79

  • 10.05

101.0025 5 83

  • 6.05

36.6025 6 98 8.95 80.1025 7 74

  • 15.05

226.5025 8 96 6.95 48.3025 9 95 5.95 35.4025 10 78

  • 11.05

122.1025 11 75

  • 14.05

197.4025 12 98 8.95 80.1025 13 90 0.95 0.9025 14 108 18.95 359.1025 15 81

  • 8.05

64.8025 16 108 18.95 359.1025 17 76

  • 13.05

170.3025 18 97 7.95 63.2025 19 72

  • 17.05

290.7025 20 106 16.95 287.3025 Sum 1781 2548.95 Mean 89.05 SD 11.58254

Basic Statistical Measures Location Variability Mean 89.05000 Std Deviation 11.58254 Media n 89.00000 Variance 134.15526 Mode 98.00000 Range 36.00000 Interquar tile Range 19.00000 SAS 9.4 output

slide-16
SLIDE 16

11/4/2016 16

S L I D E 60

The coefficient of variation

  • Definition:

– The standard deviation divided by the mean. It is used to

  • btain a measure of relative variation i.e. variation

relative to the size of the mean

  • CV’s calculation
  • Commonly used in quality control

S L I D E 61

Percentiles

  • Definition:

– A number that indicates the percentage of a distribution that is less than or equal to that number

  • Commonly used to compare

individual values to norm

– Growth charts

  • Used to determine normal

laboratory ranges

– Between 2½ and 97½ percentiles  contains the central 95% of the distribution

Quantiles Level Quantile 100% Max 108.0 99% 108.0 95% 108.0 90% 107.0 75% Q3 97.5 50% Median 89.0 25% Q1 78.5 10% 74.5 5% 73.0 1% 72.0 0% Min 72. SAS 9.4 output

S L I D E 62

Interquartile range

  • Definition:

– The difference between the 25th percentile(first quartile) and the 75th percentile(third quartile)

  • It contains the central 50% of the distribution
  • Some authors present the first and third quartile values

instead of the difference

S L I D E 63

Interquartile range

  • Interquartile range
  • 97.5-78.5=19
  • Or present the first and

third quartile (78.5,97.5)

Subject Glucose 19 72 7 74 11 75 17 76 10 78 4 79 15 81 5 83 2 86 1 88 13 90 3 93 9 95 8 96 18 97 6 98 12 98 20 106 14 108 16 108

Basic Statistical Measures Location Variability Mean 89.05000 Std Deviation 11.58254 Median 89.00000 Variance 134.15526 Mode 98.00000 Range 36.00000 Interquartile Range 19.00000 SAS 9.4 output

slide-17
SLIDE 17

11/4/2016 17

S L I D E 64

Use of measures of spread

  • What is the best measure for a particular dataset?

– The choice depends on:

  • Type of measure of central tendency

– Mean standard deviation – Median interquartile range

  • Distribution

– Symmetrical: the same shape on both sides of the mean standard deviation or interquartile range – Skewed: outliers in one direction interquartile range

  • Purpose

– Compare to norms percentiles – Compare distributions measured on different scale coefficient of variation – Describe the central 50% of distribution interquartile range – Emphasize the extreme values range

S L I D E 65

Error bar plots

  • Definition:

–A graph that displays the mean and a measure of a spread for one or more groups

  • Deciphering the error bar plot

–The circle

  • The mean

–The bars

  • The standard deviation

–Some authors present the standard error

S L I D E 66

The proportions and percentages

  • Proportion definition:

– The number of observations with the characteristic of interest divided by the total number of observations.

  • Proportion’s calculation

– If the data contains two groups a and b, then the proportion of a is

  • Could be used with

– Nominal scales – Ordinal scales – numerical scales

  • Percentage: is the proportion multiplied by 100%

S L I D E 67

The ratios

  • Definition:

– A part divided by another part. It is the number of

  • bservations WITH the characteristic of interest

divided by the number of observations WITHOUT the characteristic of interest .

  • Ratio’s calculation

– If the data contains two groups a and b, then the ratio of a to b is

slide-18
SLIDE 18

11/4/2016 18

S L I D E 68

The rates

  • Definition:

– A proportion associated with a multiplier, called the base (e.g., 1000, 100,000) and computed over a specified period

  • Rate’s calculation

– If the data contains two groups a and b, then the rate of a is

  • Use of rates in epidemiology and medicine:

– Mortality rates – Cause-specific mortality rates – Morbidity rates

S L I D E 69

The rates

  • Adjusting rates:

– Why crude rate might not be suitable?

  • Comparing populations with dissimilar characteristics

such as age, gender, race

– Types:

  • Direct adjustment
  • Indirect adjustment

– Details of calculations will be covered in the epidemiology and public health thread class

S L I D E 70

One of the problems in the analysis of frequency distribution is SKEWNESS

  • Horizontal stretching of the distribution

the right and left sides of the distributions are not mirror images i.e. one tail is longer than the other

  • The tail indicates the direction and type of

skewed distribution – Tail is pointing to the right  skewed to the right (positively skewed) – Tail is pointing to the left  skewed to the left (negatively skewed)

  • The mean follows the tail regardless of the

type of skewed distribution – The sequence from the tail to the apex is mean, median, mode (realize it is alphabetical order)

  • Mean > median > mode  skewed

to the right (positively skewed)

  • Mean < median < mode  skewed

to the left (negatively skewed)

Graph source: http://www.statisticshowto.com/wp- content/uploads/2014/02/pearson-mode-skewness.jpg

S L I D E 71

Statistics in medicine

Lecture 1- part 4: Describing variation, and graphical presentation

Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu

slide-19
SLIDE 19

11/4/2016 19

S L I D E 72

Readings and resources

  • Chapter 9, p105-118: Jekel's epidemiology, biostatistics, preventive

medicine, and public health by David L. Katz et al (4th edition).

S L I D E 73

There are several way to depict continuous variable frequency distribution

  • Histogram
  • Frequency polygons
  • Line graphs
  • Stem and leaf diagrams
  • Quantiles
  • Boxplots

S L I D E 74

Frequency distribution is usually presented with histogram

  • Definition:

– A bar graph of a frequency distribution of numerical observations

  • Steps of creating histogram

– Decide on the number of non-overlapping intervals(statistical software might determine this automatically) – Put the intervals on the x-axis – Put the number or percentages on the y-axis

  • Percentages are used to compare two histograms based on different

sample sizes

– The frequency/percentages are presented with bars

  • Area of each bar is in proportion to percentage of individuals in that

interval

  • Combining observations in intervals

 smoother curve compared to histograms of individual values

S L I D E 75

210 180 150 120 90 60 44 41 38 35 32 29 26 23 20 17 14 11 8 5 2

Glucose level Frequency

Histogram of Glucose level

Minitab 17 output

Interpreting the graph Most participants had fasting blood glucose level of 65 to 125. Only two participants had blood glucose level less than 60 mg/dl. Additionally, the distribution is skewed to the right (positively skewed) ; several participants had had fasting blood glucose level much higher than the target of =< 125 mg/dl.

slide-20
SLIDE 20

11/4/2016 20

S L I D E 76

21 0 1 80 1 50 1 20 90 60 4 4 4 1 3 8 3 5 3 2 2 9 2 6 2 3 2 0 1 7 1 4 1 1 8 5 2

Glucose level Frequency Frequency polygon of Glucose level

Frequency polygon definition: A line graph connecting the mid-points

  • f the top of the columns of histogram.

It is useful in comparing two frequency distributions Steps of creating frequency polygons Create a histogram Connect the mid-points of the top of the columns of histogram

Frequency polygons is another presentation of the frequency distribution

Minitab 17 output

S L I D E 77

21 0 1 80 1 50 1 20 90 60 25 20 1 5 1 0 5

Glucose level Percent Percentage polygon

Percentage polygons

Percentage polygon definition: A line graph connecting the mid-points of the top of the columns of histogram based on percentages instead of count. It is useful in comparing two or more frequency distributions when frequencies are not equal Steps of creating percentage polygons Create a histogram based on percentages Connect the mid-points of the top of the columns of histogram Extends the line from the midpoints of the first and last columns to the x-axis

S L I D E 78

Stem-and-leaf plots

  • Definition:

– A graphical display for numerical data. It is similar to both frequency table and histogram

  • For tallying observations
  • Steps of creating stem-and-leaf plot

– Decide on the number of non-overlapping intervals – Draw a vertical line – Put the first digits of each interval on the left side of the vertical line “stem” – For each individual, put the second digit on the right side of the vertical line “leaves”

  • If the observation is one digit, that digit is the leaf

– Reorder leaves from lowest to highest within each interval – Count from either end to locate the median

S L I D E 79

Stem-and-Leaf Display: Glucose level

1 5 2 7 6 678899 46 7 000000012235555555555556677778888889999 82 8 000000000002233555555555666678899999 (24) 9 000000002225555555555568 74 10 00000000000033338 57 11 000000000005 45 12 000000124 36 13 00035559 28 14 000035555599 16 15 0000558 9 16 055558 3 17 02 1 18 1 19 1 20 1 21 1 22 0 Stem-and-leaf of Glucose level N = 180 Leaf Unit = 1.0 n Stem Leaf

Median is in this line=91

Vertical line was added manually

Minitab 17 output

slide-21
SLIDE 21

11/4/2016 21

S L I D E 80

Box plots (box-and-whisker plot)

  • Definition:

–A graph that summarize the data by displaying the minimum, first quartile, median, third quartile, and maximum statistics

  • It could be created from the information

displayed in a stem-and-leaf plot or a frequency table

S L I D E 81

Box plots (box-and-whisker plot)

  • Deciphering the box-and-whisker plot

– The box

  • The top of the box is the is the third

quartile

  • The bottom of the box is the first quartile
  • The length of the box is the interquartile

range

  • The median is presented with a horizontal

line in the box

  • The mean is presented with a plus sign in

the box (some programs)

– The whiskers

  • Depict the minimum and the maximum

values

Source: editionhttp://www.physics.csbsju.edu/stats/simpl e.box.defs.gif S L I D E 82

225 200 175 150 125 100 75 50

Glucose level

91 101.033

Boxplot of Glucose level

Interpreting the results The boxplot shows:

  • The range(whiskers) is 52 ,172
  • The longer upper whisker and large box area

above the median indicate that the data is rightly (positive) skewed

  • The median is 91
  • The mean 101.033
  • The interquartile range is 79,118.75
  • One outlier is present

S L I D E 83

Tabular and graphical presentation

  • f nominal and ordinal data
  • Contingency frequency tables:
  • A table used to display counts and or

frequencies for two or more nominal

  • r quantitative variables

Gender Post graduate College High school Male 1 3 3 Female 5 6 2

slide-22
SLIDE 22

11/4/2016 22

S L I D E 84

Tabular and graphical presentation

  • f nominal and ordinal data
  • Dot plots
  • A graphical presentation using dots
  • Bar charts

– A graph used with nominal characteristics to display the numbers or percentages of

  • bservations with the characteristic
  • f interest
  • The categories are placed on the x-

axis

  • The numbers or percentages are

placed on the y-axis

S L I D E 85

Graphs for two characteristics

  • Two characteristics are nominal

– Bar charts

  • Dot plots

S L I D E 86

Graphs for two characteristics

  • One characteristic is nominal and the other is numerical:
  • Box plots
  • Error plots

Box plots SAS 9.4 output Error plots SAS 9.4 output

S L I D E 87

Graphs for two characteristics

  • Two characteristics are numerical:
  • Scatterplots (bivariate plots)

– A two-dimensional graph displaying the relationship between two numerical characteristics of variables

  • Creating a scatterplot

– If data does not have an outcome and a predictor

  • Choice of the x and y axis does

not matter – If data has an outcome and a predictor

  • Put the explanatory (risk

factor, predictor) on the x- axis

  • Put the outcome on the y-axis

– Put a circle for each observation at the point of intersection of its x and y values

Scatter plots SAS 9.4

  • utput
slide-23
SLIDE 23

11/4/2016 23

S L I D E 88

Quiz

A pharmaceutical company tested the effect of sofosbuvir (new HCV drug) on sustained viral response (SVR) in four HCV genotypes. In genotype 1, 2, 3, and 4, the drug was shown to cause SVR in 90%, 93%, 84%, and 96% of the patients respectively. What type of graphical depiction is best suited to show the data? A. Pie chart B. Venn diagram C. Bar diagram

  • D. Histogram