THE REVISION OF SOME CONCEPTS Summary Statistics Quantitative data - - PowerPoint PPT Presentation
THE REVISION OF SOME CONCEPTS Summary Statistics Quantitative data - - PowerPoint PPT Presentation
THE REVISION OF SOME CONCEPTS Summary Statistics Quantitative data describes a numeric set of data by its Center, Variability, Shape But important to consider if data are: Non-normal Non-normal median range Normal mean
Summary Statistics
Quantitative data describes a numeric set
- f data by its Center, Variability, Shape
But important to consider if data are:
- Non-normal
- Non-normal
median range
- Normal
mean variance standard deviation
Data Summarization
To summarize quantitative data, we need to use
- ne or two parameters that can describe the
data.
- 1. Measures
- f
Central tendency which describes the center of the data
- 1. and the Measures of Dispersion, which show
how the data are scattered around its center
Measures of central tendency
Variable usually has a point (center) around which the
- bserved values lie.
These averages are also called measures of central tendency. The three most commonly used averages are:
- 1. The arithmetic mean:
- 2. The Median
- 3. The Mode
1- The arithmetic mean:
the sum of observation divided by the number of
- bservations:
- x =
∑ x n Where : Where : x = mean ∑ denotes the (sum of) x the values of observation n the number of observation
2- Median
It is the middle observation in a series of
- bservation
after arranging them in an ascending or descending manner.
- The rank of median
for is (n + 1)/2 if the
- The rank of median
for is (n + 1)/2 if the number of observation is odd
- and n/2 if the number is even
- The most frequent occurring value in the data is the
mode and is calculated as follows: Example: 5, 6, 7, 5, 10. The mode in this data is 5 since number 5 is repeated twice.
3- Mode
Sometimes, there is more than one mode and sometimes there is no mode especially in small set of observations. Unimodal - Bimodal - Nomodal
Advantages and disadvantages of
Central Tendency Measures (CTM):
- Mean: is the preferred CTM since it takes into account each
individual observation but its main disadvantage is that it is affected by the extreme values of observations.
- Median: it is a useful descriptive measure if there are one or two
- Median: it is a useful descriptive measure if there are one or two
extremely high or low values. The median is less sensitive to outliers (extreme scores) than the mean and thus a better measure than the mean for highly skewed distributions.
- Mode: is rarely used.
Measures of Dispersion
- The measure of dispersion describes the degree of
variations or scatter or dispersion of the data around its central values: 1. Range - R 1. Range - R 2. Variance -V 3. Standard Deviation – SD
dispersion = variation = spread = scatter
1- Range
- is the difference between the largest and smallest
values.
- is the simplest measure of variation.
- Disadvantages: it is based only on two of the
- bservations and gives no idea of how the other
- bservations are arranged between these two. Also,
it tends to be large when the size of the sample
increases
If we want to get the average of differences between the mean and each observation in the data, we have to reduce each value from the mean and then sum these differences and divide it by the number of observation.
2- Variance
divide it by the number of observation. V = ∑ (mean – xi) / n
- Variance: V = ∑ (mean – x) / n
- The value of this equation will be equal to
zero because the differences between each
2- Variance
zero because the differences between each value and the mean will have negative and positive signs that will equalize zero on algebraic summation.
- To overcome this zero we square the difference
between the mean and each value so the sign will be always positive.
2- Variance
be always positive.
- Thus we get:
V = ∑ (mean – x)2 / n - 1
3- Standard Deviation (SD)
The main disadvantage of the variance is that it is the square of the units used. So, it is more convenient to express the variation in the original units by taking the square root in the original units by taking the square root
- f the variance.
This is called the standard deviation (SD). Therefore SD = √ V
- i.e. SD = √ ∑ (mean – x)2 / n - 1
Summary statistics in useful to identify if data are normal or not
Summary Statistics and Normal data
Normal Data: approximately 95% of
- bservations are between the mean plus
- r minus 2 standard deviations
Normal Distribution curve (NDC)
NDC is a Graphical Presentation <Frequency Polygon> of any Quantitative Variables. The Normal Distribution Curve is the frequency polygon of a quantitative variable measured in large number. It occupies a major role in the techniques of statistical analysis.
Areas under the NDC
- X ± 1 SD = 68% of the area on each side of the
mean.
- X ± 2 SD = 95% of area on each side of the
- X ± 2 SD = 95% of area on each side of the
mean.
- X ± 3 SD = 99% of area on each side of the
mean.
Characteristics of NDC
1- It is bell shaped, continuous curve. 2- It is symmetrical (i.e., can be divided into two equal halves vertically). 3- The tails never touch the base line but extended to 3- The tails never touch the base line but extended to infinity in either direction. 4- The mean, median and mode values coincide. 5- It is described by two parameters: arithmetic mean determine the location of the center of the curve and standard deviation represents the scatter around the mean.
NDC and Skewed data
- If we represent a collected data by a
frequency polygon graph and the resulted curve does not simulate the normal distribution curve (with all its normal distribution curve (with all its characteristics) then these data are not normally distributed
Skewness and Kurtosis
Skewness: measures asymmetry
- f data
– Positive or right skewed: Longer right tail – Negative or left skewed: Longer left tail Longer left tail Kurtosis: measures peakedness of the distribution of data. The kurtosis of normal distribution is 0.
NDC can be used in distinguishing between normal from abnormal measurements. Example: If we have NDC for hemoglobin levels for a population of normal adult males with mean ± SD = 11 ± 1.5
NDC and normal measurement
If we obtain a hemoglobin reading for an individual = 8.1 and we want to know if he/she is normal or anemic. If this reading is within the area under the curve at 95% of normal (i.e. mean ± 2 SD) he /she will be considered normal. If his/her reading is less then he/she is anemic.
The normal range for hemoglobin in this example will be:
- the higher level of hemoglobin: 11 + 2 ( 1.5 ) =14.
- the lower hemoglobin level: 11 – 2 ( 1.5 ) = 8.
The normal range of hemoglobin of adult males is from 8 to 14.
NDC and normal measurement
The normal range of hemoglobin of adult males is from 8 to 14. The reading of 8.1 is within the 95% of this population, therefore this individual is normal because this reading is within the 95% of this population.
How to test for Normality
- Mean = Median
- (mean-2sd, mean+2sd) reasonable range
- -1 < skewness < 1
- -1 < kurtosis < 1
- Histogram shows symmetric bell shape
- Histogram shows symmetric bell shape
If data are not normal:
- Natural log transformation can transform very skewed
data to ‘Normal’ data use transformed data in analysis
Use the tool at http://onlinestatbook.com/stat_sim/sampling_dist/index.html to check the characteristics of the sampling distribution of the mean.
disabled
disabled disabled
disabled