Descriptive and Summary Statistics
BIO5312 FALL2017 STEPHANIE J. SPIELMAN, PHD
Descriptive and Summary Statistics BIO5312 FALL2017 STEPHANIE J. - - PowerPoint PPT Presentation
Descriptive and Summary Statistics BIO5312 FALL2017 STEPHANIE J. SPIELMAN, PHD Logistics All course materials will be hosted here: http://sjspielman.org/bio5312_fall2017 Submit assignments via Canvas: https://templeu.instructure.com Please
BIO5312 FALL2017 STEPHANIE J. SPIELMAN, PHD
All course materials will be hosted here: http://sjspielman.org/bio5312_fall2017 Submit assignments via Canvas: https://templeu.instructure.com Please bring your laptop to class!!! Office SERC 643
The primary goal is to analyze, interpret, and visualize data in the biological sciences Achieved via statistical analysis and data science techniques in R This is not a course in statistical theory.
Descriptive and Summary Statistics Data visualization Fundamentals in probability, distributions Statistical inference: hypothesis testing and confidence intervals Linear modeling Multiple testing Binary classification Clustering methods Special topics in current biological data analysis
Descriptive and Summary Statistics Data visualization Fundamentals in probability, distributions Statistical inference: hypothesis testing and confidence intervals Linear modeling Multiple testing Binary classification Clustering methods Special topics in current biological data analysis
Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. We use statistics to make inferences about phenomena using samples and quantify uncertainty of data Biostatistics is (surprisingly!) a branch of applied statistics geared towards to medical and biological problems
Populations are the entire collection of individuals/units/etc. a researcher is interested in
Samples are subsets of individuals/units from populations
Parameters and estimates use different notations, as we will see
In an ideal world, a sample is unbiased and features low sampling error
Samples should be randomly chosen
being chosen for a given sample Bias Sampling error Low bias and low sampling error
Precise Imprecise Inaccurate Accurate
A researcher selects the first 58 student volunteers that sign up for a study A computer program numbers all residents in a community, and then uses a random-number generator to select 26 residents A researcher vigorously shakes a box containing equally sized balls and takes the first 3 that fall
A researcher selects all study participants whose first name starts with an A, B, K, M, or O.
A researcher selects the first 58 student volunteers that sign up for a study A computer program numbers all residents in a community, and then uses a random-number generator to select 26 residents A researcher vigorously shakes a box containing equally sized balls and takes the first 3 that fall out of the box. A researcher selects all study participants whose first name starts with an A, B, K, M, or O.
Tools to concisely describe data, numerically and visually Generally the first step in data exploration and statistical analysis
Quantitative data
Categorical data
How you analyze and visualize data depends on the type of data you have
Continuous
Discrete
Nominal
Ordinal – categories with a natural ordering
Binary
Bonus: names of sex genotypes?
Continuous
Mean 𝑍 " = $
% ∑
𝑍
( % ()$
Median
%*$ +
th observation
% + th and % + + 1 th observation
Discrete
Mode
the distribution (commonly used for discrete data)
Range Standard deviation and variance Interquartile range
Difference between largest and smallest value in a distribution
Range is very sensitive to extreme observations and becomes very unwieldy very quickly.
Generally discussed in the context of mean Deviance describes how each nth data point deviates from mean 𝑍 ":
$ − 𝑍
", 𝑍
+ − 𝑍
", 𝑍
0 − 𝑍
", …, 𝑍
% − 𝑍
"
Standard deviation of a sample
$ %2$
∑ (𝑍
(−𝑍
")+
% ()$
Generally discussed in the context of median Quartiles divide the data into four equal parts (“quar”!) Interquartile range (IQR) is the difference between the third and first quartile
Five number summary: min, Q1, median, Q3, max
Median First quartile Third quartile Interquartile range 1.25 1.64 1.91 2.31 2.37 2.38 2.84 2.87 2.93 2.94 2.98 3.00 3.09 3.22 3.41 3.55
The median is much more robust to outliers compared to the mean.
mean mean
Which would you choose for a symmetric distribution and why?
Coefficient of variation is the standard deviation of a sample expressed as a percentage of the sample mean (aka normalized)
𝒕 𝒁 ; ×𝟐𝟏𝟏%
Measurement Sample estimate Population parameter Mean 𝑍 " =
$ % ∑
𝑍
( % ()$
𝜈 =
$ % ∑
𝑦(
% ()$
Standard deviation 𝑡 =
$ %2$
∑ (𝑍
(−𝑍
")+
% ()$
$ %
∑ (𝜈(−𝜈̅)+
% ()$
𝑡+ σ+
Different types of plots are used to represent different types of data
Continuous data
Histogram Density plot Boxplot Violin plot
Discrete data
Bar plot
Comparing two continuous variables
Scatterplot
Trend over time
Line plot
10 20 30 40 12 14 16 18
Value Count
Uniform Bell–shaped Asymmetric (skewed) Bimodal
Graphical representation of a five- number summary “Whiskers” calculated as data within +/- 1.5 IQR
−4 −2 2
Value
Q1 Median “whiskers” Q3 IQR
10
Distributions Value
*Pun intended.
Bimodal Unimodal 10 10 200 400 600
Value Count
0.0 0.2 0.4 0.6
Value
Symmetry? Skewness? Modality? Asymmetric Right-skewed Unclear
Violin plot Density plot Boxplot N(5, 4) N(2, 1) N(4, 0.09)
20 40 60
pink red white
Flowers in garden Count Flower color
pink red white
http://journals.plos.org/plosbiology/article?id =10.1371/journal.pbio.1002128
−10 10 −2 −1 1 2
Variable 1 Variable 2
1 2 3 4 −2 −1 1 2 3
Variable 1 Variable 2
explanatory/independent variable response/dependent variable
100 110 120 130 140 150 1992 1996 2000
Year Value
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 75 100 125 150 175 Value Year