[PPT] - Descriptive and Summary Statistics BIO5312 FALL2017 STEPHANIE J. PowerPoint Presentation

SLIDE 1

Descriptive and Summary Statistics

BIO5312 FALL2017 STEPHANIE J. SPIELMAN, PHD

SLIDE 2

Logistics

All course materials will be hosted here: http://sjspielman.org/bio5312_fall2017 Submit assignments via Canvas: https://templeu.instructure.com Please bring your laptop to class!!! Office SERC 643

Weekly office hours Friday 1-3 ground floor of SERC ß vote?

SLIDE 3

Course goals

The primary goal is to analyze, interpret, and visualize data in the biological sciences Achieved via statistical analysis and data science techniques in R This is not a course in statistical theory.

SLIDE 4

Course topics

Descriptive and Summary Statistics Data visualization Fundamentals in probability, distributions Statistical inference: hypothesis testing and confidence intervals Linear modeling Multiple testing Binary classification Clustering methods Special topics in current biological data analysis

SLIDE 5

Course topics

Descriptive and Summary Statistics Data visualization Fundamentals in probability, distributions Statistical inference: hypothesis testing and confidence intervals Linear modeling Multiple testing Binary classification Clustering methods Special topics in current biological data analysis

SLIDE 6

But first, what are we doing here?

Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. We use statistics to make inferences about phenomena using samples and quantify uncertainty of data Biostatistics is (surprisingly!) a branch of applied statistics geared towards to medical and biological problems

SLIDE 7

Populations and samples

Populations are the entire collection of individuals/units/etc. a researcher is interested in

Generally we can never know the true composition of a population
Populations are described with parameters

Samples are subsets of individuals/units from populations

We use hypothesis testing to (try to) draw population-level conclusions from samples
Samples are described with estimates

Parameters and estimates use different notations, as we will see

SLIDE 8

What makes a good sample?

In an ideal world, a sample is unbiased and features low sampling error

Bias is a systematic discrepancy between estimate and parameter

Samples should be randomly chosen

Each population unit should have an equal and independent chance of

being chosen for a given sample Bias Sampling error Low bias and low sampling error

Precise Imprecise Inaccurate Accurate

SLIDE 9

Pop quiz: Is it random?

A researcher selects the first 58 student volunteers that sign up for a study A computer program numbers all residents in a community, and then uses a random-number generator to select 26 residents A researcher vigorously shakes a box containing equally sized balls and takes the first 3 that fall

ut of the box.

A researcher selects all study participants whose first name starts with an A, B, K, M, or O.

SLIDE 10

Pop quiz: Is it random?

A researcher selects the first 58 student volunteers that sign up for a study A computer program numbers all residents in a community, and then uses a random-number generator to select 26 residents A researcher vigorously shakes a box containing equally sized balls and takes the first 3 that fall out of the box. A researcher selects all study participants whose first name starts with an A, B, K, M, or O.

SLIDE 11

Descriptive and Summary Statistics

Tools to concisely describe data, numerically and visually Generally the first step in data exploration and statistical analysis

Identify missing values, outliers, etc.
Check assumptions required to fit models or perform statistical tests
Identify trends that merit further study

SLIDE 12

Types of data

Quantitative data

Continuous
Discrete (includes count data)

Categorical data

Nominal
Ordinal
Binary*

How you analyze and visualize data depends on the type of data you have

SLIDE 13

Quantitative data

Continuous

Any real-number value within some range

Discrete

Values are in indivisible units, i.e. whole or counting numbers
Includes count data (number of cups of coffee per day, number of amino acids in a protein…)

SLIDE 14

Categorical data

Nominal

Hair color, eye color, sex genotypes (XX, XY, XXY, XYY, XO).

Ordinal – categories with a natural ordering

Bad, fair, good, excellent
A, B, C, D

Binary

Yes/No
True/False

Bonus: names of sex genotypes?

SLIDE 15

Measures of Location

Continuous

Mean 𝑍 " = $

% ∑

𝑍

( % ()$

Median

For odd n, the

%*$ +

th observation

For even n, the average of the

% + th and % + + 1 th observation

Discrete

Mode

The most frequent appearing observation in

the distribution (commonly used for discrete data)

1, 2, 2, 2, 3, 4, 4, 5, 6 à 2

SLIDE 16

Measures of location in distributions

http://i.imgur.com/YSEYhha.jpg

SLIDE 17

Measures of spread

Range Standard deviation and variance Interquartile range

SLIDE 18

Range

Difference between largest and smallest value in a distribution

1, 2, 3, 7, 9 à 8
1, 2, 3, 7, 9, 500 à 499

Range is very sensitive to extreme observations and becomes very unwieldy very quickly.

SLIDE 19

Standard deviation and variance

Generally discussed in the context of mean Deviance describes how each nth data point deviates from mean 𝑍 ":

𝑍

$ − 𝑍

", 𝑍

+ − 𝑍

", 𝑍

0 − 𝑍

", …, 𝑍

% − 𝑍

"

Standard deviation of a sample

𝑡 =

$ %2$

∑ (𝑍

(−𝑍

")+

% ()$

Variance
𝑡+

SLIDE 20

Interquartile range

Generally discussed in the context of median Quartiles divide the data into four equal parts (“quar”!) Interquartile range (IQR) is the difference between the third and first quartile

How much of the data does the IQR encompass?

Five number summary: min, Q1, median, Q3, max

Median First quartile Third quartile Interquartile range 1.25 1.64 1.91 2.31 2.37 2.38 2.84 2.87 2.93 2.94 2.98 3.00 3.09 3.22 3.41 3.55

SLIDE 21

Mean or median?

The median is much more robust to outliers compared to the mean.

mean mean

Which would you choose for a symmetric distribution and why?

SLIDE 22

Measures of variability

Coefficient of variation is the standard deviation of a sample expressed as a percentage of the sample mean (aka normalized)

𝑫𝑷𝑾 =

𝒕 𝒁 ; ×𝟐𝟏𝟏%

Useful measure for comparing variability between two differently-scaled datasets

SLIDE 23

Sample vs population notation

Measurement Sample estimate Population parameter Mean 𝑍 " =

$ % ∑

𝑍

( % ()$

𝜈 =

$ % ∑

𝑦(

% ()$

Standard deviation 𝑡 =

$ %2$

∑ (𝑍

(−𝑍

")+

% ()$

σ =

$ %

∑ (𝜈(−𝜈̅)+

% ()$

Variance

𝑡+ σ+

SLIDE 24

Visualizing data

Different types of plots are used to represent different types of data

Continuous data

Histogram Density plot Boxplot Violin plot

Discrete data

Bar plot

Comparing two continuous variables

Scatterplot

Trend over time

Line plot

SLIDE 25

Histogram

10 20 30 40 12 14 16 18

Value Count

SLIDE 26

Using histograms to describe distributions

Uniform Bell–shaped Asymmetric (skewed) Bimodal

SLIDE 27

Density plots smoothen histograms

0.0 0.1 0.2 0.3 12 14 16 18 Value Density 10 20 30 40 50 12 14 16 18 x count 0.0 0.1 0.2 0.3 12 14 16 18 x density

SLIDE 28

Boxplot

Graphical representation of a five- number summary “Whiskers” calculated as data within +/- 1.5 IQR

−4 −2 2

Value

Q1 Median “whiskers” Q3 IQR

utliers

SLIDE 29

Boxplots: The plot thickens*

10

Distributions Value

*Pun intended.

Bimodal Unimodal 10 10 200 400 600

Value Count

SLIDE 30

What can we say about this distribution based on its boxplot?

0.0 0.2 0.4 0.6

Value

Symmetry? Skewness? Modality? Asymmetric Right-skewed Unclear

SLIDE 31

Violin plot: Density meets boxplot

4 8 12 x value 3 6 9 12 2 4 3.0 3.5 4.0 4.5 5.0 0.0 0.5 1.0 0.0 0.1 0.2 0.3 0.00 0.05 0.10 0.15 0.20 value density 4 8 12 x value

Violin plot Density plot Boxplot N(5, 4) N(2, 1) N(4, 0.09)

SLIDE 32

Barplot

20 40 60

range

pink red white

Flowers in garden Count Flower color

range

pink red white

SLIDE 33

Cautionary tale in barplots

http://journals.plos.org/plosbiology/article?id =10.1371/journal.pbio.1002128

SLIDE 34

Scatterplot

−10 10 −2 −1 1 2

Variable 1 Variable 2

1 2 3 4 −2 −1 1 2 3

Variable 1 Variable 2

explanatory/independent variable response/dependent variable

SLIDE 35

Time series data

100 110 120 130 140 150 1992 1996 2000

Year Value

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 75 100 125 150 175 Value Year

SLIDE 36

Descriptive and Summary Statistics BIO5312 FALL2017 STEPHANIE J. - - PowerPoint PPT Presentation

Descriptive and Summary Statistics

Logistics

Course goals

Course topics

Course topics

But first, what are we doing here?

Populations and samples

What makes a good sample?

Pop quiz: Is it random?

Pop quiz: Is it random?

Descriptive and Summary Statistics

Types of data

Quantitative data

Categorical data

Measures of Location

Measures of location in distributions

Measures of spread

Range

Standard deviation and variance

Interquartile range

Mean or median?

Measures of variability

Sample vs population notation

Visualizing data

Histogram

Using histograms to describe distributions

Density plots smoothen histograms

Boxplot

Boxplots: The plot thickens*

What can we say about this distribution based on its boxplot?

Violin plot: Density meets boxplot

Barplot

Cautionary tale in barplots

Scatterplot

Time series data

BRE BREAK