Descriptive and Summary Statistics BIO5312 FALL2017 STEPHANIE J. - - PowerPoint PPT Presentation

descriptive and summary statistics
SMART_READER_LITE
LIVE PREVIEW

Descriptive and Summary Statistics BIO5312 FALL2017 STEPHANIE J. - - PowerPoint PPT Presentation

Descriptive and Summary Statistics BIO5312 FALL2017 STEPHANIE J. SPIELMAN, PHD Logistics All course materials will be hosted here: http://sjspielman.org/bio5312_fall2017 Submit assignments via Canvas: https://templeu.instructure.com Please


slide-1
SLIDE 1

Descriptive and Summary Statistics

BIO5312 FALL2017 STEPHANIE J. SPIELMAN, PHD

slide-2
SLIDE 2

Logistics

All course materials will be hosted here: http://sjspielman.org/bio5312_fall2017 Submit assignments via Canvas: https://templeu.instructure.com Please bring your laptop to class!!! Office SERC 643

  • Weekly office hours Friday 1-3 ground floor of SERC ß vote?
slide-3
SLIDE 3

Course goals

The primary goal is to analyze, interpret, and visualize data in the biological sciences Achieved via statistical analysis and data science techniques in R This is not a course in statistical theory.

slide-4
SLIDE 4

Course topics

Descriptive and Summary Statistics Data visualization Fundamentals in probability, distributions Statistical inference: hypothesis testing and confidence intervals Linear modeling Multiple testing Binary classification Clustering methods Special topics in current biological data analysis

slide-5
SLIDE 5

Course topics

Descriptive and Summary Statistics Data visualization Fundamentals in probability, distributions Statistical inference: hypothesis testing and confidence intervals Linear modeling Multiple testing Binary classification Clustering methods Special topics in current biological data analysis

slide-6
SLIDE 6

But first, what are we doing here?

Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. We use statistics to make inferences about phenomena using samples and quantify uncertainty of data Biostatistics is (surprisingly!) a branch of applied statistics geared towards to medical and biological problems

slide-7
SLIDE 7

Populations and samples

Populations are the entire collection of individuals/units/etc. a researcher is interested in

  • Generally we can never know the true composition of a population
  • Populations are described with parameters

Samples are subsets of individuals/units from populations

  • We use hypothesis testing to (try to) draw population-level conclusions from samples
  • Samples are described with estimates

Parameters and estimates use different notations, as we will see

slide-8
SLIDE 8

What makes a good sample?

In an ideal world, a sample is unbiased and features low sampling error

  • Bias is a systematic discrepancy between estimate and parameter

Samples should be randomly chosen

  • Each population unit should have an equal and independent chance of

being chosen for a given sample Bias Sampling error Low bias and low sampling error

Precise Imprecise Inaccurate Accurate

slide-9
SLIDE 9

Pop quiz: Is it random?

A researcher selects the first 58 student volunteers that sign up for a study A computer program numbers all residents in a community, and then uses a random-number generator to select 26 residents A researcher vigorously shakes a box containing equally sized balls and takes the first 3 that fall

  • ut of the box.

A researcher selects all study participants whose first name starts with an A, B, K, M, or O.

slide-10
SLIDE 10

Pop quiz: Is it random?

A researcher selects the first 58 student volunteers that sign up for a study A computer program numbers all residents in a community, and then uses a random-number generator to select 26 residents A researcher vigorously shakes a box containing equally sized balls and takes the first 3 that fall out of the box. A researcher selects all study participants whose first name starts with an A, B, K, M, or O.

slide-11
SLIDE 11

Descriptive and Summary Statistics

Tools to concisely describe data, numerically and visually Generally the first step in data exploration and statistical analysis

  • Identify missing values, outliers, etc.
  • Check assumptions required to fit models or perform statistical tests
  • Identify trends that merit further study
slide-12
SLIDE 12

Types of data

Quantitative data

  • Continuous
  • Discrete (includes count data)

Categorical data

  • Nominal
  • Ordinal
  • Binary*

How you analyze and visualize data depends on the type of data you have

slide-13
SLIDE 13

Quantitative data

Continuous

  • Any real-number value within some range

Discrete

  • Values are in indivisible units, i.e. whole or counting numbers
  • Includes count data (number of cups of coffee per day, number of amino acids in a protein…)
slide-14
SLIDE 14

Categorical data

Nominal

  • Hair color, eye color, sex genotypes (XX, XY, XXY, XYY, XO).

Ordinal – categories with a natural ordering

  • Bad, fair, good, excellent
  • A, B, C, D

Binary

  • Yes/No
  • True/False

Bonus: names of sex genotypes?

slide-15
SLIDE 15

Measures of Location

Continuous

Mean 𝑍 " = $

% ∑

𝑍

( % ()$

Median

  • For odd n, the

%*$ +

th observation

  • For even n, the average of the

% + th and % + + 1 th observation

Discrete

Mode

  • The most frequent appearing observation in

the distribution (commonly used for discrete data)

  • 1, 2, 2, 2, 3, 4, 4, 5, 6 à 2
slide-16
SLIDE 16

Measures of location in distributions

http://i.imgur.com/YSEYhha.jpg
slide-17
SLIDE 17

Measures of spread

Range Standard deviation and variance Interquartile range

slide-18
SLIDE 18

Range

Difference between largest and smallest value in a distribution

  • 1, 2, 3, 7, 9 à 8
  • 1, 2, 3, 7, 9, 500 à 499

Range is very sensitive to extreme observations and becomes very unwieldy very quickly.

slide-19
SLIDE 19

Standard deviation and variance

Generally discussed in the context of mean Deviance describes how each nth data point deviates from mean 𝑍 ":

  • 𝑍

$ − 𝑍

", 𝑍

+ − 𝑍

", 𝑍

0 − 𝑍

", …, 𝑍

% − 𝑍

"

Standard deviation of a sample

  • 𝑡 =

$ %2$

∑ (𝑍

(−𝑍

")+

% ()$

  • Variance
  • 𝑡+
slide-20
SLIDE 20

Interquartile range

Generally discussed in the context of median Quartiles divide the data into four equal parts (“quar”!) Interquartile range (IQR) is the difference between the third and first quartile

  • How much of the data does the IQR encompass?

Five number summary: min, Q1, median, Q3, max

Median First quartile Third quartile Interquartile range 1.25 1.64 1.91 2.31 2.37 2.38 2.84 2.87 2.93 2.94 2.98 3.00 3.09 3.22 3.41 3.55

slide-21
SLIDE 21

Mean or median?

The median is much more robust to outliers compared to the mean.

mean mean

Which would you choose for a symmetric distribution and why?

slide-22
SLIDE 22

Measures of variability

Coefficient of variation is the standard deviation of a sample expressed as a percentage of the sample mean (aka normalized)

  • 𝑫𝑷𝑾 =

𝒕 𝒁 ; ×𝟐𝟏𝟏%

  • Useful measure for comparing variability between two differently-scaled datasets
slide-23
SLIDE 23

Sample vs population notation

Measurement Sample estimate Population parameter Mean 𝑍 " =

$ % ∑

𝑍

( % ()$

𝜈 =

$ % ∑

𝑦(

% ()$

Standard deviation 𝑡 =

$ %2$

∑ (𝑍

(−𝑍

")+

% ()$

  • σ =

$ %

∑ (𝜈(−𝜈̅)+

% ()$

  • Variance

𝑡+ σ+

slide-24
SLIDE 24

Visualizing data

Different types of plots are used to represent different types of data

Continuous data

Histogram Density plot Boxplot Violin plot

Discrete data

Bar plot

Comparing two continuous variables

Scatterplot

Trend over time

Line plot

slide-25
SLIDE 25

Histogram

10 20 30 40 12 14 16 18

Value Count

slide-26
SLIDE 26

Using histograms to describe distributions

Uniform Bell–shaped Asymmetric (skewed) Bimodal

slide-27
SLIDE 27

Density plots smoothen histograms

0.0 0.1 0.2 0.3 12 14 16 18 Value Density 10 20 30 40 50 12 14 16 18 x count 0.0 0.1 0.2 0.3 12 14 16 18 x density
slide-28
SLIDE 28

Boxplot

Graphical representation of a five- number summary “Whiskers” calculated as data within +/- 1.5 IQR

−4 −2 2

Value

Q1 Median “whiskers” Q3 IQR

  • utliers
slide-29
SLIDE 29

Boxplots: The plot thickens*

10

Distributions Value

*Pun intended.

Bimodal Unimodal 10 10 200 400 600

Value Count

slide-30
SLIDE 30

What can we say about this distribution based on its boxplot?

0.0 0.2 0.4 0.6

Value

Symmetry? Skewness? Modality? Asymmetric Right-skewed Unclear

slide-31
SLIDE 31

Violin plot: Density meets boxplot

4 8 12 x value 3 6 9 12 2 4 3.0 3.5 4.0 4.5 5.0 0.0 0.5 1.0 0.0 0.1 0.2 0.3 0.00 0.05 0.10 0.15 0.20 value density 4 8 12 x value

Violin plot Density plot Boxplot N(5, 4) N(2, 1) N(4, 0.09)

slide-32
SLIDE 32

Barplot

20 40 60

  • range

pink red white

Flowers in garden Count Flower color

  • range

pink red white

slide-33
SLIDE 33

Cautionary tale in barplots

http://journals.plos.org/plosbiology/article?id =10.1371/journal.pbio.1002128

slide-34
SLIDE 34

Scatterplot

−10 10 −2 −1 1 2

Variable 1 Variable 2

1 2 3 4 −2 −1 1 2 3

Variable 1 Variable 2

explanatory/independent variable response/dependent variable

slide-35
SLIDE 35

Time series data

100 110 120 130 140 150 1992 1996 2000

Year Value

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 75 100 125 150 175 Value Year

slide-36
SLIDE 36

BRE BREAK