[PPT] - SUMMARY STATISTICS INTRODUCTION TO DATA ANALYSIS FINAL EXAM Friday PowerPoint Presentation

SLIDE 1

SUMMARY STATISTICS

INTRODUCTION TO DATA ANALYSIS

SLIDE 2

INTRODUCTION TO DATA ANALYSIS

FINAL EXAM

▸ Friday February 7 2020 ::: 4-8pm ▸ 66/E33 & 66/E34 ▸ no class at noon on that day

SLIDE 3

INTRODUCTION TO DATA ANALYSIS

HOW (NOT) TO PERFORM OPTIMALLY IN THIS COURSE

▸ use the script, not the slides ▸ individual practice at home essential

SLIDE 4

INTRODUCTION TO DATA ANALYSIS

LEARNING GOALS

▸ understand what a “summary statistic” is ▸ understand and be able to compute the following: ▸ counts and frequencies for categorical data ▸ measures of central tendency: mean, mode & median ▸ measures of dispersion: variance, standard deviation & quantiles ▸ bootstrapped confidence intervals for an estimate ▸ co-variance & correlation

SLIDE 5

INTRODUCTION TO DATA ANALYSIS

SUMMARY STATISTICS

▸ usually: what we analyze ≠ what we actually measured ▸ data observations are always already interpreted abstractions over a much

richer reality

▸ e.g., we record whether a coin landed heads or tails, not where it landed ▸ summary statistic: a single number that represent one aspect of the data ▸ useful for communication about / understanding of the data at hand ▸ e.g., counting observations of a particular type / calculating the mean of

some numeric observations

SLIDE 6

INTRODUCTION TO DATA ANALYSIS

SUMMARY STATISTICS

▸ usually: what we analyze ≠ what we actually measured ▸ data observations are always already interpreted abstractions over a much

richer reality

▸ e.g., we record whether a coin landed heads or tails, not where it landed ▸ summary statistic: a single number that represent one aspect of the data ▸ useful for communication about / understanding of the data at hand ▸ e.g., counting observations of a particular type / calculating the mean of

some numeric observations

SLIDE 7

INTRODUCTION TO DATA ANALYSIS

BIO-LOGIC JAZZ-METAL

▸ 102 participants from this course [THANKS FOR DOING THIS!] ▸ everybody got three 2-alternative forced-choice questions (in random order):

“If you have to choose between the following two options, which one do you prefer?”

1. Biology vs Logic 2. Jazz vs Metal 3. Mountains vs Beach

▸ no sane person would defend serious scientific hypotheses about this study,

but the lecturer conjectures irresponsibly that a certain musical taste may be correlated with a particular preference for academic subjects

SLIDE 8

INTRODUCTION TO DATA ANALYSIS

INSPECTING THE DATA

participant with ID 379 prefers:

beaches over mountains
logic over biology
metal over jazz

SLIDE 9

INTRODUCTION TO DATA ANALYSIS

COUNTING OBSERVATIONS

▸ functions `n`, `count`, and `tally` from `dplyr` package ▸ caveats:

▸ different versions of `dplyr` package implement `count` differently ▸ several packages define a `count` function; use `dplyr::count` explicitly to be sure

▸ functions `table` and `prop.table` from base R

SLIDE 10

INTRODUCTION TO DATA ANALYSIS

COUNTING OBSERVATIONS

▸ `n` works only in `mutate` and `summarize` ▸ `n` essentially counts rows (useful after grouping!)

SLIDE 11

INTRODUCTION TO DATA ANALYSIS

COUNTING OBSERVATIONS

▸ `count` and `tally` are wrappers around `n` ▸ `count` implicitly groups/ungroups ▸ `tally` does not tinker with existing grouping

SLIDE 12

INTRODUCTION TO DATA ANALYSIS

COUNTS OF CHOICE PAIRS

SLIDE 13

INTRODUCTION TO DATA ANALYSIS

PROPORTIONS OF CHOICE PAIRS

SLIDE 14

INTRODUCTION TO DATA ANALYSIS

SLIDE 15

INTRODUCTION TO DATA ANALYSIS

MEASURES OF CENTRAL TENDENCY & DISPERSION

▸ central tendency: where is

“the center” of the data

bservations

▸ dispersion: how far are

values distributed around “the center”

SLIDE 16

INTRODUCTION TO DATA ANALYSIS

AVOCADO DATA

▸ data released by Hass Avocado Board (plucked from kaggle)

SLIDE 17

INTRODUCTION TO DATA ANALYSIS

MEAN

SLIDE 18

INTRODUCTION TO DATA ANALYSIS

MEAN :: EXAMPLE

SLIDE 19

INTRODUCTION TO DATA ANALYSIS

CALCULATING THE MEAN IN R

SLIDE 20

INTRODUCTION TO DATA ANALYSIS

EXCURSION :: MEAN AS EXPECTED VALUE

▸ the mean can be conceptualized also as

the value you would expect to gain when you sample once from the observed data

▸ useful later to link this to the expected

value of a random variable (but not important right now)

SLIDE 21

INTRODUCTION TO DATA ANALYSIS

MEDIAN

SLIDE 22

INTRODUCTION TO DATA ANALYSIS

MEDIAN :: EXAMPLE

SLIDE 23

INTRODUCTION TO DATA ANALYSIS

CALCULATING THE MEDIAN IN R

SLIDE 24

INTRODUCTION TO DATA ANALYSIS

MEAN VS MEDIAN

▸ mean is more susceptible to outliers ▸ choice of mean vs. median is great for manipulation: ▸ “How to mislead with statistics”

SLIDE 25

INTRODUCTION TO DATA ANALYSIS

MODE

▸ the mode is the value that occurred most frequently in the data ▸ often not applicable to metric data (where each measurement, if fine-grained

enough occurs only once)

▸ good for nominal and ordinal measures ▸ there is no built-in function in R to calculate the mode ▸ caveat: function `mode` exists but is unrelated

SLIDE 26

INTRODUCTION TO DATA ANALYSIS

VARIANCE

SLIDE 27

INTRODUCTION TO DATA ANALYSIS

VARIANCE :: EXAMPLE

SLIDE 28

INTRODUCTION TO DATA ANALYSIS

VARIANCE :: EXAMPLE

SLIDE 29

INTRODUCTION TO DATA ANALYSIS

VARIANCE :: BIASED AND UNBIASED ESTIMATORS

▸ biased estimator (unless mean is known) ▸ unbiased estimator (if mean is estimated from data as well) ▸ R’s built-in function `var` calculates the unbiased estimator!

SLIDE 30

INTRODUCTION TO DATA ANALYSIS

STANDARD DEVIATION

SLIDE 31

INTRODUCTION TO DATA ANALYSIS

VARIANCE & STANDARD DEVIATION :: EXAMPLE

SLIDE 32

INTRODUCTION TO DATA ANALYSIS

QUANTILE

▸ the k% quantile is a value so that k% of the data are smaller

SLIDE 33

INTRODUCTION TO DATA ANALYSIS

CONFIDENCE ESTIMATES VIA BOOTSTRAPPING

▸ variance & standard deviation tell us how far around the

mean the data dwells

▸ they do not tell us how good our estimate of the mean is ▸ we can use bootstrapping, a special instance of

resampling methods for this purpose

SLIDE 34

INTRODUCTION TO DATA ANALYSIS

BOOTSTRAPPING 95 % CONFIDENCE INTERVALS FOR THE MEAN

SLIDE 35

INTRODUCTION TO DATA ANALYSIS

BOOTSTRAPPING 95 % CONFIDENCE INTERVALS

resample 1 resample 2

collected measures

f interest for each

resample

riginal data
Fish: Water vector created by brgfx - www.freepik.com

SLIDE 36

INTRODUCTION TO DATA ANALYSIS

BOOTSTRAPPING 95 % CONFIDENCE INTERVALS

riginal data

95% bootstrapped CI

SLIDE 37

INTRODUCTION TO DATA ANALYSIS

BOOTSTRAPPING IN R

full data example partial data example

SLIDE 38

INTRODUCTION TO DATA ANALYSIS

NESTED TIBBLES FOR GROUP SUMMARIES

SLIDE 39

INTRODUCTION TO DATA ANALYSIS

NESTING TABLES

SLIDE 40

INTRODUCTION TO DATA ANALYSIS

UNNESTING NESTED TABLES

SLIDE 41

INTRODUCTION TO DATA ANALYSIS

COVARIANCE

▸ covariance measures the degree to which

two associated measurements show similar deviation from their respective means

Cov( ⃗ x , ⃗ y ) = 1 n − 1

n

∑

i=1

(xi − μ

⃗ x ) (yi − μ ⃗ y )

SLIDE 42

INTRODUCTION TO DATA ANALYSIS

COVARIANCE :: EXAMPLE

Cov( ⃗ x , ⃗ y ) = 1 n − 1

n

∑

i=1

(xi − μ

⃗ x ) (yi − μ ⃗ y )

SLIDE 43 Maria Pershina • Jona Carmon

size weight

?

SLIDE 44 Maria Pershina • Jona Carmon 24.11.2019 44

size weight

SLIDE 45 Maria Pershina • Jona Carmon 24.11.2019 45

size weight

coariance = ∑

𝑂 𝑗=1 (𝑦𝑗 − ¯

𝑦)(𝑧𝑗 − ¯ 𝑧) 𝑂 ² variance = ∑

𝑂 𝑗=1 (𝑦𝑗 − ¯

𝑦)² 𝑂 ² ² ² ² ² ² ² ¯ 𝑦

SLIDE 46 Maria Pershina • Jona Carmon 24.11.2019 46

size weight

coariance = ∑

𝑂 𝑗=1 (𝑦𝑗 − ¯

𝑦)(𝑧𝑗 − ¯ 𝑧) 𝑂 ¯ 𝑧 ¯ 𝑦

SLIDE 47 Maria Pershina • Jona Carmon 24.11.2019 47

size weight

coariance = ∑

𝑂 𝑗=1 (𝑦𝑗 − ¯

𝑦)(𝑧𝑗 − ¯ 𝑧) 𝑂 ¯ 𝑧 ¯ 𝑦

− + + −

SLIDE 48

INTRODUCTION TO DATA ANALYSIS

COVARIANCE :: INTERPRETATION

▸ summands are positive when xi and yi deviate “in the same

direction” from their respective means

▸ positive (negative) covariance therefore reflects an overall

tendency that that higher xi, the higher (lower) yi

▸ this is a descriptive property of the data, not an evidential

indicator of a causal relation

Cov( ⃗ x , ⃗ y ) = 1 n − 1

n

∑

i=1

(xi − μ

⃗ x ) (yi − μ ⃗ y )

SLIDE 49

INTRODUCTION TO DATA ANALYSIS

COVARIANCE :: SCALE VARIANCE

▸ covariance is not invariant under positive linear transformation

SLIDE 50

INTRODUCTION TO DATA ANALYSIS

PRODUCT-MOMENT CORRELATION

▸ Bravais-Pearson product-moment correlation coefficient is

defined as covariance standardized by std. deviations

r

⃗ x ⃗ y =

Cov( ⃗ x , ⃗ y ) SD( ⃗ x ) SD( ⃗ y )

SLIDE 51

INTRODUCTION TO DATA ANALYSIS

CORRELATION :: EXAMPLE

▸ correlation is invariant under positive linear transformation

SLIDE 52

INTRODUCTION TO DATA ANALYSIS

CORRELATION :: EXAMPLE

▸ negative correlation indicates an

verall negative association: the

higher total-volume-sold, the lower the average price

SLIDE 53

INTRODUCTION TO DATA ANALYSIS

CORRELATION :: PROPERTIES & INTERPRETATION

▸ r lies in [-1;1] ▸ r = 0 indicates no correlation at all ▸ r =1 indicates perfect positive correlation ▸ r = -1 indicates perfect negative correlation ▸ r >= 0.5 suggests noteworthy (pos.) correlation ▸ r <= -0.5 suggests noteworthy (neg.) correlation ▸ r2 also interpretable as “variance explained” in a

regression model (later)