SUMMARY STATISTICS INTRODUCTION TO DATA ANALYSIS FINAL EXAM Friday - - PowerPoint PPT Presentation

summary statistics
SMART_READER_LITE
LIVE PREVIEW

SUMMARY STATISTICS INTRODUCTION TO DATA ANALYSIS FINAL EXAM Friday - - PowerPoint PPT Presentation

INTRODUCTION TO DATA ANALYSIS SUMMARY STATISTICS INTRODUCTION TO DATA ANALYSIS FINAL EXAM Friday February 7 2020 ::: 4-8pm 66/E33 & 66/E34 no class at noon on that day INTRODUCTION TO DATA ANALYSIS HOW (NOT) TO PERFORM OPTIMALLY


slide-1
SLIDE 1

SUMMARY STATISTICS

INTRODUCTION TO DATA ANALYSIS

slide-2
SLIDE 2

INTRODUCTION TO DATA ANALYSIS

FINAL EXAM

▸ Friday February 7 2020 ::: 4-8pm ▸ 66/E33 & 66/E34 ▸ no class at noon on that day

slide-3
SLIDE 3

INTRODUCTION TO DATA ANALYSIS

HOW (NOT) TO PERFORM OPTIMALLY IN THIS COURSE

▸ use the script, not the slides ▸ individual practice at home essential

slide-4
SLIDE 4

INTRODUCTION TO DATA ANALYSIS

LEARNING GOALS

▸ understand what a “summary statistic” is ▸ understand and be able to compute the following: ▸ counts and frequencies for categorical data ▸ measures of central tendency: mean, mode & median ▸ measures of dispersion: variance, standard deviation & quantiles ▸ bootstrapped confidence intervals for an estimate ▸ co-variance & correlation

slide-5
SLIDE 5

INTRODUCTION TO DATA ANALYSIS

SUMMARY STATISTICS

▸ usually: what we analyze ≠ what we actually measured ▸ data observations are always already interpreted abstractions over a much

richer reality

▸ e.g., we record whether a coin landed heads or tails, not where it landed ▸ summary statistic: a single number that represent one aspect of the data ▸ useful for communication about / understanding of the data at hand ▸ e.g., counting observations of a particular type / calculating the mean of

some numeric observations

slide-6
SLIDE 6

INTRODUCTION TO DATA ANALYSIS

SUMMARY STATISTICS

▸ usually: what we analyze ≠ what we actually measured ▸ data observations are always already interpreted abstractions over a much

richer reality

▸ e.g., we record whether a coin landed heads or tails, not where it landed ▸ summary statistic: a single number that represent one aspect of the data ▸ useful for communication about / understanding of the data at hand ▸ e.g., counting observations of a particular type / calculating the mean of

some numeric observations

slide-7
SLIDE 7

INTRODUCTION TO DATA ANALYSIS

BIO-LOGIC JAZZ-METAL

▸ 102 participants from this course [THANKS FOR DOING THIS!] ▸ everybody got three 2-alternative forced-choice questions (in random order):

“If you have to choose between the following two options, which one do you prefer?”

1. Biology vs Logic 2. Jazz vs Metal 3. Mountains vs Beach

▸ no sane person would defend serious scientific hypotheses about this study,

but the lecturer conjectures irresponsibly that a certain musical taste may be correlated with a particular preference for academic subjects

slide-8
SLIDE 8

INTRODUCTION TO DATA ANALYSIS

INSPECTING THE DATA

participant with ID 379 prefers:

  • beaches over mountains
  • logic over biology
  • metal over jazz
slide-9
SLIDE 9

INTRODUCTION TO DATA ANALYSIS

COUNTING OBSERVATIONS

▸ functions `n`, `count`, and `tally` from `dplyr` package ▸ caveats:

▸ different versions of `dplyr` package implement `count` differently ▸ several packages define a `count` function; use `dplyr::count` explicitly to be sure

▸ functions `table` and `prop.table` from base R

slide-10
SLIDE 10

INTRODUCTION TO DATA ANALYSIS

COUNTING OBSERVATIONS

▸ `n` works only in `mutate` and `summarize` ▸ `n` essentially counts rows (useful after grouping!)

slide-11
SLIDE 11

INTRODUCTION TO DATA ANALYSIS

COUNTING OBSERVATIONS

▸ `count` and `tally` are wrappers around `n` ▸ `count` implicitly groups/ungroups ▸ `tally` does not tinker with existing grouping

slide-12
SLIDE 12

INTRODUCTION TO DATA ANALYSIS

COUNTS OF CHOICE PAIRS

slide-13
SLIDE 13

INTRODUCTION TO DATA ANALYSIS

PROPORTIONS OF CHOICE PAIRS

slide-14
SLIDE 14

INTRODUCTION TO DATA ANALYSIS

slide-15
SLIDE 15

INTRODUCTION TO DATA ANALYSIS

MEASURES OF CENTRAL TENDENCY & DISPERSION

▸ central tendency: where is

“the center” of the data

  • bservations

▸ dispersion: how far are

values distributed around “the center”

slide-16
SLIDE 16

INTRODUCTION TO DATA ANALYSIS

AVOCADO DATA

▸ data released by Hass Avocado Board (plucked from kaggle)

slide-17
SLIDE 17

INTRODUCTION TO DATA ANALYSIS

MEAN

slide-18
SLIDE 18

INTRODUCTION TO DATA ANALYSIS

MEAN :: EXAMPLE

slide-19
SLIDE 19

INTRODUCTION TO DATA ANALYSIS

CALCULATING THE MEAN IN R

slide-20
SLIDE 20

INTRODUCTION TO DATA ANALYSIS

EXCURSION :: MEAN AS EXPECTED VALUE

▸ the mean can be conceptualized also as

the value you would expect to gain when you sample once from the observed data

▸ useful later to link this to the expected

value of a random variable (but not important right now)

slide-21
SLIDE 21

INTRODUCTION TO DATA ANALYSIS

MEDIAN

slide-22
SLIDE 22

INTRODUCTION TO DATA ANALYSIS

MEDIAN :: EXAMPLE

slide-23
SLIDE 23

INTRODUCTION TO DATA ANALYSIS

CALCULATING THE MEDIAN IN R

slide-24
SLIDE 24

INTRODUCTION TO DATA ANALYSIS

MEAN VS MEDIAN

▸ mean is more susceptible to outliers ▸ choice of mean vs. median is great for manipulation: ▸ “How to mislead with statistics”

slide-25
SLIDE 25

INTRODUCTION TO DATA ANALYSIS

MODE

▸ the mode is the value that occurred most frequently in the data ▸ often not applicable to metric data (where each measurement, if fine-grained

enough occurs only once)

▸ good for nominal and ordinal measures ▸ there is no built-in function in R to calculate the mode ▸ caveat: function `mode` exists but is unrelated

slide-26
SLIDE 26

INTRODUCTION TO DATA ANALYSIS

VARIANCE

slide-27
SLIDE 27

INTRODUCTION TO DATA ANALYSIS

VARIANCE :: EXAMPLE

slide-28
SLIDE 28

INTRODUCTION TO DATA ANALYSIS

VARIANCE :: EXAMPLE

slide-29
SLIDE 29

INTRODUCTION TO DATA ANALYSIS

VARIANCE :: BIASED AND UNBIASED ESTIMATORS

▸ biased estimator (unless mean is known) ▸ unbiased estimator (if mean is estimated from data as well) ▸ R’s built-in function `var` calculates the unbiased estimator!

slide-30
SLIDE 30

INTRODUCTION TO DATA ANALYSIS

STANDARD DEVIATION

slide-31
SLIDE 31

INTRODUCTION TO DATA ANALYSIS

VARIANCE & STANDARD DEVIATION :: EXAMPLE

slide-32
SLIDE 32

INTRODUCTION TO DATA ANALYSIS

QUANTILE

▸ the k% quantile is a value so that k% of the data are smaller

slide-33
SLIDE 33

INTRODUCTION TO DATA ANALYSIS

CONFIDENCE ESTIMATES VIA BOOTSTRAPPING

▸ variance & standard deviation tell us how far around the

mean the data dwells

▸ they do not tell us how good our estimate of the mean is ▸ we can use bootstrapping, a special instance of

resampling methods for this purpose

slide-34
SLIDE 34

INTRODUCTION TO DATA ANALYSIS

BOOTSTRAPPING 95 % CONFIDENCE INTERVALS FOR THE MEAN

slide-35
SLIDE 35

INTRODUCTION TO DATA ANALYSIS

BOOTSTRAPPING 95 % CONFIDENCE INTERVALS

resample 1 resample 2

collected measures

  • f interest for each

resample

  • riginal data
  • Fish: Water vector created by brgfx - www.freepik.com
slide-36
SLIDE 36

INTRODUCTION TO DATA ANALYSIS

BOOTSTRAPPING 95 % CONFIDENCE INTERVALS

  • riginal data

95% bootstrapped CI

slide-37
SLIDE 37

INTRODUCTION TO DATA ANALYSIS

BOOTSTRAPPING IN R

full data example partial data example

slide-38
SLIDE 38

INTRODUCTION TO DATA ANALYSIS

NESTED TIBBLES FOR GROUP SUMMARIES

slide-39
SLIDE 39

INTRODUCTION TO DATA ANALYSIS

NESTING TABLES

slide-40
SLIDE 40

INTRODUCTION TO DATA ANALYSIS

UNNESTING NESTED TABLES

slide-41
SLIDE 41

INTRODUCTION TO DATA ANALYSIS

COVARIANCE

▸ covariance measures the degree to which

two associated measurements show similar deviation from their respective means

Cov( ⃗ x , ⃗ y ) = 1 n − 1

n

i=1

(xi − μ

⃗ x ) (yi − μ ⃗ y )

slide-42
SLIDE 42

INTRODUCTION TO DATA ANALYSIS

COVARIANCE :: EXAMPLE

Cov( ⃗ x , ⃗ y ) = 1 n − 1

n

i=1

(xi − μ

⃗ x ) (yi − μ ⃗ y )

slide-43
SLIDE 43 Maria Pershina • Jona Carmon

size weight

?

slide-44
SLIDE 44 Maria Pershina • Jona Carmon 24.11.2019 44

size weight

slide-45
SLIDE 45 Maria Pershina • Jona Carmon 24.11.2019 45

size weight

coariance = ∑

𝑂 𝑗=1 (𝑦𝑗 − ¯

𝑦)(𝑧𝑗 − ¯ 𝑧) 𝑂 ² variance = ∑

𝑂 𝑗=1 (𝑦𝑗 − ¯

𝑦)² 𝑂 ² ² ² ² ² ² ² ¯ 𝑦

slide-46
SLIDE 46 Maria Pershina • Jona Carmon 24.11.2019 46

size weight

coariance = ∑

𝑂 𝑗=1 (𝑦𝑗 − ¯

𝑦)(𝑧𝑗 − ¯ 𝑧) 𝑂 ¯ 𝑧 ¯ 𝑦

slide-47
SLIDE 47 Maria Pershina • Jona Carmon 24.11.2019 47

size weight

coariance = ∑

𝑂 𝑗=1 (𝑦𝑗 − ¯

𝑦)(𝑧𝑗 − ¯ 𝑧) 𝑂 ¯ 𝑧 ¯ 𝑦

− + + −
slide-48
SLIDE 48

INTRODUCTION TO DATA ANALYSIS

COVARIANCE :: INTERPRETATION

▸ summands are positive when xi and yi deviate “in the same

direction” from their respective means

▸ positive (negative) covariance therefore reflects an overall

tendency that that higher xi, the higher (lower) yi

▸ this is a descriptive property of the data, not an evidential

indicator of a causal relation

Cov( ⃗ x , ⃗ y ) = 1 n − 1

n

i=1

(xi − μ

⃗ x ) (yi − μ ⃗ y )

slide-49
SLIDE 49

INTRODUCTION TO DATA ANALYSIS

COVARIANCE :: SCALE VARIANCE

▸ covariance is not invariant under positive linear transformation

slide-50
SLIDE 50

INTRODUCTION TO DATA ANALYSIS

PRODUCT-MOMENT CORRELATION

▸ Bravais-Pearson product-moment correlation coefficient is

defined as covariance standardized by std. deviations

r

⃗ x ⃗ y =

Cov( ⃗ x , ⃗ y ) SD( ⃗ x ) SD( ⃗ y )

slide-51
SLIDE 51

INTRODUCTION TO DATA ANALYSIS

CORRELATION :: EXAMPLE

▸ correlation is invariant under positive linear transformation

slide-52
SLIDE 52

INTRODUCTION TO DATA ANALYSIS

CORRELATION :: EXAMPLE

▸ negative correlation indicates an

  • verall negative association: the

higher total-volume-sold, the lower the average price

slide-53
SLIDE 53

INTRODUCTION TO DATA ANALYSIS

CORRELATION :: PROPERTIES & INTERPRETATION

▸ r lies in [-1;1] ▸ r = 0 indicates no correlation at all ▸ r =1 indicates perfect positive correlation ▸ r = -1 indicates perfect negative correlation ▸ r >= 0.5 suggests noteworthy (pos.) correlation ▸ r <= -0.5 suggests noteworthy (neg.) correlation ▸ r2 also interpretable as “variance explained” in a

regression model (later)