Data Science in the Wild, Spring 2019
Eran Toch
1
Data Science in the Wild Lecture 7: Analyzing Experiments Eran Toch - - PowerPoint PPT Presentation
Data Science in the Wild Lecture 7: Analyzing Experiments Eran Toch Data Science in the Wild, Spring 2019 1 Agenda 1. Statistical Tests and the t-Test 2. Running the t-Test 3. t-Test assumptions 4. Analyzing Inferential Statistics 5.
Data Science in the Wild, Spring 2019
1
Data Science in the Wild, Spring 2019
2
Data Science in the Wild, Spring 2019
3
Data Science in the Wild, Spring 2019
4 Form 1 Form 2 16.3 17.3 18.3
Control Treatment
Data Science in the Wild, Spring 2019
5 Form 1 Form 2 16.5 17.5 18.5
Control Treatment
Data Science in the Wild, Spring 2019
6 28
Data Science in the Wild, Spring 2019
There are two types of errors one can make in statistical hypothesis testing:
7
Too confident Cowards
Data Science in the Wild, Spring 2019
8
William S. Gosset
A B
Data Science in the Wild, Spring 2019
9
How can we infer a different in the yield of two fields from the samples alone?
Data Science in the Wild, Spring 2019
10
A B Value XA XB
Data Science in the Wild, Spring 2019
11
A B Value
Signal Noise Difference between means Variability
= =
XA- XB SA2 + SB2 nB nA
XA XB
Data Science in the Wild, Spring 2019
different from each other as they are within each other
12
Data Science in the Wild, Spring 2019
13
Data Science in the Wild, Spring 2019
rejecting the null hypothesis
14
Data Science in the Wild, Spring 2019
15
Data Science in the Wild, Spring 2019
16
Data Science in the Wild, Spring 2019
17
Data Science in the Wild, Spring 2019
18
Frequency Our variable µ - expected value of the population mean X - mean
sample SD
Data Science in the Wild, Spring 2019
19
t = sample mean − population mean standard error t = ¯ X − µ SD/√n = 20.09 − 23 6.023/ √ 32 = −2.73
Data Science in the Wild, Spring 2019
20
Frequency Reaction time (ms)
No alcohol Alcohol
Effect of alcohol on RT
Hypothesis false (reaction time faster in ‘alcohol’ condition) Hypothesis true (reaction time slower in ‘alcohol’ condition)
Hypothesis test: ‘Alcohol’ vs ‘No alcohol’ condition
Data Science in the Wild, Spring 2019
21
df = pd.read_csv("https://raw.githubusercontent.com/Opensourcefordatascience/ Data-sets/master//Iris_Data.csv") setosa = df[(df['species'] == 'Iris-setosa')] setosa.reset_index(inplace= True) versicolor = df[(df['species'] == 'Iris-versicolor')] versicolor.reset_index(inplace= True) stats.ttest_ind(setosa['sepal_width'], versicolor['sepal_width']) Ttest_indResult(statistic=9.2827725555581111, pvalue=4.3622390160102143e-15)
Data Science in the Wild, Spring 2019
22
N Mean SD SE 95% Conf. Interval species Iris-setosa 50 3.418 0.381024 0.053885 3.311313 3.524687 Iris-versicolor 50 2.770 0.313798 0.044378 2.682136 2.857864
rp.summary_cont(df.groupby("species")['sepal_width'])
Data Science in the Wild, Spring 2019
23
Data Science in the Wild, Spring 2019
24
Independent t- test results Difference (sepal_width - sepal_width) = 0.6480 1 Degrees of freedom = 98.0000 2 t = 9.2828 3 Two side test p value = 0.0000 4 Mean of sepal_width > mean of sepal_width p va... 1.0000 5 Mean of sepal_width < mean of sepal_width p va... 0.0000 6 Cohen's d = 1.8566 7 Hedge's g = 1.8423 8 Glass's delta = 1.7007 9 r = 0.6840 descriptives, results = rp.ttest(setosa['sepal_width'], versicolor[‘sepal_width']) results
Data Science in the Wild, Spring 2019
25
Data Science in the Wild, Spring 2019
26
Subject Before diet After diet A 100 70 B 90 89 C 89 70 D 100 101 E 100 98 F 90 87 Diet 1 Diet 2 Subject Weight Change A
B
C
D +1 E
F
Diet 1 Diet 2 Paired Unpaired
Data Science in the Wild, Spring 2019
27
Data Science in the Wild, Spring 2019
28
Data Science in the Wild, Spring 2019
measured are equal in the population
Levene's Test of Equality of Variances
used statistic to test the assumption of homogeneity of variance
29
Data Science in the Wild, Spring 2019
(p-value)
be treated as equal
the assumption of homogeneity of variances
30
stats.levene(setosa['sepal_width'], versicolor['sepal_width']) LeveneResult(statistic=0.66354593329432332, pvalue=0.41728596812962038)
Data Science in the Wild, Spring 2019
statistical test
31
diff = setosa['sepal_width'] - versicolor['sepal_width']
Data Science in the Wild, Spring 2019
plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other
should fall on the red line. If the dots are not
there is deviation from normality
long as it’s not severe
32
Data Science in the Wild, Spring 2019
33
import pylab stats.probplot(diff, dist="norm", plot=pylab) pylab.show()
Data Science in the Wild, Spring 2019
34
diff.plot(kind= "hist", title= "Sepal Width Residuals") plt.xlabel("Length (cm)") plt.savefig("Residuals Plot of Sepal Width.png")
Data Science in the Wild, Spring 2019
a sample x1, ..., xn came from a normally distributed population
indicated to be normally distributed
35
stats.shapiro(diff)
(0.9859335422515869, 0.8108891248703003)
Data Science in the Wild, Spring 2019
36
Data Science in the Wild, Spring 2019
the effect observed in the statistics
the effect size, dependent on the assumptions about the data
mean difference between your two groups, and then dividing the result by the pooled standard deviation
37
Data Science in the Wild, Spring 2019
38
Data Science in the Wild, Spring 2019
39
Data Science in the Wild, Spring 2019
table (if we look for 95% confidence, we should pick the 0.05 critical value)
40
Data Science in the Wild, Spring 2019
tests:
value is totally arbitrary
problematic in large data sets
41
Data Science in the Wild, Spring 2019
42
Leif D. Nelson, False-Positives, p-Hacking, Statistical Power, and Evidential Value
Data Science in the Wild, Spring 2019
measure“ Goodhart’s law.
43
Data Science in the Wild, Spring 2019
44
Data Science in the Wild, Spring 2019
Parametric
Correlation
Differential
45
Data Science in the Wild, Spring 2019
Parametric tests for data
E.g., time to complete task, number of errors
46
Non-parametric
E.g., whether users found the system useful or not Yes No
18 35 53 70 1 2 3 4 5 6 7
Data Science in the Wild, Spring 2019
47
Data Science in the Wild, Spring 2019
48
Correlation
variables
regressions
Difference
variables
between means, variance, distribution
Data Science in the Wild, Spring 2019
49
Data Science in the Wild, Spring 2019
t-test assumptions
equal
signed-rank test
50
Data Science in the Wild, Spring 2019
value wins over any observations in the other set
51
Data Science in the Wild, Spring 2019
hares
post (their rank order, from first to last crossing the finish line) is as follows, writing T for a tortoise and H for a hare:
so UH = 25
test…
52
Data Science in the Wild, Spring 2019
53
Data Science in the Wild, Spring 2019
54
import scipy.stats # u : Mann-Whitney test statistic # p : p-value u, p = scipy.stats.mannwhitneyu(x, y)
Data Science in the Wild, Spring 2019
55
Data Science in the Wild, Spring 2019
56
These tests are for summaries of categorical (nominal) data:
Data Science in the Wild, Spring 2019
57
well the observed distribution of data fits with the distribution that is expected if the variables are independent
expect 10 observations in each group
Data Science in the Wild, Spring 2019
no statistical significance between the observed and the expected
58
Data Science in the Wild, Spring 2019
59
Vegan Vegitatian Total Male 20 (25) 30 (25) 50 Female 30 (25) 20 (25) 50 Total 50 50 100
χ2 = ((20-25)^2/25) + ((30-25)^2/25) + ((30-25)^2/25) + ((20-25)^2/25) = (25/25) + (25/25) + (25/25) + (25/25) = 4
Data Science in the Wild, Spring 2019
DF = (r-1)(c-1) Where DF = Degree of freedom r = number of rows c = number of columns
60
Data Science in the Wild, Spring 2019
61
Data Science in the Wild, Spring 2019
62