PS 405 Week 4 Section: Difference of means, ANOVA, and Matrix - - PowerPoint PPT Presentation
PS 405 Week 4 Section: Difference of means, ANOVA, and Matrix - - PowerPoint PPT Presentation
PS 405 Week 4 Section: Difference of means, ANOVA, and Matrix Algebra D.J. Flynn February 4, 2014 t-tests for equality of two sample means hypotheses: H 0 : no difference in sample means H A : significant difference calculating
t-tests
◮ for equality of two sample means ◮ hypotheses:
H0: no difference in sample means HA: significant difference
◮ calculating the t-stat:
t = statistic - hypothesized difference SE of estimate
Gender/partisanship example
Question: are men and women equally likely to be Democrats? t-stat for difference in proportions:
t = (Pm − Pf)
- Pm(1−Pm)
nm
+ Pf(1−PF)
nF
p-value that R estimates is for null of no difference; confidence interval is for difference between two sample means Interpretation of p-value: “if null hypothesis is true, how ofen would we observe a difference this large under repeated sampling?” – NOT there is a p% chance that the true difference is equal to X.
Logic of ANOVA and the F test
◮ running theme: experiments with > 2 groups ◮ does assignment to a particular group (X) affect some
continuous outcome (Y)?
◮ this question can be answered with one-way ANOVA (AKA
F-test)
◮ two sources of variation in DV:
◮ intended: independent variable/factor ◮ unintended: error/residual
◮ goal of ANOVA: determine share of variance explained by X
ANOVA table
◮ Go through table quickly ◮ F statistic (sometimes called F-act):
F = explained variance unexplained variance = MSA MSE
◮ look up critical F-stat based on numerator df, denominator df,
and confidence level
◮ if F-act > F-critical, then we reject the null of independence
ANOVA in R
- 1. identify independent and dependent variables
- 2. determine variable structures (and change if necessary)
- 3. estimate ANOVA and call up results
Determining variable structure
◮ str(variable) returns the structure of a variable: integer,
factor, character, number, logical
◮ important because ANOVAs are used for categorical IVs ◮ practice:
install.packages("datasets") library(datasets) names(chickwts) str(chickwts$weight) str(chickwts$feed) levels(chickwts$feed)
Estimating ANOVAs in R
anova<-aov(weight∼feed,data=chickwts) summary(anova) Df Sum Sq Mean Sq F value Pr(>F) feed 5 231129 46226 15.37 5.94e-10 *** Residuals 65 195556 3009
- Signif. codes:
0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’
What happens if we instead estimate aov(feed∼weight)? wrong.model<-aov(feed∼weight,data=chickwts) Warning messages: 1: In model.response(mf, "numeric") : using type = "numeric" with a factor response will be ignored 2: In Ops.factor(y, z$residuals) :
- not meaningful
for factors
Another example
We have data on which undergraduate institution people attended and mid-life satisfaction (0-100): names(my.data) [1] "school" "satisfaction" table(my.data$school) school fsu uf um 5 5 5 my.anova<-aov(satisfaction∼school,data=my.data) summary(my.anova) Df Sum Sq Mean Sq F value Pr(>F) school 2 7216 3608 11.85 0.00144 ** Residuals 12 3655 305
- Signif. codes:
0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’
fsu<-subset(my.data,school=="fsu") uf<-subset(my.data,school=="uf") um<-subset(my.data,school=="um") mean(fsu$satisfaction) [1] 92.6 mean(uf$satisfaction) [1] 39.2 mean(um$satisfaction) [1] 60.8
Changing variable structure
◮ Current structure:
is.factor is.numeric is.character is.vector... will return TRUE or FALSE
◮ New structure:
as.factor as.numeric as.character as.vector... will change object to desired structure
Generalizations of the one-way ANOVA
- 1. two-way ANOVA: if we have more than 1 explanatory factor
(e.g., soil type + type of potato = potato yield)
- 2. ANCOVA: ANOVA with a continuous covariate (e.g., soil type +
type of potato + weather = potato yield)
Example of two-way ANOVA in R
Does income depend on type of profession and education? library(car) names(Prestige) [1] "education" "income" "women" "prestige" "census" "type" str(Prestige$education) num [1:102] 13.1 12.3 12.8 11.4 14.6 ... str(Prestige$type) Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 ...
summary(Prestige$education)
- Min. 1st Qu.
Median Mean 3rd Qu. Max. 6.380 8.445 10.540 10.740 12.650 15.970 Prestige$education.recoded<-recode(Prestige$education, "6.38:8.445=1;8.446:10.54=2;10.55:10.74=3;10.75:12.65=4; 12.66:15.97=5;else=NA") table(Prestige$education.recoded) 1 2 3 4 5 26 25 2 23 26 as.factor(Prestige$education.recoded) [1] 5 4 5 4 5 5 5 5 5 5 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 4 [37] 4 3 4 2 4 4 2 2 2 4 4 4 4 2 4 2 2 2 4 4 4 2 4 1 2 3 2 [73] 1 1 1 2 2 1 1 1 2 2 2 1 1 2 1 2 2 1 1 1 1 1 1 4 2 1 1 Levels: 1 2 3 4 5
my.two.way<-aov(income∼type+education.recoded, data=Prestige) summary(my.two.way) Df Sum Sq Mean Sq F value Pr(>F) type 2 5.960e+08 297978078 25.266 1.65e-09 ** education.recoded 1 2.952e+07 29520188 2.503 0.117 Residuals 94 1.109e+09 11793647
- Signif. codes:
0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 4 observations deleted due to missingness
Matrix algebra terms
◮ scalar ◮ vector ◮ matrix
Matrix algebra operations
◮ addition ◮ subtraction ◮ multiplication ◮ inverse ◮ transpose
Why we care: the linear model
◮ Scalar form:
Yi = β0 + β1X1i + β2X2i + ...βKXKi + ǫi
◮ Matrix form:
Yi = Xiβ + ǫi Benefits of matrix form:
- 1. more parsimonious expression of models with lots of
covariates
- 2. understand what’s going on behind the scenes. For example,
the parameter β is estimated by calculating (XTX)−1XTy
This is the linear model in matrix form:
Yi = Xiβ + ǫi
For each term in this equation...
◮ scalar, vector, or matrix? ◮ size?