Linear regression and t-tests Steve Bagley somgen223.stanford.edu - - PowerPoint PPT Presentation

linear regression and t tests
SMART_READER_LITE
LIVE PREVIEW

Linear regression and t-tests Steve Bagley somgen223.stanford.edu - - PowerPoint PPT Presentation

Linear regression and t-tests Steve Bagley somgen223.stanford.edu 1 Linear regression somgen223.stanford.edu 2 d <- tibble (height = 0 : 5, weight = 0.5 + 0 : 5 + runif (6, -0.5, 0.5)) Create data In this dataset, weight = 0.5 + height


slide-1
SLIDE 1

Linear regression and t-tests

Steve Bagley

somgen223.stanford.edu 1

slide-2
SLIDE 2

Linear regression

somgen223.stanford.edu 2

slide-3
SLIDE 3

Create data

d <- tibble(height = 0:5, weight = 0.5 + 0:5 + runif(6, -0.5, 0.5))

  • In this dataset, weight = 0.5 + height + some random errors.
  • runif generates random numbers from a uniform distribution.

somgen223.stanford.edu 3

slide-4
SLIDE 4

Plot the data

plot <- ggplot(d, aes(height, weight)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + expand_limits(y = 0) plot

2 4 6 1 2 3 4 5

height weight

somgen223.stanford.edu 4

slide-5
SLIDE 5

How to do a linear regression

reg <- lm(weight ~ height, data = d) reg Call: lm(formula = weight ~ height, data = d) Coefficients: (Intercept) height 0.4463 1.1150

  • Note use of ~ here: weight ~ height
  • This is called the formula notation.
  • The variable on the left is the dependent variable.
  • The variable on the right is the independent variable.
  • They should be column names in the data argument.
  • The result shows the y-intercept and the coefficient of the height variable.

somgen223.stanford.edu 5

slide-6
SLIDE 6

How to get more information about the regression

summary(reg) Call: lm(formula = weight ~ height, data = d) Residuals: 1 2 3 4 5 6

  • 0.01798

0.14520 -0.27676 0.14624 0.04688 -0.04359 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.4463 0.1272 3.51 0.0247 * height 1.1151 0.0420 26.55 1.2e-05 ***

  • Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.1757 on 4 degrees of freedom Multiple R-squared: 0.9944, Adjusted R-squared: 0.9929 F-statistic: 704.8 on 1 and 4 DF, p-value: 1.197e-05

somgen223.stanford.edu 6

slide-7
SLIDE 7

How to extract the coefficients

coefficients(reg) (Intercept) height 0.4463429 1.1150495 coefficients(reg)[["(Intercept)"]] [1] 0.4463429 coefficients(reg)[["height"]] [1] 1.115049

  • coefficients returns a named vector.
  • Use [[ ]] to extract the values without the names.

somgen223.stanford.edu 7

slide-8
SLIDE 8

Add regression line information

plot + annotate("text", x = 1, y = 5, label = sprintf("y = %.4f + %.4f x", coefficients(reg)[["(Intercept)"]], coefficients(reg)[["height"]]))

y = 0.4463 + 1.1150 x

2 4 6 1 2 3 4 5

height weight

somgen223.stanford.edu 8

slide-9
SLIDE 9

Add regression line information (fancy)

plot + annotate("text", x = 1, y = 5, label = sprintf("italic(y) == %.4f + %.4f * italic(x)", coefficients(reg)[["(Intercept)"]], coefficients(reg)[["height"]]), parse = TRUE)

y = 0.4463 + 1.115x

2 4 6 1 2 3 4 5

height weight

  • See ?plotmath for details

somgen223.stanford.edu 9

slide-10
SLIDE 10

Add other information (gratuitously ornate)

plot + annotate("text", x = 1, y = 5, label = "e^{pi * i} - 1 == 0", parse = TRUE)

eπi − 1 = 0

2 4 6 1 2 3 4 5

height weight

  • See ?plotmath for details

somgen223.stanford.edu 10

slide-11
SLIDE 11

Adding the regression info using package ggpubr

library(ggpubr) ggscatter(d, x = "height", y = "weight", add = "reg.line", add.params = list(color = "blue")) + stat_regline_equation(label.x = 1, label.y = 5) + stat_cor(label.x = 1, label.y = 4.7)

y = 0.45 + 1.1 x R = 1 , p = 1.2e-05 2 4 6 1 2 3 4 5 height weight

somgen223.stanford.edu 11

slide-12
SLIDE 12

Simple statistical tests

somgen223.stanford.edu 12

slide-13
SLIDE 13

Create data

set.seed(13) n <- 50 d2 <- tibble(value = c(rnorm(n, mean = 10, sd = 2), rnorm(n, mean = 11, sd = 2)), group = c(rep("control", times = n), rep("treatment", times = n))) head(d2) # A tibble: 6 x 2 value group <dbl> <chr> 1 11.1 control 2 9.44 control 3 13.6 control 4 10.4 control 5 12.3 control 6 10.8 control

  • rnorm generates random numbers from a Gaussian distribution.
  • rep builds a vector by repeating values.

somgen223.stanford.edu 13

slide-14
SLIDE 14

Plot the data

ggplot(d2, aes(value, color = group)) + geom_histogram(aes(fill = group), position = "dodge", binwidth = 0.5)

2 4 6 7.5 10.0 12.5 15.0

value count group

control treatment somgen223.stanford.edu 14

slide-15
SLIDE 15

Two sample t-test

d2_x <- d2 %>% filter(group == "control") %>% pull(value) d2_y <- d2 %>% filter(group == "treatment") %>% pull(value) t.test(d2_x, d2_y) Welch Two Sample t-test data: d2_x and d2_y t = -2.247, df = 97.824, p-value = 0.02689 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:

  • 1.6164644 -0.1002824

sample estimates: mean of x mean of y 9.947163 10.805536

  • t.test uses vectors, not data frames.

somgen223.stanford.edu 15