Introduction to probability distributions and testing
Lecture 2
1
Introduction to probability distributions and testing Lecture 2 - - PowerPoint PPT Presentation
1 Introduction to probability distributions and testing Lecture 2 Summary of this week This week we introduce the normal distribution and how to estimate population means from samples using confidence intervals. In RStudio we will use
1
■ This week we introduce the normal
distribution and how to estimate population means from samples using confidence intervals.
■ In RStudio we will use many built-in
functions for calculating summary statistics, probabilities and critical values (quantiles).
3
By actively following the lecture and practical and carrying out the independent study the successful student will be able to:
■ Explain the properties of ‘normal distributions’
and their use in statistics
■ Define, select and calculate with R probabilities,
quantiles and confidence intervals
4
5 Variable: Some information about an individual (the property that we measure)
Example: You flip a coin twice. What’s the probability of getting heads both times?
Probability of getting a head: 4 different outcomes:
0.1 0.2 0.3 0.4 0.5 0.6 1 2
Number of heads Probability
Example: You flip a coin twice. What’s the probability of getting heads both times?
A probability distribution may be either discrete or continuous:
zDiscrete distribution means that X can
assume one of a finite number of values (e.g. Binomial – what the previous example was)
zContinuous distribution means that X
can assume one of an infinite number of different values (e.g. normal, uniform, lots of others ->) Ø We will focus on normal distributions in this module Ø Why? Because normal distribution are central to most statistical tests we do
7
8
Values the variable (x) can take (a.k.a quantiles) Density f(x) – the height for a given value on the x axis
Pierce (2017) 'Normal Distribution', Math Is Fun, Available at: <http://www.mathsisfun.com/data/standard-normal-distribution.html>
10
95% of observations are within 1.96 SD of the mean 68% of observations within 1 SD of the mean
Standard deviation: measure of how spread out data points
11
x = 1, 2, 3, 4, 5, 6 Mean (!̅) = 3.5 n = 6 ⎷(1-3.5)2 + (2-3.5)2 + (3-3.5)2 + (4-3.5)2 + (5-3.5)2 + (6-3.5)2 6 ⎷ 2.92 = 1.7
Standard deviation: measure of how spread out data points
12
95% of observations are within 1.96 SD of the mean (e.g 95% within 3.5 ± 3.3) 68% of observations within 1 SD of the mean (e.g 68% within 3.5 ± 1.7)
Parameter 1: σ (standard deviation) Parameter 2: μ (mean) Understand: only that parameters alter shape
13
Normal distributions have two parameters:
14
x = given value of variable $ = mean % = Standard deviation e = base of the natural logarithm π = constant (pi)
Equation (function) for normal distribution:
‘Density’ (the height for a given value on the x axis)
(only terms that alter the density – all other terms are already defined/constants)
15
pnorm – maps value to probability qnorm – maps probability to value
Pnorm() to calc area (probability) qnorm() to calc quantile (value of x)
I.Q. in the U.K. is normally distributed μ = 100, σ = 15
mu <- 100 sd <- 15 IQ <- 115 pnorm(IQ, mu, sd, lower.tail = FALSE)
The value you are interested in The standard deviation The mean Whether you are interested in the lower or upper part
(here – upper part)
17
mu <- 100 sd <- 15 IQ <- 115 pnorm(IQ, mu, sd, lower.tail = FALSE)
[1] 0.1586553
I.Q. in the U.K. is normally distributed μ = 100, σ = 15
18
I.Q. in the U.K. is normally distributed μ = 100, σ = 15
mu <- 100 sd <- 15 P <- 0.159 qnorm(P, mu, sd, lower.tail = FALSE)
[1] 115
Ø Same mode, median & mean Ø 95% of observations are within 1.96 SD of the mean
deviation
○Cumulative probabilities using pnorm() ○Quantiles using qnorm()
19
We often want to know information about populations: e.g mean value μ (mu) in whole population (this is a true value)
20
But we can only measure on a sample of that population e.g Sample mean (!̅, said x bar) in sample
We can use the properties of the normal distribution to help us estimate population parameters from samples
21
mean = 100 sd = 15
Mean of the sample will differ from the population mean by chance (unlikely to be 100 spot
Sample of size n
Example: I.Q. in the U.K. is normally distributed μ = 100, σ = 15
22
mean = 100 sd = 15
Many samples of size n
Plot the mean of all samples = sampling distribution
Example: I.Q. in the U.K. is normally distributed μ = 100, σ = 15
The sampling distribution of the mean has the same mean as the parent population
The sampling distribution has different (and lower) standard deviation than the parent population Standard deviation of the sampling distribution = called the standard error of the mean (often shortened to ‘standard error’ (or just ‘se’))
se = sd /√n
Standard deviation = 15 Standard error = 15 / √n = 4.7
…but only have samples! We use the samples to infer information about the population. So we need an idea of how confident we can be in our inferences…. (is our sample mean any good?) Confidence intervals do this!
■How confident can we be that our sample mean is
■Confidence intervals give the highest and lowest
■Likely means 95% (most common), 99%, 99.9%
27
e.g. the mean I.Q. of the population of the U.K is 100 (±4.7) = 95% certain that the mean I.Q. of the U.K’s population is between 95.3 and 104.7
WARNING: This method of calculating CI’s requires large sample sizes (~30+) – assumes a normal distribution
i.e., 95% certain population mean is between !̅ − 1.96×3. 4. and !̅ + 1.96×3. 4.
P = 0.975 Quantile = 1.96
Where does this number come from? Value of x (quantile) when P=0.975:
> qnorm(0.975) [1] 1.959964
P = 0.025 Quantile = -1.96
1- 0.025=0.975
31
2-tailed test: Allots half of your alpha to testing the statistical significance in
testing statistical significance in the other direction (0.05 total in both, 0.025 in each tail Regardless of the direction of the relationship you hypothesize, you are testing for the possibility of the relationship in both directions
100 measures of Km in μM Use CI’s to estimate the true Km value…
32
Km is the concentration of substrate which permits the enzyme to achieve half Vmax
33
km <- read.table("../data/km.txt", header = FALSE) hist(km$V1)
Km for Arginine-tRNA synthetase
m <- mean(km$V1);m [1] 255
(!̅ + 1.96×3. 4.)
(!̅ − 1.96×3. 4.)
se <- sd(km$V1)/sqrt(length(km$V1));se [1] 3.919647 q <- qnorm(0.975);q [1] 1.959964 m + amount [1] 262.7 m - amount [1] 247.3 amount <- round(q*se,1) [1] 7.7
34
35
36
19 students make a lactate dehydrogenase solution to a recipe that should yield a concentration of 1.5 μmols l-1
37
Recipe should yield a concentration of 1.5 μmols l-1
mean(ldh$ldh) [1] 1.373684
38
ldh <- read.table(“ldh.txt”, header=T)
!̅ ± 6[8.9.]×3. 4.
m <- mean(ldh$ldh);m 1.373684
se <- sd(ldh$ldh)/sqrt(length(ldh$ldh));se [1] 0.03230167 df <- length(ldh$ldh)-1;df [1] 18 t <- qt(0.975,df=df);t [1] 2.100922 round(m+t*se,2) [1] 1.44 round(m-t*se,2) [1] 1.31 39
40
Outline: Distributions in general and in R, and hypothesis testing using the binomial distribution as an example. By actively following the lecture and practical and carrying out the independent study the successful student will be able to:
intervals
41
install.packages("ggplot2") - once library(ggplot2) - once each session Based on the ‘Grammar of Graphics’ ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>()
ggplot(data = <DATA>) + <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
43
ggplot(data = binodf, aes(x = y, y = probability))+ geom_bar(stat = "identity") ggplot(data = binodf, aes(x = y, y = probability))+ geom_bar(stat = "identity")+ ylim(0,1) ggplot(data = binodf, aes(x = y, y = probability))+ geom_point()
44