Introduction to probability distributions and testing Lecture 2 - - PowerPoint PPT Presentation

introduction to probability distributions and testing
SMART_READER_LITE
LIVE PREVIEW

Introduction to probability distributions and testing Lecture 2 - - PowerPoint PPT Presentation

1 Introduction to probability distributions and testing Lecture 2 Summary of this week This week we introduce the normal distribution and how to estimate population means from samples using confidence intervals. In RStudio we will use


slide-1
SLIDE 1

Introduction to probability distributions and testing

Lecture 2

1

slide-2
SLIDE 2

Summary of this week

■ This week we introduce the normal

distribution and how to estimate population means from samples using confidence intervals.

■ In RStudio we will use many built-in

functions for calculating summary statistics, probabilities and critical values (quantiles).

3

slide-3
SLIDE 3

Learning objectives for the week

By actively following the lecture and practical and carrying out the independent study the successful student will be able to:

■ Explain the properties of ‘normal distributions’

and their use in statistics

■ Define, select and calculate with R probabilities,

quantiles and confidence intervals

4

slide-4
SLIDE 4

What is a probability distribution?

A formula or a table used to assign probabilities to each possible value of a random variable X

5 Variable: Some information about an individual (the property that we measure)

Example: You flip a coin twice. What’s the probability of getting heads both times?

Probability of getting a head: 4 different outcomes:

ç This is a probability distribution!

slide-5
SLIDE 5

ç This is a probability distribution!

0.1 0.2 0.3 0.4 0.5 0.6 1 2

Number of heads Probability

What is a probability distribution?

A formula or a table used to assign probabilities to each possible value of a random variable X

Example: You flip a coin twice. What’s the probability of getting heads both times?

slide-6
SLIDE 6

A probability distribution may be either discrete or continuous:

zDiscrete distribution means that X can

assume one of a finite number of values (e.g. Binomial – what the previous example was)

zContinuous distribution means that X

can assume one of an infinite number of different values (e.g. normal, uniform, lots of others ->) Ø We will focus on normal distributions in this module Ø Why? Because normal distribution are central to most statistical tests we do

7

What is a probability distribution?

slide-7
SLIDE 7

What is the normal distribution?

8

Values the variable (x) can take (a.k.a quantiles) Density f(x) – the height for a given value on the x axis

Normal distribution (a.k.a Gaussian distribution, bell- shaped curve) Symmetrical distribution used to compute probabilities for continuous data.

slide-8
SLIDE 8

All normal distributions have these properties…

Pierce (2017) 'Normal Distribution', Math Is Fun, Available at: <http://www.mathsisfun.com/data/standard-normal-distribution.html>

slide-9
SLIDE 9

10

95% of observations are within 1.96 SD of the mean 68% of observations within 1 SD of the mean

Standard deviation: measure of how spread out data points

All normal distributions have these properties…

slide-10
SLIDE 10

11

x = 1, 2, 3, 4, 5, 6 Mean (!̅) = 3.5 n = 6 ⎷(1-3.5)2 + (2-3.5)2 + (3-3.5)2 + (4-3.5)2 + (5-3.5)2 + (6-3.5)2 6 ⎷ 2.92 = 1.7

All normal distributions have these properties…

Standard deviation: measure of how spread out data points

slide-11
SLIDE 11

12

95% of observations are within 1.96 SD of the mean (e.g 95% within 3.5 ± 3.3) 68% of observations within 1 SD of the mean (e.g 68% within 3.5 ± 1.7)

All normal distributions have these properties…

slide-12
SLIDE 12

Parameter 1: σ (standard deviation) Parameter 2: μ (mean) Understand: only that parameters alter shape

13

…but they vary based on their PARAMETERS

Normal distributions have two parameters:

slide-13
SLIDE 13

Why? (without going into too much detail)

14

x = given value of variable $ = mean % = Standard deviation e = base of the natural logarithm π = constant (pi)

Equation (function) for normal distribution:

‘Density’ (the height for a given value on the x axis)

(only terms that alter the density – all other terms are already defined/constants)

slide-14
SLIDE 14

15

pnorm – maps value to probability qnorm – maps probability to value

Extracting probabilities using R

Pnorm() to calc area (probability) qnorm() to calc quantile (value of x)

slide-15
SLIDE 15

I.Q. in the U.K. is normally distributed μ = 100, σ = 15

What is probability of an individual having IQ >115?”

mu <- 100 sd <- 15 IQ <- 115 pnorm(IQ, mu, sd, lower.tail = FALSE)

The value you are interested in The standard deviation The mean Whether you are interested in the lower or upper part

  • f the distribution

(here – upper part)

Extracting probabilities using R

slide-16
SLIDE 16

17

mu <- 100 sd <- 15 IQ <- 115 pnorm(IQ, mu, sd, lower.tail = FALSE)

[1] 0.1586553

Answer = 15.9 %

Extracting probabilities using R

I.Q. in the U.K. is normally distributed μ = 100, σ = 15

What is probability of an individual having IQ >115?”

slide-17
SLIDE 17

18

I.Q. in the U.K. is normally distributed μ = 100, σ = 15

What I.Q. value are 0.159 (15.9%) of people above?

mu <- 100 sd <- 15 P <- 0.159 qnorm(P, mu, sd, lower.tail = FALSE)

[1] 115

Extracting probabilities using R

slide-18
SLIDE 18

MINI SUMMARY

  • Normal distributions are continuous, symmetrical distributions
  • They have certain set properties

Ø Same mode, median & mean Ø 95% of observations are within 1.96 SD of the mean

  • They differ according to their parameters which are: mean & standard

deviation

  • In R, we can extract:

○Cumulative probabilities using pnorm() ○Quantiles using qnorm()

19

slide-19
SLIDE 19

Using the normal distribution

We often want to know information about populations: e.g mean value μ (mu) in whole population (this is a true value)

x

20

But we can only measure on a sample of that population e.g Sample mean (!̅, said x bar) in sample

We can use the properties of the normal distribution to help us estimate population parameters from samples

slide-20
SLIDE 20

Sampling distribution of the mean

21

Population

mean = 100 sd = 15

Mean of the sample will differ from the population mean by chance (unlikely to be 100 spot

  • n)

Sample of size n

Example: I.Q. in the U.K. is normally distributed μ = 100, σ = 15

slide-21
SLIDE 21

22

Population

mean = 100 sd = 15

Many samples of size n

Plot the mean of all samples = sampling distribution

  • f the mean

Example: I.Q. in the U.K. is normally distributed μ = 100, σ = 15

Sampling distribution of the mean

slide-22
SLIDE 22

Population distribution Sampling distribution

The sampling distribution of the mean has the same mean as the parent population

Sampling distribution of the mean

slide-23
SLIDE 23

The sampling distribution has different (and lower) standard deviation than the parent population Standard deviation of the sampling distribution = called the standard error of the mean (often shortened to ‘standard error’ (or just ‘se’))

Sampling distribution of the mean Population distribution Population distribution Sampling distribution

slide-24
SLIDE 24

Sampling distribution of the mean Population distribution Population distribution Sampling distribution

se = sd /√n

Standard deviation = 15 Standard error = 15 / √n = 4.7

slide-25
SLIDE 25

We want to know about populations...

…but only have samples! We use the samples to infer information about the population. So we need an idea of how confident we can be in our inferences…. (is our sample mean any good?) Confidence intervals do this!

slide-26
SLIDE 26

Confidence intervals

■How confident can we be that our sample mean is

a good estimate of the true value?

■Confidence intervals give the highest and lowest

likely values

■Likely means 95% (most common), 99%, 99.9%

27

e.g. the mean I.Q. of the population of the U.K is 100 (±4.7) = 95% certain that the mean I.Q. of the U.K’s population is between 95.3 and 104.7

slide-27
SLIDE 27

Confidence intervals on the mean

Rely on fact that 95% of

  • bservations fall within

1.96 s.d. of the mean in a normal distribution

slide-28
SLIDE 28

Confidence intervals on the mean For LARGE samples

WARNING: This method of calculating CI’s requires large sample sizes (~30+) – assumes a normal distribution

i.e., 95% certain population mean is between !̅ − 1.96×3. 4. and !̅ + 1.96×3. 4.

!̅ ± 1.96×3. 4.

slide-29
SLIDE 29

P = 0.975 Quantile = 1.96

Confidence intervals on the mean For LARGE samples

Where does this number come from? Value of x (quantile) when P=0.975:

> qnorm(0.975) [1] 1.959964

P = 0.025 Quantile = -1.96

!̅ ± 1.96×3. 4.

slide-30
SLIDE 30

Why qnorm(0.975) and not qnorm(0.95)?

1- 0.025=0.975

31

2-tailed test: Allots half of your alpha to testing the statistical significance in

  • ne direction and half of your alpha to

testing statistical significance in the other direction (0.05 total in both, 0.025 in each tail Regardless of the direction of the relationship you hypothesize, you are testing for the possibility of the relationship in both directions

slide-31
SLIDE 31

What is the Km for Arginine-tRNA synthetase when ATP is the substrate?

100 measures of Km in μM Use CI’s to estimate the true Km value…

32

Km is the concentration of substrate which permits the enzyme to achieve half Vmax

slide-32
SLIDE 32

33

Confidence intervals for LARGE samples

km <- read.table("../data/km.txt", header = FALSE) hist(km$V1)

Km for Arginine-tRNA synthetase

slide-33
SLIDE 33

m <- mean(km$V1);m [1] 255

  • Mean
  • Standard error (=sd/√n)
  • quantile
  • amount to add/subtract
  • Upper confidence limit

(!̅ + 1.96×3. 4.)

  • Lower confidence limit

(!̅ − 1.96×3. 4.)

se <- sd(km$V1)/sqrt(length(km$V1));se [1] 3.919647 q <- qnorm(0.975);q [1] 1.959964 m + amount [1] 262.7 m - amount [1] 247.3 amount <- round(q*se,1) [1] 7.7

34

Confidence intervals for large samples

slide-34
SLIDE 34
  • Sample mean = 255 μM
  • 95% certain population mean is between:

247.3 μM and 262.7 μM

  • We would normally summarise in a report

as:

35

The 95% confidence interval on the mean was 255 ± 7.7 μM.

Confidence intervals for large samples

slide-35
SLIDE 35
  • Exactly the same except we use a t-

distribution rather than a normal

  • This means we use t given by qt() rather

than 1.96 given by qnorm()

  • Depends on sample size (degrees of

freedom; sample size -1) as well

  • !̅ ± 6[8.9.]×3. 4.

36

Confidence intervals: small samples

slide-36
SLIDE 36

19 students make a lactate dehydrogenase solution to a recipe that should yield a concentration of 1.5 μmols l-1

37

Confidence intervals: small samples

How good is the recipe/ability to follow the recipe?

slide-37
SLIDE 37

Recipe should yield a concentration of 1.5 μmols l-1

mean(ldh$ldh) [1] 1.373684

Mean of sample is 1.37 μmols l-1 Does this sample differ significantly from the population mean (i.e. 1.5 μmols l-1)? And what would that mean if it does?

38

Confidence intervals: small samples

ldh <- read.table(“ldh.txt”, header=T)

slide-38
SLIDE 38

!̅ ± 6[8.9.]×3. 4.

m <- mean(ldh$ldh);m 1.373684

  • Mean
  • Standard error
  • qt – need df
  • Upper confidence limit
  • Lower confidence limit

se <- sd(ldh$ldh)/sqrt(length(ldh$ldh));se [1] 0.03230167 df <- length(ldh$ldh)-1;df [1] 18 t <- qt(0.975,df=df);t [1] 2.100922 round(m+t*se,2) [1] 1.44 round(m-t*se,2) [1] 1.31 39

Confidence intervals: small samples

slide-39
SLIDE 39
  • 95% certain population mean is between:

1.31 and 1.44 μmols l-1

  • What does this tell us about the

recipe/ability to follow recipe? 19 lactate dehydrogenase solutions to a recipe that should yield a concentration of 1.5 μmols l-1

40

Confidence intervals: small samples

The 95% confidence interval on the mean was 1.37 ± 0.07 μmols l-1.

slide-40
SLIDE 40

Objectives

Outline: Distributions in general and in R, and hypothesis testing using the binomial distribution as an example. By actively following the lecture and practical and carrying out the independent study the successful student will be able to:

  • Explain the properties of ‘normal distributions’ and their use in statistics
  • Define, select and calculate with R probabilities, quantiles and confidence

intervals

  • use ggplot() to create simple graphs (we’ll talk about this later)

41

slide-41
SLIDE 41

ggplot()

install.packages("ggplot2") - once library(ggplot2) - once each session Based on the ‘Grammar of Graphics’ ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>()

  • r

ggplot(data = <DATA>) + <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

43

slide-42
SLIDE 42

ggplot()

ggplot(data = binodf, aes(x = y, y = probability))+ geom_bar(stat = "identity") ggplot(data = binodf, aes(x = y, y = probability))+ geom_bar(stat = "identity")+ ylim(0,1) ggplot(data = binodf, aes(x = y, y = probability))+ geom_point()

44