Lecture 6: samples and populations Todays lecture Look at - PowerPoint PPT Presentation

Lecture 6: samples and populations

Today’s lecture Look at fundamental concepts of samples and populations Intended to reinforce similar material in MAS2901 Adopt a different perspective to MAS2901: use simulation rather than analytic calculation

Example Type of problem looked at in MAS2901: Mercury waste dumped in a river Affects prawns which live in the river Max permitted level is one part per million on average A sample of prawns is collected and mercury content measured in these Attempt to infer the population mean mercury content from the sample

Example Type of problem looked at in MAS2901: Mercury waste dumped in a river Affects prawns which live in the river Max permitted level is one part per million on average A sample of prawns is collected and mercury content measured in these Attempt to infer the population mean mercury content from the sample Use a hypothesis test to decide whether population mean is greater than max allowed level – see MAS2901 for details

Populations Suppose we measure some random quantity X X can adopt a range of possible values: some values are more likely than others This is the distribution of X Usually we do not know this distibution exactly The unknown distribution is called the population distribution In the example: the population consists of the prawns in the estuary; the random quantity X is the mercury concentration in a randomly selected prawn; and the population distribution is the distribution of X .

Learning about populations We are usually interested in key properties of the population distribution such as: the expectation of X – usually called the population mean; the variance of X – usually called the population variance; or the 95th percentile of X (for example). Often we make some simplifying assumptions about the population distribution. For example, we might assume: (a) X is normally distributed with unknown mean and variance; (b) X is exponentially distributed with rate parameter λ , where λ is uknown but lies on the interval (0 , 1); (c) X is normally distributed with unknown mean and variance σ 2 = 5. A set of assumptions like this is referred to as a model.

Fully-specified population distributions In some situations – usually rather artificial ones – we know the population distribution exactly. For example: let X be the score obtained from rolling a fair die; or let X be the number on a card drawn at random from a full deck. (Assume Jack, Queen, King numbered 11,12,13 respectively.)

Samples We do not know everything about the population distribution We learn about the population distribution by drawing a sample A sample of size n corresponds to taking n independent measurements from the distribution Each measurement is a random variable with the same distribution as X : the sample measurements denoted X 1 , X 2 , . . . , X n The actual measurements obtained are denoted x 1 , x 2 , . . . , x n

Samples We do not know everything about the population distribution We learn about the population distribution by drawing a sample A sample of size n corresponds to taking n independent measurements from the distribution Each measurement is a random variable with the same distribution as X : the sample measurements denoted X 1 , X 2 , . . . , X n The actual measurements obtained are denoted x 1 , x 2 , . . . , x n The distinction between the population distribution and how we learn about the population from limited samples is probably the most important concept in statistics

Estimators Suppose we wish to learn about some aspect of the population distribution e.g. population mean or population variance We construct an estimator for the quantity of interest For example, for population mean, a good estimator is the sample mean n X = 1 ¯ � X i . n i =1

Estimators Suppose we wish to learn about some aspect of the population distribution e.g. population mean or population variance We construct an estimator for the quantity of interest For example, for population mean, a good estimator is the sample mean n X = 1 ¯ � X i . n i =1 Formally, an estimator is defined to be some function of the sample: S = g ( X 1 , X 2 , . . . , X n ) for some function g When we observe some measurements X 1 = x 1 , . . . , X n = x n then we can compute an estimate s = g ( x 1 , x 2 , . . . , x n ).

Simulation study of estimators Since any estimator S is a random variable it makes sense to talk about its distribution – we can use simulation to do this Example 6.2: Suppose the population distribution is normal, and we wish to estimate the population mean. Suppose the sample size is n = 4 and our estimator is ¯ X = ( X 1 + X 2 + X 3 + X 4 ) / 4. What is the distribution of ¯ X when the population distribution is N (170 , 20 2 )?

Example 6.2 – R code simulate.sample.mean = function(n) { xbar = vector(mode="numeric",length=n) for (i in 1:n) { x = rnorm(4,170,20) # Generate a sample of size 4 xbar[i] = 0.25*sum(x) } xbar } xbar=simulate.sample.mean(500) hist(xbar,xlab="sample mean",ylab="frequency")

Example 6.2 – plot Histogram of xbar 80 60 frequency 40 20 0 140 150 160 170 180 190 200 sample mean

Example 6.3 Suppose the population distribution is normal, and we wish to estimate the 90th percentile using a sample of size 10. A sensible estimator is to define S to be the second largest value in the sample (i.e. the 9th value when the samples are ordered from smallest to largest). What is the distribution of S when the population distribution is N (0 , 1)?

Example 6.3 – R code simulate.percentile = function(n) { s = vector(mode="numeric",length=n) for (i in 1:n) { x = rnorm(10,0,1) # Generate a sample of size 10 x = sort(x) s[i] = x[9] # Get 9th value on sorted list } s } s=simulate.percentile(500) hist(s,xlab="s",ylab="frequency",main="")

Example 6.3 – plot 150 frequency 100 50 0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 s

What does the distribution of ¯ X look like? Consider the following two examples for the density of the population distribution. For each example, decide which histogram on the slides (A, B, C or D) is most likely to represent the distribution of the sample mean ¯ X when the sample size is 10. . .

Example 6.4 0.6 0.4 f(x) 0.2 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 x

Options A–D option A Option B 120 70 100 60 50 80 frequency frequency 40 60 30 40 20 20 10 0 0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 sample mean sample mean Option C option D 100 100 80 80 frequency 60 frequency 60 40 40 20 20 0 0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 sample mean sample mean

Example 6.5 f(x) 0.00 0.10 0.20 0.30 0 2 4 x 6 8 10

Options A–D Option A Option B 80 80 60 60 frequency frequency 40 40 20 20 0 0 0 2 4 6 8 10 0 2 4 6 8 10 sample mean sample mean Option C Option D 150 80 100 60 frequency frequency 40 50 20 0 0 0 2 4 6 8 10 0 2 4 6 8 10 sample mean sample mean

Answers Example 6.4: option B Example 6.5: option D

Conclusions The sample mean is distributed around the population mean. The distribution of sample mean values ‘forgets’ the underlying shape of the population distrubition. As n increases we expect the distribution of ¯ X to become more clustered around the true value.

The central limit theorem Suppose X 1 , X 2 , . . . , X n are independent and identically distributed random variables with common mean µ and variance σ 2 which are both finite. Define ¯ X − µ Z = σ/ √ n . Then as n → ∞ the distribution of Z tends to N (0 , 1).

CLT via simulation Population distribution: normal mixture with two components 0.30 0.20 f(x) 0.10 0.00 0 2 4 6 8 10 x The population mean is µ = 5 and variance is σ 2 = 4 . 3.

R code for sampling ¯ X simulate.bimod = function(k,n) { # Generate k samples of size n s = vector(mode="numeric",length=k) for (i in 1:k) { u = rnorm(n,3,0.6) v = rnorm(n,7,0.6) r = runif(n) x = c(u[r>0.5],v[r<=0.5]) s[i] = mean(x) } s }

Histograms from simulations of ¯ X Sample size 2 Sample size 5 Sample size 10 200 200 250 150 150 frequency frequency frequency 100 100 150 50 50 50 0 0 0 2 3 4 5 6 7 8 2 3 4 5 6 7 8 3 4 5 6 7 sample mean sample mean sample mean

Mean and variance for simulated ¯ X Simulated mean of ¯ Variance of ¯ σ 2 / n Sample size n µ X X 2 5.0 2.15 4 . 94 2 . 27 5 5.0 0.86 4 . 98 0 . 862 10 5.0 0.43 4 . 96 0 . 443

Lecture 6: samples and populations Todays lecture Look at - PowerPoint PPT Presentation

Lecture 6: samples and populations Todays lecture Look at fundamental concepts of samples and populations Intended to reinforce similar material in MAS2901 Adopt a different perspective to MAS2901: use simulation rather than analytic

Counting Words: Type probabilities Population models Type-rich populations, samples, ZM &

Samples Advertising of samples and handing out samples Advertising Education and Assurance

-Samples [AB98] Hyp: domain S is a smooth curve or surface. S 1 -Samples [AB98] Hyp:

Business Statistics CONTENTS Comparing two samples Comparing two unrelated samples Comparing

Chapter 7: Sampling In this chapter we will cover: 1. Samples and Populations ( 7.1, 7.2 Rice)

Introduction to Research Methods Samples and Populations Measuring Data Relationships Bewteen

Possum Populations Examining changes in possum populations at two different areas of Australia

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 14, 2019 1 / 125

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 13, 2019 1 / 125

STAT 113 Independent vs. Paired Samples Colin Reimer Dawson Oberlin College November 16, 2017

Variance Estimation in Complex Samples: The Finite Population Bootstrap Using Pseudo-Populations

Resampling Methods general problem scientific Qs are about populations we cant measure

Modeling Prey-Predator Populations Alison Pool and Lydia Silva December 13, 2006 Alison Pool and

MutaPon Analysis in Frozen and FFPE Tumor Samples Gad Getz, PhD KrisPn Ardlie, PhD Broad

Combining Point and Line Samples for Direct Illumination Points only Points + Lines Katherine

Labeling Blood Samples There are documented occurrences and near misses of mislabeling of blood

CSI5126 . Algorithms in bioinformatics Multiple Sequence Alignment (MSA) Marcel Turcotte School

Lexico-Syntactic Influences in Spoken-Word Recognition Garance P ARIS Dept. of Psycholinguistics

An introduction to multiple alignments original version by Cdric Notredame, updated by Laurent

January 8, 2013 MEMORANDUM TO: Chairman Macfarlane Commissioner Svinicki Commissioner

Exploitation Passenger pigeon hunted to extinction dye-hardwood tree Pau-Brazil vigorously

Example titles from last semester Coral reef resilience and susceptibility due to human

FINAL REPORT An empirical relationship between changes in headrope length and catch for the NPF

P( ) 1 conditional probability where P(F) > 0 Conditional probability of E given F:

Lecture 6: samples and populations Todays lecture Look at - PowerPoint PPT Presentation

Lecture 6: samples and populations Todays lecture Look at fundamental concepts of samples and populations Intended to reinforce similar material in MAS2901 Adopt a different perspective to MAS2901: use simulation rather than analytic

Counting Words: Type probabilities Population models Type-rich populations, samples, ZM &amp;

Samples Advertising of samples and handing out samples Advertising Education and Assurance

-Samples [AB98] Hyp: domain S is a smooth curve or surface. S 1 -Samples [AB98] Hyp:

Business Statistics CONTENTS Comparing two samples Comparing two unrelated samples Comparing

Chapter 7: Sampling In this chapter we will cover: 1. Samples and Populations ( 7.1, 7.2 Rice)

Introduction to Research Methods Samples and Populations Measuring Data Relationships Bewteen

Possum Populations Examining changes in possum populations at two different areas of Australia

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 14, 2019 1 / 125

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 13, 2019 1 / 125

STAT 113 Independent vs. Paired Samples Colin Reimer Dawson Oberlin College November 16, 2017

Variance Estimation in Complex Samples: The Finite Population Bootstrap Using Pseudo-Populations

Resampling Methods general problem scientific Qs are about populations we cant measure

Modeling Prey-Predator Populations Alison Pool and Lydia Silva December 13, 2006 Alison Pool and

MutaPon Analysis in Frozen and FFPE Tumor Samples Gad Getz, PhD KrisPn Ardlie, PhD Broad

Combining Point and Line Samples for Direct Illumination Points only Points + Lines Katherine

Labeling Blood Samples There are documented occurrences and near misses of mislabeling of blood

CSI5126 . Algorithms in bioinformatics Multiple Sequence Alignment (MSA) Marcel Turcotte School

Lexico-Syntactic Influences in Spoken-Word Recognition Garance P ARIS Dept. of Psycholinguistics

An introduction to multiple alignments original version by Cdric Notredame, updated by Laurent

January 8, 2013 MEMORANDUM TO: Chairman Macfarlane Commissioner Svinicki Commissioner

Exploitation Passenger pigeon hunted to extinction dye-hardwood tree Pau-Brazil vigorously

Example titles from last semester Coral reef resilience and susceptibility due to human

FINAL REPORT An empirical relationship between changes in headrope length and catch for the NPF

P( ) 1 conditional probability where P(F) &gt; 0 Conditional probability of E given F:

Counting Words: Type probabilities Population models Type-rich populations, samples, ZM &

P( ) 1 conditional probability where P(F) > 0 Conditional probability of E given F: