[PPT] - 13 Collecting Statistical Data 13.1 The Population 13.2 Sampling PowerPoint Presentation

SLIDE 1

1.1 - 1

13 Collecting Statistical Data

13.1 The Population 13.2 Sampling 13.3 Random Sampling

SLIDE 2

1.1 - 2

Polls, studies, surveys and other data

collecting tools collect data from a small part of a larger group so that we can learn something about the larger group.

This is a common and important goal
f statistics: Learn about a large group

by examining data from some of its members.

SLIDE 3

1.1 - 3

 Data

collections of observations (such as measurements, genders, survey responses)

SLIDE 4

1.1 - 4

 Statistics

is the science of planning studies and experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions based on the data

SLIDE 5

1.1 - 5

 Population

the complete collection of all individuals (scores, people, measurements, and so on) to be studied; the collection is complete in the sense that it includes all of the individuals to be studied

SLIDE 6

1.1 - 6

 Census Collection of data from every

member of a population

 Sample Subcollection of members

selected from a population

SLIDE 7

1.1 - 7

The practical alternative to a census is to collect

data only from some members of the population and use that data to draw conclusions and make inferences about the entire population.

Statisticians call this approach a survey (or a poll

when the data collection is done by asking questions).

The subgroup chosen to provide the data is called

the sample, and the act of selecting a sample is called sampling.

A Survey

SLIDE 8

1.1 - 8

The first important step in a survey is to distinguish

the population for which the survey applies (the target population) and the actual subset of the population from which the sample will be drawn, called the sampling frame.

The ideal scenario is when the sampling frame is the

same as the target population–that would mean that every member of the target population is a candidate for the sample. When this is impossible (or impractical), an appropriate sampling frame must be chosen.

A Survey

SLIDE 9

1.1 - 9

The U.S. presidential election of 1936 pitted Alfred Landon, the Republican governor of Kansas, against the incumbent Democratic President, Franklin D. Roosevelt. At the time of the election, the nation had not yet emerged from the Great Depression, and economic issues such as unemployment and government spending were the dominant themes of the campaign. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL

SLIDE 10

1.1 - 10

The Literary Digest had used polls to accurately predict the results of every presidential election since 1916, and their 1936 poll was the largest and most ambitious poll ever. The sampling frame for the Literary Digest poll consisted of an enormous list

f names that included:

CASE STUDY 2 THE 1936 LITERARY DIGEST POLL

SLIDE 11

1.1 - 11

(1) every person listed in a telephone directory anywhere in the United States, (2) every person on a magazine subscription list, and (3) every person listed on the roster of a club or professional association. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL

SLIDE 12

1.1 - 12

From this sampling frame a list of about 10 million names was created, and every name on this list was mailed a mock ballot and asked to mark it and return it to the magazine. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL

SLIDE 13

1.1 - 13

Based on the poll results, the Literary

Digest predicted a landslide victory for Landon with 57% of the vote, against Roosevelt’s 43%.

the election turned out to be a

landslide victory for Roosevelt with 62% of the vote, against 38% for Landon. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL

SLIDE 14

1.1 - 14

The difference between the poll’s

prediction and the actual election results was 19%, the largest error ever in a major public opinion poll. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL

SLIDE 15

1.1 - 15

For the same election, a young pollster

named George Gallup was able to predict accurately a victory for Roosevelt using a sample of “only” 50,000 people.

What went wrong with the Literary

Digest poll and why was Gallup able to do so much better? CASE STUDY 2 THE 1936 LITERARY DIGEST POLL

SLIDE 16

1.1 - 16

The first thing seriously wrong with the

Literary Digest poll was the sampling frame, consisting of names taken from telephone directories, lists of magazine subscribers, rosters of club members, and so on. Telephones in 1936 were something of a luxury, and magazine subscriptions and club memberships even more so, at a time when 9 million people were unemployed. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL

SLIDE 17

1.1 - 17

When it came to economic status the

Literary Digest sample was far from being a representative cross section of the voters. This was a critical problem, because voters often vote on economic issues, and given the economic conditions of the time, this was especially true in 1936. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL

SLIDE 18

1.1 - 18

When the choice of the sample has a

built-in tendency (whether intentional

r not) to exclude a particular group or

characteristic within the population, we say that a survey suffers from selection bias.

Selection bias must be avoided, but it

is not always easy to detect it ahead of

time. Even the most scrupulous

attempts to eliminate selection bias can fall short. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL

SLIDE 19

1.1 - 19

The second serious problem with the

Literary Digest poll was the issue of nonresponse bias.

In a typical survey it is understood that

not every individual is willing to respond to the survey request (and in a democracy we cannot force them to do so). CASE STUDY 2 THE 1936 LITERARY DIGEST POLL

SLIDE 20

1.1 - 20

Those individuals who do not respond

to the survey request are called nonrespondents, and those who do are called respondents.

The percentage of respondents out of

the total sample is called the response rate. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL

SLIDE 21

1.1 - 21

For the Literary Digest poll, out of a

sample of 10 million people who were mailed a mock ballot only about 2.4 million mailed a ballot back, resulting in a 24% response rate.

When the response rate to a survey is

low, the survey is said to suffer from nonresponse bias. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL

SLIDE 22

1.1 - 22

One of the significant problems with

the Literary Digest poll was that the poll was conducted by mail. This approach is the most likely to magnify nonresponse bias, because people

ften consider a mailed questionnaire

just another form of junk mail. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL

SLIDE 23

1.1 - 23

The Literary Digest story has two morals: (1) You’ll do better with a well-chosen small sample than with a badly chosen large one, and (2) watch out for selection bias and nonresponse bias. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL

SLIDE 24

1.1 - 24

Page 516, problems 17,18,19

(Solutions on following slides) NOTE:

students should omit problem 28 from homework exercises

Examples

SLIDE 25

1.1 - 25

Solutions 17(a) the citizens of Cleansburg 17(b) the sampling frame is limited to that part of the target population that passes by a city street corner between 4:00 pm and 6:00 pm

Examples

SLIDE 26

1.1 - 26

Solutions 18(a) 475 18(b) yes, this survey is subject to nonresponse bias. The response rate is

Examples

266 . 475 1313 475

SLIDE 27

1.1 - 27

Solutions 19(a) the choice of street corner could make a difference in responses collected 19(b) interviewer D. We are assuming that people who live or work downtown are more likely to answer yes than people in other parts of town.

Examples

SLIDE 28

1.1 - 28

Solutions 19(c) yes. People on street between 4 pm and 6 pm are not representative of the population at large. Also, the five street corners were chosen by the interviewers and the passers-by are unlikely to represent a cross-section of the target population. 19(d) omit

Examples

SLIDE 29

1.1 - 29

One commonly used short-cut in sampling is

known as convenience sampling. In convenience sampling the selection of which individuals are in the sample is dictated by what is easiest or cheapest for the data collector, never mind trying to get a representative sample.

A classic example of convenience sampling is

when interviewers set up at a fixed location such as a mall or outside a supermarket and ask passersby to be part of a public opinion poll.

Convenience Sampling

SLIDE 30

1.1 - 30

A different type of convenience sampling occurs

when the sample is based on self-selection–the sample consists of those individuals who volunteer to be in it.

Self-selection is the reason why many Area Code

800 polls are not to be trusted. Convenience sampling is not always bad–at times there is no

ther choice or the alternatives are so expensive

that they have to be ruled out.

Convenience Sampling

SLIDE 31

1.1 - 31

Quota sampling is a systematic effort to force the

sample to be representative of a given population through the use of quotas–the sample should have so many women, so many men, so many blacks, so many whites, so many people living in urban areas, so many people living in rural areas, and so on. The proportions in each category in the sample should be the same as those in the population.

If we can assume that every important

characteristic of the population is taken into account when the quotas are set up, it is reasonable to expect that the sample will be representative of the population and produce reliable data.

Quota Sampling

SLIDE 32

1.1 - 32

The best alternative to human selection is to let the

laws of chance determine the selection of a sample.

Sampling methods that use randomness as part of

their de- sign are known as random sampling methods, and any sample obtained through random sampling is called a random sample (or a probability sample).

Random Sampling

SLIDE 33

1.1 - 33

The most basic form of random sampling is called

simple random sampling. It is based on the same principle a lottery is. Any set of numbers of a given size has an equal chance of being chosen as any

ther set of numbers of that size.
In theory, simple random sampling is easy to
implement. We put the name of each individual in

the population in “a hat,” mix the names well, and then draw as many names as we need for our

sample. Of course “a hat” is just a metaphor.

Simple Random Sampling

SLIDE 34

1.1 - 34

These days, the “hat” is a computer database

containing a list of members of the population. A computer program then randomly selects the names.

This is a fine idea for small, compact populations,

but a hopeless one when it comes to national surveys and public opinion polls. For most public

pinion polls–especially those done on a regular

basis” the time and money needed to do this are simply not available.

Simple Random Sampling

SLIDE 35

1.1 - 35

13 Collecting Statistical Data

13.1 The Population 13.2 Sampling 13.3 Random Sampling 13.4 Sampling: Terminology and Key Concepts 13.5 The Capture-Recapture Method 13.6 Clinical Studies

SLIDE 36

1.1 - 36

In a survey, we use a subset of the population, called a sample, as the source of our information, and from this sample, we try to generalize and draw conclusions about the entire population.

Survey

SLIDE 37

1.1 - 37

 Parameter

a numerical measurement describing some characteristic of a

population.

population parameter

SLIDE 38

1.1 - 38

 Statistic

a numerical measurement describing some characteristic of a sample.

sample statistic

SLIDE 39

1.1 - 39

We will use the term sampling error to describe the

difference between a parameter and a statistic used to estimate that parameter.

In other words, the sampling error measures how

much the data from a survey differs from the data that would have been obtained if a census had been used.

Sampling Error

SLIDE 40

1.1 - 40

Sampling error can be attributed to two factors:

chance error and sampling bias.

Sampling Error

SLIDE 41

1.1 - 41

Chance error is the result of the basic fact that a

sample can only give us approximate information about the population.

Different samples are likely to produce different

statistics for the same population, even when the samples are chosen in exactly the same way–a phenomenon known as sampling variability.

Chance Error

SLIDE 42

1.1 - 42

While sampling variability, and thus chance error, are

unavoidable, with careful selection of the sample and the right choice of sample size they can be kept to a minimum.

Chance Error

SLIDE 43

1.1 - 43

Sample bias is the result of choosing a bad sample

and is a much more serious problem than chance error.

Getting a sample that is representative of the entire

population can be very difficult and can be affected by many subtle factors.

Sampling Bias

SLIDE 44

1.1 - 44

As opposed to chance error, sample bias can be

eliminated by using proper methods of sample selection.

Sampling Bias

SLIDE 45

1.1 - 45

The size of the sample is denoted by the letter n
The size of the population is denoted by the letter N
The ratio n/N is called the sampling proportion.

Sampling Proportion

SLIDE 46

1.1 - 46

Page 515, problems 5,6,7,8

Examples

SLIDE 47

1.1 - 47

5(a) 680/8325 5(b) 45%

Examples

SLIDE 48

1.1 - 48

6(a) the registered voters in Cleansburg 6(b) the 680 registered voters polled by telephone 6(c) simple random sampling

Examples

SLIDE 49

1.1 - 49

7. Smith 3%, Jones 3%, and Brown 0%

Examples

SLIDE 50

1.1 - 50

8. Chance, the sample was chosen randomly to eliminate

selection bias and there was a 100% response rate to eliminate non-response bias.

Examples

SLIDE 51

1.1 - 51

13 Collecting Statistical Data

13.1 The Population 13.2 Sampling 13.3 Random Sampling 13.4 Sampling: Terminology and Key Concepts 13.5 The Capture-Recapture Method 13.6 Clinical Studies

SLIDE 52

1.1 - 52

Finding the exact N-value of a large and elusive

population can be extremely difficult and sometimes impossible.

In many cases, a good estimate is all we really

need, and such estimates are possible through sampling methods.

The simplest sampling method for estimating the

N-value of a population is called the capture- recapture method.

Finding the N-value

SLIDE 53

1.1 - 53

■ Step 1. Capture (sample): Capture (choose) a sample of size n1, tag (mark, identify) the animals (objects, people), and release them back into the general population.

THE CAPTURE- RECAPTURE METHOD

SLIDE 54

1.1 - 54

■ Step 2. Recapture (resample): After a certain period of time, capture a new sample of size n2 and take an exact head count

f the tagged individuals (i.e., those that were

also in the first sample). Call this number k.

THE CAPTURE- RECAPTURE METHOD

SLIDE 55

1.1 - 55

■ Step 3. Estimate: The N-value of the population can be estimated to be approximately (n1• n2)/k.

THE CAPTURE- RECAPTURE METHOD

SLIDE 56

1.1 - 56

The capture-recapture method is based on the assumption that both the captured and recaptured samples are representative of the entire population.

Capture-Recapture Method

SLIDE 57

1.1 - 57

Under these assumptions, the proportion of tagged

individuals in the recaptured sample is approximately equal to the proportion of the tagged individuals in the population.

In other words, the ratio k/n2 is approximately

equal to the ratio n1/N. From this we can solve for N and get N ≈ (n1• n2)/k

Capture-Recapture Method

SLIDE 58

1.1 - 58

A large pond is stocked with catfish. As part of a research project we need to estimate the number of catfish in the pond. An actual head count is out of the question (short of draining the pond), so our best bet is the capture-recapture method.

Example 13.6 Small Fish in a Big Pond

SLIDE 59

1.1 - 59

Step 1. For our first sample we capture a predetermined number n1 of catfish, say n1 = 200. The fish are tagged and released unharmed back in the pond.

Example 13.6 Small Fish in a Big Pond

SLIDE 60

1.1 - 60

Step 2. After giving enough time for the released fish to mingle and disperse throughout the pond, we capture a second sample of n2 catfish. While n2 does not have to equal n1, it is a good idea for the two samples to be of approximately the same order of magnitude. Let’s say that n2 = 150. Of the 150 catfish in the second sample, 21 have tags (were part of the original sample).

Example 13.6 Small Fish in a Big Pond

SLIDE 61

1.1 - 61

Assuming the second sample is representative of the catfish population in the pond, the ratio of tagged fish in the second sample (21/150) is approximately the same as the ratio of tagged fish in the pond (200/N). This gives the approximate proportion 21/150 ≈ 200/N which in turn gives N ≈ 200 150/21 ≈ 1428.57

Example 13.6 Small Fish in a Big Pond

SLIDE 62

1.1 - 62

Obviously, the value N = 1428.57 cannot be taken literally, since N must be a whole number. Besides, even in the best of cases, the computation is only an estimate. A sensible conclusion is that there are approximately N = 1400 catfish in the pond.