COMM 291 Midterm Review Session By Simon Roberts Types of - - PowerPoint PPT Presentation

comm 291
SMART_READER_LITE
LIVE PREVIEW

COMM 291 Midterm Review Session By Simon Roberts Types of - - PowerPoint PPT Presentation

COMM 291 Midterm Review Session By Simon Roberts Types of Variables Categorical Variable: Variable names fall into bins or categories Binary Variable: There are exactly 2 options (true/false) Nominal: Variables are simply named


slide-1
SLIDE 1

COMM 291

Midterm Review Session

By Simon Roberts

slide-2
SLIDE 2

Types of Variables

  • Categorical Variable: Variable names fall into “bins” or

categories

  • Binary Variable: There are exactly 2 options (true/false)
  • Nominal: Variables are simply named (colours, shapes)
  • Ordinal: Variables have a specific order (Infant, Youth, Teen, Adult)
  • Quantitative Variable: Variables have measured numeric

values

  • Identifier Variable: Unique Identifiers, such as a Social

Insurance Number, Student ID or Amazon Tracking Number

slide-3
SLIDE 3

Types of Variables – Activity

Student ID Age Tuition Major Height Satisfaction Rating 12345 19 $6800 Finance 180 cm Neutral 12346 20 $5900 Computer Science 168 cm Extremely Likes

slide-4
SLIDE 4

Surveys and Sampling

  • Population: All individuals with a

common characteristic you want to generalize about

  • Sample: A “slice” of the

population

  • Parameter: a fact or

characteristic about the population

  • Statistic: a fact or characteristic

about the sample

slide-5
SLIDE 5

Biased Samples

  • Nonresponse/Undercoverage: Members of the population are

systematically excluded from the sample

  • Conducting a telephone survey during the day (excludes commuters)
  • Voluntary Bias: Subjects who feel strongly self-select to

participate

  • Common with hot-button issues (gun control, affirmative action, etc.)
  • Convenience Bias: Choosing subjects based on whether

they’re easy to survey

  • Standing at a mall and asking the first 50 people who agree to take

part

slide-6
SLIDE 6

Sampling Designs

  • Simple Random Sample: Each individual has an equal chance
  • f being selected
  • Stratified Random Sample: Divide population into

homogenous groups and select from each stratum

  • Divide by age group, political affiliation and then sample by group
  • Cluster Random Sample: Divide population into heterogenous

groups and select a few clusters

  • Randomly select a few high schools to represent a district
  • Systematic Sampling: Select every nth individual
slide-7
SLIDE 7

Sampling Designs Activity

UBC is interested in improving their food services on campus, so they wish to sample their students. Identify the method of sampling.

  • 1. There are 4 faculties (commerce, engineering, science, arts).

Randomly select 20 students from each.

  • 2. Randomly select one of the faculties and survey all the students in that

faculty.

  • 3. Stop every 10th person who enters the Nest that enters on a Thursday.
  • 4. Each student has a student ID. Randomly select 200 participants.
slide-8
SLIDE 8

Simpson’s Paradox

The direction of association of the population may be the opposite of the direction of association of its relevant subgroups

slide-9
SLIDE 9

Simpson’s Paradox

The direction of association of the population may be the opposite of the direction of association of its relevant subgroups

slide-10
SLIDE 10

Describing Categorical Data - Activity

Male Female Total Finance 200 130 330 Accounting 260 280 540 Marketing 240 300 540 Total 700 710 1410

  • 1. What percentage of students chose marketing?
  • 2. What percentage of finance students are female?
  • 3. What percentage of male students chose accounting?
slide-11
SLIDE 11

Displaying Quantitative Data

  • Typically presented in a histogram, stem/leaf plot or boxplot
  • Mean = “Center of Mass”
  • Median = Middle
  • Mode = Most Frequent Observation
  • Range = Max – Min
  • Standard Deviation = 𝜏 =

෌ 𝑦𝑗−𝜈 2 𝑂

  • Variance = 𝜏2
  • IQR = Q3 – Q1
slide-12
SLIDE 12

Histograms vs Stem-and-Leaf

  • Only shows distribution
  • Excellent for displaying large

datasets

  • Shows individual values
  • Impractical for large datasets
slide-13
SLIDE 13

Drawing Boxplots

  • 1. Get a five-number summary (Max, Q3, Median, Q1, Min)
  • 2. Calculate IQR
  • 3. Find Inner Fences, but do not plot them

1. Q3 + 1.5IQR 2. Q1 – 1.5IQR Find

  • 4. Grow whiskers to most extreme values in the fences
  • 5. Show outliers

1. (convention uses ○ for outliers within the fences and * for outliers outside the fences)

  • 6. Use Excel
slide-14
SLIDE 14

Effect of Changing Values Activity

Suppose you’ve drawn a boxplot with the following data: Min = 10; Q1 = 20; Med = 35; Q3 = 45; Max = 85 There was an error and the max was actually only 75. How does this effect:

  • The mean?
  • The median?
  • The range?
  • The IQR?
slide-15
SLIDE 15

Scatterplots, Correlation and Linear Regression

  • Correlation (r): How strong is the linear clustering around a line?
  • Only for quantitative data with a linear pattern
  • Use “Association” for Categorical Variables as its less descriptive
  • - 1 ≤ r ≤ 1
  • Correlation is unitless
  • The correlation of variables X and Y = The correlation of variables Y

and X

  • Correlation does not necessarily imply causation.
  • Lurking variables: a third variable that causes both X and Y
  • Extrapolation: extending results beyond the range of data provided
slide-16
SLIDE 16

Lurking Variables and Extrapolation

slide-17
SLIDE 17

How to find a regression line

  • Slope of the estimated regression equation:

𝒄𝟐 = 𝒔( 𝑻𝒛 𝑻𝒚 )

  • Predicted value from regression equation:

ഥ 𝒛 = 𝒄𝟏 + 𝒚𝒄𝟐

  • R-Squared (𝑠2) is the % of variation in the y value that

the model can explain

  • Always takes on values from 0 to 1 inclusive
slide-18
SLIDE 18

Residual Plots

  • Residual = Observation – Predicted @ each point
  • Good fit if there is a symmetric horizontal band around x = 0
  • If it is curved, then the data is not a linear trend
  • If the residuals create a linear trend, there’s either a problem with

your algebra or you need to take a root of the observations

slide-19
SLIDE 19

Residual Plots - Example

Homoscedastic = Constant variance around model. This is good. Heteroscedastic = Non-constant variance around model. This violates several assumptions about linear inference later

  • n.

Bias = Residuals form a line/curve pattern around x=0. Indicator that the linear model is not a good fit OR data should be transformed

slide-20
SLIDE 20

Residual Plots - Example

Homoscedastic = Constant variance around model. This is good. Heteroscedastic = Non-constant variance around model. This violates several assumptions about linear inference later

  • n.

Bias = Residuals form a line/curve pattern around x=0. Indicator that the linear model is not a good fit OR data should be transformed

slide-21
SLIDE 21

Correlation and Linear Regression Activity

  • Is there a relationship between an NFL team’s total spending (in

millions) on player salary and its league performance? A linear model predicting Wins (out of 16 regular season games) is shown below: ෝ 𝑥 = −16.32 + 0.219𝑡

  • A. What is the explanatory variable? What is the response variable?
  • B. What does the slope mean? What does the y-intercept mean?
  • C. Does a team that spends 130 million and wins 13 games over or

underperform the model’s prediction?

  • D. The residual SD is 3 games. How practical is this model?
slide-22
SLIDE 22

Combining Random Variables

E(x) = Expected Value of a Random Variable (think mean) σ = Standard Deviation for Random Variables E(X±Y) =E(X) ± E(Y). Does not require independence E(aX) = aE(x) Var(aX) = a2Var(x) SD(aX) = |a|SD(x) Var(X±Y) = Var(X) + Var(Y). Requires Independence! SD(X±Y) = Var(X) + Var(Y). Requires Independence!

slide-23
SLIDE 23

Combining Random Variables Activity

Variable B has an expected value of 9.6 and an SD of 0.8. Variable C has an expected value of 30 and SD of 2.2. Find E(24B + C) and SD(24B + C)

slide-24
SLIDE 24

Normal Distribution + Empirical Rule

What is the total area under the curve? Does this curve extend beyond 3 SD? Informally known as a “bell” curve Standard Deviations are measures of spread We use Z-Scores to standardize different obs.

slide-25
SLIDE 25

Finding Probabilities Activity

Calculate Z =

𝒚 −𝝂 𝝉

  • r use NORM.DIST(x, mu, sigma, true)

Given 𝜈 = 85 and 𝜏 = 15, calculate the following:

  • 1. X < 90
  • 2. X > 105
  • 3. 80 < x < 100
slide-26
SLIDE 26

Finding X Activity

Use NORM.INV(p, mu, sigma) to return the value of x such that the area to the left will have the value p Given a test where 𝜈 = 75 and 𝜏 = 9, calculate the following:

  • 1. What score will put you in the top 5% of the class
  • 2. What score will put you in the bottom 30%
slide-27
SLIDE 27

Central Limit Theorem

  • The mean of a random

samples has a sampling distribution that is approximated by a normal distribution

  • More samples = better!
  • Has implications for

probabilities for samples

  • f proportions and means
slide-28
SLIDE 28

Sampling Distributions for Proportions

  • Only for binary categorical data
  • Sample Proportion Ƹ

𝑞, Population Proportion is 𝑞

  • 𝑇𝐸

Ƹ 𝑞 =

𝑞𝑟 𝑜

  • 𝑎 =

ො 𝑞 −𝑞 𝑇𝐸 ො 𝑞

  • 10%, Success/Fail, Independence, Sample Size Assumptions
slide-29
SLIDE 29

Sampling Distributions for Means

  • Only for quantitative data
  • Sample Mean ҧ

𝑦, Population Proportion is 𝜈

  • 𝑇𝑢𝑏𝑜𝑒𝑏𝑠𝑒 𝐹𝑠𝑠𝑝𝑠

ҧ 𝑦 =

𝜏 𝑜

  • 𝑎 =

ҧ 𝑦−𝜈 𝑇𝐸 ҧ 𝑦

  • If the population is normal, the sample is normal
  • If the population is not normal, but conditions are met then

the distribution will be approximately normal by the central limit theorem (same conditions are proportions)

slide-30
SLIDE 30

Sampling Distributions for Means

  • Only for quantitative data
  • Sample Mean ҧ

𝑦, Population Proportion is 𝜈

  • 𝑇𝑢𝑏𝑜𝑒𝑏𝑠𝑒 𝐹𝑠𝑠𝑝𝑠

ҧ 𝑦 =

𝜏 𝑜

  • 𝑎 =

ҧ 𝑦−𝜈 𝑇𝐸 ҧ 𝑦

  • If the population is normal, the sample is normal
  • If the population is not normal, but conditions are met then

the distribution will be approximately normal by the central limit theorem (same conditions are proportions)

slide-31
SLIDE 31

Confidence Intervals

Estimate ± (Critical Value * SD(Estimate)) <- This is a margin of error Proportions: Ƹ 𝑞± 𝑨∗

ො 𝑞 ො 𝑟 𝑜 . To Choose N, shortcut is 𝑜 = 𝑨∗ ො 𝑞 ො 𝑟 𝑁𝐹2

Means: ҧ 𝑦± 𝑢𝑜−1∗ 𝜏

𝑜. Otherwise use CONFIDENCE.T(alpha, sd, n)

To reduce the size of a CI, choose a lower CV or higher N

slide-32
SLIDE 32

Confidence Intervals Question

A 95% confidence interval for the mean number of televisions per American household is (1.15, 4.20). For each of the following statements about the above confidence interval, choose true or false. a) The probability that is between 1.15 and 4.20 is .95. b) We are 95% confident that the true mean number of televisions per American household is between 1.15 and 4.20. c) 95% of all samples should have x-bars between 1.15 and 4.20. d) 95% of all American households have between 1.15 and 4.20 televisions. e) Of 100 intervals calculated the same way (95%), we expect 95 of them to capture the population mean. f) Of 100 intervals calculated the same way (95%), we expect 100 of them to capture the sample mean.

slide-33
SLIDE 33

Confidence Interval Proportion Question

Suppose I take a SRS of 30 students at UBC, and calculate the sample proportion of male students in the sample is 0.4. What is a 95% confidence interval for this estimate? What sample size should I choose if I want my margin of error to be no more than 5%?

slide-34
SLIDE 34

Confidence Interval Means Questions

Given the summary statistics below, compute a 95% confidence interval for the temperature of an average person in Fahrenheit: Variable N Mean Median StDev SE Mean TEMP 130 98.249 98.300 0.733 0.064

slide-35
SLIDE 35

Hypothesis Testing – 5 Easy Steps

1. Specify the Null Hypothesis – This is about the population

A. Null implies no difference, relationship or effect. Think of “Not Guilty” B. For Example, H0 = 90

2. Specify the Alternative Hypothesis – Also about the population

A. Alternative is the statement there is an effect or difference. Think of “Guilty!” B. For Example, HA ≠ 90 C. Conclusion drawn from a two-sided test are stronger and be applied to the one sided, but not reverse

3. Set the Significance Level (𝛽) 4. Calculate the Test-Statistic and corresponding p-value from your sample 5. If the p-value is less than 𝛽, reject the null hypothesis. Else, fail to reject

slide-36
SLIDE 36

Types of Errors

False Positive False Negative

slide-37
SLIDE 37

Hypothesis Testing T/F Question Set

1. Rejecting the null hypothesis proves the alternative hypothesis. 2. The level of significance, or alpha is a measure of the amount of risk the statistician will accept when making a decision. 3. Choosing alpha = 0.05 is equivalent to saying we are 95% confident the results

  • ccurred by chance

4. Assuming a healthy diagnosis is normal, failing to prescribe medication to a sick patient is an example of a Type II error. 5. The test Ha: x > 0 is a left-tailed test

slide-38
SLIDE 38

Hypothesis Testing Proportions Activity

The standard treatment for a disease works in 0.675 of all patients. A new treatment is proposed. Is it better? (The scientists who created it claim it is. You – as an advocate for a patient or sales personnel for the company that eventually would market it – must be more skeptical. Where’s the data?) An initial clinical trial of n = 100 patients (of similar general health) is conducted: 77 people are cured. Assume the selection of patients is random. Test appropriate hypotheses at the 1% significance level.

slide-39
SLIDE 39

Hypothesis Testing Proportions Activity – Solutions

The standard treatment for a disease works in 0.675 of all patients. A new treatment is

  • proposed. Is it better? (The scientists who created it claim it is. You – as an advocate for a patient or

sales personnel for the company that eventually would market it – must be more skeptical. Where’s the data?) An initial clinical trial of n = 100 patients (of similar general health) is conducted: 77 people are

  • cured. Assume the selection of patients is random. Test appropriate hypotheses at the 1% significance

level. The test statistic is 2.03. The critical value is 2.326. We do not reject H0. There is not sufficient sample evidence to support the claim that the new treatment curse over 0.675 of all patients. This difference (0.77 vs. 0.675) is not statistically significant at the 1% level.

slide-40
SLIDE 40

Hypothesis Testing Means Activity

The population standard deviation for waiting times to be seated at a restaurant is know to be 10 minutes. An expensive restaurant claims that the average waiting time for dinner is approximately 1 hour, but we suspect that this claim is inflated to make the restaurant appear more exclusive and successful. A random sample of 30 customers yielded a sample average waiting time of 50 minutes. Is there evidence to say that the restaurant’s claim is too high? The original data (individual waiting times) is not normally distributed. What theorem allows us to do the previous calculations ?

slide-41
SLIDE 41

COMM 291

Midterm Review Session

By Simon Roberts