COMM 291 Midterm Review Session By Simon Roberts Types of - - PowerPoint PPT Presentation
COMM 291 Midterm Review Session By Simon Roberts Types of - - PowerPoint PPT Presentation
COMM 291 Midterm Review Session By Simon Roberts Types of Variables Categorical Variable: Variable names fall into bins or categories Binary Variable: There are exactly 2 options (true/false) Nominal: Variables are simply named
Types of Variables
- Categorical Variable: Variable names fall into “bins” or
categories
- Binary Variable: There are exactly 2 options (true/false)
- Nominal: Variables are simply named (colours, shapes)
- Ordinal: Variables have a specific order (Infant, Youth, Teen, Adult)
- Quantitative Variable: Variables have measured numeric
values
- Identifier Variable: Unique Identifiers, such as a Social
Insurance Number, Student ID or Amazon Tracking Number
Types of Variables – Activity
Student ID Age Tuition Major Height Satisfaction Rating 12345 19 $6800 Finance 180 cm Neutral 12346 20 $5900 Computer Science 168 cm Extremely Likes
Surveys and Sampling
- Population: All individuals with a
common characteristic you want to generalize about
- Sample: A “slice” of the
population
- Parameter: a fact or
characteristic about the population
- Statistic: a fact or characteristic
about the sample
Biased Samples
- Nonresponse/Undercoverage: Members of the population are
systematically excluded from the sample
- Conducting a telephone survey during the day (excludes commuters)
- Voluntary Bias: Subjects who feel strongly self-select to
participate
- Common with hot-button issues (gun control, affirmative action, etc.)
- Convenience Bias: Choosing subjects based on whether
they’re easy to survey
- Standing at a mall and asking the first 50 people who agree to take
part
Sampling Designs
- Simple Random Sample: Each individual has an equal chance
- f being selected
- Stratified Random Sample: Divide population into
homogenous groups and select from each stratum
- Divide by age group, political affiliation and then sample by group
- Cluster Random Sample: Divide population into heterogenous
groups and select a few clusters
- Randomly select a few high schools to represent a district
- Systematic Sampling: Select every nth individual
Sampling Designs Activity
UBC is interested in improving their food services on campus, so they wish to sample their students. Identify the method of sampling.
- 1. There are 4 faculties (commerce, engineering, science, arts).
Randomly select 20 students from each.
- 2. Randomly select one of the faculties and survey all the students in that
faculty.
- 3. Stop every 10th person who enters the Nest that enters on a Thursday.
- 4. Each student has a student ID. Randomly select 200 participants.
Simpson’s Paradox
The direction of association of the population may be the opposite of the direction of association of its relevant subgroups
Simpson’s Paradox
The direction of association of the population may be the opposite of the direction of association of its relevant subgroups
Describing Categorical Data - Activity
Male Female Total Finance 200 130 330 Accounting 260 280 540 Marketing 240 300 540 Total 700 710 1410
- 1. What percentage of students chose marketing?
- 2. What percentage of finance students are female?
- 3. What percentage of male students chose accounting?
Displaying Quantitative Data
- Typically presented in a histogram, stem/leaf plot or boxplot
- Mean = “Center of Mass”
- Median = Middle
- Mode = Most Frequent Observation
- Range = Max – Min
- Standard Deviation = 𝜏 =
𝑦𝑗−𝜈 2 𝑂
- Variance = 𝜏2
- IQR = Q3 – Q1
Histograms vs Stem-and-Leaf
- Only shows distribution
- Excellent for displaying large
datasets
- Shows individual values
- Impractical for large datasets
Drawing Boxplots
- 1. Get a five-number summary (Max, Q3, Median, Q1, Min)
- 2. Calculate IQR
- 3. Find Inner Fences, but do not plot them
1. Q3 + 1.5IQR 2. Q1 – 1.5IQR Find
- 4. Grow whiskers to most extreme values in the fences
- 5. Show outliers
1. (convention uses ○ for outliers within the fences and * for outliers outside the fences)
- 6. Use Excel
Effect of Changing Values Activity
Suppose you’ve drawn a boxplot with the following data: Min = 10; Q1 = 20; Med = 35; Q3 = 45; Max = 85 There was an error and the max was actually only 75. How does this effect:
- The mean?
- The median?
- The range?
- The IQR?
Scatterplots, Correlation and Linear Regression
- Correlation (r): How strong is the linear clustering around a line?
- Only for quantitative data with a linear pattern
- Use “Association” for Categorical Variables as its less descriptive
- - 1 ≤ r ≤ 1
- Correlation is unitless
- The correlation of variables X and Y = The correlation of variables Y
and X
- Correlation does not necessarily imply causation.
- Lurking variables: a third variable that causes both X and Y
- Extrapolation: extending results beyond the range of data provided
Lurking Variables and Extrapolation
How to find a regression line
- Slope of the estimated regression equation:
𝒄𝟐 = 𝒔( 𝑻𝒛 𝑻𝒚 )
- Predicted value from regression equation:
ഥ 𝒛 = 𝒄𝟏 + 𝒚𝒄𝟐
- R-Squared (𝑠2) is the % of variation in the y value that
the model can explain
- Always takes on values from 0 to 1 inclusive
Residual Plots
- Residual = Observation – Predicted @ each point
- Good fit if there is a symmetric horizontal band around x = 0
- If it is curved, then the data is not a linear trend
- If the residuals create a linear trend, there’s either a problem with
your algebra or you need to take a root of the observations
Residual Plots - Example
Homoscedastic = Constant variance around model. This is good. Heteroscedastic = Non-constant variance around model. This violates several assumptions about linear inference later
- n.
Bias = Residuals form a line/curve pattern around x=0. Indicator that the linear model is not a good fit OR data should be transformed
Residual Plots - Example
Homoscedastic = Constant variance around model. This is good. Heteroscedastic = Non-constant variance around model. This violates several assumptions about linear inference later
- n.
Bias = Residuals form a line/curve pattern around x=0. Indicator that the linear model is not a good fit OR data should be transformed
Correlation and Linear Regression Activity
- Is there a relationship between an NFL team’s total spending (in
millions) on player salary and its league performance? A linear model predicting Wins (out of 16 regular season games) is shown below: ෝ 𝑥 = −16.32 + 0.219𝑡
- A. What is the explanatory variable? What is the response variable?
- B. What does the slope mean? What does the y-intercept mean?
- C. Does a team that spends 130 million and wins 13 games over or
underperform the model’s prediction?
- D. The residual SD is 3 games. How practical is this model?
Combining Random Variables
E(x) = Expected Value of a Random Variable (think mean) σ = Standard Deviation for Random Variables E(X±Y) =E(X) ± E(Y). Does not require independence E(aX) = aE(x) Var(aX) = a2Var(x) SD(aX) = |a|SD(x) Var(X±Y) = Var(X) + Var(Y). Requires Independence! SD(X±Y) = Var(X) + Var(Y). Requires Independence!
Combining Random Variables Activity
Variable B has an expected value of 9.6 and an SD of 0.8. Variable C has an expected value of 30 and SD of 2.2. Find E(24B + C) and SD(24B + C)
Normal Distribution + Empirical Rule
What is the total area under the curve? Does this curve extend beyond 3 SD? Informally known as a “bell” curve Standard Deviations are measures of spread We use Z-Scores to standardize different obs.
Finding Probabilities Activity
Calculate Z =
𝒚 −𝝂 𝝉
- r use NORM.DIST(x, mu, sigma, true)
Given 𝜈 = 85 and 𝜏 = 15, calculate the following:
- 1. X < 90
- 2. X > 105
- 3. 80 < x < 100
Finding X Activity
Use NORM.INV(p, mu, sigma) to return the value of x such that the area to the left will have the value p Given a test where 𝜈 = 75 and 𝜏 = 9, calculate the following:
- 1. What score will put you in the top 5% of the class
- 2. What score will put you in the bottom 30%
Central Limit Theorem
- The mean of a random
samples has a sampling distribution that is approximated by a normal distribution
- More samples = better!
- Has implications for
probabilities for samples
- f proportions and means
Sampling Distributions for Proportions
- Only for binary categorical data
- Sample Proportion Ƹ
𝑞, Population Proportion is 𝑞
- 𝑇𝐸
Ƹ 𝑞 =
𝑞𝑟 𝑜
- 𝑎 =
ො 𝑞 −𝑞 𝑇𝐸 ො 𝑞
- 10%, Success/Fail, Independence, Sample Size Assumptions
Sampling Distributions for Means
- Only for quantitative data
- Sample Mean ҧ
𝑦, Population Proportion is 𝜈
- 𝑇𝑢𝑏𝑜𝑒𝑏𝑠𝑒 𝐹𝑠𝑠𝑝𝑠
ҧ 𝑦 =
𝜏 𝑜
- 𝑎 =
ҧ 𝑦−𝜈 𝑇𝐸 ҧ 𝑦
- If the population is normal, the sample is normal
- If the population is not normal, but conditions are met then
the distribution will be approximately normal by the central limit theorem (same conditions are proportions)
Sampling Distributions for Means
- Only for quantitative data
- Sample Mean ҧ
𝑦, Population Proportion is 𝜈
- 𝑇𝑢𝑏𝑜𝑒𝑏𝑠𝑒 𝐹𝑠𝑠𝑝𝑠
ҧ 𝑦 =
𝜏 𝑜
- 𝑎 =
ҧ 𝑦−𝜈 𝑇𝐸 ҧ 𝑦
- If the population is normal, the sample is normal
- If the population is not normal, but conditions are met then
the distribution will be approximately normal by the central limit theorem (same conditions are proportions)
Confidence Intervals
Estimate ± (Critical Value * SD(Estimate)) <- This is a margin of error Proportions: Ƹ 𝑞± 𝑨∗
ො 𝑞 ො 𝑟 𝑜 . To Choose N, shortcut is 𝑜 = 𝑨∗ ො 𝑞 ො 𝑟 𝑁𝐹2
Means: ҧ 𝑦± 𝑢𝑜−1∗ 𝜏
𝑜. Otherwise use CONFIDENCE.T(alpha, sd, n)
To reduce the size of a CI, choose a lower CV or higher N
Confidence Intervals Question
A 95% confidence interval for the mean number of televisions per American household is (1.15, 4.20). For each of the following statements about the above confidence interval, choose true or false. a) The probability that is between 1.15 and 4.20 is .95. b) We are 95% confident that the true mean number of televisions per American household is between 1.15 and 4.20. c) 95% of all samples should have x-bars between 1.15 and 4.20. d) 95% of all American households have between 1.15 and 4.20 televisions. e) Of 100 intervals calculated the same way (95%), we expect 95 of them to capture the population mean. f) Of 100 intervals calculated the same way (95%), we expect 100 of them to capture the sample mean.
Confidence Interval Proportion Question
Suppose I take a SRS of 30 students at UBC, and calculate the sample proportion of male students in the sample is 0.4. What is a 95% confidence interval for this estimate? What sample size should I choose if I want my margin of error to be no more than 5%?
Confidence Interval Means Questions
Given the summary statistics below, compute a 95% confidence interval for the temperature of an average person in Fahrenheit: Variable N Mean Median StDev SE Mean TEMP 130 98.249 98.300 0.733 0.064
Hypothesis Testing – 5 Easy Steps
1. Specify the Null Hypothesis – This is about the population
A. Null implies no difference, relationship or effect. Think of “Not Guilty” B. For Example, H0 = 90
2. Specify the Alternative Hypothesis – Also about the population
A. Alternative is the statement there is an effect or difference. Think of “Guilty!” B. For Example, HA ≠ 90 C. Conclusion drawn from a two-sided test are stronger and be applied to the one sided, but not reverse
3. Set the Significance Level (𝛽) 4. Calculate the Test-Statistic and corresponding p-value from your sample 5. If the p-value is less than 𝛽, reject the null hypothesis. Else, fail to reject
Types of Errors
False Positive False Negative
Hypothesis Testing T/F Question Set
1. Rejecting the null hypothesis proves the alternative hypothesis. 2. The level of significance, or alpha is a measure of the amount of risk the statistician will accept when making a decision. 3. Choosing alpha = 0.05 is equivalent to saying we are 95% confident the results
- ccurred by chance
4. Assuming a healthy diagnosis is normal, failing to prescribe medication to a sick patient is an example of a Type II error. 5. The test Ha: x > 0 is a left-tailed test
Hypothesis Testing Proportions Activity
The standard treatment for a disease works in 0.675 of all patients. A new treatment is proposed. Is it better? (The scientists who created it claim it is. You – as an advocate for a patient or sales personnel for the company that eventually would market it – must be more skeptical. Where’s the data?) An initial clinical trial of n = 100 patients (of similar general health) is conducted: 77 people are cured. Assume the selection of patients is random. Test appropriate hypotheses at the 1% significance level.
Hypothesis Testing Proportions Activity – Solutions
The standard treatment for a disease works in 0.675 of all patients. A new treatment is
- proposed. Is it better? (The scientists who created it claim it is. You – as an advocate for a patient or
sales personnel for the company that eventually would market it – must be more skeptical. Where’s the data?) An initial clinical trial of n = 100 patients (of similar general health) is conducted: 77 people are
- cured. Assume the selection of patients is random. Test appropriate hypotheses at the 1% significance
level. The test statistic is 2.03. The critical value is 2.326. We do not reject H0. There is not sufficient sample evidence to support the claim that the new treatment curse over 0.675 of all patients. This difference (0.77 vs. 0.675) is not statistically significant at the 1% level.
Hypothesis Testing Means Activity
The population standard deviation for waiting times to be seated at a restaurant is know to be 10 minutes. An expensive restaurant claims that the average waiting time for dinner is approximately 1 hour, but we suspect that this claim is inflated to make the restaurant appear more exclusive and successful. A random sample of 30 customers yielded a sample average waiting time of 50 minutes. Is there evidence to say that the restaurant’s claim is too high? The original data (individual waiting times) is not normally distributed. What theorem allows us to do the previous calculations ?