1
Quantitative analysis with statistics (and ponies) (Some slides, - - PowerPoint PPT Presentation
Quantitative analysis with statistics (and ponies) (Some slides, - - PowerPoint PPT Presentation
Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur) 1 Interviews, diary studies Start stats Thursday: Ethics/IRB Tuesday: More stats New homework is available 2 INTERVIEWS 3
2
- Interviews, diary studies
- Start stats
- Thursday: Ethics/IRB
- Tuesday: More stats
- New homework is available
3
INTERVIEWS
4
Why an interview
- Rich data (from fewer people)
- Good for exploration
– When you aren’t sure what you’ll find – Helps identify themes, gain new perspectives
- Usually cannot generalize quantitatively
- Potential for bias (conducting, analyzing)
- Structured vs. semi-structured
5
Interview best practices
- Make participants comfortable
- Avoid leading questions
- Support whatever participants say
– Don’t make them feel incorrect or stupid
- Know when to ask a follow-up
- Get a broad range of participants (hard)
6
Try it!
- In pairs, write two interview questions about
password security/usability
- Change partners with another pair and ask each
- ther; report back
7
DIARY STUDIES
8
Why do a diary study?
- Rich longitudinal data (from a few participants)
– In the field … ish
- Natural reactions and occurences
– Existence and quantity of phenomena – User reactions in the moment rather than via recall
- Lots of work for you and your participants
- On paper vs. technology-mediated
9
Experience sampling
- Kind of a prompted diary
- Send participants a stimulus when they are in
their natural life, not in the lab
10
Diary / ESM best practices
- When will an entry be recorded?
– How often? Over what time period?
- How long will it take to record an entry?
– How structured is the response?
- Pay well
– Pay per response? But don’t create bias
11
Facebook regrets (Wang et al.)
- Online survey, interviews, diary study, 2nd survey
- What do people regret posting? Why?
- How do users mitigate?
12
FB regrets – Interviews
- Semi-structured, in-person, in-lab
- Recruiting via Craigslist
– Why pre-screen questionnaire? – 19/301
- Coded by a single author for high-level themes
13
FB regrets – Diary study
- “The diary study did not turn out to be very
useful”
- Daily online form (30 days)
– Facebook activities, incidents – “Have you changed anything in your privacy settings? What and why?” – “Have you posted something on Facebook and then regretted doing it? Why and what happened?” – 22+ days of entries: $15 – 12/19 interviewees entered 1+ logs (217 total logs)
14
Location-sharing (Consolvo et al.)
- Whether and what about location to disclose
– To people you know
- Preliminary interview
– Buddy list, expected preferences
- Two-week ESM (simulated location requests)
- Final interview to reflect on experience
15
ESM study
- Whether to disclose or not, and why
– Customized askers, customized context questions – If so, how granular? – Where are you and what are you doing? – One-time or standing request
- $60-$250 to maximize participation
- Average response rate: above 90%
16
Statistics for experimental comparisons
- The main idea: Hypothesis testing
- Choosing the right test: Comparisons
- Regressions
- Other stuff
– Non-independence, directional tests, effect size
- Tools
17
OVERVIEW
What’s the big idea, anyway?
18
Statistics
- In general: analyzing and interpreting data
- We often mean: Statistical hypothesis testing
– Question: Are two things different? – Is it unlikely the data would look like this unless there is actually a difference in real life?
19
Important note
- This lecture is not going to be precise or
- complete. It is intended to give you some
intuition and help you understand what questions to ask.
20
The prototypical case
- Q:
Q: Do ponies who drink more caffeine make better passwords?
- Experiment: Recruit 30 ponies. Give 15 caffeine
pills and 15 placebos. They all create passwords.
http://www.fanpop.com/clubs/my-little-pony-friendship-is-magic/images/33207334/title/little-pony-friendship-magic-photo
21
Hypotheses
- Nul
Null hypot l hypothesis hesis: There is no difference Caffeine does not affect pony password strength.
- Al
Alternat ternative hypot ive hypothesis hesis: There is a difference Caffeine affects pony password strength.
- Note what is not here (more on this later):
– Which direction is the effect? – How strong is the effect?
22
Hypotheses, continued
- Statistical test gives you one of two answers:
- 1. Reject the null: We have (strong) evidence the
alternative is true.
- 2. Don’t reject the null: We don’t have (strong)
evidence the alternative is true.
- Again, note what isn’t here:
– We have strong evidence the null is true. (NOPE)
23
P values
- What is the probability that the data would look
like this if there’s no actual difference?
– i.e., Probability we tell everyone about ponies and caffeine but it isn’t really true
- Most often, α = 0.05; some people choose 0.01
– If p < 0.05 , reject null hypothesis; there is a “significant” difference between caffeine and placebo – A p-value is not magic, just probability, and the threshold is arbitrary – But, reported TRUE or FALSE: You don’t say something is “more significant” because the p-value is lower
24
Type II Error (False negative)
- There is a difference, but you didn’t find evidence
– No one will know the power of caffeinated ponies
- Hypothesis tests DO NOT BOUND this error
- Instead, statistical power is the probability of
rejecting the null hypothesis if you should
– Requires that you estimate the effect size (hard)
25
- After an experiment, one of four things has
happened (total P=1).
- Which box are you in? You don’t know.
Hypotheses, power, probability
PROBABILITY You rejected the null You didn’t Reality: Difference Estimated via power analysis ? Reality: No difference Bounded by α ?
26
Correlation and causation
- Correlation: We observe that two things are
related
Do rural or urban ponies make stronger passwords?
- Causation: We randomly assigned participants
to groups and gave them different treatments
– If designed properly Do password meters help ponies?
27
CHOOSING THE RIGHT TEST
28
What kind of data do you have?
- Explanatory variables: inputs, x-values
– e.g., conditions, demographics
- Outcome variables: outputs, y-values
– e.g., time taken, Likert responses, password strength
29
http://i196.photobucket.com/albums/aa92/ karina408_album/Wallpaper-53.jpg
What kind of data do you have?
- Quantitative
– Discrete (Number of caffeine pills taken by each pony) – Continuous (Weight of each pony)
- Categorical
– Binary (Is it or isn’t it a pony?) – Nominal: No order (Color of the pony) – Ordinal: Ordered (Is the pony super cool, cool, a little cool, or uncool)
30
What kind of data do you have?
- Does your dependent data follow a normal
distribution? (You can calculate this!)
– If so, use parametric tests. – If not, use non-parametric tests.
- Are your data independent?
– If not, repeated-measures, mixed models, etc.
http://www.wikipedia.org
31
If both are categorical ….
- Participants each used one of two systems
– Did they like the system they got? (Yes/no)
- HA: System affects user sentiment
- Use (Pearson’s) χ2 (Chi-squared) test of
independence.
– Fewer than 5 data points in any single cell, use Fisher’s Exact Test (also works with lots of data)
32
Contingency tables
- Rows one variable,
columns the other
- Example:
– Row = condition – Column = true/false
- χ2 = 97.013, df = 14, p
= 1.767e-14
33
Explanatory: categorical Outcome: continuous ….
- Participants each used one system
– Measure a continuous value (time taken, pwd guess #)
- HA: System affects password strength
- Normal, continuous outcome (compare mean):
– 2 conditions: T-test – 3+ conditions: ANOVA
34
Explanatory: categorical Outcome: continuous ….
- Non-normal outcome, ordinal outcome
– Does one group tend to have larger values? – 2 conditions: Mann-Whitney U (AKA Wilcoxon rank- sum) – 3+ conditions: Kruskal-Wallis
35
Outcome: Length of password
36
What about Likert-scale data?
- Respond to the statement: Ponies are magical.
– 7: Strongly agree – 6: Agree – 5: Mildly agree – 4: Neutral – 3: Mildly disagree – 2: Disagree – 1: Strongly disagree
37
What about Likert-scale data?
- Some people treat it as continuous (not good)
- Other people treat it as ordinal (better!)
– Difference 1-2 ≠ 2-3 – Use Mann-Whitney U / Kruskal-Wallis
- Another good option: binning (simpler)
– Transform into binary “agree” and “not agree” – Use χ2 or FET
38
nudge-comp8
38
baseline meter three-segment green tiny huge no suggestions text-only bunny half-score
- ne-third-score
nudge-16 text-only half-score bold text-only half- score
Visual Scoring Visual & Scoring Control
Password meter annoying
39
Notes for study design
- Plan your analysis before you collect data!
– What explanatory, outcome variables? – Which tests will be appropriate?
- Ensure that you collect what you need and know
what do with it
– Otherwise your experiment may be wasted
40
CONTRASTS
41
Contrasts
- If you have more than two conditions,
– H0 = “the conditions are all the same” – HA = “the conditions are not all the same” – “Omnibus test”
- If you accept the null, you are done
- ONLY if you reject this null, you may compare
individual conditions to each other
– AKA “Pairwise”
42
Example:
- Password meters: 15 conditions
– Does assigned meter affect password strength? – Omnibus test: yes – Individual meter: Better than no meter? – One meter better than another meter?
43
P values and multiple testing
- P-values bound Type I error (false positive)
– You expect this to happen 5% of the time if α = 0.05
- What happens if you conduct a lot of statistical
tests in one experiment?
- Your cumulative probability of a Type I error can
increase dramatically!
44
Correcting p-values
- Goal: Adjust the math so your overall Type I
error remains bounded by α = 0.05
- Many methods for “correcting” p values
– Bonferroni correction: Easy but conservative (Multiply p values by the number of tests) – Holm-Bonferroni is also frequently used
45
Planned vs. Unplanned Contrasts
- N-1 free planned
planned contrasts
– Actually, really planned. No peeking at the data.
- Additional contrasts (planned or unplanned)
require p-correction for multiple testing
46
Contrasts in the meters paper
“We ran pairwise contrasts comparing each condition to our two control conditions, no meter and baseline meter. In addition, to investigate hypotheses about the ways in which conditions varied, we ran planned contrasts comparing tiny to huge, nudge-16 to nudge-comp8, half-score to one- third-score, text-only to text-only half-score, half- score to text-only half-score, and text-only half- score to bold text-only half-score.”
47
Continuous/ordinal data
48
Notes for study design
- Lots of conditions means lots of correction
– Which means you need big effect sizes or large N
- Consider limiting conditions
– What do you really want to test? – Full-factorial or not?
49
CORRELATION, REGRESSION
Finding a relationship among variables
50
Correlation
- Measure two numeric values
– Are they related?
- Pearson correlation
– Requires both variables to be normal – Only looks for a linear relationship
- Often preferred: Spearman’s rank correlation
coefficient (Spearman’s ρ)
– Evaluates a relationship’s monotonicity – Both variables get larger together
51
Regressions
- What is the relationship among variables?
– Generally one outcome (dependent variable) – Often multiple factors (independent variables)
- The type of regression you perform depends on
the outcome
– Binary outcome: logistic regression – Ordinal outcome: ordinal / ordered regression – Continuous outcome: linear regression
52
Example regression
- Outcome:
– Pass pony quiz (or not): Logistic – Total score on pony quiz: Linear
- Independent variables:
– Age of pony – Number of prior races – Diet: hay or pop-tarts (code as eatsHay=true/false) – (Indicator variables for color categories) – Etc.
53
What you get
- Linear: Outcome = ax1 + bx2 + c
– Score = 5*eatsHay - 3*age + 7
- Logistic: Outcome is in log likelihood
– Intuition: probability of passing decreases with age, increases if ate hay, etc.
54
Interactions in a regression
- Normally, outcome = ax1 + bx2 + c + …
- Interactions account for situations when two
variables are not simply additive. Instead, their interaction impacts the outcome
– e.g., Maybe blue ponies, and only blue ponies, get a larger benefit from eating pop-tarts before the quiz
- Outcome = ax1 + bx2 + c + d(x1x2) + …
55
Example regression output
56
Notes for study design
- The more input variables in your regression, the
more data you will need to collect to get useful results
57
Try it! In groups of 2-3
- Does caffeine impact pony password strength?
– When strength = cracked or not cracked – When strength = 0-100 scoring – When strength = self-reported perception 1-5 – Compare caffeine, NyQuil, placebo
- Do gender, state of residence, and education
level impact pony password strength?
58
OTHER THINGS TO CONSIDER
Non-independence, directional testing, effect size
59
What if you have lots of questions?
- If we ask 40 privacy questions on a Likert scale,
how do we analyze this survey?
- One option: Add responses to get “privacy score”
– Make sure the scales are the same – Reverse if needed (e.g., “personal privacy is important to me” “I don’t care if companies sell my data”) – Important: Verify that responses are correlated!
60
Verifying correlation
- Usually preferred: Spearman’s rank correlation
coefficient (Spearman’s ρ)
– Evaluates a relationship’s monotonicity – e.g., all variables get larger with privacy sensitivity
61
Another option: Factor analysis
- Evaluate underlying factors you are detecting
- You specify N, a number of factors
- Algorithm groups related questions (N groups)
– Each group is a factor
- Factor loadings measure goodness of correlation
– Questions loading primarily onto one factor are useful
62
In groups: Plan your analysis
- Does caffeine impact pony password strength?
– When strength = cracked or not cracked – When strength = 0-100 scoring – Compare caffeine, NyQuil, placebo
- Do gender, age, state of residence, and
education level impact pony privacy concern?
– Concerned vs. unconcerned – Privacy “score” by adding 30 questions
63
Independence
- Why might your data not be independent?
– Non-independent sample (bad!) – The inherent design of the experiment (ok!)
- Example: Same ponies make passwords, before
and after taking the caffeine pills
– Each pony cannot be independent of itself
64
Repeated measures
- AKA within subjects
– Measure the same participant multiple times
- Paired T-test
– Two samples per participant, two groups
- Repeated measures ANOVA
– More general
65
Hierarchy and mixed model
- For regressions, use a “mixed model”
- Intuition: Each pony’s result driven by combo of
individual skills, group characteristics, treatment effects
- Case 1: Many measurements of each pony
- Case 2: The ponies have some other relationship.
e.g., all ponies attended 1 of 5 security camps. (You want to control for this, but not evaluate it.)
66
Directional testing
- If your hypothesis goes one way:
Caffeinated ponies make stronger passwords.
- More power than more general tests
– BUT, must select direction BEFORE looking at data – Won’t reject null if there’s a difference the other way
- Example: One-tailed T-test
- Use with caution!
67
Effect size
- Hypothesis test: Is there a difference?
- Also (more?) important: How big a difference?
- Findings can be “significant” but unimportant
Factor Coef. Exp(coef) SE p-value login count <0.001 1.000 <0.001 <0.001 password fail rate
- 0.543
0.581 0.116 <0.001 gender (male) 0.078 0.925 0.027 0.005 engineering
- 0.273
0.761 0.048 <0.001 humanities
- 0.107
0.898 0.054 0.048 public policy 0.079 1.082 0.058 0.176†
68
TOOLS
69
So how do I DO these tests?
- Excel: Very easy, but not very powerful
– Doesn’t have many useful tests
- R: Most powerful, steepest learning curve
– Like Matlab but for stats – Somewhat bizarre language/API/data representation – Free and open-source (awesome add-on packages)
- SPSS: Graphical, also quite powerful
– Expensive ($25 student license from Terpware) – Somewhat scriptable, not as flexible as R
70
R tutorials
- http://www.statmethods.net
- http://cyclismo.org/tutorial/R/
71
Choosing a test
- http://webspace.ship.edu/pgmarr/Geo441/Statistical%20Test
%20Flow%20Chart.pdf
- http://abacus.bates.edu/~ganderso/biology/resources/
statistics.html
- http://bama.ua.edu/~jleeper/627/choosestat.html
- http://med.cmb.ac.lk/SMJ/VOLUME%203%20DOWNLOADS/
Page%2033-37%20-%20Choosing%20the%20correct %20statistical%20test%20made%20easy.pdf
- http://fwncwww14.wks.gorlaeus.net/images/home/news/