Course Business 2 new datasets on CourseWeb Midterm grades posted - - PowerPoint PPT Presentation

course business
SMART_READER_LITE
LIVE PREVIEW

Course Business 2 new datasets on CourseWeb Midterm grades posted - - PowerPoint PPT Presentation

Course Business 2 new datasets on CourseWeb Midterm grades posted Well done! Mean: 95% Final project: Analyze your own data I can supply simulated data if needed In-class presentation: Last 2 weeks of class Sign


slide-1
SLIDE 1

Course Business

  • 2 new datasets on CourseWeb
  • Midterm grades posted
  • Well done! Mean: 95%
  • Final project: Analyze your own data
  • I can supply simulated data if needed
  • In-class presentation: Last 2 weeks of class
  • Sign up on CourseWeb for a time slot – first come, first

served!

  • Final paper: Due on CourseWeb last day of class
  • CourseWeb has more info. on requirements
slide-2
SLIDE 2

Distributed Practice

2 4 6 8 10 2 4 6 8 10 YearsOnJob EmployeeBurnout

Employee A Employee B Employee C

2 4 6 8 10 2 4 6 8 10 YearsOnJob EmployeeBurnout

Employee A Employee B Employee C

2 4 6 8 10 2 4 6 8 10 YearsOnJob EmployeeBurnout 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10

Employee A Employee B Employee C

A B C

  • An I/O psychologist models EmployeeBurnout as a function of

YearsOnJob, tracked longitudinally for each of 500 employees.

  • Which figure below corresponds to the assumptions made by

each of these model formulae?:

  • EmployeeBurnout ~ 1 + YearsOnJob + (1|Employee)
  • EmployeeBurnout ~ 1 + poly(YearsOnJob, degree=2) +

(1|Employee)

  • EmployeeBurnout ~ 1 + YearsOnJob + (1 + YearsOnJob|Employee)
slide-3
SLIDE 3

2 4 6 8 10 2 4 6 8 10 YearsOnJob EmployeeBurnout

Employee A Employee B Employee C

2 4 6 8 10 2 4 6 8 10 YearsOnJob EmployeeBurnout

Employee A Employee B Employee C

2 4 6 8 10 2 4 6 8 10 YearsOnJob EmployeeBurnout 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10

Employee A Employee B Employee C

A B C

  • An I/O psychologist models EmployeeBurnout as a function of

YearsOnJob, tracked longitudinally for each of 500 employees.

  • Which figure below corresponds to the assumptions made by

each of these model formulae?:

  • EmployeeBurnout ~ 1 + YearsOnJob + (1|Employee)
  • EmployeeBurnout ~ 1 + poly(YearsOnJob, degree=2) +

(1|Employee)

  • EmployeeBurnout ~ 1 + YearsOnJob + (1 + YearsOnJob|Employee)

Distributed Practice

B C A

slide-4
SLIDE 4

Week 11: Missing Data

l Unbalanced Factors l Rank Deficiency

l Linear Combinations l Incomplete Designs

l Empirical Logit l Missing Data (NA values)

l Types of Missingness

l Non-Ignorable l Ignorable l Summary

l Possible Solutions

l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation

slide-5
SLIDE 5

Unbalanced Factors

  • Sometimes, we may have differing

numbers of observations per level

  • Possible reasons:
  • Some categories naturally more common
  • e.g., college majors
  • Categories may be equally common in the

population, but we have sampling error

  • e.g., ended up 60% female participants, 40% male
  • Study was designed so that some conditions are

more common

  • e.g., more “control” subjects than “intervention” subjects
  • We wanted equal numbers of observations, but lost

some because of errors or exclusion criteria

  • e.g., data loss due to computer problems
  • Dropping subjects below a minimum level of performance
slide-6
SLIDE 6

Weighted Coding

  • “For the average student, does course size

predict probability of graduation?”

  • Random sample of 200 Pitt undergrads
  • 5 are student athletes and 195 are not
  • How can we make the intercept reflect the

“average student”?

  • We could try to apply effects coding to the

StudentAthlete variable by centering around the mean and getting (0.5, -0.5), but…

slide-7
SLIDE 7

Weighted Coding

  • An intercept at 0 would no longer correspond to the
  • verall mean
  • As a scale, this would be totally

unbalanced

  • To fix balance, we need to assign

a heavier weight to Athlete

ATHLETE (5) NOT ATHLETE (195)

.5

  • .5
  • .475

Zero is here

But “not athlete” is actually far more common

slide-8
SLIDE 8

Weighted Coding

  • Change codes so the mean is 0
  • c(.975, -.025)
  • contr.helmert.weighted() function in my

psycholing package will calculate this ATHLETE (5) NOT ATHLETE (195)

.975

  • .025
slide-9
SLIDE 9

Weighted Coding

  • Weighted coding: Change the codes so that

the mean is 0 again

  • Used when the imbalance reflects something real
  • Like Type II sums of squares
  • “For the average student, does course size

predict graduation rates?”

  • Average student is not a student athlete, and our

answer to the question about an “average student” should reflect this!

slide-10
SLIDE 10

Unweighted Coding

  • But in many experimental or quasi-experimental

designs, imbalance might not reflect anything “real”

  • We intended to present 50 multiplication problems and

50 addition problems, but our experiment was miscoded and present 60 multiplications and 40 additions

  • It may have

been sabotaged

  • We collected daily diary data from 30 single people and

30 married people, but data from one of the married subjects was lost

slide-11
SLIDE 11

Unweighted Coding

  • But in many experimental or quasi-experimental

designs, imbalance might not reflect anything “real”

  • Accidental, not related to the real-world construct
  • Not meaningful
  • In fact, we’d like to get rid of it
slide-12
SLIDE 12

Unweighted Coding

  • But in many experimental or quasi-experimental

designs, imbalance might not reflect anything “real”

  • Accidental, not related to the real-world construct
  • Not meaningful
  • In fact, we’d like to get rid of it
  • Retain the (-0.5, 0.5) codes
  • Weights the two conditions equally—because the

imbalance isn’t meaningful

  • Like Type III sums of squares
  • Probably what you want for factorial experiments or

quasi-experiments

slide-13
SLIDE 13

Unbalanced Factors: Another View

  • Gestures produced…
  • Average of the individuals (weighted): 5.5
  • Average of the group means (unweighted): 4.0
  • With perfectly balanced factors, these are identical!

8 7 5 8 4 10 TYPICALLY DEVELOPING 2 ASD

Group Mean: 7 Group Mean: 1

slide-14
SLIDE 14

Unbalanced Factors: Summary

  • Weighted coding: Change the codes so that the

mean is 0 (e.g., contr.helmert.weighted() )

  • Imbalance reflects something real
  • Differences in group sizes affect the conclusions
  • If the group sizes were different, we would want a different

conclusion

  • Assumes we have measured the relative frequency in the

population correctly

  • Characterizes the average individual
  • Unweighted coding: Keep as -0.5 and 0.5
  • Imbalance is an accident that we want to eliminate
  • Assumes differences in group sizes are irrelevant to the
  • conclusions. Same results if we re-ran with different sizes
  • Characterizes the average group
  • Most experimental or quasi-experimental designs
slide-15
SLIDE 15

Week 11: Missing Data

l Unbalanced Factors l Rank Deficiency

l Linear Combinations l Incomplete Designs

l Empirical Logit l Missing Data (NA values)

l Types of Missingness

l Non-Ignorable l Ignorable l Summary

l Possible Solutions

l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation

slide-16
SLIDE 16
  • Longitudinal (5-week) study of stress
  • Dependent measure: Concentration of cortisol,

a stress hormone

  • Nanomoles per liter (nmol/L)
  • Personality, cognitive, environmental, clinical

variables

stress.csv

slide-17
SLIDE 17

Distributed Practice!

  • Emil would like to examine cortisol as a

function of both external (temperature in C) and internal factors (excitement seeking, on a scale of 1 to 7). The initial model:

model1 <- lmer(CortisolNMol ~ 1 + TempC + ExcitementSeeking + (1|Subject), data=stress)

yields the following results:

  • Decide what we can do to make the intercept

more meaningful here. Then, implement it in R

  • Tip: Remember that the intercept would be the

cortisol level when all the predictor variables = 0

slide-18
SLIDE 18

Distributed Practice!

  • Emil would like to examine cortisol as a

function of both external (temperature in C) and internal factors (excitement seeking, on a scale of 1 to 7). The initial model:

model1 <- lmer(CortisolNMol ~ 1 + TempC + ExcitementSeeking + (1|Subject), data=stress)

yields the following results:

  • ExcitementSeeking=0 makes no sense when

it’s on a 1 to 7 scale, so let’s center:

  • stress$Excitement.c <- scale(stress$ExcitementSeeking,

center=TRUE, scale=FALSE)[,1]

  • Then, rerun the model with centered variable
slide-19
SLIDE 19

Distributed Practice!

  • Emil would like to examine cortisol as a

function of both external (temperature in C) and internal factors (excitement seeking, on a scale of 1 to 7). The initial model:

model1 <- lmer(CortisolNMol ~ 1 + TempC + ExcitementSeeking + (1|Subject), data=stress)

yields the following results:

  • ExcitementSeeking=0 makes no sense when

it’s on a 1 to 7 scale, so let’s center:

  • Or, another way to center:
  • stress$Excitement.c <- stress$Excitement –

mean(stress$Excitement)

slide-20
SLIDE 20

Distributed Practice!

  • Emil would like to examine cortisol as a

function of both external (temperature in C) and internal factors (excitement seeking, on a scale of 1 to 7). The initial model:

model1 <- lmer(CortisolNMol ~ 1 + TempC + ExcitementSeeking + (1|Subject), data=stress)

yields the following results:

  • An alternate strategy would be to subtract 1

from ExcitementSeeking so that 0 represents the lowest score rather than the average

  • Tell us you something different but also statistically valid
slide-21
SLIDE 21

Distributed Practice!

  • Emil would like to examine cortisol as a

function of both external (temperature in C) and internal factors (excitement seeking, on a scale of 1 to 7). The initial model:

model1 <- lmer(CortisolNMol ~ 1 + TempC + ExcitementSeeking + (1|Subject), data=stress)

yields the following results:

  • Possible to center TempC here as well, but this

variable at least has a meaningful 0 already (0 C is possible)

slide-22
SLIDE 22

Rank Deficiency

  • We also have some other measures
  • Let’s try adding temperature in Fahrenheit to

the model (TempF)

  • model2 <- lmer(CortisolNMol ~ 1 + TempC +

TempF + ExcitementSeeking + (1|Subject), data=stress)

  • This looks scary!
slide-23
SLIDE 23

Rank Deficiency

  • Our model:
  • What does γ100 represent here?
  • The effect of a 1-unit change in degrees Celsius

… while holding degrees Fahrenheit constant

  • This makes no sense—if C changes, so does F
  • In fact, one change perfectly predicts the other
  • F = (9/5)C + 32
  • Problem if one column is a linear combination of
  • ther(s)—it can be perfectly formed from other

columns by adding, subtracting, multiplying, or dividing

=

Stress Baseline

E(Yi(j)) γ000

Temperature in Celsius Temperature in Fahrenheit

γ100x1i(j) γ200x2i(j)

+ +

slide-24
SLIDE 24

Rank Deficiency

  • Linear

combinations result in a perfect correlation

  • Here:

Correlation between

TempC and TempF

  • 10

20 30 30 40 50 60 70 80 90 Temperature in Celsius Temperature in Fahrenheit

slide-25
SLIDE 25

Rank Deficiency

  • You: “R won’t do what I want! I

need to fit this model, but I can’t! This program is broken!”

  • Scott’s view: This really isn’t a

coherent research question

  • “Effect of changing C while holding

constant F” is a nonsensical question— doesn’t make sense to ask

  • We would have this same problem in any

software package, or even if we computed the regression by hand

slide-26
SLIDE 26

Rank Deficiency

  • In fact, it’s mathematically impossible

to perform this regression

  • A matrix in which some columns are linear

combinations of others is “rank deficient”

  • If a matrix is rank-deficient, you can’t compute the

inverse of it (because the inverse doesn’t exist)

  • But, you need to calculate an inverse to perform

linear regression: (X’X)-1

  • Therefore, regression can’t be performed
slide-27
SLIDE 27

Rank Deficiency: Solutions

  • Is it unexpected that columns would be related

in this way?:

  • Check your experiment script & data processing

pipeline (e.g., is something saved in the wrong column)

  • Or is it expected? (TempF should be predictable

from TempC):

  • Rethink analysis—would it ever be sensible to try

to distinguish these effects?

  • Maybe this is just not a coherent research question
  • Or, a different design might make these independent
slide-28
SLIDE 28

Rank Deficiency

  • We have some other personality variables:
  • Gregariousness
  • Assertiveness
  • Extraversion, the sum of the gregariousness,

assertiveness, and excitement seeking facets

  • Emil’s next model:
  • modelExtra <- lmer(CortisolNMol ~ 1 + TempC

+ Gregariousness + Assertiveness + ExcitementSeeking + Extraversion + (1|Subject), data=stress)

  • Where is the rank deficiency in this case?
slide-29
SLIDE 29

Rank Deficiency

  • “Problem if one column is a linear combination of other(s)—it

can be perfectly formed from other columns by adding, subtracting, multiplying, or dividing”

  • Similar problem: One predictor variable is the

mean or sum of other predictors in the model

  • We already know exactly what Extraversion is

when we know the three facet scores

  • Doesn’t make sense to ask about the effect of

Extraversion while holding constant Gregariousness, Assertiveness, and ExcitementSeeking

slide-30
SLIDE 30

Rank Deficiency

  • “Problem if one column is a linear combination of other(s)—it

can be perfectly formed from other columns by adding, subtracting, multiplying, or dividing”

  • Similar problem: One predictor variable is the

mean or sum of other predictors in the model

  • Can also get this with the average of other

variables

  • Exam 1 score, Exam 2 score, Average score
  • Average = (Exam1 + Exam2) / 2
  • Again, can be perfectly predicted from the other

columns

slide-31
SLIDE 31

Week 11: Missing Data

l Unbalanced Factors l Rank Deficiency

l Linear Combinations l Incomplete Designs

l Empirical Logit l Missing Data (NA values)

l Types of Missingness

l Non-Ignorable l Ignorable l Summary

l Possible Solutions

l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation

slide-32
SLIDE 32

Incomplete Designs

  • The last variable that Emil is interested in is

anxiety

  • Two relevant columns:
  • Anxiety: Does this person have a diagnosis of

GAD (Generalized Anxiety Disorder) or not?

  • Severity: Is the anxiety severe or not?
  • Let’s look at these two factors and their

interaction:

  • model4 <- lmer(CortisolNMol ~ 1 + Anxiety * Severity

+ (1|Subject), data=stress)

slide-33
SLIDE 33

Incomplete Designs

  • Why is this model rank-deficient?
  • Some cells of the interaction are not

represented in our current design:

  • xtabs( ~ Anxiety + Severity,

data=stress)

No observations for people with No anxiety and Severe symptoms

slide-34
SLIDE 34

Incomplete Designs

  • Again, not a bug. Doesn’t make sense to ask

about anxiety*severity interaction here

Anxiety: No Severity: No Anxiety: Yes Severity: No Anxiety: Yes Severity: Yes

Anxiety effect Severity effect

slide-35
SLIDE 35

Incomplete Designs

  • Again, not a bug. Doesn’t make sense to ask

about anxiety*severity interaction here

  • Interaction of Anxiety & Severity: “Effect of Anxiety

and Severe anxiety over and above the effects of Anxiety alone and Severe anxiety alone.”

  • But, no effect of severe anxiety “alone.” Have to

have anxiety to have severe anxiety.

Anxiety: No Severity: No Anxiety: Yes Severity: No Anxiety: Yes Severity: Yes

Anxiety effect Severity effect

slide-36
SLIDE 36

Incomplete Designs

  • In fact, it’s mathematically impossible

to calculate an interaction here!

  • Interaction column is identical to Severity column
  • We can’t distinguish them!
  • So, this is really just another example of the same linear

combination problem

Anxiety Severity Anxiety * Severity

1 1 1 1

slide-37
SLIDE 37

Incomplete Designs: Solutions

  • If missing cell is intended:
  • Often makes more sense to think of this as a

single factor with > 2 categories

  • “Not anxious,” “Moderate anxiety,” “Severe anxiety”
  • Can use Contrast 1 to compare Moderate & Severe to

Not Anxious, and Contrast 2 to compare Moderate to Severe

  • If missing cell is accidental:
  • Might need to collect more data!
  • Check your experimental lists
slide-38
SLIDE 38

Week 11: Missing Data

l Unbalanced Factors l Rank Deficiency

l Linear Combinations l Incomplete Designs

l Empirical Logit l Missing Data (NA values)

l Types of Missingness

l Non-Ignorable l Ignorable l Summary

l Possible Solutions

l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation

slide-39
SLIDE 39

Odds are not the same thing as probabilities!

Source Confusion

  • So far, we’ve handled cases where combinations
  • f the predictor variables are missing
  • Another scenario: With a categorical DV,

particular categories of the DV might be rare/non- existent

  • Overall or in certain conditions
  • A relevant categorical DV might be memory:
  • Lots of theoretical interest in source memory
  • Remembering the context or source where you

learned something

Odds are not the same thing as probabilities!

WHO SAID IT?

slide-40
SLIDE 40

Source Confusion

  • sourceconfusion.csv: Source confusions in the

cued recall task

  • Two independent variables:
  • AssocStrength

AssocStrength: within-subjects but between-items

  • Strategy (maintenance or elaborative rehearsal):

within-items but between-subjects

  • Here, items = WordPairs

VIKING—COLLEGE SCOTCH—VODKA

Study: Test:

VIKING—vodka

slide-41
SLIDE 41

Source Confusion

  • Two independent variables:
  • AssocStrength: within-subjects but between-items
  • Strategy (maintenance or elaborative rehearsal):

within-items but between-subjects

  • Here, items = WordPairs
  • Factorial design … apply the coding scheme you

think would be most appropriate

  • Effects coding
  • contrasts(sourceconfusion$AssocStrength)

<- c(0.5, -0.5)

  • contrasts(sourceconfusion$Strategy)

<- c(0.5, -0.5)

slide-42
SLIDE 42

Source Confusion

  • Two independent variables:
  • AssocStrength: within-subjects but between-items
  • Strategy (maintenance or elaborative rehearsal):

within-items but between-subjects

  • Here, items = WordPairs
  • Now try model SourceConfusion (0=no, 1=yes) as

a function of these 2 variables & their interaction

  • Use the maximum random effects structure
  • Hint 1: Is this glmer() or lmer() ?
  • Hint 2: Remember the maximum random effects

includes:

  • By-subjects slope for only the within-subject variables
  • By-items slopes for only the within-item variables
slide-43
SLIDE 43

Source Confusion Model

  • Let’s model what causes people to make source

confusions:

  • model.Source <- glmer(SourceConfusion ~

AssocStrength*Strategy + (1+AssocStrength|Subject) + (1+Strategy|WordPair), data=sourceconfusion, family=binomial)

  • This looks

bad! L

slide-44
SLIDE 44

Low & High Probabilities

  • Problem: These are

low frequency events

  • In fact, lots of theoretically interesting things have

low frequency

  • Clinical diagnoses that are not common
  • Various kinds of cognitive errors
  • Language production, memory, language

comprehension…

  • Learners’ errors in math or other educational

domains

slide-45
SLIDE 45

Low & High Probabilities

  • A problem for our model:
  • Model was trying to find the odds of making a

source confusion within each study condition

  • But: Source confusions were never observed with

elaborative rehearsal, ever!

  • How small are the odds?

They are infinitely small in this dataset!

  • Note that not all failures to converge reflect low

frequency events. But when very low frequencies exist, they are likely to cause convergence problems.

slide-46
SLIDE 46

Low & High Probabilities

  • Logit is undefined if probability = 1
  • Logit is also undefined if probability = 0
  • Log 0 is undefined
  • e??? = 0
  • But there is nothing to which you can raise e to get 0

p(confusion) 1-p(confusion)

[ ]

logit = log 1

[ ]

= log p(confusion) 1-p(confusion)

[ ]

logit = log 1

[ ]

= log = log (0)

Division by zero!!

slide-47
SLIDE 47

Low & High Probabilities

  • When close to 0 or 1, logit is defined but unstable

0.0 0.2 0.4 0.6 0.8 1.0

  • 4
  • 2

2 4 PROBABILITY of recall LOG ODDS of recall

p(0.6) -> 0.41 p(0.8) ->1.39 Relatively gradual change at moderate probabilities p(0.95) -> 2.94 p(0.98) -> 3.89 Fast change at extreme probabilities

slide-48
SLIDE 48

Low & High Probabilities

  • A problem for our model:
  • Question was how much less common source

confusions become with elaborative rehearsal

  • But: Source confusions were never observed with

elaborative rehearsal, ever!

  • Why we think this happened:
  • In theory, elaborative subjects would

probably make at least one of these errors eventually (given infinite trials)

  • Not impossible
  • But, empirically, probability was low

enough that we didn’t see the error in our sample (limited sample size)

Probability = 1% N = ∞ N = 8

slide-49
SLIDE 49

Empirical Logit

  • Empirical logit: An adjustment to the regular

logit to deal with probabilities near (or at) 0 or 1

p(confusion) 1-p(confusion)

[ ]

logit = log

slide-50
SLIDE 50

Empirical Logit

  • Empirical logit: An adjustment to the regular

logit to deal with probabilities near (or at) 0 or 1

  • Makes extreme values

(close to 0 or 1) less extreme

Num of “A”s Num of “B”s

[ ]

logit = log

A = Source confusion

  • ccurred

B = Source confusion did not occur

Num of “A”s + 0.5 Num of “B”s + 0.5

[ ]

  • emp. logit = log

e

slide-51
SLIDE 51

Empirical Logit

  • Empirical logit: An adjustment to the regular

logit to deal with probabilities near (or at) 0 or 1

  • Makes extreme values

(close to 0 or 1) less extreme

Num of “A”s Num of “B”s

[ ]

logit = log

Num of “A”s + 0.5 Num of “B”s + 0.5

[ ]

  • emp. logit = log

e

Num of As: 10 Num of Bs: 0 Num of As: 9 Num of Bs: 1

3.04 2.20 1.85 0.41 0.37

Num of As: 6 Num of Bs: 4

slide-52
SLIDE 52

Empirical logit doesn’t go as high or as low as the “true” logit At moderate values, they’re essentially the same

slide-53
SLIDE 53
slide-54
SLIDE 54

With larger samples, difference gets much smaller (as long as probability isn’t 0 or 1)

slide-55
SLIDE 55

Empirical Logit: Implementation

  • Empirical logit requires summing up events and

then adding 0.5 to numerator & denominator:

  • Thus, we have to (1) sum across individual trials,

and then (2) calculate empirical logit

Num A + 0.5 Num B + 0.5

[ ]

empirical logit = log

Single empirical logit value for each subject in each condition Not a sequence

  • f one YES or

NO for every item Num of As for subject S10 in Low associative strength, Maintenance rehearsal condition

slide-56
SLIDE 56

Empirical Logit: Implementation

  • Empirical logit requires summing up events and

then adding 0.5 to numerator & denominator:

  • Thus, we have to (1) sum across individual trials,

and then (2) calculate empirical logit

  • Can’t have multiple random effects with empirical

logit

  • Would have to do separate by-subjects and by-

items analyses

  • Collecting more data can be another solution

Num A + 0.5 Num B + 0.5

[ ]

empirical logit = log

Num of As for subject S10 in Low associative strength, Maintenance rehearsal condition

slide-57
SLIDE 57

Empirical Logit: Implementation

  • Scott’s psycholing package can help calculate the

empirical logit & run the model

  • Example script on CourseWeb
  • Two notes:
  • No longer using glmer() with family=binomial.

We’re now running the model on the empirical logit value, which isn’t just a 0 or 1.

Here, the value of the DV is -1.49

slide-58
SLIDE 58

Empirical Logit: Implementation

  • Scott’s psycholing package can help calculate the

empirical logit & run the model

  • Example script on CourseWeb
  • Two notes:
  • No longer using glmer() with family=binomial.

We’re now running the model on the empirical logit value, which isn’t just a 0 or 1.

  • Because we calculate the empirical logit beforehand,

model doesn’t know how many observations went into that value

  • Want to appropriately

weight the model

  • 2.46 could be the

average across 10 trials or across 100 trials

slide-59
SLIDE 59

Week 11: Missing Data

l Unbalanced Factors l Rank Deficiency

l Linear Combinations l Incomplete Designs

l Empirical Logit l Missing Data (NA values)

l Types of Missingness

l Non-Ignorable l Ignorable l Summary

l Possible Solutions

l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation

slide-60
SLIDE 60
  • So far, we’ve looked at

cases where:

  • Two predictor variables are

always confounded

  • A particular category of the

dependent variable never (or almost never) appears in a particular cell

  • But lots of cases where

some observations are missing more haphazardly

Missing Data

slide-61
SLIDE 61

Missing Data

  • Lots of cases where a model

makes sense in principle, but part of the dataset is missing

  • Computer crashes
  • Some people didn’t entirely fill
  • ut questionnaire
  • Participants dropped out
  • Implausible values that we

excluded

  • Non-codable data (e.g., we’re

looking at whether L2 learners produce the correct plural, but someone says fish)

  • Remember that missing data

is indicated in R with NA NA

slide-62
SLIDE 62

Missing Data

  • Big issue: Is our sample still representative?
  • Basic goal in inferential statistics is to

generalize from limited sample to a population

  • If sample is truly random, this is justified
slide-63
SLIDE 63

Missing Data

  • Problem: If data from certain types of people

always go missing, our sample will no longer be representative of the broader population

  • It’s not a random sample if we systematically lose

certain kinds of data

slide-64
SLIDE 64

Missing Data

  • In fact, it’s a problem even if certain types of

people are somewhat more likely to have missing data

  • Still not a fully random sample
slide-65
SLIDE 65

Big Issue: WHY data is missing

l We will see several techniques for dealing with

missing data

l The degree to which these techniques are

appropriate depends on how the missing data relates to the other variables

l Missingness of data may not be arbitrary (and often

isn’t)

l Let’s first look at some hypothetical patterns of

missing data

l This is a conceptual distinction l In any given actual data set, we might not know the

pattern for certain

slide-66
SLIDE 66

Week 11: Missing Data

l Unbalanced Factors l Rank Deficiency

l Linear Combinations l Incomplete Designs

l Empirical Logit l Missing Data (NA values)

l Types of Missingness

l Non-Ignorable l Ignorable l Summary

l Possible Solutions

l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation

slide-67
SLIDE 67

Hypothetical Scenario #1

l Sometimes, the fact that a data point is NA may

be related to what the value would have been … if we’d been able to measure it

slide-68
SLIDE 68

Hypothetical Scenario #1

l Sometimes, the fact that a data point is NA may

be related to what the value would have been … if we’d been able to measure it

l A health psychologist is surveying high school

students about their marijuana use

l Students who’ve tried marijuana may be more likely

to leave this question blank than those who haven’t

l Remaining data is a biased sample

ACTUAL STATE OF WORLD Yes No No Yes No Yes WHAT WE SEE Yes No NA NA No NA

= 50% = 33%

slide-69
SLIDE 69

Hypothetical Scenario #1

l Sometimes, the fact that a data point is NA may

be related to what the value would have been … if we’d been able to measure it

l In other words, some values are more likely than

  • thers to end up as NAs

l All that’s relevant here is that there is a statistical

contingency

l The actual causal chain might be more complex

l e.g., marijuana use à fear of legal repercussions à

  • mitted response
slide-70
SLIDE 70

Hypothetical Scenario #1

l Further examples:

l Clinical study where we’re measuring health

  • utcome. People who are very ill might drop out of

the study.

l Experiment where you have to press a key within 3

seconds or the trial ends without a response time being recorded

l People with low high school GPA decline to report it

l These are all examples of nonignorable

missingness

slide-71
SLIDE 71

Hypothetical Scenario #1

l Nonignorable missingness is bad L

l Remaining observations (those without NAs) are not

representative of the full population

l We can’t fully account for what the missing data

were, or why they’re missing

l We simply don’t know what the missing RT what

would have been if people had been allowed more time to respond

slide-72
SLIDE 72

Week 11: Missing Data

l Unbalanced Factors l Rank Deficiency

l Linear Combinations l Incomplete Designs

l Empirical Logit l Missing Data (NA values)

l Types of Missingness

l Non-Ignorable l Ignorable l Summary

l Possible Solutions

l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation

slide-73
SLIDE 73

Hypothetical Scenario #2

l In other cases, data might go missing at

“random” or for reasons completely unrelated to the study

l Computer crash l Inclement weather l Experimenter error l Random subsampling of people for a follow-up

slide-74
SLIDE 74

Hypothetical Scenario #2

l In other cases, data might go missing at

“random” or for reasons completely unrelated to the study

l Computer crash l Inclement weather l Experimenter error l Random subsampling of people for a follow-up

l In these cases, there is no reason to think that

the missing data would look any different from the remaining data

l Ignorable missingness

slide-75
SLIDE 75

Hypothetical Scenario #2

l Ignorable missingness isn’t nearly as bad! J

l It’s a bummer that we lost some data, but what’s left

is still representative of the population

l Our data is still a random sample of the population—

it’s just a smaller random sample

l Still valid to make inferences about the population

slide-76
SLIDE 76

Hypothetical Scenario #3

l Another case of ignorable missingness is when

the fact that the data is NA can be fully explained by other, known variables

l Examples:

l People who score high on a pretest are excluded

from further participation in an intervention study

l We’re looking at child SES as a predictor of physical

growth, but lower SES families are less likely to return for the post-test

l DV is whether people say a plural vs singular noun;

we discard ambiguous words (e.g., “fish”). Rate of ambiguous words differs across conditions

slide-77
SLIDE 77

Hypothetical Scenario #3

l Another case of ignorable missingness is when

the fact that the data is NA can be fully explained by other, known variables

l This is also ignorable because there’s no

mystery about why the data is missing, nor the values of the variables associated with NAness

l We know why the high-pretest people were excluded

from the intervention. Has nothing to do with unobserved variables

l Again, referring to statistical contingencies not

direct causal links

l Low SES à Transporation less affordable à NA

slide-78
SLIDE 78

Week 11: Missing Data

l Unbalanced Factors l Rank Deficiency

l Linear Combinations l Incomplete Designs

l Empirical Logit l Missing Data (NA values)

l Types of Missingness

l Non-Ignorable l Ignorable l Summary

l Possible Solutions

l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation

slide-79
SLIDE 79

Wrap-Up

l Ignorableness is more like a continuum l As long as we’re relatively ignorable, we’re OK

l

“We should expect departures from [ignorable missingness] … in many realistic cases … may often have only a minor impact on estimates and standard errors.” (Schafer & Graham, 2002, p. 152)

l

“In many psychological situations the departures … are probably not serious.” (Schafer & Graham, 2002, p. 154)

All of the missingness can be accounted for by known variables (IGNORABLE) All of the missingness depends on unknown values (NONIGNORABLE)

slide-80
SLIDE 80

Ignorable or Non-ignorable?

l Where is my dataset on this continuum?

l

"In general, there is no way to test whether [ignorable missingness] holds in a data set.” (Schafer & Graham, 2002, p. 152)

l Definitely ignorable if you used known variable(s) to

decide to discard data or to stop data collection

l

Kids who get a low score on Task 1 aren’t given Task 2

l

People who are sufficiently healthy are excluded from a clinical study

l

We realized we made a typo in one of our stimulus items and discarded all trials using that item

l Other cases: Use your knowledge of the domain

l Are certain values less likely to be measured & recorded; e.g.,

poor health in a clinical study? (non-ignorable)

l Or, is the missingness basically happening at random? Can it

be accounted for by things we measured? (ignorable)

l Big picture: Most departures from ignorable missingness

aren’t terrible, but be aware of the possibility that certain values are systematically more likely to be missing

slide-81
SLIDE 81

Ignorable or Non-ignorable?

l The post-experiment manipulation-check

questionnaires for five participants were accidentally thrown away.

l In a 2-day memory experiment, people who know they

would do poorly on the memory test are discouraged and don’t want to return for the second session.

l There was a problem with one of the auditory stimulus

files in the “Passive Sentence” condition (but not the corresponding version in the “Active” condition); we discarded data from those trials.

slide-82
SLIDE 82

Ignorable or Non-ignorable?

l The post-experiment manipulation-check

questionnaires for five participants were accidentally thrown away.

l Ignorable—not related to any variable

l In a 2-day memory experiment, people who know they

would do poorly on the memory test are discouraged and don’t want to return for the second session.

l Non-ignorable. Missingness depends on what your memory

score would have been if we had observed it.

l There was a problem with one of the auditory stimulus

files in the “Passive Sentence” condition (but not the corresponding version in the “Active” condition); we discarded data from those trials.

l Ignorable; this depends on a known variable (condition)

slide-83
SLIDE 83

Ignorable or Non-ignorable?

l We are comparing life satisfaction among a sample of

students known to live on-campus vs. a sample of students known to live off-campus. But students off- campus are less likely to return their surveys because it’s more inconvenient for them to do so.

slide-84
SLIDE 84

Ignorable or Non-ignorable?

l We are comparing life satisfaction among a sample of

students known to live on-campus vs. a sample of students known to live off-campus. But students off- campus are less likely to return their surveys because it’s more inconvenient for them to do so.

l Ignorable if missingness depends only on this known

  • variable. Fewer off-campus students might return their

surveys, but the off-campus students from whom we have data don’t differ from the off-campus students for whom we don’t have data.

l If we think that there is also a relation to the unmeasured life-

satisfaction variable (e.g., people unhappy with their lives don’t return the survey), although not mentioned above, then this would be non-ignorable.

l Assumptions we are willing to make about the missing data

(and why it’s missing) affect how we can then use it.

slide-85
SLIDE 85

Week 11: Missing Data

l Rank Deficiency

l Linear Combinations l Incomplete Designs

l Empirical Logit l Missing Data (NA values)

l Intro l Types of Missingness

l Non-Ignorable l Ignorable l Summary

l Possible Solutions

l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation

slide-86
SLIDE 86

Casewise Deletion

l In each comparison, delete only observations if

the missing data is relevant to this comparison

l Correlating Extraversion & Conscientiousness

à delete/ignore the red rows

slide-87
SLIDE 87

l In each comparison, delete only observations if

the missing data is relevant to this comparison

l Correlating Extraversion & ReadingSpan à

delete/ignore the blue row

Casewise Deletion

slide-88
SLIDE 88

Casewise Deletion

l Avoids data loss l But, results not completely consistent /

comparable because they’re based

  • n different observations

l e.g., possible to have A > B > C > A

l

cor.test(stress$ReadingSpan, stress$Extraversion)

l

cor.test(stress$Conscientiousness, stress$Extraversion)

df=453

d.f.s don’t match because they’re based on different subsets of the data

slide-89
SLIDE 89

Week 11: Missing Data

l Rank Deficiency

l Linear Combinations l Incomplete Designs

l Empirical Logit l Missing Data (NA values)

l Intro l Types of Missingness

l Non-Ignorable l Ignorable l Summary

l Possible Solutions

l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation

slide-90
SLIDE 90

Listwise Deletion

l Delete any observation where data is missing

anywhere

l e.g., stress2 <- na.omit(stress)

l Default in lmer and many other programs

slide-91
SLIDE 91

Listwise Deletion

l Avoids inconsistency l In some cases, could result in a lot of data loss

l However, mixed effects models do well even with

moderate data loss (25%; Quene & van den Bergh, 2004)

l Unlike ANOVA, MEMs properly account for some

subjects or conditions having fewer observations

slide-92
SLIDE 92

Listwise Deletion

l Avoids inconsistency l In some cases, could result in a lot of data loss

l However, mixed effects models do well even with

moderate data loss (25%; Quene & van den Bergh, 2004)

l Unlike ANOVA, MEMs properly account for some

subjects or conditions having fewer observations

l Produces the correct parameter estimates if

missingness is ignorable

l Although some other things (R2) may be incorrect

l Estimates will be wrong if missingness is non-

ignorable

slide-93
SLIDE 93

Week 11: Missing Data

l Rank Deficiency

l Linear Combinations l Incomplete Designs

l Empirical Logit l Missing Data (NA values)

l Intro l Types of Missingness

l Non-Ignorable l Ignorable l Summary

l Possible Solutions

l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation

slide-94
SLIDE 94

Unconditional Imputation

l Replace missing values with the mean of the

  • bserved values

l Imputing the mean reduces the variance

l This increases chance of detecting spurious effects

l Also distorts the correlations with other variables l Bad. Don’t do this!

5, 8, 3, ?, ?

  • M = 5.33
  • S2 = 12.5

5, 8, 3, 5.33, 5.33

  • M = 5.33
  • S2 = 3.17
slide-95
SLIDE 95

Week 11: Missing Data

l Rank Deficiency

l Linear Combinations l Incomplete Designs

l Empirical Logit l Missing Data (NA values)

l Intro l Types of Missingness

l Non-Ignorable l Ignorable l Summary

l Possible Solutions

l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation

slide-96
SLIDE 96

Conditional Imputation

l Replace missing values with the values

predicted by a model using known variable(s)

l

interimModel <- lmer(ReadingSpan ~ 1 + OperationSpan + (1|Subject), data=stress)

l

Create a model that predicts RSpan from OSpan

l

predictedValues <- predict(interimModel, stress[is.na(stress$ReadingSpan),]

l

Use the model to predict the missing ReadingSpan values

l

stress[is.na(stress$ReadingSpan),'ReadingSpan'] <- predictedValues

l Replace the missing ReadingSpan values with the predicted

values

Reading Span

~ 1 +

Reading Span = NA NA Operation Span

slide-97
SLIDE 97

Conditional Imputation

l Replace missing values with the values

predicted by a model using known variable(s)

l If ignorable missingness, get the correct

parameter estimates

l And, standard errors not as distorted

l Especially if we add some noise to the fitted values

l predictedValues <- predictedValues +

rnorm(length(predictedValues), mean=0, sd=ResidualSDFromTheModel)

Reading Span

~ 1 +

Reading Span = NA NA Operation Span

slide-98
SLIDE 98

Conditional Imputation

l Replace missing values with the values

predicted by a model using known variable(s)

l Where this is useful?

l Many observations have a small

amount of missing data, but which column it is varies

l Listwise deletion would wipe out

every row with a NA anywhere

Reading Span

~ 1 +

Reading Span = NA NA Operation Span

slide-99
SLIDE 99

Week 11: Missing Data

l Rank Deficiency

l Linear Combinations l Incomplete Designs

l Empirical Logit l Missing Data (NA values)

l Intro l Types of Missingness

l Non-Ignorable l Ignorable l Summary

l Possible Solutions

l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation

slide-100
SLIDE 100

Multiple Imputation

l Like doing conditional imputation several times

l Replace missing data with one possible set of values l Run the model l Repeat it

l Final result averages these

Dataset with imputation 1 Dataset with imputation 2 Dataset with imputation 3 Model results 1 Model results 2 Model results 3 Final Results

}

Dataset with missing data {

slide-101
SLIDE 101

Multiple Imputation

l R package mice (have to install)

l Schematic example:

l imp <- mice(stress) l Creates several sets of imputed data l Need to set some parameters to indicate level 1 vs level 2

variables, categorical vs continuous variables

l miModel <- with(imp, lmer(…) )

l Fit the model to each set of imputed data

l result <- pool(miModel)

l Combine the model results

l summary(result) l Limitations:

l Limited to two nested levels (with current software) l Only gives you fixed effect estimates (not estimates of

random effect variance)

l Can be time-consuming

slide-102
SLIDE 102

Pattern Mixture Models

l Alternative: Pattern-mixture models

l Classify participants by the patterns of missing data l Then look at the effects / pattern of results within

each group

l e.g., Effects of reading span for people with personality data

reported & effects of reading span for people without personality data reported

slide-103
SLIDE 103

Week 11: Missing Data

l Rank Deficiency

l Linear Combinations l Incomplete Designs

l Empirical Logit l Missing Data (NA values)

l Intro l Types of Missingness

l Non-Ignorable l Ignorable l Summary

l Possible Solutions

l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation

slide-104
SLIDE 104

l Encyclopedia Brown confronted local

troublemaker “Bugs” Hauser about the missing data. Bugs says he distinctly remembers storing the missing sheet of data between pages 151 and 152 of his lab

  • notebook. Bugs says that the sheet must

have just fallen out when Bugs's gang, the Cotton-Top Tamarins, were cleaning their clubhouse.

l How did Encyclopedia know Bugs was

lying?

Pages 151 and 152 are the front and back of the same sheet.