A Model-Data-Fit-Informed Approach to Score Resolution in Performance Assessments
Stefanie A. Wind University of Alabama
1
- A. Adrienne Walker
Georgia Department of Education
Approach to Score Resolution in Performance Assessments Stefanie A. - - PowerPoint PPT Presentation
A Model-Data-Fit-Informed Approach to Score Resolution in Performance Assessments Stefanie A. Wind A. Adrienne Walker University of Alabama Georgia Department of Education 1 Outline Background Purpose Methods Results Implications 2
A Model-Data-Fit-Informed Approach to Score Resolution in Performance Assessments
Stefanie A. Wind University of Alabama
1
Georgia Department of Education
Outline
2
Background Purpose Methods Results Implications
Background
3
Score Resolution in Performance Assessment
Raters disagree Additional ratings Collected Resolved ratings are some combination
ratings
Usually based on rater agreement:
4
Potential Issues with Agreement-Based Score Resolution
Two raters could exhibit different levels of severity and both ratings could be plausible
They could both accurately reflect student achievement over domains Statistical adjustments for rater severity (e.g., MFRM) could mitigate severity differences
Two raters could agree on inaccurate representations of student achievement
Unlikely in high-stakes assessments where raters are highly trained, but still possible
5
Score Resolution & Person Fit
Agreement-based score resolution has a similar goal as individual person fit analysis in modern measurement models:
To identify individual students for whom achievement estimates may not be a reasonable representation of their response pattern 6
Previous Research on Agreement-based Score Resolution & Person Fit
Both methods identify similar students whose performances warrant additional investigation (Myford & Wolfe, 2002) Applying agreement-based score resolution improves psychometric defensibility from both a rater agreement & person fit perspective…
For most students But not all students! (Wind & Walker, 2019) 7
Brief Illustration: Agreement-Based Resolution does not Always Improve Person Fit
8
9
Before Resolution
aligned with model-expected PRF
After Resolution
from model-expected PRF
Purpose
10
Purpose
To explore a model-data-fit informed approach to score resolution in the context of mixed-format educational assessments
11
Research Questions
12
person fit statistics to identify performances for resolution and rater fit statistics to identify raters to provide resolved scores on student achievement and person fit statistics?
data fit-informed approach and a rater-agreement approach to score resolution result in similar student achievement estimates and person fit statistics?
Methods
13
Simulation Study
Yes, it’s kind of weird to use a simulation to look at rater judgments
And especially resolved rater judgments!
Simulated data are useful because:
We cannot collect new resolved ratings for performances identified for resolution in a secondary analysis of real data
We designed the simulation based on results from analyses of large-scale performance assessments in which score resolution procedures are applied (Wind & Walker, 2019)
14
Design of Simulation Study
5,000 students
Student achievement: θ~N(0,1)
30 MC items (all students respond to all items)
Item difficulty: β~N(0, 0.5)
1 writing task
Scored on 4 domains Domain difficulty :δ1 = 0.00, δ2 = 2.00, δ3 = - 2.00, δ4 = 1.00 5-category rating scale (0, 1, 2, 3, 4)
50 total raters
Rater severity: λ~N(0,.5) 2 randomly selected raters scored each student’s writing task
15 In all conditions:
Simulation Study, continued
Type of student misfit:
Differential achievement over domains Student*rater interaction
16
Manipulated factors:
% of Raters Exhibiting Severity:
0% 20% 40%
% of Students Exhibiting Misfit:
0% 5% 10%
Disordered domain difficulty from the order in the complete sample Disordered domain difficulty for one rater & kept original order for second rater Severity effect raters: λ~N(1.0,.5) ½ of students exhibit each type of misfit
Null condition
Conditions with 0% simulated rater severity and 0% simulated person misfit informed our evaluation of rater severity & person fit Infit & Outfit MSE statistics
Bootstrap approach 🤡 to identify critical values
17
Analysis Procedure
18
(1) Simulate MC responses and CR ratings using specified conditions (2) Analyze simulated data using PC-MFR model (3) Evaluate person fit and rater fit (🤡) (3A) Identify misfitting students (3B) Identify raters with moderate severity and good fit (4) Simulate new ratings for each student in (3A) from one randomly selected rater from (3B) using student theta parameter from (1) Students who misfit due to ”true” differential achievement (Maintain disordered domain difficulty → expected person misfit after resolution) Students who misfit due to rater*domain interaction (Use expected domain difficulty → expect improved person fit) (3C) Identify students with acceptable fit No resolution needed. Ratings from Step 1 are the final ratings
(5) From original ratings (1), identify the original rater whose ratings are closest to the model- expected ratings (6) Use the ratings identified in (5) and the new ratings from (4) as the final resolved ratings (7) Analyze final ratings for all students using the PC-MFR model and evaluate person fit
ln[Pnijlk(x=k)/ Pnijlk(x=k-1)] = θn – λi – δj – ηl – τlk ⚓ MC item difficulties, domain difficulties, and rater severity locations to values from Step 2
Rater Agreement Analysis
We also examined rater agreement in our simulated ratings
Identified performances with discrepancies ≥ 2 raw score points
Used the same approach to identify a third rater and generate additional ratings
Compared the 3rd rater’s ratings to the original ratings & kept the ratings from the closest 2 raters Analyzed resolved ratings using PC-MFR model
20
Results
21
Unresolved Ratings
Person fit statistics reflected simulation design
80-96% of students simulated to exhibit misfit were classified as misfitting ≤ 1% of students simulated to fit were classified as misfitting
22
Resolved Ratings
Lower average MSE fit statistics for all students in all conditions
Some differences for student fit subgroups:
23
Differential Achievement students:
(noisier; more misfit) than the overall sample
“misfitting” Rater*Performance Interaction students:
average person fit statistics
“misfitting” following resolution Fitting students:
Comparison to Rater Agreement
Similar overall average person fit statistics following resolution
Some differences for student fit subgroups:
24
Differential Achievement students:
(noisier; more misfit) than the overall sample
“misfitting” Rater*Performance Interaction students:
statistics did not improve much following resolution
“misfitting” following resolution Fitting students:
Lack of improvement in person fit = Key difference between model fit-informed approach & rater agreement-based approach
Implications
25
Contribution
Previous research on score resolution has focused almost exclusively
We considered the implications of using indicators of model-data fit to identify performances for resolution & to identify raters to provide the new scores
26
RQ1: What is the impact of using person fit statistics to identify performances for resolution and rater fit statistics to identify raters to provide resolved scores on student achievement and person fit statistics?
27
resulted in overall improved model-data fit for students who exhibited misfit due to the rater*performance interaction
quality of achievement estimates in performance assessments
What does this mean? 🧑
Fit-informed approach can help researchers and practitioners identify students with unexpected ratings both before and following score resolution
If fit does not improve:
Additional steps may be needed to meaningfully evaluate their achievement related to the construct of interest.
E.g., additional qualifiers for interpreting and using the score should be considered along with the score itself
28
RQ2: To what extent do the model-data fit-informed approach and a rater-agreement approach to score resolution result in comparable student achievement estimates and person fit statistics?
29
person fit following resolution
for rater*performance interaction subgroup
discrepant
measurement perspective
What does this mean? 🧑
Fit-informed score resolution is a promising alternative to the agreement-based approach
Helps sort out “true” person misfit from rater effects Method for evaluating the resolved scores for evidence that they can be reasonably interpreted and used
Other applications:
Could be used in testing programs that employ only one rater per student response
Random read-behinds can be supplemented with person-fit read-behinds
Important to evaluate both person fit & rater effects at all stages of assessment process
30
Thank you!
31
STEFANIE A. WIND SWIND@UA.EDU @PROFESSORWIND