Approach to Score Resolution in Performance Assessments Stefanie A. - - PowerPoint PPT Presentation

approach to score resolution in
SMART_READER_LITE
LIVE PREVIEW

Approach to Score Resolution in Performance Assessments Stefanie A. - - PowerPoint PPT Presentation

A Model-Data-Fit-Informed Approach to Score Resolution in Performance Assessments Stefanie A. Wind A. Adrienne Walker University of Alabama Georgia Department of Education 1 Outline Background Purpose Methods Results Implications 2


slide-1
SLIDE 1

A Model-Data-Fit-Informed Approach to Score Resolution in Performance Assessments

Stefanie A. Wind University of Alabama

1

  • A. Adrienne Walker

Georgia Department of Education

slide-2
SLIDE 2

Outline

2

Background Purpose Methods Results Implications

slide-3
SLIDE 3

Background

3

slide-4
SLIDE 4

Score Resolution in Performance Assessment

Raters disagree Additional ratings Collected Resolved ratings are some combination

  • f original & new

ratings

Usually based on rater agreement:

4

slide-5
SLIDE 5

Potential Issues with Agreement-Based Score Resolution

  • 1. Discrepancies in rater judgment may not always indicate inaccurate ratings

 Two raters could exhibit different levels of severity and both ratings could be plausible

 They could both accurately reflect student achievement over domains  Statistical adjustments for rater severity (e.g., MFRM) could mitigate severity differences

  • 2. Rater agreement may not always indicate accurate ratings

 Two raters could agree on inaccurate representations of student achievement

 Unlikely in high-stakes assessments where raters are highly trained, but still possible

5

slide-6
SLIDE 6

Score Resolution & Person Fit

 Agreement-based score resolution has a similar goal as individual person fit analysis in modern measurement models:

 To identify individual students for whom achievement estimates may not be a reasonable representation of their response pattern 6

slide-7
SLIDE 7

Previous Research on Agreement-based Score Resolution & Person Fit

 Both methods identify similar students whose performances warrant additional investigation (Myford & Wolfe, 2002)  Applying agreement-based score resolution improves psychometric defensibility from both a rater agreement & person fit perspective…

 For most students  But not all students! (Wind & Walker, 2019) 7

slide-8
SLIDE 8

Brief Illustration: Agreement-Based Resolution does not Always Improve Person Fit

8

slide-9
SLIDE 9

9

Before Resolution

  • Raters disagreed on all domains except D1
  • Flagged for resolution
  • Overall shape of observed PRF generally

aligned with model-expected PRF

After Resolution

  • Raters agreed on all domains
  • Improved agreement
  • Overall shape of observed PRF deviates

from model-expected PRF

slide-10
SLIDE 10

Purpose

10

slide-11
SLIDE 11

Purpose

To explore a model-data-fit informed approach to score resolution in the context of mixed-format educational assessments

11

slide-12
SLIDE 12

Research Questions

12

  • 1. What is the impact of using

person fit statistics to identify performances for resolution and rater fit statistics to identify raters to provide resolved scores on student achievement and person fit statistics?

  • 2. To what extent do the model-

data fit-informed approach and a rater-agreement approach to score resolution result in similar student achievement estimates and person fit statistics?

slide-13
SLIDE 13

Methods

13

slide-14
SLIDE 14

Simulation Study

 Yes, it’s kind of weird to use a simulation to look at rater judgments

 And especially resolved rater judgments!

 Simulated data are useful because:

 We cannot collect new resolved ratings for performances identified for resolution in a secondary analysis of real data

 We designed the simulation based on results from analyses of large-scale performance assessments in which score resolution procedures are applied (Wind & Walker, 2019)

14

🤩 🤔 🧑

slide-15
SLIDE 15

Design of Simulation Study

 5,000 students

 Student achievement: θ~N(0,1)

 30 MC items (all students respond to all items)

 Item difficulty: β~N(0, 0.5)

 1 writing task

 Scored on 4 domains  Domain difficulty :δ1 = 0.00, δ2 = 2.00, δ3 = - 2.00, δ4 = 1.00  5-category rating scale (0, 1, 2, 3, 4)

 50 total raters

 Rater severity: λ~N(0,.5)  2 randomly selected raters scored each student’s writing task

15 In all conditions:

slide-16
SLIDE 16

Simulation Study, continued

Type of student misfit:

 Differential achievement over domains  Student*rater interaction

16

Manipulated factors:

% of Raters Exhibiting Severity:

 0%  20%  40%

% of Students Exhibiting Misfit:

 0%  5%  10%

Disordered domain difficulty from the order in the complete sample Disordered domain difficulty for one rater & kept original order for second rater Severity effect raters: λ~N(1.0,.5) ½ of students exhibit each type of misfit

slide-17
SLIDE 17

Null condition

 Conditions with 0% simulated rater severity and 0% simulated person misfit informed our evaluation of rater severity & person fit Infit & Outfit MSE statistics

Bootstrap approach 🤡 to identify critical values

17

slide-18
SLIDE 18

Analysis Procedure

18

slide-19
SLIDE 19

(1) Simulate MC responses and CR ratings using specified conditions (2) Analyze simulated data using PC-MFR model (3) Evaluate person fit and rater fit (🤡) (3A) Identify misfitting students (3B) Identify raters with moderate severity and good fit (4) Simulate new ratings for each student in (3A) from one randomly selected rater from (3B) using student theta parameter from (1) Students who misfit due to ”true” differential achievement (Maintain disordered domain difficulty → expected person misfit after resolution) Students who misfit due to rater*domain interaction (Use expected domain difficulty → expect improved person fit) (3C) Identify students with acceptable fit No resolution needed. Ratings from Step 1 are the final ratings

(5) From original ratings (1), identify the original rater whose ratings are closest to the model- expected ratings (6) Use the ratings identified in (5) and the new ratings from (4) as the final resolved ratings (7) Analyze final ratings for all students using the PC-MFR model and evaluate person fit

ln[Pnijlk(x=k)/ Pnijlk(x=k-1)] = θn – λi – δj – ηl – τlk ⚓ MC item difficulties, domain difficulties, and rater severity locations to values from Step 2

slide-20
SLIDE 20

Rater Agreement Analysis

 We also examined rater agreement in our simulated ratings

 Identified performances with discrepancies ≥ 2 raw score points

 Used the same approach to identify a third rater and generate additional ratings

 Compared the 3rd rater’s ratings to the original ratings & kept the ratings from the closest 2 raters  Analyzed resolved ratings using PC-MFR model

20

👰

slide-21
SLIDE 21

Results

21

slide-22
SLIDE 22

Unresolved Ratings

Person fit statistics reflected simulation design

 80-96% of students simulated to exhibit misfit were classified as misfitting  ≤ 1% of students simulated to fit were classified as misfitting

22

slide-23
SLIDE 23

Resolved Ratings

 Lower average MSE fit statistics for all students in all conditions

 Some differences for student fit subgroups:

23

Differential Achievement students:

  • Fit statistics still higher

(noisier; more misfit) than the overall sample

  • 79-94% classified as

“misfitting” Rater*Performance Interaction students:

  • More acceptable

average person fit statistics

  • 1% - 2% classified as

“misfitting” following resolution Fitting students:

  • Fit remained acceptable
slide-24
SLIDE 24

Comparison to Rater Agreement

 Similar overall average person fit statistics following resolution

 Some differences for student fit subgroups:

24

Differential Achievement students:

  • Fit statistics still higher

(noisier; more misfit) than the overall sample

  • 75-93% classified as

“misfitting” Rater*Performance Interaction students:

  • Average person fit

statistics did not improve much following resolution

  • 70% - 82% classified as

“misfitting” following resolution Fitting students:

  • Fit remained acceptable

Lack of improvement in person fit = Key difference between model fit-informed approach & rater agreement-based approach

slide-25
SLIDE 25

Implications

25

slide-26
SLIDE 26

Contribution

 Previous research on score resolution has focused almost exclusively

  • n rater agreement methods

 We considered the implications of using indicators of model-data fit to identify performances for resolution & to identify raters to provide the new scores

26

slide-27
SLIDE 27

RQ1: What is the impact of using person fit statistics to identify performances for resolution and rater fit statistics to identify raters to provide resolved scores on student achievement and person fit statistics?

27

  • Fit-informed approach

resulted in overall improved model-data fit for students who exhibited misfit due to the rater*performance interaction

  • Effective for improving the

quality of achievement estimates in performance assessments

slide-28
SLIDE 28

What does this mean? 🧑

 Fit-informed approach can help researchers and practitioners identify students with unexpected ratings both before and following score resolution

 If fit does not improve:

Additional steps may be needed to meaningfully evaluate their achievement related to the construct of interest.

 E.g., additional qualifiers for interpreting and using the score should be considered along with the score itself

28

slide-29
SLIDE 29

RQ2: To what extent do the model-data fit-informed approach and a rater-agreement approach to score resolution result in comparable student achievement estimates and person fit statistics?

29

  • Overall improvement in

person fit following resolution

  • Person fit did not improve

for rater*performance interaction subgroup

  • Profiles were less

discrepant

  • Still misfitting from a

measurement perspective

slide-30
SLIDE 30

What does this mean? 🧑

 Fit-informed score resolution is a promising alternative to the agreement-based approach

 Helps sort out “true” person misfit from rater effects  Method for evaluating the resolved scores for evidence that they can be reasonably interpreted and used

 Other applications:

 Could be used in testing programs that employ only one rater per student response

 Random read-behinds can be supplemented with person-fit read-behinds

 Important to evaluate both person fit & rater effects at all stages of assessment process

30

slide-31
SLIDE 31

Thank you!

31

STEFANIE A. WIND SWIND@UA.EDU @PROFESSORWIND