Automated Scoring and Rater Drift National Conference on Student - - PowerPoint PPT Presentation

automated scoring and rater drift
SMART_READER_LITE
LIVE PREVIEW

Automated Scoring and Rater Drift National Conference on Student - - PowerPoint PPT Presentation

Automated Scoring and Rater Drift National Conference on Student Assessment Detroit, 2010 Wayne Camara The College Board Rater drift When ratings are made over a period of time there is a concern that ratings may become more lenient or


slide-1
SLIDE 1

Automated Scoring and Rater Drift

National Conference on Student Assessment Detroit, 2010 Wayne Camara The College Board

slide-2
SLIDE 2

Rater drift

  • When ratings are made over a period of time

there is a concern that ratings may become more lenient or harsh.

  • Occurs in all contexts, performance

appraisals, scoring performance assessments, judging athletic events…

  • Increased risk when:
  • Rubrics (criteria) are more subjective.
  • Scoring occurs over time (within year, between

years).

  • Pressure to score many tasks quickly
slide-3
SLIDE 3

Detecting and Correcting Rater Drift

  • Tools may differ between assessments

completed on paper and computer.

  • Multiple readers, with mixed assignments
  • Read behind
  • Seed papers from previous administration,

benchmark papers (established mark)

  • Calibration of readers, retraining
slide-4
SLIDE 4

Automated Scoring

  • Automated scoring systems – essays, spoken

responses, short content items, numerical /graphical responses to math questions (with verifiable and limited set of correct responses).

  • Typically evaluated through comparison with human

readers.

  • Correlations, Weighted Kappa (preferred over % agreement which is

misleading and sensitive to rating scale – 4-pt vs 9-pt).

  • Exact and adjacent agreement is impacted by score scale (4 vs 9 pt)
  • Similar distributions as humans (variation in ratings, use of extremes in

scale).

  • Also validated against external criteria (other test sections, previous

scores on same test, scores on similar tests, grades)

slide-5
SLIDE 5

Automated Scoring – Issues to consider in using scores for detecting drift

  • Rubric – whether it is general vs task specific; holistic vs

mechanistic, unidimensional.

  • Using other sections of the test is useful (such as MC

items). However, there are also weaknesses with using MC items.

  • Relationship between performance tasks and MC items should

differ (assume they measure different parts of the construct). Need to ensure consistency across tasks before employing MC section

  • corr. as criteria.
  • Best when computed separately for each dimension (not

combined score) and each rater (not total score)

slide-6
SLIDE 6

Papers by Lottridge and Schulz : Best Practices

  • Scoring engine must be trained – if drift exists

then using papers from a brief time period can introduce similar error in system.

  • Note that raters will tend to avoid extreme scores

– but some AS systems also avoid extreme scores

  • Selection of training sample – tasks already

calibrated, representativeness of tasks.

  • Compare reader agreement AND distribution of

scores across all results (readers)

slide-7
SLIDE 7

Papers by Lottridge and Schulz : Best Practices

  • Year to year drift should be checked (e.g., rescore papers,

N=500 to 1,000).

  • Intrareader correlations and agreement increased over

time.

  • AS is treated as a single scorer in comparison to each

reader

  • Propose using as second reader or solely to monitor

reader quality.

  • Utility as second reader is established with knowledge that AS will

focus on some dimensions (grammar, mechanics, vocab, semantic content or relevance, organization)

  • AS does not evaluate rhetorical skills, voice, the accuracy of

concepts described, whether arguments are well founded.

slide-8
SLIDE 8

Automated Scoring

  • Some cautions, but promise
  • Common Core Assessments – many of the leading

proponents of different assessment models have over estimated the efficacy of AS and underestimated the cost and time required.

  • As noted earlier – limited to certain types of tasks and

subjects.

  • As noted in papers – AS doesn’t make judgments, but

scores on selected features. Readers, also score based on context and differential features, but have the ability to make judgments and consider all aspects of a paper (if time permitted).

  • AS moving beyond the big three (ETS, Vantage, PEM) to

many new players, including Pacific Metrics, AIR…