automated scoring and rater drift
play

Automated Scoring and Rater Drift National Conference on Student - PowerPoint PPT Presentation

Automated Scoring and Rater Drift National Conference on Student Assessment Detroit, 2010 Wayne Camara The College Board Rater drift When ratings are made over a period of time there is a concern that ratings may become more lenient or


  1. Automated Scoring and Rater Drift National Conference on Student Assessment Detroit, 2010 Wayne Camara The College Board

  2. Rater drift • When ratings are made over a period of time there is a concern that ratings may become more lenient or harsh. • Occurs in all contexts, performance appraisals, scoring performance assessments, judging athletic events… • Increased risk when: • Rubrics (criteria) are more subjective. • Scoring occurs over time (within year, between years). • Pressure to score many tasks quickly

  3. Detecting and Correcting Rater Drift • Tools may differ between assessments completed on paper and computer. • Multiple readers, with mixed assignments • Read behind • Seed papers from previous administration, benchmark papers (established mark) • Calibration of readers, retraining

  4. Automated Scoring • Automated scoring systems – essays, spoken responses, short content items, numerical /graphical responses to math questions (with verifiable and limited set of correct responses). • Typically evaluated through comparison with human readers. • Correlations, Weighted Kappa (preferred over % agreement which is misleading and sensitive to rating scale – 4-pt vs 9-pt). • Exact and adjacent agreement is impacted by score scale (4 vs 9 pt) • Similar distributions as humans (variation in ratings, use of extremes in scale). • Also validated against external criteria (other test sections, previous scores on same test, scores on similar tests, grades)

  5. Automated Scoring – Issues to consider in using scores for detecting drift • Rubric – whether it is general vs task specific; holistic vs mechanistic, unidimensional. • Using other sections of the test is useful (such as MC items). However, there are also weaknesses with using MC items. • Relationship between performance tasks and MC items should differ (assume they measure different parts of the construct). Need to ensure consistency across tasks before employing MC section corr. as criteria. • Best when computed separately for each dimension (not combined score) and each rater (not total score)

  6. Papers by Lottridge and Schulz : Best Practices • Scoring engine must be trained – if drift exists then using papers from a brief time period can introduce similar error in system. • Note that raters will tend to avoid extreme scores – but some AS systems also avoid extreme scores • Selection of training sample – tasks already calibrated, representativeness of tasks. • Compare reader agreement AND distribution of scores across all results (readers)

  7. Papers by Lottridge and Schulz : Best Practices • Year to year drift should be checked (e.g., rescore papers, N=500 to 1,000). • Intrareader correlations and agreement increased over time. • AS is treated as a single scorer in comparison to each reader • Propose using as second reader or solely to monitor reader quality. • Utility as second reader is established with knowledge that AS will focus on some dimensions (grammar, mechanics, vocab, semantic content or relevance, organization) • AS does not evaluate rhetorical skills, voice, the accuracy of concepts described, whether arguments are well founded.

  8. Automated Scoring • Some cautions, but promise • Common Core Assessments – many of the leading proponents of different assessment models have over estimated the efficacy of AS and underestimated the cost and time required. • As noted earlier – limited to certain types of tasks and subjects. • As noted in papers – AS doesn’t make judgments, but scores on selected features. Readers, also score based on context and differential features, but have the ability to make judgments and consider all aspects of a paper (if time permitted). • AS moving beyond the big three (ETS, Vantage, PEM) to many new players, including Pacific Metrics, AIR…

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend