Continuous Flow Scoring of Prose Constructed Response: A Hybrid of - - PowerPoint PPT Presentation

continuous flow scoring of prose constructed response a
SMART_READER_LITE
LIVE PREVIEW

Continuous Flow Scoring of Prose Constructed Response: A Hybrid of - - PowerPoint PPT Presentation

Continuous Flow Scoring of Prose Constructed Response: A Hybrid of Automated and Human Scoring Denny Way Pearson June 22, 2016 Presented at that National Conference on Student Assessment, Philadelphia, PA 1 Background This


slide-1
SLIDE 1

Continuous Flow Scoring of Prose Constructed Response: A Hybrid of Automated and Human Scoring

Denny Way Pearson June 22, 2016

Presented at that National Conference on Student Assessment, Philadelphia, PA

1

slide-2
SLIDE 2

Background

  • This presentation and the one that follows are based on

systems Pearson has developed to support high volume, large scale applications of automated scoring (AS) of written constructed response items

  • Much of Pearson’s recent work in this area has been in

supporting the PARCC assessment consortium

  • The PARCC English Language Arts / Literacy (ELA/L)

assessments include a variety of prose constructed response (PCR) tasks, which require students to write extended responses

2

slide-3
SLIDE 3

Scoring Written Responses on PARCC

  • The extensive use of writing is a strength of the ELA/L

assessment and a primary reason for its strong rating

  • btained in evaluations comparing it to other Common

Core assessments1

  • Historically writing has been scored by humans, which

takes time and adds cost. Research has indicated that automated scoring can effectively supplement human scoring to reduce cost and increase scoring efficiency

3

1 See Doorey, N. & Polikoff, M. (2016, February). Evaluating the quality of

next generation assessments. Washington, DC: Thomas Fordham Institute.

slide-4
SLIDE 4

Use of Automated Scoring for PARCC ELA/L

  • Automated scoring of writing was assumed for the
  • perational PARCC assessments beginning in 2015-16
  • At their preference, individual PARCC states may
  • ptionally contract to have 100% human scoring
  • A single score is reported for each PARCC PCR with 10%

second scoring for the purposes of reliability

  • To support the use of automated scoring, extensive

research has been conducted; this research and proposed

  • perational procedures were vetted with PARCC’s

Technical Advisory Committee (TAC) and approved by the PARCC State Leads

4

slide-5
SLIDE 5

Topics for This Presentation

  • What is “continuous flow” scoring?
  • Training IEA on operational data
  • Criteria for operationally deploying the AI scoring

model

  • Evaluating results

5

slide-6
SLIDE 6

Continuous Flow

  • In continuous flow scoring of constructed response items,

a hybrid of human and Pearson’s Intelligent Essay Assessor (IEA) is used to optimize both quality and cost of scoring

  • Continuous flow utilizes human scoring along with

automated scoring such that responses can be branched to flow to either scoring approach

  • Part of continuous flow involves “smart routing” a process

that involves automatically routing certain responses to

  • btain an additional human score by predicting that the

automated score will be less likely to agree with a human score

6

slide-7
SLIDE 7

Smart Routing Concept

7

slide-8
SLIDE 8

Continuous Flow Process Diagram

8

slide-9
SLIDE 9

Training IEA on Operational Data

  • Continuous flow makes it relatively easy to train the

automated scoring engine on operational data early in the administration window

  • During this process, multiple human scores can be

requested and any backreading scores assigned by supervisors can also be used to obtain the best possible data to train IEA

  • Human scoring is monitored closely and when criteria are

met, IEA modeling takes place

  • Once IEA is trained on a particular prompt, results are

evaluated by comparing IEA-human scoring agreement with human-human scoring agreement

9

slide-10
SLIDE 10

Reporting with Multiple Scores: Best (i.e., Highest Quality) Score Wins

  • Although multiple scores may be assigned for a

given response, only one can be reported

  • When multiple scores exist, there is a hierarchy

for deciding which score is actually reported

  • When the automated score is the only score, it is reported
  • When there is an automated score and a human score, the human

score is reported

  • When there are two human scores, the first score is reported
  • When there is a supervisor back read score, the back read score is

reported

  • When there are two non-adjacent scores, a resolution score is

provided and the resolution score is reported

10

slide-11
SLIDE 11

Daily Scoring Status Calls

  • While training IEA operationally, daily status meetings

were held with automated scoring experts, performance scoring operational staff, content experts, program team, and psychometricians

  • Scoring statistics were shared daily to review human

scoring performance and the operational readiness of automated scoring models.

  • Interventions were made where scoring challenges were

encountered

  • Resetting human scores where agreement was low
  • Additional training clarifications
  • Efforts to sample high performance responses

11

slide-12
SLIDE 12

Criteria for Operationally Deploying the AI Scoring Model - Considerations

  • Need for automated criteria that can be applied in real

time

  • Focus on validity as the most important criteria
  • Governs evaluation of human scoring
  • Expressed in terms of agreement rates rather than other statistics
  • Need to document performance of AI scoring for

subgroups

  • Metrics based on the research literature2

12

Measure Threshold Human-Machine Difference Pearson Correlation Less than 0.7 Greater than 0.1 Kappa Less than 0.4 Greater than 0.1 Quadratic Weighted Kappa Less than 0.7 Greater than 0.1 Exact Agreement Less than 65% Greater than 5.25% Standardized Mean Difference Greater than 0.15

2 See Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use

  • f automated scoring. Educational Measurement: Issues and Practices, 31, 2–13.
slide-13
SLIDE 13

Criteria for Operationally Deploying the AI Scoring Model

  • 1. Primary Criteria – Based on validity responses
  • With smart routing applied as needed, IEA agreement is as good
  • r better than human agreement for both trait scores
  • 2. Contingent Primary Criteria (if validity responses are not

available)

  • With smart routing applied as needed, IEA-Human exact

agreement is within 5.25% of Human-Human exact agreement for both trait scores

  • 3. Secondary Criteria - Based on the training responses
  • With smart routing applied as needed, IEA-human differences on

statistical measures for both traits are evaluated against quality criteria tolerances for subgroups with at least 50 responses

13

slide-14
SLIDE 14

Subgroup Analyses

  • For each prompt, we evaluated the performance of IEA for

various subgroups

  • We calculated various agreement indices (r, Kappa,

Quadratic Kappa, Exact Agreement) based human-human results with IEA-human results

  • We also looked at standardized mean differences (SMDs)

between IEA and human scores

  • We flagged differences for any groups based on the

quality criteria:

14

Measure Threshold Human-Machine Difference Pearson Correlation Less than 0.7 Greater than 0.1 Kappa Less than 0.4 Greater than 0.1 Quadratic Weighted Kappa Less than 0.7 Greater than 0.1 Exact Agreement Less than 65% Greater than 5.25% Standardized Mean Difference Greater than 0.15

slide-15
SLIDE 15

Summary

  • Continuous flow scoring involves integrating human and

automated scoring processes to support high quality and efficient scoring

  • This presentation described various processes involved

in continuous flow scoring as applied to the PARCC assessment program

  • The presentation that follows will share some of the

research and initial operational results for the PARCC program based on continuous flow scoring.

15