PARCC Research Results Karen E. Lochbaum Pearson June 22, 2016 - - PowerPoint PPT Presentation

▶

Feb 18, 2023 193 likes •439 views

PARCC Research Results Karen E. Lochbaum Pearson June 22, 2016 Presented at that National Conference on Student Assessment, Philadelphia, PA 1 Research Questions Do scores assigned by the Intelligent Essay Assessor (IEA) agree with

SLIDE 1

PARCC Research Results

Karen E. Lochbaum Pearson June 22, 2016

Presented at that National Conference on Student Assessment, Philadelphia, PA

SLIDE 2

Research Questions

Do scores assigned by the Intelligent Essay Assessor

(IEA) agree with human scores as well as human scores agree with each other? ‒ Across all prompts and traits for all responses? ‒ Across prompts and traits for responses across subgroups?

Do scores assigned by IEA agree with scores assigned

by experts to validity papers as well as human scores do?

SLIDE 3

Series of Studies and Results

2014: Field Test Study
Promising Initial Results
2015: Year 1 Operational Studies
Performance
Validity responses
Subgroups
2016: Year 2 Operational Performance

SLIDE 4

2015 Research Summary

SLIDE 5

Year 1 Operational Study

IEA served as 10% second score
A subset of prompts received an additional

human score

One of each prompt type
In each grade level
Study compared IEA-human to human-

human performance on 26 prompts

SLIDE 6

Summary of Human vs. IEA Exact Agreement Rates

The exact agreement between IEA and human readers was higher than it was between two human readers. And higher still between IEA and more experienced human back read scorers.

SLIDE 7

Summary of Human vs. IEA Exact Agreement Rates on Validity Responses

IEA’s exact agreement on validity responses was higher than it was for humans

SLIDE 8

Human vs. IEA Exact Agreement Rates by Subgroup

Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

The exact agreement between IEA and human readers was higher than it was between two human readers for various demographic subgroups.

Comparison Af Am Asian Hispanic 2+ Races Native Am Human 2 Human 1 68.6% 62.8% 67.1% 69.8% 65.4% IEA Op Human 1 74.0% 68.1% 72.5% 72.6% 72.6% Comparison White ELL SWD Female Male Human 2 Human 1 65.0% 71.2% 75.5% 63.9% 68.2% IEA Op Human 1 69.9% 76.3% 78.6% 69.0% 73.0%

SLIDE 9

2016 Operational Performance

SLIDE 10

A Reminder: Criteria for Operationally Deploying the AI Scoring Model

1. Primary Criteria – Based on validity responses
With smart routing applied as needed, IEA agreement is as good
r better than human agreement for both trait scores
2. Contingent Primary Criteria (if validity responses are not

available)

With smart routing applied as needed, IEA-Human exact

agreement is within 5.25% of Human-Human exact agreement for both trait scores

3. Secondary Criteria - Based on the training responses
With smart routing applied as needed, IEA-human differences on

statistical measures for both traits are evaluated against quality criteria tolerances for subgroups with at least 50 responses

SLIDE 11

Summary of Results: Comparison of IEA and Human Scores

Mean and Standard Deviations of IEA and Human Scores across all

prompts were very close

Some variability compared to the first human scorer might be expected

item-by-item because IEA was trained on the “best” score available (backread, resolution, first read)

SLIDE 12

IEA Mean vs. Human Mean Conventions Trait

SLIDE 13

IEA SD vs. Human SD Conventions Trait

SLIDE 14

IEA Mean vs. Human Mean Expressions Trait

SLIDE 15

IEA SD vs. Human SD Expressions Trait

SLIDE 16

IEA vs. Human Validity Agreement Conventions Trait

Blue means IEA performance exceeds human by > 5.25 Blue-Green means IEA at

r above human

Green means IEA performance within 5.25

f human

Red means IEA performance lower than human by > 5.25 Grade Exact SP0 SP1 SP2 SP3

3 4 4 5 5 6 6 6 7 7 8 9 9 9 10 10 10 11 11 11 11 11

SLIDE 17

IEA vs. Human Validity Agreement Expressions Trait

Blue exceeds by > 5.25 Blue-Green exceeds Green within 5.25 Red lower by > 5.25

Grade Exact SP0 SP1 SP2 SP3 SP4

3 4 4 5 5 6 6 6 7 7 8 9 9 9 10 10 10 11 11 11 11 11

SLIDE 18

IEA vs. Human Agreement Conventions Trait

Grade Exact SP0 SP1 SP2 SP3

3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 7 7 7 7 8 8 8 8 8 9 9 9 10 10 10 10 11

Blue exceeds by > 5.25 Blue-Green exceeds Green within 5.25 Red lower by > 5.25

SLIDE 19

IEA vs. Human Agreement Expressions Trait

Grade Exact SP0 SP1 SP2 SP3 SP4

3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 7 7 7 7 8 8 8 8 8 9 9 9 10 10 10 10 11

Blue exceeds by > 5.25 Blue-Green exceeds Green within 5.25 Red lower by > 5.25

SLIDE 20

A Reminder: Subgroup Analyses

For each prompt, we evaluated the performance of IEA for

various subgroups

We calculated various agreement indices (r, Kappa,

Quadratic Kappa, Exact Agreement) based human-human results with IEA-human results

We also looked at standardized mean differences (SMDs)

between IEA and human scores

We flagged differences for any groups based on the

quality criteria:

Measure Threshold Human-Machine Difference Pearson Correlation Less than 0.7 Greater than 0.1 Kappa Less than 0.4 Greater than 0.1 Quadratic Weighted Kappa Less than 0.7 Greater than 0.1 Exact Agreement Less than 65% Greater than 5.25% Standardized Mean Difference Greater than 0.15

SLIDE 21

Subgroup Analyses

29/55 prompts had no flags on either trait
When flags did occur
Only for one or two groups
Only one or two of the quality measures
None sufficiently concerning to consider retraining
Sometimes different measures indicated different results
Lower than humans on exact agreement
Higher on quadratic weighted kappa
SMD flags were rare
Always indicated higher IEA scores than human scores

SLIDE 22

Summary of Subgroup Analyses

SLIDE 23

Spring 2016 Continuous Flow Performance

With 6.5M responses scored YTD

SLIDE 24

Summary

Extensive research was conducted over three years to

validate the use of the Continuous Flow system on the PARCC assessment

Initial results indicate its successful operational use in

2016

Continuous Flow combines the strengths and benefits of

both human and automated scoring

Continuous Flow performance exceeds that of a human
nly scoring system while routing potentially challenging

PARCC Research Results

Karen E. Lochbaum Pearson June 22, 2016

Research Questions

(IEA) agree with human scores as well as human scores agree with each other? ‒ Across all prompts and traits for all responses? ‒ Across prompts and traits for responses across subgroups?

by experts to validity papers as well as human scores do?

Series of Studies and Results

2015 Research Summary

Year 1 Operational Study

human score

human performance on 26 prompts

Summary of Human vs. IEA Exact Agreement Rates

The exact agreement between IEA and human readers was higher than it was between two human readers. And higher still between IEA and more experienced human back read scorers.

Summary of Human vs. IEA Exact Agreement Rates on Validity Responses

IEA’s exact agreement on validity responses was higher than it was for humans

Human vs. IEA Exact Agreement Rates by Subgroup

The exact agreement between IEA and human readers was higher than it was between two human readers for various demographic subgroups.

Comparison Af Am Asian Hispanic 2+ Races Native Am Human 2 Human 1 68.6% 62.8% 67.1% 69.8% 65.4% IEA Op Human 1 74.0% 68.1% 72.5% 72.6% 72.6% Comparison White ELL SWD Female Male Human 2 Human 1 65.0% 71.2% 75.5% 63.9% 68.2% IEA Op Human 1 69.9% 76.3% 78.6% 69.0% 73.0%

2016 Operational Performance

A Reminder: Criteria for Operationally Deploying the AI Scoring Model

available)

agreement is within 5.25% of Human-Human exact agreement for both trait scores

statistical measures for both traits are evaluated against quality criteria tolerances for subgroups with at least 50 responses

Summary of Results: Comparison of IEA and Human Scores

prompts were very close

item-by-item because IEA was trained on the “best” score available (backread, resolution, first read)

IEA Mean vs. Human Mean Conventions Trait

IEA SD vs. Human SD Conventions Trait

IEA Mean vs. Human Mean Expressions Trait

IEA SD vs. Human SD Expressions Trait

IEA vs. Human Validity Agreement Conventions Trait

Blue means IEA performance exceeds human by > 5.25 Blue-Green means IEA at

Green means IEA performance within 5.25

Red means IEA performance lower than human by > 5.25 Grade Exact SP0 SP1 SP2 SP3

IEA vs. Human Validity Agreement Expressions Trait

Grade Exact SP0 SP1 SP2 SP3 SP4

IEA vs. Human Agreement Conventions Trait

Grade Exact SP0 SP1 SP2 SP3

IEA vs. Human Agreement Expressions Trait

Grade Exact SP0 SP1 SP2 SP3 SP4

A Reminder: Subgroup Analyses

various subgroups

Quadratic Kappa, Exact Agreement) based human-human results with IEA-human results

between IEA and human scores

quality criteria:

Subgroup Analyses

Summary of Subgroup Analyses

Spring 2016 Continuous Flow Performance

With 6.5M responses scored YTD

Summary

validate the use of the Continuous Flow system on the PARCC assessment

2016

both human and automated scoring

responses for further review