Developing Automated Scoring for Large-scale Assessments of - - PowerPoint PPT Presentation

developing automated scoring
SMART_READER_LITE
LIVE PREVIEW

Developing Automated Scoring for Large-scale Assessments of - - PowerPoint PPT Presentation

Developing Automated Scoring for Large-scale Assessments of Three-dimensional Learning Jay Thomas 1 , Ellen Holste 2 , Karen Draney 3 , Shruti Bathia 3 , and Charles W. Anderson 2 1. ACT, Inc. Michigan State University 2. UC Berkeley, BEAR


slide-1
SLIDE 1

Developing Automated Scoring for Large-scale Assessments of Three-dimensional Learning

Jay Thomas1, Ellen Holste2, Karen Draney3, Shruti Bathia3, and Charles W. Anderson2

1.

ACT, Inc.

2.

Michigan State University

3.

UC Berkeley, BEAR Center

slide-2
SLIDE 2

Based on the NRC Developing Assessments for the Next Generation Science Standards (Pellegrino et al, 2014)

  • Need assessment tasks with multiple components to get at all 3

dimensions (C 2-1)

  • Tasks must accurately locate students along a sequence of

progressively more complex understanding (C 2-2)

  • Traditional selected-response items cannot assess the full breadth

and depth of NGSS

  • Technology can address some of the problems
  • Particularly scalability and cost
slide-3
SLIDE 3

Example of a Carbon TIME Item

slide-4
SLIDE 4

Comparing FC vs CR vs Both

  • Compare spread
  • f data
  • Adding CR (or CR
  • nly) increases

the confidence that we have classified students correctly

  • Since explanations

is a practice that we are focusing

  • n in the LP, it

requires CR to assess the construct fully

slide-5
SLIDE 5

Item Development Students respond to Items WEW (Rubric) Development Using WEW (Human scoring) to create training set Creating Machine Learning (ML) Models Using ML Model (Computer scoring) Backcheck coding (human) QWK Check for Reliability Psychometric Analysis (IRT, WLE) Interpretation by larger research group

Recursive Feedback Loops for Item Development

Processes moving towards final interpretation Feedback loops that indicate that a question, rubric,

  • r coding potentially has a problem that needs to be

addressed

slide-6
SLIDE 6

Consequences of using machine scoring

  • Item revision and improvement
  • Increase in the size of the usable data set to

increase power of statistics

  • Increased confidence in reliability of scoring

through back-checking samples and revising models

  • Reduced costs by needing fewer human coders
  • Model to show that the kinds of assessments

envisioned by Pellegrino et al (2014) for NGSS can be reached at scale with low cost

As of March 6, 2019

School Year Responses Scored 2015-16 175,265 2016-17 532,825 2017-18 693,086 2018-19 227,041 TOTAL 1,628,217 Cost Savings and scalability Labor hours needed to human score responses @ 100 per hour

16,282.7 hours

Labor cost per hour (undergraduate students including

  • misc. costs)

$18 per hour Cost to human score all responses

$293,079

slide-7
SLIDE 7

Types of validity evidence

  • As taken from the Standards for Educational and Psychological

Testing, 2014 ed.

  • Evidence based on test content
  • Evidence based on response processes
  • Evidence based on internal structure
  • Evidence based on relation to other variables
  • Convergent and discriminant evidence
  • Test-criterion evidence
  • Evidence for validity and consequences of testing
slide-8
SLIDE 8

Comparison of interviews and IRT analysis results

  • Overall Spearman rank correlation

= 0.81, p<0.01, n=49

  • Comparison of scoring for one

written versus interview item

slide-9
SLIDE 9

Evidence based on internal structure

  • Analysis method: item response models (specifically, unidimensional

and multidimensional partial credit models)

  • Provide item and step difficulties and person proficiencies on one

scale

  • Provide comparisons of step difficulties within items
slide-10
SLIDE 10

Step difficulties for each item 2015-16 Data

slide-11
SLIDE 11

Classifying Students into LP levels

Comparing FC to EX + FC

slide-12
SLIDE 12

Classifying Students into LP levels

Comparing EX to EX + FC

slide-13
SLIDE 13

Classifying Classroom Data

95% confidence intervals: Average learning gains for teachers with at least 15 students who had both

  • verall pretests

and overall posttests (macroscopic explanations)

slide-14
SLIDE 14

Questions?

  • Contact info
  • Jay Thomas

jay.thomas@act.org

  • Karen Draney

kdraney@berkeley.edu

  • Andy Anderson andya@msu.edu
  • Ellen Holste

holste@msu.edu

  • Shruti Bathia

shruti_bathia@berkeley.edu