Humans and Machines: Modeling the Stochastic Behavior of Raters in - - PowerPoint PPT Presentation

humans and machines modeling the stochastic behavior of
SMART_READER_LITE
LIVE PREVIEW

Humans and Machines: Modeling the Stochastic Behavior of Raters in - - PowerPoint PPT Presentation

Humans and Machines: Modeling the Stochastic Behavior of Raters in Educational Assessment Richard J. Patz BEAR Seminar UC Berkeley Graduate School of Education February 13, 2018 Outline of Topics Natural human responses in educational


slide-1
SLIDE 1

Humans and Machines: Modeling the Stochastic Behavior of Raters in Educational Assessment

Richard J. Patz BEAR Seminar UC Berkeley Graduate School of Education February 13, 2018

slide-2
SLIDE 2

Outline of Topics

  • Natural human responses in educational assessment
  • Technology in education, assessment and scoring
  • Computational methods for automated scoring (NLP

, LSA, ML)

  • Rating information in statistical and psychometric analysis: Challenges
  • Unreliability and bias
  • Combining information from multiple rating
  • Hierarchical rater model (HRM)
  • Applications
  • Comparing machine to humans
  • Simulating human rating errors to further related research
slide-3
SLIDE 3

Why natural, constructed response formats in assessment?

  • Learning involves constructing knowledge and expressing

through language (written and/or oral)

  • Assessments should consist of ‘authentic’ tasks, i.e., of a

type that students encounter during instruction

  • Artificially contrived item formats (e.g., multiple-choice)

advantage skills unrelated to the intended construct

  • Some constructs (e.g., essay writing) simply can’t be

measured through selected-response formats

slide-4
SLIDE 4

Disadvantages of Constructed- Response formats

  • Time consuming for examinees (fewer items per unit time)
  • Require expensive human ratings (typically)
  • Create delay in providing scores, reports
  • Human rating is error-prone
  • Consistency across rating events difficult to maintain
  • Inconsistency impairs comparability
  • Combining multiple ratings creates modeling, scoring problems
slide-5
SLIDE 5

Practical Balancing of Priorities

  • Mix constructed-response formats with selected response

formats, to realize benefits of each

  • Leverage technology in the scoring of CR items
  • Rule-based scoring (exhaustively enumerated/

constrained)

  • Natural language processing and subsequent automated

rating for written (sometimes spoken) responses

  • Made more practical with computer-based test delivery
slide-6
SLIDE 6

Technology For Automated Scoring

  • Ten years ago there were relatively few providers
  • Expensive, proprietary algorithms
  • Specialized expertise (NLP

, LSA, AI)

  • Laborious, ‘hand crafted’, engine training
  • Today solutions are much more ubiquitous
  • Students fitting AS models in CS, STAT classes
  • Open source libraries abound
  • Machine learning, neural networks, accessible, powerful, up to job
  • Validity and reliability challenges remain
  • Impact of algorithms on instruction, e.g., in writing? Also threat of gaming strategies
  • Managing algorithm improvements, examinee adaptations, over time
  • Quality human scores needed to train the machines (supervised learning)
  • Biases or other problems in human ratings ‘learned’ by algorithms
  • Combining scores from machines and humans
slide-7
SLIDE 7

Machine Learning for Automated Essay Scoring

Example characteristics:

  • Words processed in relation to corpus for frequency, etc., etc.
  • N-grams (word pairs, triplets, etc.)
  • Transformations (non-linear, sinusoidal), and dimensionality reduction
  • Iterations improving along gradient, memory of previous states
  • Data split into training and validation; maximizing predication accuracy on validation; little else “interpretable” about parameters

Taghipour & Ng (2016) Example architecture:

slide-8
SLIDE 8

Focus of Research

  • Situating rating process within overall measurement and statistical

context

  • The hierarchical rater model (HRM)
  • Accounting for multiple ratings ‘correctly’
  • Contrast with alternative approaches, e.g., Facets
  • Simultaneous analysis of human and machine ratings
  • Example from large-scale writing assessment
  • Leveraging models of human rating behavior for better simulation,

examination of impacts on inferences

slide-9
SLIDE 9

Hierarchical Structure of Rated Item Response Data

  • If all levels follow normal distribution, then Generalizability Theory applies
  • Estimates at any level weigh data mean and prior mean, using ‘generalizability coefficient’
  • If ‘ideal ratings’ follow IRT model and observed ratings a signal detection model, HRM

Patz, Junker & Johnson, 2002

θi ~ i.i.d. N(µ,σ 2), i = 1,…,N ξij ~ an IRT model (e.g. PCM), j = 1,…,J,for each i Xijr ~ a signal detection model r = 1,…,R, for each i, j ⎫ ⎬ ⎪ ⎪ ⎭ ⎪ ⎪

HRM levels

slide-10
SLIDE 10

Hierarchical Rater Model

  • Raters detect true item score (i.e., ‘ideal rating’) with a degree of bias

and imprecision

φr = −.2 ψ r = .5

ξ = 3

p32r = .08 p33r = .64 p34r = .27

slide-11
SLIDE 11

Hierarchical Rater Model (cont.)

  • Examinees respond to items according to a polytomous item response

theory model (here PCM; could be GPCM, GRM, others):

P ξij = ξ |θi,β j,γ jξ ⎡ ⎣ ⎤ ⎦ = exp θi − β j

k=1 ξ

−γ jk ⎧ ⎨ ⎩ ⎪ ⎫ ⎬ ⎭ ⎪ exp

h=0 K−1

θi − β j

k=1 h

−γ jk ⎧ ⎨ ⎩ ⎫ ⎬ ⎭

θi ~ i.i.d. N(µ,σ 2), i = 1,…,N,

slide-12
SLIDE 12

HRM Estimation

  • Most straightforward to estimate using Markov chain

Monte Carlo

  • Uninformative priors specified in Patz et al 2002,

Casablanca et al, 2016

  • WinBugs/JAGS (may be called from within R)
  • HRM has been estimated using maximum likelihood and

posterior modal estimation (Donoghue and Hombo, 2001; DeCarlo et al, 2011)

slide-13
SLIDE 13

Facets Alternative

  • Facets (Linacre) models can capture rater effects:

where 𝜇rjk is the effect rater r has on category k of item j. Note: rater effects 𝜇 may be constant for all levels of an item, all items at a a given level, or for all levels of all items. Every rater-item combination has unique ICC Facets models have proven highly useful in the detection and mitigation of rater effects in operational scoring (e.g., Wang & Wilson, 2005; Myford & Wolfe, 2004)

P ξij = ξ |θi,β j,γ jξ,λrjk ⎡ ⎣ ⎤ ⎦ = exp θi − β j

k=1 ξ

−γ jk − λrjk ⎧ ⎨ ⎩ ⎪ ⎫ ⎬ ⎭ ⎪ exp

h=0 K−1

θi − β j

k=1 h

−γ jk − λrjk ⎧ ⎨ ⎩ ⎫ ⎬ ⎭

slide-14
SLIDE 14

Dependence structure of Facets models

  • Ratings are directly related to proficiency
  • Arbitrarily precise 𝜄 estimation achievable by increasing ratings R
  • Alternatives (other than HRM) include:
  • Rater Bundle Model (Wilson & Hoskens, 2001)
  • Design-effect-like correction (Bock, Brennan, Muraki, 1999)
slide-15
SLIDE 15

Applications & Extensions

  • f HRM
  • Detecting rater effects and “modality” effects in Florida

assessment program (Patz, Junker, Johnson, 2002)

  • 360-degree feedback data (Barr & Raju, 2003)
  • Rater covariates, applied to Golden State Exam (image vs.

paper study) (Mariano & Junker, 2007)

  • Latent classes for raters, applied to large-scale language

assessment (DeCarlo et al, 2011)

  • Machine (i.e., automated) and human scoring (Casabianca et

al, 2016)

slide-16
SLIDE 16

HRM with rater covariates

  • Introduce design matrix 𝛷 associating individual raters to

their covariates

  • Bias and variability of ratings vary according rater

characteristics

Bias: Variability:

slide-17
SLIDE 17

Application with Human and Machine Ratings

  • Statewide writing assessment program (provided by CTB)
  • 5 dimensions of writing (“items”); each on 1-6 rubric
  • 487 examinees
  • 36 raters: 18 male, 17 female, 1 machine
  • Each paper scored by four raters (1 machine, 3 humans)
  • 9740 ratings in total
slide-18
SLIDE 18

Results by “gender”

  • Males and female very similar (and negligible on average) bias
  • Machine less variable (esp. than males) more severe (not sig.)
  • Individual rater bias and severity is informative (next slide)
slide-19
SLIDE 19

Most lenient, r=11

Most harsh and least variable, r=20 (problematic pattern confirmed)

Most variable (r=29)

Individual rater estimate may be diagnostic

slide-20
SLIDE 20

Continued Research

  • HRM presents systematic way to simulate rater behavior
  • What range of variability and bias are typical? Good? Problematic?
  • Realistic simulations yielding predictable agreement rates,

quadratic weighted kappa statistics, etc.?

  • What are the downstream impacts of rater problems on: Measurement

accuracy? Equating? Engine training?

  • To what degree and how might modeling of raters (often unidentified
  • r ignored) improve machine learning results in the training of

automated scoring engines

  • Under what conditions should different (esp. more granular) signal

detection models be used within the HRM framework?

slide-21
SLIDE 21

Quadratic Weighted Kappa

  • Penalizes non-adjacent disagreement more than unweighted

kappa or linearly (i-j) weighted kappa

  • Widely used as a prediction accuracy metric in machine

learning

  • Kappa statistics are an important supplement to rates of

agreement (exact/adjacent) in operational rating κ = 1− wi, jOi, j

i, j

wi, jEi, j

i, j

wi, j = (i − j)2 (N −1)2

where

Oi, j =

Ei, j =

Observed count in cell i,j Expected count in cell i,j

slide-22
SLIDE 22

HRM Rater Noise

  • How do HRM signal detection accuracy impact reliability and agreement

statistics for rated items?

  • Use HRM to simulate realistic patterns of rater behavior
  • Example
  • For 10,000 examinees with normally distributed proficiencies
  • True item scores (ideal ratings) from PCM/RSM: 10 items, 5-levels per item
  • Vary rater variability parameter 𝛺, with rater bias 𝟈=0
slide-23
SLIDE 23

Ideal ratings follow PCM

ψ r = 0

Ideal Ratings:

slide-24
SLIDE 24

Results

slide-25
SLIDE 25

Continued Research

  • HRM presents systematic way to simulate rater behavior
  • What range of variability and bias are typical? Good? Problematic?
  • Realistic simulations yielding predictable agreement rates, quadratic

weighted kappa statistics, etc.?

  • What are the downstream impacts of rater problems on: Measurement

accuracy? Equating? Engine training?

  • To what degree and how might modeling of raters (often unidentified
  • r ignored) improve machine learning results in the training of

automated scoring engines

  • Under what conditions should different (esp. more granular) signal

detection models be used within the HRM framework?

slide-26
SLIDE 26

Summary

  • Constructed response item formats remain important
  • Technology making these formats more feasible
  • Modeling rater behavior is important
  • HRM provides useful framework for characterizing rater error

patterns, from humans and/or machines

  • HRM signal detection model layer useful in simulation
  • Modeling raters may improve machine scoring solutions

(TBD)