Humans and Machines: Modeling the Stochastic Behavior of Raters in Educational Assessment
Richard J. Patz BEAR Seminar UC Berkeley Graduate School of Education February 13, 2018
Humans and Machines: Modeling the Stochastic Behavior of Raters in - - PowerPoint PPT Presentation
Humans and Machines: Modeling the Stochastic Behavior of Raters in Educational Assessment Richard J. Patz BEAR Seminar UC Berkeley Graduate School of Education February 13, 2018 Outline of Topics Natural human responses in educational
Richard J. Patz BEAR Seminar UC Berkeley Graduate School of Education February 13, 2018
, LSA, ML)
through language (written and/or oral)
type that students encounter during instruction
advantage skills unrelated to the intended construct
measured through selected-response formats
formats, to realize benefits of each
constrained)
rating for written (sometimes spoken) responses
, LSA, AI)
Example characteristics:
Taghipour & Ng (2016) Example architecture:
context
examination of impacts on inferences
Patz, Junker & Johnson, 2002
θi ~ i.i.d. N(µ,σ 2), i = 1,…,N ξij ~ an IRT model (e.g. PCM), j = 1,…,J,for each i Xijr ~ a signal detection model r = 1,…,R, for each i, j ⎫ ⎬ ⎪ ⎪ ⎭ ⎪ ⎪
HRM levels
and imprecision
φr = −.2 ψ r = .5
ξ = 3
p32r = .08 p33r = .64 p34r = .27
theory model (here PCM; could be GPCM, GRM, others):
P ξij = ξ |θi,β j,γ jξ ⎡ ⎣ ⎤ ⎦ = exp θi − β j
k=1 ξ
∑
−γ jk ⎧ ⎨ ⎩ ⎪ ⎫ ⎬ ⎭ ⎪ exp
h=0 K−1
∑
θi − β j
k=1 h
∑
−γ jk ⎧ ⎨ ⎩ ⎫ ⎬ ⎭
θi ~ i.i.d. N(µ,σ 2), i = 1,…,N,
Monte Carlo
Casablanca et al, 2016
posterior modal estimation (Donoghue and Hombo, 2001; DeCarlo et al, 2011)
where 𝜇rjk is the effect rater r has on category k of item j. Note: rater effects 𝜇 may be constant for all levels of an item, all items at a a given level, or for all levels of all items. Every rater-item combination has unique ICC Facets models have proven highly useful in the detection and mitigation of rater effects in operational scoring (e.g., Wang & Wilson, 2005; Myford & Wolfe, 2004)
P ξij = ξ |θi,β j,γ jξ,λrjk ⎡ ⎣ ⎤ ⎦ = exp θi − β j
k=1 ξ
∑
−γ jk − λrjk ⎧ ⎨ ⎩ ⎪ ⎫ ⎬ ⎭ ⎪ exp
h=0 K−1
∑
θi − β j
k=1 h
∑
−γ jk − λrjk ⎧ ⎨ ⎩ ⎫ ⎬ ⎭
assessment program (Patz, Junker, Johnson, 2002)
paper study) (Mariano & Junker, 2007)
assessment (DeCarlo et al, 2011)
al, 2016)
their covariates
characteristics
Bias: Variability:
Most lenient, r=11
Most harsh and least variable, r=20 (problematic pattern confirmed)
Most variable (r=29)
Individual rater estimate may be diagnostic
quadratic weighted kappa statistics, etc.?
accuracy? Equating? Engine training?
automated scoring engines
detection models be used within the HRM framework?
kappa or linearly (i-j) weighted kappa
learning
agreement (exact/adjacent) in operational rating κ = 1− wi, jOi, j
i, j
wi, jEi, j
i, j
wi, j = (i − j)2 (N −1)2
where
Oi, j =
Ei, j =
Observed count in cell i,j Expected count in cell i,j
statistics for rated items?
ψ r = 0
Ideal Ratings:
weighted kappa statistics, etc.?
accuracy? Equating? Engine training?
automated scoring engines
detection models be used within the HRM framework?
patterns, from humans and/or machines
(TBD)