Humans and Machines: Modeling the Stochastic Behavior of Raters in - PowerPoint PPT Presentation

Humans and Machines: Modeling the Stochastic Behavior of Raters in Educational Assessment Richard J. Patz BEAR Seminar UC Berkeley Graduate School of Education February 13, 2018

Outline of Topics • Natural human responses in educational assessment • Technology in education, assessment and scoring • Computational methods for automated scoring (NLP , LSA, ML) • Rating information in statistical and psychometric analysis: Challenges • Unreliability and bias • Combining information from multiple rating • Hierarchical rater model (HRM) • Applications • Comparing machine to humans • Simulating human rating errors to further related research

Why natural, constructed response formats in assessment? • Learning involves constructing knowledge and expressing through language (written and/or oral) • Assessments should consist of ‘authentic’ tasks, i.e., of a type that students encounter during instruction • Artificially contrived item formats (e.g., multiple-choice) advantage skills unrelated to the intended construct • Some constructs (e.g., essay writing) simply can’t be measured through selected-response formats

Disadvantages of Constructed- Response formats • Time consuming for examinees (fewer items per unit time) • Require expensive human ratings (typically) • Create delay in providing scores, reports • Human rating is error-prone • Consistency across rating events di ffi cult to maintain • Inconsistency impairs comparability • Combining multiple ratings creates modeling, scoring problems

Practical Balancing of Priorities • Mix constructed-response formats with selected response formats, to realize benefits of each • Leverage technology in the scoring of CR items • Rule-based scoring (exhaustively enumerated/ constrained) • Natural language processing and subsequent automated rating for written (sometimes spoken) responses • Made more practical with computer-based test delivery

Technology For Automated Scoring • Ten years ago there were relatively few providers • Expensive, proprietary algorithms • Specialized expertise (NLP , LSA, AI) • Laborious, ‘hand crafted’, engine training • Today solutions are much more ubiquitous • Students fitting AS models in CS, STAT classes • Open source libraries abound • Machine learning, neural networks, accessible, powerful, up to job • Validity and reliability challenges remain • Impact of algorithms on instruction, e.g., in writing? Also threat of gaming strategies • Managing algorithm improvements, examinee adaptations, over time • Quality human scores needed to train the machines (supervised learning) • Biases or other problems in human ratings ‘learned’ by algorithms • Combining scores from machines and humans

Machine Learning for Automated Essay Scoring Example architecture: Taghipour & Ng (2016) Example characteristics: • Words processed in relation to corpus for frequency, etc., etc. • N-grams (word pairs, triplets, etc.) • Transformations (non-linear, sinusoidal), and dimensionality reduction • Iterations improving along gradient, memory of previous states • Data split into training and validation; maximizing predication accuracy on validation; little else “interpretable” about parameters

Focus of Research • Situating rating process within overall measurement and statistical context • The hierarchical rater model (HRM) • Accounting for multiple ratings ‘correctly’ • Contrast with alternative approaches, e.g., Facets • Simultaneous analysis of human and machine ratings • Example from large-scale writing assessment • Leveraging models of human rating behavior for better simulation, examination of impacts on inferences

Hierarchical Structure of Rated Item Response Data Patz, Junker & Johnson, 2002 • If all levels follow normal distribution, then Generalizability Theory applies • Estimates at any level weigh data mean and prior mean, using ‘generalizability coe ffi cient’ • If ‘ideal ratings’ follow IRT model and observed ratings a signal detection model, HRM ⎫ θ i i . i . d . N ( µ , σ 2 ), i = 1, … , N ~ ⎪ ⎪ HRM levels ξ ij an IRT model (e.g. PCM), j = 1, … , J ,for each i ~ ⎬ ⎪ a signal detection model r = 1, … , R , for each i , j X ijr ~ ⎪ ⎭

Hierarchical Rater Model • Raters detect true item score (i.e., ‘ideal rating’) with a degree of bias and imprecision φ r = − .2 ψ r = .5 ξ = 3 p 33 r = .64 p 32 r = .08 p 34 r = .27

Hierarchical Rater Model (cont.) • Examinees respond to items according to a polytomous item response theory model (here PCM; could be GPCM, GRM, others): ⎧ ⎫ ξ ∑ θ i − β j − γ jk exp ⎨ ⎬ ⎪ ⎪ ⎩ ⎭ ⎡ ⎤ k = 1 P ξ ij = ξ | θ i , β j , γ j ξ ⎦ = θ i ~ i . i . d . N ( µ , σ 2 ), i = 1, … , N , ⎣ ⎧ ⎫ K − 1 h ∑ ∑ θ i − β j − γ jk exp ⎨ ⎬ ⎩ ⎭ h = 0 k = 1

HRM Estimation • Most straightforward to estimate using Markov chain Monte Carlo • Uninformative priors specified in Patz et al 2002, Casablanca et al, 2016 • WinBugs/JAGS (may be called from within R) • HRM has been estimated using maximum likelihood and posterior modal estimation (Donoghue and Hombo, 2001; DeCarlo et al, 2011)

Facets Alternative • Facets (Linacre) models can capture rater e ff ects: ⎧ ⎫ ξ ∑ θ i − β j − γ jk − λ rjk exp ⎨ ⎬ ⎪ ⎪ ⎩ ⎭ ⎡ ⎤ k = 1 P ξ ij = ξ | θ i , β j , γ j ξ , λ rjk ⎦ = ⎣ ⎧ ⎫ K − 1 h ∑ ∑ θ i − β j − γ jk − λ rjk exp ⎨ ⎬ ⎩ ⎭ h = 0 k = 1 where 𝜇 rjk is the e ff ect rater r has on category k of item j . Note: rater e ff ects 𝜇 may be constant for all levels of an item, all items at a a given level, or for all levels of all items. Every rater-item combination has unique ICC Facets models have proven highly useful in the detection and mitigation of rater e ff ects in operational scoring (e.g., Wang & Wilson, 2005; Myford & Wolfe, 2004)

Dependence structure of Facets models • Ratings are directly related to proficiency • Arbitrarily precise 𝜄 estimation achievable by increasing ratings R • Alternatives (other than HRM) include: • Rater Bundle Model (Wilson & Hoskens, 2001) • Design-e ff ect-like correction (Bock, Brennan, Muraki, 1999)

Applications & Extensions of HRM • Detecting rater e ff ects and “modality” e ff ects in Florida assessment program (Patz, Junker, Johnson, 2002) • 360-degree feedback data (Barr & Raju, 2003) • Rater covariates, applied to Golden State Exam (image vs. paper study) (Mariano & Junker, 2007) • Latent classes for raters, applied to large-scale language assessment (DeCarlo et al, 2011) • Machine (i.e., automated) and human scoring (Casabianca et al, 2016)

HRM with rater covariates • Introduce design matrix 𝛷 associating individual raters to their covariates • Bias and variability of ratings vary according rater characteristics Bias: Variability:

Application with Human and Machine Ratings • Statewide writing assessment program (provided by CTB) • 5 dimensions of writing (“items”); each on 1-6 rubric • 487 examinees • 36 raters: 18 male, 17 female, 1 machine • Each paper scored by four raters (1 machine, 3 humans) • 9740 ratings in total

Results by “gender” • Males and female very similar (and negligible on average) bias • Machine less variable (esp. than males) more severe (not sig.) • Individual rater bias and severity is informative (next slide)

Individual rater estimate may be diagnostic Most lenient, r=11 Most harsh and least variable, r=20 (problematic pattern confirmed) Most variable (r=29)

Continued Research • HRM presents systematic way to simulate rater behavior • What range of variability and bias are typical? Good? Problematic? • Realistic simulations yielding predictable agreement rates, quadratic weighted kappa statistics, etc.? • What are the downstream impacts of rater problems on: Measurement accuracy? Equating? Engine training? • To what degree and how might modeling of raters (often unidentified or ignored) improve machine learning results in the training of automated scoring engines • Under what conditions should di ff erent (esp. more granular) signal detection models be used within the HRM framework?

Quadratic Weighted Kappa O i , j = Observed count in cell i,j ∑ w i , j O i , j E i , j = Expected count in cell i,j κ = 1 − i , j where ∑ w i , j E i , j w i , j = ( i − j ) 2 i , j ( N − 1) 2 • Penalizes non-adjacent disagreement more than unweighted kappa or linearly (i-j) weighted kappa • Widely used as a prediction accuracy metric in machine learning • Kappa statistics are an important supplement to rates of agreement (exact/adjacent) in operational rating

HRM Rater Noise • How do HRM signal detection accuracy impact reliability and agreement statistics for rated items? • Use HRM to simulate realistic patterns of rater behavior • Example • For 10,000 examinees with normally distributed proficiencies • True item scores (ideal ratings) from PCM/RSM: 10 items, 5-levels per item • Vary rater variability parameter 𝛺 , with rater bias 𝟈 =0

Ideal ratings follow PCM Ideal Ratings: ψ r = 0

Results

Humans and Machines: Modeling the Stochastic Behavior of Raters in - PowerPoint PPT Presentation

Humans and Machines: Modeling the Stochastic Behavior of Raters in Educational Assessment Richard J. Patz BEAR Seminar UC Berkeley Graduate School of Education February 13, 2018 Outline of Topics Natural human responses in educational

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF

Language in humans Today: how do humans process language? Language in Humans We ve

Finite State Machines (FSM) Chapter 8 State Machines Introduction State Machines Mealy and

humans are awesome* *compressors (or: what machines can learn from humans about lossy

Finite State Machines (FSM) AKA Finite State Automat on State Machines Introduction State

Snails Versus Humans Comparing Relative Strength of Snails and Humans OBJECTIVE Students will

TRAMADOL LETHAL DOSE HUMANS ARE ANIMALS PRESENTATION Tramadol Lethal Dose Humans Are Animals

CHAPTER 10 Premodern Humans Chapter Outline * Premodern Humans of the Middle Pleistocene *

Beyond Finite State Machines Beyond Finite State Machines Managing Complex, Intermixing Behavior

Intelligent Behavior in Humans and Machines Pat Langley Institute for the Study of Learning and

BEHAVIOR @ HOME Behavior Basics Simple strategies that can make a big difference! Presented by

Eugene Agichtein g g Emory University Eugene Agichtein RuSSIR 2009: Modeling User Behavior and

Stochastic Processes Will Perkins March 7, 2013 Stochastic Processes Q: What is a Stochastic

What If We Only Have Stochastic . . . What if the Stochastic . . . Approximate Stochastic

Human Resources (HR) Training Practices of Incheon Airport Aviation Academy 26 th September. 2012

HUMAN RESOURCE MANAGEMENT AND EMPLOYEE HUMAN RESOURCE MANAGEMENT AND EMPLOYEE PERFORMANCE: A CASE

Links between performance appraisal and knowledge management Potentials and evidence from the

Digitalisation and innovation in human resource management in the global economy (A project

WELCOME ADDRESS Ms. Grace Yu, Executive Director Singapore Human Resources Institute Ms Low Peck

- Introduction of Myanmars transition (2011-2015) - Economic profile (2011-2015) - HDI

Presentation to the Task Force on Modernizing Human Resources Management Presented by Greg

Filling Of Posts, Retention And Career Management In The South African Public Service Dr Sanjay

Sambuz

Useful Links

Newsletter

Mail Us