Developing Automated Scoring for Large-scale Assessments of - PowerPoint PPT Presentation

Developing Automated Scoring for Large-scale Assessments of Three-dimensional Learning Jay Thomas 1 , Ellen Holste 2 , Karen Draney 3 , Shruti Bathia 3 , and Charles W. Anderson 2 1. ACT, Inc. Michigan State University 2. UC Berkeley, BEAR Center 3.

Based on the NRC Developing Assessments for the Next Generation Science Standards (Pellegrino et al, 2014) • Need assessment tasks with multiple components to get at all 3 dimensions (C 2-1) • Tasks must accurately locate students along a sequence of progressively more complex understanding (C 2-2) • Traditional selected-response items cannot assess the full breadth and depth of NGSS • Technology can address some of the problems • Particularly scalability and cost

Example of a Carbon TIME Item

Comparing FC vs CR vs Both • Compare spread of data • Adding CR (or CR only) increases the confidence that we have classified students correctly • Since explanations is a practice that we are focusing on in the LP, it requires CR to assess the construct fully

Recursive Feedback Loops for Item Development Using WEW Creating Students (Human Machine Item WEW (Rubric) respond to scoring) to Learning Development Development Items create (ML) training set Models Interpretation Using ML Psychometric QWK Check Backcheck by larger Model Analysis for coding research (Computer (IRT, WLE) Reliability (human) group scoring) Processes moving towards final interpretation Feedback loops that indicate that a question, rubric, or coding potentially has a problem that needs to be addressed

As of March 6, 2019 Consequences of using machine scoring School Year Responses Scored • Item revision and improvement • 2015-16 175,265 Increase in the size of the usable data set to increase power of statistics 2016-17 532,825 • Increased confidence in reliability of scoring 2017-18 693,086 through back-checking samples and revising models 2018-19 227,041 • Reduced costs by needing fewer human coders • Model to show that the kinds of assessments TOTAL 1,628,217 envisioned by Pellegrino et al (2014) for NGSS can be reached at scale with low cost Cost Savings and scalability Labor hours needed to human score responses @ 100 16,282.7 hours per hour Labor cost per hour (undergraduate students including $18 per hour misc. costs) Cost to human score all responses $293,079

Types of validity evidence • As taken from the Standards for Educational and Psychological Testing, 2014 ed. • Evidence based on test content • Evidence based on response processes • Evidence based on internal structure • Evidence based on relation to other variables • Convergent and discriminant evidence • Test-criterion evidence • Evidence for validity and consequences of testing

Comparison of interviews and IRT analysis results • • Comparison of scoring for one Overall Spearman rank correlation written versus interview item = 0.81, p <0.01, n =49

Evidence based on internal structure • Analysis method: item response models (specifically, unidimensional and multidimensional partial credit models) • Provide item and step difficulties and person proficiencies on one scale • Provide comparisons of step difficulties within items

Step difficulties for each item 2015-16 Data

Classifying Students into LP levels Comparing FC to EX + FC

Classifying Students into LP levels Comparing EX to EX + FC

Classifying Classroom Data 95% confidence intervals: Average learning gains for teachers with at least 15 students who had both overall pretests and overall posttests (macroscopic explanations)

Questions? • Contact info • Jay Thomas jay.thomas@act.org • Karen Draney kdraney@berkeley.edu • Andy Anderson andya@msu.edu • Ellen Holste holste@msu.edu • Shruti Bathia shruti_bathia@berkeley.edu

Developing Automated Scoring for Large-scale Assessments of - PowerPoint PPT Presentation

Developing Automated Scoring for Large-scale Assessments of Three-dimensional Learning Jay Thomas 1 , Ellen Holste 2 , Karen Draney 3 , Shruti Bathia 3 , and Charles W. Anderson 2 1. ACT, Inc. Michigan State University 2. UC Berkeley, BEAR

Exercise 8: Scoring Exercise 8: Scoring FLUKA Beginners Course Exercise 8: Scoring Aim of the

Mountain High Swim League Scoring Presentation 2018 Scoring Committee 1 MHSL Scoring Training

Exercise 8: Scoring FLUKA Beginners Course Exercise 8: Scoring Aim of the exercise: 1- Add

Automated Essay Scoring as Basic Regression Ashesh Singh Background What is Automated Essay

Continuous Flow Scoring of Prose Constructed Response: A Hybrid of Automated and Human Scoring

Automated Design of Digital Automated Design of Digital Automated Design of Digital Automated

Welcome to Scoring the ACIRI a Job Aid. 1 This job aid provides a brief review of the scoring

Investment Board April 21, 2014 Agenda UW-IT Portfolio Scoring Process Scoring Results

Mobile Credit Scoring: Powering Consumer Finance in Emerging Markets SUMMARY Credit Scoring

SI Scoring Guide SUBORDINATION INDEX USING SALT Discuss the scoring rules SALT SOFTWARE, LLC

TDNN: A Two-stage Deep Neural Network for Prompt-independent Automated Essay Scoring Outline

Overview of Automated Bus Consortium Program Accelerating automated technology for transit

Automated Reasoning: Some Successes and New Challenges Predrag Jani ci c

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

Automated Reasoning Course Presentation Summary Automated Reasoning Motivations Course Plan

Automated sleep scoring using unsupervised learning of meta-features DD221X: Degree project in

A Recommendation System for Software Function Discovery Naoki Ohsugi Software Engineering

Lessons learned from monitoring investment newsletters for over 30 years June 24, 2013, meeting

Correlation Quantitative A Aptitude & & Business S Statistics Correlation

Visualization + Analysis Blockchains Are Networks Time-series Visualization Quickly

Modelling A User Population for Designing Information Retrieval Metrics Tetsuya Sakai

Optimal driving waveform for the overdamped, rocking ratchets Maria Laura Olivera Instituto de

Biological motors 18.S995 - L10 Reynolds numbers Re = UL = UL dunkel@math.mit.edu

Distribution Workgroup Ratchet Charges Overview Hilary Chapman 27 th August 2015 Mod Panel

Sambuz

Useful Links

Newsletter

Mail Us

Developing Automated Scoring for Large-scale Assessments of - PowerPoint PPT Presentation

Developing Automated Scoring for Large-scale Assessments of Three-dimensional Learning Jay Thomas 1 , Ellen Holste 2 , Karen Draney 3 , Shruti Bathia 3 , and Charles W. Anderson 2 1. ACT, Inc. Michigan State University 2. UC Berkeley, BEAR

Exercise 8: Scoring Exercise 8: Scoring FLUKA Beginners Course Exercise 8: Scoring Aim of the

Mountain High Swim League Scoring Presentation 2018 Scoring Committee 1 MHSL Scoring Training

Exercise 8: Scoring FLUKA Beginners Course Exercise 8: Scoring Aim of the exercise: 1- Add

Automated Essay Scoring as Basic Regression Ashesh Singh Background What is Automated Essay

Continuous Flow Scoring of Prose Constructed Response: A Hybrid of Automated and Human Scoring

Automated Design of Digital Automated Design of Digital Automated Design of Digital Automated

Welcome to Scoring the ACIRI a Job Aid. 1 This job aid provides a brief review of the scoring

Investment Board April 21, 2014 Agenda UW-IT Portfolio Scoring Process Scoring Results

Mobile Credit Scoring: Powering Consumer Finance in Emerging Markets SUMMARY Credit Scoring

SI Scoring Guide SUBORDINATION INDEX USING SALT Discuss the scoring rules SALT SOFTWARE, LLC

TDNN: A Two-stage Deep Neural Network for Prompt-independent Automated Essay Scoring Outline

Overview of Automated Bus Consortium Program Accelerating automated technology for transit

Automated Reasoning: Some Successes and New Challenges Predrag Jani ci c

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

Automated Reasoning Course Presentation Summary Automated Reasoning Motivations Course Plan

Automated sleep scoring using unsupervised learning of meta-features DD221X: Degree project in

A Recommendation System for Software Function Discovery Naoki Ohsugi Software Engineering

Lessons learned from monitoring investment newsletters for over 30 years June 24, 2013, meeting

Correlation Quantitative A Aptitude &amp; &amp; Business S Statistics Correlation

Visualization + Analysis Blockchains Are Networks Time-series Visualization Quickly

Modelling A User Population for Designing Information Retrieval Metrics Tetsuya Sakai

Optimal driving waveform for the overdamped, rocking ratchets Maria Laura Olivera Instituto de

Biological motors 18.S995 - L10 Reynolds numbers Re = UL = UL dunkel@math.mit.edu

Distribution Workgroup Ratchet Charges Overview Hilary Chapman 27 th August 2015 Mod Panel

Sambuz

Useful Links

Newsletter

Mail Us

Correlation Quantitative A Aptitude & & Business S Statistics Correlation