Efficiency of Scoring Innovative Items in Educational Assessment - - PowerPoint PPT Presentation
Efficiency of Scoring Innovative Items in Educational Assessment - - PowerPoint PPT Presentation
Efficiency of Scoring Innovative Items in Educational Assessment Shudong Wang NWEA Paper presented at the NCSA National Conference on Student Assessment June 24-26, 2019, Orlando, Florida I. Introduction Choosing Item Format/Type in
2
▪ Choosing Item Format/Type in Assessment
✓ Selected-response/Objective scoring ✓ Constructed-response/Objective scoring ✓ Constructed-response/Subjective scoring
▪ Computer Use in Education and Technology-enhanced Items
(TEI) Types (Zenisky & Sireci, 2002; Bennett, 1993)
✓ Selection/identification (drag-and-drop, hot-spot) ✓ Reordering/rearrangement (concept-mapping, create-a-tree) ✓ Completion (graphical modeling, mathematical expressions) ✓ Construction (generating examples, formulating hypotheses, essay/short
answer, passage-editing)
✓ Presentation (problem-solving vignettes, role play)
- I. Introduction
3
Graphic Gap Match Choice Multiple
4
▪ Advantages and Disadvantages of Multiple-choice (MC) Items
and TEIs
MC:
✓ Advantages: efficient administration, automated scoring, broad content coverage, and high reliability ✓ Disadvantages: difficult to write MC items that evoke complex cognitive processes
TEI:
✓ Advantages: improved construct representation - facilitate more authentic and direct measurement of knowledge, skills, and abilities (KSA) than the MC format allows - higher fidelity
✓ Disadvantages: source of construct irrelevant variance, such as computer literacy
▪ Five Dimensions of TEI
✓
Item format
✓
Response action
✓
Media inclusion
✓
Level of interactivity
✓
Scoring method
5
▪ Relationship between Score (D vs. P) and Item Type (MC vs.
TEI)
TEI MC D & P
Dichotomous (D) Polytomous (P)
There are three commonly used scoring methods for TEI (N is number of components): 1. N Method 2. N/2 Method 3. All or Nothing Method (AONM)
D_D D_P P_P
6
AONM:
0=0; 1=1, 2, 3, 4 0=0, 1; 1=2, 3, 4 0=0, 1, 2; 1=3, 4 0=0, 1, 2, 3; 1=4
D: 1 P: 1 2 3 4 Score Dichotomous (D) Polytomous (P) Total Components (N) Total Categories Response D1 D2 D3 D4 N N/2 1 2 1 1 4 5 1 1 1 2 1 1 2 1 3 1 1 1 3 1 4 1 1 1 1 4 2 Table 1. Examples of Different Scoring Methods
D_D D_P P_P
7 Type of Research Response Time Involved Item level Test level Results (Efficiency*) Relationship between Dichotomous (D) and Polytomous (P) 1 No Yes Yes P is better than D 2 Yes Yes Yes D is better than P Partial Credit Scoring Method 3 Yes/No Yes Yes Optimal is better than both N and N/2 methods Relationship between Dichotomous-D (D-D) and Dichotomous-P (D-P) 4 No Yes Yes
? ▪ Review of Researches on Scoring Methods for TEI Types
*: The efficiency is defined as the mean weighted item information divided by the average time spent on an item within an item type (Wan and Henly (2012). 1 & 2: Ripkey and Case,1996; Jiao etal.,2012; Bauer etal.,2011; Ben-Simonetal.,1997; Wan and Henly, 2012. 3: Muckle, Becker, & Wu, 2011; Becker & Soni, 2013; Lorié, 2014; Clyne, 2015; Tao, 2018; Tao & Mix, 2017. 4: Current research
Table 2: Types of Researches
8
Purposes of This Study:
To investigate the efficiency of scoring method on TEIs in educational assessments
9
1. Monte Carlo technique seems to be an appropriate choice, and both descriptive methods and inferential procedures are used in this study. 2. Independent Variable:
Scoring method (MC, CR3, 1CR4, 2CR4, 1CR5, 2CR5, 3CR5) in Table 2.
- 3. Dependent Variables
p-Value, point-biserial, KR20 reliability, test information, and test efficiency (ratio of test information between two tests)
- II. Method
Scoring Method MC CR3 1CR4 2CR4 1CR5 2CR5 3CR5 Type of Item MC CR3 CR4 CR5 N of Category 2 3 4 5 Original Response String (ORS) 0, 1 0,1,2 0,1,2,3 0,1,2,3,4 New Response String (NRS) 0,1 0,1 0,1 0,1 0,1 0,1 0,1 Collapse Rule to generate NRS None 0=(0) 0=(0) 0=(0,1) 0=(0) 0=(0,1) 0=(0,1,2) 1=(1,2) 1=(1,2,3) 1=(2,3) 1=(1,2,3,4) 1=(2,3,4) 1=(3,4)
Table 3: Scoring Method
10
- 4. Major Steps of Simulation
Step 1: Generate person (2000) and item parameter (20 for each of scoring methods) for each of tests MC(20) + CR3(20), MC(20) + CR4(20), MC(20) + CR5(20) Step 2: Generate items responses based on Rasch and PCM models for each of 40 item tests Step 3: Collapse original CR response strings into MC (D-P) response strings by different scoring methods used different collapsing rules Step 4: Calibrate item parameters by fixing person parameters Step 5: Repeat Step 2 to 4 for 100 times (100 simulated tests) and for each of 100 replications (tests), person parameters are different and item parameters are fixed across 100 replications Step 6: Calculate item and test statistics by the CTT and IRT methods (five types of dependent variables) based on results obtained from Step 4 and 5.
11
- III. Results
- 1. Item/Test Analysis Results from CTT
Scoring Method p-Value Point-biserial KR20 D MC 0.52 0.44 0.78 D_CR3 0.67 0.46 0.81 D_1CR4 0.71 0.50 0.84 D_2CR4 0.50 0.55 0.88 D_1CR5 0.72 0.48 0.80 D_2CR5 0.58 0.51 0.84 D_3CR5 0.47 0.50 0.84 Table 4. Overall Means (20 Items) of p-Value, Point-biserial, and KR20 for Different Scoring Methods
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Test Infomation
- 4
- 3
- 2
- 1
1 2 3 4
theta
Inf_CR5 Inf_CR4 Inf_CR3 Inf_MC
Test Information of Different Type of Item Responses (Dichotomou to Polytomous)
Figure 1. Person Test Information from Both Dichotomous and Polytomous Responses Based on True Item Parameters for a Given Test 1 (Replication 1)
- 2. Item/Test Analysis Results from IRT
13
Figure 2. Person Test Information from Dichotomous Responses Based on Estimated Item Parameters by Different Scoring Methods for a Given Test 80 (Replication 80)
1 2 3 4 5
Test Infomation
- 4
- 3
- 2
- 1
1 2 3 4
theta
Inf_D_3CR5 Inf_D_2CR5 Inf_D_1CR5 Inf_D_2CR4 Inf_D_1CR4 Inf_D_CR3 Inf_D_MC
Test Information from Dichotomous Responses
14
- 4
- 3
- 2
- 1
1 2 3 4
Theta
1 2 3
Relative Efficiency
EF_D_3CR5 EF_D_2CR5 EF_D_1CR5 EF_D_2CR4 EF_D_1CR4 EF_D_CR3
Rative Efficiency of Person Tests of Dichotomous Responses
Figure 3. Relative Efficiency of Person Tests with Non-MC Dichotomous Responses Items Over MC Responses Based on Esitmated Item Parameters by Different Scoring Methods For a Given Test 80 (Replication 80)
15 Dependent Variable Type Scoring Method N MIN MAX MEAN STD SEM Information I inf_MC 100 3.54 3.63 3.58 0.01 0.53 inf_CR3 100 6.79 6.91 6.84 0.02 0.38 inf_CR4 100 11.60 11.91 11.73 0.05 0.29 inf_CR5 100 12.43 12.56 12.49 0.02 0.28 inf_D_MC 100 3.58 3.68 3.62 0.02 0.53 II inf_D_CR3 100 2.92 3.02 2.96 0.02 0.58 inf_D_1CR4 100 2.69 2.83 2.76 0.03 0.60 inf_D_2CR4 100 3.20 3.25 3.22 0.01 0.56 inf_D_1CR5 100 2.23 2.70 2.47 0.07 0.64 inf_D_2CR5 100 2.24 2.51 2.39 0.11 0.65 inf_D_3CR5 100 2.18 2.25 2.21 0.01 0.67 I EF_CR3 100 1.92 1.94 1.93 0.00 EF_CR4 100 3.25 3.27 3.26 0.00 EF_CR5 100 3.53 3.60 3.57 0.01 Efficiency II EF_D_CR3 100 0.81 0.84 0.82 0.01 EF_D_1CR4 100 0.75 0.79 0.77 0.01 EF_D_2CR4 100 0.89 0.91 0.90 0.00 EF_D_1CR5 100 0.63 0.76 0.69 0.02 EF_D_2CR5 100 0.63 0.72 0.68 0.03 EF_D_3CR5 100 0.61 0.65 0.64 0.01 III EF_D_CR3M 100 0.43 0.45 0.44 0.00 EF_D_1CR4M 100 0.24 0.25 0.24 0.00 EF_D_2CR4M 100 0.28 0.28 0.28 0.00 EF_D_1CR5M 100 0.18 0.22 0.20 0.01 EF_D_2CR5M 100 0.18 0.20 0.19 0.01 EF_D_3CR5M 100 0.17 0.18 0.18 0.00
Table 5. Overall Average
- f Test Information and
Efficiency for Different Scoring Methods
16
3. Inferential Statistics Results Statistical Hypotheses: There are no effects of the scoring method on all the dependent variables used in different simulation conditions. All hypotheses have been rejected, meaning, scoring method makes difference for any given dependent variable. Summary of Results ▪ Efficiency of person scores increases as number of category of item responses increases ▪ On average, information of D_P responses is less than that of D_D responses ▪ Based on the simulation conditions, for D_P scoring methods, optimal number
- f category of items is 4, not 5.
17
- IV. Conclusions
1.
Different scoring methods have impact on efficiency of scores
2.
Scoring TEI as MC does not increase efficiency
3.
Large number of categories (or components) are not necessarily the best choice for D_P scoring method
Thank you ! For any question:
Shudong.wang@NWEA.org
18