Changes in test Scores w ith Multiple Sittings of CanTEST Philip - - PowerPoint PPT Presentation

▶

Sep 21, 2022 347 likes •505 views

Changes in test Scores w ith Multiple Sittings of CanTEST Philip Nagy Rationale Research Questions Do test scores change on repeating the test? Is change related to length of time between sittings? Test Development Questions

SLIDE 1

Changes in test Scores w ith Multiple Sittings of CanTEST

Philip Nagy

SLIDE 2

Official Languages and Bilingualism Institute

Rationale

Research Questions

Do test scores change on repeating the test?
Is change related to length of time between sittings?

Test Development Questions

Can data from repeaters be used in test calibration for

new form development? Context: Receptive Skills

SLIDE 3

Official Languages and Bilingualism Institute

The Data

Listening Tests: Six forms with 15 short and 25 long passage items Reading Tests: Seven forms with 15 skim-and-scan, 20 reading passage, and 25 cloze items The Sample: Mean first score of 3.6, compared to 4.3 for those who write only once Assumptions

Difficulty of forms is balanced across sittings (true)
Samples writing each form are equivalent (untested)

SLIDE 4

Official Languages and Bilingualism Institute

Listening Results: Sitting 2 minus Sitting 1 (N=179)

Change in Raw Score Total Test (40) Short Passages (15) Long Passages (25) Down >11 3 1 Down 6 to 10 18 2 11 Down 3 to 5 18 24 22 Same ± 2 43 91 72 Up 3 to 5 42 42 46 Up 6 to 10 36 20 24 Up >11 19 3

SLIDE 5

Official Languages and Bilingualism Institute

Listening Results, another look

Change in Raw Score Total Test (40) Short Passages (15) Long Passages (25) Down some 22% 15% 19% About the same 24% 51% 40% Up some 54% 34% 41% Mean raw gain 2.6 1.3 1.3 Mean % gain 6.5% of 40 items 8.8% of 15 items 5.2% of 25 items

SLIDE 6

Official Languages and Bilingualism Institute

Listening Results Interpretation

How important is the improvement?

On average, 3.6 points needed out of 40 to

improve one band

So, 2.6 points is about 75% of a band

improvement

SLIDE 7

Official Languages and Bilingualism Institute

Listening Results Interpretation

Can the data be used for test calibration?

The changes in average item difficulty are

different for the subtests

.088 for short passages
.052 for long passages
The difference of .036 (.088 - .052) is about the

same as the standard error of the difficulty indices

Listening data from repeaters should not be

used for item calibration

SLIDE 8

Official Languages and Bilingualism Institute

Changes in Listening by Length of Time betw een Sittings

Test → Time Between Tests ↓ Total Test Short Passages Long Passages > 6 months (N=63) +2.13 +0.631 +1.49 < 6 months (N=116) +2.87 +1.691 +1.18

1Difference significant, p=0.05

Those who repeat sooner do better than those who repeat later

SLIDE 9

Official Languages and Bilingualism Institute

Reading Results: Sitting 2 minus Sitting 1 (N=284)

Note: Reading Score is doubled to give a total out of 80 rather than 60.

Change in Raw Score Total (80) Skim-&-Scan (15) Passage (20) Cloze (25) Down 21 or more 17 Down 11 to 20 19 2 12 Down 6 to 10 21 12 18 32 Down 3 to 5 28 32 30 34 Same score ± 2 46 139 142 106 Up 3 to 5 33 65 63 52 Up 6 to 10 47 31 23 36 Up 11 to 20 48 3 8 12 Up 21 or more 25

SLIDE 10

Official Languages and Bilingualism Institute

Reading Results, another look

Change in Raw Score Total (80) Skim-&- Scan (15) Reading Passage (20) Cloze Passage (25) Down some 30% 16% 17% 27% About the same 16% 49% 50% 37% Up some 54% 35% 33% 35%

SLIDE 11

Official Languages and Bilingualism Institute

Reading Results Interpretation

How important is the improvement?

On average, 6.5 points needed (out of 80) to

improve one band

So, 3.45 points is about 55% of a band

improvement

SLIDE 12

Official Languages and Bilingualism Institute

Reading Results Interpretation

Can the data be used for test calibration?

The changes in average item difficulty are

different for the subtests

+0.072 for skim-and-scan
+0.050 for reading passages
+0.002 for cloze
The largest difference of .070 (.072 - .002) is

two to three times larger than the standard error

f the difficulty indices
Reading data from repeaters should not be used

for item calibration

SLIDE 13

Official Languages and Bilingualism Institute

Changes in Reading by Length of Time betw een Sittings

1Difference significant, p=0.05

Those who repeat later actually do worse than those who repeat sooner

Test → Time Between Tests ↓ Total (80) Skim-&Scan Reading Passage Cloze Passage > 6 months (N=105)

0.119
0.2921
0.017
0.079

< 6 months (N=179) +0.070 +0.1711 +0.010 +0.046

SLIDE 14

Official Languages and Bilingualism Institute

Conclusion

Listening:
30% of sample do more poorly on 2nd sitting
Average gain is 75% of a band score
Differences in gains across item types vary by an item standard

error

Reading
40% of sample do more poorly on 2nd sitting
Average gain is 55% of a band score
Differences in gains across item types vary by 2-3 times an

item standard error

Both
Those who rewrite within six months do better
Data from repeaters should not be used for item calibration