Screening Common Items for IRT Equating Yi Du, Ph.D. Data - - PDF document

screening common
SMART_READER_LITE
LIVE PREVIEW

Screening Common Items for IRT Equating Yi Du, Ph.D. Data - - PDF document

Screening Common Items for IRT 6/17/2011 Equating Screening Common Items for IRT Equating Yi Du, Ph.D. Data Recognition Corporation Presentation at the 41th National Conference on Student Assessment, June 20, 2011 Overview Most states


slide-1
SLIDE 1

Screening Common Items for IRT Equating 6/17/2011 Yi Du, Ph.D. 1

Screening Common Items for IRT Equating

Yi Du, Ph.D. Data Recognition Corporation Presentation at the 41th National Conference on Student Assessment, June 20, 2011

2

Overview

  • Most states use equating to maintain a

common test scale across years

  • Common item equating designs help bring

the new items onto the reference scale

  • The quality of equating results (proficiency

estimates) are affected by the across-year stability of the common items

6/17/2011 3

Historically, common items were selected to be representative of the total test forms in:

  • Content
  • e.g., proportionally match the test specifications
  • Statistical characteristics
  • e.g., item difficulty and the location of spread of

item difficulty

Sinharay and Holland, 2007

  • Recently suggested that the best common items

may come from the middle of the test difficulty range

6/17/2011

Selecting Common Items

slide-2
SLIDE 2

Screening Common Items for IRT Equating 6/17/2011 Yi Du, Ph.D. 2

4

  • Common Items should be interspersed

throughout the test

  • Avoid very early and late item positions
  • Common Items should be placed in the

same relative item position across usages

  • The number of common items should be

a predetermined ratio of the total items

  • e.g., 20 – 30% of total test items.

6/17/2011

Using Common Items

5

Construct-irrelevant factors:

  • Test security
  • Over exposure
  • Item revisions
  • Test booklet (change in format/layout)
  • Environmental and societal change
  • Test population changes

Construct-relevant factors:

  • Curriculum/Instruction changes

6/17/2011

Unstable Common Items

6

Unstable Anchors and Test Score Validity

 If common items are differentially difficulty

across administrations, item drift exists.

 Using common items with item drift can result

in questionable equating results.

 Mechanisms are required to detect item drift

during the equating process

 Consider removing unstable items from the

anchor set

6/17/2011

slide-3
SLIDE 3

Screening Common Items for IRT Equating 6/17/2011 Yi Du, Ph.D. 3

7

 Many assessment programs implement

quantitative and qualitative procedures for screening common items

 This presentation will:

 Summarize popular item screening procedures  Show results from select screening procedures  Discuss strategies for selecting and

implementing different screening procedures

6/17/2011

Today‘s Presentation

8

Screening Procedures

 Classical screening procedures  IRT-related procedures  Expert judgment

6/17/2011 9

Classical Screening Techniques

 P-value cut-off criterion  Delta plot Method (Angolf, 1972, 1982,

Dorans & Holland, 1993)

 DIF methods

 Mental-Haenszel chi-square statistics (Holland ,

1985; Holland and Chayer, 1988)

6/17/2011

slide-4
SLIDE 4

Screening Common Items for IRT Equating 6/17/2011 Yi Du, Ph.D. 4

10

P-Value Cut-Off Procedures

 Computes the differences in common item p-values

(item difficulty) across two test administrations.

 Unstable items are removed from the anchor set based

  • n fixed rules (usually established a priori).

 A popular procedure for screening common items.  The method may not accurately control for Type I

error (Harris, 1993, Miller, Rotou and Twing, 2004)

6/17/2011 11

Delta Plot Procedures

 Introduced by Angoff (1972)

 Measures differences based on item-by-group

interaction

 Estimates are accurate when all items are

equal in discrimination

 Modified by Dorans and Holland (1993)

6/17/2011 12

Dorans and Holland (1993) Delta Plots

 Item p-values are converted to an interval scale

using inverse normal deviates (z’s)

 The following linear transformation is then applied:  The resulting scale has a mean = 13 and an SD = 4  The perpendicular distance of the paired Delta

value to the principal axis line is determined.

 An a priori cut-off rule is used to remove common

items from the anchor set

 

) ( 4 13

1 p 

   

slide-5
SLIDE 5

Screening Common Items for IRT Equating 6/17/2011 Yi Du, Ph.D. 5

13

Modified Delta Plot Technique, Cont.

 An Alternative Approach

 Item p-values are converted using the

inverse standard normal distribution

 The perpendicular distance of the paired

Delta value is computed by:

 

) (

1 p

   

1

2 

   A B Z AZ D

new

  • ld

Zold Znew Znew zold Zold Znew Znew zold Zold Znew Zold Znew

SD SD r SD SD r SD SD SD SD A

2 ) )( ( 2 2 2 ) )( ( 2 2 2 2 2

2 4 ) ( ) (     

6/17/2011 14

Modified Delta Plot Technique, Cont.

 An Alternative Approach—Continued

 And:

B=Mean(Znew) – A*Mean(Zold)

 The standard deviation (SD) of the

perpendicular distance is given by

 A fixed rule is used to flag items, e.g.,

D > 3 SD from the fitted line

6/17/2011 ) )( (

1 2 ) (

Znew Zold Zold Znew D

r SD SD SD    

15

Mantel-Haenszel (MH) Chi-Square Techniques

 Item Drift is a special case of DIF that affects a test‘s

  • bjectivity (Bock, Muraki, & Pfeittenberger, 1988)

 Many DIF methods could be used for screening

common items, theoretically

 MH procedures have been studied and used in the

screening common items process (Michaelides, 2006)

 DIF and Item Drift differ practically in:

 Sample size  Subgroup differences vs. local proficiency

6/17/2011

slide-6
SLIDE 6

Screening Common Items for IRT Equating 6/17/2011 Yi Du, Ph.D. 6

16

IRT-Related Procedures

 Expected p-value comparison (including Delta plot

method)

 Common items TCC comparison and preliminary

screening

 Item parameters comparison with inferential statistics

 Weighted and unweighted t-statistic (Wright & Stone, 1979,

Wright and Bell, 1984; Smith and Kramer, 1992 )

 Alternative information-weighted Linking Constants (Cohen,

Jiang and Yu, 2008)

 Displacement measure (Linacre, 2006)  Robust-z statistic (Huynh; 2002; Tenenaum, 2001)

 Lord‘s (1980) Chi-Square statistic

17

Expected P-Value Comparisons

 The expected p-values of common items are

compared to make sure the items are similar in difficulty in both forms

 A regression line is fit for the p-values

between the estimated new form and old form.

 Delta Plot methods can be used to evaluate

the expected p-value differences in the IRT framework

6/17/2011 18

Expected P-Value Comparisons

6/17/2011

slide-7
SLIDE 7

Screening Common Items for IRT Equating 6/17/2011 Yi Du, Ph.D. 7

19

Common Items TCC Comparisons

 The old and equated common item set TCCs

are compared to make sure that they have reasonable overlap.

 The correlation coefficients between the old

and new item parameters are compared.

 Fixed rules are used to flag unstable

common items

6/17/2011 20

Common Items TCC Comparisons

6/17/2011

Screening for 3-PL IRT Estimates

  • IRT item parameters (a, b, and c) of common

items are plotted

  • Fixed rules are used to flag unstable common

items, e.g.,

 .3 > a or a > 1.5, or the item drift > 1.0 theta

unit

 -2.0 > b or b > 2.0 or the item drift > 1.0 theta

unit

 C > .35

6/17/2011

slide-8
SLIDE 8

Screening Common Items for IRT Equating 6/17/2011 Yi Du, Ph.D. 8

22

Wright and Stone‘s t-statistic

 Mainly used in Rasch equating  Offers the unweighted and weighted link

constants:

M d d ul

M i ik ij jk

  

1

) ( ) (

   

  M i ijk M i ijk ik ij jk

w w d d wl

1 1

1 ) (

2 2

) ( ) (

ik ij ijk

se se w  

6/17/2011 23

Wright and Stone‘s t-statistic (Cont.)

  • The t-statistic identifies two sources of error:

 The random fluctuation arising from the finite samples  The differences in item difficulty

  • A common practice is to use the critical value

α at 0.05.

2 / 1 2 2

] ) ( ) [( ) (

ik ij jk ik ij ijk

se se ul d d t    

6/17/2011 24

Information-Weighted Linking Constants

 The original linking constant is determined by taking

the arithmetic mean difference in item difficulty parameters as the linking constant:

 The weighted linking constant is weighted by the

Information δj:

 By weighting with sample error, the item parameter

estimates are believed to be more sufficient.

1

b b     

 t J j t j t

J d

1

1

     

  t J j j t j t J j j t

d

1 2 1 2 '

1 1

1 , ) (

1 2 1 2 '

    

  t J j j j t J j j t

w w d Var

slide-9
SLIDE 9

Screening Common Items for IRT Equating 6/17/2011 Yi Du, Ph.D. 9

25

Huynh‘s Robust-z Statistic

  • The robust-z standardizes item parameter

differences with the normalized interquartile range of the estimated differences in item parameters of the linking items

74 . * ) ( ) (

1 3

Q Q Median d d Z Robust

ik ij

    

6/17/2011 26

Huynh‘s Robust-z Statistic

  • Assuming samples j and k are from a common

population, the robust-z statistic captures mostly one error source: the use of finite student samples.

  • A common practice is to use the critical α at 0.10

(z = 1.645).

  • Other criteria (e.g., the ratio of item parameter SDs

and the correlation of the item parameters) are used with robust-z when purging items from the anchor set.

  • In this method, statistical power is not affected by

sample size.

6/17/2011 27

Linacre‘s Displacement Measure

  • Three sources of variability influence the size
  • f the displacement measure:

 finite samples of students  differences in student ability  departure of the item difficulty from the best fit

  • f the data set to the Rasch model
  • In equating practice, a cut-off point of 0.30 is

commonly used.

6/17/2011

slide-10
SLIDE 10

Screening Common Items for IRT Equating 6/17/2011 Yi Du, Ph.D. 10

28

Linacre‘s Displacement Measure

  • The displacement measure indexes the size of

the change in an estimated item parameter in samples j and k that would be observed if the item parameters were ‗free‘ and the others were ‗fixed.‘            

^ ) | ( ^ ^ ) | ( ^ ik b P i ij b P i jk

b b D

6/17/2011 29

Chi-Square for IRT Parameters

 The procedure is based on the chi-square statistic:

v‘ is the vector and is the inverse

  • f the asymptotic variance-covariance matrix for

 The null hypothesis is: bi1 = bi2 and ai1 and ai2  Chi-squares are computed for items independently.  Chi-square values are compared against tabled chi-

square values

  

1 ' 2 i i i i

v v

} , {

2 1 2 1 i i i i

a a b b       2 1 2 1

,

i i i i

a a b b      

1

i

6/17/2011 30

Proportion of Items Flagged Using Different Methods

t-statistic Robust-z Displacement p ≤ .10 p ≤ .05 p ≤ .10 p ≤ .05 ±.30 ±.50

Grade 4 ELA

0.03 0.00 0.15 0.03 0.00 0.00

Grade 8 Math

0.11 0.08 0.14 0.06 0.06 0.03

Grade 11 Science

0.00 0.00 0.08 0.00 0.06 0.00

Grade 11 SS

0.17 0.07 0.33 0.33 0.43 0.13

6/17/2011

slide-11
SLIDE 11

Screening Common Items for IRT Equating 6/17/2011 Yi Du, Ph.D. 11

31

 Different screening methods yield different

results

 Methods have unique advantages and

disadvantages

  • Robust-z flagged more linking items
  • Robust-z may not function well when the range
  • f item difficulty is small.
  • The t-statistic depends upon the sample size.
  • Displacement confounds item drift with model fit

6/17/2011

Results

32

Proportion of Proficient Students across Methods

6/17/2011

0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 G4 ELA G4 Math G4 Science G4 SS Percentage of Students at Proficient Level t-statistic Robust_Z Displacement Delta Plot

33

  • Different screening methods may yield

unique equating results

  • These equating results, in turn, affect

students‘ total scores and the classifications of the performance standards

6/17/2011

Why This Matters to You

slide-12
SLIDE 12

Screening Common Items for IRT Equating 6/17/2011 Yi Du, Ph.D. 12

34

Summary

  • The definition of what constitutes an unstable

anchor item is dependent, in part, upon the techniques used to detect differential performance

  • Some methods may have a higher probability of

making Type I error, while the others may be likely to result in a Type II error

  • Screening power differs across methods, due to

various statistical assumptions, equating methods, different sample sizes, and other circumstances

6/17/2011

Strategies for Screening Common Items, Cont.

  • Proposed strategies:

First, flag all potential ―flawed‖ items to prevent Type II error and ensure the quality of linking items

Next, evaluate all ―flagged‖ items to prevent Type I error and minimize bias in equating constants.

  • Use multiple procedures to screen common items

to help support and confirm any decision making.

6/17/2011 36

Strategies for Screening Common Items, Cont.

  • All statistical output from the common item

screenings need to be examined carefully

  • Professional judgment, along with statistical

methods and fixed rules, should drive any flagging rules and decision-making regarding the removal

  • f unstable common items
  • The total anchor set and standard error of

equating should be examined before the final equating decision is made.

6/17/2011

slide-13
SLIDE 13

Screening Common Items for IRT Equating 6/17/2011 Yi Du, Ph.D. 13

Thanks very much for your time and consideration! 

6/17/2011 37