STAAR-Like Quality Starts with Reliability Quality Educational - - PowerPoint PPT Presentation

staar like quality starts with reliability
SMART_READER_LITE
LIVE PREVIEW

STAAR-Like Quality Starts with Reliability Quality Educational - - PowerPoint PPT Presentation

STAAR-Like Quality Starts with Reliability Quality Educational Research Our mission is to provide a comprehensive independent research- based resource of easily accessible and interpretable data for policy makers, school administrators,


slide-1
SLIDE 1

STAAR-Like Quality Starts with Reliability

slide-2
SLIDE 2

Quality Educational Research

Our mission is to provide a comprehensive independent research- based resource of easily accessible and interpretable data for policy makers, school administrators, teachers, and parents to use in making decisions.

slide-3
SLIDE 3

Objectives

  • Introduction to test theories
  • Components of a quality local assessment
  • Apply basic measurement concepts of reliability, validity, and

test construction

  • Demonstrate why high test scores may not always indicate

reliable, valid test scores

  • Apply basic concepts of constructing a STAAR-like

assessment.

slide-4
SLIDE 4

Test Theories

  • Classical Testing Theory (CTT)
  • Generalizability Theory (G Theory)
  • Item Response Theory (IRT)
slide-5
SLIDE 5

Basic Concepts of Measurement

slide-6
SLIDE 6

Measurement Error

This is a fundamental component of psychological testing. True Score = Observed Score + Measurement Error Measurement Error – due to the test administration, guessing, and

  • ther temporary fluctuations in behavior. Can you name more?

True Score -> If we could measure a student’s ability in some area (e.g., math) an infinite number of times, the average of these scores would be the student’s true score.

slide-7
SLIDE 7

Measurement Error

On February 5, 2012, 111.3 million people watched an event that is completely standardized and has as one of its core components measurement error. And they probably didn’t even know it. What was it? Super Bowl 46: New York Giants vs. New England Patriots

slide-8
SLIDE 8

Measurement Error

After each play, how do the teams know where to start the next play? The referee “spots” the ball. How accurate is the spot? Everything in football begins with this fundamental component. What if the referee spots the football 1 yard too short or 1 yard too

  • long. Are there rules on spotting the football?

Everything in football has been calibrated - the size of the football, the box and chains, the football field, the clock, etc., but every play has measurement error in that it all depends on where the referee “spots” the ball.

slide-9
SLIDE 9

Measurement Error

30 50 40 30 40

Measurement Error The advantage in football is the instant reply-we don’t have this advantage in testing.

slide-10
SLIDE 10

Reliability and Measurement Error

  • Measurement error is inversely related to reliability. In other

words, as one goes up the other goes down.

  • We usually measure reliability from 0 to 1. So if reliability is

1, then we have no measurement error – this is an extreme case.

  • If reliability is 0, then we are not really measuring anything.
  • In order to increase “reliability”, we must decrease “error”.
slide-11
SLIDE 11

Components of a Quality Local Assessment

slide-12
SLIDE 12

Purpose

The purpose of the assessment will drive the test design.

  • Purpose 1: To classify students into distinct categories

Summative - pass/fail

  • Purpose 2: To provide information

Formative – learning experience for both student and teacher

slide-13
SLIDE 13

Purpose

  • Purpose 1: To classify students into distinct categories.

Under this purpose, is there any reason to administer the test to the student who passes the test (missing only a few questions) or fails the test (only getting a few questions correct) every year?

  • Purpose 2: To provide information.

Under this purpose, is there any reason to administer the test to the student who passes the test (missing only a few questions) or fails the test (only getting a few questions correct) every year?

slide-14
SLIDE 14

Reliability

  • Internal Consistency
  • Calculate Alpha
  • Desired Items
  • Desired Reliability
  • Test-Retest Forms
  • Alternate Forms
  • Test-Retest Alternate Forms
slide-15
SLIDE 15

Reliability

  • Inter-Rater
  • Percent Agreement
  • Cohen’s Kappa
slide-16
SLIDE 16

Validity

Scores must be reliable before they can be valid.

  • Content
  • Criterion-related
  • Predictive
  • Concurrent
  • Construct
slide-17
SLIDE 17

Predictive Validity (Criterion Related)

See Handout

slide-18
SLIDE 18

Item Analysis

  • p-value
  • desired value – reliability is maximized when p is half way

between the floor and ceiling

  • 4 possible choices: floor is .25; ceiling is 1; desired value is

.625

  • calculate a few of your own.
  • Desired total score mean
  • desired value – reliability is maximized when total mean

score is the sum of all the desired p-values

  • .625+.625+.75 = 2
  • calculate a few of your own
slide-19
SLIDE 19

Item Analysis

  • Discrimination Index (D-Index)
  • Determine high (top 27%) and low group (bottom 27%)
  • D = pu – pl
  • D should be greater than .30
slide-20
SLIDE 20

Item Analysis

  • Point Biserial Correlation
  • Correlation between dichotomous item and total test score
  • High value indicates strong correlation between that item

and the total test score

  • High value does not indicate that a lot of respondents

answered the item correctly

  • Cronbach Alpha “if deleted”
  • Distractors
  • (1 – desired p-value)/# of distractors
  • Do you have any that are not doing their job.
slide-21
SLIDE 21

Apply Basic Measurement Concepts of Reliability, Validity, and Test Construction

slide-22
SLIDE 22

Reliability

  • Internal Consistency
  • Test-Retest
  • Alternate Form
  • Test-Retest Alternate Form
  • Inter-Rater
slide-23
SLIDE 23

Validity

  • Content
  • Criterion-related
  • Construct
slide-24
SLIDE 24

Item Analysis

  • Discrimination Index (D-Index)
  • Distractor Analysis
slide-25
SLIDE 25

Demonstrate Why High Test Scores May Not Always Indicate Reliable, Valid Test Scores

slide-26
SLIDE 26

Apply Basic Concepts of Constructing a STAAR-Like Assessment.

slide-27
SLIDE 27

Test Construction

Process Steps

  • 1. Identify the primary purpose(s) for which the test scores will be

used.

  • 2. Identify behaviors that represent the construct or define the

domain.

  • 3. Prepare a set of test specifications, delineating the proportion of

items that should focus on each type of behavior identified in step 2

  • 4. Construct an initial pool of items
  • 5. Have items reviewed (and revise as necessary)

(Crocker and Algina, 2008)

slide-28
SLIDE 28

Test Construction

Process Steps

  • 6. Hold preliminary item tryouts (and revise as necessary)
  • 7. Field-test the items on a large sample representative of the

examinee population for whom the test is intended

  • 8. Determine statistical properties of items scores and, when

appropriate, eliminate items that do not meet pre-established criteria

  • 9. Design and conduct reliability and validity studies for the final

form of the test

  • 10. Develop guidelines for administration, scoring, and

interpretation of the test scores (e.g., prepare norm tables, suggest recommended cutting scores or standards for performance, etc.) (Crocker and Algina, 2008)

slide-29
SLIDE 29

STAAR

Reliability

  • Mean
  • SD
  • Alpha
  • SEM = SD*sqt(1-r)
  • Mean P-Value

Validity Scale Score – use SD and Mean Equating Vertical Score

slide-30
SLIDE 30

Scaling, Norming, Equating

Scaling – assigning intervally scaled numerical values to raw scores. Norming – constructing conversion tables so that a particular raw score value can be interpreted in terms of its relative location and frequency within the total score distribution. Equating – a statistical process for expressing scores of one test on the scale of another with maximum precision (Osterlind, 2006).

slide-31
SLIDE 31

So How Does this Help the Test Designer?

slide-32
SLIDE 32

So How does this Help the Test Designer?

  • Provides information about how test scores remain stable over

time (when to modify a test and when not to).

  • Provides information about how to produce an alternate (but

equivalent form) of a test to prevent cheating.

  • Provides information about how reliable a test is performing

with respect to student ability. Are students with similar abilities getting similar questions correct?

  • Provides information that can be used to explain to students,

teachers, and parents the reliability of test scores.

slide-33
SLIDE 33

So How does this Help the Test Designer?

  • Provides information about how test items relate to the

specified content.

  • Provides information about how test scores relate to other

criteria (e.g., course grade, GPA, SAT, etc.).

  • Provides information about how groups of items on a test

cluster together to measure a similar construct (e.g., math ability).

  • Provides information that can be used to explain to students,

teachers, and parents the validity of test scores.

slide-34
SLIDE 34

Consolidating Efforts

slide-35
SLIDE 35

Benefits to Districts

  • Greater precision in measurement.
  • Make better decisions about item usage (increase information

gathered from each item).

  • Make better decisions about the level of difficulty of tests.
  • Strategically substitute items from one test to another.
  • Provides an item clearinghouse so districts can trade items if

they choose to do so.

slide-36
SLIDE 36

Review Today’s Objectives

  • Introduction to Test Theory
  • Components of a quality local assessment
  • Apply basic measurement concepts of reliability, validity, and

test construction

  • Demonstrate why high test scores may not always indicate

reliable, valid test scores

  • Apply basic concepts of constructing a STAAR-like

assessment.