staar like quality starts with reliability
play

STAAR-Like Quality Starts with Reliability Quality Educational - PowerPoint PPT Presentation

STAAR-Like Quality Starts with Reliability Quality Educational Research Our mission is to provide a comprehensive independent research- based resource of easily accessible and interpretable data for policy makers, school administrators,


  1. STAAR-Like Quality Starts with Reliability

  2. Quality Educational Research Our mission is to provide a comprehensive independent research- based resource of easily accessible and interpretable data for policy makers, school administrators, teachers, and parents to use in making decisions.

  3. Objectives • Introduction to test theories • Components of a quality local assessment • Apply basic measurement concepts of reliability, validity, and test construction • Demonstrate why high test scores may not always indicate reliable, valid test scores • Apply basic concepts of constructing a STAAR-like assessment.

  4. Test Theories • Classical Testing Theory (CTT) • Generalizability Theory (G Theory) • Item Response Theory (IRT)

  5. Basic Concepts of Measurement

  6. Measurement Error This is a fundamental component of psychological testing. True Score = Observed Score + Measurement Error Measurement Error – due to the test administration, guessing, and other temporary fluctuations in behavior. Can you name more? True Score - > If we could measure a student’s ability in some area (e.g., math) an infinite number of times, the average of these scores would be the student’s true score.

  7. Measurement Error On February 5, 2012, 111.3 million people watched an event that is completely standardized and has as one of its core components measurement error. And they probably didn’t even know it. What was it? Super Bowl 46: New York Giants vs. New England Patriots

  8. Measurement Error After each play, how do the teams know where to start the next play? The referee “spots” the ball. How accurate is the spot? Everything in football begins with this fundamental component. What if the referee spots the football 1 yard too short or 1 yard too long. Are there rules on spotting the football? Everything in football has been calibrated - the size of the football, the box and chains, the football field, the clock, etc., but every play has measurement error in that it all depends on where the referee “spots” the ball.

  9. Measurement Error The advantage in football is the instant reply- we don’t have this 30 30 40 40 50 advantage in testing. Measurement Error

  10. Reliability and Measurement Error • Measurement error is inversely related to reliability. In other words, as one goes up the other goes down. • We usually measure reliability from 0 to 1. So if reliability is 1, then we have no measurement error – this is an extreme case. • If reliability is 0, then we are not really measuring anything. • In order to increase “reliability”, we must decrease “error”.

  11. Components of a Quality Local Assessment

  12. Purpose The purpose of the assessment will drive the test design. • Purpose 1: To classify students into distinct categories Summative - pass/fail • Purpose 2: To provide information Formative – learning experience for both student and teacher

  13. Purpose • Purpose 1: To classify students into distinct categories. Under this purpose, is there any reason to administer the test to the student who passes the test (missing only a few questions) or fails the test (only getting a few questions correct) every year? • Purpose 2: To provide information. Under this purpose, is there any reason to administer the test to the student who passes the test (missing only a few questions) or fails the test (only getting a few questions correct) every year?

  14. Reliability • Internal Consistency • Calculate Alpha • Desired Items • Desired Reliability • Test-Retest Forms • Alternate Forms • Test-Retest Alternate Forms

  15. Reliability • Inter-Rater • Percent Agreement • Cohen’s Kappa

  16. Validity Scores must be reliable before they can be valid. • Content • Criterion-related • Predictive • Concurrent • Construct

  17. Predictive Validity (Criterion Related) See Handout

  18. Item Analysis • p -value • desired value – reliability is maximized when p is half way between the floor and ceiling • 4 possible choices: floor is .25; ceiling is 1; desired value is .625 • calculate a few of your own. • Desired total score mean • desired value – reliability is maximized when total mean score is the sum of all the desired p -values • .625+.625+.75 = 2 • calculate a few of your own

  19. Item Analysis • Discrimination Index (D-Index) • Determine high (top 27%) and low group (bottom 27%) • D = p u – p l • D should be greater than .30

  20. Item Analysis • Point Biserial Correlation • Correlation between dichotomous item and total test score • High value indicates strong correlation between that item and the total test score • High value does not indicate that a lot of respondents answered the item correctly • Cronbach Alpha “if deleted” • Distractors • (1 – desired p -value)/# of distractors • Do you have any that are not doing their job.

  21. Apply Basic Measurement Concepts of Reliability, Validity, and Test Construction

  22. Reliability • Internal Consistency • Test-Retest • Alternate Form • Test-Retest Alternate Form • Inter-Rater

  23. Validity • Content • Criterion-related • Construct

  24. Item Analysis • Discrimination Index (D-Index) • Distractor Analysis

  25. Demonstrate Why High Test Scores May Not Always Indicate Reliable, Valid Test Scores

  26. Apply Basic Concepts of Constructing a STAAR-Like Assessment.

  27. Test Construction Process Steps 1. Identify the primary purpose(s) for which the test scores will be used. 2. Identify behaviors that represent the construct or define the domain. 3. Prepare a set of test specifications, delineating the proportion of items that should focus on each type of behavior identified in step 2 4. Construct an initial pool of items 5. Have items reviewed (and revise as necessary) (Crocker and Algina, 2008)

  28. Test Construction Process Steps 6. Hold preliminary item tryouts (and revise as necessary) 7. Field-test the items on a large sample representative of the examinee population for whom the test is intended 8. Determine statistical properties of items scores and, when appropriate, eliminate items that do not meet pre-established criteria 9. Design and conduct reliability and validity studies for the final form of the test 10. Develop guidelines for administration, scoring, and interpretation of the test scores (e.g., prepare norm tables, suggest recommended cutting scores or standards for performance, etc.) (Crocker and Algina, 2008)

  29. STAAR Reliability • Mean • SD • Alpha • SEM = SD*sqt(1-r) • Mean P-Value Validity Scale Score – use SD and Mean Equating Vertical Score

  30. Scaling, Norming, Equating Scaling – assigning intervally scaled numerical values to raw scores. Norming – constructing conversion tables so that a particular raw score value can be interpreted in terms of its relative location and frequency within the total score distribution. Equating – a statistical process for expressing scores of one test on the scale of another with maximum precision (Osterlind, 2006).

  31. So How Does this Help the Test Designer?

  32. So How does this Help the Test Designer? • Provides information about how test scores remain stable over time (when to modify a test and when not to). • Provides information about how to produce an alternate (but equivalent form) of a test to prevent cheating. • Provides information about how reliable a test is performing with respect to student ability. Are students with similar abilities getting similar questions correct? • Provides information that can be used to explain to students, teachers, and parents the reliability of test scores.

  33. So How does this Help the Test Designer? • Provides information about how test items relate to the specified content. • Provides information about how test scores relate to other criteria (e.g., course grade, GPA, SAT, etc.). • Provides information about how groups of items on a test cluster together to measure a similar construct (e.g., math ability). • Provides information that can be used to explain to students, teachers, and parents the validity of test scores.

  34. Consolidating Efforts

  35. Benefits to Districts • Greater precision in measurement. • Make better decisions about item usage (increase information gathered from each item). • Make better decisions about the level of difficulty of tests. • Strategically substitute items from one test to another. • Provides an item clearinghouse so districts can trade items if they choose to do so.

  36. Review Today’s Objectives • Introduction to Test Theory • Components of a quality local assessment • Apply basic measurement concepts of reliability, validity, and test construction • Demonstrate why high test scores may not always indicate reliable, valid test scores • Apply basic concepts of constructing a STAAR-like assessment.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend