STAAR-Like Quality Starts with Reliability Quality Educational - PowerPoint PPT Presentation

STAAR-Like Quality Starts with Reliability

Quality Educational Research Our mission is to provide a comprehensive independent research- based resource of easily accessible and interpretable data for policy makers, school administrators, teachers, and parents to use in making decisions.

Objectives • Introduction to test theories • Components of a quality local assessment • Apply basic measurement concepts of reliability, validity, and test construction • Demonstrate why high test scores may not always indicate reliable, valid test scores • Apply basic concepts of constructing a STAAR-like assessment.

Test Theories • Classical Testing Theory (CTT) • Generalizability Theory (G Theory) • Item Response Theory (IRT)

Basic Concepts of Measurement

Measurement Error This is a fundamental component of psychological testing. True Score = Observed Score + Measurement Error Measurement Error – due to the test administration, guessing, and other temporary fluctuations in behavior. Can you name more? True Score - > If we could measure a student’s ability in some area (e.g., math) an infinite number of times, the average of these scores would be the student’s true score.

Measurement Error On February 5, 2012, 111.3 million people watched an event that is completely standardized and has as one of its core components measurement error. And they probably didn’t even know it. What was it? Super Bowl 46: New York Giants vs. New England Patriots

Measurement Error After each play, how do the teams know where to start the next play? The referee “spots” the ball. How accurate is the spot? Everything in football begins with this fundamental component. What if the referee spots the football 1 yard too short or 1 yard too long. Are there rules on spotting the football? Everything in football has been calibrated - the size of the football, the box and chains, the football field, the clock, etc., but every play has measurement error in that it all depends on where the referee “spots” the ball.

Measurement Error The advantage in football is the instant reply- we don’t have this 30 30 40 40 50 advantage in testing. Measurement Error

Reliability and Measurement Error • Measurement error is inversely related to reliability. In other words, as one goes up the other goes down. • We usually measure reliability from 0 to 1. So if reliability is 1, then we have no measurement error – this is an extreme case. • If reliability is 0, then we are not really measuring anything. • In order to increase “reliability”, we must decrease “error”.

Components of a Quality Local Assessment

Purpose The purpose of the assessment will drive the test design. • Purpose 1: To classify students into distinct categories Summative - pass/fail • Purpose 2: To provide information Formative – learning experience for both student and teacher

Purpose • Purpose 1: To classify students into distinct categories. Under this purpose, is there any reason to administer the test to the student who passes the test (missing only a few questions) or fails the test (only getting a few questions correct) every year? • Purpose 2: To provide information. Under this purpose, is there any reason to administer the test to the student who passes the test (missing only a few questions) or fails the test (only getting a few questions correct) every year?

Reliability • Internal Consistency • Calculate Alpha • Desired Items • Desired Reliability • Test-Retest Forms • Alternate Forms • Test-Retest Alternate Forms

Reliability • Inter-Rater • Percent Agreement • Cohen’s Kappa

Validity Scores must be reliable before they can be valid. • Content • Criterion-related • Predictive • Concurrent • Construct

Predictive Validity (Criterion Related) See Handout

Item Analysis • p -value • desired value – reliability is maximized when p is half way between the floor and ceiling • 4 possible choices: floor is .25; ceiling is 1; desired value is .625 • calculate a few of your own. • Desired total score mean • desired value – reliability is maximized when total mean score is the sum of all the desired p -values • .625+.625+.75 = 2 • calculate a few of your own

Item Analysis • Discrimination Index (D-Index) • Determine high (top 27%) and low group (bottom 27%) • D = p u – p l • D should be greater than .30

Item Analysis • Point Biserial Correlation • Correlation between dichotomous item and total test score • High value indicates strong correlation between that item and the total test score • High value does not indicate that a lot of respondents answered the item correctly • Cronbach Alpha “if deleted” • Distractors • (1 – desired p -value)/# of distractors • Do you have any that are not doing their job.

Apply Basic Measurement Concepts of Reliability, Validity, and Test Construction

Reliability • Internal Consistency • Test-Retest • Alternate Form • Test-Retest Alternate Form • Inter-Rater

Validity • Content • Criterion-related • Construct

Item Analysis • Discrimination Index (D-Index) • Distractor Analysis

Demonstrate Why High Test Scores May Not Always Indicate Reliable, Valid Test Scores

Apply Basic Concepts of Constructing a STAAR-Like Assessment.

Test Construction Process Steps 1. Identify the primary purpose(s) for which the test scores will be used. 2. Identify behaviors that represent the construct or define the domain. 3. Prepare a set of test specifications, delineating the proportion of items that should focus on each type of behavior identified in step 2 4. Construct an initial pool of items 5. Have items reviewed (and revise as necessary) (Crocker and Algina, 2008)

Test Construction Process Steps 6. Hold preliminary item tryouts (and revise as necessary) 7. Field-test the items on a large sample representative of the examinee population for whom the test is intended 8. Determine statistical properties of items scores and, when appropriate, eliminate items that do not meet pre-established criteria 9. Design and conduct reliability and validity studies for the final form of the test 10. Develop guidelines for administration, scoring, and interpretation of the test scores (e.g., prepare norm tables, suggest recommended cutting scores or standards for performance, etc.) (Crocker and Algina, 2008)

STAAR Reliability • Mean • SD • Alpha • SEM = SD*sqt(1-r) • Mean P-Value Validity Scale Score – use SD and Mean Equating Vertical Score

Scaling, Norming, Equating Scaling – assigning intervally scaled numerical values to raw scores. Norming – constructing conversion tables so that a particular raw score value can be interpreted in terms of its relative location and frequency within the total score distribution. Equating – a statistical process for expressing scores of one test on the scale of another with maximum precision (Osterlind, 2006).

So How Does this Help the Test Designer?

So How does this Help the Test Designer? • Provides information about how test scores remain stable over time (when to modify a test and when not to). • Provides information about how to produce an alternate (but equivalent form) of a test to prevent cheating. • Provides information about how reliable a test is performing with respect to student ability. Are students with similar abilities getting similar questions correct? • Provides information that can be used to explain to students, teachers, and parents the reliability of test scores.

So How does this Help the Test Designer? • Provides information about how test items relate to the specified content. • Provides information about how test scores relate to other criteria (e.g., course grade, GPA, SAT, etc.). • Provides information about how groups of items on a test cluster together to measure a similar construct (e.g., math ability). • Provides information that can be used to explain to students, teachers, and parents the validity of test scores.

Consolidating Efforts

Benefits to Districts • Greater precision in measurement. • Make better decisions about item usage (increase information gathered from each item). • Make better decisions about the level of difficulty of tests. • Strategically substitute items from one test to another. • Provides an item clearinghouse so districts can trade items if they choose to do so.

Review Today’s Objectives • Introduction to Test Theory • Components of a quality local assessment • Apply basic measurement concepts of reliability, validity, and test construction • Demonstrate why high test scores may not always indicate reliable, valid test scores • Apply basic concepts of constructing a STAAR-like assessment.

STAAR-Like Quality Starts with Reliability Quality Educational - PowerPoint PPT Presentation

STAAR-Like Quality Starts with Reliability Quality Educational Research Our mission is to provide a comprehensive independent research- based resource of easily accessible and interpretable data for policy makers, school administrators,

STAAR ALT 2 UPDATE OVERVIEW of STAAR ALT 2 An assessment based on alternate academic

STARTS: STARTS: STARTS: STARTS: STAtic STAtic Regression Test Selection Regression Test

JET Job Skills Elementary School I Like Rain By Sarah Rogers-Tanner I like rain I dont like

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Software Reliability and System Reliability Introduction 1 Software Reliability and System

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

Reliability Perspectives on Clean Power Plan Implications NERC Reliability Assessments John Moura

The Future of Reliability: Stanton Energy Reliability Center DCBO Bidders Conference

Grades Weighting STAAR& Grad Plans Did You Know?? There are 2 types of grades MAJOR 1.

Assessment Appendix What is STAAR? The Texas Essential Knowledge & Skills (TEKS) adopted by

The Language Proficiency Assessment Committee (LPAC) Decision-Making Process for STAAR and TELPAS

(LPAC) Decision-Making Process for STAAR and TELPAS Texas Education Agency Student Assessment

ACADEMIC RECOVERY & REINTEGRATION PLAN PART ONE April 27, 2020 NO STAAR DATA? WEVE GOT

STAAR Parent Awareness Session February 28, 2012

Example STAAR-EOC English 1 Expository Prompt READ the information in the box below. In 1955

Estimator for Graph Analytics on FPGA Heiner Giefers, Peter Staar, Raphael Polig IBM Research

Start Dates Face to Face: AUGUST 24,2015 Online: AUGUST 24, 2015 BENEFITS OF COLLEGE NOW 1. 1.

Developing Early maths Skills First skills Children learn very basic maths skills from an early

Welcome! Mrs. Shuttleworth 1-3 Room 28 Upper Moreland Primary School Please come in and find

Dr. Maria Minon, Commissioner Christina Altmayer, Executive Director 1 Children and Families

DMAC Users Meeting Data, the Building Blocks of Instructional Planning Dr. Rebeca Cooper

Observatory of Complex Systems http://ocs.unipa.it ELSA Air Traffic Simulator: an Empirically

A special thank you! Way! Christine Franklin University of Georgia Athens, GA

Informational Math Night Grades 6-9 Reading Public Schools December, 2013 Shift in Standards

STAAR-Like Quality Starts with Reliability Quality Educational - PowerPoint PPT Presentation

STAAR-Like Quality Starts with Reliability Quality Educational Research Our mission is to provide a comprehensive independent research- based resource of easily accessible and interpretable data for policy makers, school administrators,

STAAR ALT 2 UPDATE OVERVIEW of STAAR ALT 2 An assessment based on alternate academic

STARTS: STARTS: STARTS: STARTS: STAtic STAtic Regression Test Selection Regression Test

JET Job Skills Elementary School I Like Rain By Sarah Rogers-Tanner I like rain I dont like

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Software Reliability and System Reliability Introduction 1 Software Reliability and System

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

Reliability Perspectives on Clean Power Plan Implications NERC Reliability Assessments John Moura

The Future of Reliability: Stanton Energy Reliability Center DCBO Bidders Conference

Grades Weighting STAAR&amp; Grad Plans Did You Know?? There are 2 types of grades MAJOR 1.

Assessment Appendix What is STAAR? The Texas Essential Knowledge &amp; Skills (TEKS) adopted by

The Language Proficiency Assessment Committee (LPAC) Decision-Making Process for STAAR and TELPAS

(LPAC) Decision-Making Process for STAAR and TELPAS Texas Education Agency Student Assessment

ACADEMIC RECOVERY &amp; REINTEGRATION PLAN PART ONE April 27, 2020 NO STAAR DATA? WEVE GOT

STAAR Parent Awareness Session February 28, 2012

Example STAAR-EOC English 1 Expository Prompt READ the information in the box below. In 1955

Estimator for Graph Analytics on FPGA Heiner Giefers, Peter Staar, Raphael Polig IBM Research

Start Dates Face to Face: AUGUST 24,2015 Online: AUGUST 24, 2015 BENEFITS OF COLLEGE NOW 1. 1.

Developing Early maths Skills First skills Children learn very basic maths skills from an early

Welcome! Mrs. Shuttleworth 1-3 Room 28 Upper Moreland Primary School Please come in and find

Dr. Maria Minon, Commissioner Christina Altmayer, Executive Director 1 Children and Families

DMAC Users Meeting Data, the Building Blocks of Instructional Planning Dr. Rebeca Cooper

Observatory of Complex Systems http://ocs.unipa.it ELSA Air Traffic Simulator: an Empirically

A special thank you! Way! Christine Franklin University of Georgia Athens, GA

Informational Math Night Grades 6-9 Reading Public Schools December, 2013 Shift in Standards

Grades Weighting STAAR& Grad Plans Did You Know?? There are 2 types of grades MAJOR 1.

Assessment Appendix What is STAAR? The Texas Essential Knowledge & Skills (TEKS) adopted by

ACADEMIC RECOVERY & REINTEGRATION PLAN PART ONE April 27, 2020 NO STAAR DATA? WEVE GOT