Operational Research in Assessm ent Program s as a W indow into - - PowerPoint PPT Presentation
Operational Research in Assessm ent Program s as a W indow into - - PowerPoint PPT Presentation
Operational Research in Assessm ent Program s as a W indow into Task and I tem Design Principles: Exam ples from NAEP Panel: Madeleine Keehner, Hilary Persky, and Luis Saldivia, Educational Testing Service Discussant: Robin Hill, Kentucky
Overview of key aspects of hum an cognition that are relevant to item and task design
Research findings and theory from cognitive science
Madeleine Keehner
Design Decisions in Assessm ent Developm ent Constructs – Target KSAs
Task structure, item types, response modes, interactive capabilities, design devices, graphics, text, media, layouts… What we measure How we measure
These Design Decisions I m pact Key Processes
Task structure, item types, response modes, interactive capabilities, design devices, graphics, text, media, layouts… Cognitive
Perception and attention WM load, exec functions Intrinsic/extraneous load LTM schema activation Metacognition
Behavioral
Affordances for action Embodiment
Social
Collaborative Communicative
Affective
Engagement Motivation Enjoyment Frustration Boredom
Zoom ing in on Cognition and Behavior
5
Perception and attention Long term schema Action planning and control Working memory
How do external item and task design features influence these internal cognitive processes?
How External Design Features affect I nternal Processes
6
Perception and attention Long term schema Action planning and control Working memory
Attention can be captured by salient features; it can be directed through signaling Perception can be
- verloaded by too
much information
How External Design Features affect I nternal Processes
7
Perception and attention Long term schema Action planning and control Working memory
Total processing load may exceed WM capacity With good design, extraneous load can be minimized, intrinsic load can be optimized
How External Design Features affect I nternal Processes
8
Perception and attention Long term schema Action planning and control Working memory
Familiar response modes, technology, or task types can activate learned schema and reduce WM load Schema may be inappropriately triggered by familiar-feeling formats
How External Design Features affect I nternal Processes
9
Perception and attention Long term schema Action planning and control Working memory
The affordances of a display can make some behaviors more likely We may not know what behaviors we are ‘inviting’ with our design
Conclusion: External Representations affect I nternal Processes
10
Perception and attention Long term schema Action planning and control Working memory
External item and task design features interact with internal cognitive processes
NAEP Reading Exam ple: I nsights from Pretesting an I nnovative I nterface Design
NAEP eReader design problem:
- How to present reading passages and items on tablet
- Allow students to interact fluently with them
- Gather evidence of reading processes
- Full-screen presentation would allow for
widest variety of passages
- Items presented in a separate window or panel
would allow for wide variety of item types
- Navigational aides provided to facilitate
navigation between items and passage
Com parison of Different Layouts
Fish Fossils
Fish Fossils Dinosaur Skeleton
1 vs 2 column passage Items swiped in from the right side
- WM load if items not
always visible?
- How do interactive
behaviors differ with visual
- cclusion?
Look-back buttons in items
- Schema for use?
- Sufficiently salient?
I nteraction Behaviors: Sw iping I tem s On and Off
- Swiping (L and R) happened more in
layouts where items overlap text (two-column passages)
– Where there was no overlap (one- column - blue) students still swipe L (on) but hardly ever swipe R (off)
- Item is visible all the time
- Is this too different from P&P?
- Does it change the way students
read/ search?
- Two-column layouts: 4th and 8th
Graders differed – 4th Graders: swiped on and then
- ff
– 8th Graders: swiped on, did other actions, then swiped off
Some performance differences: G4 did a little better with 1-column, G8 had longer CRs with 1-column
Overall I nsights and Eventual Design Decisions
- Different behavioral affordances from 1 and 2 column layouts
– Students do not remove items if they are not occluding text
- Suggests less cognitive effort to leave on – only removed when in the way
– Performance similar but not identical (note: no P&P baseline) – More process information when swiping on and off – Always-visible items might change reading strategy/ approach (diff from P&P) – Expert committee decision: Two-column layout appropriate operational trade-off – (Note: interface design still evolving)
- Use of look-back buttons in items hardly ever observed
– Interview questions indicated students had not noticed them – Suggests no schema to look for them and not salient enough to capture attention – Design tweak: Visual salience was enhanced, instruction added to tutorial
Take-Hom e 1 : Design Decisions I m pact Basic Processes, and the Reverse should Also be True
Task structure, item types, response modes, interactive capabilities, design devices, graphics, text, media, layouts… Cognitive
Perception and attention WM load, exec functions Intrinsic/extraneous load LTM schema activation Metacognition
Behavioral
Affordances for action Embodiment
Social
Collaborative Communicative
Affective
Engagement Motivation Enjoyment Frustration Boredom Knowledge of these basic processes should also impact our design decisions
Take-Hom e 2 : I nterdisciplinary Collaboration is needed to do Justice to both W hat and How
- Assessment developers
– Subject-matter content expertise, item and task design experience
- Learning scientists
– Subject-relevant cognitive and learning expertise
- Cognitive scientists
– Expertise in general cognitive, metacognitive, behavioral, social, and affective processes; usability and cognitive research methods; human-computer interaction, etc. – (And many others, of course… ..)
Take-Hom e 3 : More and Better Research Needed
- Traditional items are supported by decades of psychometric research
– Empirical data: item response characteristics, validity studies, etc.
- Digital assessments allow many more options for:
– Varied stimuli and representations – Different response modes and response behaviors – Other kinds of behaviors and interactions
- Psychometric approach alone may not be enough
– Basic properties of cognition need to be examined, and considered a priori – Requires experimental cognitive research methods and analyses – Meanwhile, let’s look at some insights from operational pretesting studies… .
19
A Pretesting Study: Effects of avatars ( and leveling) in SBTs
- n students
Hilary Persky
Background
- The affordances of DBA allow assessments to better reflect
authentic reading experiences, which are purpose driven, at times collaborative, and involve various types and levels of support.
- Many believe the construct of reading comprehension has
broadened with advent of digital literacies.
- Purpose driven tasks have been taken up by the next generation
state assessments as well as national and international assessments (PIRLS and PISA).
21
W hy the study?
- Avatars used in new NAEP reading tasks to:
- introduce and reaffirm overall task and specific activity
purposes
- simulate conversation/ collaboration
- assist in task transitions
- reset student understanding (leveling)
- Some stakeholder concerns:
- Do avatars add cognitive load?
- Are avatars actually engaging?
- Does “leveling” negatively affect students?
22
Study Questions
- Main focus: Does having student avatars
affect
- Test performance?
- Test-taking behaviors?
- Affective responses?
- Do we see any effects of leveling?
23
Study Design
- Two assessment tasks: literary and
informational
- Two versions of each task
– Avatar vs Non-avatar
- Leveling in both versions
- Student survey on
– Preferences and affective responses – Background information (digital access; reading motivation)
24
Study Approach
- Tryout (like normal admin):
- 100 students recruited from the DC area
- Randomly assigned to the Avatar or Non-
avatar conditions (each student took only one task)
- Cog labs (one on one; think aloud, eye
tracking, post-task interview):
- 12 students, recruited from Trenton, Ewing,
Princeton
- Randomly assigned to the Avatar and Non-
avatar conditions
25
Tryout Perform ance Results
- No significant effects on total tasks scores or item
scores
- The number of high- and low-performing
students was similarly distributed in the avatar and non-avatar conditions.
- No significant interactions with gender,
race/ ethnicity, SES, or digital access (based on survey items included in the tryout).
26
Tryout Process Data Results
- No significant effect of avatars on reading
behaviors such as reading speed, or the number
- f page turns.
- No significant effect of avatars on question
answering behaviors such as the number of times answers are changed, back navigation, or specific item behavior, such as select in passage behavior.
- No significant effects of avatars on time use (that
is, time on reading or items)
27
BUT: Tryout survey affective results show differences…
28
29
5 10 15 20 25 1 Very easy 2 Somewhat easy 3 Not easy or hard 4 Somewhat hard 5 Very hard
Lit task. How easy or hard was this task?
Avatar Non-Avatar
Results suggest students in the avatar conditions perceived the tasks as easier.
5 10 15 20 25 30 1 Very easy 2 Somewhat easy 3 Not easy or hard 4 Somewhat hard 5 Very hard
Inf task. How easy or hard was this task?
Avatar Non-Avatar
30
5 10 15 20 25
Lit task. The pictures and conversations made the task more interesting.
5 10 15 20 25 30 1 Strongly Disagree 2 Disagree 3 Somewhat Disagree 4 Somewhat Agree 5 Agree 6 Strongly Agree No Answer
Lit task. The conversations made the task less interesting.
2 4 6 8 10 12 14 1 Strongly Disagree 2 Disagree 3 Somewhat Disagree 4 Somewhat Agree 5 Agree 6 Strongly Agree
Inf task AV. The pictures and conversations made the task more interesting.
2 4 6 8 10 12 14 16 1 Strongly Disagree 2 Disagree 3 Somewhat Disagree 4 Somewhat Agree 5 Agree 6 Strongly Agree
Inf task AV. The conversations made the task less interesting.
Leveling survey responses
- I felt annoyed when the task gave me answers to
questions I had just answered: for both tasks, significantly m ore students disagreed.
- Getting an answer to a question I had just completed
made me more confident about answering the next question: for both tasks, significantly m ore students agreed.
31
Tryout Survey Sum m ary
- Students perceived the Avatar version
as easier and equally or more interesting.
- On average, students in the Avatar
condition were positive to neutral about the images and conversations.
- On average, students were positive to
neutral about leveling.
32
Cog lab student comments
33
- They made me feel like I already knew the book and read it several times, like they would come to me
for help if I was the teacher. It gave me specific parts of the book that I would read, and then the avatars would ask questions about it so I felt like I was explaining it to them.
- They made you look at things you wouldn’t think about if reading it by yourself.
- Seeing their interpretation helped me connect back to the story, made it a little easier. I think the
avatars helped me to understand the story because they had similar questions that I had.
- Classmates (avatars) make it easier because they do the reading of what you would normally have to
read and give your brain a rest. With avatars is more interactive because you don’t get bored and zone
- ut as you would in normal reading tasks in school.
- They provided guidance and direction. It was more personal, not all directions given by the computer.
It felt almost real and like I was working with them a little bit…
- When they communicate with each other it was like working with students in class, like when two other
students are talking to each other and I am listening to them.
- It was different. Usually we just get the questions and multiple choice answers. It was cool but didn’t
help or hurt me.
- I guess, kind of collaborating, but they couldn’t actually talk to me or respond to what I was saying. It
didn’t feel like a real interaction, they can’t comment on my statements.
- Leveling by avatar: Some of it was funny because the answers were kind of obvious, but it was cool to
feel like you were having a conversation with someone and see what they are thinking and where they are coming from and explain why. Good to hear opinions other than my own.
- Leveling by avatar: I guess it could have been that answer, but it doesn’t matter that much what she
(avatar) said. Not annoying, just whatever. It had no effect and wouldn’t change my approach/ answers. 34
Take-aw ays
- Avatars do not seem to add cognitive load and
students do seem to find them (mostly) engaging.
- On average students perceived leveling as not
annoying, and it gave them confidence to answer the next question in the task.
- Avatars and leveling are not just surface design
features, but construct-relevant features afforded by DBA to measure reading comprehension.
35
New study to dig further
- Purpose: Study effect of SBT features on students’ reading performance, reading
behaviors, and engagement.
- In context of full NAEP pilot, examine students’ performance and processes on
SBTs in comparison with discrete (DI) blocks using the same texts and items as the SBTs, but without any of SBT features (e.g., avatars, leveling, sequencing).
- Developed special study student questionnaire items from recent literature on
student engagement, motivation, persistence, and self-efficacy (Guthrie, & Klauda, 2014) as well as established NAEP survey design principles.
- Analysis about to begin!
36
NAEP Mathem atics Pretesting Findings
Luis Saldivia
NCSA, June 2 7 , 2 0 1 8
2 0 1 7 Operational
Multiple Choice Constructed Response Multiple Select Matching Zones Grids In-line Choice (Drop-down) Bar Graph Box Plot
NAEP Mathematics Item Types
3 8
2 0 1 5 Operational
Multiple Choice Constructed Response
Purpose of the Pretest Study
- The study consisted of small-scale tryouts of a selection of NAEP
mathematics 2017 discrete items. In tryouts, students answer items in timed, assessment-like conditions. Goals:
- Gather data about item response times (RTs)
- Investigate item performance
- Systematically test the effects of presentation format and response
mode by varying item type while holding constant the item content
Design
- I nline vs SSMC: Compare response times and scores. Examine
whether inline choice formats appear to produce greater usability or construct-irrelevant cognitive challenges, compared with traditional SSMC.
- MSMC: Compare two variants of MSMC items, with and without the
number of selections specified. Compare number and range of selections made and resulting scores.
- Zone: Compare two variants of MS Zone-selection items, with and
without the number of selections specified. Compare number and range of selections made, and resulting scores.
- Grid vs MSMC: Compare selection behaviors, specifically number of
choices selected and number of options left blank
Results – Tim ing
I nline Choice vs. SSMC
- Six pairs of items were compared at each grade
- The findings suggest that inline choice is equivalent to SSMC in
terms of effects of presentation format and response selection mode on performance and speed.
- Conclusion: The item content should drive the selection of the best
item format to meet the requirement of the item. For instance, inline choice can be used to for content that require linking ideas such as claims with evidence
Zone and MSMC W ith vs. W ithout Num ber of Selections Specified
Zone and MSMC W ith vs. W ithout Num ber of Selections Specified
- Two Zone and Two MSMC versions were given at each grade
- It is clear from the data that students do understand the requirement
to select more than one option
- There is some indication that specifying the number of selections
reduces the variance in the number of selections made
- Students do not adhere to the instruction
– It is not possible to know from the present data
- whether students do not notice (or forget) the instruction
- if they do attend to it but deliberately choose a different number
- In two cases out of eight contrasts, scores were significantly higher
in the number-specified variant
- Given that students did not adhere to the number specified, it is not
clear whether giving this instruction is systematically beneficial
- Item contents were at least as important for raw score differences as
the instruction to select a particular number of options
- Side Notes:
– There does not appear to be a trend for MSMC items to be easier or harder than Zone items, – There is not difference in the response times for these item types, and overall there is no evidence from either scores or RTs that students have difficulty with the zone-selection response mode
Zone and MSMC W ith vs. W ithout Num ber of Selections Specified
MSMC vs. Grid I tem s
MSMC Vs. Grid I tem s
Four pairs of variant per grade In grid items, students were almost universally likely to fill all rows. By contrast, in MSMC items, the number of response selections varied considerably and tended to cluster around the middle of the available range, with very few students making the maximum number of selections
- Original scoring rubrics tended to benefit the MSMC
– Partial scores that allowed answers with some blank responses, which rarely
- ccurs in Grid items
– Penalizing incorrect selections, which is more likely in Grid items since students rarely leave options blanks even if they are unsecure of the correct selection – For these reason Grid items tended to no receive partial scoring
MSMC Vs. Grid item s
- Dichotomous rubric also advantaged MSMC items
– Grid items tend to encourage attempts on all rows. Students may be more likely to guess when they are not sure, since they believe they must provide a response for all instances. – We cannot assume that the unselected options in MSMC items are equivalent to False selections on Grid items– some may be equivalent to False, while
- thers may be equivalent to Don’t Know, and in those cases students may not
choose to guess – In a Grid format, we assume that those same Don’t Know instances tend to get instantiated in a selection, which in those cases would be a guess or a less than certain selection
MSMC Vs. Grid I tem s
- It is important to develop scoring rubrics for Grid items that take
account of the affordances of this layout and produce equivalent scores for cognitively equivalent items
- Grid items rubrics should not assume that any options will be left
blank, and the treatment of incorrect selections should take into account the greater likelihood of guessing
- Grid items have distinct measurement properties, and are by
design not analogous to MSMC items
- Grid items appear to have different cognitive and even
metacognitive affordances
MSMC Vs. Grid I tem s
- One potential advantage of Grid items might come from their
tendency to make students attempt all selections.
- If a rubric is designed carefully, this property might be helpful for
distinguishing students who are leaving MSMC options blank to indicate ‘not True’ versus those whose blank responses indicate ‘Don’t know’
- Given the layout of Grid items, it might even be possible to
incorporate a third column (e.g., “Cannot be determined”) that could make such a distinction explicit, which is something that would be difficult to do with MSMC formats
Next Steps- Other I tem Types
5 1
Source: A B C D E Target: 1 2 3
Num ber of Actions
5 2
Num ber of Actions
5 3
Sequence of Student Actions
– Representing num bers sym bolically
- Analyze patterns in sequence of actions (exactly 3 choices)
- Do students work from graphical to symbolic representation
(“target-focused”) or symbolic to graphical (“source-focused”) representation?
5 4
Num ber of actions
5 5
44% Target-focused 13% Source-focused 12% No clear focus
All students
69% Exactly 3 actions 31% More than 3 actions
Target- vs. Source-focused
I tem scores
5 6
Turn and Talk
Discussant/ Reactant
- High quality assessments: Can states have them without
considering item and assessment constructs and structures including the student experience? (Performance, Reaction, Engagement)
- How should states begin to examine all of the underlying metadata
regarding how students interact with the items and the construct of the assessment?
- What are the potential barriers for large scale assessments?
– Lack of collaboration with cognitive scientists, content scientists and assessment developers – from Keehner’s slide (17). – Lack of funding to support this research. – Time to complete research and react to findings.
Takeaw ays
- This is not just a “NAEP” issue.
- Design decisions matter beyond just having more robust items and
assessment construct.
- DBAs should not simply replicate P & P assessments. There is so much
to gain from the underlying meta-data that can be provided from a DBA.
- Refreshing to know this research is happening related to the NAEP items