Improving the quality of assessments in medical education Professor Reg Dennick
Professor of Medical Education School of Medicine
University of Nottingham United Kingdom
medical education Professor Reg Dennick Professor of Medical - - PowerPoint PPT Presentation
Improving the quality of assessments in medical education Professor Reg Dennick Professor of Medical Education School of Medicine University of Nottingham United Kingdom Overview Assessment of learning outcomes & competencies Curriculum
Professor of Medical Education School of Medicine
University of Nottingham United Kingdom
Assessment of learning outcomes & competencies Curriculum alignment Quality assurance of Assessment Standard setting Validity & Reliability Assessment as measurement Psychometrics as a method of quality control Classical test theory: reducing ‘noise’ Item analysis Generalisability theory Factor analysis Item Response theory: Rasch modeling Summary
‘The Domains of Learning’
The cognitive (knowledge) domain (2001)
General Medical Council (1993, 2002, 2009)
Outcome based curricula
“Doing the right thing”
“Doing the thing right”
Using outcomes and competencies leads to ‘Constructive Alignment’
Planned Taught Learned
Curriculum Alignment
Assessed Planned Taught Learned Assessed
if you don’t know what its outcomes should be?’
The Professional challenge
Can defining and listing outcomes and competencies equate to a definition of professional performance that can be
Essays MCQs **** OSCEs DOPs MiniCEX MSF CBD Simulation
Work based assessments (WBAs)
Miller’s Triangle The Dreyfus Model
– Rigid adherence to taught rules or plans: ‘context-free elements’ – Little situational perception
– Guidelines for action based on attributes or aspects (aspects are global characteristics of situations recognisable only after some prior experience) – Situational perception still limited
– Coping with crowdedness (pressure) – Now sees actions at least partially in terms of longer- term goals – Conscious deliberate planning and problem solving
– See situations holistically rather than in terms of aspects – See what is most important in a situation – Uses intuition and ‘know-how’
– No longer predominantly relies on rules, guidelines or maxims – Intuitive grasp of situations based on deep tacit understanding – Analytic approaches used only in novel situation or when problems occur
Objectivity Validity Reliability Assessor training/skills Psychometric evaluation
Modified from: Tavakol & Dennick (2011) “Post-examination analysis of objective tests”. Medical Teacher. 33(6):447-58
Item analysis Factor analysis Cluster analysis Rasch Modelling G study Post-examination analysis Question writing Exam drafting Assessment Piloting
The Examination Cycle
New
Feedback reports
Bank
Learning outcomes Teaching experiences Standard setting
A standard is a statement about whether an examination performance is good enough for a particular purpose:
– A defined score that serves as the boundary between passing and failing
examinees’ performances against a social or educational constructs: –Student ready for next phase: progression –Student ready for graduation –Competent practitioner
The Standard Setting Problem
Pass Fail
Competent Incompetent
The method has to be:
Relative methods
Absolute methods
examinees
Borderline group method Contrasting group method
– Based on a comparison among the performances of examinees – A set proportion of candidates fails regardless of how well they perform.
– Based on how much the examinees know – Candidates pass or fail depending on whether they meet specified criteria.
each other, to the norm or group average. Marks are normally distributed and grade boundaries inserted afterwards according to defined standards. Students pass or fail and are graded depending on the norm.
specific criteria of competence. Students pass or fail depending
competencies.
Judgments are made about individual test items
– Angoff’s method – Ebel’s method
knowledge
knowledge of assessments.
by using multiple assessors.
– The judges make judgments about the proportion
takers would have answered correctly – Judges read each item and assign it to one of the categories in the classification table – Calculate passing score
Easy Medium Hard Essential Important Nice to know
Ebel’s method
For each category an estimate of the proportion of questions the ‘borderline’ candidate gets right is made.
Easy Medium Hard Essential
0.95 0.85 0.80
Important
0.90 0.80 0.75
Nice to know
0.80 0.60 0.50
The scores for each question are multiplied by the appropriate proportion and summed.
Category Proportion Right # Questions Score Essential Easy 0.95 x 3 = 2.85 Medium 0.85 x 2 = 1.70 Hard 0.80 x 2 = 1.60 Important Easy 0.90 x 3 = 2.70 Medium 0.80 x 4 = 3.20 Hard 0.75 x 4 = 3.00 Nice to know Easy 0.80 x 2 = 1.60 Medium 0.60 x 2 = 1.20 Hard 0.50 x 3 = 1.50 25 19.20 Pass mark = 19.20/25 = 76.8%
borderline, fail
global score
category are averaged to give the pass mark.
Fail group N Pass mark Pass group
In an OSCE exam students are scored using a checklist but also given a GLOBAL score of PASS or FAIL. After the exam the two distributions are plotted and the pass mark determined from the overlap between the two groups
Choice of standard setting methods depends on:
– Credibility – Resources available – High stakes level of exam
characteristics of assessments as well as attitudes and psychological traits.
validation of measurement tools such as exams, survey questionnaires and personality assessments.
and accurately a test measures what it purports to measure.
and improve the quality of exams.
How can psychometrics improve student assessment?
restructured or discarded.
mark.
and global rating scales.
associated with students’ scores
Measures of Variability
different pattern.
some above 90), Examination B shows few students at either extreme.
39
Standard Deviation (SD)
interval and ratio data are described.
the distribution.
designed to test?
validity.
Classical Test Theory (CTT)
error.
ability.
Reliability estimates inform us about the amount of measurement error (“noise”) associated with a test score
TESTEE TEST TESTER
the constructs in a test consistently.
the reliability of a test.
popular for estimating the reliability of tests.
calculate the coefficient alpha.
Coefficient alpha (Cronbach’s alpha)
internal consistency, because alpha is also affected by the length of the test.
that the test length should be shortened.
rely on published alpha estimates and should measure alpha each time the test is administered.
Item analysis
items on the test.
whole, item analyses examines individual items, not the overall test.
improve the overall quality of your test.
to keep on a test, which items to modify or eliminate.
Item Difficulty “p”
correctly answer the item.
the p-value.
p = number of students correctly answering the item/ number of students
Item discrimination index “D”
between students who are ‘strong’ and ‘weak’.
range of D is - 1.00 to + 1.00.
scores answering a question correctly and the proportion
students answered correctly more than the top students.
0.1 0.2 0.3 0.4 0.5 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 PH-PL (D)
Good discrimination
Too easy Problem questions
that are either too easy or too hard reduce the reliability of the test.
measuring instrument.
discriminate between weak and strong students.
Cronbach’s alpha, and other reliability measures cannot identify the potential sources of measurement error associated with the student
The answer is-
associated with the student obtained mark.
quality of the test.
Patients (SPs), a range of examiners and different checklists to assess student’s performance on 20 stations GT can calculate the amount of error caused by each facet.
Variance Component, calculated via an analysis of variance procedure (ANOVA) in SPSS.
G-Coefficient.
Factor Analysis (FA)
contribute towards the cognitive structure or performance of a test.
underlying constructs called Factors.
a set of questions that are highly correlated.
contributing to the test and to identify and remove irrelevant and ‘noisy’ items.
factor represents a different construct which should, in principle, be scored separately.
Exploratory Factor Analysis ( EFA)
factor loadings in the test.
papers.
Hypothetical example (EFA)
Question F1 F2 Q9 0.92
Q2 0.92
Q6 0.81 0.21 Q10 0.79
Q4 0.79
Q1 0.73 0.36 Q3 0.69 0.16 Q5
0.03 Q7 0.01 0.96 Q8 0.01 0.96 % variance explained by each factor 47.23 23.60 Factor 1 identified in Q9,Q2,Q6,Q 10,Q4,Q1,Q3 This question does not load
Factor 2 identified in Q7 & Q8
Item Response Theory (IRT)
discrimination plus the traditional reliability coefficient (e.g. Cronbach’s alpha) to examine the quality of a test.
conditions with a view to identifying sources of error and improving reliability.
about how student ability interacts with the test and its items.
student’s ability and the item’s difficulty level to improve many of the factors which can influence the quality of assessments.
IRT (Cont.)
Imagine a student taking an exam in physiology:
affected by the student’s physiology ability and the item’s difficulty level.
probability that h/she will answer the question 1 correctly is high.
student will answer the question correctly is low.
based on student test scores and item parameters ( item difficulty, item discrimination and guessing).
test level information.
performance on a question is given by a range of item parameters.
transformed mathematically, into units termed “logits”, typically which range from -4 to +4.
ability and the probability of success on a given question generates an Item Characteristic Curve (ICC).
Item Characteristic Curve
This is the average student High ability Low ability
Average student has ~100% probability of answering question correctly
This curve plots the ability of all students against the probability that they will answer a specific question correctly.
ICC (Cont.)
ICC (cont.)
A perfectly discriminating item
Multiple ICC curves can also be displayed on one graph and IRT can also be used to analyse OSCE stations.
Item-Student Map
can also be presented on an item- student map.
get a greater understanding of how well student ability maps onto item difficulty.
calculated and displayed together.
Item student Map (ISM)
ISM = A visual representation of the distribution of students’ ability and the difficulty of each item. left side= the ability of students Right side= the difficulty of each item #= 5 students M= mean; S= One standard deviation T= Two standard deviation 0 logits= average student Top of map= Most able students and most difficult items Bottom of Map= Least able students and easiest items
Questions?
professional pass mark for a test.
and noise in an assessment but it does not tell us where all the errors can be found.
error in a test (and thus help to eliminate them).
test.
relationship between student ability and the demands
evaluating tests
problem items and sources of measurement error
improve their reliability and fairness.
assessments.