medical education Professor Reg Dennick Professor of Medical - - PowerPoint PPT Presentation

medical education professor reg dennick
SMART_READER_LITE
LIVE PREVIEW

medical education Professor Reg Dennick Professor of Medical - - PowerPoint PPT Presentation

Improving the quality of assessments in medical education Professor Reg Dennick Professor of Medical Education School of Medicine University of Nottingham United Kingdom Overview Assessment of learning outcomes & competencies Curriculum


slide-1
SLIDE 1

Improving the quality of assessments in medical education Professor Reg Dennick

Professor of Medical Education School of Medicine

University of Nottingham United Kingdom

slide-2
SLIDE 2

Assessment of learning outcomes & competencies Curriculum alignment Quality assurance of Assessment Standard setting Validity & Reliability Assessment as measurement Psychometrics as a method of quality control Classical test theory: reducing ‘noise’ Item analysis Generalisability theory Factor analysis Item Response theory: Rasch modeling Summary

Overview

slide-3
SLIDE 3

Bloom’s Taxonomy

‘The Domains of Learning’

  • Cognitive (knowledge)
  • Psychomotor (practical skills)
  • Affective (attitudes)
slide-4
SLIDE 4

The cognitive (knowledge) domain (2001)

slide-5
SLIDE 5

General Medical Council (1993, 2002, 2009)

Outcome based curricula

slide-6
SLIDE 6

General Medical Council 2009

  • The doctor as scholar and scientist

“Doing the right thing”

  • The doctor as practitioner

“Doing the thing right”

  • The doctor as professional
  • “The right person doing it”
slide-7
SLIDE 7

CANMEDS Curriculum (2005)

  • Medical Expert Role
  • Communicator Role
  • Collaborator Role
  • Manager Role
  • Health Advocate Role
  • Scholar Role
  • Professional Role
slide-8
SLIDE 8

Using outcomes and competencies leads to ‘Constructive Alignment’

slide-9
SLIDE 9

Planned Taught Learned

Curriculum Alignment

Assessed Planned Taught Learned Assessed

slide-10
SLIDE 10

Quality Assessment

  • 1980s management culture
  • Quality of public services scrutinised
  • Teaching Quality Audit in Higher Education
  • ‘How can you evaluate the effectiveness of a course

if you don’t know what its outcomes should be?’

  • Accountability
  • Patient safety
  • Insurance/Indemnity
  • Professional bodies (GMC)
slide-11
SLIDE 11

The Professional challenge

Can defining and listing outcomes and competencies equate to a definition of professional performance that can be

  • bjectively measured?
slide-12
SLIDE 12

Assessing competence

Essays MCQs **** OSCEs DOPs MiniCEX MSF CBD Simulation

Work based assessments (WBAs)

slide-13
SLIDE 13

Miller’s Triangle The Dreyfus Model

From Novice to Expert

  • Level 1 Novice

– Rigid adherence to taught rules or plans: ‘context-free elements’ – Little situational perception

  • Level 2 Advanced Beginner

– Guidelines for action based on attributes or aspects (aspects are global characteristics of situations recognisable only after some prior experience) – Situational perception still limited

  • Level 3 Competent

– Coping with crowdedness (pressure) – Now sees actions at least partially in terms of longer- term goals – Conscious deliberate planning and problem solving

  • Level 4 Proficient

– See situations holistically rather than in terms of aspects – See what is most important in a situation – Uses intuition and ‘know-how’

  • Level 5 Expert

– No longer predominantly relies on rules, guidelines or maxims – Intuitive grasp of situations based on deep tacit understanding – Analytic approaches used only in novel situation or when problems occur

slide-14
SLIDE 14

Assessment challenges

Objectivity Validity Reliability Assessor training/skills Psychometric evaluation

slide-15
SLIDE 15

Improve the assessors

slide-16
SLIDE 16

Modified from: Tavakol & Dennick (2011) “Post-examination analysis of objective tests”. Medical Teacher. 33(6):447-58

Item analysis Factor analysis Cluster analysis Rasch Modelling G study Post-examination analysis Question writing Exam drafting Assessment Piloting

The Examination Cycle

New

Feedback reports

Bank

Learning outcomes Teaching experiences Standard setting

slide-17
SLIDE 17

How do you define the pass mark for an exam?

STANDARD SETTING

slide-18
SLIDE 18

Definition of Standards

A standard is a statement about whether an examination performance is good enough for a particular purpose:

– A defined score that serves as the boundary between passing and failing

slide-19
SLIDE 19

Standards

  • Standards are based on judgments about

examinees’ performances against a social or educational constructs: –Student ready for next phase: progression –Student ready for graduation –Competent practitioner

slide-20
SLIDE 20

The Standard Setting Problem

Pass Fail

Competent Incompetent

slide-21
SLIDE 21

Setting the pass mark

The method has to be:

  • Defensible
  • Credible
  • Supported by body of evidence
  • Feasible
  • Acceptable to all stakeholders
slide-22
SLIDE 22

Standard Setting Methods

Relative methods

  • based on judgments about groups of test takers

Absolute methods

  • based on judgments about test questions
  • based on judgments about the performance of individual

examinees

Borderline group method Contrasting group method

slide-23
SLIDE 23

Types of Standards

  • Relative standards/ norm referenced methods:

– Based on a comparison among the performances of examinees – A set proportion of candidates fails regardless of how well they perform.

  • Absolute standards/ criterion referenced methods:

– Based on how much the examinees know – Candidates pass or fail depending on whether they meet specified criteria.

slide-24
SLIDE 24

Norm referencing - assessing students in relation to

each other, to the norm or group average. Marks are normally distributed and grade boundaries inserted afterwards according to defined standards. Students pass or fail and are graded depending on the norm.

Relative Method

slide-25
SLIDE 25

Criterion referencing - each student assessed against

specific criteria of competence. Students pass or fail depending

  • n the achievement of a minimum number of specified

competencies.

Absolute Method

slide-26
SLIDE 26

Absolute Methods

Judgments are made about individual test items

– Angoff’s method – Ebel’s method

slide-27
SLIDE 27

Angoff’s method

  • Select the judges
  • Discuss
  • Purpose of the test
  • Nature of the examinees
  • What constitutes adequate/inadequate

knowledge

  • The borderline candidate
slide-28
SLIDE 28

The ‘borderline’ candidate

  • How do you define this concept?
  • It is based on past experience and accumulated

knowledge of assessments.

  • It is a subjective judgement made more reliable

by using multiple assessors.

  • It is based on the consensus of experts.
slide-29
SLIDE 29

Ebel’s Method

  • Difficulty-Relevance decisions

– The judges make judgments about the proportion

  • f items in each category that borderline test-

takers would have answered correctly – Judges read each item and assign it to one of the categories in the classification table – Calculate passing score

slide-30
SLIDE 30

Ebel’s method

Easy Medium Hard Essential Important Nice to know

slide-31
SLIDE 31

Ebel’s method

For each category an estimate of the proportion of questions the ‘borderline’ candidate gets right is made.

Easy Medium Hard Essential

0.95 0.85 0.80

Important

0.90 0.80 0.75

Nice to know

0.80 0.60 0.50

slide-32
SLIDE 32

Ebel’s Method

The scores for each question are multiplied by the appropriate proportion and summed.

Category Proportion Right # Questions Score Essential Easy 0.95 x 3 = 2.85 Medium 0.85 x 2 = 1.70 Hard 0.80 x 2 = 1.60 Important Easy 0.90 x 3 = 2.70 Medium 0.80 x 4 = 3.20 Hard 0.75 x 4 = 3.00 Nice to know Easy 0.80 x 2 = 1.60 Medium 0.60 x 2 = 1.20 Hard 0.50 x 3 = 1.50 25 19.20 Pass mark = 19.20/25 = 76.8%

slide-33
SLIDE 33

Borderline Group method

  • Useful for OSCEs.
  • OSCE checklist has a ‘global’ score box: eg. pass,

borderline, fail

  • Examiner(s) complete checklist but also judge a

global score

  • At end of exam scores in the ‘borderline’

category are averaged to give the pass mark.

slide-34
SLIDE 34

Contrast Group Method

Fail group N Pass mark Pass group

In an OSCE exam students are scored using a checklist but also given a GLOBAL score of PASS or FAIL. After the exam the two distributions are plotted and the pass mark determined from the overlap between the two groups

slide-35
SLIDE 35

Standard setting: practical implications

Choice of standard setting methods depends on:

– Credibility – Resources available – High stakes level of exam

slide-36
SLIDE 36

How can we improve the quality of assessments?

PSYCHOMETRICS

Post exam analysis

slide-37
SLIDE 37

Psychometrics

  • Psychometrics is concerned with the quantitative

characteristics of assessments as well as attitudes and psychological traits.

  • Psychometrics is concerned with the construction and

validation of measurement tools such as exams, survey questionnaires and personality assessments.

  • The psychometric soundness of a test refers to how reliably

and accurately a test measures what it purports to measure.

  • Psychometricians are increasingly being required to monitor

and improve the quality of exams.

slide-38
SLIDE 38

How can psychometrics improve student assessment?

  • Aberrant questions and stations can be detected and then

restructured or discarded.

  • By improving the practical organisation of examinations.
  • By improving the credibility of the competence-based pass

mark.

  • By improving the validity and reliability of checklists, rating

and global rating scales.

  • By recognising, isolating and estimating measurement errors

associated with students’ scores

  • By identifying the constructs within tests.
  • By relating student ability and item difficulty within tests.
slide-39
SLIDE 39

Measures of Variability

  • Both the mark distributions have a mean of 50, but show a

different pattern.

  • Examination A has a wide range of marks (some below 20 and

some above 90), Examination B shows few students at either extreme.

  • Examination B is more homogenous than examination B.

39

slide-40
SLIDE 40

Standard Deviation (SD)

  • SD is the most widely used measure of variability when

interval and ratio data are described.

  • This statistic describes how values vary about the mean of

the distribution.

  • The SD also can be used for interpreting the relative position
  • f individual students in a normal distribution.
slide-41
SLIDE 41

Validity & Reliability

  • Validity: Is the test measuring the construct that it is

designed to test?

  • Reliability: is the test accurate, reproduceable and stable?
  • A test can be reliable without being valid.
  • But a test cannot be valid if it is unreliable.
  • Reliability is a necessary but not sufficient condition for

validity.

slide-42
SLIDE 42

Classical Test Theory (CTT)

  • A student’s mark (X) is composed of two components:
  • True score (T) : the score that would be obtained if there was no

error.

  • Error score (E): the score that has nothing to do with the student’s

ability.

X=T+E

Reliability estimates inform us about the amount of measurement error (“noise”) associated with a test score

slide-43
SLIDE 43

Sources of Error “NOISE”

TESTEE TEST TESTER

Error

“noise”

slide-44
SLIDE 44

Reliability

  • Reliability the extent to which a test measures

the constructs in a test consistently.

  • There are different procedures for estimating

the reliability of a test.

  • Coefficient alpha ( Cronbach’s alpha) is

popular for estimating the reliability of tests.

  • Computer programs, such as SPSS, can

calculate the coefficient alpha.

slide-45
SLIDE 45

Coefficient alpha (Cronbach’s alpha)

  • Coefficient alpha is expressed as a number between 0 and 1.
  • Acceptable values of alpha, range from 0.70 to 0.95.
  • A high coefficient alpha does not always mean a high degree of

internal consistency, because alpha is also affected by the length of the test.

  • A high value of alpha (> 0.90) may suggest redundancies and show

that the test length should be shortened.

  • As alpha is based on a specific sample of students. We should not

rely on published alpha estimates and should measure alpha each time the test is administered.

slide-46
SLIDE 46

Item analysis

  • The reliability of test scores are dependent on the quality of the

items on the test.

  • Unlike reliability analyses that evaluate the properties of a test as a

whole, item analyses examines individual items, not the overall test.

  • If you can improve the quality of the individual items, you will

improve the overall quality of your test.

  • Item analysis statistics enable test developers to decide which items

to keep on a test, which items to modify or eliminate.

slide-47
SLIDE 47

Item Difficulty “p”

  • Item difficulty is the proportion of students who

correctly answer the item.

  • In the language of psychometrics, it is also called

the p-value.

  • Item difficulty level or index is calculated as :

p = number of students correctly answering the item/ number of students

slide-48
SLIDE 48

Item discrimination index “D”

  • A value of how well a question is able to differentiate

between students who are ‘strong’ and ‘weak’.

  • The item-discrimination index is symbolized by D. The

range of D is - 1.00 to + 1.00.

  • D is the difference between the proportion of high total

scores answering a question correctly and the proportion

  • f low scores answering the question correctly
  • A negative D-value indicates a poor item that the weaker

students answered correctly more than the top students.

slide-49
SLIDE 49
  • 0.3
  • 0.2
  • 0.1

0.1 0.2 0.3 0.4 0.5 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 PH-PL (D)

Good discrimination

D

p

Example

Too easy Problem questions

slide-50
SLIDE 50

Reliability

  • Poorly discriminating questions and questions

that are either too easy or too hard reduce the reliability of the test.

  • An unreliable test is too ‘noisy’ and is a poor

measuring instrument.

  • An unreliable test is unfair because it does not

discriminate between weak and strong students.

slide-51
SLIDE 51

Problems with CTT

Cronbach’s alpha, and other reliability measures cannot identify the potential sources of measurement error associated with the student

  • btained mark.

The answer is-

Generalisability Theory

slide-52
SLIDE 52

Generalizability Theory (GT)

  • GT identifies the potential sources of measurement error

associated with the student obtained mark.

  • The different sources of error are known as Facets.
  • Identifying facets allows test constructors to improve the

quality of the test.

  • For instance, if an OSCE station has used Standardised

Patients (SPs), a range of examiners and different checklists to assess student’s performance on 20 stations GT can calculate the amount of error caused by each facet.

slide-53
SLIDE 53

GT (cont.)

  • Each facet has a value associated with it called its

Variance Component, calculated via an analysis of variance procedure (ANOVA) in SPSS.

  • The Variance Components are used to calculate a

G-Coefficient.

  • The G-Coefficient is equivalent to the Reliability
  • f the test and has a value between 0 and 1.
slide-54
SLIDE 54

Factor Analysis (FA)

  • Factor Analysis is a technique for identifying the factors that

contribute towards the cognitive structure or performance of a test.

  • Responses to many different questions are driven by just a few

underlying constructs called Factors.

  • A factor is a construct which represents the relationship between

a set of questions that are highly correlated.

  • This is referred to ‘factor loadings’.
  • We can use Factor Analysis to identify the common factors

contributing to the test and to identify and remove irrelevant and ‘noisy’ items.

slide-55
SLIDE 55

FA (Cont.)

  • Assessors should understand the number of factors
  • r constructs measured in the test.
  • The number of factors is important because each

factor represents a different construct which should, in principle, be scored separately.

slide-56
SLIDE 56

Exploratory Factor Analysis ( EFA)

  • EFA is widely used to discover factors and

factor loadings in the test.

  • EFA is also used to revise and simplify exam

papers.

  • EFA can be performed by SPSS
  • The output for 10 questions is shown next.
slide-57
SLIDE 57

Hypothetical example (EFA)

Question F1 F2 Q9 0.92

  • 0.02

Q2 0.92

  • 0.02

Q6 0.81 0.21 Q10 0.79

  • 0.38

Q4 0.79

  • 0.38

Q1 0.73 0.36 Q3 0.69 0.16 Q5

  • 0.28

0.03 Q7 0.01 0.96 Q8 0.01 0.96 % variance explained by each factor 47.23 23.60 Factor 1 identified in Q9,Q2,Q6,Q 10,Q4,Q1,Q3 This question does not load

  • nto F1 or F2

Factor 2 identified in Q7 & Q8

EFA

slide-58
SLIDE 58

Item Response Theory (IRT)

  • Test developers have traditionally used Item difficulty and item

discrimination plus the traditional reliability coefficient (e.g. Cronbach’s alpha) to examine the quality of a test.

  • G-theory makes a more elaborate analysis of examination

conditions with a view to identifying sources of error and improving reliability.

  • These procedures focus on the test and its errors, but say nothing

about how student ability interacts with the test and its items.

  • The aim of IRT is to measure the relationship between the

student’s ability and the item’s difficulty level to improve many of the factors which can influence the quality of assessments.

slide-59
SLIDE 59

IRT (Cont.)

Imagine a student taking an exam in physiology:

  • The probability that the student answer question 1 correctly is

affected by the student’s physiology ability and the item’s difficulty level.

  • If the student has a high level of physiology knowledge, the

probability that h/she will answer the question 1 correctly is high.

  • If the question has a low index of difficulty the probability that the

student will answer the question correctly is low.

  • IRT analyses these relationships using a mathematical model

based on student test scores and item parameters ( item difficulty, item discrimination and guessing).

slide-60
SLIDE 60

IRT (Cont.)

  • IRT focuses on item level information rather than

test level information.

  • The relationship between student ability and

performance on a question is given by a range of item parameters.

slide-61
SLIDE 61

IRT (Cont.)

  • In IRT, Item difficulty and student ability are

transformed mathematically, into units termed “logits”, typically which range from -4 to +4.

  • Plotting the relationship between all students’

ability and the probability of success on a given question generates an Item Characteristic Curve (ICC).

slide-62
SLIDE 62

IRT has many data outputs

  • Item Characteristic Curves
  • Item-student maps
  • Dimensionality analysis
  • Reliability
  • Item fit statistics
  • Test information function
  • Item difficulty invariance
slide-63
SLIDE 63

Item Characteristic Curve

This is the average student High ability Low ability

Average student has ~100% probability of answering question correctly

This curve plots the ability of all students against the probability that they will answer a specific question correctly.

slide-64
SLIDE 64

ICC (Cont.)

A difficult item

slide-65
SLIDE 65

ICC (cont.)

A perfectly discriminating item

slide-66
SLIDE 66

ICC & OSCE stations

Multiple ICC curves can also be displayed on one graph and IRT can also be used to analyse OSCE stations.

slide-67
SLIDE 67

Item-Student Map

  • The distribution of students’ ability and the difficulty of each item

can also be presented on an item- student map.

  • Item-student maps compare students and questions in order to

get a greater understanding of how well student ability maps onto item difficulty.

  • Using IRT software programmes, the item student map can be

calculated and displayed together.

slide-68
SLIDE 68

Item student Map (ISM)

ISM = A visual representation of the distribution of students’ ability and the difficulty of each item. left side= the ability of students Right side= the difficulty of each item #= 5 students M= mean; S= One standard deviation T= Two standard deviation 0 logits= average student Top of map= Most able students and most difficult items Bottom of Map= Least able students and easiest items

slide-69
SLIDE 69

Item-Student Map (Cont.)

slide-70
SLIDE 70

Item-Student Map ( Cont.)

slide-71
SLIDE 71

Questions?

slide-72
SLIDE 72

Summary

  • Standard setting enables us to set a rational, fair and

professional pass mark for a test.

  • Classical Test Theory gives us an overview of the error

and noise in an assessment but it does not tell us where all the errors can be found.

  • Generalisability theory enables us to identify sources of

error in a test (and thus help to eliminate them).

  • Factor Analysis identifies the underlying constructs in a

test.

  • Item Response Theory enables us to understand the

relationship between student ability and the demands

  • f the test.
slide-73
SLIDE 73

Requirements for quality assessment

  • To aim for constructive alignment in the curriculum
  • To adopt the exam cycle framework for producing and

evaluating tests

  • To standard set high stakes exams
  • To use post-examination analysis methods to identify

problem items and sources of measurement error

  • To iteratively analyse and improve high stakes tests to

improve their reliability and fairness.

  • To use a psychometrician to evaluate and improve all

assessments.

slide-74
SLIDE 74

References