HFES Webinar Series How Do You Know That Your Metrics Work? - - PowerPoint PPT Presentation

hfes webinar series how do you know that your metrics
SMART_READER_LITE
LIVE PREVIEW

HFES Webinar Series How Do You Know That Your Metrics Work? - - PowerPoint PPT Presentation

HFES Webinar Series How Do You Know That Your Metrics Work? Fundamental Psychometric Approaches to Evaluating Metrics Presented by Fred Oswald, Rice University Moderated by Rebecca A. Grier, Ford Motor Company Hosted by the Perception and


slide-1
SLIDE 1

HFES Webinar Series

How Do You Know That Your Metrics Work? Fundamental Psychometric Approaches to Evaluating Metrics

Presented by Fred Oswald, Rice University Moderated by Rebecca A. Grier, Ford Motor Company Hosted by the Perception and Performance Technical Group

slide-2
SLIDE 2

HFES Webinar Series

  • Began in 2011
  • Organized by the Education & Training

Committee

  • This webinar is organized and hosted by the

HFES Perception and Performance Technical Group, http://hfes-pptg.org/.

  • See upcoming and past webinars at

http://bit.ly/HFES_Webinars

slide-3
SLIDE 3

HFES Webinar FAQs

1. There are no CEUs for this webinar. 2. This webinar is being recorded. HFES will post links to the recording and presentation slides on the HFES Web site within 3-5 business days. Watch your e-mail for a message containing the links. 3. Listen over your speakers or via the telephone.

If you are listening over your speakers, make sure your speaker volume is turned on in your operating system and your speakers are turned on.

4. All attendees are muted. Only the presenters can be heard. 5. At any time during the webinar, you can submit questions using the Q&A panel. The moderator will read the questions following the last presentation. 6. Trouble navigating in Zoom? Type a question into Chat. HFES staff will attempt to help. 7. HFES cannot resolve technical issues related to the webinar

  • service. If you have trouble connecting or hearing the audio, click

the “Support” link at www.zoom.us.

slide-4
SLIDE 4

About the Presenters

Presenter Fred Oswald, PhD, is a professor in the Department of Psychology at Rice

  • University. An organiza:onal psychologist, he addresses issues pertaining to personnel

selec:on, college admission, military selec:on and classifica:on, and school-to-work transi:on. Oswald publishes sta:s:cal and methodological research in the areas of big data, meta-analysis, measure development, and psychometrics. He is an Associate Editor of Journal of Management, Psychological Methods, Advances in Methods and Prac9ce in Psychological Science, and Journal of Research in Personality. Fred received his MA and PhD in industrial-organiza:onal psychology from the University of Minnesota in 1999. Moderator Rebecca A. Grier, PhD, is a human factors scien:st at Ford Motor Company researching human interac:on with highly automated vehicles. In addi:on, she is secretary/treasurer of the HFES Percep:on and Performance Technical Group and chair of the Society of Automo:ve Engineers Taskforce on Iden:fying Automated Driving Systems - Dedicated Vehicles (ADS-DV) User Issues for Persons with Disabili:es. Rebecca received her MA and PhD in human factors/ experimental psychology from the University of Cincinna: and a BS With Honors in psychology from Loyola University, Chicago.

slide-5
SLIDE 5

How do you know that your metrics work? Fundamental questions about metrics

Fred Oswald Rice University

HFES Webinar Series April 12, 2018

animalmascots.com/01-00887/German-Shepard-Mascot-Costume

slide-6
SLIDE 6

Outline: Questions Addressed

Q0: Context of measurement? Q1: Develop a new measure? Q2: How to develop good items? Q3: Format of the measure? Q4: Evidence for reliability? Q5: Practical analysis tips?

slide-7
SLIDE 7

Q0: Context of measurement?

  • Evaluative (e.g., system

comparison, individual differences)

  • Developmental (e.g., training

evaluation)

  • Managerial decision-making

(e.g., compensate, promote, transfer, terminate) Purposes

slide-8
SLIDE 8

Content

  • General ßà Specific/Contextualized
  • Multiple ßà Single measures
  • Strong ßà Weak or “subtle” indicators

Form

  • Many items ßà Few items
  • Self-report ßà “Other” report
  • Traditional ßà Innovative

Q0: Context of measurement?

slide-9
SLIDE 9

Q0: Context of measurement?

Broad Contexts

  • Academic vs. organizational
  • IRB vs. organizational climate for

surveying

  • Perceptions of fairness, relevance

(from all stakeholders, including those of the test-taker)

  • Legal concerns
slide-10
SLIDE 10

Q1: Develop a new measure?

No à Use an existing measure when

  • there is a strong theoretical basis
  • past empirical research demonstrates

reliability and validity

  • you are not interested in a measure-

development study (do not toss an ad hoc measure into a study)…

slide-11
SLIDE 11

Q1: Develop a new measure?

Yes à Develop a new measure when

  • access to existing measures is limited (expensive,

proprietary)

  • there is room for improvement (improved theory,

aligning the measure with the intended purpose, increased sensitivity to the test-taker perspective, updating language)

  • test security is of concern (“freshening” item

pools, previous test compromises)

  • there is limited testing time…
slide-12
SLIDE 12

A Common Context: Limited Testing Time

Problem: Too many constructs and not enough time, resources…or test-taker patience Reasons: Theories get complex; organizations place high demand on measures/data to answer many practical organizational questions Solutions: Reduce constructs to “essential” ones? Abandon use of multiple scales for a construct? Shorten measures?

slide-13
SLIDE 13

Q2: How to develop good items?

Good measure development – and therefore good results – requires sound investment.

  • Expertise (substantive researchers, SMEs,

psychometricians, sensitivity review)

  • Development process (item generation,

refinement, translation/backtranslation)

  • Research/evidence (reliability, validity, low

adverse impact, generalizability)

slide-14
SLIDE 14

Q2: How to develop good items?

Item content can be evaluated for relevancy, deficiency and contamination; however, these three characteristics can also be psychological phenomena (e.g., did the test-taker forget or get confused by the item content?).

slide-15
SLIDE 15

Q2: How to develop good items?

Appropriate content sampling from a construct domain is a necessary condition for

  • btaining interpretable reliability evidence for a set of items.

High reliability coefficient does not ensure adequate content sampling: collections of items can covary due to shared contaminants or shared deficiencies.

slide-16
SLIDE 16

Q2: How to develop good items?

construct

job satisfaction

  • Items sample different aspects of the theoretical construct.
  • e.g, satisfaction with: autonomy, salary, job variety, management,

coworkers….

  • Controlled heterogeneity entails varying these aspects across items to

triangulate on the psychological construct.

  • Varying items allows for distinguishing item content that is item-specific vs.

construct-relevant.

item 1 item 2 item 3 item k

items … construct

job satisfaction

slide-17
SLIDE 17

Q2: How to develop good items?

item 1

construct

item 2 item 3 item k

items …

workload (task load)

Another construct: Workload

  • Identified facets (aspects) under its construct umbrella:

NASA-TLX: Mental Demand, Physical Demand, Temporal Demand, Performance, Frustration, Effort

  • Controlled heterogeneity: create items reflecting each facet
  • Given enough items, a facet can become a reliable scale on its own

(e.g., high alpha, strong single factor).

slide-18
SLIDE 18

Q2: How to develop good items?

Who generates items?

  • SMEs: domain-specific experts/theorists; job analysts; job

incumbents; researchers themselves relying on past theories and measures

  • Item categorization process

How many people are needed to generate items?

  • Generate items (from SMEs, research literature) until themes

(and possibly content) are redundant. Often need fewer SMEs than one might think.

slide-19
SLIDE 19

Q2: How to develop good items?

What items are appropriate given the measurement goals? e.g.,

  • Knowledge: easier items (minimal competence); difficult items

(certify professionals); full range (accurately measure people’s knowledge across the range)

  • Personality: screen out extremes (e.g., antisocial) vs. assess

normal personality(e.g., agreeableness)

  • Adaptive items (regardless of domain): initial item is in the

“middle” of the construct continuum; subsequent items are tailored to test-takers’ past responses (reliable items consistent with one’s true score improves estimation, reducing test time)

slide-20
SLIDE 20

Q2: How to develop good items?

Items developed will eventually be refined, based on psychometric analysis (item-remainder correlations; CFA factor loadings).

slide-21
SLIDE 21

Q3: Format of the measure?

Instructions

  • Often, instructions are written at too high of a grade level (5

grade levels too high for patient discharge interview; Spandorfer et al., 1993).

  • Very few people read them (Novick & Ward, 2006), though

novices who read them will improve (Catrambone, 1990).

  • Detailed instructions about providing “objective” information

(BARS, time spans, frequencies) often do not change the subjective response process.

  • General suggestion: Assume that test-takers will ignore (or skim)

instructions and proceed accordingly! Novel formats requiring instructions demand pilot testing to ensure comprehension and quality responding.

slide-22
SLIDE 22

Q3: Format of the measure?

Look and feel?

  • Grammar and syntax matter, not just for

understandability and better data but for credibility.

  • Clear, readable font (this is HFES! J).
  • Intuitive method of responding.
  • For web-based measures, check browser types and

screen resolutions.

  • Minimize drudgery; maximize simplicity and readability.

Keep all cognitive burdens to a necessary minimum (see Dillman’s Tailored Design Method; Dillman, 2014)

slide-23
SLIDE 23

Q3: Format of the measure?

Randomize items?

  • Cluster items within a construct; cycle across

constructs; randomize items?

  • Randomizing does not seem to make a
  • difference. It might increase honesty (by not

knowing the construct being assessed), though it may also heighten confusion (by shifting gears across constructs).

  • General suggestion: Group items within

construct, especially for longer batteries of measures.

slide-24
SLIDE 24

Q3: Format of the measure?

Negatively worded items?

  • Responses based on misreading even a small

number of negatively worded items can distort the reliability, and therefore the validity, of a measure (Schmitt & Stults, 1985).

  • Double negatives are especially bad.
  • General advice: Keeping items in the ‘direction’
  • f the construct and avoiding negation outweighs

‘balancing’ positively and negatively worded items.

slide-25
SLIDE 25

Q3: Format of the measure?

Pilot test a new measure before collecting

  • data. Have people typical of your future

sample(s) take the measure and provide feedback about the items (e.g., relevant, non- invasive, culturally sensitive, no typos, easy to take?).

slide-26
SLIDE 26

Q3: How many items?

Are single-item measures useful? Perhaps in some cases:

  • Research using single item measures of each of the five

JDI job satisfaction facets and found correlations between . 60 and .72 to the full length versions of the JDI scales (Nagy, 2002)

  • Review of single-item graphical representation scales; so

called “faces” scales (Patrician, 2004)

  • Single item measures work best on “homogeneous”

constructs (Loo, 2002) Pop Quiz: …Why not always administer a single-item test (hey, it would be quick to administer, easy to grade)?

slide-27
SLIDE 27

Q4: Evidence for reliability?

Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 1 var1

cov12 cov13 cov14 cov15 cov16 cov17

Item 2 cov21

var2 cov23 cov24 cov25 cov26 cov27

Item 3 cov31

cov32 var3 cov34 cov35 cov36 cov37

Item 4 cov41

cov42 cov43 var4 cov45 cov46 cov47

Item 5 cov51

cov52 cov53 cov54 var5 cov56 cov57

Item 6 cov61

cov62 cov63 cov64 cov65 var6 cov67

Item 7 cov71

cov72 cov73 cov74 cov75 cov76 var7

  • When an item composite (scale score) is a simple average of items, then the

variance is the average of all entries in the var/cov matrix (e.g., above).

  • With the addition of items to a composite, item variances increase linearly,

but covariances increase geometrically; even when items are moderately correlated, the latter quickly swamps the former.

slide-28
SLIDE 28

Q4: Evidence for reliability?

Assuming that items reflect a single dimension:

  • α = .70 is usually not

high enough (Lance, Butts, & Michels, 2006, ORM)

  • higher item

intercorrelations = higher alpha (add similar items)

  • additional items =

higher alpha (add more items)

slide-29
SLIDE 29

Q5: Practical analysis tips? #1 - Creating a Appropriate Scale Score

  • 1. Summing across items with missing data will lead to some

people with artificially deflated scale scores. This practice seems to happen enough that I offer this tidbit:

  • 2. In the presence of missing data, average the scores on items

within a scale.

  • 3. If there are too many missing items, you can only compute

averages when there is at least, say, an 80% response rate (e.g., 4 out of 5 items).

  • 4. The average score can be a more interpretable metric (e.g.,

score is back on the Likert scale for each item). However, to bring scale scores (and their descriptives) back to the metric of the raw scale, then multiply the mean by the number of items in the scale.

slide-30
SLIDE 30

Q5: Practical analysis tips? #2 - Labeling and Reverse-Scoring Items

  • 1. Label your items and variables as if you were giving the data

file to someone who could not contact you for further help in using the file.

  • 2. …This “someone” may likely be you: you are busy, and notes

that were once clear are now cryptic (including mental notes – you aren’t getting younger)!

  • 3. Suggestion: Indicate the item # within the variable label (e.g.,

CONSC31); create a code book and/or very clear variable list; create/save all program and syntax files; save reverse-coded items as separate variables (so that you never unknowingly reverse the reverse-coding back into the original variable).

slide-31
SLIDE 31

Some Useful References

General Guidelines DeVellis, R. F. (2017). Scale development: Theory and applications. Thousand Oaks, CA: Sage. [useful book] Hinkin, T. R. (1998). A brief tutorial on the development of measures for use in survey questionnaires. Organizational Research Methods, 1, 104-121. [useful tips] Shortening measures Stanton, J. M., Sinar, E. F., Balzer, W. K., & Smith, P. C. (2002). Issues and strategies for reducing the length of self-report scales. Personnel Psychology, 55, 167-193. [tips] Donnellan, M. B., Oswald, F. L., Baird, B. M., & Lucas, R. E. (2006). The Mini-IPIP scales: Tiny-yet-effective measures of the Big Five factors of personality. Psychological Assessment, 18, 192-203. [example process of sale development]

slide-32
SLIDE 32

Thank you!

Fred Oswald Rice University 6100 Main St – MS25 Houston, TX 77005 foswald@rice.edu workforce.rice.edu

animalmascots.com/01-00887/German-Shepard-Mascot-Costume