Haste still pays haste, and leisure answers leisure;
Like doth quit like, and Measure still for Measure (Shakespeare, 1623).
There are those who would have you believe that Shakespeare was writing about justice and morality and how everyone gets what they deserve in the end. But what if Shakespeare was really trying to teach us an important lesson about the statistical validity of measurement scales?
Measurement is about defining a unit of comparison. It makes assumptions about the nature of reality (the nature of justice, even) and so on but we’ll stick to the numbers here. Statisticians like to assign numbers to things so we can compare them, and validity has to do with whether a unit on a scale actually measures what it’s supposed to measure. If so, we can be confident in comparisons based on the scale. Reliability, a story for another day, is whether the measure works consistently.
Clever people define a metre
As a unit of measurement, metres are a brilliant invention. Depending on your required degree of precision, metres can be used to measure everything from the circumference of the earth to the width of a hair. The important thing about metres, though, is that they are an invention. After using various body parts to express the height, depth, and length of things in hilariously unreliable ways, all the clever people got together and decided to pick one standard way of expressing highness, deepness, and longness. The metre assumes that the constructs it seeks to measure are stable in time and space, at least on the surface of the earth, and Step One, construct validity means being valid in theory.
Metres are also nice and tangible. You can take a couple of metre sticks and hold them up against a wall, plunge them into a pond, or lay them out on the ground and you’ll probably be fairly satisfied that they are indeed measuring up-thereness, down-thereness, and far-awayness. Step Two, face validity, is nothing to do with a certain fictional ex-American army character but everything to do with whether an assessment actually looks like it measures what it aims to measure.
Scales can be either criterion measures or norm measures; criterion measures deal in absolute terms while norm measures just aim to compare a thing to lots of other things. Metres are a good example of a criterion measure, owing to their clearer empirical basis (philosophical debates aside). Normative scales are based on measuring some property of lots individual things and then dividing up the difference between the highest and the lowest to make a unit. Sounds like there are a few more assumptions going on there than with our lovely solid metre stick. IQ is one of the best and most contentious examples of a normative scale. The adage is that IQ test measure intelligence while intelligence is the thing that IQ tests measure. What’s important is knowing that, unlike all the clever people who agreed how big a metre would always be, no one’s still quite sure exactly what an IQ of 100 actually means, apart from being about average in a select group of people at a particular point in time. Criterion scales, then, need to provide a bit more evidence to convince us of their validity.
Steps 3 to 5, especially for criterion scales
Assuming for a moment that they egg came before the chicken, Step Three, convergent validity, involves comparing an assessment tool to another one that aims to measure the same thing. It basically means that if X is valid and my new assessment tool, Y, is just like X, then Y is also valid. That’s quite big assumption but we’re not exactly starting again at the beginning of time so if you pick an X that has well established psychometric properties, there’s a good chance that your assumption is sound. You might also hear Step Three described as concurrent validity but anything that reports a correlation with a different scale is describing convergent validity. If regressions are just fancy correlations then predictive validity, which you might also come across, is just fancy convergent validity.
Divergent validity is, believe it or not, the opposite of convergent validity and provides us with Step Four. Just as convergent validity is demonstrated by positive correlations with similar assessments, and divergent validity relies on establishing that there is no correlation with a scale that measures a theoretically unrelated construct in order to eliminate it from further enquiry. Theoretical blurring in depression and anxiety, for example, requires divergent validity for scales of either.
Discriminant validity is Step Five and gets at the capacity of an assessment to accurately distinguish between people or groups of people. An anger assessment tool should be able to tell who’s angry and who’s not; Mr T might have quite a high score and Ned Flanders quite a low one. Metres aren’t actually all that good for discriminating the size of babies, because most babies are less than one meter tall, but centimetres do a perfectly valid job.
So, think carefully the next time you try to put a number on something. If it’s valid in theory it has construct validity, if it looks valid then it has face validity, if it looks like another valid scale then it has convergent validity, if it’s nothing like an unrelated concept then it has divergent validity, and if it can tell you who’s in and who’s out it has discriminant validity. And if you don’t establish validity you’ll get what you deserve in the end. Thanks, Shakespeare!
UK Health Security Agency
Nottingham, Birmingham, Essex, London, Newcastle Upon Tyne, Liverpool, Ashford, Bristol, Leeds, Belfast, Cardiff
December 22, 2024
Institute of Statistical Science, Academia Sinica
Taipei City, Taiwan
March 15, 2025
Harvard T.H. Chan School of Public Health
Boston, MA, USA
February 01, 2025
The Open University
Milton Keynes, UK
January 27, 2025