Statistics is far, far simpler than normal life. In most spheres of daily existence, there are hundreds of things that could go wrong whereas in statistics there are just two, very simply named Type I and Type II. If you can avoid both, you’ll do just fine. Type I error is essentially seeing something that isn’t there, and Type II error is failing to spot something that is. This post will take you through Type I next week’s will cover Type II, all with the help of Captain Statto and the crew of the pirate ship Regressor.
Imagine, if you will, the day Captain Statto sent Avery up to the crow’s nest with his telescope to have a look around. Statto thought there were no other ships around but wanted Avery to check before settling down for the evening with a bottle of rum. After a few minutes, Avery proudly shouts down that there’s a ship flying a Jolly Roger two points abaft the beam, starboard side. Battle stations! The crew drag themselves away from their games of Liar’s Dice and load the cannon. They patiently scan the horizon, but there’s no sign of another ship. It turned out that Avery’s Jolly Roger was in fact a seagull. Avery mistakenly rejected the null hypothesis (that there were no other ships in the area), committing a Type I error, based on a problem with the data collection instrument, namely Avery’s rum-soaked eyes.
Unreliable eyes
Avery’s eyes proved to be neither a valid nor a reliable method of data collection, and Type I errors often have to do with the instruments used. Validity was previously covered here, and basically refers to whether a scale measures what it claims to measure. Reliability, the other essential property of good measurement, has to do with whether a scale works consistently for different people and for the same people over time.
In quantitative analysis, there are two kinds of reliability: internal and test-retest. Internal reliability is used for multi-item scales and tests whether people give consistent patterns of answers. For example, the internal reliability is high when everyone who ticks A on question 1 also ticks B on question 2. Internal reliability is measured using Cronbach’s α statistic which calculates the average correlation between scale items, and values above .7 are indicative of good reliability. One thing to consider is the population on which reliability analysis is based. For example, there is a tendency to standardise assessment tools using undergraduate students as participants, often for course credit. Undergraduate students are systematically different from the normal population, in their age profile and average weekly consumption of alcohol among others, and this leaves the instrument open to the criticism that it is not reliable for the normal population. There’ll be more on the perils of sampling next week.
Test-retest reliability is concerned with whether the same person gives the same answer when they respond to items on a scale more than once. For anything concerning humans, the convention is to measure at two time-points two weeks apart. If it is after too short a period, scores may be inflated owing to memory effects while too long between responses opens the possibility of different scores due to fluctuations in the intensity of whatever is being measured; this is especially true of psychological problems like depression and anxiety. Test-retest reliability can use Pearson correlations between items or between scale total scores for each participant, and correlations of the order of .85 or .9 can be expected.
Risk of bias
In qualitative research, inter-coder reliability is used as a safeguard against bias in analysis. Analysis of interviews, for example, usually involves the development of a coding frame, a list of themes relevant to the study that might arise in the interview. The first step is for one researcher to note occurrences of all the codes in all the interviews. Using the same coding, a second rater then independently analyses a sample of interviews. A percentage agreement between the original and the second ratings is calculated and adjusted for chance using some version of κ adjustment If the minimum accepted κ coefficient of about .7 is reached for a code, it is considered reliable.
The level of significance of a statistical test result means the level of risk of Type I error that you’ve prepared to live with. Most people are happy with 5%, about a one-in-twenty chance of finding something that isn’t actually there. Once you’ve minimised the risk of measurement error, there are countless extraneous variables in even the best designed studies but being 95% sure of something is usually enough. So, having relieved Avery of look-out duties, Captain Statto can sail happily onwards for a further 19 days before expecting a similar seagull fail. Unless someone makes a Type II error next week…
Institute of Statistical Science, Academia Sinica
Taipei City, Taiwan
March 15, 2025
Harvard T.H. Chan School of Public Health
Boston, MA, USA
February 01, 2025
The Open University
Milton Keynes, UK
January 27, 2025
Imperial College London
London, UK
February 11, 2025