In statistics, the term reliability refers to the consistency of a measure.
If we measure something like intelligence, knowledge, productivity, efficiency, etc. in individuals multiple times, are the measurements consistent?
Ideally, researchers want a test to have high reliability because that means it provides consistent measurements over time which means the results of the test can be trusted.
It turns out that there are four ways to measure reliability:
1. Split-Half Reliability Method – Determines how much error in the test results is due to poor test construction -e.g. poorly worded questions or confusing instructions.
This method uses the following process:
- Split a test into two halves. For example, one half may be composed of even-numbered questions while the other half is composed of odd-numbered questions.
- Administer each half to the same individual.
- Repeat for a large group of individuals.
- Calculate the correlation between the scores for both halves.
The higher the correlation between the two halves, the higher the internal consistency of the test or survey. Ideally you would like the correlation between the halves to be high because this indicates that all parts of the test are contributing equally to what is being measured.
2. Test-Retest Reliability Method – Determines how much error in the test results is due to administration problems – e.g. loud environment, poor lighting, insufficient time to complete test.
This method uses the following process:
- Administer a test to a group of individuals.
- Wait some amount of time (days, weeks, or months) and administer the same test to the same group of individuals.
- Calculate the correlation between the scores of the two tests.
Generally a test-retest reliability correlation of at least 0.80 or higher indicates good reliability.
3. Parallel Forms Reliability Method – Determines how much error in the test results is due to outside effects – e.g. students getting access to questions ahead of time or students getting better scores by simply practicing more.
This method uses the following process:
- Administer one version of a test to a group of individuals.
- Administer an alternate but equally difficult version of the test to the same group of individuals.
- Calculate the correlation between the scores of the two tests.
4. Inter-rater Reliability Method – Determines how consistently each item on a test measures the true construct being measured – e.g. are all questions clearly communicated and relevant to the construct being measured?
This method involves having multiple qualified raters or judges rate each item on a test and then calculating the overall percent agreement between raters or judges.
The higher the percent agreement between judges, the higher the reliability of the test.
Reliability vs. Validity
Reliability refers to the consistency of a measure and validity refers to the extent to which a test or scale measures the construct it sets out to measure.
A good test or scale is one that has both high reliability and high validity. However, it’s possible for a test or scale to have reliability without having validity.
For example, suppose a given scale that weighs boxes consistently weighs the boxes as 10 pounds over the true weight. This scale is reliable because it’s consistent in its measurements, but it’s not valid because it doesn’t measure the true value of the weight.
Reliability & Standard Error of Measurement
A reliability coefficient can also be used to calculate a standard error of measurement, which estimates the variation around a “true” score for an individual when repeated measures are taken.
It is calculated as:
SEm = s√1-R
where:
- s: The standard deviation of measurements
- R: The reliability coefficient of a test
Refer to this article for an in-depth explanation of the standard error of measurement.