Reliability for single item measures

IJsbrand Leertouwer; Noémi Schuurman

Reliability for single item measures

Authors

Affiliations

IJsbrand Leertouwer

Youth & Family department, Erasmus University Rotterdam

Noémi Schuurman

Methodology & Statistics Department, Utrecht University

Published

2025-05-23

This article has not been peer-reviewed yet and may be subject to change.

Want to cite this article? See citation info.

This article is about the reliability of single-item measures. Reliability refers to the consistency of measurements. When a construct does not vary, reliable measurements of this construct should also not vary. Single-item measures capture an entire construct using just one question. These measures are commonly used in ILD designs to reduce participant burden. However, their reliability is rarely assessed or reported.

This lack of documentation is problematic: unreliable measurements can distort parameter estimates, potentially leading to inaccurate or biased conclusions. To improve the interpretability of research findings, it is essential to assess the reliability of single-item measures, and when feasible correct for their imperfect reliability.

So far, the available methods for determining the reliability of single item measures are based on the test-retest framework. In this article you will find: 1) a brief explanation of the test-retest framework for both classical test theory and ILD; 2) an explanation of a non-model-based method that stays close to classical test theory; 3) an explanation of a model-based method that is based on specifying the dynamics in the construct under study.

1 The test-retest framework

In the test-retest framework, the same “test” is administered multiple times, and reliability is estimated based on the difference between the test scores. This makes it a suitable candidate for estimating the reliability of single item measures. Other frameworks, such as the internal consistency framework and parallel test framework require multiple items, or even multiple tests.

1.1 The test-retest framework for classical test theory

In classical test theory, the constructs of interest are assumed to be stable over time. For such stable properties, referred to as traits, an observed score at a time point t consists of a stable part of the score that is referred to as the true score and a measurement error that is specific to the time point:

\[ observed\ score_t = true\ score + measurement\ error_t. \]

Example: Ability test

Ronnie studies working memory by asking people to repeat a sequence of letters and numbers that increases by one with each repetition. When a person fails to reproduce a sequence, he stops and counts the number of elements in the sequence; this is the score for working memory.

In order to study the reliability of his measurements, he administers the same test twice over a two week period. A person’s true ability score is not expected to change within this period, but their actual scores may be affected by external influences, such as a bad night of sleep or a good cup of coffee. These influences are random and distort the measurement of their true ability; they are referred to as measurement error.

Although reliability is formulated at the level of the individual, classical test theory defines it for a population of individuals. Reliability can be expressed as the proportion of true score variance to total variance, or one minus the proportion of measurement error variance to total variance:

\[ reliability(observed\ scores) = \frac{variance(true\ scores)} {variance(observed\ scores)} = 1 - \frac{variance(measurement\ errors)}{variance(observed\ scores)}. \] In classical test theory, variance in the true score refers to variance between individuals, which is the variance in their trait score (i.e., average over time). Variance in scores within the same individual is interpreted as measurement error.

1.2 The test-retest framework for intensive longitudinal data

In contrast to classical test theory, constructs in ILD are often expected to vary regularly over time within individuals:

\[ observed\ score_t = true\ score_t + measuerment\ error_t. \]

Such varying experiences are referred to as states and are often of key interest in ILD.

Example: Enthusiasm

Ayoko wants to measure people’s enthusiasm. She designs an app that prompts individuals to rate their enthusiasm on a scale from 0 to 100 in the morning, afternoon, and evening over the course of a week. People’s enthusiasm is expected to vary across these measurements. For example, it will be higher when reading a good book than when standing in line to pay for lunch.

In addition to such actual fluctuations, each measurement of enthusiasm may be distorted by random influences. For example, someone may not fill in the value they intended because they are distracted. Or they may settle for a value that is close enough to their intended value but not exactly it, because they have big fingers and it is hard to select an exact value.

As a result, there is true variance in the enthusiasm scores, as well as measurement error variance.

In order to get a reliability estimate for states within a person using a test-retest approach, we need a way to distinguish the variance in the true score from variance due to measurement error.

So far, two tricks for doing so have been proposed, which are applied at different phases of research. The first is to shrink down the time interval for some measurements to almost zero, such that the true score is unlikely to change. This method requires a specific measurement phase. The second is to assume a dynamic process for the true score and add a variance term that is unique to a measurement occasion. This method can be applied during the analysis phase and does not require a specific measurement phase. Both methods can be used for single item measures and single (or multiple) individuals. They are however, based on different assumptions, as indicated below.

2 Shrinking down the time interval to almost zero

One solution for separating variance in the true score from variance due to measurement error is to zoom in on such a small time interval that the true score is unlikely to change. In this immediate test-retest method (Dejonckheere et al., 2022), a measurement is repeated with only a few items in between. For example, within a questionnaire of eight items, one of the items would be repeated after these eight items, with a minimum number of items in between.

Measurement error variance is then expressed as the expected value (i.e., average) of the squared difference between initial and quickly repeated measurements, divided by two:

\[ variance(measurement\ errors) = \frac{average((initial\ items\ - repetitions)^2)}{2}. \]

In order to get a reliability estimate, this measurement error variance can be divided by the variance in the combination of initial and quickly repeated measurements, and subtracted from one:

\[ reliability(repeated\ scores) = 1 - \frac{variance(measurement\ errors)}{variance(initial\ items\ and\ repetitions)}. \]

Note that the measurement error variance can be larger than the variance in the combination of initial and repeated measurements. As a result, the reliability estimate can be negative; this is most likely in the case of outliers, and/or with a low number of measurements.

The clear benefit of this method is that it features a simple calculation. However, it does require some assumptions:

The true score does not change between the initial and replicate measurement.
People do not recall their initial response during the replicate measurement.
People do not strive for consistency of their responses during the second measurement.

When there is in fact a difference in the true score between the original and repeated measurement (i.e., assumption 1 is violated), variance in the true score is regarded as measurement error variance, and hence the reliability estimate is deflated. However, as there is typically no more than a minute in between initial items and quick repetitions, it is relatively likely that the true score does not change.

Example: A change in the true score

Ayoko wants to measure and model Tobias’ enthusiasm. With her app she gathers personal measurements in the morning, afternoon and evening, and wants to know how reliable Tobias’ responses are. She uses the immediate test-retest method in order to get an estimate.

In the afternoon Tobias eats a cookie just in between the original measurement of enthusiasm and the repetition. As a result the second measurement of enthusiasm is higher than the first. The immediate test-retest ascribes this difference to measurement error, while Tobias’ enthusiasm was in fact temporarily boosted by the tasty cookie.

When people do remember their initial score and feel inclined to provide a consistent response (i.e., assumptions 2 and 3 are violated), a difference in scores that would otherwise be negligible to a person may be artificially shrunken down to zero, inflating the reliability estimate. As a result, the estimated measurement error may be artificially smaller for people who strive for consistency than for people who do not strive for consistency. In practice, we do not know whether people remember their initial item score and strive for consistency.

Example: Remembering and striving for consistency

During a particular measurement, Tobias is not very sure how enthusiastic he feels. His finger falls on the number 6 and he decides that this number is close enough. Quickly after giving this rating Tobias is presented with another question about his enthusiasm. “Didn’t I already fill this in?” Tobias thinks to himself. “What was it again? 12?” He fills in 12 again.

Even though Tobias felt the same at both measurements, any score between 0 and 20 may have sufficed for him; 6 was not his true score. However, because he wants to provide a consistent score, he fills in the exact same value twice.

If Tobias did not remember his true score, he may not have repeated the exact same value during the second measurement.

If Tobias did remember his first score but did not care about being consistent, he may also not have repeated the exact same value twice.

In practice, we do not now which of the scenarios above is true.

In any occasion, the immediate test-retest method serves as a screening tool for reliability. Based on the results you can decide whether you find the reliability sufficient to continue with further analyses. In contrast, the next method is not only able to detect, but also correct for measurement error.

3 Adding time point specific variance to a dynamic model

Another way of separating variance in the true score from measurement error variance is to assume a dynamic model for the true score, and add a non-dynamic variance component. For clarity, we will consider the simplest dynamic model for the true score for the remainder: the autoregressive model of order one.

You may recall that for ILD the situation is such that both the true score and measurement error vary over time:

\[ observed\ score_t = true\ score_t + measurement\ error_t. \]

In the measurement error autoregressive model (Schuurman et al., 2015), the true score at measurement occasion is decomposed into a trait score, which is stable over time (i.e., the average), and a state score which is variable over time:

\[ true\ score_t = trait\ score + state\ score_t. \]

The state score in this model can be predicted by the previous state score by an autoregressive parameter \(\phi\) with a dynamic residual:

\[ state\ score_t = \phi\;state\ score_{t-1} + dynamic\ residual_t \]

The key property of dynamic residuals is that they carry over to the next measurement through the autoregressive parameter.

Dynamic residuals are sometimes referred to as innovations or dynamic errors. However, even though they are sometimes referred to as error, they are in fact part of the variance in the true score.

Example: A great movie

Ayoko wants to study the dynamics in Tobias’ enthusiasm. She gathers measurements in the morning, in the afternoon and in the evening. On Sunday afternoon Tobias goes to the cinema and greatly enjoys a movie. This results in a spike in his enthusiasm. In the evening, Tobias is still thinking about this movie and how good it was, still feeling some enthusiasm. In other words, the “autoregressive parameter” has “carried over” this “dynamic residual” of seeing the movie.

In contrast to dynamic residuals, measurement errors are not involved in the dynamic process of the true score. In other words, the effect of a measurement error does not carry over to the next measurement occasion.

Example: Slip of the finger

On Tuesday morning, Tobias’ finger slips when rating his enthusiasm on his phone, and he accidentally records the highest possible value. The effect of this error is limited to this single measurement and will not affect the next measurement.

Both measurement errors and dynamic residuals are specified to have an average of zero with a certain variance. True score variance is equal to the variance of in the autoregressive process (Staudenmayer & Buonaccorsi, 2005):

\[ variance(true\ score) = \frac{variance(dynamic\ residuals)}{1-\phi^2}. \]

In order to calculate reliability, we can divide the true score variance by the total variance, which encompasses both true score variance and measurement error variance:

\[ reliability(observed\ scores) = \frac{variance(true\ score)}{variance(true\ score) + variance(measurement\ errors)}. \]

This method of calculating reliability by adding a measurement occasion specific term to a model is also based on several assumptions:

The selected dynamic model accurately represents the dynamic process of the true score.
Any variance that is not captured in the dynamic model is measurement error variance.

The first assumption implies that all underlying assumptions of the dynamic model itself must hold true. In the case of the measurement error autoregressive model, this means that: a) true score is best predicted only by the previous true score, and b) the process is stationary. These assumptions are relatively restrictive. When these conditions are not met, the parameters in the model and hence the reliability estimate may be inaccurate and/or biased.

Example: Wrong model

Ayoko wants to investigate emotional inertia (the strength of the autoregressive process) and measures people three times per day for one week. She also measures Stephen, who suffers from bipolar disorder. During the measurement week, Stephen transitions from a depressive to a hypomanic episode. As a result, his emotional state shifts abruptly and systematically, rather than evolving gradually from previous states. The data are also unlikely to be stationary. In this case, fitting an AR(1) model to this data could yield misleading parameter estimates and an unreliable estimate of reliability.

Another, less restrictive model can be used to model the true score, but this adds complexity. Note that (multivariate versions of) autoregressive models are not uncommon. When these models are used to analyze the data, it makes sense to extend them with (a) random measurement error term(s), to at least check for the potential influence of measurement error.

The second assumption is unlikely to be met in practice for the measurement error autoregressive model, and likely any dynamic model. In practice, the part of a score that is specific to a measurement occasion will contain both actual measurement error as well as fluctuations in the true score that are not carried over to the next measurement occasion. As true fluctuations are interpreted as measurement error, the reliability estimate based on this model provides a lower bound to the true reliability.

Example: A change in the true score revisited

On Thursday afternoon, Donald has a tasty cookie right before a measurement of his enthusiasm. The effect of this cookie on his enthusiasm likely does not carry over to the measurement in the evening. However, the cookie did temporarily boost his enthusiasm.

If measurements were taken every 3 minutes, this change in the true score may have been captured and modeled. In this case, it would have been part of the variance in the true score.

Relatedly, it is important to note that dynamic residuals and measurement errors in the measurement error autoregressive model can only be distinguished from each other by the fact that the effect of the dynamic residuals carries over to the next measurement. When dynamic errors do not carry over (i.e., the autoregressive parameter is zero), the model is not identified.

As noted earlier, using the model described above, parameter values can be adjusted for measurement error. This is because it features a measurement model. In this specific case, when a regular autoregressive model would be fitted to data that contain measurement error, the estimated autoregressive parameter will be biased towards zero. The measurement error autoregressive model can correct for this bias.

The measurement error autoregressive model can be extended to multivariate data (Schuurman & Hamaker, 2019). In this case, the effect of measurement error on parameter estimates is much more unpredictable: measurement error in one variable may lead to under -or overestimations of other parameter values.

4 Takeaway

Current methods that are able to estimate the reliability of single item measures are both based on the test-retest framework, but make different assumptions about the true score. None of these assumptions are undisputed and as a result there is no gold standard for estimating reliability of single items.

Quickly repeating a measurement means that memory effects take place. Whether people remember their score and try to reproduce it may differ between individuals and even measurement occasions, making the reliability estimate diffuse and confounded.

Adding a measurement occasion specific term to a dynamic model is also not a perfect solution. The estimated measurement error will likely contain some true score variance that is specific to the measurement occasion, thereby underestimating the true reliability.

Nevertheless, documenting the reliability of single item measures in your sample is highly relevant. Not only as a step for validating your results, but also as a precedent for future studies.

5 Further reading

We have collected various topics for you to read more about below.

Read more: Design choices up to this point

Read more: Reliability for multiple items

[Reliability for multiple items]

Acknowledgments

This work was supported by the Dutch National Research Agenda (NWA/eHealth Junior consortium; project number: 1292.19.226).

References

Dejonckheere, E., Demeyer, F., Geusens, B., Piot, M., Tuerlinckx, F., Verdonck, S., & Mestdagh, M. (2022). Assessing the reliability of single-item momentary affective measurements in experience sampling. Psychological Assessment, 34(12), 1138.

Schuurman, N. K., & Hamaker, E. L. (2019). Measurement error and person-specific reliability in multilevel autoregressive modeling. Psychological Methods, 24(1), 70.

Schuurman, N. K., Houtveen, J. H., & Hamaker, E. L. (2015). Incorporating measurement error in n= 1 psychological autoregressive modeling. Frontiers in Psychology, 6, 1038.

Staudenmayer, J., & Buonaccorsi, J. P. (2005). Measurement error in linear autoregressive models. Journal of the American Statistical Association, 100(471), 841–852.

Citation

BibTeX citation:

@article{leertouwer2025,
  author = {Leertouwer, IJsbrand and Schuurman, Noémi},
  title = {Reliability for Single Item Measures},
  journal = {MATILDA},
  number = {2025-05-23},
  date = {2025-05-23},
  url = {https://matilda.fss.uu.nl/articles/reliability-single-item-measures.html},
  langid = {en}
}

For attribution, please cite this work as:

Leertouwer, Ij., & Schuurman, N. (2025). Reliability for single item measures. MATILDA, 2025-05-23. https://matilda.fss.uu.nl/articles/reliability-single-item-measures.html