Single item or multiple item measures
This page considers several theoretical implications for deciding on using single or multiple items to measure a construct.
In case a single item is used, the response to this single item is used in further analyses and results for the study. A common example of a construct for which single items are often used is ‘age’.
In case multiple items are used to measure a construct, some aggregation is performed on the multiple items to get a single measure for usage in analyses and results: For example, by creating a sum score or mean score, or by using some statistical measurement model to do this (e.g., a latent variable model such as a factor model or item response theory model, see Bollen (2002)). Common examples of constructs for which multiple items are often used are ‘math ability’, ‘depression’ and ‘agreeableness’.
It is also possible that both single item and multiple item measures are used in a study.
Considering whether you’ll use single or multiple items is important, because your choice implies certain assumptions about the mechanism between the construct you want to measure, and the data you observe. These in turn may also affect your end conclusions and results. They may also have implications for how to validate your measurements.
In the following we first discuss single item measures, followed by multiple item measures.
1 Measuring your construct(s) with single items
In psychology we have had a tradition of relying on multiple item measures for measuring constructs that we cannot directly observe or derive (e.g. positive/negative affect). Single items are usually mainly used for constructs that are more straightforward to directly observe or derive (e.g., age or occupation).
For ILD, researchers more often reach for single items measures even in the former case. Often, this is for important practical reasons, such as keeping the burden for participants low. However, considerations related to theory can and should also play a role in this decision: For example, a single item may be deemed sufficient, or the most direct, or the most clear, way of measuring the construct of interest. Or, perhaps typical latent variable theory (Borsboom et al., 2003; Borsboom, 2008), that is usually the basis of multiple item measures, may be deemed unsuitable for a construct.
Kim wants to get a good impression of someone’s daily experienced received social support. They want to a) keep close to the participants own impression and interpretation of the social support the participants experienced b) keep the participants’ burden low.
Kim decides to use a single item asking ’To what extent did you feel supported by others today?“, rather than asking multiple questions. Kim will use the responses to this item for further analyses on the construct experienced social support.
Using single items does not have to be problematic, if a single item gives a good enough measurement of the construct of interest for the research goals. As is the case for multiple item measures, it is essential to (find ways to) evaluate whether this is the case.
- [When are my measurements ‘good enough’?]
- Reliability for single items
- [Qualitative assessments of measurement quality]
- [Predictive & convergent validity]
2 Measuring your construct(s) with multiple items
There can be many reasons for why a researcher may choose to study a construct with multiple items. Common reasons researchers opt for this, in our experience, are:
content validity (measuring all relevant aspects of a construct);
reliability (to account for random measurement error);
mirroring inter-individual difference questionnaires: Use an adapted version of a gold standard questionnaire from inter-individual difference studies (e.g., cross-sectional studies).
It is important to realize that each of these reasons for opting for multiple items has theoretical implications, and these may be different, and potentially not reconcilable with each other. Read more about these three reasons below.
2.1 Multiple items for reasons of content validity
Asking multiple questions, or using multiple items, to cover all relevant aspects of the construct of interest. The different items are used to measure different important aspects of the relevant construct. The responses to these multiple items are aggregated in some way to get an overall measure of the construct of interest. This aggregate may then be used for further analyses.
Max wants to get a good impression of someone’s daily experienced received social support. To get an impression of all relevant aspects of social support for their study, Max breaks down the construct into different aspects, measured with different items.
Rather than using a single item asking ’How much social support did you receive today?“, they asked multiple questions like”Did you receive help with your job today?“,”Did you feel you could reach out for emotional support from your friends/family/colleagues today?“, and”Did you rely on financial support today?“. In this way, various distinct aspects of receiving social support are covered by the multiple items.
To get an overall measure of experienced social support for further use in their analyses, Max will aggregate the responses to these items; either using a mean score or a more complex latent variable model.
If you consider using multiple items to improve content validity, carefully consider the theoretical implications of how you should aggregrate the multiple items into a single measure to be used in your analyses.
- [Multiple item measures: Sum-Scores or Mean-Scores vs Latent Variable Models]
- [Formative versus Reflective Latent Variable Models]
2.2 Multiple items for reasons of reliability
Having multiple items to measure the same construct can be helpful, because it allows using internal consistency reliability approaches. For these techniques, the general idea is that the items all measure the same underlying construct. The shared variance among those items is the result of variation in the true construct, while the unique (non-shared) variance among the items is considered measurement error.
Multiple items that measure the same construct, combined with internal consistency reliability approaches, can be used to…
a) estimate the reliability of (a set of) items for measuring a construct of interest. That is, to validate a questionnaire.
b) account for measurement error for a construct of interest during an analysis. That is, to prevent measurement error from biasing the results of analyses.
Ume has developed a short experience-sampling questionnaire where three items are used to measure daily experienced received social support. The items are “I felt supported by others today…”, “I felt grateful to others today…” and “Other people had my back today…”. The shared variance in the items over time is considered variance in the ‘true’ received social support, and the unique variance in the items over time is considered measurement error.
Well-known internal consistency approaches for interindividual (e.g., cross-sectional) studies are reliability measures such as Cronbach’s Alpha Sijtsma (2009), and latent variable models such as factor models and Item Response Theory models (Bollen, 2002). These techniques cannot be used directly on intensive longitudinal data, but extensions and similar alternatives are available Nezlek (2017).
Ume uses a reflective linear factor model with the three items as indicators, and one latent variable that should reflect ‘received social support’. They use the model to estimate the reliability for their short questionnaire, as part of their validation study. Reliability will be high when the items consistently increase and decrease together over the repeated measures (if they covary a lot of time).
If the questionnaire is deemed valid based on Ume’s study, in future studies researchers can use the same factor model to correct for measurement error in their analyses.
When using multiple items for reliability purposes, it is important to consider
- that each internal consistency reliability technique is based on theoretical assumptions, and hence has theoretical implications. It is important to carefully think about what measurement model is appropriate for your research question and context.
- it is possible to evaluate reliability and correct for measurement error with single item measures as well, for both single or multiple persons (see: Reliability for single item measures).
- [Multiple item measures: Sum-scores or mean-scores vs latent variable models]
- [Formative versus reflective latent variable models]
- [When are my measurements ‘good enough’?]
- [Internal consistency reliability for ILD]
- [Qualitative assessments of measurement quality]
- [Predictive & convergent/divergent validity for ILD]
2.3 Multiple items to mirror interindividual difference questionnaires
Researchers may want to use constructs established in inter-individual difference research for intra-individual difference studies using ILD: For example, constructs such as ‘positive affect’ or ‘introversion’. In that case, the first intuition may be to simply take the questionnaires from inter-individual difference research and adapt them to the intra-individual context.
A researcher wants to measure intra-individual differences in ‘Positive Affect’ and ‘Negative Affect’, constructs established in the context of individual differences research. It is commonly measured with the PANAS Schedule (Tran, 2020; Watson et al., 1988). It measures how much positive and negative affect people tend to generally experience, and is used to study trait differences between people in how much positive and negative affect the tend to experience, and how this relates to other differences between people (e.g., their sex or personality traits).
The researcher takes the original items of the PANAS scale, and but changes the phrasing to suit intra-individual difference research.
For example: The original PANAS instructions were “Indicate the extent you have felt this way over the past week.” for various adjectives (e.g., “irritable”, “guilty”, “afraid”), rated on a 5 point likert scale (“very slightly/not at all” - “extremely”). The researcher changes the instruction to “Indicate the extent you have felt this way in the last hour.”, keeping the remaining the same.
It is important to consider the following:
- constructs established inter-individual difference research may be less relevant, or irrelevant for studying intra-individual differences.
Based on interindividual difference research, the items “irritable”, “guilty”, “afraid” (and others) are considered suitable items for establishing consistent differences in ‘overall negative affect’ among different people (Watson et al., 1988). Based on factor modeling studies (Wedderhoff et al., 2021), it has been established that people that tend to overall feel more irritable than other people, tend to also feel more guilty and afraid than other people. This property is captured in a latent variable, which is considered to capture differences in people’s overall ‘negative affect’.
This however doesn’t mean such a latent variable will also capture differences in a persons’ intra-individual differences well, that is, how a person’s affect fluctuates over time.
If we apply the ‘negative affect’ latent variable to intra-individual differences over time, using the same adjectives as the original scale, this would imply the following pattern over time: If a person feels relatively irritable at a certain occasion, that person tends to also feel relatively guilty and afraid at that same occasion.
For example, consider that whether a person feels irritable at a given occasion, and whether the person also feels guilty and afraid, depends mainly on the context the participant is in at that occasion - rather than this it is a result of a latent variable ‘negative affect’. For some experiences they may feel all three emotions to a large extent due to the nature of the event (participant is publicly called out on a big mistake), while for other experiences they may be inclined to feel mostly one of them (participant encounters an angry swarm of bees during a walk).
In that case, over time, people’ s emotion would be considered separate constructs, and whether they covary or not depends mostly on the particular features of changeable circumstances. If so, the concept of general negative effect may not be directly relevant for studying how people’s emotions fluctuate over time (intra-individual differences); even though it may be very suitable for studying general tendencies of people compared to other people (interindividual differences).
- questionnaires for intra-individual difference studies, often need to have different qualities than questionnaires on inter-individual differences.
Consider the PANAS adjective ‘guilty’, for which participants now would “Indicate the extent you have felt this way in the last hour.” If one collects relatively few repeated measures over a short time span, participants may often or even always respond ‘not at all/very slightly’ for the majority of measures. This would result in little to no within-person variance in the repeated measures of guilt, making intra-individual difference analyses difficult or impossible (see also floor and ceiling effects).
In comparison, for studying interindividual differences in general traits, the timing of measurements is often less important for obtaining a representative varied sample of observations.
While the original PANAS for interindividual differences consists of 20 items, we may prefer a shorter questionnaire for an intra-individual difference study where we collect repeated measures every hour.
- valid and reliable questionnaires from inter-individual difference research, may not be valid and reliable for intra-individual difference research.
While the original PANAS for interindividual differences may have a reliability between .8-.9 Thompson (2007), an adapted PANAS for intra-individual differences may have a much lower or higher reliability (Schuurman & Hamaker, 2019).
Hence, questionnaires or items adapted from inter-individual difference research need to be completely re-validated.
We generally recommend to avoid copying constructs from inter-individual research directly to intra-individual difference studies unless there are very good theoretical reasons to do so. That is, the constructs need to make sense for studying intra-individual differences, for the given research question, and context, preferably backed by empirical evidence. In many cases it may be necessary, and more fruitful to establish (new) constructs tailored to intra-individual differences.
- [Intra-individual versus interindividual difference studies]
2.4 Combining different reasons for using multiple items
A researcher may want to use multiple items for more than one of the reasons discussed above, for example, they want to both cover different kinds of aspects of a construct (content validity), and use multiple items to estimate the reliability of their multiple item measure. It is important to think about for a given construct whether and how these different goals can be reconciled with each other. This will depend on the goals, the nature of the construct, the design of the items, and how the items will be aggregated.
Max wants to get a good impression of someone’s daily experienced received social support. They care about both content validity, and they also want to evaluate the measures’ reliability, to make sure it captures as little measurement error as possible.
To ensure their measure is content valid, Max breaks down the construct into different aspects, measured with different items.
Rather than using a single item asking ’How much social support did you receive today?“, they asked multiple questions like”Did you receive help with your job today?“,”Did you feel you could reach out for emotional support from your friends/family/colleagues today?“, and”Did you rely on financial support today?“. In this way, various distinct aspects of receiving social support are covered by the multiple items.
Max uses a reflective factor model to estimate the internal consistent reliability of the items. However, because this reliability estimate relies on the idea that what is reliable is the variance that is shared among the items, all the unique variability in the items is considered measurement error. Hence, all the unique contributions of the items, which is why Max used multiple items with respect to content validity, are not considered in this reliability estimate. As a result, the reliability estimate is likely to underestimate the true reliability of the measure.
Max’ internal consistency reliability approach, does not mesh well with their approach to content validity.
References
Citation
@online{schuurman2023,
author = {Schuurman, Noémi K.},
title = {Single Item or Multiple Item Measures},
date = {2023-05-26},
langid = {en}
}