How do responses to Likert type response scales vary across countries and cultures?

How do responses to Likert type response scales vary across countries and cultures?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I gave an answer here on about how to analyse ordinal items, such as those on Likert type response scales (e.g., Strongly disagree to Strongly agree). Someone asked whether there were differences between countries and cultures in how respondents use such scales. I remember reading research about this many years ago. I can remember something about certain countries preferring extreme responses. Such differences naturally have implications for cross-cultural comparisons of levels of like life satisfaction, personality, and presumably any self-report measure that uses such response scales.


  • How does use of Likert type response scales vary across countries and cultures?
  • What explains any such differences?
  • How are differences in response style differentiated from differences in levels of the underlying construct?

Culture reference effects

Heine et al (2002) discuss how people in different cultures often answer questions relative to a reference group in their own culture. Thus, for example, if a culture is more collectivistic in general, measured cultural differences may less when people within a culture answer test items relative to their own cultural reference group.

To quote the abstract:

Social comparison theory maintains that people think about themselves compared with similar others. Those in one culture, then, compare themselves with different others and standards than do those in another culture, thus potentially confounding cross-cultural comparisons. A pilot study and Study 1 demonstrated the problematic nature of this reference-group effect: Whereas cultural experts agreed that East Asians are more collectivistic than North Americans, cross-cultural comparisons of trait and attitude measures failed to reveal such a pattern. Study 2 found that manipulating reference groups enhanced the expected cultural differences, and Study 3 revealed that people from different cultural backgrounds within the same country exhibited larger differences than did people from different countries. Crosscultural comparisons using subjective Likert scales are compromised because of different reference groups. Possible solutions are discussed.

Cultural differences in response tendencies

Lee et al (2002) summarised some of the empirical literature on cross-cultural differences in use of response scales:

Only a few researchers have addressed the issue of cultural differences in rating scales empiri- cally. Wong, Tam, Fung, and Wan (1993) found no difference in the way Chinese participants in Hong Kong responded to an odd versus an even number of response choices. Johnson (1981) found no difference in how readers of Horizons USA who resided in Great Britain, Italy, the Philippines, and Venezuela responded to bipolar scales. Stening and Everett (1984) found that Japanese managers responding in Japanese were more likely to choose the midpoint than were American or British managers responding in English. Chen, Lee, and Stevenson (1995) found the same effect for Japanese and to a lesser extent for Taiwanese in a sample comparing 11th-grade students in Japan, Taiwan, Canada, and, in the United States, Minneapolis. Iwata, Saito, and Roberts (1994) and Iwata, Roberts, and Kawa- kami (1995) reported that junior high school students in Japan and the United States responded similarly on negative items but that Japanese were less likely to endorse positive items.

When reviewing the literature on cross-cultural differences in response style Hamamura et al (2008) stated that:

Compared to North Americans of European-heritage, higher levels of extreme responding have been observed in African-Americans (Bachman & O'Malley, 1984) and Latino Americans (Hui & Triandis, 1989). In contrast, East Asians seem to show more moderacy than samples of European-heritage (Chen, Lee, & Stevenson, 1995)


  • Bachman, J. G., & O'Malley, P. M. (1984). Yea-saying, nay-saying, and going to extremes: Black-white differences in response styles. Public Opinion Quarterly, 48, 491-509.
  • Chen, C., Lee, S. Y., & Stevenson, H. W. (1995). Response style and cross-cultural comparisons of rating scales among East Asian and North American students. Psychological Science, 6, 170-175.
  • Hamamura, T., Heine, S.J. & Paulhus, D.L. (2008). Cultural differences in response styles: The role of dialectical thinking. Personality and Individual Differences, 44, 932-942.
  • Heine, S.J. and Lehman, D.R. and Peng, K. and Greenholtz, J. (2002). What's wrong with cross-cultural comparisons of subjective Likert scales?: The reference-group effect. Journal of personality and social psychology, 82, 903, PDF
  • Hui, C. H., & Triandis, H. C. (1989). Effects of culture and response format on extreme response style. Journal of CrossCultural Psychology, 20, 296-309.
  • Iwata, N., Roberts, C.R., & Kawakami, N. (1995). Japan-U.S. comparison of responses to depression scale items among adult workers. Psychiatry Research, 58, 237-245.
  • Iwata, N., Saito, K., & Roberts, R.E. (1994). Responses to a self-administered depression scale among younger adolescents in Japan. Psychiatry Research, 53, 275-287.
  • Johnson, J.D. (1981). Effects of the order of presentation of evaluative dimensions for bipolar scales in four societies. Journal of Social Psychology, 113, 21-27.
  • Lee, J.W. and Jones, P.S. and Mineyama, Y. and Zhang, X.E. (2002). Cultural differences in responses to a Likert scale. Research in nursing & health, 25, 295-306.
  • Stening, B.W., & Everett, J.E. (1984). Response styles in a cross-cultural managerial study. Journal of Social Psychology, 122, 151-156.
  • Wong, C.S., Tam, K.C., Fung, M.Y., & Wan, K. (1993). Differences between odd and even number of response scale: Some empirical evidence. Chinese Journal of Psychology, 35, 75-86.

Phase 3: scale evaluation

Step 7: tests of dimensionality

The test of dimensionality is a test in which the hypothesized factors or factor structure extracted from a previous model is tested at a different time point in a longitudinal study or, ideally, on a new sample (91). Tests of dimensionality determine whether the measurement of items, their factors, and function are the same across two independent samples or within the same sample at different time points. Such tests can be conducted using independent cluster model (ICM)-confirmatory factor analysis, bifactor modeling, or measurement invariance.

Confirmatory factor analysis

Confirmatory factor analysis is a form of psychometric assessment that allows for the systematic comparison of an alternative a priori factor structure based on systematic fit assessment procedures and estimates the relationship between latent constructs, which have been corrected for measurement errors (92). Morin et al. (92) note that it relies on a highly restrictive ICM, in which cross-loadings between items and non-target factors are assumed to be exactly zero. The systematic fit assessment procedures are determined by meaningful satisfactory thresholds Table ​ Table2 2 contains the most common techniques for testing dimensionality. These techniques include the chi-square test of exact fit, Root Mean Square Error of Approximation (RMSEA ≤ 0.06), Tucker Lewis Index (TLI ≥ 0.95), Comparative Fit Index (CFI ≥ 0.95), Standardized Root Mean Square Residual (SRMR ≤ 0.08), and Weighted Root Mean Square Residual (WRMR ≤ 1.0) (90, 92�).

Table 2

Description of model fit indices and thresholds for evaluating scales developed for health, social, and behavioral research.

Model fit indicesDescriptionRecommended threshold to useReferences
Chi-square testThe chi-square value is a test statistic of the goodness of fit of a factor model. It compares the observed covariance matrix with a theoretically proposed covariance matrixChi-square test of model fit has been assessed to be overly sensitive to sample size and to vary when dealing with non-normal variables. Hence, the use of non-normal data, a small sample size (n = 180�), and highly correlated items make the chi-square approximation inaccurate. An alternative to this is to use the Satorra-Bentler scaled (mean-adjusted) difference chi-squared statistic. The DIFFTEST has been recommended for models with binary and ordinal variables(2, 93)
Root Mean Squared Error of Approximation (RMSEA)RMSEA is a measure of the estimated discrepancy between the population and model-implied population covariance matrices per degree of freedom (139).Browne and Cudeck recommend RMSEA ≤ 0.05 as indicative of close fit, 0.05 ≤ RMSEA ≤ 0.08 as indicative of fair fit, and values Ϡ.10 as indicative of poor fit between the hypothesized model and the observed data. However, Hu and Bentler have suggested RMSEA ≤ 0.06 may indicate a good fit(26, 96�)
Tucker Lewis Index (TLI)TLI is based on the idea of comparing the proposed factor model to a model in which no interrelationships at all are assumed among any of the itemsBentler and Bonnett suggest that models with overall fit indices of < 0.90 are generally inadequate and can be improved substantially. Hu and Bentler recommend TLI ≥ 0.95(95�)
Comparative Fit Index (CFI)CFI is an incremental relative fit index that measures the relative improvement in the fit of a researcher's model over that of a baseline modelCFI ≥ 0.95 is often considered an acceptable fit(95�)
Standardized Root Mean Square Residual (SRMR)SRMR is a measure of the mean absolute correlation residual, the overall difference between the observed and predicted correlationsThreshold for acceptable model fit is SRMR ≤ 0.08(95�)
Weighted Root Mean Square Residual (WRMR)WRMR uses a “variance-weighted approach especially suited for models whose variables measured on different scales or have widely unequal variances” (139) it has been assessed to be most suitable in assessing models fitted to binary and ordinal dataYu recommends a threshold of WRMR < 1.0 for assessing model fit. This index is used for confirmatory factor analysis and structural equation models with binary and ordinal variables(101)
Standard of Reliability for scalesA reliability of 0.90 is the minimum recommended threshold that should be tolerated while a reliability of 0.95 should be the desirable standard. While the ideal has rarely been attained by most researchers, a reliability coefficient of 0.70 has often been accepted as satisfactory for most scalesNunnally recommends a threshold of 𢙐.90 for assessing internal consistency for scales(117, 123)

Bifactor modeling

Bifactor modeling, also referred to as nested factor modeling, is a form of item response theory used in testing dimensionality of a scale (102, 103). This method can be used when the hypothesized factor structure from the previous model produces partially overlapping dimensions so that one could be seeing most of the items loading onto one factor and a few items loading onto a second and/or a third factor. The bifactor model allows researchers to estimate a unidimensional construct while recognizing the multidimensionality of the construct (104, 105). The bifactor model assumes each item loads onto two dimensions, i.e., items forming the construct may be associated with more than one source of true score variance (92). The first is a general latent factor that underlies all the scale items and the second, a group factor (subscale). A 𠇋ifactor model is based on the assumption that a f-factor solution exists for a set of n items with one [general]/Global (G) factor and f – 1 Specific (S) factors also called group factors” (92). This approach allows researchers to examine any distortion that may occur when unidimensional IRT models are fit to multidimensional data (104, 105). To determine whether to retain a construct as unidimensional or multidimensional, the factor loadings from the general factor are then compared to those from the group factors (103, 106). Where the factor loadings on the general factor are significantly larger than the group factors, a unidimensional scale is implied (103, 104). This method is assessed based on meaningful satisfactory thresholds. Alternatively, one can test for the coexistence of a general factor that underlies the construct and multiple group factors that explain the remaining variance not explained by the general factor (92). Each of these methods can be done using statistical software such as Mplus, R, SAS, SPSS, or Stata.

Measurement invariance

Another method to test dimensionality is measurement invariance, also referred to as factorial invariance or measurement equivalence (107). Measurement invariance concerns the extent to which the psychometric properties of the observed indicators are transportable (generalizable) across groups or over time (108). These properties include the hypothesized factor structure, regression slopes, intercept, and residual variances. Measurement invariance is tested sequentially at five levels𠅌onfigural, metric, scalar, strict (residual), and structural (107, 109). Of key significance to the test of dimensionality is configural invariance, which is concerned with whether the hypothesized factor structure is the same across groups. This assumption has to be met in order for subsequent tests to be meaningful (107, 109). For example, a hypothesized unidimensional structure, when tested across multiple countries, should be the same. This can be tested in CTT, using multigroup confirmatory factor analysis (110�).

An alternative approach to measurement invariance in the testing of unidimensionality under item response theory is the Rasch measurement model for binary items and polytomous IRT models for categorical items. Here, emphasis is on testing the differential item functioning (DIF)𠅊n indicator of whether 𠇊 group of respondents is scoring better than another group of respondents on an item or a test after adjusting for the overall ability scores of the respondents” (108, 113). This is analogous to the conditions underpinning measurement invariance in a multi-group CFA (108, 113).

Whether the hypothesized structure is bidimensional or multidimensional, each dimension in the structure needs to be tested again to confirm its unidimensionality. This can also be done using confirmatory factor analysis. Appropriate model fit indices and the strength of factor loadings (cf. Table ​ Table2) 2 ) are the basis on which the latent structure of the items can be judged.

One commonly encountered pitfall is a lack of satisfactory global model fit in confirmatory factor analysis conducted on a new sample following a satisfactory initial factor analysis performed on a previous sample. Lack of satisfactory fit offers the opportunity to identify additional underperforming items for removal. Items with very poor loadings (𢙀.3) can be considered for removal. Also, modification indices, produced by Mplus and other structural equation modeling (SEM) programs, can help identify items that need to be modified. Sometimes a higher-order factor structure, where correlations among the original factors can be explained by one or more higher-order factors, is needed. This can also be assessed using statistical software such as Mplus, R, SAS, SPSS, or Stata.

A good example of best practice is seen in the work of Pushpanathan et al. on the appropriateness of using a traditional confirmatory factor analysis or a bifactor model (114) in assessing whether the Parkinson's Disease Sleep Scale-Revised was better used as a unidimensional scale, a tri-dimensional scale, or a scale that has an underlying general factor and three group factors (sub-scales). They tested this using three different models𠅊 unidimensional model (1-factor CFA) a 3-factor model (3 factor CFA) consisting of sub-scales measuring insomnia, motor symptoms and obstructive sleep apnea, and REM sleep behavior disorder and a confirmatory bifactor model having a general factor and the same three sub-scales combined. The results of this study suggested that only the bifactor model with a general factor and the three sub-scales combined achieved satisfactory model fitness. Based on these results, the authors cautioned against the use of a unidimensional total scale scores as a cardinal indicator of sleep in Parkinson's disease, but encouraged the examination of its multidimensional subscales (114).

Scoring scale items

Finalized items from the tests of dimensionality can be used to create scale scores for substantive analysis including tests of reliability and validity. Scale scores can be calculated by using unweighted or weighted procedures. The unweighted approach involves summing standardized item scores or raw item scores, or computing the mean for raw item scores (115). The weighted approach in calculating scale scores can be produced via statistical software programs such as Mplus, R, SAS, SPSS, or Stata. For instance, in using confirmatory factor analysis, structural equation models, or exploratory factor analysis, each factor produced reveals a statistically independent source of variation among a set of items (115). The contribution of each individual item to this factor is considered a weight, with the factor loading value representing the weight. The scores associated with each factor in a model then represents a composite scale score based on a weighted sum of the individual items using factor loadings (115). In general, it does not make much difference in the performance of the scale if scales are computed as unweighted items (e.g., mean or sum scores) or weighted items (e.g., factor scores).

Step 8: tests of reliability

Reliability is the degree of consistency exhibited when a measurement is repeated under identical conditions (116). A number of standard statistics have been developed to assess reliability of a scale, including Cronbach's alpha (117), ordinal alpha (118, 119) specific to binary and ordinal scale items, test–retest reliability (coefficient of stability) (1, 2), McDonald's Omega (120), Raykov's rho (2) or Revelle's beta (121, 122), split-half estimates, Spearman-Brown formula, alternate form method (coefficient of equivalence), and inter-observer reliability (1, 2). Of these statistics, Cronbach's alpha and test–retest reliability are predominantly used to assess reliability of scales (2, 117).

Cronbach's alpha

Cronbach's alpha assesses the internal consistency of the scale items, i.e., the degree to which the set of items in the scale co-vary, relative to their sum score (1, 2, 117). An alpha coefficient of 0.70 has often been regarded as an acceptable threshold for reliability however, 0.80 and 0.95 is preferred for the psychometric quality of scales (60, 117, 123). Cronbach's alpha has been the most common and seems to have received general approval however, reliability statistics such as Raykov's rho, ordinal alpha, and Revelle's beta, which are debated to have improvements over Cronbach's alpha, are beginning to gain acceptance.

Test–retest reliability

An additional approach in testing reliability is the test–retest reliability. The test–retest reliability, also known as the coefficient of stability, is used to assess the degree to which the participants' performance is repeatable, i.e., how consistent their sum scores are across time (2). Researchers vary in how they assess test–retest reliability. While some prefer to use intra class correlation coefficient (124), others use the Pearson product-moment correlation (125). In both cases, the higher the correlation, the higher the test–retest reliability, with values close to zero indicating low reliability. In addition, study conditions could change values on the construct being measured over time (as in an intervention study, for example), which could lower the test-retest reliability.

The work of Johnson et al. (16) on the validation of the HIV Treatment Adherence Self-Efficacy Scale (ASES) is a good example of the test of reliability. As part of testing for reliability, the authors tested for the internal consistency reliability values for the ASES and its subscales using Raykov's rho (produces a coefficient similar to alpha but with fewer assumptions and with confidence intervals) they then tested for the temporal consistency of the ASES' factor structure. This was then followed by test–retest reliability assessment among the latent factors. The different approaches provided support for the reliability of the ASES scale.

Other approaches found to be useful and support scale reliability include split-half estimates, Spearman-Brown formula, alternate form method (coefficient of equivalence), and inter-observer reliability (1, 2).

Step 9: tests of validity

Scale validity is the extent to which 𠇊n instrument indeed measures the latent dimension or construct it was developed to evaluate” (2). Although it is discussed at length here in Step 9, validation is an ongoing process that starts with the identification and definition of the domain of study (Step 1) and continues to its generalizability with other constructs (Step 9) (36). The validity of an instrument can be examined in numerous ways the most common tests of validity are content validity (described in Step 2), which can be done prior to the instrument being administered to the target population, and criterion (predictive and concurrent) and construct validity (convergent, discriminant, differentiation by known groups, correlations), which occurs after survey administration.

Criterion validity

Criterion validity is the �gree to which there is a relationship between a given test score and performance on another measure of particular relevance, typically referred to as criterion” (1, 2). There are two forms of criterion validity: predictive (criterion) validity and concurrent (criterion) validity. Predictive validity is “the extent to which a measure predicts the answers to some other question or a result to which it ought to be related with” (31). Thus, the scale should be able to predict a behavior in the future. An example is the ability for an exclusive breastfeeding social support scale to predict exclusive breastfeeding (10). Here, the mother's willingness to exclusively breastfeed occurs after social support has been given, i.e., it should predict the behavior. Predictive validity can be estimated by examining the association between the scale scores and the criterion in question.

Concurrent criterion validity is the extent to which test scores have a stronger relationship with criterion (gold standard) measurement made at the time of test administration or shortly afterward (2). This can be estimated using Pearson product-moment correlation or latent variable modeling. The work of Greca and Stone on the psychometric evaluation of the revised version of a social anxiety scale for children (SASC-R) provides a good example for the evaluation of concurrent validity (140). In this study, the authors collected data on an earlier validated version of the SASC scale consisting of 10 items, as well as the revised version, SASC-R, which had additional 16 items making a 26-item scale. The SASC consisted of two sub scales [fear of negative evaluation (FNE), social avoidance and distress (SAD)] and the SASC-R produced three new subscales (FNE, SAD-New, and SAD-General). Using a Pearson product-moment correlation, the authors examined the inter-correlations between the common subscales for FNE, and between SAD and SAD-New. With a validity coefficient of 0.94 and 0.88, respectively, the authors found evidence of concurrent validity.

A limitation of concurrent validity is that this strategy for validity does not work with small sample sizes because of their large sampling errors. Secondly, appropriate criterion variables or “gold standards” may not be available (2). This reason may account for its omission in most validation studies.

Construct validity

Construct validity is the 𠇎xtent to which an instrument assesses a construct of concern and is associated with evidence that measures other constructs in that domain and measures specific real-world criteria” (2). Four indicators of construct validity are relevant to scale development: convergent validity, discriminant validity, differentiation by known groups, and correlation analysis.

Convergent validity is the extent to which a construct measured in different ways yields similar results. Specifically, it is the �gree to which scores on a studied instrument are related to measures of other constructs that can be expected on theoretical grounds to be close to the one tapped into by this instrument” (2, 37, 126). This is best estimated through the multi-trait multi-method matrix (2), although in some cases researchers have used either latent variable modeling or Pearson product-moment correlation based on Fisher's Z transformation. Evidence of convergent validity of a construct can be provided by the extent to which the newly developed scale correlates highly with other variables designed to measure the same construct (2, 126). It can be invalidated by too low or weak correlations with other tests which are intended to measure the same construct.

Discriminant validity is the extent to which a measure is novel and not simply a reflection of some other construct (126). Specifically, it is the �gree to which scores on a studied instrument are differentiated from behavioral manifestations of other constructs, which on theoretical grounds can be expected not to be related to the construct underlying the instrument under investigation” (2). This is best estimated through the multi-trait multi method matrix (2). Discriminant validity is indicated by predictably low or weak correlations between the measure of interest and other measures that are supposedly not measuring the same variable or concept (126). The newly developed construct can be invalidated by too high correlations with other tests which are intended to differ in their measurements (37). This approach is critical in differentiating the newly developed construct from other rival alternatives (36).

Differentiation or comparison between known groups examines the distribution of a newly developed scale score over known binary items (126). This is premised on previous theoretical and empirical knowledge of the performance of the binary groups. An example of best practice is seen in the work of Boateng et al. on the validation of a household water insecurity scale in Kenya. In this study, we compared the mean household water insecurity scores over households with or without E. coli present in their drinking water. Consistent with what we knew from the extant literature, we found households with E. coli present in their drinking water had higher mean water insecurity scores than households that had no E. coli in drinking water. This suggested our scale could discriminate between particular known groups.

Although correlational analysis is frequently used by several scholars, bivariate regression analysis is preferred to correlational analysis for quantifying validity (127, 128). Regression analysis between scale scores and an indicator of the domain examined has a number of important advantages over correlational analysis. First, regression analysis quantifies the association in meaningful units, facilitating judgment of validity. Second, regression analysis avoids confounding validity with the underlying variation in the sample and therefore the results from one sample are more applicable to other samples in which the underlying variation may differ. Third, regression analysis is preferred because the regression model can be used to examine discriminant validity by adding potential alternative measures. In addition to regression analysis, alternative techniques such as analysis of standard deviations of the differences between scores and the examination of intraclass correlation coefficients (ICC) have been recommended as viable options (128).

Taken together, these methods make it possible to assess the validity of an adapted or a newly developed scale. In addition to predictive validity, existing studies in fields such as health, social, and behavioral sciences have shown that scale validity is supported if at least two of the different forms of construct validity discussed in this section have been examined. Further information about establishing validity and constructing indictors from scales can be found in Frongillo et al. (141).

Andersen S.M., Reznik I., Chen S. (1997) The self in relation to others: Cognitive and motivational underpinnings. In: Snodgrass J.G., Thompson R.L. (eds). The self across psychology. Academy of Sciences, New York, pp. 233–275

Anderson N.H. (1981) Foundations of information integration theory. Academic Press, New York

Arce-Ferrer A.J. (2006) An Investigation Into the Factors Influencing Extreme-Response Style. Educational and Psychological Measurement 66(3): 374–392

Arnold H.J., Feldman D.C. (1981) Social desirability response bias in self-report choice situations. Academy of Management Journal 24(2): 377–385

Bachman J.G., O’Malley P.M. (1984a) Black-white differences in self-esteem: Are they affected by response styles?. American Journal of Sociology 90: 624–639

Bachman J.G., O’Malley P.M. (1984b) Yea-saying, nay-saying, and going to extremes: Black-white differences in response styles. Public Opinion Quarterly 48(2): 491–509

Barnette J. (2000) Effects of stem and Likert response option reversals on survey internal consistency: If you feel the need, there is a better alternative to using those negatively worded stems. Educational and Psychological Measurement 60(3): 361–370

Beatty P., Herrmann D. (2002) To answer or not to answer: Decision process related to survey item nonresponse. In: Groves R.N., Dillman D.A., Eltinge J.L., Little R.J. (eds). Survey nonresponse. John Wiley & Sons, New York, pp. 71–85

Bellah R., Madsen R., Sullivan W., Swidler A., Tipton S. (1985). Habits of the heart: Individualism and commitment in American life. University of California Press, Berkeley

Billiet J.B., McClendon M.J. (2000) Modeling acquiescence in measurement models for two balanced sets of items. Structural Equation Modeling 7(4): 608–628

Bishop G.F., Smith A. (2001) Response order effects and the early Gallop split ballots. Public Opinion Quarterly 65: 479–505

Bishop G.F., Tuchfarber A.J., Oldendick R.W. (1986) Opinion fictitious issues: The pressure to answer survey questions. Public Opinion Quarterly 50: 240–250

Bless H., Igou E.R., Schwartz N., Waenke M. (2000) Reducing context effects by adding context information: The direction and size of context effects in political judgment. Personality and Social Psychology Bulletin 26(9): 1036–1045

Bradburn N.M., Sudman S., Blair E., Stocking C. (1978) Question threat and response bias. Public Opinion Quarterly 42(2): 221–234

Brew F.P., Hesketh B., Taylor A. (2001) Individualist-collectivist differences in adolescent decision making and decision styles with Chinese and Anglos. International Journal of Intercultural Relations 25(1): 1–19

Buchanan T., Ali T., Heffernan T.M., Ling J., Parrott A.C., Rodgers J. et al. (2005) Nonequivalence of on-line and paper-and-pencil psychological tests: The case of the prospective memory questionnaire. Behavior Research Methods 37(1): 148–154

Cantril H. (1946) The intensity of an attitude. Journal of Abnormal and Social Psychology 41: 129–135

Chen C., Lee S.-Y., Stevenson H.W. (1995) Response style and cross-cultural comparisons of rating scales among East Asian and North American students. Psychological Science 6(3): 170–175

Chiou J.-S. (2001) Horizontal and vertical individualism and collectivism among college students in the United States, Taiwan, and Argentina. Journal of Social Psychology 141(5): 667–678

Clarke I., III (2000) Extreme response style in cross-cultural research: An empirical investigation. Journal of Social Behavior and Personality 15(1): 137–152

Couch A., Keniston K. (1960) Yeasayers and naysayers: Agreeing response set as a personality variable. Journal of Abnormal and Social Psychology 60: 151–174

Crandall J.E. (1965) Some relationships among sex, anxiety, and conservatism of judgment. Journal of Personality 33(1):99–107

Cronbach L.J. (1946) Response sets and test validity. Educational and Psychological Measurement 6: 475–494

Crowne D.P., Marlowe D. (1960) A new scale of social desirability independent of psychopathology. Journal of Consulting Psychology 24: 349–354

Edwards A.L. (1957) The social desirability in personality assessment and research. Holt, Rinehart & Winstone, New York

Edwards A.L. (1963) A factor analysis of experimental social desirability and response set scales. Journal of Applied Psychology 47(5): 308–316

Edwards A.L. (1966) Relationship between probability of endorsement and social desirability scale value for a set of 2,824 personality statements. Journal of Applied Psychology 50(3): 238–239

Edwards A.L., Diers C. (1963) Neutral items as a measure of acquiescence. Educational and Psychological Measurement 23(4): 687–698

Edwards A.L., Walker J.N. (1961) Social desirability and agreement response set. Journal of Abnormal and Social Psychology 62: 180–183

Fishbein M., Ajzen I. (1981) On construct validity: A critique of Miniard and Cohen’s paper. Journal of Experimental Social Psychology 17(3): 340–350

Fiske A.P. (1992) The four elementary forms of sociality: Framework for a unified theory of social relations. Psychological Review 99(4): 689–723

Gilljam M., Granberg D. (1993) Should we take don’t know for an answer?. Public Opinion Quarterly 57(3): 348–357

Greenleaf E.A. (1992a) Improving rating scale measures by detecting and correcting bias components in some response styles. Journal of Marketing Research 29(2): 176–188

Greenleaf E.A. (1992b) Measuring extreme response style. Public Opinion Quarterly 56(3): 328–351

Grimm S.D., Church A. (1999) A cross-cultural study of response biases in personality measures. Journal of Research in Personality 33(4): 415–441

Groves R.M. (1989) Survey errors and survey costs. John Wiley & Sons, New York

Gudykunst W.B. (1997) Cultural variability in communication: An introduction. Communication Research 24(4): 327–348

Gudykunst W.B., Matsumoto Y. (1996) Cross-cultural variability of communication in personal relationships. In: Gudykunst W.B., Ting-Toomey S., Nishida T. (eds). Communication in personal relationships across cultures. Sage, Thousand Oaks

Gudykunst W.B., Matsumoto Y., Ting-Toomey S., Nishida T. (1996) The influence of cultural individualism-collectivism, self construals, and individual values on communication styles across cultures. Human Communication Research 22(4): 510–543

Haberstroh S., Oyserman D., Schwarz N., Kuehnen U., Ji L.-J. (2002) Is the interdependent self more sensitive to question context than the independent self? Self-construal and the observation of conversational norms. Journal of Experimental Social Psychology 38(3): 323–329

Harzing A.-W. (2006) Response Styles in Cross-national Survey Research: A 26-country Study. International Journal of Cross Cultural Management 6(2): 243–266

Heine S.J., Lehman D.R. (1995) Social desirability among Canadian and Japanese students. Journal of Social Psychology 135(6): 777–779

Heine S.J., Lehman D.R. (1997) The cultural construction of self-enhancement: An examination of group-serving biases. Journal of Personality and Social Psychology 72(6): 1268–1283

Heine S.J., Lehman D.R., Peng K., Greenholtz J. (2002) What’s wrong with cross-cultural comparisons of subjective Likert scales?: The reference-group effect. Journal of Personality andSocial Psychology 82(6): 903–918

Hofstede G. (1980) Culture’s consequences: International differences in work-related values. Sage, Beverly Hills

Hofstede G. (1991) Cultures and organizations: Software of the mind. McGraw-Hill, London

Holtgraves T. (1997) Styles of language use: Individual and cultural variability in conversational indirectness. Journal of Personality and Social Psychology 73(3): 624–637

Holtgraves T. (2004) Social desirability and self-reports: Testing models of socially desirable responding. Personality & Social Psychology Bulletin 30(2): 161–172

Hsu F.L. (1983) Rugged individualism reconsidered. University of Tennessee Press, Knoxville

Hui C. (1988) Measurement of individualism-collectivism. Journal of Research in Personality 22(1): 17–36

Hui C., Triandis H. (1989) Effects of culture and response format on extreme response style. Journal of Cross-Cultural Psychology 20(3): 296–309

Javeline D. (1999) Response effects in polite cultures: A test of acquiescence in Kazakhstan. Public Opinion Quarterly 63(1): 1–28

Johnson J. (1981) Effects of the order of presentation of evaluative dimensions for bipolar scales in four societies. Journal of Social Psychology 113(1): 21–27

Johnson T., Kulesa P., Cho Y.I., Shavitt S. (2005) The relation between culture and response styles: Evidence from 19 Countries. Journal of Cross-Cultural Psychology 36(2): 264–277

Johnson T.P., O’Rourke D., Chavez N., Sudman S., Warnecke R.B., Lacey L. et al. (1997) Social cognition and responses to survey questions among culturally diverse populations. In: Lyberg L., Biemer P., Collin M., de Leeuw E., Dippo C., Schwarz N., Trewin D. (eds). Survey measurement and process quality. Wiley-Interscience, New York

Kagitcibasi, C. (1994). A critical appraisal of individualism and collectivism: Toward a new formulation. In U. Kim, H. C. Triandis, C. Kagitcibasi, S.-C. Choi, & G. Yoon (Eds.), Individualism and collectivism: Theory, method, and applications (Vol. 18, pp. 52–65). Newbury Park: Sage.

Kim U., Triandis H., Kagitcibasi C., Choi S.-C., Yoon G., eds. (1994) Individualism and collectivism: Theory, method, and applications. Sage, Thousand Oaks, CA

Knowles E.S., Condon C.A. (1999) Why people say “yes”: A dual-process theory of acquiescence. Journal of Personality and Social Psychology 77(2): 379–386

Knowles E.S., Nathan K.T. (1997) Acquiescent responding in self-reports: Cognitive style or social concern?. Journal of Research in Personality 31(2): 293–301

Krosnick J.A. (1991) Response strategies for coping with the cognitive demands of attitude measures in surveys. Applied Cognitive Psychology 5: 213–236

Krosnick J.A. (2002) The cause of no-opinion response to attitude measures in surveys: They are rarely what they appear to be. In: Groves R.N., Dillman D.A., Eltinge J.L., Little R.J. (eds). Survey nonresponse. John Wiley & Sons, New York

Krosnick J.A., Holbrook A.L., Berent M.K., Carson R.T., Hanemann W., Kopp R.J., et al. (2002) The impact of “no opinion” response options on data quality: Non-attitude reduction or an invitation to satisfice?. Public Opinion Quarterly 66(3): 371–403

Krosnick J.A., Schuman H. (1988) Attitude intensity, importance, and certainty and susceptibility to response effects. Journal of Personality and Social Psychology 54(6): 940–952

Kuehnen U., Oyserman D. (2002) Thinking about the self influences thinking in general: Cognitive consequences of salient self-concept. Journal of Experimental Social Psychology 38(5): 492–499

Lee C., Green R.T. (1991) Cross cultural examination of Fishbein behavioral intentions model. Journal of International Business Studies 22: 289–305

Lehnert W. (1977) Human and computational question answering. Cognitive Science 1(1): 47–73

Marin G., Gamba R.J., Marin B.V. (1992) Extreme response style and acquiescence among Hispanics: The role of acculturation and education. Journal of Cross-Cultural Psychology 23(4): 498–509

Markus H.R., Kitayama S. (1991) Culture and the self: Implications for cognition, emotion, and motivation. Psychological Review 98(2): 224–253

Mathiowetz N.A., Duncan G.J. (1988) Out of work, out of mind: Response errors in retrospective reports of unemployment. Journal of Business and Economic Statistics 6: 221–229

Matsuda Y., Harsel S., Furusawa S., Kim H.-S., Quarles J. (2001) Democratic values and mutual perceptions of human rights in four Pacific Rim nations. International Journal of Intercultural Relations 25(4): 405–421

McClendon M.J. (1991) Acquiescence and recency response-order effects in interview surveys. Sociological Methods and Research 20: 60–103

Middleton K.L., Jones J.L. (2000) Socially desirable response sets: The impact of country culture. Psychology and Marketing 17(2): 149–163

Mondak J.J., Davis M.B. (2001) Asked and Answered: Knowledge Levels When We Will Not Take Don’t Know for an Answer. Political Behavior 23(3): 199–224

Moore D.W. (2002) Measuring new types of question-order effects: Additive and subtractive. Public Opinion Quarterly 66(1): 80–91

Moorman R.H., Podsakoff P.M. (1992) A meta-analytic review and empirical test of the potential confounding effects of social desirability response sets in organizational behaviour research. Journal of Occupational and Organizational Psychology 65(2): 131–149

Morling B., Fiske S.T. (1999) Defining and measuring harmony control. Journal of Research in Personality 33(4): 379–414

Norenzayan A., Schwarz N. (1999) Telling what they want to know: Participants tailor causal attributions to researchers’ interests. European Journal of Social Psychology 29(8): 1011–1020

Ohbuchi K.-I., Fukushima O., Tedeschi J.T. (1999) Cultural values in conflict management: Goal orientation, goal attainment, and tactical decision. Journal of Cross-Cultural Psychology 30(1): 51–71

Oppenheim A.N. (1966) Questionnaire design and attitude measurement. Heinemann, London

Oyserman D. (1993) The lens of personhood: Viewing the self and others in a multicultural society. Journal of Personality and Social Psychology 65(5): 993–1009

Oyserman D., Coon H.M., Kemmelmeier M. (2002) Rethinking individualism and collectivism: Evaluation of theoretical assumptions and meta-analyses. Psychological Bulletin 128(1): 3–72

Oyserman, D., & Markus, H. R. (1993). The sociocultural self. In J. M. Suls (Ed.), The self in social perspective (Vol. 4, pp. 187–220). Hillsdale, NJ: Lawrence Erlbaum Associates.

Paulhus, D. L. (1991). Measurement and control of response bias. In J. P. Robinson, P. R. Shaver, L. S. Wrightsman, & F. M. Andrews (Eds.), Measures of personality and social psychological attitudes (Vol. 1, pp. 17–59). San Diego: Academic Press.

Paulhus D.L. (2002) Socially desirable responding: The evolution of a construct. In: Braun H.I., Jackson D.N., Wiley D.E., Messick S. (eds). The role of constructs in psychological and educational measurement. Lawrence Erlbaum Associates, Mahwah, NJ, pp. 49–69

Paulhus D.L., Harms P.D., Bruce M.N., Lysy D.C. (2003) The over-claiming technique: Measuring self-enhancement independent of ability. Journal of Personality and Social Psychology 84(4): 890–904

Paulhus D.L., John O.P. (1998) Egoistic and moralistic biases in self-perception: The interplay of self-deceptive styles with basic traits and motives. Journal of Personality 66(6): 1025–1060

Paulhus D.L., Reid D.B. (1991) Enhancement and denial in socially desirable responding. Journal of Personality and Social Psychology 60(2): 307–317

Peterson J.B., DeYoung C.G., Driver-Linn E., Seguin J.R., Higgins D.M., Arseneault L., et al. (2003) Self-deception and failure to modulate responses despite accruing evidence of error. Journal of Research in Personality 37(3): 205–223

Ray J. (1983) Reviving the problem of acquiescent response bias. Journal of Social Psychology 121(1): 81–96

Reykowski J. (1994) Collectivism and individualism as dimensions of social change. In: Kim U., Triandis H.C., Kagitcibasi C., Choi C., Yoon G. (eds). Individualism and collectivism: Theory, method, and applications. Thousand Oaks, California

Richman W.L., Kiesler S., Weisband S., Drasgow F. (1999) A meta-analytic study of social desirability distortion in computer-administered questionnaires, traditional questionnaires, and interviews. Journal of Applied Psychology 84(5): 754–775

Sampson E.E. (1977) Psychology and the American ideal. Journal of Personality and Social Psychology 35(11): 767–782

Schuman H., Presser S. (1981) Questions and answers in attitude surveys: Experiments on question form, wording, and context. Academic Press, New York

Schwartz S.H. (1990) Individualism-collectivism: Critique and proposed refinements. Journal of Cross-Cultural Psychology 21(2): 139–157

Schwarz N. (1999) Self-reports: How the questions shape the answers. American Psychologist 54(2): 93–105

Schwarz N. (2003) Self-reports in consumer research: The challenge of comparing cohorts and cultures. Journal of Consumer Research 29(4): 588–594

Schwarz N., Hippler H.-J. (1995) Subsequent questions may influence answers to preceding questions in mail surveys. Public Opinion Quarterly 59(1): 93–97

Schwarz N., Hippler H.-J., Deutsch B., Strack F. (1985) Response scales: Effects of category range on reported behavior and comparative judgments. Public Opinion Quarterly 49(3): 388–395

Schwarz N., Hippler H.J., Noelle-Neumann E. (1991) Cognitive model of response-order effects. In: Schwarz N., Sudman S. (eds). Context effects in social and psychological research. Springer Verlag, New York

Schwarz N., Oyserman D. (2001) Asking questions about behavior: Cognition, communication, and questionnaire construction. American Journal of Evaluation 22(2): 127–160

Sekaran U. (1984) Methodological and theoretical issues and advancements in cross-cultural research. Journal of International Business Studies 14(2): 61–73

Shulruf, B., Hattie, J., & Dixon, R. (2006). The influence of individualist and collectivist attributes on responses to Likert-type scales. Paper presented at the 26th International Association of Applied Psychology, 17–21 July, Athens.

Shulruf, B., Hattie, J., & Dixon, R. (2007). Development of a New Measurement Tool for Individualism and Collectivism. Journal of Psychoeducational Assessment (in press).

Shulruf, B., Watkins, D., Hattie, J., Faria, L., Pepi, A., Alesi, M., et al. (in progress). Measuring Collectivism and Individualism in the Third Millennium.

Singelis T.M., Triandis H., Bhawuk D., Gelfand M.J. (1995) Horizontal and vertical dimensions of individualism and collectivism: A theoretical and measurement refinement. Cross-Cultural Research: The Journal of Comparative Social Science 29(3): 240–275

Smith P.B. (2004) Acquiescent Response Bias as an Aspect of Cultural Communication Style. Journal of Cross-Cultural Psychology 35(1): 50–61

Stening B., Everett J. (1984) Response styles in a cross-cultural managerial study. Journal of Social Psychology 122(2): 151–156

Strack F. (1992) Order effects in survey research: Activation and information functions of preceding questions. In: Schwarz N., Sudman S. (eds). Context effects in social and psychological research. Springer-Verlag, New York, pp. 23–34

Sudman S., Bradburn N.M. (1974) Response effects in surveys. Aldine Publishing Company, Chicago

Sudman S., Bradburn N.M., Schwarz N. (1996) Thinking about answers: The application of cognitive processes to survey methodology. Jossey-Bass, San Francisco

Swearingen, D. L. (1998). Response sets, item format, and thinking style: Implications for questionnaire design. U Denver, US, 1.

Tourangeau R. (1991) Context effects on responses to attitude questions: Attitudes as memory structure. In: Schwarz N., Sudman S. (eds). Context effects in social and psychological research. Springer-Verlag, New York, pp. 35–47

Tourangeau R. (2003) Cognitive aspects of survey measurement and mismeasurement. International Journal of Public Opinion Research 15(1): 3–7

Tourangeau R., Rasinski K.A. (1988) Cognitive processes underlying context effects in attitude measurement. Psychological Bulletin 103(3): 299–314

Tourangeau R., Smith T.W. (1996) Asking sensitive questions: The impact of data collection mode, question format, and question context. Public Opinion Quarterly 60(2): 275–304

Triandis H. (1989) The self and social behavior in differing cultural contexts. Psychological Review 96(3): 506–520

Triandis H. (1995) Individualism and collectivism. Westview Press, Boulder

Triandis H. (1996) The psychological measurement of cultural syndromes. American Psychologist 51(4): 407–415

Triandis H., Bontempo R., Villareal M.J., Asai M., Lucca N. (1988) Individualism and collectivism: Cross-cultural perspectives on self in group relationships. Journal of Personality and Social Psychology 54(2): 323–338

Triandis H., Gelfand M. (1998) Converging measurement of horizontal and vertical individualism and collectivism. Journal of Personality and Social Psychology 74(1): 118–128

Triandis H., McCusker C., Hui C. (1990) Multimethod probes of individualism and collectivism. Journal of Personality and Social Psychology 59(5): 1006–1020

Triandis H., Suh E.M. (2002) Cultural influences on personality. Annual Review of Psychology 53(1): 133–160

Uskul A.K., Oyserman D. (2005) Question Comprehension and Response: Implications of Individualism and Collectivism. In: Mannix B., Neale M., Chen Y. (eds). Research on managing groups and teams: National culture & groups. Elsevier Science, Oxford

van Herk H., Poortinga Y.H., Verhallen T.M. (2004) Response styles in rating scales: Evidence of method bias in data from six EU countries. Journal of Cross-Cultural Psychology 35(3): 346–360

Walsh W.A., Banaji M.R., eds. (1997) The collective self (Vol. 818). Annals of the New York Academy of Sciences, New York

Warnecke R.B., Johnson T.P., Chavez N., Sudman S., O’Rourke D., Lacey L., et al. (1997) Improving question wording in survey of culturally diverse population. Annual Epidemiology 7: 334–342

Waterman A.S. (1984) The psychology of individualism. Praeger, New York

Watson D. (1992) Correcting for acquiescent response bias in the absence of a balanced scale: An application to class consciousness. Sociological Methods and Research 21(1): 52–88

Welkenhuysen-Gybels J., Billiet J., Cambre B. (2003) Adjustment for acquiescence in the assessment of the construct equivalence of Likert-type score items. Journal of Cross-Cultural Psychology 34(6): 702–722

Weng L.-J., Cheng C.-P. (2000) Effects of response order on Likert-type scales. Educational and Psychological Measurement 60(6): 908–924

Wilson T.D., LaFleur S.J., Anderson D. (1996) The validity and consequences of verbal reports about attitudes. In: Schwarz N., Sudman S. (eds). Answering questions: Methodology for determining cognitive and communicative processes in survey research. Jossey-Bass, San Francisco, pp. 91–114

Wong N., Rindfleisch A., Burroughs J. (2003) Do Reverse-Worded Items Confound Measures in Cross-Cultural Consumer Research? The Case of the Material Values Scale. Journal of Consumer Research 30: 72–91

Directionality of Likert scales

A feature of Likert scales is their directionality: the categories of response may be increasingly positive or increasingly negative. While interpretation of a category may vary among respondents (e.g., one person’s “agree” is another’s “strongly agree”), all respondents should nevertheless understand that “strongly agree” is a more positive opinion than “agree.” One important consideration in the design of questionnaires is the use of reverse scoring on some items. Imagine a questionnaire with positive statements about the benefits of public health education programs (e.g., “TV campaigns are a good way to persuade people to stop smoking in the presence of children”). A subject who strongly agreed with all such statements would be presumed to have a very positive view about the benefits of this method of health education. However, perhaps the subject was not participating wholeheartedly and simply checked the same response category for each item. To ensure that respondents are reading and evaluating statements carefully, a few negative statements may be included (e.g., “Money spent on public health education programs would be better spent on research into new therapies”). If a respondent answers positively to positive statements and negatively to negative statements, the researcher may have increased confidence in the data.

Self-Reported Metrics

6.2.2 Semantic Differential Scales

The semantic differential technique involves presenting pairs of bipolar, or opposite, adjectives at either end of a series of scales, such as the following:

Like the Likert scale , a five- or seven-point scale is commonly used. The difficult part about the semantic differential technique is coming up with words that are truly opposites. Sometimes a thesaurus can be helpful since it includes antonyms. But you need to be aware of the connotations of different pairings of words. For example, a pairing of “Friendly/Unfriendly” may have a somewhat different connotation and yield different results from “Friendly/Not Friendly” or “Friendly/Hostile.”

Osgood’s Semantic Differential

The semantic differential technique was developed by Charles E. Osgood (Osgood et al., 1957), who designed it to measure the connotations of words or concepts. Using factor analysis of large sets of semantic differential data, he found three recurring attitudes that people used in assessing words and phrases: evaluation (such as “good/bad”), potency (such as “strong/weak”), and activity (such as “passive/active”).


Two relevant methodological issues at this stage are the selection of the unit of analysis and the relevance of the research topic. Most research on international markets involves comparisons. Therefore, defining the unit of analysis, that is, selecting the relevant contexts to be compared is a priority in cross-cultural research. 1 Craig and Douglas 19 propose three aspects that need to be considered in defining the unit: the geographic scope of the unit (for example, country, region, and so on) the criteria for membership in the unit (for example, demographic or socio-economic characteristics, and so on) and the situational context (for example, specific socio-cultural settings, climate context, and so on). This section will focus on geographic scope, which needs to be chosen based on the purpose of the research.

Within the different geographical levels, the country level provides a practical and convenient unit for data collection. Thus, researchers mostly use this unit of analysis in their studies. However, the use of countries is criticized for several reasons. 1 First, countries are not always that relevant. Cities, regions or even the world may be more appropriate. Second, countries are not isolated or independent units. They develop and adopt similar practices and behaviors through numerous ways. Finally, the differences between countries in terms of economic, social or cultural factors, and the heterogeneity within countries can have unintended consequences.

The relevance of the topic in the selected units of analysis is more difficult and important than in domestic research, due to the unfamiliarity with the countries/cultures where the research is being conducted. The research topic should be equally important and appropriate in each context, and conceptually equivalent, an issue that will be addressed in the next section. 17, 20 Similarly, the relevance of constructs should be carefully evaluated. 1 This issue will help to avoid pseudoetic bias (that is, to assume that a measure developed in a context is appropriate in all the contexts).

Suggestions and recommendations

Given the limitations of the use of the country, the consideration of different geographical units is suggested in the literature. As a result of advances in information and communication technology, improvements in physical communication and transportation, and the convergence of consumer needs, ‘national culture’ is less meaningful. 21, 22 Therefore, several authors call for the study of units of analysis, such as regions, communities or specific population segments (for example, teenagers), as well as the combination of multiple levels of units. 1, 23 However, these alternative units of analysis should not totally replace the use of national borders. Engelen and Brettel 24 justify their use based on existing theoretical and empirical evidence plus their managerial relevance, since organizations typically carry out their international activities along national borders.

If countries are used as unit of analysis, they should be ‘purposively selected to be comparable’, 1 taking into account those factors that may be relevant or affect the phenomenon being studied. Furthermore, researchers should beware of the degree of cultural interpenetration, that is, the extent to which the members of one country are exposed to another through different channels, such as the direct experience, the media or the experiences of others. It is also important to take into account the intra-national diversity to truly understand the phenomenon under investigation. Finally, the selection of the unit should be based on the objectives of the study rather than on convenience. 17, 25

Regarding the topic being investigated, Douglas and Craig 1 suggest removing the influence of the dominant culture. Researchers should isolate the tendency to allow their own beliefs and values to influence the question analyzed. It would help them to distinguish the relevant topics, constructs or relationships to be studied in each context. It is also important to identify the role of mediating and moderator factors embedded in each socio-cultural context and assess how this can be related to the focal topic. For instance, a study exploring the purchase intention of foreign products should consider to what extent the image of the country of origin affects this intention.

The Effect of Rating Scale on Response Style: Experimental Evidence for Job Satisfaction

This paper explores the relationship between rating scales and response style using experimental data from a sample of 1500 households of the Innovation Panel (2008) which is part of the Understanding Society database. Two random groups of individuals are being asked about their level of job satisfaction using a self-assessment questionnaire through two (7 and 11 points) rating options. By comparing the two groups, we explore the effects of the different rating scales on Extreme Response Style (ERS). The experimental design of the data enables us to show that both high and low Extreme Response Style (ERS) are correlated with personal and demographic characteristics. In addition, when comparing the shorter to a longer scale, we show that the survey design may generates tendency to choose responses at the extreme values of the distribution.

This is a preview of subscription content, access via your institution.


2.1 Participants and recruitment procedure

The present study included the datasets from ten countries that have validated the FCV-19S in their respective countries. A short sampling description is given herewith, details can be found in the original papers (Abad et al., 2020 Broche-Pérez et al., 2020 Chang, Hou, et al., 2020 Harper et al., 2020 Mailliez et al., 2021 Masuyama et al., 2020 Pakpour, Griffiths, Chang, et al., 2020 Sakib et al., 2020 Soraci et al., 2020 Winter et al., 2020 ). More specifically, all the participants used in the present study were recruited through convenience sampling. Some were recruited using online surveys and some were recruited using paper-based (offline) surveys because most of the validations were carried out independently by different research teams and the respective teams had different resources in the different countries. However, there was no serious bias in using the two types of survey data collection and there is prior evidence showing that online and offline surveys are measurement invariant (Martins, 2010 ). All the study designs were cross-sectional. Moreover, general populations were the target sample in most of the countries (Table 1). Table 1 also reports the data collection period for each country and a related figure concerning COVID-19 infection at the time of the study.

Online convenience sampling

In March—education facilities closed, border screening, social distancing

In April—Office holiday, suspicion of public transport, public gathering restrictions

Online convenience sampling

Online convenience sampling

April 2020 Social isolation and distancing hygiene practices school closures, strict regulations for events and public places quarantine of infected peoples closing non-essential businesses

April 30th 6,006 deaths due to COVID-19

87,187 people diagnosed with the disease.

August 18th 1,352 deaths in the last 24 hr

47,784 confirmed cases in the last 24 hr

Infection control policies implemented in late January, 2020

Strict regulations for events and public places

Blocked Chinese passengers on 30 January 2020

State of Emergency declared on 31 January 2020

Prohibition of access and removal in the municipalities with COVID-19 outbreak implemented on 23 February 2020

The Cuban government presents the action plan against COVID-19 (January 2020)

Mandatory use of facial mask (March 2020)

Strict lockdown in areas with more than 10 confirmed cases (March 2020)

Isolation of all suspected cases in specialized centres (March 2020)

Closure of international borders (March 2020)

Strict regulations for events and public places (March 2020)

Strict lockdown implemented on 22 March 2020

Smart lockdown implemented on 13 June 2020

Online convenience sampling

Partial lockdown for sick people on 23 February 2020

Demonstration over 5,000 people is banned on 29 February 2020

Lockdown for general population between March 17 and May 11

2.2 Measures

2.2.1 Fear of COVID-19 Scale (FCV-19S)

The seven-item FCV-19S was developed to quickly assess individuals' fear towards COVID-19 (Ahorsu, Lin, Imani, et al., 2020 Ahorsu, Lin, & Pakpour, 2020 ). Responding to items on a five-point Likert scale (1 = strongly disagree 5 = strongly agree), the FCV-19S has been found to be psychometrically sound in assessing fear of COVID-19 in different populations, including different ethnic groups (Alyami et al., 2020 Pakpour, Griffiths, Chang, et al., 2020 Pang et al., 2020 Sakib et al., 2020 Satici et al., 2020 Soraci et al., 2020 Tsipropoulou et al., 2020 ) and various vulnerable groups (Pakpour, Griffiths, Chang, et al., 2020 ). An example item in the FCV-19S is “I cannot sleep because I'm worrying about getting coronavirus-19”. A higher level of fear toward COVID-19 is indicated by the higher FCV-19S score. Moreover, different language versions of the FCV-19S used in the present study have been validated (Alyami et al., 2020 Chang, Hou, et al., 2020 Pakpour, Griffiths, Chang, et al., 2020 Sakib et al., 2020 Satici et al., 2020 Soraci et al., 2020 Tsipropoulou et al., 2020 ).

2.3 Data analysis

The participants' age, gender distribution (male, female, and other), and FCV-19S scores were first analysed using descriptive statistics for each country. Item properties of the seven FCV-19S items were then examined using skewness, kurtosis (to check normal distribution of responses for each item), item difficulty (with the use of Rasch analysis), item fit (including information-weighted fit mean square [MnSq] and outlier-sensitive fit MnSq where value between 0.5 and 1.5 indicates good fit) (Lin et al., 2019 ) factor loadings (derived from confirmatory factor analysis [CFA]) and item-total correlations. The entire FCV-19S scale properties were assessed using internal consistency, CFA and Rasch analysis. For internal consistency, Cronbach's α with a value >0.7 indicates satisfactory (Lee et al., 2016 ) for CFA, fit indices of comparative fit index (CFI) and Tucker-Lewis index (TLI) > 0.9 with root mean square error of approximation (RMSEA) and standardized root mean square residual (SRMR) <0.08 indicate satisfactory (Lin et al., 2017 ) for Rasch analysis, item and person separation reliability >0.7 with item and person separation index >2 indicate satisfactory (Lin et al., 2019 ).

Differential item functioning (DIF) based on Rasch analysis was conducted to examine whether different interpretations of the FCV-19S item content occurred across countries, gender (male and female) or age groups (children aged below 18 years, young to middle-aged adults aged between 18 and 60 years and older people aged above 60 years). A substantial DIF is defined as a DIF contrast >0.5 (Lin et al., 2019 ). Measurement invariance was further tested using multigroup CFA to examine whether participants from different countries, different gender participants (male and female), and participants with different ages (children aged below 18 years, young to middle-aged adults aged between 18 and 60 years, and older people aged above 60 years) interpret the entire FCV-19S similarly. In the multigroup CFA, several nested models were compared. More specifically, configural models across countries, gender and age groups were first carried out to examine whether different aggregated subgroups of participants confirm the single-factor structure of the FCV-19S. Then, CFA models with factor loadings constrained equally across subgroups were constructed and compared with the configural models to examine whether different subgroups shared the same factor loadings. Finally, CFA models with factor loadings and item intercepts constrained equally across subgroups were constructed and compared with the models with factor loadings constrained equally to examine whether different subgroups shared the same item intercepts. ΔCFI > −0.01, ΔRMSEA < 0.01 and ΔSRMR < 0.01 support the full measurement invariance in every two nested models' comparisons (Lin et al., 2019 ). However, if the full measurement invariance was not achieved, partial invariance was tested using the process of relaxing factor loadings or item intercepts in the constrained models. Moreover, the data relating to “other” gender was not used for DIF or multigroup CFA because there were only 27 participants reporting their gender as other. Given the huge difference in sample sizes (27 “other” gender, 7,723 male gender, and 8,363 female gender), carrying out invariance testing on such a small sample size would be problematic.

A model with structural equation modelling (SEM) was then constructed to examine the associations between age, gender, and fear of COVID-19. In the SEM model, young to middle-aged adults aged between 18 and 60 years and being male were reference groups. All the statistical analyses were performed using SPSS 24.0 (IBM corp.), WINSTEPS 4.1.0 (, and lavaan package ( in the R software.


Searches were conducted using Nursing & Allied Health Database and Science direct databases.

Within the Nursing & Allied Health Database the words ‘spirituality’ and ‘tools or measures or assessment or instruments or scales’ and ‘nursing’ were used as keywords searched within the abstract of articles. Limiters were placed by age such that only results involving adults were returned. It was specified that scholarly journal articles should be returned, written in English. This resulted in 15 hits.

Within the Science Direct search the same words as above were used for search within the abstract of articles, topic requests were highlighted such that results only returned those concerning ’patients’ or ‘nurse’. Content was again limited to academic journals. This resulted in 362 results

Duplicates were removed and then titles and abstracts of articles were viewed and inappropriate articles discarded. Articles were discarded at this stage if they included assessment of spirituality in child patients, if they did not consider the role of nurses or student nurses in a patient’s spirituality. The remaining articles were then viewed in full. Articles met the inclusion criteria if they included within their methodology measures which related to nursing professionals’ spiritual care and assessment of patients.

The research in Tanzania reported in this paper was funded by the European Union: Funded under: FP7-HEALTH: Project reference: 261349, as was the time of J.B., A.K., J.G., F.M., K.O. Examples from Burkina Faso reported in the paper were drawn from the impact evaluation of the Health Sector Results-Based Financing Program, funded by the World Bank through the Health Results Innovation Trust Fund. Contributions made by E.D. are based on her PhD dissertation submitted to Johns Hopkins University under the supervision of David H. Peters and with inputs and guidance from Qian-Li Xue, Sara Bennett, Kitty Chan, and Saifuddin Ahmed. The UK Department for International Development (DFID) as part of the Consortium for Research on Resilient and Responsive Health Systems (RESYST) supported the time of J.L., E.D. and JB writing the paper. J.B.’s time was also supported by the Research Council of Norway. The views expressed and information contained in it are not necessarily those of or endorsed by the funders, which can accept no responsibility for such views or information or for any reliance placed on them.

Conflict of interest statement. None declared.


For instance, for principal axis factor, it’s all factors with eigenvalues greater zero. For PCA, it’s all factors with eigenvalues greater than 1 (the Kaiser criterion).

Comparing Error Rates and Power When Analyzing Likert Scale Data

After analyzing all pairs of distributions, the results indicate that both types of analyses produce type I error rates that are nearly equal to the target value. A type I error rate is essentially a false positive. The test results are statistically significant but, unbeknownst to the investigator, the null hypothesis is actually true. This error rate should equal the significance level.

The 2-sample t-test and Mann-Whitney test produce nearly equal false positive rates for Likert scale data. Further, the error rates for both analyses are close to the significance level target. Excessive false positives are not a concern for either hypothesis test.

Regarding statistical power, the simulation study shows that there is a minute difference between these two tests. Apprehensions about the Mann-Whitney test being underpowered were unsubstantiated. In most cases, if there is an actual difference between populations, the two tests have an equal probability of detecting it.

There is one qualification. A power difference between the two tests exists for several specific combinations of distribution pairs. The difference in power affects only a small portion of the possible combinations of distributions. My suggestion is to perform both tests on your Likert data. If the test results disagree, look at the article to determine whether a difference in power might be the cause.

In most cases, it doesn&rsquot matter which of the two statistical analyses you use to analyze your Likert data. If you have two groups and you&rsquore analyzing five-point Likert data, both the 2-sample t-test and Mann-Whitney test have nearly equivalent type I error rates and power. These results are consistent across group sizes of 10, 30, and 200.

Sometimes it&rsquos just nice to know when you don&rsquot have to stress over something!


  1. Grayson

    Down with spam. Give creativity on blog pages!

  2. Oluwatosin

    I advise you to visit the site, on which there are a lot of articles on this issue.

  3. Torry

    Brilliant idea and it is timely

  4. Uptun

    I apologise, but it not absolutely approaches me. Who else, what can prompt?

  5. Ronald

    Does it have no equivalent?

  6. Daimmen

    wonderfully, very useful piece

Write a message