psych (version 2.1.9)

testRetest: Find various test-retest statistics, including test, person and item reliability


Given two presentations of a test, it is straightforward to find the test-retest reliablity, as well as the item reliability and person stability across items. Using the multi-level structure of the data, it is also possible to do a variance deomposition to find variance components for people, items, time, people x time, people x items, and items x time as well as the residual variance. This leads to various generalizability cofficients.


testRetest(t1,t2=NULL,keys=NULL,id="id", time=  "time", select=NULL, 
check.keys=TRUE, warnings=TRUE,lmer=TRUE,sort=TRUE)



a data.frame or matrix for the first time of measurement.


a data.frame or matrix for the second time of measurement. May be NULL if time is specifed in t1


item names (or locations) to analyze, preface by "-" to reverse score.


subject identification codes to match across time


The name of the time variable identifying time 1 or 2 if just one data set is supplied.


A subset of items to analyze


If TRUE will automatically reverse items based upon their correlation with the first principal component. Will throw a warning when doing so, but some people seem to miss this kind of message.


If TRUE, then warn when items are reverse scored


If TRUE, include the lmer variance decomposition. By default, this is true, but this can lead to long times for large data sets.


If TRUE, the data are sorted by id and time. This allows for random ordering of data, but will fail if ids are duplicated in different studies. In that case, we need to add a constant to the ids for each study. See the last example.



The time 1 time 2 correlation of scaled scores across time


Guttman's lambda 3 (aka alpha) and lambda 6* (item reliabilities based upon smcs) are found for the scales at times 1 and 2.


The within subject test retest reliability of response patterns over items


Item reliabilities, item loadings at time 1 and 2, item means at time 1 and time 2


A data frame of principal component scores at time 1 and time 2, raw scores from time 1 and time 2, the within person standard deviation for time 1 and time 2, and the rqq and dqq scores for each subject.


If given separate t1 and t2 data.frames, this is combination suitable for using multilevel.reliability


A key vector showing which items have been reversed


The multilevel output


There are many ways of measuring reliability. Test - Retest is one way. If the time interval is very short (or immediate), this is known as a dependability correlation, if the time interval is longer, a stability coefficient. In all cases, this is a correlation between two measures at different time points. Given the multi-level nature of these data, it is possible to find variance components associated with individuals, time, item, and time by item, etc. This leads to several different estimates of reliability (see multilevel.reliability for a discussion and references).

It is also possible to find the subject reliability across time (this is the correlation across the items at time 1 with time 2 for each subject). This is a sign of subject reliability (Wood et al, 2017). Items can show differing amounts of test-retest reliability over time. Unfortunately, the within person correlation has problems if people do not differ very much across items. If all items are in the same keyed direction, and measuring the same construct, then the response profile for an individual is essentially flat. This implies that the even with almost perfect reproducibility, that the correlation can actually be negative. The within person distance (d2) across items is just the mean of the squared differences for each item. Although highly negatively correlated with the rqq score, this does distinguish between random responders (high dqq and low rqq) from consistent responders with lower variance (low dqq and low rqq).

Several individual statistics are reported in the scores object. These can be displayed by using pairs.panels for a graphic display of the relationship and ranges of the various measures.

Although meant to decompose the variance for tests with items nested within tests, if just given two tests, the variance components for people and for time will also be shown. The resulting variance ratio of people to total variance is the intraclass correlation between the two tests. See also ICC for the more general case.


Cattell, R. B. (1964). Validity and reliability: A proposed more basic set of concepts. Journal of Educational Psychology, 55(1), 1 - 22. doi: 10.1037/h0046462

Cranford, J. A., Shrout, P. E., Iida, M., Rafaeli, E., Yip, T., \& Bolger, N. (2006). A procedure for evaluating sensitivity to within-person change: Can mood measures in diary studies detect change reliably? Personality and Social Psychology Bulletin, 32(7), 917-929.

DeSimone, J. A. (2015). New techniques for evaluating temporal consistency. Organizational Research Methods, 18(1), 133-152. doi: 10.1177/1094428114553061

Revelle, W. and Condon, D. Reliability from alpha to omega: A tutorial. Psychological Assessment, 31 (12) 1395-1411.

Revelle, W. (in preparation) An introduction to psychometric theory with applications in R. Springer. (Available online at

Shrout, P. E., & Lane, S. P. (2012). Psychometrics. In Handbook of research methods for studying daily life. Guilford Press.

Wood, D., Harms, P. D., Lowman, G. H., & DeSimone, J. A. (2017). Response speed and response consistency as mutually validating indicators of data quality in online samples. Social Psychological and Personality Science, 8(4), 454-464. doi: 10.1177/1948550617703168

See Also

alpha, omega scoreItems, cor2


Run this code
  #for faster compiling, dont test 
#lmer set to FALSE for speed.
#set lmer to TRUE to get variance components
sai.xray <- subset(psychTools::sai,psychTools::sai$study=="XRAY")
#The case where the two measures are identified by time
#automatically reverses items but throws a warning
stability <- testRetest(sai.xray[-c(1,3)],lmer=FALSE) 
stability  #show the results
#get a second data set
sai.xray1 <- subset(sai.xray,sai.xray$time==1)
msq.xray <- subset(psychTools::msqR,
 (psychTools::msqR$study=="XRAY") & (psychTools::msqR$time==1))
select <- colnames(sai.xray1)[is.element(colnames(sai.xray1 ),colnames(psychTools::msqR))] 

select <-select[-c(1:3)]  #get rid of the id information
#The case where the two times are in the form x, y

dependability <-  testRetest(sai.xray1,msq.xray,keys=select,lmer=FALSE)
dependability  #show the results

#now examine the Impulsivity subscale of the EPI
#use the epiR data set which includes epi.keys
#Imp <- selectFromKeys(epi.keys$Imp)   #fixed temporarily with 
Imp <- c("V1", "V3", "V8", "V10","V13" ,"V22", "V39" , "V5" , "V41")
imp.analysis <- testRetest(psychTools::epiR,select=Imp) #test-retest = .7, alpha=.51,.51 

#demonstrate random ordering  -- the results should be the same
n.obs <- NROW(psychTools::epiR)
ss <- sample(n.obs,n.obs)
temp.epi <- psychTools::epiR
temp.epi <-char2numeric(temp.epi)  #make the study numeric
temp.epi$id <- temp.epi$id + 300*temp.epi$study
random.epi <- temp.epi[ss,]
random.imp.analysis <- testRetest(random.epi,select=Imp)
# }

Run the code above in your browser using DataCamp Workspace