repMean: Replication methods (JK1, JK2 and BRR) for descriptive statistics.

Description

Compute totals, means, adjusted means, mean differences, variances and standard deviations with standard errors in random or clustered or complex samples. Variance estimation in complex cluster designs based on Jackknife (JK1, JK2) or Balanced Repeated Replicates (BRR) procedure. Moreover, analyses can be customized for multiple or nested imputed variables, applying the combination rules of Rubin (1987) for imputed data and Rubin (2003) for nested imputed data. Conceptually, the function combines replication methods and methods for multiple imputed data. Trend estimation as usual in large-scale assessments is supported as well. Technically, this is a wrapper for the svymean and svyvar functions of the survey package.

Usage

repMean (datL, ID, wgt = NULL, type = c("none", "JK2", "JK1", "BRR", "Fay"), PSU = NULL,
         repInd = NULL, jkfac=NULL, repWgt = NULL, nest=NULL, imp=NULL, groups = NULL,
         group.splits = length(groups), group.differences.by = NULL,
         cross.differences = FALSE, crossDiffSE = c("wec", "rep","old"),
         adjust = NULL, useEffectLiteR = FALSE, nBoot = 100, group.delimiter = "_",
         trend = NULL, linkErr = NULL, dependent, na.rm = FALSE, doCheck = TRUE,
         engine = c("survey", "BIFIEsurvey"), scale = 1, rscales = 1, mse=TRUE,
         rho=NULL, hetero=TRUE, se_type = c("HC3", "HC0", "HC1", "HC2", "CR0", "CR2"),
         clusters = NULL, crossDiffSE.engine= c("lavaan", "lm"),
         stochasticGroupSizes = FALSE, verbose = TRUE, progress = TRUE, nCores=NULL)

Value

A list of data frames in the long format. The output can be summarized using the report function. The first element of the list is a list with either one (no trend analyses) or two (trend analyses) data frames with at least six columns each. For each subpopulation denoted by the groups

statement, each parameter (i.e., mean, variance, or group differences) and each coefficient (i.e., the estimate and the corresponding standard error) the corresponding value is given.

group: Denotes the group an analysis belongs to. If no groups were specified and/or analysis for the whole sample were requested, the value of ‘group’ is ‘wholeGroup’.
depVar: Denotes the name of the dependent variable in the analysis.
modus: Denotes the mode of the analysis. For example, if a JK2 analysis without sampling weights was conducted, ‘modus’ takes the value ‘jk2.unweighted’. If a analysis without any replicates but with sampling weights was conducted, ‘modus’ takes the value ‘weighted’.
parameter: Denotes the parameter of the regression model for which the corresponding value is given further. Amongst others, the ‘parameter’ column takes the values ‘mean’, ‘sd’, ‘var’ and ‘meanGroupDiff’ if group differences were requested.
coefficient: Denotes the coefficient for which the corresponding value is given further. Takes the values ‘est’ (estimate) and ‘se’ (standard error of the estimate).
value: The value of the parameter estimate in the corresponding group.

If groups were specified, further columns which are denoted by the group names are added to the data frame.

Arguments

datL: Data frame in the long format (i.e. each line represents one ID unit in one imputation of one nest) containing all variables for analysis.
ID: Variable name or column number of student identifier (ID) variable. ID variable must not contain any missing values.
wgt: Optional: Variable name or column number of weighting variable. If no weighting variable is specified, all cases will be equally weighted.
type: Defines the replication method for cluster replicates which is to be applied. Depending on type, additional arguments must be specified (e.g., PSU and/or repInd or repWgt).
PSU: Variable name or column number of variable indicating the primary sampling unit (PSU). When a jackknife procedure is applied, the PSU is the jackknife zone variable. If NULL, no cluster structure is assumed and standard errors are computed according to a random sample.
repInd: Variable name or column number of variable indicating replicate ID. In a jackknife procedure, this is the jackknife replicate variable. If NULL, no cluster structure is assumed and standard errors are computed according to a random sample.
jkfac: Only applies if engine = "BIFIEsurvey". Argument is passed to BIFIE.data.jack and specifies the factor for multiplying jackknife replicate weights.
repWgt: Normally, replicate weights are created by repMean directly from PSU and repInd variables. Alternatively, if replicate weights are included in the data.frame, specify the variable names or column number in the repWgt argument.
nest: Optional: name or column number of the nesting variable. Only applies in nested multiple imputed data sets.
imp: Optional: name or column number of the imputation variable. Only applies in multiple imputed data sets.
groups: Optional: vector of names or column numbers of one or more grouping variables.
group.splits: Optional: If groups are defined, group.splits optionally specifies whether analysis should be done also in the whole group or overlying groups. See examples for more details.
group.differences.by: Optional: Specifies variable group differences should be computed for. The corresponding variable must be included in the groups statement. Exception: choose 'wholePop' if you want to estimate each's group difference from the overall sample mean. See examples for further details.
cross.differences: Either a list of vectors, specifying the pairs of levels for which cross-level differences should be computed. Alternatively, if TRUE, cross-level differences for all pairs of levels are computed. If FALSE, no cross-level differences are computed. (see example 2a, 3, and 4)
crossDiffSE: Method for standard error estimation for cross level differences, where groups are dependent. wec uses weighted effect coding, rep uses replication methods (bootstrap or jackknife) to estimate the standard error between the total mean and group-specific means. old does not account for dependent groups and treat the groups as if they were independent from each other.
adjust: Variable name or column number of variable(s) for which adjusted means should be computed. Non-numeric variables (factors) will be converted to 0/1 dichotomous variables.
useEffectLiteR: Logical: use the lavaan-wrapper EffectLiteR to compute adjusted means? Alternatively, adjusted means are computed by applying a simple linear regression model in each group, using the variables in adjust as independent variables. Afterwards, the coefficients are weighted with the (weighted) means of the independent variables. Standard errors for this procedure are received using the delta method by applying an augmented variance-covariance matrix which assumes zero covariances between independent variable means and regression coefficients. We recommend to set useEffectLiteR = TRUE if no replication methods are applied. When replication methods are used (jackknife-1, jackknife-2, BRR), we recommend to set useEffectLiteR = FALSE, because otherwise the estimation is very slow.
nBoot: Without replicates (i.e., for completely random samples), the rep method for standard error estimation for cross level differences needs a bootstrap. nBoot therefore specifies the number of bootstrap samples. This argument is only necessary, if crossDiffSE = "rep" and none of the replicate methods (JK1, JK2, or BRR) is applied. Otherwise, nBoot will be ignored.
group.delimiter: Character string which separates the group names in the output frame.
trend: Optional: name or column number of the trend variable which contains the measurement time of the survey. Note: Levels of all grouping variables must be equal in all 'sub populations' partitioned by the discrete trend variable. repMean computes differences for all pairwise contrasts defined by trend variable levels. or three measurement occasions, i.e. 2010, 2015, and 2020, contrasts (i.e. trends) are computed for 2010 vs. 2015, 2010 vs. 2020, and 2015 vs. 2020.
linkErr: Optional: Either the name or column number of the linking error variable. If NULL, a linking error of 0 will be assumed in trend estimation. Alternatively, linking errors may be given as data.frame with following specifications: Two columns, named trendLevel1 and trendLevel2 which contain the levels of the trend variable. The contrasts between both values indicates which trend is meant. For only two measurement occasions, i.e. 2010 and 2015, trendLevel1 should be 2010, and trendLevel2 should be 2015. For three measurement occasions, i.e. 2010, 2015, and 2020, additional lines are necessary where trendLevel1 should be 2010, and trendLevel2 should be 2020, to mark the contrast between 2010 and 2020, and further additional lines are necessary where trendLevel1 should be 2015, and trendLevel2 should be 2020. The column depVar must include the name of the dependent variable. This string must correspond to the name of the dependent variable in the data. The column parameter indicates the parameter the linking error belongs to. Column linkingError includes the linking error value. Providing linking error in a data.frame is necessary for more than two measurement occasions. See the example 3a for further details.
dependent: Variable name or column number of the dependent variable.
na.rm: Logical: Should cases with missing values be dropped?
doCheck: Logical: Check the data for consistency before analysis? If TRUE groups with insufficient data are excluded from analysis to prevent subsequent functions from crashing.
engine: Which package should be used for estimation?
scale: scaling constant for variance, for details, see help page of svrepdesign from the survey package
rscales: scaling constant for variance, for details, see help page of svrepdesign from the survey package
mse: Logical: If TRUE, compute variances based on sum of squares around the point estimate, rather than the mean of the replicates. See help page of svrepdesign from the survey package for further details.
rho: Shrinkage factor for weights in Fay's method. If engine = "survey", argument is passed to the rho argument of the svrepdesign function from the survey package. See the corresponding help page for further details. If engine = "BIFIEsurvey", argument is passed to the fayfac argument of the BIFIE.data.jack function from the BIFIEsurvey package. See the corresponding help page for further details. For convenience, if rho = NULL (the default) and engine = "BIFIEsurvey" and type = "JK1", BIFIE.data.jack is called with jktype="JK_GROUP" and fayfac = rho, where \(\rho = (N_{cluster} - 1) \times N_{cluster}^{-1}\)
hetero: Logical: Assume heteroscedastic variance for weighted effect coding?
se_type: The sort of standard error sought for cross level differences. Only applies if crossDiffSE == "wec" and hetero == TRUE and crossDiffSE.engine == "lm". See the help page of lm_robust from the estimatr package for further details.
clusters: Optional: Variable name or column number of cluster variable. Only necessary if weighted effecting coding should be performed using heteroscedastic variances. See the help page of lm_robust from the estimatr package for further details.
crossDiffSE.engine: Software implementation used for estimating cross-level differences. Choices are either "lavaan" (required if stochasticGroupSites == "TRUE") or R function lm. "lavaan" is the default.
stochasticGroupSizes: Logical: Assume stochastic group sizes for using weighted effect coding in cross-level differences? Note: To date, only crossDiffSE.engine = "lavaan" allows for stochastic group sizes. Stochastic group sizes are not yet implemented for any replication method (jackknife, BRR).
verbose: Logical: Show analysis information on console?
progress: Logical: Show progress bar on console?
nCores: integer (default: NULL), number of cores to use for parallel processing, if engine = "survey". If NULL, single core processing is used.

Details

Function first creates replicate weights based on PSU and repInd variables (if defined) according to JK2 or BRR procedure as implemented in WesVar. According to multiple imputed data sets, a workbook with several analyses is created. The function afterwards serves as a wrapper for svymean called by svyby implemented in the ‘survey’ package. The results of the several analyses are then pooled according to Rubin's rule.

References

te Grotenhuis, M., Pelzer, B., Eisinga, R., Nieuwenhuis, R., Schmidt-Catran, A., & Konig, R. (2017). When size matters: advantages of weighted effect coding in observational studies. International Journal of Public Health. 62, 163--167.

Sachse, K. A. & Haag, N. (2017). Standard errors for national trends in international large-scale assessments in the case of cross-national differential item functioning. Applied Measurement in Education, 30, (2), 102-116. http://dx.doi.org/10.1080/08957347.2017.1283315

Weirich, S., Hecht, M., Becker, B. et al. Comparing group means with the total mean in random samples, surveys, and large-scale assessments: A tutorial and software illustration. Behav Res (2021). https://doi.org/10.3758/s13428-021-01553-1

Examples

Run this code

data(lsa)

### Example 1: only means, SD and variances for each country
### We only consider domain 'reading'
rd     <- lsa[which(lsa[,"domain"] == "reading"),]

### We only consider the first "nest".
rdN1   <- rd[which(rd[,"nest"] == 1),]

### First, we only consider year 2010
rdN1y10<- rdN1[which(rdN1[,"year"] == 2010),]

# \donttest{
### mean estimation
means1 <- repMean(datL = rdN1y10, ID="idstud", wgt="wgt", type = "JK2", PSU = "jkzone",
          repInd = "jkrep", imp="imp", groups = "country", dependent = "score",
          na.rm=FALSE, doCheck=TRUE, engine = "BIFIEsurvey")
### reporting function: the function does not know which content domain is being considered,
### so the user may add new columns in the output using the 'add' argument
res1   <- report(means1, add = list(domain = "reading"))

### Example 1a: Additionally to example 1, we decide to estimate whether
### each country's mean differ significantly from the overall mean as well
### as from the individual means of the other contries
means1a<- repMean(datL = rdN1y10, ID="idstud", wgt="wgt", type = "JK2", PSU = "jkzone",
          repInd = "jkrep", imp="imp", groups = "country", group.splits = 0:1,
          group.differences.by = "country", cross.differences = TRUE,
          dependent = "score", na.rm=FALSE, doCheck=TRUE, hetero=FALSE)
res1a  <- report(means1a, add = list(domain = "reading"))

### See that the means of all countries significantly differ from the overall mean.
print(res1a[intersect(which(res1a[,"comparison"] == "crossDiff"),
      which(res1a[,"parameter"] == "mean")),], digits = 3)

### Example 2: Sex differences by country. Assume equally weighted cases by omitting
### 'wgt' argument.
means2 <- repMean(datL = rdN1y10, ID="idstud", type = "JK2", PSU = "jkzone",
          repInd = "jkrep", imp="imp", groups = c("country", "sex"), group.splits = 0:2,
          group.differences.by="sex", dependent = "score", na.rm=FALSE, doCheck=TRUE,
          cross.differences =TRUE, crossDiffSE.engine= "lm")
res2   <- report(means2,add = list(domain = "reading"))

### Example 2a: Additionally to example 2, we decide to estimate whether
### each country's mean differ significantly from the overall mean. (Note: by default,
### such cross level differences are estimated using 'weighted effect coding'. Use the
### 'crossDiffSE' argument to choose alternative methods.) Moreover, we estimate whether
### each country's sex difference differ significantly from the sex difference in the
### whole population.
means2a<- repMean(datL = rdN1y10, ID="idstud", wgt="wgt", type = "JK2", PSU = "jkzone",
          repInd = "jkrep", imp="imp", groups = c("country", "sex"), group.splits = 0:2,
          group.differences.by="sex", cross.differences = list(c(0,1), c(0,2)),
          dependent = "score", na.rm=FALSE, doCheck=TRUE,
          crossDiffSE.engine= "lm", clusters = "idclass")
res2a  <- report(means2a,add = list(domain = "reading"))

### Third example: like example 2a, but using nested imputations of dependent variable,
### and additionally estimating trend: use 'rd' instead of 'rdN1y10'
### assume equally weighted cases by omitting 'wgt' argument
### ignoring jackknife by omitting 'type', 'PSU' and 'repInd' argument
means3T<- repMean(datL = rd, ID="idstud", imp="imp", nest="nest",
          groups = c("country", "sex"), group.splits = 0:2, group.differences.by="sex",
          cross.differences = list(c(0,1), c(0,2)), dependent = "score", na.rm=FALSE,
          doCheck=TRUE, trend = "year", linkErr = "leScore",
          crossDiffSE = "wec", crossDiffSE.engine= "lavaan")
res3T  <- report(means3T, add = list(domain = "reading"))

### Example 3a: like example 3, but providing linking errors in an additional data.frame
### This is optional for two measurement occasions but mandatory if the analysis contains
### more than two measurement occasions
linkErr<- data.frame ( trendLevel1 = 2010, trendLevel2 = 2015,  depVar = "score",
          parameter = "mean", unique(lsa[,c("domain", "leScore")]),
          stringsAsFactors = FALSE)
colnames(linkErr) <- car::recode(colnames(linkErr), "'leScore'='linkingError'")
### note that the linking errors for the specified domain have to be chosen via
### subsetting
means3a<- repMean(datL = rd, ID="idstud", imp="imp", nest="nest",
          groups = c("country", "sex"),
          group.splits = 0:2, group.differences.by="sex",
          cross.differences = list(c(0,1), c(0,2)),
          dependent = "score", na.rm=FALSE, doCheck=TRUE, trend = "year",
          linkErr = linkErr[which(linkErr[,"domain"] == "reading"),],
          crossDiffSE = "wec", crossDiffSE.engine= "lavaan")
res3a  <- report(means3a, add = list(domain = "reading"))

### Fourth example: using a loop do analyse 'reading' and 'listening' comprehension
### in one function call. Again with group and cross differences and trends, and
### trend differences
### we use weights but omit jackknife analysis by omitting 'type', 'PSU' and 'repInd'
### argument
means4T<- by ( data = lsa, INDICES = lsa[,"domain"], FUN = function (sub.dat) {
          repMean(datL = sub.dat, ID="idstud", wgt="wgt", imp="imp", nest="nest",
                 groups = c("country", "sex"), group.splits = 0:2,
                 group.differences.by="sex",
                 cross.differences = list(c(0,1), c(0,2)), dependent = "score",
                 na.rm=FALSE, doCheck=TRUE,
                 trend = "year", linkErr = "leScore", crossDiffSE.engine= "lm") })
ret4T  <- do.call("rbind", lapply(names(means4T), FUN = function ( domain ) {
          report(means4T[[domain]], add = list(domain = domain))}))
          
### Fifth example: compute adjusted means, also with trend estimation
### Note: all covariates must be numeric or 0/1 dichotomous
rdN1[,"mignum"] <- as.numeric(rdN1[,"mig"])
rdN1[,"sexnum"] <- car::recode(rdN1[,"sex"], "'male'=0; 'female'=1", as.numeric=TRUE,
                   as.factor=FALSE)
means5 <- repMean(datL = rdN1, ID="idstud", wgt="wgt", type = "JK2", PSU = "jkzone",
          repInd = "jkrep", imp="imp", groups = "country",
          adjust = c("sexnum", "ses", "mignum"), useEffectLiteR = FALSE,
          dependent = "score", na.rm=FALSE, doCheck=TRUE, trend = "year",
          linkErr = "leScore")
res5   <- report(means5, add = list(domain = "reading"))

if (FALSE) {
############################################################################################
#    Example 6: R code for running the PISA 2015 science example to compare group means    #
#                    with the total mean using weighted effect coding                      #
############################################################################################

# Warning: large PISA data set requires at least 16 GB free working memory (RAM):

### define necessary directories (note: writing permissions required)
folder <- tempdir()

### download PISA 2015 zipped student questionnaire data (420 MB) to a folder with
### writing permissions
download.file(url = "https://webfs.oecd.org/pisa/PUF_SPSS_COMBINED_CMB_STU_QQQ.zip",
         destfile = file.path(folder, "pisa2015.zip"))

### unzip PISA 2015 student questionnaire data (1.5 GB) to temporary folder
zip::unzip(zipfile = file.path(folder, "pisa2015.zip"), files= "CY6_MS_CMB_STU_QQQ.sav",
     exdir=folder)

### read data
pisa <- foreign::read.spss(file.path (folder, "CY6_MS_CMB_STU_QQQ.sav"),
        to.data.frame=TRUE, use.value.labels = FALSE, use.missings = TRUE)

# dependent variables
measure.vars <- paste0("PV", 1:10, "SCIE")

### choose desired variables and reshape into the long format
#              'CNTSTUID' = individual student identifier
#                   'CNT' = country identifier
#                 'SENWT' = senate weight (assume a population of 5000 in each country)
#              'W_FSTUWT' = final student weight
#                  'OECD' = dummy variable indicating which country is part of the OECD
#   'W_FSTURWT' (1 to 80) = balanced repeated replicate weights
# 'PV1SCIE' to 'PV10SCIE' = 10 plausible values of (latent) science performance
pisaLong <- reshape2::melt(pisa, id.vars = c("CNTSTUID", "CNT", "SENWT", "W_FSTUWT",
            "OECD", paste0("W_FSTURWT", 1:80)),
            measure.vars = measure.vars, value.name = "value", variable.name="imp",
            na.rm=TRUE)

### choose OECD countries
oecd <- pisaLong[which(pisaLong[,"OECD"] == 1),]

### analyze data
### analysis takes approximately 30 minutes on an Intel i5-6500 machine with 32 GB RAM
means   <- repMean( datL = oecd,         # data.frame in the long format
    ID                   = "CNTSTUID",   # student identifier
    dependent            = "value",      # the dependent variable in the data
    groups               = "CNT",        # the grouping variable
    wgt                  = "SENWT",      # (optional) weighting variable. We use senate
                                         # weights (assume a population of 5000 in each
                                         # country)
    type                 = "Fay",        # type of replication method. Corresponding to
                                         # the PISA sampling method, we use "Fay"
    rho                  = 0.5,          # shrinkage factor for weights in Fay's method
    scale                = NULL,         # scaling constant for variance, set to NULL
                                         # according to PISA's sampling method
    rscales              = NULL,         # scaling constant for variance, set to NULL
                                         # according to PISA's sampling method
    repWgt               = paste0("W_FSTURWT", 1:80), # the replicate weights,
                                                      # provided by the OECD
    imp                  = "imp",        # the imputation variable
    mse                  = FALSE,        # if TRUE, compute variances based on sum of
                                         # squares around the point estimate, rather
                                         # than the mean of the replicates.
    group.splits         = 0:1,          # defining the 'levels' for which means should
                                         # be computed. 0:1 implies that means for the
                                         # whole sample (level 0) as well as for groups
                                         # (level 1) are computed
    cross.differences    = TRUE,         # defines whether (and which) cross level mean
                                         # differences should be computed. TRUE means
                                         # that all cross level mean differences are
                                         # computed
    crossDiffSE          = "wec",        # method for standard errors of mean
                                         # differences
    crossDiffSE.engine   = "lm",         # software implementation for standard
                                         # errors of mean differences
    hetero               = TRUE,         # assume heteroscedastic group variances
    stochasticGroupSizes = FALSE         # assume fixed group sizes
                   )

### call a reporting function to generate user-friendly output
results <- report(means, exclude = c("Ncases", "NcasesValid", "var", "sd"))
}# }

Run the code above in your browser using DataLab