ebp: Empirical Best Prediction for Disaggregated Indicators

Description

Function ebp estimates indicators using the Empirical Best Prediction approach by Molina and Rao (2010). Point predictions of indicators are obtained by Monte-Carlo approximations. Additionally, mean squared error (MSE) estimation can be conducted by using a parametric bootstrap approach (see also Gonzalez-Manteiga et al. (2008)). The unit-level model of Battese, Harter and Fuller (1988) is fitted by the restricted maximum likelihood (REML) method and one of five different transformation types for the dependent variable can be chosen. This approach can be extended to data under informative sampling using weights and is based on Guadarrama et al. (2018). Model estimation combines the uni-level model of Battese, Harter and Fuller (1988) and the approach of You and Rao (2002) using survey weights. At the moment, only the log-transformation is supported for this method.

Usage

ebp(
  fixed,
  pop_data,
  pop_domains,
  smp_data,
  smp_domains,
  L = 50,
  threshold = NULL,
  transformation = "box.cox",
  interval = "default",
  MSE = FALSE,
  B = 50,
  seed = 123,
  boot_type = "parametric",
  parallel_mode = ifelse(grepl("windows", .Platform$OS.type), "socket", "multicore"),
  cpus = 1,
  custom_indicator = NULL,
  na.rm = FALSE,
  weights = NULL,
  pop_weights = NULL,
  aggregate_to = NULL,
  weights_type = "Guadarrama",
  benchmark = NULL,
  benchmark_type = "ratio",
  benchmark_level = NULL,
  benchmark_weights = NULL,
  nlme_maxiter = 1000,
  nlme_tolerance = 1e-06,
  nlme_opt = "nlminb",
  nlme_optimmethod = "BFGS",
  nlme_method = "REML",
  nlme_mstol = 1e-07,
  nlme_msmaxiter = 1000,
  nlme_returnobject = FALSE,
  rescale_weights = FALSE,
  Ydump = NULL
)

Value

An object of class "ebp", "emdi" that provides estimators for regional disaggregated indicators and optionally corresponding MSE estimates. Several generic functions have methods for the returned object. For a full list and descriptions of the components of objects of class "emdi", see emdiObject.

Arguments

fixed: a two-sided linear formula object describing the fixed-effects part of the nested error linear regression model with the dependent variable on the left of a ~ operator and the explanatory variables on the right, separated by + operators. The argument corresponds to the argument fixed in function lme.
pop_data: a data frame that needs to comprise the variables named on the right of the ~ operator in fixed, i.e. the explanatory variables, and pop_domains.
pop_domains: a character string containing the name of a variable that indicates domains in the population data. The variable can be numeric or a factor but needs to be of the same class as the variable named in smp_domains.
smp_data: a data frame that needs to comprise all variables named in fixed and smp_domains.
smp_domains: a character string containing the name of a variable that indicates domains in the sample data. The variable can be numeric or a factor but needs to be of the same class as the variable named in pop_domains.
L: a number determining the number of Monte-Carlo simulations that must be at least 1. Defaults to 50. For practical applications, values larger than 200 are recommended (see also Molina, I. and Rao, J.N.K. (2010)).
threshold: a number defining a threshold. Alternatively, a threshold may be defined as a function of y returning a numeric value. Such a function will be evaluated once for the point estimation and in each iteration of the parametric bootstrap. A threshold is needed for calculation e.g. of head count ratios and poverty gaps. The argument defaults to NULL. In this case, the threshold is set to 60% of the median of the variable that is selected as dependent variable similary to the at-risk-of-poverty rate used in the EU (see also Social Protection Committee 2001). However, any desired threshold can be chosen.
transformation: a character string. Five different transformation types for the dependent variable can be chosen (i) no transformation ("no"); (ii) log transformation ("log"); (iii) Box-Cox transformation ("box.cox"); (iv) Dual transformation ("dual"); (v) Log-Shift transformation ("log.shift"); (Vi) rank-order transformation ("ordernorm"). Defaults to "box.cox".
interval: a string equal to 'default' or a numeric vector containing a lower and upper limit determining an interval for the estimation of the optimal parameter. The interval is passed to function optimize for the optimization. Defaults to 'default' which equals c(-1,2) for Box-Cox, c(0,2) for Dual and an interval based on the range of y for Log-Shift transformation. If the convergence fails, it is often advisable to choose a smaller more suitable interval. For right skewed distributions, the negative values may be excluded, also values larger than 1 are seldom observed.
MSE: if TRUE, MSE estimates using a parametric bootstrap approach are calculated (see also Gonzalez-Manteiga et al. (2008)). Defaults to FALSE.
B: a number determining the number of bootstrap populations in the parametric bootstrap approach (see also Gonzalez-Manteiga et al. (2008)) used in the MSE estimation. The number must be greater than 1. Defaults to 50. For practical applications, values larger than 200 are recommended (see also Molina, I. and Rao, J.N.K. (2010)).
seed: an integer to set the seed for the random number generator. For the usage of random number generation, see Details. If seed is set to NULL, seed is chosen randomly. Defaults to 123.
boot_type: character string to choose between different MSE estimation procedures,currently a "parametric" and a semi-parametric "wild" bootstrap are possible. Defaults to "parametric".
parallel_mode: modus of parallelization, defaults to an automatic selection of a suitable mode, depending on the operating system, if the number of cpus is chosen higher than 1. For details, see parallelStart.
cpus: number determining the kernels that are used for the parallelization. Defaults to 1. For details, see parallelStart.
custom_indicator: a list of functions containing the indicators to be calculated additionally. Such functions must depend on the target variable y, and optional can depend on pop_weights and the threshold. Defaults to NULL.
na.rm: if TRUE, observations with NA values are deleted from the population and sample data. For the EBP procedure complete observations are required. Defaults to FALSE.
weights: a character string containing the name of a variable that indicates weights in the sample data. If a character string is provided a weighted version of the ebp will be used. The variable has to be numeric. Defaults to NULL.
pop_weights: a character string containing the name of a variable that indicates population weights in the populatation data. If a character string is provided weighted indicators are estimated using population weights. The variable has to be numeric. Defaults to NULL.
aggregate_to: a character string containing the name of a variable from population data that indicates the target domain level for which the results are to be displayed. The variable can be numeric or a factor. Defaults to NULL.
weights_type: a character string. Two different methods for survey weights are available (i) EBP under informative sampling from Guadarrama et al. (2018) ("Guadarrama"); (ii) considering survey weights by using the weighting options of nlme from Pinheiro and Bates (2023) ("nlme"); (iii) considering survey weights by using the weighting options of nlme and use these weights also to determine the optimal transformation parameter lambda ("nlme_lambda"). Defaults to "Guadarrama".
benchmark: The input depends on the type of benchmarking to be performed. (i) Benchmarking with a fixed value: (a) with one value for each indicator: a named vector containing the numeric benchmark value(s). The names of the vector matchs to the chosen indicators. Benchmarking is available for "Mean" and "Head_Count". (b) with values for the sub-level specified in the argument benchmark_level: a data.frame composed of a variable of class character containing the domain names at which the benchmarkaing is performed and variable(s) with benchmark value(s) of class numeric. Benchmarking is supplied for the Mean and the Head_Count ratio. Therefore, the names of the data.frame must match for the first variable the benchmark_level and for the other(s) to Mean and Head_Count. (ii) Benchmarking with the survey data: a vector containing the names of the chosen indicators. In this case, survey weights (weights) are needed. Benchmarking is available for "Mean" and "Head_Count".
benchmark_type: a character indicating the type of benchmarking. Types that can be chosen (i) Raking ("raking"), (ii) Ratio adjustment ("ratio"), and for head count, ratio adjustment of the complement ("ratio_complement". Defaults to "ratio"
benchmark_level: a character indicating the level at which the benchmarking is performed. This name must be represented in the sample and population data as variable name.
benchmark_weights: the name of variable containing benchmark weights. This is only possible for internal benchmarking and enable users to benchmark with weights differing from the survey weights (Default for weighting for internal benchmarking).
nlme_maxiter: an integer indicating the maximum number of iterations the lme function from package nlme will run for parameter convergence. Defaults to 1000.
nlme_tolerance: a real number indicating the tolerance criterion for the the lme function from package nlme. Defaults to 1e^-6.
nlme_opt: a string indicating the optimizer to be used by the lme function from package nlme, either "nlminb" (the default) or "optim".
nlme_optimmethod: a string indicating the optimization method to be used with the optim optimizer the lme function from packages nlme and optim Defaults to "BFGS".
nlme_method: a string indicating the method to be used by the lme function from package nlme, either "REML" (the default) or "ML".
nlme_mstol: a real number indicating the tolerance criterion for the the optimization step of the lme function from package nlme. Defaults to 1e^-7.
nlme_msmaxiter: an integer indicating the maximum number of iterations for the optimization step of the lme function from package nlme will run for parameter convergence. Defaults to 1000.
nlme_returnobject: a logical indicating whether the fitted object should be returned with a warning (instead of an error via stop()) when the maximum number of iterations is reached without convergence of the algorithm. Defaults to FALSE
rescale_weights: a logical indicating if the sample weights are scaled. If FALSE (default), the sample weights do not change. When TRUE , the sample weights are rescaled such that the average weight is 1 within each domain.
Ydump: a string specifying the name of a .csv file to save all simulated values of the dependent value, model predictions, and error terms used for point estimation.

Details

For Monte-Carlo approximations and in the parametric bootstrap approach random number generation is used. Thus, a seed is set by the argument seed.

The set of predefined indicators includes the mean, median, four further quantiles (10%, 25%, 75% and 90%), head count ratio, poverty gap, Gini coefficient and the quintile share ratio.

Since the sample observations often cannot be identified in practical applications, a modified approach by Guadarrama et al. (2016) called census EBP is implemented for the point estimation. For the MSE estimation, the bootstrap sample is not extracted from the superpopulation, but generated by the estimated model parameters. The lower the ratio between the sample and the population size, the closer are the results to the proposed approach by Molina and Rao (2010).

References

Battese, G.E., Harter, R.M. and Fuller, W.A. (1988). An Error-Components Model for Predictions of County Crop Areas Using Survey and Satellite Data. Journal of the American Statistical Association, Vol.83, No. 401, 28-36.

Gonzalez-Manteiga, W. et al. (2008). Bootstrap mean squared error of a small-area EBLUP. Journal of Statistical Computation and Simulation, 78:5, 443-462.

Guadarrama, M., Molina, I. and Rao, J.N.K. (2016). A comparison of small area estimation methods for poverty mapping. Joint Issue: Statistics in Transition New Series Survey Methodology, Vol.17, No. 1, 41–66.

Guadarrama, M., Molina, I. and Rao, J.N.K. (2018). Small area estimation of general parameters under complex sampling designs. Computational Statistics & Data Analysis, Vol. 121, 20-40.

Kreutzmann, A., Pannier, S., Rojas-Perilla, N., Schmid, T., Templ, M. and Tzavidis, N. (2019). The R Package emdi for Estimating and Mapping Regionally Disaggregated Indicators, Journal of Statistical Software, Vol. 91, No. 7, 1--33, <doi:10.18637/jss.v091.i07>

Molina, I. and Rao, J.N.K. (2010). Small area estimation of poverty indicators. The Canadian Journal of Statistics, Vol. 38, No.3, 369-385.

Social Protection Committee (2001). Report on indicators in the field of poverty and social exclusions, Technical Report, European Union. You, Y., Rao, J.N.K. (2002). A pseudo-empirical best linear unbiased prediction approach to small area estimation using survey weights. The Canadian Journal of Statistics. Vol. 30, No. 3, 431–439.

Examples

Run this code

# \donttest{
# Loading data - population and sample data
data("eusilcA_pop")
data("eusilcA_smp")

# Example 1: With default setting but na.rm=TRUE
emdi_model <- ebp(
  fixed = eqIncome ~ gender + eqsize + cash + self_empl +
    unempl_ben + age_ben + surv_ben + sick_ben + dis_ben + rent + fam_allow +
    house_allow + cap_inv + tax_adj, pop_data = eusilcA_pop,
  pop_domains = "district", smp_data = eusilcA_smp, smp_domains = "district",
  na.rm = TRUE
)


# Example 2: With MSE, two additional indicators and function as threshold -
# Please note that the example runs for several minutes. For a short check
# change L and B to lower values.
emdi_model <- ebp(
  fixed = eqIncome ~ gender + eqsize + cash +
    self_empl + unempl_ben + age_ben + surv_ben + sick_ben + dis_ben + rent +
    fam_allow + house_allow + cap_inv + tax_adj, pop_data = eusilcA_pop,
  pop_domains = "district", smp_data = eusilcA_smp, smp_domains = "district",
  threshold = function(y) {
    0.6 * median(y)
  }, transformation = "log",
  L = 50, MSE = TRUE, boot_type = "wild", B = 50, custom_indicator =
    list(
      my_max = function(y) {
        max(y)
      },
      my_min = function(y) {
        min(y)
      }
    ), na.rm = TRUE, cpus = 1, nlme_opt="optim" 
)

# Example 3: With default setting but na.rm=TRUE under informative sampling.
emdi_model <- ebp(
  fixed = eqIncome ~ gender + eqsize + cash + self_empl +
    unempl_ben + age_ben + surv_ben + sick_ben + dis_ben + rent + fam_allow +
    house_allow + cap_inv + tax_adj, pop_data = eusilcA_pop,
  pop_domains = "district", smp_data = eusilcA_smp, smp_domains = "district",
  weights = "weight", transformation = "log", na.rm = TRUE
)

# Example 4: With default setting and random effect on the district level
# while the output is at state level
emdi_model <- ebp(
  fixed = eqIncome ~ gender + eqsize + cash + self_empl +
    unempl_ben + age_ben + surv_ben + sick_ben + dis_ben + rent + fam_allow +
    house_allow + cap_inv + tax_adj, pop_data = eusilcA_pop,
  pop_domains = "district", smp_data = eusilcA_smp, smp_domains = "district",
  na.rm = TRUE, aggregate_to = "state"
)

# Example 5: With default setting using pop_weights to get weighted
# indicators according to equivalized household size and an using an
# custom_indicator using pop_weights
emdi_model <- ebp(
  fixed = eqIncome ~ gender + eqsize + cash + self_empl +
    unempl_ben + age_ben + surv_ben + sick_ben + dis_ben + rent + fam_allow +
    house_allow + cap_inv + tax_adj, pop_data = eusilcA_pop,
  pop_domains = "district", smp_data = eusilcA_smp, smp_domains = "district",
  custom_indicator =
    list(HCR_singleHH = function(y, pop_weights, threshold) {
                              mean(y[pop_weights == 1] < threshold)
                        }
    ), na.rm = TRUE, pop_weights = "hhsize"
)
# }

Run the code above in your browser using DataLab