ebp: Empirical Best Prediction for disaggregated indicators

Description

Function ebp estimates indicators using the Empirical Best Prediction approach by Molina and Rao (2010). Point predictions of indicators are obtained by Monte-Carlo approximations. Additionally, mean squared error (MSE) estimation can be conducted by using a parametric bootstrap approach (see also Gonzalez-Manteiga et al. (2008)). The unit-level model of Battese, Harter and Fuller (1988) is fitted by the restricted maximum likelihood (REML) method and one of three different transformation types for the dependent variable can be chosen.

Usage

ebp(
  fixed,
  pop_data,
  pop_domains,
  smp_data,
  smp_domains,
  L = 50,
  threshold = NULL,
  transformation = "box.cox",
  interval = c(-1, 2),
  MSE = FALSE,
  B = 50,
  seed = 123,
  boot_type = "parametric",
  parallel_mode = ifelse(grepl("windows", .Platform$OS.type), "socket", "multicore"),
  cpus = 1,
  custom_indicator = NULL,
  na.rm = FALSE
)

Arguments

fixed

a two-sided linear formula object describing the fixed-effects part of the nested error linear regression model with the dependent variable on the left of a ~ operator and the explanatory variables on the right, separated by + operators. The argument corresponds to the argument fixed in function lme.

pop_data

a data frame that needs to comprise the variables named on the right of the ~ operator in fixed, i.e. the explanatory variables, and pop_domains.

pop_domains

a character string containing the name of a variable that indicates domains in the population data. The variable can be numeric or a factor but needs to be of the same class as the variable named in smp_domains.

smp_data

a data frame that needs to comprise all variables named in fixed and smp_domains.

smp_domains

a character string containing the name of a variable that indicates domains in the sample data. The variable can be numeric or a factor but needs to be of the same class as the variable named in pop_domains.

a number determining the number of Monte-Carlo simulations that must be at least 1. Defaults to 50. For practical applications, values larger than 200 are recommended (see also Molina, I. and Rao, J.N.K. (2010)).

threshold

a number defining a threshold. Alternatively, a threshold may be defined as a function of y returning a numeric value. Such a function will be evaluated once for the point estimation and in each iteration of the parametric bootstrap. A threshold is needed for calculation e.g. of head count ratios and poverty gaps. The argument defaults to NULL. In this case the threshold is set to 60% of the median of the variable that is selected as dependent variable similary to the at-risk-of-poverty rate used in the EU (see also Social Protection Committee 2001). However, any desired threshold can be chosen.

transformation

a character string. Three different transformation types for the dependent variable can be chosen (i) no transformation ("no"); (ii) log transformation ("log"); (iii) Box-Cox transformation ("box.cox"). Defaults to "box.cox".

interval

a numeric vector containing a lower and upper limit determining an interval for the estimation of the optimal parameter. The interval is passed to function optimize for the optimization. Defaults to c(-1,2). If the convergence fails, it is often advisable to choose a smaller more suitable interval. For right skewed distributions the negative values may be excluded, also values larger than 1 are seldom observed.

MSE

if TRUE, MSE estimates using a parametric bootstrap approach are calculated (see also Gonzalez-Manteiga et al. (2008)). Defaults to FALSE.

a number determining the number of bootstrap populations in the parametric bootstrap approach (see also Gonzalez-Manteiga et al. (2008)) used in the MSE estimation. The number must be greater than 1. Defaults to 50. For practical applications, values larger than 200 are recommended (see also Molina, I. and Rao, J.N.K. (2010)).

seed

an integer to set the seed for the random number generator. For the usage of random number generation see details. If seed is set to NULL, seed is chosen randomly. Defaults to 123.

boot_type

character string to choose between different MSE estimation procedures,currently a "parametric" and a semi-parametric "wild" bootstrap are possible. Defaults to "parametric".

parallel_mode

modus of parallelization, defaults to an automatic selection of a suitable mode, depending on the operating system, if the number of cpus is chosen higher than 1. For details see parallelStart.

cpus

number determining the kernels that are used for the parallelization. Defaults to 1. For details see parallelStart.

custom_indicator

a list of functions containing the indicators to be calculated additionally. Such functions must and must only depend on the target variable y and the threshold. Defaults to NULL.

na.rm

if TRUE, observations with NA values are deleted from the population and sample data. For the EBP procedure complete observations are required. Defaults to FALSE.

Value

An object of class "emdi", "model", "ebp" that provides estimators for regional disaggregated indicators and optionally corresponding MSE estimates. Generic functions such as compare_plot, estimators, print, plot and summary have methods that can be used to obtain further information. See emdiObject for descriptions of components of objects of class "emdi".

Details

For Monte-Carlo approximations and in the parametric bootstrap approach random number generation is used. Thus, a seed is set by the argument seed. The set of predefined indicators includes the mean, median, four further quantiles (10%, 25%, 75% and 90%), head count ratio, poverty gap, Gini coefficient and the quintile share ratio.

References

Kreutzmann, A., Pannier, S., Rojas-Perilla, N., Schmid, T., Templ, M. and Tzavidis, N. (2019). The R Package emdi for Estimating and Mapping Regionally Disaggregated Indicators, Journal of Statistical Software, Vol. 91, No. 7, 1--33, <doi:10.18637/jss.v091.i07> Battese, G.E., Harter, R.M. and Fuller, W.A. (1988). An Error-Components Model for Predictions of County Crop Areas Using Survey and Satellite Data. Journal of the American Statistical Association, Vol.83, No. 401, 28-36. Gonzalez-Manteiga, W. et al. (2008). Bootstrap mean squared error of a small-area EBLUP. Journal of Statistical Computation and Simulation, 78:5, 443-462. Molina, I. and Rao, J.N.K. (2010). Small area estimation of poverty indicators. The Canadian Journal of Statistics, Vol. 38, No.3, 369-385. Social Protection Committee (2001). Report on indicators in the field of poverty and social exclusions, Technical Report, European Union.

Examples

Run this code

# NOT RUN {
# Loading data - population and sample data
data("eusilcA_pop")
data("eusilcA_smp")

# Example 1: With default setting but na.rm=TRUE 
emdi_model <- ebp(fixed = eqIncome ~ gender + eqsize + cash + self_empl + 
unempl_ben + age_ben + surv_ben + sick_ben + dis_ben + rent + fam_allow + 
house_allow + cap_inv + tax_adj, pop_data = eusilcA_pop,
pop_domains = "district", smp_data = eusilcA_smp, smp_domains = "district", 
na.rm = TRUE)


# Example 2: With MSE, two additional indicators and function as threshold -
# Please note that the example runs for several minutes. For a short check
# change L and B to lower values.
emdi_model <- ebp(fixed = eqIncome ~ gender + eqsize + cash + 
self_empl + unempl_ben + age_ben + surv_ben + sick_ben + dis_ben + rent + 
fam_allow + house_allow + cap_inv + tax_adj, pop_data = eusilcA_pop,
pop_domains = "district", smp_data = eusilcA_smp, smp_domains = "district",
threshold = function(y){0.6 * median(y)}, transformation = "log", 
L = 50, MSE = TRUE, boot_type = "wild", B = 50, custom_indicator = 
list(my_max = function(y, threshold){max(y)},
my_min = function(y, threshold){min(y)}), na.rm = TRUE, cpus = 1)
# }

Run the code above in your browser using DataLab