This function generates synthetic data (possibly contaminated by outliers) for the basic unit-level SAE model.
makedata(seed = 1024, intercept = 1, beta = 1, n = 4, g = 20, areaID = NULL,
ve = 1, ve.contam = 41, ve.epsilon = 0, vu = 1, vu.contam = 41,
vu.epsilon = 0)
An instance of the class saemodel
.
[integer]
seed value used in set.seed
(default
seed = 1024
).
[numeric]
or [NULL]
value of the intercept
of the fixed-effects model or NULL
for a model without
intercept (default: intercept = 1
).
[numeric vector]
value of the fixed-effect coefficients
(without intercept; default: beta = 1
). For each given
coefficient, a vector of realizations is drawn from the standard
normal distribution.
[integer]
number of units per area in balanced-data
setups (default: n = 4
).
[integer]
number of areas (default: g = 20
).
[integer vector]
or [NULL]
. If one attempts
to generate synthetic unbalanced data, one calls makedata
with
a vector, the elements of which area identifiers. This vector
should contain a series of (integer valued) area IDs. The number
of areas is set equal to the number unique IDs.
[numeric]
nonnegative value of model/ residual variance.
[numeric]
nonnegative value of model variance of
the outlier part in a mixture distribution (Tukey-Huber-type
contamination model) \(e = (1-h)N(0, ve) + hN(0, ve.contam)
\).
[numeric]
value in \([0,1]\) that
defines the relative number of outliers (i.e., epsilon or h in
the contamination mixture distribution). Typically, it takes
values between 0 and 0.5 (but it is not restricted to this interval).
[numeric]
value of the (area-level) random-effect
variance.
[numeric]
nonnegative value of the (area-level)
random-effect variance of the outlier part in the contamination
mixture distribution.
[numeric]
value in \([0,1]\) that
defines the relative number of outliers in the contamination
mixture distribution of the (area-level) random effects.
Let \(y_i\) denote an area-specific \(n_i\)-vector of the response variable for the areas \(i = 1,..., g\). Define a \((n_i \times p)\)-matrix \(X_i\) of realizations from the std. normal distribution, \(N(0,1)\), and let \(\beta\) denote a \(p\)-vector of regression coefficients. Now, the \(y_i\) are drawn using the law \(y_i \sim N(X_i\beta, v_e I_i + v_u J_i)\) with \(v_e\) and \(v_u\) the variances of the model error and random-effect variance, respectively, and \(I_i\) and \(J_i\) denoting the identity matrix and matrix of ones, respectively.
In addition, we allow the distribution of the model/residual and area-level random effect to be contaminated (cf. Stahel and Welsh, 1997). Notably, the laws of \(e_{i,j}\) and \(u_i\) are replaced by the Tukey-Huber contamination mixture:
\(e_{i,j} \sim (1-\epsilon^{ve})N(0,v_e) + \epsilon^{ve}N(0, v_e^{\epsilon})\)
\(u_{i} \sim (1-\epsilon^{vu})N(0,v_u) + \epsilon^{vu}N(0, v_u^{\epsilon})\)
where \(\epsilon^{ve}\) and \(\epsilon^{vu}\) regulate the degree of contamination; \(v_e^{\epsilon}\) and \(v_u^{\epsilon}\) define the variance of the contamination part of the mixture distribution.
Four different contamination setups are possible:
no contamination (i.e., ve.epsilon = vu.epsilon = 0
),
contaminated model error (i.e., ve.epsilon != 0
and
vu.epsilon = 0
),
contaminated random effect (i.e., ve.epsilon = 0
and
vu.epsilon != 0
),
both are conaminated (i.e., ve.epsilon != 0
and
vu.epsilon != 0
).
Schoch, T. (2012). Robust Unit-Level Small Area Estimation: A Fast Algorithm for Large Datasets. Austrian Journal of Statistics 41, 243--265. tools:::Rd_expr_doi("https://doi.org/10.17713/ajs.v41i4.1548")
Stahel, W. A. and A. Welsh (1997). Approaches to robust estimation in the simplest variance components model. Journal of Statistical Planning and Inference 57, 295--319. tools:::Rd_expr_doi("https://doi.org/10.1016/S0378-3758(96)00050-X")
saemodel()
,
fitsaemodel()
# generate a model with synthetic data
model <- makedata()
model
# summary of the model
summary(model)
Run the code above in your browser using DataLab