simContinuous(simPopObj, additional = "netIncome", method = c("multinom", "lm", "poisson"), zeros = TRUE, breaks = NULL, lower = NULL, upper = NULL, equidist = TRUE, probs = NULL, gpd = TRUE, threshold = NULL, est = "moments", limit = NULL, censor = NULL, log = TRUE, const = NULL, alpha = 0.01, residuals = TRUE, keep = TRUE, maxit = 500, MaxNWts = 1500, tol = .Machine$double.eps^0.5, nr_cpus = NULL, eps = NULL, regModel = "basic", byHousehold = NULL, imputeMissings = FALSE, seed, verbose = FALSE, by = "strata")simPopObj holding household survey
data, population data and optionally some margins.dataS that should be simulated for the population data.
Currently, only one additional variable can be simulated at a time."multinom",
for using multinomial log-linear models combined with random draws from the
resulting categories, "lm", for using (two-step) regression
models combined with random error terms and "poisson" for using Poisson regression for count variables.additional is semi-continuous, i.e., contains a considerable amount
of zeros. If TRUE and method is "multinom", a separate
factor level for zeros in the response is used. If TRUE and
method is "lm", a two-step model is applied. The first step
thereby uses a log-linear or multinomial log-linear model (see
Details).additional. If NULL,
break points are computed using weighted quantiles.breaks is NULL, these can be used to specify
lower and upper bounds other than minimum and maximum, respectively. Note
that if method is "multinom" and gpd is TRUE
(see below), upper defaults to Inf.method is "multinom" and
breaks is NULL, this indicates whether the (positive) default
break points should be equidistant or whether there should be refinements in
the lower and upper tail (see getBreaks).method is
"multinom" and breaks is NULL, this gives probabilities
for quantiles to be used as (positive) break points. If supplied, this is
preferred over equidist.method is "multinom", this indicates
whether the upper tail of the variable specified by additional should
be simulated by random draws from a (truncated) generalized Pareto
distribution rather than a uniform distribution.method is "multinom",
values for categories above threshold are drawn from a (truncated)
generalized Pareto distribution.method is "multinom", the
estimator to be used to fit the generalized Pareto distribution.data.frames; if
multinomial models are computed, this can be used to account for structural
zeros. The names of the list components specify the categories that should
be censored. For each of these categories, a list or data.frame
containing levels of the predictor variables can be supplied. The
probability of the specified categories is set to 0 for the respective
predictor levels. Currently, this is only implemented for more than two
categories in the response.method is "lm", this indicates whether
the linear model should be fitted to the logarithms of the variable
specified by additional. The predicted values are then
back-transformed with the exponential function. See Details for
more information.method is "lm" and log is
TRUE, this gives a constant to be added before log transformation.method is "lm", this gives trimming
parameters for the sample data. Trimming is thereby done with respect to the
variable specified by additional. If a numeric vector of length two
is supplied, the first element gives the trimming proportion for the lower
part and the second element the trimming proportion for the upper part. If a
single numeric is supplied, it is used for both. With NULL, trimming
is suppressed.method is "lm", this indicates
whether the random error terms should be obtained by draws from the
residuals. If FALSE, they are drawn from a normal distribution
(median and MAD of the residuals are used as parameters).TRUE, the corresponding column name is
given by additional with postfix "Cat".method is "lm" and zeros is TRUE,
a small positive numeric value or NULL. When fitting a log-linear
model within a stratum, factor levels may not exist in the sample but are
likely to exist in the population. However, the coefficient for such factor
levels will be 0. Therefore, coefficients smaller than tol in
absolute value are replaced by coefficients from an auxiliary model that is
fit to the whole sample. If NULL, no auxiliary log-linear model is
computed and no coefficients are replaced.NULL (the default). In
the former case and if (multinomial) log-linear models are computed,
estimated probabilities smaller than this are assumed to result from
structural zeros and are set to exactly 0.simStructure) are used. 'sum',
'mean' or 'random' is specified, the values are aggregated and each member
of the household gets the same value (mean, sum or a random value) assigned.TRUE, additional output is written to the promtsimPopObj containing survey
data as well as the simulated population data including the continuous
variable specified by additional and possibly simulated categories
for the desired continous variable.
method is "lm", the behavior for two-step models is
described in the following.If zeros is TRUE and log is not TRUE or the
variable specified by additional does not contain negative values, a
log-linear model is used to predict whether an observation is zero or not.
Then a linear model is used to predict the non-zero values.
If zeros is TRUE, log is TRUE and const
is specified, again a log-linear model is used to predict whether an
observation is zero or not. In the linear model to predict the non-zero
values, const is added to the variable specified by additional
before the logarithms are taken.
If zeros is TRUE, log is TRUE, const is
NULL and there are negative values, a multinomial log-linear model is
used to predict negative, zero and positive observations. Categories for the
negative values are thereby defined by breaks. In the second step, a
linear model is used to predict the positive values and negative values are
drawn from uniform distributions in the respective classes.
If zeros is FALSE, log is TRUE and const
is NULL, a two-step model is used if there are non-positive values in
the variable specified by additional. Whether a log-linear or a
multinomial log-linear model is used depends on the number of categories to
be used for the non-positive values, as defined by breaks. Again,
positive values are then predicted with a linear model and non-positive
values are drawn from uniform distributions.
The number of cpus are selected automatically in the following manner. The number of cpus is equal the number of strata. However, if the number of cpus is less than the number of strata, the number of cpus - 1 is used by default. This should be the best strategy, but the user can also overwrite this decision.
simStructure, simCategorical,
simComponents, simEUSILC
data(eusilcS)
## Not run:
# ## approx. 20 seconds computation time
# inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090")
# simPop <- simStructure(data=inp, method="direct",
# basicHHvars=c("age", "rb090", "hsize", "pl030", "pb220a"))
#
# regModel = ~rb090+hsize+pl030+pb220a
#
# # multinomial model with random draws
# eusilcM <- simContinuous(simPop, additional="netIncome",
# regModel = regModel,
# upper=200000, equidist=FALSE, nr_cpus=1)
# class(eusilcM)
# ## End(Not run)
## Not run:
# # two-step regression
# eusilcT <- simContinuous(simPop, additional="netIncome",
# regModel = "basic",
# method = "lm")
# class(eusilcT)
# ## End(Not run)
Run the code above in your browser using DataLab