A tidy reimplementation of the functions implemented in mgcv::gamSim()
that can be used to fit GAMs. An new feature is that the sampling
distribution can be applied to all the example types.
data_sim(
model = "eg1",
n = 400,
scale = NULL,
theta = 3,
power = 1.5,
dist = c("normal", "poisson", "binary", "negbin", "tweedie", "gamma", "ocat",
"ordered categorical"),
n_cat = 4,
cuts = c(-1, 0, 5),
seed = NULL,
gfam_families = c("binary", "tweedie", "normal")
)character; either "egX" where X is an integer 1:7, or
the name of a model. See Details for possible options.
numeric; the number of observations to simulate.
numeric; the level of noise to use.
numeric; the dispersion parameter \(\theta\) to use. The default is entirely arbitrary, chosen only to provide simulated data that exhibits extra dispersion beyond that assumed by under a Poisson.
numeric; the Tweedie power parameter.
character; a sampling distribution for the response
variable. "ordered categorical" is a synonym of "ocat".
integer; the number of categories for categorical response.
Currently only used for distr %in% c("ocat", "ordered categorical").
numeric; vector of cut points on the latent variable, excluding
the end points -Inf and Inf. Must be one fewer than the number of
categories: length(cuts) == n_cat - 1.
numeric; the seed for the random number generator. Passed to
base::set.seed().
character; a vector of distributions to use in
generating data with grouped families for use with family = gfam(). The
allowed distributions as as per dist.
data_sim() can simulate data from several underlying models of
known true functions. The available options currently are:
"eg1": a four term additive true model. This is the classic Gu & Wahba
four univariate term test model. See gw_functions for more details of
the underlying four functions.
"eg2": a bivariate smooth true model.
"eg3": an example containing a continuous by smooth (varying
coefficient) true model. The model is \(\hat{y}_i = f_2(x_{1i})x_{2i}\) where the function \(f_2()\) is \(f_2(x) = 0.2 * x^{11} *
(10 * (1 - x))^6 + 10 * (10 * x)^3 * (1 - x)^{10}\).
"eg4": a factor by smooth true model. The true model contains a factor
with 3 levels, where the response for the nth level follows the nth
Gu & Wahba function (for \(n \in {1, 2, 3}\)).
"eg5": an additive plus factor true model. The response is a linear
combination of the Gu & Wahba functions 2, 3, 4 (the latter is a null
function) plus a factor term with four levels.
"eg6": an additive plus random effect term true model.
´"eg7": a version of the model in "eg1"`, but where the covariates are
correlated.
"gwf2": a model where the response is Gu & Wahba's
\(f_2(x_i)\) plus noise.
"lwf6": a model where the response is Luo & Wahba's "example 6"
function \(sin(2(4x-2)) + 2 exp(-256(x-0.5)^2)\) plus noise.
"gfam": simulates data for use with GAMs with
family = gfam(families). See example in mgcv::gfam(). If this model
is specified then dist is ignored and gfam_families is used to
specify which distributions are included in the simulated data. Can be a
vector of any of the families allowed by dist. For
"ocat" %in% gfam_families (or "ordered categorical"), 4 classes are
assumed, which can't be changed. Link functions used are "identity"
for "normal", "logit" for "binary", "ocat", and
"ordered categorical", and "exp" elsewhere.
The random component providing noise or sampling variation can follow one
of the distributions, specified via argument dist
"normal": Gaussian,
"poisson": Poisson,
"binary": Bernoulli,
"negbin": Negative binomial,
"tweedie": Tweedie,
"gamma": gamma , and
"ordered categorical": ordered categorical
Other arguments provide the parameters for the distribution.
Gu, C., Wahba, G., (1993). Smoothing Spline ANOVA with Component-Wise Bayesian "Confidence Intervals." J. Comput. Graph. Stat. 2, 97–117.
Luo, Z., Wahba, G., (1997). Hybrid adaptive splines. J. Am. Stat. Assoc. 92, 107–116.
# \dontshow{
op <- options(pillar.sigfig = 5, cli.unicode = FALSE)
# }
data_sim("eg1", n = 100, seed = 1)
# an ordered categorical response
data_sim("eg1", n = 100, dist = "ocat", n_cat = 4, cuts = c(-1, 0, 5))
# \dontshow{
options(op)
# }
Run the code above in your browser using DataLab