This function is the workhorse of the MRFcov
package, running
separate penalized regressions for each node to estimate parameters of
Markov Random Fields (MRF) graphs. Covariates can be included
(a class of models known as Conditional Random Fields; CRF), to estimate
how interactions between nodes vary across covariate magnitudes.
MRFcov(
data,
symmetrise,
prep_covariates,
n_nodes,
n_cores,
n_covariates,
family,
bootstrap = FALSE,
progress_bar = FALSE
)
A list
containing:
graph
: Estimated parameter matrix
of pairwise interaction effects
intercepts
: Estimated parameter vector
of node intercepts
indirect_coefs
: list
containing matrices representing
indirect effects of each covariate on pairwise node interactions
direct_coefs
: matrix
of direct effects of each parameter on
each outcome node. For family = 'binomial'
models, all coefficients are
estimated on the logit scale.
param_names
: Character string of covariate parameter names
mod_type
: A character stating the type of model that was fit
(used in other functions)
mod_family
: A character stating the family of model that was fit
(used in other functions)
poiss_sc_factors
: A matrix of the estimated negative binomial or
poisson parameters for each raw node variable (only returned if family = "poisson"
).
These are needed for converting coefficients back to their original distribution, and are
used for prediction purposes only
A dataframe
. The input data where the n_nodes
left-most variables are variables that are to be represented by nodes in the graph
The method to use for symmetrising corresponding parameter estimates
(which are taken from separate regressions). Options are min
(take the coefficient with the
smallest absolute value), max
(take the coefficient with the largest absolute value)
or mean
(take the mean of the two coefficients). Default is mean
Logical. If TRUE
, covariate columns will be cross-multiplied
with nodes to prep the dataset for MRF models. Note this is only useful when additional
covariates are provided. Therefore, if n_nodes < NCOL(data)
,
default is TRUE
. Otherwise, default is FALSE
. See
prep_MRF_covariates
for more information
Positive integer. The index of the last column in data
which is represented by a node in the final graph. Columns with index
greater than n_nodes are taken as covariates. Default is the number of
columns in data
, corresponding to no additional covariates
Positive integer. The number of cores to spread the job across using
makePSOCKcluster
. Default is 1 (no parallelisation)
Positive integer. The number of covariates in data
, before cross-multiplication.
Default is NCOL(data) - n_nodes
The response type. Responses can be quantitative continuous (family = "gaussian"
),
non-negative counts (family = "poisson"
) or binomial 1s and 0s (family = "binomial"
).
If using (family = "binomial"
), please note that if nodes occur in less than 5 percent
of observations this can make it generally difficult to
estimate occurrence probabilities (on the extreme end, this can result in intercept-only
models being fitted for the nodes in question). The function will issue a warning in this case.
If nodes occur in more than 95 percent of observations, this will return an error as the cross-validation
step will generally be unable to proceed. For family = 'poisson'
models, all returned
coefficients are estimated on the identity scale AFTER using a nonparanormal transformation.
See vignette("Gaussian_Poisson_CRFs")
for details of interpretation
Logical. Used by bootstrap_MRF
to reduce memory usage
Logical. Progress bar in pbapply is used if TRUE
, but this slows estimation.
Separate penalized regressions are used to approximate
MRF parameters, where the regression for node j
includes an
intercept and coefficients for the abundance (families gaussian
or poisson
)
or presence-absence (family binomial
) of all other
nodes (/j
) in data
. If covariates are included, coefficients
are also estimated for the effect of the covariate on j
, and for the
effects of the covariate on interactions between j
and all other nodes
(/j
). Note that interaction coefficients must be estimated between variables that
are on roughly the same scale, as the resulting parameter estimates are
unified into a Markov Random Field using the specified symmetrise
function.
Counts for poisson
variables, which are often not on the same scale,
will therefore be normalised with a nonparanormal transformation
x = qnorm(rank(log2(x + 0.01)) / (length(x) + 1))
. These transformed counts
will be used in a (family = "gaussian")
model and their respective raw distribution parameters returned so that coefficients
can be back-transformed for interpretation (this back-transformation is
performed automatatically by other functions including predict_MRF
and cv_MRF_diag
). Gaussian variables are not automatically transformed, so
if they cover quite different ranges and scales, then it is recommended to scale them prior to fitting
models. For more information on this process, use
vignette("Gaussian_Poisson_CRFs")
Note that since the number of parameters to estimate in each node-wise regression
quickly increases with increasing numbers of nodes and covariates,
LASSO penalization is used to regularize
regressions. This is done by minimising the cross-validated
mean error for each node separately using cv.glmnet
. In this way,
we maximise the log-likelihood of each node
separately before unifying the nodes into a graph.
Ising, E. (1925). Beitrag zur Theorie des Ferromagnetismus.
Zeitschrift für Physik A Hadrons and Nuclei, 31, 253-258.
Cheng, J., Levina, E., Wang, P. & Zhu, J. (2014).
A sparse Ising model with covariates. (2012). Biometrics, 70, 943-953.
Clark, NJ, Wells, K and Lindberg, O.
Unravelling changing interspecific interactions across environmental gradients
using Markov random fields. (2018). Ecology doi: 10.1002/ecy.2221
Full text here.
Sutton C, McCallum A. An introduction to conditional random fields.
Foundations and Trends in Machine Learning 4, 267-373.
Cheng et al. (2014), Sutton & McCallum (2012) and Clark et al. (2018)
for overviews of Conditional Random Fields. See cv.glmnet
for
details of cross-validated optimization using LASSO penalty. Worked examples to showcase
this function can be found using vignette("Bird_Parasite_CRF")
and
vignette("Gaussian_Poisson_CRFs")
data("Bird.parasites")
CRFmod <- MRFcov(data = Bird.parasites, n_nodes = 4, family = 'binomial')
Run the code above in your browser using DataLab