mfp2: Multivariable Fractional Polynomial Models with Extensions

Description

Selects the multivariable fractional polynomial (MFP) model that best predicts the outcome variable. It also has the ability to model a sigmoid relationship between x and an outcome variable y using the approximate cumulative distribution (ACD) transformation proposed by Royston (2014). This function provides two interfaces for input data: one for inputting data matrix x and outcome vector y directly and the other for using a formula object together with a dataframe data. Both interfaces are equivalent in terms of functionality.

Usage

mfp2(x, ...)
# S3 method for default
mfp2(
  x,
  y,
  weights = NULL,
  offset = NULL,
  cycles = 5,
  scale = NULL,
  shift = NULL,
  df = 4,
  center = TRUE,
  subset = NULL,
  family = c("gaussian", "poisson", "binomial", "cox"),
  criterion = c("pvalue", "aic", "bic"),
  select = 0.05,
  alpha = 0.05,
  keep = NULL,
  xorder = c("ascending", "descending", "original"),
  powers = NULL,
  ties = c("breslow", "efron", "exact"),
  strata = NULL,
  nocenter = NULL,
  acdx = NULL,
  ftest = FALSE,
  control = NULL,
  verbose = TRUE,
  ...
)
# S3 method for formula
mfp2(
  formula,
  data,
  weights = NULL,
  offset = NULL,
  cycles = 5,
  scale = NULL,
  shift = NULL,
  df = 4,
  center = TRUE,
  subset = NULL,
  family = c("gaussian", "poisson", "binomial", "cox"),
  criterion = c("pvalue", "aic", "bic"),
  select = 0.05,
  alpha = 0.05,
  keep = NULL,
  xorder = c("ascending", "descending", "original"),
  powers = NULL,
  ties = c("breslow", "efron", "exact"),
  strata = NULL,
  nocenter = NULL,
  ftest = FALSE,
  control = NULL,
  verbose = TRUE,
  ...
)

Value

mfp2() returns an object of class inheriting from glm or copxh, depending on the family parameter.

The function summary() (i.e. summary.mfp2()) can be used to obtain or print a summary of the results. The generic accessor function coef() can be used to extract the vector of coefficients from the fitted model object. The generic predict() can be used to obtain predictions from the fitted model object.

An object of class mfp2 is a list containing all entries as for glm

or coxph, and in addition the following entries:

convergence_mfp: logical value indicating convergence of mfp algorithm.
fp_terms: a data.frame with information on fractional polynomial terms.
transformations: a data.frame with information on shifting, scaling and centering for all variables.
fp_powers: a list with all powers of fractional polynomial terms. Each entry of the list is named according to the transformation of the variable.
acd: a vector with information for which variables the acd transformation was applied.
x_original: the scaled and shifted input matrix but without transformations.
y: the original outcome variable.
x: the final transformed input matrix used to fit the final model.
call_mfp: the call to the mfp2() function.
family_string: the family stored as character string.

The mfp2 object may contain further information depending on family.

Arguments

x: for mfp2.default: x is an input matrix of dimensions nobs x nvars. Each row is an observation vector.
...: not used.
y: for mfp2.default: y is a vector for the response variable. For family = "binomial" it should be a vector with two levels (see stats::glm()). For family = "cox" it must be a survival::Surv() object containing 2 columns.
weights: a vector of observation weights of length nobs. Default is NULL which assigns a weight of 1 to each observation.
offset: a vector of length nobs that is included in the linear predictor. Useful for the poisson family (e.g. log of exposure time). Default is NULL which assigns an offset of 0 to each observation. If supplied, then values must also be supplied to the predict() function.
cycles: an integer, specifying the maximum number of iteration cycles. Default is 5.
scale: a numeric vector of length nvars or single numeric specifying scaling factors. If a single numeric, then the value will be replicated as necessary. The formula interface mfp2.formula only supports single numeric input to set a default value, individual values can be set using fp terms in the formula input. Default is NULL which lets the program estimate the scaling factors (see Details section). If scaling is not required set scale = 1 to disable it.
shift: a numeric vector of length nvars or a single numeric specifying shift terms. If a single numeric, then the value will be replicated as necessary. The formula interface mfp2.formula only supports single numeric input to set a default value, individual values can be set using fp terms in the formula input. Default is NULL which lets the program estimate the shifts (see Details section). If shifting is not required, set shift = 0 to disable it.
df: a numeric vector of length nvars or a single numeric that sets the (default) degrees of freedom (df) for each predictor. If a single numeric, then the value will be replicated as necessary. The formula interface mfp2.formula only supports single numeric input to set a default value, individual values can be set using fp terms in the formula input. The df (not counting the intercept) are twice the degree of a fractional polynomial (FP). For example, an FP2 has 4 df, while FPm has 2*m df. The program overrides default df based on the number of distinct (unique) values for a variable as follows: 2-3 distinct values are assigned df = 1 (linear), 4-5 distinct values are assigned df = min(2, default) and >= 6 distinct values are assigned df = default.
center: a logical determining whether variables are centered before final model fitting. The default TRUE implies mean centering, except for binary covariates, where the covariate is centered using the lower of the two distinct values of the covariate. See Details section below.
subset: an optional vector specifying a subset of observations to be used in the fitting process. Default is NULL and all observations are used. See Details below.
family: a character string representing a glm() family object as well as Cox models. For more information, see details section below.
criterion: a character string specifying the criterion used to select variables and FP models of different degrees. Default is to use p-values in which case the user can specify the nominal significance level (or use default level of 0.05) for variable and functional form selection (see select and alpha parameters below). If the user specifies the BIC (bic) or AIC (aic) criteria the program ignores the nominal significance levels and selects variables and functional forms using the chosen information criterion.
select: a numeric vector of length nvars or a single numeric that sets the nominal significance levels for variable selection on each predictor by backward elimination. If a single numeric, then the value will be replicated as necessary. The formula interface mfp2.formula only supports single numeric input to set a default value, individual values can be set using fp terms in the formula input. The default nominal significance level is 0.05 for all variables. Setting the nominal significance level to be 1 for certain variables forces them into the model, leaving all other variables to be selected.
alpha: a numeric vector of length nvars or a single numeric that sets the significance levels for testing between FP models of different degrees. If a single numeric, then the value will be replicated as necessary. The formula interface mfp2.formula only supports single numeric input to set a default value, individual values can be set using fp terms in the formula input. The default nominal significance level is 0.05 for all variables.
keep: a character vector with names of variables to be kept in the model. In case that criterion = "pvalue", this is equivalent to setting the selection level for the variables in keep to 1. However, this option also keeps the specified variables in the model when using the BIC or AIC criteria.
xorder: a string determining the order of entry of the covariates into the model-selection algorithm. The default is ascending, which enters them by ascending p-values, or decreasing order of significance in a multiple regression (i.e. most significant first). descending places them in reverse significance order, whereas original respects the original order in x.
powers: a named list of numeric values that sets the permitted FP powers for each covariate. The default is NULL, and each covariate is assigned powers = c(-2, -1, -0.5, 0, 0.5, 1, 2, 3), where 0 means the natural logarithm. Powers are sorted before further processing in the program. If some variables are not assigned powers, the default powers will be assigned. The formula interface offers two options for supplying powers: through the 'powers' argument and the 'fp()' function. So, if the user supplies powers in both options for a certain variable, the powers supplied through 'fp()' will be given preference.For the algorithm to select the powers, each variable must have a minimum of two powers. If the users wants to use one power, they should first transform their variables before using mfp2() function and specify appropriate df
ties: a character string specifying the method for tie handling in Cox regression. If there are no tied death times all the methods are equivalent. Default is the Breslow method. This argument is used for Cox models only and has no effect on other model families. See survival::coxph() for details.
strata: a numeric vector or matrix of variables that define strata to be used for stratification in a Cox model. A new factor, whose levels are all possible combinations of the variables supplied will be created. Default is NULL and a Cox model without stratification would be fitted. See survival::coxph() for details.
nocenter: a numeric vector with a list of values for fitting Cox models. See survival::coxph() for details.
acdx: a numeric vector of names of continuous variables to undergo the approximate cumulative distribution (ACD) transformation. It also invokes the function-selection procedure to determine the best-fitting FP1(p1, p2) model (see Details section). Not present in the formula interface mfp2.formula and to be set using fp terms in the formula input. The variable representing the ACD transformation of x is named A_x.
ftest: a logical; for normal error models with small samples, critical points from the F-distribution can be used instead of Chi-Square distribution. Default FALSE uses the latter. This argument is used for Gaussian models only and has no effect for other model families.
control: a list object with parameters controlling model fit details. Returned by either stats::glm.control() or survival::coxph.control(). Default is NULL to use default parameters for the given model class.
verbose: a logical; run in verbose mode.
formula: for mfp2.formula: an object of class formula: a symbolic description of the model to be fitted. Special fp terms can be used to define fp-transformations. The details of model specification are given under ‘Details’.
data: for mfp2.formula: a data.frame which contains all variables specified in formula.

Methods (by class)

mfp2(default): Default method using input matrix x and outcome vector y.
mfp2(formula): Provides formula interface for mfp2.

Brief summary of FPs

In the following we denote fractional polynomials for a variable $x$ by increasing complexity as either FP1 or FP2. In this example, $FP2(p1, p2)$ for $p1\neq p2$ is the most flexible FP transformation, where $$FP2(p1, p2) = \beta_1 x^{p1} + \beta_2 x^{p2}.$$ When $p1 = p2$ (repeated powers), the FP2 model is given by $$FP2(p1, p2) = \beta_1 x^{p1} + \beta_2 x^{p1}log(x).$$ The powers $p1$ and $p2$ are usually chosen from a predefined set of powers $S = {(-2, -1, -0.5, 0, 0.5, 1, 2, 3)}$ where the power of 0 indicates the natural logarithm. The best FP2 is then estimated by using a closed testing procedure that seeks the best combination from all 36 pairs of powers $(p1, p2)$. Functions that only involve a single power of the variable are denoted as FP1, i.e. $$FP1(p1) = \beta_1 x^{p1}.$$ For details see e.g. Sauerbrei et al (2006).

Details on <code>family</code> option

mfp2() supports the family object as used by stats::glm(). The built in families are specified via a character string. mfp2(..., family = "binomial") fits a logistic regression model, while mfp2(..., family = "gaussian") fits a linear regression (ordinary least squares) model.

For Cox models, the response should preferably be a Surv object, created by the survival::Surv() function, and the family = "cox". Only right-censored data are currently supported. To fit stratified Cox models, the strata option can be used, or alternatively strata terms can be included in the model formula when using the formula interface mfp2.formula.

Details on shifting, scaling, centering

Fractional polynomials are defined only for positive variables due to the use of logarithms and other powers. Thus, mfp2() estimates shifts for each variables to ensure positivity or assumes that the variables are already positive when computing fractional powers of the input variables in case that shifting is disabled manually.

If the values of the variables are too large or too small, it is important to conduct variable scaling to reduce the chances of numerical underflow or overflow which can lead to inaccuracies and difficulties in estimating the model. Scaling can be done automatically or by directly specifying the scaling values so that the magnitude of the x values are not too extreme. By default scaling factors are estimated by the program as follows.

After adjusting the location of $x$ so that its minimum value is positive, creating $x'$, automatic scaling will divide each value of $x'$ by $10^p$ where the exponent $p$ is given by $$p = sign(k) \times floor(|k|) \quad \text{where} \quad k = log_{10} (max(x')- min(x'))$$

The FP transformation of $x'$ is centered on the mean of the observed values of $x'$. For example, for the FP1 model $\beta_0 + \beta_1x^p$, the actual model fitted by the software would be $\beta'_0 + \beta'_1(x'^p-mean(x'^p))$. This approach ensures that the revised constant $\beta'_0$ or baseline hazard function in a Cox model retains a meaningful interpretation.

So in brief: shifting is required to make input values positive, scaling helps to bring the values to a reasonable range. Both operations are conducted before estimating the FP powers for an input variable. Centering, however, is done after estimating the FP functions for each variable. Centering before estimating the FP powers may result in different powers and should be avoided. Also see transform_vector_fp() for some more details.

Details on the <code>subset</code> argument

Note that subsetting occurs after data pre-processing (shifting and scaling), but before model selection and fitting. In detail, when the option subset is used and scale, shift or centering values are to be estimated, then mfp2() first estimates these parameters using the full dataset (no subsetting). It then conduct subsetting before proceeding to perform model selection and fitting on the specified subset of the data.

Therefore, subsetting in mfp2() is not equivalent to subsetting the data before passing it to mfp2(), and thus cannot be used to implement, for example, cross-validation or to remove NA. These tasks should be done by the caller beforehand. However, it does allow to use the same data pre-processing for different subsets of the data. An example use case is when separate models are to be estimated for women and men in the dataset, but a common data pre-processing should be applied. In this case the subset option can be used to restrict model selection to either women or men, but the data processing (e.g. shifting factors) will be shared between the two models.

Details on approximate cumulative distribution transformation

The approximate cumulative distribution (ACD) transformation (Royston 2014) converts each predictor, $x$, smoothly to an approximation, $acd(x)$, of its empirical cumulative distribution function. This is done by smoothing a probit transformation of the scaled ranks of $x$. $acd(x)$ could be used instead of $x$ as a covariate. This has the advantage of providing sigmoid curves, something that regular FP functions cannot achieve. Details of the precise definition and some possible uses of the ACD transformation in a univariate context are given by Royston (2014). Royston and Sauerbrei (2016) describes how one could go further and replace FP2 functions with a pair of FP1 functions, one in $x$ and the other in $acd(x)$.

This alternative class of four-parameter functions provides about the same flexibility as the standard FP2 family, but the ACD component offers the additional possibility of sigmoid functions. Royston (2014) discusses how the extended class of functions known as $FP1(p1, p2)$, namely $$FP1(p1, p2) = \beta_1 x^{p1} + \beta_2 acd(x)^{p2}$$ can be fitted optimally by seeking the best combination of all 64 pairs of powers (p1, p2). The optimisation is invoked by use of the acdx parameter. Royston (2014) also described simplification of the chosen function through model reduction by applying significance testing to six sub-families of functions,M1-M6, giving models M1 (most complex) through M6 (null):

M1: FP1(p1, p2) (no simplification)
M2: FP1(p1, .) (regular FP1 function of $x$)
M3: FP1(., p2) (regular FP1 function of $acd(x)$)
M4: FP1(1, .) (linear function of $x$)
M5: FP1(., 1) (linear function of $acd(x)$)
M6: Null ($x$ omitted entirely)

Selection among these six sub-functions is performed by a closed test procedure known as the function-selection pocedure FSPA. It maintains the family-wise type 1 error probability for selecting $x$ at the value determined by the select parameter. To obtain a 'final' model, a structured sequence of up to five tests is carried out, the first at the significance level specified by the select parameter, and the remainder at the significance level provided by the alpha option. The sequence of tests is as follows:

Test 1: Compare the deviances of models 6 and 1 on 4 d.f. If not significant then stop and omit $x$, otherwise continue to step 2.
Test 2: Compare the deviances of models 4 and 1 on 3 d.f. If not significant then accept model 4 and stop. Otherwise, continue to step 3.
Test 3: Compare the deviance of models 2 and 1 on 2 d.f. If not significant then accept model 2 and stop. Otherwise continue to step 4.
Test 4: Compare the deviance of models 3 and 1 on 2 d.f. If significant then model 1 cannot be simplified; accept model 1 and stop. Otherwise continue to step 5.
Test 5: Compare the deviances of models 5 and 3 on 1 d.f. If significant then model 3 cannot be simplified; accept model 3. Otherwise, accept model 5. End of procedure.

The result is the selection of one of the six models.

Details on model specification using a <code>formula</code>

mfp2 supports model specifications using two different interfaces: one which allows passing of the data matrix x and outcome vector y directly (as done in e.g. stats::glm.fit() or glmnet) and another which conforms to the formula interface used by many commonly used R modelling functions such as stats::glm() or survival::coxph().

Both interfaces are equivalent in terms of possible fitted models, only the details of specification differ. In the standard interface all details regarding FP-transformations are given as vectors. In the formula interface all details are specified using special fp() function. These support the specification of degrees of freedom (df), nominal significance level for variable selection (select), nominal significance level for functional form selection (alpha), shift values (shift), scale values (scale), centering (center) and the ACD-transformation (acd). Values specified through fp() function override the values specified as defaults and passed to the mfp2() function.

The formula may also contain strata terms to fit stratified Cox models, or an offset term to specify a model offset.

Note that for a formula using ., such as y ~ . the mfp2() function may not fit a linear model, but may perform variable and functional form selection using FP-transformations, depending on the default settings of df, select and alpha passed as arguments to mfp2(). For example, using y ~ . with default settings means that mfp2() will apply FP transformation with 4 df to all continuous variables and use alpha equal to 0.05 to select functional forms, along with the selection algorithm with a significance level of 0.05 for all variables.

Compatibility with <code>mfp</code> package

mfp2 is an extension of the mfp package and can be used to reproduce the results from a model fitted by mfp. Since both packages implement the MFP algorithm, they use functions with the same names (e.g fp()). Therefore, if you load both packages using a call to library, there will be namespace conflicts and only the functions from the package loaded last will work properly.

Convergence and Troubleshooting

Typically, mfp2 requires two to four cycles to achieve convergence. Lack of convergence involves oscillation between two or more models and is extremely rare. If the model does not converge, you can try changing the nominal significance levels for variable (select) or function selection (alpha).

References

Royston, P. and Sauerbrei, W., 2008. Multivariable Model - Building: A Pragmatic Approach to Regression Anaylsis based on Fractional Polynomials for Modelling Continuous Variables. John Wiley & Sons.

Sauerbrei, W., Meier-Hirmer, C., Benner, A. and Royston, P., 2006. Multivariable regression model building by using fractional polynomials: Description of SAS, STATA and R programs. Comput Stat Data Anal, 50(12): 3464-85.

Royston, P. 2014. A smooth covariate rank transformation for use in regression models with a sigmoid dose-response function. Stata Journal 14(2): 329-341.

Royston, P. and Sauerbrei, W., 2016. mfpa: Extension of mfp using the ACD covariate transformation for enhanced parametric multivariable modeling. The Stata Journal, 16(1), pp.72-87.

Sauerbrei, W. and Royston, P., 1999. Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. J Roy Stat Soc a Sta, 162:71-94.

Examples

Run this code


# Gaussian model
data("prostate")
x = as.matrix(prostate[,2:8])
y = as.numeric(prostate$lpsa)
# default interface
fit1 = mfp2(x, y, verbose = FALSE)
fit1$fp_terms
fracplot(fit1) # generate plots
summary(fit1)
# formula interface
fit1b = mfp2(lpsa ~ fp(age) + fp(svi, df = 1) + fp(pgg45) + fp(cavol) + fp(weight) +
fp(bph) + fp(cp), data = prostate)

# logistic regression model
data("pima")
xx <- as.matrix(pima[, 2:9])
yy <- as.vector(pima$y)
fit2 <- mfp2(xx, yy, family = "binomial", verbose = FALSE)
fit2$fp_terms

# Cox regression model
data("gbsg")
# create dummy variable for grade using ordinal coding
gbsg <- create_dummy_variables(gbsg, var_ordinal = "grade", drop_variables = TRUE)
xd <- as.matrix(gbsg[, -c(1, 6, 10, 11)])
yd <- survival::Surv(gbsg$rectime, gbsg$censrec)
# fit mfp and keep hormon in the model
fit3 <- mfp2(xd, yd, family = "cox", keep = "hormon", verbose = FALSE)
fit3$fp_terms

Run the code above in your browser using DataLab