datadist: Distribution Summaries for Predictor Variables

Description

For a given set of variables or a data frame, determines summaries of variables for effect and plotting ranges, values to adjust to, and overall ranges for Predict, plot.Predict, ggplot.Predict, summary.rms, survplot, and nomogram.rms. If datadist is called before a model fit and the resulting object pointed to with options(datadist="name"), the data characteristics will be stored with the fit by Design(), so that later predictions and summaries of the fit will not need to access the original data used in the fit. Alternatively, you can specify the values for each variable in the model when using these 3 functions, or specify the values of some of them and let the functions look up the remainder (of say adjustmemt levels) from an object created by datadist. The best method is probably to run datadist once before any models are fitted, storing the distribution summaries for all potential variables. Adjustment values are 0 for binary variables, the most frequent category (or optionally the first category level) for categorical (factor) variables, the middle level for ordered factor variables, and medians for continuous variables. See descriptions of q.display and q.effect for how display and effect ranges are chosen for continuous variables.

Usage

datadist(..., data, q.display, q.effect=c(0.25, 0.75),
         adjto.cat=c('mode','first'), n.unique=10)
# S3 method for datadist
print(x, ...)
# options(datadist="dd")
# used by summary, plot, survplot, sometimes predict
# For dd substitute the name of the result of datadist

Value

a list of class "datadist" with the following components

limits: a \(7 \times k\) vector, where \(k\) is the number of variables. The 7 rows correspond to the low value for estimating the effect of the variable, the value to adjust the variable to when examining other variables, the high value for effect, low value for displaying the variable, the high value for displaying it, and the overall lowest and highest values.
values: a named list, with one vector of unique values for each numeric variable having no more than n.unique unique values

Arguments

...: a list of variable names, separated by commas, a single data frame, or a fit with Design information. The first element in this list may also be an object created by an earlier call to datadist; then the later variables are added to this datadist object. For a fit object, the variables named in the fit are retrieved from the active data frame or from the location pointed to by data=frame number or data="data frame name". For print, is ignored.
data: a data frame or a search position. If data is a search position, it is assumed that a data frame is attached in that position, and all its variables are used. If you specify both individual variables in ... and data, the two sets of variables are combined. Unless the first argument is a fit object, data must be an integer.
q.display: set of two quantiles for computing the range of continuous variables to use in displaying regression relationships. Defaults are \(q\) and \(1-q\), where \(q=10/max(n,200)\), and \(n\) is the number of non-missing observations. Thus for \(n<200\), the .05 and .95 quantiles are used. For \(n\geq 200\), the \(10^{th}\) smallest and \(10^{th}\) largest values are used. If you specify q.display, those quantiles are used whether or not \(n<200\).
q.effect: set of two quantiles for computing the range of continuous variables to use in estimating regression effects. Defaults are c(.25,.75), which yields inter-quartile-range odds ratios, etc.
adjto.cat: default is "mode", indicating that the modal (most frequent) category for categorical (factor) variables is the adjust-to setting. Specify "first" to use the first level of factor variables as the adjustment values. In the case of many levels having the maximum frequency, the first such level is used for "mode".
n.unique: variables having n.unique or fewer unique values are considered to be discrete variables in that their unique values are stored in the values list. This will affect how functions such as nomogram.Design determine whether variables are discrete or not.
x: result of datadist

Author

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

Details

For categorical variables, the 7 limits are set to character strings (factors) which correspond to c(NA,adjto.level,NA,1,k,1,k), where k is the number of levels. For ordered variables with numeric levels, the limits are set to c(L,M,H,L,H,L,H), where L is the lowest level, M is the middle level, and H is the highest level.

Examples

Run this code

if (FALSE) {
d <- datadist(data=1)         # use all variables in search pos. 1
d <- datadist(x1, x2, x3)
page(d)                       # if your options(pager) leaves up a pop-up
                              # window, this is a useful guide in analyses
d <- datadist(data=2)         # all variables in search pos. 2
d <- datadist(data=my.data.frame)
d <- datadist(my.data.frame)  # same as previous.  Run for all potential vars.
d <- datadist(x2, x3, data=my.data.frame)   # combine variables
d <- datadist(x2, x3, q.effect=c(.1,.9), q.display=c(0,1))
# uses inter-decile range odds ratios,
# total range of variables for regression function plots
d <- datadist(d, z)           # add a new variable to an existing datadist
options(datadist="d")         #often a good idea, to store info with fit
f <- ols(y ~ x1*x2*x3)


options(datadist=NULL)        #default at start of session
f <- ols(y ~ x1*x2)
d <- datadist(f)              #info not stored in `f'
d$limits["Adjust to","x1"] <- .5   #reset adjustment level to .5
options(datadist="d")


f <- lrm(y ~ x1*x2, data=mydata)
d <- datadist(f, data=mydata)
options(datadist="d")


f <- lrm(y ~ x1*x2)           #datadist not used - specify all values for
summary(f, x1=c(200,500,800), x2=c(1,3,5))         # obtaining predictions
plot(Predict(f, x1=200:800, x2=3))  # or ggplot()


# Change reference value to get a relative odds plot for a logistic model
d$limits$age[2] <- 30    # make 30 the reference value for age
# Could also do: d$limits["Adjust to","age"] <- 30
fit <- update(fit)   # make new reference value take effect
plot(Predict(fit, age, ref.zero=TRUE, fun=exp),
     ylab='Age=x:Age=30 Odds Ratio')   # or ggplot()
}

Run the code above in your browser using DataLab