dataRep: Representativeness of Observations in a Data Set

Description

These functions are intended to be used to describe how well a given set of new observations (e.g., new subjects) were represented in a dataset used to develop a predictive model. The dataRep function forms a data frame that contains all the unique combinations of variable values that existed in a given set of variable values. Cross--classifications of values are created using exact values of variables, so for continuous numeric variables it is often necessary to round them to the nearest v and to possibly curtail the values to some lower and upper limit before rounding. Here v denotes a numeric constant specifying the matching tolerance that will be used. dataRep also stores marginal distribution summaries for all the variables. For numeric variables, all 101 percentiles are stored, and for all variables, the frequency distributions are also stored (frequencies are computed after any rounding and curtailment of numeric variables). For the purposes of rounding and curtailing, the roundN function is provided. A print method will summarize the calculations made by dataRep, and if long=TRUE all unique combinations of values and their frequencies in the original dataset are printed.

The predict method for dataRep takes a new data frame having variables named the same as the original ones (but whose factor levels are not necessarily in the same order) and examines the collapsed cross-classifications created by dataRep to find how many observations were similar to each of the new observations after any rounding or curtailment of limits is done. predict also does some calculations to describe how the variable values of the new observations "stack up" against the marginal distributions of the original data. For categorical variables, the percent of observations having a given variable with the value of the new observation (after rounding for variables that were through roundN in the formula given to dataRep) is computed. For numeric variables, the percentile of the original distribution in which the current value falls will be computed. For this purpose, the data are not rounded because the 101 original percentiles were retained; linear interpolation is used to estimate percentiles for values between two tabulated percentiles. The lowest marginal frequency of matching values across all variables is also computed. For example, if an age, sex combination matches 10 subjects in the original dataset but the age value matches 100 ages (after rounding) and the sex value matches the sex code of 300 observations, the lowest marginal frequency is 100, which is a "best case" upper limit for multivariable matching. I.e., matching on all variables has to result on a lower frequency than this amount. A print method for the output of predict.dataRep prints all calculations done by predict by default. Calculations can be selectively suppressed.

Usage

dataRep(formula, data, subset, na.action)
roundN(x, tol=1, clip=NULL)
# S3 method for dataRep
print(x, long=FALSE, ...)
# S3 method for dataRep
predict(object, newdata, ...)
# S3 method for predict.dataRep
print(x, prdata=TRUE, prpct=TRUE, ...)

Value

dataRep returns a list of class "dataRep" containing the collapsed data frame and frequency counts along with marginal distribution information. predict returns an object of class "predict.dataRep"

containing information determined by matching observations in newdata with the original (collapsed) data.

Arguments

formula: a formula with no left-hand-side. Continuous numeric variables in need of rounding should appear in the formula as e.g. roundN(x,5) to have a tolerance of e.g. +/- 2.5 in matching. Factor or character variables as well as numeric ones not passed through roundN are matched on exactly.
x: a numeric vector or an object created by dataRep
object: the object created by dataRep or predict.dataRep
data, subset, na.action: standard modeling arguments. Default na.action is na.delete, i.e., observations in the original dataset having any variables missing are deleted up front.
tol: rounding constant (tolerance is actually tol/2 as values are rounded to the nearest tol)
clip: a 2-vector specifying a lower and upper limit to curtail values of x before rounding
long: set to TRUE to see all unique combinations and frequency count
newdata: a data frame containing all the variables given to dataRep but not necessarily in the same order or having factor levels in the same order
prdata: set to FALSE to suppress printing newdata and the count of matching observations (plus the worst-case marginal frequency).
prpct: set to FALSE to not print percentiles and percents
...: unused

Side Effects

print.dataRep prints.

Author

Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com

Examples

Run this code

set.seed(13)
num.symptoms <- sample(1:4, 1000,TRUE)
sex <- factor(sample(c('female','male'), 1000,TRUE))
x    <- runif(1000)
x[1] <- NA
table(num.symptoms, sex, .25*round(x/.25))


d <- dataRep(~ num.symptoms + sex + roundN(x,.25))
print(d, long=TRUE)


predict(d, data.frame(num.symptoms=1:3, sex=c('male','male','female'),
                      x=c(.03,.5,1.5)))

Run the code above in your browser using DataLab