hclust()
. A small function
naclus
is also provided which depicts similarities in which
observations are missing for variables in a data frame. The
similarity measure is the fraction of NAs
in common between any two
variables. The diagonals of this sim
matrix are the fraction of NAs
in each variable by itself. naclus
also computes na.per.obs
, the
number of missing variables in each observation, and mean.na
, a
vector whose ith element is the mean number of missing variables other
than variable i, for observations in which variable i is missing. The
naplot
function makes several plots (see the which
argument).So as to not generate too many dummy variables for multi-valued
character or categorical predictors, varclus
will automatically
combine infrequent cells of such variables using an auxiliary
function combine.levels
that is defined here. If all values of
x
are NA
, combine.levels
returns a numeric vector
is returned that is all NA
.
plotMultSim
plots multiple similarity matrices, with the similarity
measure being on the x-axis of each subplot.
na.pattern
prints a frequency table of all combinations of
missingness for multiple variables. If there are 3 variables, a
frequency table entry labeled 110
corresponds to the number of
observations for which the first and second variables were missing but
the third variable was not missing.
varclus(x, similarity=c("spearman","pearson","hoeffding","bothpos","ccbothpos"), type=c("data.matrix","similarity.matrix"), method="complete", data=NULL, subset=NULL, na.action=na.retain, trans=c("square", "abs", "none"), ...)
"print"(x, abbrev=FALSE, ...)
"plot"(x, ylab, abbrev=FALSE, legend.=FALSE, loc, maxlen, labels, ...)
naclus(df, method)
naplot(obj, which=c('all','na per var','na per obs','mean na', 'na per var vs mean na'), ...)
combine.levels(x, minlev=.05)
plotMultSim(s, x=1:dim(s)[3], slim=range(pretty(c(0,max(s,na.rm=TRUE)))), slimds=FALSE, add=FALSE, lty=par('lty'), col=par('col'), lwd=par('lwd'), vname=NULL, h=.5, w=.75, u=.05, labelx=TRUE, xspace=.35)
na.pattern(x)
x
is
a formula, model.matrix
is used to convert it to a design matrix.
If the formula excludes an intercept (e.g., ~ a + b -1
),
the first categorical (factor
) variable in the formula will have
dummy variables generated for all levels instead of omitting one for
the first level. For combine.levels
, x
is a character, category,
or factor vector (or other vector that is converted to factor). For
plot
and print
, x
is an object created by
varclus
. For na.pattern
, x
is a list, data frame,
or numeric matrix.For plotMultSim
, is a numeric vector specifying the ordered
unique values on the x-axis, corresponding to the third dimension of
s
.
varclus
, for example. A
use for this might be to show pairwise similarities of variables
across time in a longitudinal study (see the example below). If
vname
is not given, s
must have dimnames
.
similarity="bothpos"
uses as
a similarity measure the proportion of observations for which two
variables are both positive. similarity="ccbothpos"
uses a
chance-corrected measure which is the proportion of observations for
which both variables are positive minus the product of the two
marginal proportions. This difference is expected to be zero under
independence. For diagonals, "ccbothpos"
still uses the proportion
of positives for the single variable. So "ccbothpos"
is not really
a similarity measure, and clustering is not done. This measure is
useful for plotting with plotMultSim
(see the last example).
x
is not a formula, it may be a data matrix or a similarity matrix.
By default, it is assumed to be a data matrix.
hclust
. The default, for both varclus
and naclus
, is
"compact"
(for R it is "complete"
).
x
is a formula. The default
na.action
is na.retain
, defined by varclus
. This
causes all observations to be kept in the model frame, with later
pairwise deletion of NA
s.trans="abs"
to take absolute values or
trans="none"
to use the coefficients as they stand.varclus
these are optional arguments to pass to
the dataframeReduce
function. Otherwise,
passed to plclust
(or to dotchart
or dotchart2
for
naplot
).
similarity
.
TRUE
to plot a legend defining the abbreviations
x
and y
defining coordinates of the
upper left corner of the legend. Default is locator(1)
.
maxlen
characters are truncated at maxlen
.
naclus
"all"
meaning to have naplot
make 4 separate
plots. To
make only one of the plots, use which="na per var"
(dot chart of
fraction of NAs for each variable), ,"na per obs"
(dot chart showing
frequency distribution of number of variables having NAs in an
observation), "mean na"
(dot chart showing mean number of other
variables missing when the indicated variable is missing), or
"na per var vs mean na"
, a scatterplot showing on the x-axis the
fraction of NAs in the variable and on the y-axis the mean number of
other variables that are NA when the indicated variable is NA.
"OTHER"
. Otherwise, the lowest frequency cell is combined
with the next lowest frequency cell, and the level name is the
combination of the two old level levels.
TRUE
to abbreviate variable names for plotting or
printing. Is set to TRUE
automatically if legend=TRUE
.
s
.
slimds
to TRUE
to scale diagonals and
off-diagonals separatelyTRUE
to add similarities to an existing plot (usually
specifying lty
or col
)
plotMultSim
s
FALSE
to suppress drawing of labels in the x direction
n
where n
is the number
of variables, to set aside for y-axis labels
varclus
or naclus
, a list of class varclus
with elements
call
(containing the calling statement), sim
(similarity matrix),
n
(sample size used if x
was not a correlation matrix already -
n
is a matrix), hclust
, the object created by hclust
,
similarity
, and method
. naclus
also returns the
two vectors listed under
description, and naplot
returns an invisible vector that is the
frequency table of the number of missing variables per observation.
plotMultSim
invisibly returns the limits of similarities used in
constructing the y-axes of each subplot. For similarity="ccbothpos"
the hclust
object is NULL
.na.pattern
creates an integer vector of frequencies.
options(contrasts= c("contr.treatment", "contr.poly"))
is issued
temporarily by varclus
to make sure that ordinary dummy variables
are generated for factor
variables. Pass arguments to the
dataframeReduce
function to remove problematic variables
(especially if analyzing all variables in a data frame).
Hoeffding W. (1948): A non-parametric test of independence. Ann Math Stat 19:546--57.
hclust
, plclust
, hoeffd
, rcorr
, cor
, model.matrix
,
locator
, na.pattern
set.seed(1)
x1 <- rnorm(200)
x2 <- rnorm(200)
x3 <- x1 + x2 + rnorm(200)
x4 <- x2 + rnorm(200)
x <- cbind(x1,x2,x3,x4)
v <- varclus(x, similarity="spear") # spearman is the default anyway
v # invokes print.varclus
print(round(v$sim,2))
plot(v)
# plot(varclus(~ age + sys.bp + dias.bp + country - 1), abbrev=TRUE)
# the -1 causes k dummies to be generated for k countries
# plot(varclus(~ age + factor(disease.code) - 1))
#
#
# use varclus(~., data= fracmiss= maxlevels= minprev=) to analyze all
# "useful" variables - see dataframeReduce for details about arguments
df <- data.frame(a=c(1,2,3),b=c(1,2,3),c=c(1,2,NA),d=c(1,NA,3),
e=c(1,NA,3),f=c(NA,NA,NA),g=c(NA,2,3),h=c(NA,NA,3))
par(mfrow=c(2,2))
for(m in c("ward","complete","median")) {
plot(naclus(df, method=m))
title(m)
}
naplot(naclus(df))
n <- naclus(df)
plot(n); naplot(n)
na.pattern(df) # builtin function
x <- c(1, rep(2,11), rep(3,9))
combine.levels(x)
x <- c(1, 2, rep(3,20))
combine.levels(x)
# plotMultSim example: Plot proportion of observations
# for which two variables are both positive (diagonals
# show the proportion of observations for which the
# one variable is positive). Chance-correct the
# off-diagonals by subtracting the product of the
# marginal proportions. On each subplot the x-axis
# shows month (0, 4, 8, 12) and there is a separate
# curve for females and males
d <- data.frame(sex=sample(c('female','male'),1000,TRUE),
month=sample(c(0,4,8,12),1000,TRUE),
x1=sample(0:1,1000,TRUE),
x2=sample(0:1,1000,TRUE),
x3=sample(0:1,1000,TRUE))
s <- array(NA, c(3,3,4))
opar <- par(mar=c(0,0,4.1,0)) # waste less space
for(sx in c('female','male')) {
for(i in 1:4) {
mon <- (i-1)*4
s[,,i] <- varclus(~x1 + x2 + x3, sim='ccbothpos', data=d,
subset=d$month==mon & d$sex==sx)$sim
}
plotMultSim(s, c(0,4,8,12), vname=c('x1','x2','x3'),
add=sx=='male', slimds=TRUE,
lty=1+(sx=='male'))
# slimds=TRUE causes separate scaling for diagonals and
# off-diagonals
}
par(opar)
Run the code above in your browser using DataLab