hclust()
. A small function
naclus
is also provided which depicts similarities in which
observations are missing for variables in a data frame. The
similarity measure is the fraction of NAs
in common between any two
variables. The diagonals of this sim
matrix are the fraction of NAs
in each variable by itself. naclus
also computes na.per.obs
, the
number of missing variables in each observation, and mean.na
, a
vector whose ith element is the mean number of missing variables other
than variable i, for observations in which variable i is missing. The
naplot
function makes several plots (see the which
argument).So as to not generate too many dummy variables for multi-valued
character or categorical predictors, varclus
will automatically
combine infrequent cells of such variables using an auxiliary
function combine.levels
that is defined here.
plotMultSim
plots multiple similarity matrices, with the similarity
measure being on the x-axis of each subplot.
na.pattern
prints a frequency table of all combinations of
missingness for multiple variables. If there are 3 variables, a
frequency table entry labeled 110
corresponds to the number of
observations for which the first and second variables were missing but
the third variable was not missing.
varclus(x, similarity=c("spearman","pearson","hoeffding","bothpos"),
type=c("data.matrix","similarity.matrix"),
method="compact", data, subset, na.action, minlev)
## S3 method for class 'varclus':
print(x, ...)
## S3 method for class 'varclus':
plot(x, ylab, abbrev=FALSE, legend.=FALSE, loc, maxlen, labels, \dots)naclus(df, method)
naplot(obj, which=c('all','na per var','na per obs','mean na',
'na per var vs mean na'), ...)
combine.levels(x, minlev=.05)
plotMultSim(s, x=1:dim(s)[3],
slim=range(pretty(c(0,max(s,na.rm=TRUE)))),
slimds=FALSE,
add=FALSE, lty=par('lty'), col=par('col'),
lwd=par('lwd'), vname=NULL, h=.5, w=.75, u=.05,
labelx=TRUE, xspace=.35)
na.pattern(x)
x
is
a formula, model.matrix
is used to convert it to a design matrix.
If the formula excludes an intercept (e.g., ~ a + b -1
),
the first categorx
is not a formula, it may be a data matrix or a similarity matrix.
By default, it is assumed to be a data matrix.hclust
. The default, for both varclus
and naclus
, is
"compact"
(for Rit is "complete"
).x
is a formula. The default na.action
is
na.retain
, defined by varclus
. This causes all observations to
be kept in the model frame, with later pairwise deletion of NA
similarity
.TRUE
to abbreviate variable names for plotting. Is set to TRUE
automatically if legend=TRUE
.TRUE
to plot a legend defining the abbreviationsx
and y
defining coordinates of the
upper left corner of the legend. Default is locator(1)
.maxlen
characters are truncated at maxlen
.plclust
(or to dotchart
or dotchart2
for naplot
).naclus
"all"
meaning to have naplot
make 4 separate
plots. To
make only one of the plots, use which="na per var"
(dot chart of
fraction of NAs for each variable), ,"na per obs"
(dot chart showing
"OTHER"
. Otherwise, the los
.slimds
to TRUE
to scale diagonals and
off-diagonals separatelyTRUE
to add similarities to an existing plot (usually
specifying lty
or col
)plotMultSim
s
FALSE
to suppress drawing of labels in the x directionn
where n
is the number
of variables, to set aside for y-axis labelsvarclus
or naclus
, a list of class varclus
with elements
call
(containing the calling statement), sim
(similarity matrix),
n
(sample size used if x
was not a correlation matrix already -
n
is a matrix), hclust
, the object created by hclust
,
similarity
, and method
. For plot
, returns the object created by
plclust
. naclus
also returns the two vectors listed under
description, and naplot
returns an invisible vector that is the
frequency table of the number of missing variables per observation.
plotMultSim
invisibly returns the limits of similarities used in
constructing the y-axes of each subplot. For similarity="ccbothpos"
the hclust
object is NULL
.na.pattern
creates an integer vector of frequencies.
options(contrasts= c("contr.treatment", "contr.poly"))
is issued
temporarily by varclus
to make sure that ordinary dummy variables
are generated for factor
variables. If a categorical or character
variable has no level containing at least a fraction minlev
of the
data, that variable is omitted from consideration and a warning is
printed.Hoeffding W. (1948): A non-parametric test of independence. Ann Math Stat 19:546--57.
hclust
, plclust
, hoeffd
, rcorr
, cor
, model.matrix
,
locator
, na.pattern
set.seed(1)
x1 <- rnorm(200)
x2 <- rnorm(200)
x3 <- x1 + x2 + rnorm(200)
x4 <- x2 + rnorm(200)
x <- cbind(x1,x2,x3,x4)
v <- varclus(x, similarity="spear") # spearman is the default anyway
v # invokes print.varclus
print(round(v$sim,2))
plot(v)
# plot(varclus(~ age + sys.bp + dias.bp + country - 1), abbrev=TRUE)
# the -1 causes k dummies to be generated for k countries
# plot(varclus(~ age + factor(disease.code) - 1))
#
df <- data.frame(a=c(1,2,3),b=c(1,2,3),c=c(1,2,NA),d=c(1,NA,3),
e=c(1,NA,3),f=c(NA,NA,NA),g=c(NA,2,3),h=c(NA,NA,3))
par(mfrow=c(2,2))
for(m in if(.R.)c("ward","complete","median") else
c("compact","connected","average")) {
plot(naclus(df, method=m))
title(m)
}
naplot(naclus(df))
n <- naclus(df)
plot(n); naplot(n)
na.pattern(df) # builtin function
x <- c(1, rep(2,11), rep(3,9))
combine.levels(x)
x <- c(1, 2, rep(3,20))
combine.levels(x)
# plotMultSim example: Plot proportion of observations
# for which two variables are both positive (diagonals
# show the proportion of observations for which the
# one variable is positive). Chance-correct the
# off-diagonals by subtracting the product of the
# marginal proportions. On each subplot the x-axis
# shows month (0, 4, 8, 12) and there is a separate
# curve for females and males
d <- data.frame(sex=sample(c('female','male'),1000,TRUE),
month=sample(c(0,4,8,12),1000,TRUE),
x1=sample(0:1,1000,TRUE),
x2=sample(0:1,1000,TRUE),
x3=sample(0:1,1000,TRUE))
s <- array(NA, c(3,3,4))
opar <- par(mar=c(0,0,4.1,0)) # waste less space
for(sx in c('female','male')) {
for(i in 1:4) {
mon <- (i-1)*4
s[,,i] <- varclus(~x1 + x2 + x3, sim='ccbothpos', data=d,
subset=month==mon & sex==sx)$sim
}
plotMultSim(s, c(0,4,8,12), vname=c('x1','x2','x3'),
add=sx=='male', slimds=TRUE,
lty=1+(sx=='male'))
# slimds=TRUE causes separate scaling for diagonals and
# off-diagonals
}
par(opar)
Run the code above in your browser using DataLab