Hmisc (version 3.0-1)

varclus: Variable Clustering

Description

Does a hierarchical cluster analysis on variables, using the Hoeffding D statistic, squared Pearson or Spearman correlations, or proportion of observations for which two variables are both positive as similarity measures. Variable clustering is used for assessing collinearity, redundancy, and for separating variables into clusters that can be scored as a single variable, thus resulting in data reduction. For computing any of the three similarity measures, pairwise deletion of NAs is done. The clustering is done by hclust(). A small function naclus is also provided which depicts similarities in which observations are missing for variables in a data frame. The similarity measure is the fraction of NAs in common between any two variables. The diagonals of this sim matrix are the fraction of NAs in each variable by itself. naclus also computes na.per.obs, the number of missing variables in each observation, and mean.na, a vector whose ith element is the mean number of missing variables other than variable i, for observations in which variable i is missing. The naplot function makes several plots (see the which argument).

So as to not generate too many dummy variables for multi-valued character or categorical predictors, varclus will automatically combine infrequent cells of such variables using an auxiliary function combine.levels that is defined here.

plotMultSim plots multiple similarity matrices, with the similarity measure being on the x-axis of each subplot.

na.pattern prints a frequency table of all combinations of missingness for multiple variables. If there are 3 variables, a frequency table entry labeled 110 corresponds to the number of observations for which the first and second variables were missing but the third variable was not missing.

Usage

varclus(x, similarity=c("spearman","pearson","hoeffding","bothpos","ccbothpos"),
        type=c("data.matrix","similarity.matrix"), 
        method=if(.R.)"complete" else "compact",
        data, subset, na.action, minlev=0.05)
## S3 method for class 'varclus':
print(x, abbrev=FALSE, ...)
## S3 method for class 'varclus':
plot(x, ylab, abbrev=FALSE, legend.=FALSE, loc, maxlen, labels, \dots)

naclus(df, method) naplot(obj, which=c('all','na per var','na per obs','mean na', 'na per var vs mean na'), ...)

combine.levels(x, minlev=.05)

plotMultSim(s, x=1:dim(s)[3], slim=range(pretty(c(0,max(s,na.rm=TRUE)))), slimds=FALSE, add=FALSE, lty=par('lty'), col=par('col'), lwd=par('lwd'), vname=NULL, h=.5, w=.75, u=.05, labelx=TRUE, xspace=.35)

na.pattern(x)

Arguments

x
a formula, a numeric matrix of predictors, or a similarity matrix. If x is a formula, model.matrix is used to convert it to a design matrix. If the formula excludes an intercept (e.g., ~ a + b -1), the first categor
df
a data frame
s
an array of similarity matrices. The third dimension of this array corresponds to different computations of similarities. The first two dimensions come from a single similarity matrix. This is useful for displaying similarity matrices computed by
similarity
the default is to use squared Spearman correlation coefficients, which will detect monotonic but nonlinear relationships. You can also specify linear correlation or Hoeffding's (1948) D statistic, which has the advantage of being sensitive to many types
type
if x is not a formula, it may be a data matrix or a similarity matrix. By default, it is assumed to be a data matrix.
method
see hclust. The default, for both varclus and naclus, is "compact" (for Rit is "complete").
data
subset
na.action
These may be specified if x is a formula. The default na.action is na.retain, defined by varclus. This causes all observations to be kept in the model frame, with later pairwise deletion of NA
ylab
y-axis label. Default is constructed on the basis of similarity.
legend.
set to TRUE to plot a legend defining the abbreviations
loc
a list with elements x and y defining coordinates of the upper left corner of the legend. Default is locator(1).
maxlen
if a legend is plotted describing abbreviations, original labels longer than maxlen characters are truncated at maxlen.
labels
a vector of character strings containing labels corresponding to columns in the similar matrix, if the column names of that matrix are not to be used
...
passed to plclust (or to dotchart or dotchart2 for naplot).
obj
an object created by naclus
which
defaults to "all" meaning to have naplot make 4 separate plots. To make only one of the plots, use which="na per var" (dot chart of fraction of NAs for each variable), ,"na per obs" (dot chart showing
minlev
the minimum proportion of observations in a cell before that cell is combined with one or more cells. If more than one cell has fewer than minlev*n observations, all such cells are combined into a new cell labeled "OTHER". Otherwise, the lo
abbrev
set to TRUE to abbreviate variable names for plotting or printing. Is set to TRUE automatically if legend=TRUE.
slim
2-vector specifying the range of similarity values for scaling the y-axes. By default this is the observed range over all of s.
slimds
set to slimds to TRUE to scale diagonals and off-diagonals separately
add
set to TRUE to add similarities to an existing plot (usually specifying lty or col)
lty
col
lwd
line type, color, or line thickness for plotMultSim
vname
optional vector of variable names, in order, used in s
h
relative height for subplot
w
relative width for subplot
u
relative extra height and width to leave unused inside the subplot. Also used as the space between y-axis tick mark labels and graph border.
labelx
set to FALSE to suppress drawing of labels in the x direction
xspace
amount of space, on a scale of 1:n where n is the number of variables, to set aside for y-axis labels

Value

  • for varclus or naclus, a list of class varclus with elements call (containing the calling statement), sim (similarity matrix), n (sample size used if x was not a correlation matrix already - n is a matrix), hclust, the object created by hclust, similarity, and method. For plot, returns the object created by plclust. naclus also returns the two vectors listed under description, and naplot returns an invisible vector that is the frequency table of the number of missing variables per observation. plotMultSim invisibly returns the limits of similarities used in constructing the y-axes of each subplot. For similarity="ccbothpos" the hclust object is NULL.

    na.pattern creates an integer vector of frequencies.

Side Effects

plots

Details

options(contrasts= c("contr.treatment", "contr.poly")) is issued temporarily by varclus to make sure that ordinary dummy variables are generated for factor variables. If a categorical or character variable has no level containing at least a fraction minlev of the data, that variable is omitted from consideration and a warning is printed.

References

Sarle, WS: The VARCLUS Procedure. SAS/STAT User's Guide, 4th Edition, 1990. Cary NC: SAS Institute, Inc.

Hoeffding W. (1948): A non-parametric test of independence. Ann Math Stat 19:546--57.

See Also

hclust, plclust, hoeffd, rcorr, cor, model.matrix, locator, na.pattern

Examples

Run this code
set.seed(1)
x1 <- rnorm(200)
x2 <- rnorm(200)
x3 <- x1 + x2 + rnorm(200)
x4 <- x2 + rnorm(200)
x <- cbind(x1,x2,x3,x4)
v <- varclus(x, similarity="spear")  # spearman is the default anyway
v    # invokes print.varclus
print(round(v$sim,2))
plot(v)


# plot(varclus(~ age + sys.bp + dias.bp + country - 1), abbrev=TRUE)
# the -1 causes k dummies to be generated for k countries
# plot(varclus(~ age + factor(disease.code) - 1))
#


df <- data.frame(a=c(1,2,3),b=c(1,2,3),c=c(1,2,NA),d=c(1,NA,3),
                 e=c(1,NA,3),f=c(NA,NA,NA),g=c(NA,2,3),h=c(NA,NA,3))
par(mfrow=c(2,2))
for(m in if(.R.)c("ward","complete","median") else 
                c("compact","connected","average")) {
  plot(naclus(df, method=m))
  title(m)
}
naplot(naclus(df))
n <- naclus(df)
plot(n); naplot(n)
na.pattern(df)      # builtin function


x <- c(1, rep(2,11), rep(3,9))
combine.levels(x)
x <- c(1, 2, rep(3,20))
combine.levels(x)


# plotMultSim example: Plot proportion of observations
# for which two variables are both positive (diagonals
# show the proportion of observations for which the
# one variable is positive).  Chance-correct the
# off-diagonals by subtracting the product of the
# marginal proportions.  On each subplot the x-axis
# shows month (0, 4, 8, 12) and there is a separate
# curve for females and males
d <- data.frame(sex=sample(c('female','male'),1000,TRUE),
                month=sample(c(0,4,8,12),1000,TRUE),
                x1=sample(0:1,1000,TRUE),
                x2=sample(0:1,1000,TRUE),
                x3=sample(0:1,1000,TRUE))
s <- array(NA, c(3,3,4))
opar <- par(mar=c(0,0,4.1,0))  # waste less space
for(sx in c('female','male')) {
  for(i in 1:4) {
    mon <- (i-1)*4
    s[,,i] <- varclus(~x1 + x2 + x3, sim='ccbothpos', data=d,
                      subset=month==mon & sex==sx)$sim
    }
  plotMultSim(s, c(0,4,8,12), vname=c('x1','x2','x3'),
              add=sx=='male', slimds=TRUE,
              lty=1+(sx=='male'))
  # slimds=TRUE causes separate  scaling for diagonals and
  # off-diagonals
}
par(opar)

Run the code above in your browser using DataCamp Workspace