scat1d: One-Dimensional Scatter Diagram, Spike Histogram, or Density

Description

scat1d adds tick marks (bar codes. rug plot) on any of the four sides of an existing plot, corresponding with non-missing values of a vector x. This is used to show the data density. Can also place the tick marks along a curve by specifying y-coordinates to go along with the x values.

If any two values of x are within eps*w of each other, where eps defaults to .001 and w is the span of the intended axis, values of x are jittered by adding a value uniformly distributed in [-jitfrac*w, jitfrac*w], where jitfrac defaults to .008. Specifying preserve=TRUE invokes jitter2 with a different logic of jittering. Allows plotting random sub-segments to handle very large x vectors (see tfrac).

jitter2 is a generic method for jittering, which does not add random noise. It retains unique values and ranks, and randomly spreads duplicate values at equidistant positions within limits of enclosing values. jitter2 is especially useful for numeric variables with discrete values, like rating scales. Missing values are allowed and are returned. Currently implemented methods are jitter2.default for vectors and jitter2.data.frame which returns a data.frame with each numeric column jittered.

datadensity is a generic method used to show data densities in more complex situations. In the Design library there is a datadensity method for use with plot.Design. Here, another datadensity method is defined for data frames. Depending on the which argument, some or all of the variables in a data frame will be displayed, with scat1d used to display continuous variables and, by default, bars used to display frequencies of categorical, character, or discrete numeric variables. For such variables, when the total length of value labels exceeds 200, only the first few characters from each level are used. By default, datadensity.data.frame will construct one axis (i.e., one strip) per variable in the data frame. Variable names appear to the left of the axes, and the number of missing values (if greater than zero) appear to the right of the axes. An optional group variable can be used for stratification, where the different strata are depicted using different colors. If the q vector is specified, the desired quantiles (over all groups) are displayed with solid triangles below each axis.

When the sample size exceeds 2000 (this value may be modified using the nhistSpike argument, datadensity calls histSpike instead of scat1d to show the data density for numeric variables. This results in a histogram-like display that makes the resulting graphics file much smaller. In this case, datadensity uses the minf argument (see below) so that very infrequent data values will not be lost on the variable's axis, although this will slightly distort the histogram.

histSpike is another method for showing a high-resolution data distribution that is particularly good for very large datasets (say n > 1000). By default, histSpike bins the continuous x variable into 100 equal-width bins and then computes the frequency counts within bins (if n does not exceed 10, no binning is done). If add=FALSE (the default), the function displays either proportions or frequencies as in a vertical histogram. Instead of bars, spikes are used to depict the frequencies. If add=FALSE, the function assumes you are adding small density displays that are intended to take up a small amount of space in the margins of the overall plot. The frac argument is used as with scat1d to determine the relative length of the whole plot that is used to represent the maximum frequency. No jittering is done by histSpike.

histSpike can also graph a kernel density estimate for x, or add a small density curve to any of 4 sides of an existing plot. When y or curve is specified, the density or spikes are drawn with respect to the curve rather than the x-axis.

Usage

scat1d(x, side=3, frac=0.02, jitfrac=0.008, tfrac,
       eps=ifelse(preserve,0,.001),
       lwd=0.1, col=par("col"),
       y=NULL, curve=NULL,
       bottom.align=FALSE,
       preserve=FALSE, fill=1/3, limit=TRUE, nhistSpike=2000, nint=100,
       type=c('proportion','count','density'), grid=FALSE, ...)
jitter2(x, ...)
## S3 method for class 'default':
jitter2(x, fill=1/3, limit=TRUE, eps=0, presorted=FALSE, ...)
## S3 method for class 'data.frame':
jitter2(x, ...)
datadensity(object, ...)
## S3 method for class 'data.frame':
datadensity(object, group,
            which=c("all","continuous","categorical"),
            method.cat=c("bar","freq"),
            col.group=1:10,
            n.unique=10, show.na=TRUE, nint=1, naxes,
            q, bottom.align=nint>1,
            cex.axis=sc(.5,.3), cex.var=sc(.8,.3),
            lmgp=NULL, tck=sc(-.009,-.002),
            ranges=NULL, labels=NULL, ...)
# sc(a,b) means default to a if number of axes <= 3,="" b="" if="">=50, use
# linear interpolation within 3-50
histSpike(x, side=1, nint=100, frac=.05, minf=NULL, mult.width=1,
          type=c('proportion','count','density'),
          xlim=range(x), ylim=c(0,max(f)), xlab=deparse(substitute(x)), 
          ylab=switch(type,proportion='Proportion',
                           count     ='Frequency',
                           density   ='Density'),
          y=NULL, curve=NULL, add=FALSE, 
          bottom.align=type=='density', col=par('col'), lwd=par('lwd'),
          grid=FALSE, ...)

Arguments

a vector of numeric data, or a data frame (for jitter2)

object

a data frame or list (even with unequal number of observations per variable, as long as group is not specified)

side

axis side to use (1=bottom (default for histSpike), 2=left, 3=top (default for scat1d), 4=right)

frac

fraction of smaller of vertical and horizontal axes for tick mark lengths. Can be negative to move tick marks outside of plot. For histSpike, this is the relative length to be used for the largest frequency. When scat1d calls

jitfrac

fraction of axis for jittering. If <=0, no="" jittering="" is="" done.="" if="" preserve=TRUE, the amount of jittering is independent of jitfrac.

tfrac

fraction of tick mark to actually draw. If

tfrac<1< code="">,
will draw a random fraction tfrac of the line segment at each point.
This is useful for very large samples or ones with some very dense points.
The default value is 1 if the nu

eps

fraction of axis for determining overlapping points in x. For preserve=TRUE the default is 0 and original unique values are retained, bigger values of eps tends to bias observations from dense to sparse regions, but ranks are sti

lwd

line width for tick marks, passed to segments

col

color for tick marks, passed to segments

specify a vector the same length as x to draw tick marks along a curve instead of by one of the axes. The y values are often predicted values from a model. The side argument is ignored when y is given.

curve

a list containing elements x and y for which linear interpolation is used to derive y values corresponding to values of x. This results in tick marks being drawn along the curve. For histSpike

bottom.align

set to TRUE to have the bottoms of tick marks (for side=1 or side=3) aligned at the y-coordinate. The default behavior is to center the tick marks. For datadensity.data.frame, bottom.align

preserve

set to TRUE to invoke jitter2

fill

maximum fraction of the axis filled by jittered values. If d are duplicated values between a lower value l and upper value u, then d will be spread within +/- fill*min(u-d,d-l)/2.

limit

specifies a limit for maximum shift in jittered values. Duplicate values will be spread within +/- fill*min(limit,min(u-d,d-l)/2). The default TRUE restricts jittering to the smallest min(u-d,d-l)/2 observed and results in equal

nhistSpike

If the number of observations exceeds or equals nhistSpike, scat1d will automatically call histSpike to draw the data density, to prevent the graphics file from being too large.

type

used by or passed to histSpike. Set to "count" to display frequency counts rather than relative frequencies, or "density" to display a kernel density estimate computed using the density function.

grid

set to TRUE if the Rgrid package is in effect for the current plot

nint

number of intervals to divide each continuous variable's axis for datadensity. For histSpike, is the number of equal-width intervals for which to bin x, and if instead nint is a character string (e.g.,

...

optional arguments passed to scat1d from datadensity or to histSpike from scat1d

presorted

set to TRUE to prevent from sorting for determining the order l

group

an optional stratification variable, which is converted to a factor vector if it is not one already

which

set which="continuous" to only plot continuous variables, or which="categorical" to only plot categorical, character, or discrete numeric ones. By default, all types of variables are depicted.

method.cat

set method.cat="freq" to depict frequencies of categorical variables with digits representing the cell frequencies, with size proportional to the square root of the frequency. By default, vertical bars are used.

col.group

colors representing the group strata. The vector of colors is recycled to be the same length as the levels of group.

n.unique

number of unique values a numeric variable must have before it is considered to be a continuous variable

show.na

set to FALSE to suppress drawing the number of NAs to the right of each axis

naxes

number of axes to draw on each page before starting a new plot. You can set naxes larger than the number of variables in the data frame if you want to compress the plot vertically.

a vector of quantiles to display. By default, quantiles are not shown.

cex.axis

character size for draw labels for axis tick marks

cex.var

character size for variable names and frequence of NAs

lmgp

spacing between numeric axis labels and axis (see par for mgp)

tck

see tck under par

ranges

a list containing ranges for some or all of the numeric variables. If ranges is not given or if a certain variable is not found in the list, the empirical range, modified by pretty, is used. Example: ranges=list(age=c(10,

labels

a vector of labels to use in labeling the axes for datadensity.data.frame. Default is to use the names of the variables in the input data frame. Note: margin widths computed for setting aside names of variables use the names, and not these

minf

For histSpike, if minf is specified low bin frequencies are set to a minimum value of minf times the maximum bin frequency, so that rare data points will remain visible. A good choice of minf is 0.075.

mult.width

multiplier for the smoothing window width computed by histSpike when type="density"

xlim

a 2-vector specifying the outer limits of x for binning (and plotting, if add=FALSE and nint is a number)

ylim

y-axis range for plotting (if add=FALSE)

xlab

x-axis label (add=FALSE); default is name of input argument x

ylab

y-axis label (add=FALSE)

add

set to TRUE to add the spike-histogram to an existing plot, to show marginal data densities

Value

histSpike returns the actual range of x used in its binning

Side Effects

scat1d adds line segments to plot. datadensity.data.frame draws a complete plot. histSpike draws a complete plot or adds to an existing plot.

Details

For scat1d the length of line segments used is

frac*min(par()$pin)
/ par()$uin[opp]

data units, where opp is the index of the opposite axis and frac defaults to .02. Assumes that plot has already been called. Current par("usr") is used to determine the range of data for the axis of the current plot. This range is used in jittering and in constructing line segments.

Examples

Run this code

plot(x <- rnorm(50), y <- 3*x + rnorm(50)/2 )
scat1d(x)                 # density bars on top of graph
scat1d(y, 4)              # density bars at right
histSpike(x, add=TRUE)       # histogram instead, 100 bins
histSpike(y, 4, add=TRUE)
histSpike(x, type='density', add=TRUE)  # smooth density at bottom
histSpike(y, 4, type='density', add=TRUE)


smooth <- lowess(x, y)    # add nonparametric regression curve
lines(smooth)             # Note: plsmo() does this
scat1d(x, y=approx(smooth, xout=x)$y) # data density on curve
scat1d(x, curve=smooth)   # same effect as previous command
histSpike(x, curve=smooth, add=TRUE) # same as previous but with histogram
histSpike(x, curve=smooth, type='density', add=TRUE)  
# same but smooth density over curve


plot(x <- rnorm(250), y <- 3*x + rnorm(250)/2)
scat1d(x, tfrac=0)        # dots randomly spaced from axis
scat1d(y, 4, frac=-.03)   # bars outside axis
scat1d(y, 2, tfrac=.2)    # same bars with smaller random fraction


x <- c(0:3,rep(4,3),5,rep(7,10),9)
plot(x, jitter2(x))       # original versus jittered values
abline(0,1)               # unique values unjittered on abline
points(x+0.1, jitter2(x, limit=FALSE), col=2)
                          # allow locally maximum jittering
points(x+0.2, jitter2(x, fill=1), col=3); abline(h=seq(0.5,9,1), lty=2)
                          # fill 3/3 instead of 1/3
x <- rnorm(200,0,2)+1; y <- x^2
x2 <- round((x+rnorm(200))/2)*2
x3 <- round((x+rnorm(200))/4)*4
dfram <- data.frame(y,x,x2,x3)
plot(dfram$x2, dfram$y)   # jitter2 via scat1d
scat1d(dfram$x2, y=dfram$y, preserve=TRUE, col=2)
scat1d(dfram$x2, preserve=TRUE, frac=-0.02, col=2)
scat1d(dfram$y, 4, preserve=TRUE, frac=-0.02, col=2)


pairs(jitter2(dfram))     # pairs for jittered data.frame
# This gets reasonable pairwise scatter plots for all combinations of
# variables where
#
# - continuous variables (with unique values) are not jittered at all, thus
#   all relations between continuous variables are shown as they are,
#   extreme values have exact positions.
#
# - discrete variables get a reasonable amount of jittering, whether they
#   have 2, 3, 5, 10, 20 \dots levels
#
# - different from adding noise, jitter2() will use the available space
#   optimally and no value will randomly mask another
#
# If you want a scatterplot with lowess smooths on the *exact* values and
# the point clouds shown jittered, you just need
#
pairs( dfram ,panel=function(x,y) { points(jitter2(x),jitter2(y))
                                    lines(lowess(x,y)) } )




datadensity(dfram)     # graphical snapshot of entire data frame
datadensity(dfram, group=cut2(dfram$x2,g=3))
                          # stratify points and frequencies by
                          # x2 tertiles and use 3 colors


# datadensity.data.frame(split(x, grouping.variable))
# need to explicitly invoke datadensity.data.frame when the
# first argument is a list

Run the code above in your browser using DataLab