vtree: vtree: Draw a variable tree

Description

vtree is a tool for drawing variable trees. Variable trees display information about nested subsets of a data frame, in which the subsetting is defined by the values of categorical variables.

Usage

vtree(z, vars, prune = list(), prunebelow = list(), keep = list(),
  follow = list(), prunelone = NULL, pruneNA = FALSE,
  labelnode = list(), labelvar = NULL, fillcolor = NULL,
  fillnodes = TRUE, NAfillcolor = "white", rootfillcolor = "#EFF3FF",
  palette = NULL, gradient = TRUE, revgradient = FALSE,
  singlecolor = 2, colorvarlabels = TRUE, title = "",
  sameline = FALSE, Venn = FALSE, check.is.na = FALSE, seq = FALSE,
  text = list(), plain = FALSE, squeeze = 1, shownodelabels = TRUE,
  showvarnames = TRUE, showlevels = TRUE, showpct = TRUE,
  showlpct = TRUE, showcount = TRUE, showlegend = FALSE,
  varnamepointsize = 20, HTMLtext = FALSE, digits = 0,
  splitwidth = 20, lsplitwidth = 15, getscript = FALSE,
  nodesep = 0.5, ranksep = 0.5, margin = 0.2, vp = TRUE,
  horiz = TRUE, summary = "", width = NULL, height = NULL,
  graphattr = "", nodeattr = "", edgeattr = "", color = c("blue",
  "forestgreen", "red", "orange", "pink"), colornodes = FALSE,
  showempty = FALSE, rounded = TRUE, nodefunc = NULL,
  nodeargs = NULL, parent = 1, last = 1, root = TRUE)

Arguments

Required: Data frame, or a single vector.

vars

Required (unless z is a vector): Either a character string of whitespace-separated variable names or a vector of variable names.

prune

List of vectors that identifies nodes to prune. The name of each element of the list must be one of the variable names in vars. Each element is a vector of character strings that identifies the values of the variable (i.e. the nodes) to prune.

prunebelow

Like prune but the nodes themselves are not pruned, just their descendants.

keep

Like prune but specifies which nodes should be kept (i.e. not pruned).

Like prune but specifies which nodes should be "followed". For the variables named, only the descendants of nodes that are followed will be shown.

prunelone

A vector of values specifying lone nodes (of any variable) to prune. A lone node is a node that has no siblings.

pruneNA

Prune all missing values? This should be used carefully because "valid" percentages are hard to interpret when NAs are pruned.

labelnode

List of vectors used to change how values of variables are displayed. The name of each element of the list is one of the variable names in vars. Each element of the list is a vector of character strings, representing the values of the variable. The names of the vector represent the labels to be used in place of the values.

labelvar

A named vector of labels for variables.

fillcolor

A named vector of colors for filling the nodes of each variable. If an unnamed, scalar color is specified, all nodes will have this color.

fillnodes

Should the nodes be filled with color?

NAfillcolor

Fill color for missing value nodes. If NULL, fill colors of missing value nodes will be consistent with the fill colors in the rest of the tree.

rootfillcolor

Fill color for the root node.

palette

A vector of palette numbers (which can range between 1 and 9). The names of the vector indicate the corresponding variable. See Palettes below for more information.

gradient

Should gradients of fill color be used across the values of each variable? A single value (with no names) specifies the setting for all variables. A logical vector of TRUE values for named variables is interpreted as TRUE for those variables and FALSE for all others. A logical vector of FALSE values for named variables is interpreted as FALSE for those variables and TRUE for all others.

revgradient

Should the gradient be reversed (i.e. dark to light instead of light to dark)? A single value (with no names) specifies the setting for all variables. A logical vector of TRUE values for named variables is interpreted as A logical vector of FALSE values for named variables is interpreted as FALSE for those variables and TRUE for all others.

singlecolor

When a variable has a single value, should its nodes can be colored light (1) medium (2) or dark (3)?

colorvarlabels

Should the variable labels be colored?

title

Optional title for the root node of the tree.

sameline

Should node labels be on the same line as node percentages?

Venn

Display multi-way set membership information? This provides an alternative to a Venn diagram. This sets showpct=FALSE, shownodelabels=FALSE. Assumption: all of the specified variables are logicals or 0/1 numeric variables.

check.is.na

Replace each variable named in vars with a logical vector indicating whether or not each of its values is missing?

seq

Display the variable tree using "sequences"? Each unique sequence (i.e. pattern) of values will be shown separately.

text

A list of vectors containing extra text to add to specified nodes. The name of each element of the list must be one of the variable names in vars. Each element is a vector of character strings. The names of the vector identify the nodes to which the text should be added. (See Formatting codes below for information on how to format text.)

plain

Use "plain" color settings?

squeeze

How much should the tree be "squeezed"? A value between 0 and 1. This controls two Graphviz parameters: margin and nodesep.

shownodelabels

Should node labels be shown? A single value (with no names) specifies the setting for all variables. A logical vector of TRUE values for named variables is interpreted as TRUE for those variables and FALSE for all others. A logical vector of FALSE values for named variables is interpreted as FALSE for those variables and TRUE for all others.

showvarnames

Show the name of the variable next to each level of the tree?

showlevels

(Deprecated) Same as showvarnames.

showpct

Show percentage in each node? A single value (with no names) specifies the setting for all variables. A logical vector of TRUE for named variables is interpreted as A logical vector of FALSE for named variables is interpreted as FALSE for those variables and TRUE for all others.

showlpct

Show the (marginal) percentages for values of each variable in the legend?

showcount

Show count in each node? A single value (with no names) specifies the setting for all variables. A logical vector of TRUE for named variables is interpreted as A logical vector of FALSE for named variables is interpreted as FALSE for those variables and TRUE for all others.

showlegend

Show legend (including marginal frequencies) for each variable?

varnamepointsize

Font size (in points) to use when displaying variable names.

HTMLtext

Is the text formatted in HTML?

digits

Number of decimal digits to show in percentages.

splitwidth

The minimum number of characters before an automatic linebreak is inserted.

lsplitwidth

The minimum number of characters before an automatic linebreak is inserted for legends.

getscript

Instead of displaying the variable tree, return the DOT script as a character string?

nodesep

Graphviz attribute: Node separation amount.

ranksep

Graphviz attribute: Rank separation amount.

margin

Graphviz attribute: node margin.

Use "valid percentages"? Valid percentages are computed by first excluding any missing values, i.e. restricting attention to the set of "valid" observations. The denominator is thus the number of non-missing observations. When vp=TRUE, nodes for missing values show the number of missing values but do not show a percentage; all the other nodes how valid percentages. When vp=FALSE, all nodes (including nodes for missing values) show percentages of the total number of observations.

horiz

Should the tree be drawn horizontally? (i.e. parent node on the left, with the tree growing to the right)

summary

A character string used to specify summary statistics to display in the nodes. The first word in the character string is the name of the variable to be summarized. The rest of the character string is the text that will be displayed, along with special codes specifying the information to display (see Summary codes below). A vector of character strings can also be specified, so that more than one variable may be summarized.

width

width (in pixels) to be passed to grViz.

height

height (in pixels) to be passed to grViz.

graphattr

Character string: Additional attributes for the Graphviz graph.

nodeattr

Character string: Additional attributes for Graphviz nodes.

edgeattr

Character string: Additional attributes for Graphviz edges.

color

A vector of color names for the outline of the nodes at each level.

colornodes

Should the node outlines be colored?

showempty

Show nodes that do not contain any observations?

rounded

Should the nodes have rounded boxes?

nodefunc

A node function (see Node functions below).

nodeargs

A list containing named arguments for the node function specified by nodefunc.

parent

Parent node number (Internal use only.)

last

Last node number (Internal use only.)

root

Is this the root node of the tree? (Internal use only.)

Value

If getscript=TRUE, returns a character string of DOT script that describes the variable tree. If getscript=FALSE, returns an object of class htmlwidget that will intelligently print itself into HTML in a variety of contexts including the R console, within R Markdown documents, and within Shiny output bindings.

Summary codes

%mean% mean
%SD% standard deviation
%min% minimum
%max% maximum
%pX% Xth percentile, e.g. p50 means the 50th percentile
%median% median, i.e. p50
%IQR% interquartile range, i.e. p25, p75
%list% list of the individual values
%mv% the number of missing values
%v% the name of the variable
%noroot% flag: Do not show summary in the root node.
%leafonly% flag: Only show summary in leaf nodes.
%var=n% flag: Only show summary in nodes of the specified variable.
%trunc=n% flag: Truncate the summary to the first n characters.

Node functions

Node functions provide a mechanism for running a function within each subset representing a node of the tree. The summary parameter uses node functions. A node functions is a function takes as arguments a data frame subset, the name of the subsetting variable, the value of the subsetting variable, and a list of named arguments.

Formatting codes

Formatting codes for the text argument. Also used by labelnode and labelvar.

\n line break
*...* italics
**...** bold
^...^ superscript (using 10 point font)
~...~ subscript (using 10 point font)
%%red ...%% display text in red (or whichever color is specified)

Palettes

Sequential palettes from Color Brewer:

Reds
Blues
Greens
Oranges
Purples
YlGn
PuBu
PuRd
YlOrBr

Examples

Run this code

# NOT RUN {
# A single-level hierarchy
vtree(FakeData,"Severity")

# A two-level hierarchy
vtree(FakeData,"Severity Sex")

# A two-level hierarchy with pruning of some values of Severity
vtree(FakeData,"Severity Sex",prune=list("Severity"=c("Moderate","NA")))

# Rename some nodes
vtree(FakeData,"Severity Sex",labelnode=list(Sex=(c("Male"="M","Female"="F"))))

# Rename a variable
vtree(FakeData,"Severity Sex",labelvar=c(Severity="How bad?"))

# Show legend. Put labels on the same line as counts and percentages
vtree(FakeData,"Severity Sex Viral",sameline=TRUE,showlegend=TRUE)

# Using the summary parameter to list ID numbers (truncated to 40 characters) in specified nodes
vtree(FakeData,"Severity Sex",summary="id \nid = %list% %var=Severity% %trunc=40%")

# }

Run the code above in your browser using DataLab