Histogram: Histogram

Description

Abbreviation: hs

From the standard R function hist, plots a frequency histogram with default colors, including background color and grid lines plus an option for a relative frequency and/or cumulative histogram, as well as summary statistics and a table that provides the bins, midpoints, counts, proportions, cumulative counts and cumulative proportions. Bins can be selected several different ways besides the default, including specifying just the bin width and/or the bin start. Also provides improved error diagnostics and feedback for the user on how to correct the problem when the bins do not contain all of the specified data.

If a set of multiple variables is provided, including an entire data frame, then each numeric variable in that set of variables is analyzed, with the option to write the resulting histograms to separate pdf files. The related CountAll function does the same for all variables in the set of variables, histograms for continuous variables and bar charts for categorical variables. Specifying a by1 or by2 variable implements Trellis graphics.

When output is assigned into an object, such as h in h <- hs(Y), can assess the pieces of output for later analysis. A primary such analysis is knitr for dynamic report generation from a generated R markdown file according to the Rmd option in which interpretative R output is embedded in documents. See value below.

Usage

Histogram(x=NULL, data=d, rows=NULL,
          stat_x=c("count", "proportion"),
          n_cat=getOption("n_cat"), Rmd=NULL,
    by1=NULL, by2=NULL,
    n_row=NULL, n_col=NULL, aspect="fill",
    bin_start=NULL, bin_width=NULL, bin_end=NULL, breaks="Sturges",
    theme=getOption("theme"),
    fill=getOption("bar_fill_cont"),
    color=getOption("bar_color_cont"),
    trans=getOption("trans_bar_fill"),
    values=FALSE,
    reg="snow2", cumulate=c("off", "on", "both"),
    xlab=NULL, ylab=NULL, main=NULL, sub=NULL,
    lab_adj=c(0,0), margin_adj=c(0,0,0,0),
    rotate_x=getOption("rotate_x"), rotate_y=getOption("rotate_y"),
    offset=getOption("offset"),
    scale_x=NULL, scale_y=NULL,
    density=FALSE, dn.hist=TRUE,
    bw=NULL, type=c("general", "normal", "both"),
    color_gen="gray20", color_nrm="gray20",
    fill_hist=getOption("se_fill"), fill_nrm=NULL, fill_gen=NULL,
    x.pt=NULL, y_axis=FALSE,
    rug=FALSE, color_rug="black", size_rug=0.5,
    add=NULL, x1=NULL, y1=NULL, x2=NULL, y2=NULL,
    eval_df=NULL, digits_d=NULL, quiet=getOption("quiet"), do_plot=TRUE,
    width=6, height=6, pdf_file=NULL, 
    fun_call=NULL, …)
hs(…)

Arguments

Variable(s) to analyze. Can be a single numerical variable, either within a data frame or as a vector in the users workspace, or multiple variables in a data frame such as designated with the c function, or an entire data frame. If not specified, then defaults to all numerical variables in the specified data frame, d by default.

data

Optional data frame that contains the variable(s) of interest, default is d.

rows

A logical expression that specifies a subset of rows of the data frame to analyze.

stat_x

Bin and transform values of variable x into "counts" by default or "proportion" if specified, that is, frequencies or relative frequencies.

n_cat

For the analysis of multiple variables, such as a data frame, specifies the largest number of unique values of variable of a numeric data type for which the variable will be analyzed as a categorical. Default is 0.

Rmd

File name for the file of R markdown to be written, if specified. The file type is .Rmd, which automatically opens in RStudio, but it is a simple text file that can be edited with any text editor, including RStudio.

by1

A categorical variable called a conditioning variable that activates Trellis graphics, from the lattice package, to provide a separate scatterplot (panel) of numeric primary variables x and y for each level of the variable.

by2

A second conditioning variable to generate Trellis plots jointly conditioned on both the by1 and by2 variables, with by2 as the row variable, which yields a scatterplot (panel) for each cross-classification of the levels of numeric x and y variables.

n_row

Optional specification for the number of rows in the layout of a multi-panel display with Trellis graphics. Need not specify ncols.

n_col

Optional specification for the number of columns in the layout a multi-panel display with Trellis graphics. Need not specify n_row If set to 1, then the strip that labels each group is moved to the left of each plot instead of the top.

aspect

Lattice parameter for the aspect ratio of the panels, defined as height divided by width. The default value is "fill" to have the panels expand to occupy as much space as possible. Set to 1 for square panels. Set to "xy" to specify a ratio calculated to "bank" to 45 degrees, that is, with the line slope approximately 45 degrees.

bin_start

Optional specified starting value of the bins.

bin_width

Optional specified bin width, which can be specified with or without a bin_start value.

bin_end

Optional specified value that is within the last bin, so the actual endpoint of the last bin may be larger than the specified value.

breaks

The method for calculating the bins, or an explicit specification of the bins, such as with the standard R seq function or other options provided by the hist function that include the default "Sturges" plus "Scott" and "FD". Not applicable and so not allowed if density is TRUE.

theme

Color theme for this analysis. Make persistent across analyses with style.

fill

Fill color of the bars. Can explicitly choose "grays" or "hcl" colors, or pre-specified R color schemes "rainbow", "terrain", and "heat". Can also provide pre-defined color ranges "blues", "reds" and "greens", as well as custom colors, such as generated by getColors. Default is bar_color from the lessR style function.

color

Border color of the bars, can be a vector to customize the color for each bar. Default is bar_color from the lessR style function.

trans

Transparency factor of the area of each slice. Default is trans_bar_fill from the lessR style function.

values

Replaces standard R labels options, which has multiple definitions in R. Specifies to display the count of each bin.

reg

The color of the superimposed, regular histogram when cumulate="both".

cumulate

Specify a cumulative histogram. The value of "on" displays the cumulative histogram, with default of "off". The value of "both" superimposes the regular histogram.

xlab

Label for x-axis_ Defaults to variable name unless variable labels are present, the defaults to also include the corresponding variable label. Can style with the lessR style function

ylab

Label for y-axis_ Defaults to Frequency or Proportion. Can style with the lessR style function.

main

Label for the title of the graph. Can set size with main_cex and color with main_color from the lessR style function.

sub

Sub-title of graph, below xlab. Not yet implemented.

lab_adj

Two-element vector -- x-axis label, y-axis label -- adjusts the position of the axis labels in approximate inches. + values move the labels away from plot edge. Not applicable to Trellis graphics.

margin_adj

Four-element vector -- top, right, bottom and left -- adjusts the margins of the plotted figure in approximate inches. + values move the corresponding margin away from plot edge. Not applicable to Trellis graphics.

rotate_x

Degrees that the x-axis values are rotated, usually to accommodate longer values, typically used in conjunction with offset. Can set persistently with the lessR style function.

rotate_y

Degrees that the y-axis values are rotated. Can set persistently with the lessR style function.

offset

The amount of spacing between the axis values and the axis_ Default is 0.5. Larger values such as 1.0 are used to create space for the label when longer axis value names are rotated. Can set persistently with the lessR style function.

scale_x

If specified, a vector of three values that define the numerical values of the x-axis: starting, ending and number of intervals, within the bounds of plot region.

scale_y

Applies to the y-axis_ See scale_x.

density

If TRUE, plot the smoothed kernel density estimate.

dn.hist

When density is TRUE, plot a histogram behind the density curve.

Bandwidth of kernel density estimation. Initial value is "nrd0", but unless specified, then may be iterated upward to create a smoother curve.

type

Type of density curve plotted. By default, the general density is plotted, though can request the normal density and both densities.

color_gen

Color of the general density curve.

color_nrm

Color of the normal curve.

fill_hist

Fill color for the histogram behind density curve, defaults to a light gray.

fill_nrm

Fill color for the estimated normal curve, with a partially transparent blue as the default, and transparent for the gray theme.

fill_gen

Fill color for the estimated general density curve, with a partially transparent light red as the default, and a light transparent gray for the gray theme.

x.pt

Value of the point on the x-axis for which to draw a unit interval around illustrating the corresponding area under the general density curve. Only applies when requesting type=general.

y_axis

Specifies if the y-axis, the density axis, should be included.

rug

If TRUE, add a rug plot, a direct display of density in the form of a narrow band beneath the density curve.

color_rug

Color of the rug ticks.

size_rug

Line width of the rug ticks.

add

Draw one or more objects, text or a geometric figures, on the plot. Possible values are any text to be written, the first argument, which is "text", or, to indicate a figure, "rect" (rectangle), "line", "arrow", "v.line" (vertical line), and "h.line" (horizontal line). The value "means" is short-hand for vertical and horizontal lines at the respective means. Does not apply to Trellis graphics. Customize with parameters such as add_fill and add_color from the style function.

First x coordinate to be considered for each object. All coordinates vary from -1 to 1.

First y coordinate to be considered for each object.

Second x coordinate to be considered for each object. Only used for "rect", "line" and arrow.

Second y coordinate to be considered for each object. Only used for "rect", "line" and arrow.

eval_df

Determines if to check for existing data frame and specified variables. By default is TRUE unless the shiny package is loaded then set to FALSE so that Shiny will run. Needs to be set to FALSE if using the pipe %\>% notation.

digits_d

Number of significant digits for each of the displayed summary statistics.

quiet

If set to TRUE, no text output. Can change system default with style function.

do_plot

If TRUE, the default, then generate the plot.

width

Width of the plot window in inches, defaults to 4.5.

height

Height of the plot window in inches, defaults to 4.5.

pdf_file

Indicate to direct pdf graphics to the specified name of the pdf file.

fun_call

Function call. Used with knitr to pass the function call when obtained from the abbreviated function call hs.

…

Other parameter values for graphics as defined processed by hist and par for general graphics, xlim and ylim for setting the range of the x and y-axes cex.main for the size of the title col.main for the color of the title cex for the size of the axis value labels col.lab for the color of the axis labels

Value

The output can optionally be saved into an R object, otherwise it simply appears in the console. Two different types of components are provided: the pieces of readable output, and a variety of statistics. The readable output are character strings such as tables amenable for display. The statistics are numerical values amenable for further analysis. The motivation of these types of output is to facilitate R markdown documents, as the name of each piece, preceded by the name of the saved object and a $, can be inserted into the R~Markdown document (see examples), interspersed with explanation and interpretation.

READABLE OUTPUT out_suggest: Suggestions for other similar analyses out_summary: Summary statistics out_freq: Frequency distribution out_outliers: Outlier analysis

STATISTICS bin_width: Bin width n_bins: Number of bins breaks: Breaks of the bins mids: Bin midpoints counts: Bin counts prop: Bin proportion cumulate: Bin cumulative counts cprop: Bin cumulative proportion

Details

OVERVIEW Results are based on the standard R hist function to calculate and plot a histogram, or a multi-panel display of histograms with Trellis graphics, plus the additional provided color capabilities, a relative frequency histogram, summary statistics and outlier analysis. The freq option from the standard R hist function has no effect as it is always set to FALSE in each internal call to hist. To plot densities, use the lessR function Density.

VARIABLES and TRELLIS PLOTS At a minimum there is one primary variable, x, which results in a single histogram. Trellis graphics, from Deepayan Sarkar's lattice package, may be implemented in which multiple panels are displayed according to the levels of one or two categorical variables, called conditioning variables. A variable specified with by1 is a conditioning variable that results in a Trellis plot, the histogram of x produced at each level of the by1 variable. Inclusion of a second conditioning variable, by2, results in a separate histogram for each combination of cross-classified values of both by1 and by2.

DATA The data may either be a vector from the global environment, the user's workspace, as illustrated in the examples below, or one or more variable's in a data frame, or a complete data frame. The default input data frame is d. Can specify the source data frame name with the data option. If multiple variables are specified, only the numerical variables in the list of variables are analyzed. The variables in the data frame are referenced directly by their names, that is, no need to invoke the standard R mechanisms of the d$name notation, the with function or the attach function. If the name of the vector in the global environment and of a variable in the input data frame are the same, the vector is analyzed.

To obtain a histogram of each numerical variable in the d data frame, use Histogram(). Or, for a data frame with a different name, insert the name between the parentheses. To analyze a subset of the variables in a data frame, specify the list with either a : or the c function, such as m01:m03 or c(m01,m02,m03).

The rows parameter subsets rows (cases) of the input data frame according to a logical expression. Use the standard R operators for logical statements as described in Logic such as & for and, | for or and ! for not, and use the standard R relational operators as described in Comparison such as == for logical equality != for not equals, and > for greater than. See the Examples.

COLORS Individual colors in the plot can be manipulated with options such as color_bars for the color of the histogram bars. A color theme for all the colors can be chosen for a specific plot with the colors option with the lessR function style. The default color theme is lightbronze, but a gray scale is available with "gray", and other themes are available as explained in style, such as "red" and "green". Use the option style(sub_theme="black") for a black background and partial transparency of plotted colors.

For the color options, such as fill, the value of "off" is the same as "transparent".

Set fill to a single color or a color range, of which there are many possibilities. For "hues" colors of the same chroma and luminance set fill to multiple colors all with the same saturation and brightness. Also available are the pre-specified R color schemes "rainbow", "terrain", and "heat". Can also provide pre-defined color ranges "blues", "reds" and "greens", or generate custom colors, such as from the lessR function getColors.

VARIABLE LABELS If variable labels exist, then the corresponding variable label is by default listed as the label for the horizontal axis and on the text output. For more information, see Read.

ONLY VARIABLES ARE REFERENCED The referenced variable in a lessR function can only be a variable name (or list of variable names). This referenced variable must exist in either the referenced data frame, such as the default d, or in the user's workspace, more formally called the global environment. That is, expressions cannot be directly evaluated. For example:

> Histogram(rnorm(50)) # does NOT work

Instead, do the following:

    > Y <- rnorm(50)   # create vector Y in user workspace
    > Histogram(Y)     # directly reference Y

ERROR DETECTION A somewhat relatively common error by beginning users of the base R hist function may encounter is to manually specify a sequence of bins with the seq function that does not fully span the range of specified data values_ The result is a rather cryptic error message and program termination. Here, Histogram detects this problem before attempting to generate the histogram with hist, and then informs the user of the problem with a more detailed and explanatory error message. Moreover, the entire range of bins need not be specified to customize the bins. Instead, just a bin width need be specified, bin_width, and/or a value that begins the first bin, bin_start. If a starting value is specified without a bin width, the default Sturges method provides the bin width.

PDF OUTPUT To obtain pdf output, use the pdf_file option, perhaps with the optional width and height options. These files are written to the default working directory, which can be explicitly specified with the R setwd function.

References

Gerbing, D. W. (2014). R Data Analysis without Programming, Chapter 5, NY: Routledge.

Gerbing, D. W. (2020). R Visualizations: Derive Meaning from Data, Chapter 4, NY: CRC Press.

Gerbing, D. W. (2021). Enhancement of the Command-Line Environment for use in the Introductory Statistics Course and Beyond, Journal of Statistics and Data Science Education, 29(3), 251-266, https://www.tandfonline.com/doi/abs/10.1080/26939169.2021.1999871.

Sarkar, Deepayan (2008) Lattice: Multivariate Data Visualization with R, Springer. http://lmdvr.r-forge.r-project.org/

Examples

Run this code

# NOT RUN {
# get the data
d <- rd("Employee")


# make sure default style is active
style()


# --------------------
# different histograms
# --------------------

# histogram with all defaults
Histogram(Salary)
# short form
#hs(Salary)

# output saved for later analysis into object h
h <- hs(Salary)
# view full text output
h
# view just the outlier analysis
h$out_outliers
# list the names of all the components
names(h)

# histogram with no borders for the bars
Histogram(Salary, color="off")

# save the histogram to a pdf file
#Histogram(Salary, pdf=TRUE)

# just males employed more than 5 years
Histogram(Salary, rows=(Gender=="M" & Years > 5))

# histogram with red bars, black background, and black border
style(panel_fill="black", fill="red", panel_color="black")
Histogram(Salary)
# or use a lessR pre-defined sequential color palette
#   with some transparency
Histogram(Salary, fill="rusts", color="brown", trans=.1)

# histogram with purple color theme, translucent gold bars
style("purple", sub_theme="black")
Histogram(Salary)
# back to default color theme
style()

# histogram with specified bin width
# can also use bin_start
Histogram(Salary, bin_width=12000)

# histogram with rotated axis values, offset more from axis
# suppress text output
style(rotate_x=45, offset=1)
Histogram(Salary, quiet=TRUE)
style()

# histogram with specified bins and grid lines displayed over the histogram
Histogram(Salary, breaks=seq(0,150000,20000), xlab="My Variable")

# histogram with bins calculated with the Scott method and values displayed
Histogram(Salary, breaks="Scott", values=TRUE, quiet=TRUE)

# histogram with the number of suggested bins, with proportions
Histogram(Salary, breaks=15, stat_x="proportion")

# histogram with non-default values for x- and y-axes
d[2,4] <- 45000
Histogram(Salary, scale_x=c(30000,130000,5), scale_y=c(0,9.5,5))

# ----------------
# Trellis graphics
# ----------------
Histogram(Salary, by1=Dept)


# ---------------------
# cumulative histograms
# ---------------------

# cumulative histogram with superimposed regular histogram, all defaults
Histogram(Salary, cumulate="both")

# cumulative histogram plus regular histogram
Histogram(Salary, cumulate="both", reg="mistyrose")

# density
Histogram(Salary, density=TRUE)


# -------------------------------------------------
# histograms for data frames and multiple variables
# -------------------------------------------------

# create data frame, d, to mimic reading data with Read function
# d contains both numeric and non-numeric data
d <- data.frame(rnorm(50), rnorm(50), rnorm(50), rep(c("A","B"),25))
names(d) <- c("X","Y","Z","C")

# although data not attached, access the variable directly by its name
Histogram(X)

# histograms for all numeric variables in data frame called d
#  except for numeric variables with unique values < n_cat
# d is the default name, so does not need to be specified with data
Histogram()

# histogram with specified options, including red axis labels
style(fill="palegreen1", panel_fill="ivory", axis_color="red") 
Histogram(values=TRUE)
style()  # reset

# histograms for all specified numeric variables
# use the combine or c function to specify a list of variables
Histogram(c(X,Y))


# -----------
# annotations
# -----------

d <- rd("Employee")

# Place a message in the top-right of the graph
# Use \n to indicate a new line
hs(Salary, add="Salaries\nin our Company", x1=100000, y1=7)

# Use style to change some parameter values
style(add_trans=.8, add_fill="gold", add_color="gold4",
      add_lwd=0.5, add_cex=1.1)
# Add a rectangle around the message centered at <100000,7>
hs(Salary, add=c("rect", "Salaries\nin our Company"),
      x1=c(82000, 100000), y1=c(7.7, 7), x2=118000, y2=6.2)
# }

Run the code above in your browser using DataLab