Learn R Programming

kutils (version 1.69)

peek: Show variables, one column at a time.

Description

This makes it easy to quickly scan through all of the columns in a data frame to spot unexpected patterns or data entry errors. Numeric variables are depiced as histograms, while factor and character variables are summarized by the R table function and then presented as barplots. Previous edition of this function was histOMatic, intended only for numeric variables. That previous name is now an alias for this function.

Usage

peek(dat, sort = TRUE, file = NULL, textout = FALSE, ask, ...,
  xlabstub = "kutils peek: ", freq = FALSE,
  histargs = list(probability = !freq), barargs = list(horiz = TRUE,
  las = 1))

Arguments

dat

An R data frame or something that can be coerced to a data frame by as.data.frame

sort

Default TRUE. Do you want display of the columns in alphabetical order?

file

Should output go in file rather than to the screen. Default is NULL, meaning show on screen. If you supply a file name, we will write PDF output into it.

textout

If TRUE, counts from histogram bins and tables will appear in the console.

ask

As in the old style R par(ask = TRUE): should keyboard interaction advance to the next plot. Will default to false if the file argument is non-null. If file is null, setting ask = FALSE will cause graphs to whir bye without pausing.

...

Additional arguments for the pdf, histogram, table, or barplot functions. Please see Details below.

xlabstub

A text stub that will appear in the x axis label. Currently it includes advertising for this package.

freq

As in the histogram frequency argument. Should graphs show counts (freq = TRUE) or proportions (AKA densities) (freq = FALSE)

histargs

A list of arguments to be passed to the hist function.

barargs

A list of arguments to be passed to the barplot function.

Value

A vector of column names that were plotted

Try the Defaults

Every effort has been made to make this simple and easy to use. Please run the examples as they are before becoming too concerned about customization. This function is intended for getting a quick look at each variable, one-by-one, it is not intended to create publication quality histograms. Most users won't need to customize the arguments, but for sake of the fastidious few, a lot of settings can be adjusted. This draws histograms for numeric variables, and as we all know, the R hist function allows a great many arguments. It draws barplots for factors or character variables, and that brings the table and barplot functions into the picture.

Style

The histograms are standard, upright histograms. The barplots are horizontal. I recognize that is a style clash. I chose to make the bars horizontal because long value labels are more easily accomodated on the left axis. The code measures the length (in inches) for strings and the margin is increased accordingly. The examples have a demonstration of that effect.

Dealing with Dots

This has a fairly elaborate setup for dealing the the additional arguments, which end up in "...". It is necessary to separate the arguments among functions table, pdf, hist and barplot. If we send an argument like plot to the table function, for example, there will be a warning that we want to avoid. The plan is to separate arguments as well as possible so that an argument that is known to be used only for one function should be sorted and used only for that function. These arguments: c("exclude", "dnn", "useNA", "deparse.level") and will go to the table function (which is used to make barplots for factor and character variables). These arguments are extracted and sent to the pdf function: c("width", "height", "onefile", "family", "title", "fonts", "version", "paper", "encoding", "bg", "fg", "pointsize", "pagecentre", "colormodel", "useDingbats", "useKerning", "fillOddEven", "compress"). Any other arguments that are unique to hist or barplot are sorted out and sent only to those functions. Any other arguments, including graphical parameters will be sent to both the histogram and barplot functions, so it is a convenient way to obtain uniform appearance. Additional arguments that are common to barplot and hist will work, and so will any graphics parameters (named arguments of par, for example). However, if one wants to target some arguments to hist, then the histargs list argument should be used. Similarly, barargs should be used to send argument to the barplot function. Warning: the defaults for histargs and barargs include some settings that are needed for the existing design. If new lists for histargs or barargs are supplied, the previously specified defaults are lost. Hence, users should include the existing members of those lists, possibly with revised values. All of this argument sorting effort is done in order to reduce a prolific number of warnings that were observed in previous editions of this function.

Examples

Run this code
# NOT RUN {
set.seed(234234)
N <- 200
mydf <- data.frame(x5 = rnorm(N), x4 = rnorm(N), x3 = rnorm(N),
                   x2 = letters[sample(1:24, 200, replace = TRUE)],
                   x1 = factor(sample(c("cindy", "bobby", "marsha",
                                        "greg", "chris"), 200, replace = TRUE)),
                   stringsAsFactors = FALSE)
## Insert 16 missings
mydf$x1[sample(1:150, 16,)] <- NA
mydf$adate <- as.Date(c("1jan1960", "2jan1960", "31mar1960", "30jul1960"), format = "%d%b%y")
peek(mydf, width = 8, height = 5)
dev.off()
peek(mydf, sort = FALSE)
dev.off()
## Demonstrate the dot-dot-dot usage to pass in hist params
peek(mydf, breaks = 30, ylab = "These are Counts, not Densities", freq = TRUE)
dev.off()
## Not Run: file output
## peek(mydf, sort = FALSE, file = "three_histograms.pdf")
## Use some objects from the datasets package
library(datasets)
peek(cars, xlabstub = "R cars data: ")
dev.off()
peek(EuStockMarkets, xlabstub = "Euro Market Data: ")
dev.off()
peek(EuStockMarkets, xlabstub = "Euro Market Data: ", breaks = 50,
     freq = TRUE)
dev.off()
## Not run: file output
## peek(EuStockMarkets, breaks = 50, file = "myeuro.pdf",
##      height = 4, width=3, family = "Times")
## peek(EuStockMarkets, breaks = 50, file = "myeuro-%d3.pdf",
##      onefile = FALSE, family = "Times", textout = TRUE)
## xlab goes into "..." and affects both histograms and barplots
peek(mydf, breaks = 30, ylab = "These are Counts, not Densities",
    freq = TRUE)
dev.off()
## xlab is added in the barargs list.
peek(mydf, breaks = 30, ylab = "These are Counts, not Densities",
    freq = TRUE, barargs = list(horiz = TRUE, las = 1, xlab = "I'm in barargs"))
dev.off()
peek(mydf, breaks = 30, ylab = "These are Counts, not Densities", freq = TRUE,
     barargs = list(horiz = TRUE, las = 1, xlim = c(0, 100),
     xlab = "I'm in barargs, not in histargs"))
levels(mydf$x1) <- c(levels(mydf$x1), "arthur philpot smythe")
mydf$x1[4] <- "arthur philpot smythe"
mydf$x2[1] <- "I forgot what letter"
peek(mydf, breaks = 30,
     barargs = list(horiz = TRUE, las = 1))
dev.off()
# }

Run the code above in your browser using DataLab