sp
, ScatterPlot
From the identical syntax, for variables X
and Y
, of Plot(X)
or Plot(X,Y)
, a family of related 1- or 2-dimensional scatterplots and related statistical analyses result from any combination of continuous or categorical variables: the traditional scatterplot of two continuous variables, a bubble (balloon) scatter plot from two categorical variables, a scatter plot with means at each level of a categorical variable paired with a continuous variable, and a Cleveland dot plot as a scatterplot that pairs a continuous variable with a each unique value of an ID-variable for a data table. For multiple plots on the same graph, specify multiple x-variables or y-variables. Summarize univariate distributions with either a 1-dimensional scatter plot of a continuous variable, or with a 1-dimensional bubble plot for a categorical variable as a more compact replacement of the traditional bar chart. From the specification of multiple categorical x-variables that share the same response scale, generalize the later to a matrix of 1-dimensional bubble plots here called the bubble plot frequency matrix. Choose data as the topic of analysis, or of statistics computed from that data such as the mean, and choose among different geometric objects to plot such as points, lines or bars.
Plot(x, y=NULL, by=NULL, data=mydata, n.cat=getOption("n.cat"), topic=c("data", "count", "prop", "mean", "sd", "min", "median", "max",
"diff"),
object=c("point", "line", "both", "bubble", "sunflower", "bar", "off"),
color.fill=getOption("color.fill.pt"),
color.stroke=getOption("color.stroke.pt"),
color.bg=getOption("color.bg"),
color.grid=getOption("color.grid"),
color.box=getOption("color.box"),
color=NULL, color.trans=NULL, color.area=NULL,
cex.axis=0.76, color.axis="gray30", xy.ticks=TRUE,
xlab=NULL, ylab=NULL, main=NULL, sub=NULL,
value.labels=NULL, rotate.values=0, offset=0.5,
size=NULL, shape="circle", means=TRUE,
sort.yx=FALSE, segments.y=FALSE, segments.x=FALSE,
bubble.size=0.25, bubble.power=0.6, bubble.counts=TRUE,
color.low=NULL, color.hi=NULL,
fit.line=NULL, color.fit.line="gray55",
ellipse=FALSE, color.ellipse="lightslategray",
color.fill.ellipse="off",
method="overplot", pt.reg="circle", pt.out="circle",
color.out30="firebrick2", color.out15="firebrick4", new=TRUE,
breaks="Sturges", bin.start=NULL, bin.width=NULL, bin.end=NULL,
prop=FALSE, cumul=c("off", "on", "both"), hist.counts=FALSE,
color.reg="snow2",
beside=FALSE, horiz=FALSE,
over.grid=FALSE, addtop=0.05, gap=NULL, count.labels=NULL,
legend.title=NULL, legend.loc="right.margin", legend.labels=NULL,
legend.horiz=FALSE,
digits.d=NULL, quiet=getOption("quiet"),
pdf.file=NULL, pdf.width=NULL, pdf.height=NULL,
fun.call=NULL, ...)
ScatterPlot(...)
sp(...)
object="point"
mydata
.x
is specified, then only "counts"
and
proportion
apply.
"point"
scatterplot for numerical variables unless
each variable has less than n.cat
integer values, by default 8, when
a bubble plot is plotted with the correspondicolor.stroke
.
If y-values are unique, as in a Cleveland dot plot, then no transparency by
defaultby
variable,
specified as a vector, one value for each level of by
."snow3"
."black"
.color.stroke
and color.fill
, and
takes precedence over their individually specified values.xlab
is not specified, then the label becomes
the name of the corresponding variable label if it exists, or, if not, the
variable name. If xy.ticks
is FALSE
, then no label is displaxlab
is not specified, then the label becomes
the name of the corresponding variable label if it exists, or, if not, the
variable name. If xy.ticks
is FALSE
, then no label displayeNULL
), then the
value.labels are set to the factor leoffset
.bubble.size
with a different metric.color.stroke
and color.fill
.
Possible values are circle
, square
, diamond
,
TRUE
, by default, plot means with the scatter plot.object="bubble"
.TRUE
, then for a bubble plot, the count underlying a
bubble is displayed in the center of the bubble, unless the bubble is too small.
Setting this value sets default to object="bubble"
.FALSE
, with options for
"loess"
and for least squares, indicated by "ls"
. Or, if set to
TRUE
, then a loess line.fit.line
option
is invoked.TRUE
, enclose a scatterplot with the .95 data ellipse from the
ellipse package. Or can specify a single numeric value greater than 0 and less than 1,
or a vector of levels to plot multiple ellipses.ellipse
is set to TRUE
.TRUE
, fill the ellipse with color.ellipse
. Usually
specify low opacity in the color specification, as shown in the examples. If specified, ellipse
is set to TRUE
."overplot"
, but can also
provide "stack"
to stack the points or
"jigger"
to scramble the points.FALSE
, then add the 1-D scatterplot to an existing graph.bin.start
value.FALSE
."on"
displays the
cumulative histogram, with default of "off"
. The value of "both"
superimposes the regular histogram.labels
options, which has multiple
definitions in R. Specifies to display the count of each bin.cumul="both"
.TRUE
for the levels of the
first variable to be plotted as adjacent bars instead of stacked on each other.TRUE
.TRUE
, plot the grid lines over the histogram.horiz=FALSE
, in the same scale as the vertical axis, puts
more space between the bars and the top of the plot area, usually to
accommodate the legend when plotting two variables. now a multiplicative
factor instead ospace
option from
the standard R barplot
function with a default of 0.2 unless two
variables are plotted and beside=TRUE
,x
has values that are counts, already tabulated. The
specified variable here contains the names of the levels of x.
by
variable.TRUE
, no text output. Can change system default
with theme
function.knitr
to pass the function call when
obtained from the abbreviated function call sp
.plot
, with an analysis of the correlation coefficient including hypothesis test and confidence interval. Two categorical variables, such as for Likert-style analysis, produces a bubble plot, in which the size of each plotted point indicates the corresponding joint frequency, and a corresponding cross-tabulation analysis. This analysis is an alternative to the traditional BarChart
. A categorical variable paired with a numeric variable yields a scatter plot with the means of each level of the categorical variable also plotted, and the summary statistics of the numeric variable for each level of the categorical variable. More information is obtained to list the categorical first in the function call. If the values of the first variable are numeric and sorted with equal intervals, then points are connected via line segments. If there is only one variable, a 1-dimensional scatter plot is produced for a numeric variable, based on the standard R function stripchart
, and a 1-dimensional bubble plot is produced for a factor, with corresponding statistics.The value labels for each axis can be over-ridden from their values in the data to user supplied values with the value.labels
option. This option is particularly useful for Likert-style data coded as integers. Then, for example, a 0 in the data can be mapped into a "Strongly Disagree" on the plot. These value labels apply to integer categorical variables, and also to factor variables. To enhance the readability of the labels on the graph, any blanks in a value label translate into a new line in the resulting plot. Blanks are also transformed as such for the labels of factor variables.
DATA
The default input data frame is mydata
. Specify another name with the data
option. Regardless of its name, the data frame need not be attached to reference the variables directly by its name, that is, no need to invoke the mydata$name
notation. The referenced variables can be in the data frame and/or the user's workspace, the global environment.
CATEGORICAL VARIABLES
Categorical variables have relatively few unique data values. The standard and most general way to define a categorical variable is as an R factor, illustrated in the examples for the Transform
function. lessR
also provides the option of defining an integer variable with equally spaced values as categorical based on the value of n.cat
, which can be set locally or globally with the theme
function. For example, for a variable with data values from 5-point Likert scale, a value of n.cat
of 5 will define the define the variable as categorical. The default value is 8. To explicitly analyze the values as numerical, set n.cat
to a value lower than 5, usually 0. Can also annotate a graph of the values of an integer categorical variable with value.labels
option.
A scatterplot of Likert type data is problematic because there are so few possibilities for points in the scatterplot. For example, for a scatterplot of two five-point Likert response data, there are only 25 possible paired values to plot, so most of the plotted points overlap with others. In this situation, that is, when a single variable or two variables with Likert response scales are specified, a bubble plot is automatically provided, with the size of each point relative to the joint frequency of the paired data values. A sunflower plot can be requested in lieu of the bubble plot with the object
option.
ADAPTIVE GRAPHICS
Results for two variables are based on the standard plot
and related graphic functions, with the additional provided color capabilities and other options including a center line. The plotting procedure utilizes ``adaptive graphics'', such that ScatterPlot
chooses different default values for different characteristics of the specified plot and data values. The goal is to produce a desired graph from simply relying upon the default values, both of the ScatterPlot
function itself, as well as the base R functions called by ScatterPlot
, such as plot
. Familiarity with the options permits complete control over the computed defaults, but this familiarity is intended to be optional for most situations.
TWO VARIABLE PLOT
When two variables are specified to plot, by default if the values of the first variable, x
, are unsorted, or if there are unequal intervals between adjacent values, or if there is missing data for either variable, a scatterplot is produced, that is, a call to the standard R plot
function with type="p"
for points. By default, sorted values with equal intervals between adjacent values of the first of the two specified variables yields a function plot if there is no missing data for either variable, that is, a call to the standard R plot
function with type="l"
, which connects each adjacent pair of points with a line segment.
Specifying multiple-x variables against a single y-variable results in multiple plots on the same graph. The color of the points of the second variable is the same as that of the first variable, but with a transparent fill. For more than two x-variables, multiple colors are displayed, one for each x-variable.
BUBBLE PLOT FREQUENCY MATRIX (BPFM)
Multiple categorical variables for x
may be specified, without specifying a y
variable. A bubble plot results that illustrates the frequency of each response for each of the variables in a common figure. Each line of information, the bubbles and counts for a single variable, replaces the standard bar chart in a more compact display. Each variable in the matrix must have the same number of response categories, that is, levels. If not, then use the factor transformation with the levels option to ensure that the levels are the same for each variable. See the examples the end of the Transform
function documentation. The BPFM is considerably condensed presentation of frequencies for a set of variables than are the corresponding bar charts.
BY VARIABLE
A variable specified with by=
is a grouping variable that specifies that the plot is produced with the points for each group plotted with a different shape and/or color. By default, the shapes vary by group, and the color of the plot symbol remains the same for the groups. The default shapes, in this order, are "circle"
, "diamond"
, "square"
, "triup"
for a triangle pointed up, and "tridown"
for a triangle pointed down.
To explicitly vary the shapes, use shape
and a list of shape values in the standard R form with the c
function to combine a list of values, one specified shape for each group, as shown in the examples. To explicitly vary the colors, use color.fill
, such as with R standard color names. If color.fill
is specified without shape
, then colors are varied, but not shapes. To vary both shapes and colors, specify values for both options, always with one shape or color specified for each level of the by
variable.
Shapes beyond the standard list of named shapes, such as "circle"
, are also available as single characters. Any single letter, uppercase or lowercase, any single digit, and the characters "+"
, "*"
and "#"
are available, as illustrated in the examples. In the use of shape
, either use standard named shapes, or individual characters, but not both in a single specification.
SCATTERPLOT ELLIPSE
For a scatterplot of two numeric variables, the ellipse=TRUE
option draws the .95 data ellipse as computed by the ellipse
function, written by Duncan Murdoch and E. D. Chow, from the ellipse
package. The axes are automatically lengthened to provide space for the entire ellipse that extends beyond the maximum and minimum data values. Multiple numerical values of ellipse
may also be specified, to obtain multiple ellipses.
ONE VARIABLE PLOT
The one variable plot is a 1-dimensional scatterplot, that is, a dot chart. For a numerical variable, results are based on the standard stripchart
function. Colors are provided by default and can also be specified. For gray scale output, potential outliers are plotted with squares and actual outliers are plotted with diamonds, otherwise shades of red are used to highlight outliers. The definition of outliers are from the R boxplot
function. The plot can also be obtained as a bubble plot for a categorical variable.
RUN CHART
Specifying one or more x-variables with no y-variables, and object="line"
plots the x-variables in a run chart, with Index on the x-axis. Index is the ordinal position of each data value, from 1 to the number of values.
BINS
Specifying object="bar"
generates a histogram for a continuous variable and a bar chart for a non-numeric variable.
VARIABLE LABELS
Although standard R does not provide for variable labels, lessR
can store the labels in the data frame with the data, obtained from the Read
function or VariableLabels
. If variable labels exist, then the corresponding variable label is by default listed as the label for the corresponding axis and on the text output.
COLORS
Individual colors in the plot can be manipulated with options such as color.fill
for the interior color of a plotted point. A color theme for all the colors can be chosen for a specific plot with the colors
option with the lessR
function theme
. The default color theme is dodgerblue
. A gray scale is available with "gray"
, and other themes are available as explained in theme
, such as "sienna"
and "orange.black"
. Use the option ghost=TRUE
for a black background, no grid lines and partial transparency of plotted colors.
Colors can also be changed for individual aspects of a scatterplot as well. To provide a warmer tone by slightly enhancing red, try a background color such as color.bg="snow"
. Obtain a very light gray with color.bg="gray99"
. To darken the background gray, try color.bg="gray97"
or lower numbers. See the lessR
function showColors
, which provides an example of all available named colors.
For the color options, such as color.grid
, the value of "off"
is the same as
"transparent"
.
PDF OUTPUT
Because of the customized graphic windowing system that maintains a unique graphic window for the Help function, the standard graphic output functions such as pdf
do not work with the lessR
graphics functions. Instead, to obtain pdf output, use the pdf.file
option, perhaps with the optional pdf.width
and pdf.height
options. These files are written to the default working directory, which can be explicitly specified with the R setwd
function.
ADDITIONAL OPTIONS
Commonly used graphical parameters that are available to the standard R function plot
are also generally available to ScatterPlot
, such as:
[object Object],[object Object],[object Object],[object Object],ONLY VARIABLES ARE REFERENCED
A referenced variable in a lessR
function can only be a variable name. This referenced variable must exist in either the referenced data frame, such as the default mydata
, or in the user's workspace, more formally called the global environment. That is, expressions cannot be directly evaluated. For example:
> ScatterPlot(rnorm(50), rnorm(50)) # does NOT work}
Instead, do the following: > X <- rnorm(50) # create vector X in user workspace > Y <- rnorm(50) # create vector Y in user workspace > ScatterPlot(X,Y) # directly reference X and Y
ellipse
function from the ellipse
package package.
Gerbing, D. W. (2013). R Data Analysis without Programming, Chapter 8, NY: Routledge.
[object Object],[object Object]
plot
, stripchart
, title
, par
, Correlation
, theme
.
#---------------------------------------------------- # traditional scatter plot with two numeric variables #----------------------------------------------------
# scatterplot with default object of point and default data table mydata Plot(Years, Salary) # or use ScatterPlot or sp in place of Plot
# new shape and point size, no grid or background color Plot(Years, Salary, size=2, shape="diamond", color.bg="off", color.grid="off")
# abbreviated function name # scatterplot, with loess line and filled ellipse with low opacity, .1 # save scatterplot to a pdf file Plot(Years, Salary, fit.line=TRUE, ellipse=TRUE, color.fill.ellipse=rgb(.6,.3,.3,.1), pdf.file="MyScatterPlot.pdf")
# scatterplot with many ellipses Plot(Years, Salary, ellipse=seq(.2,.9, .1))
# scatterplot with three x-variables, plotted against Salary Plot(c(Pre, Post, Years), Salary)
# increase span (smoothing) from default of .75 # span is a loess parameter and generates a caution that can be # ignored that it is not a graphical parameter -- we know that #Plot(Years, Salary, fit.line="loess", span=1.25)
# change color theme to gray scale, then back to default theme(colors="gray") Plot(Years, Salary) theme(colors="dodgerblue")
# variables of interest are in a data frame which is not the default mydata Plot(eruptions, waiting, ellipse=TRUE, data=faithful)
# by variable scatterplot with default point color, vary shapes Plot(Years, Salary, by=Gender) # by variable with values of Gender for plotting symbols # reduce the size of the plotted symbols with size<1 plot(years,="" salary,="" by="Gender," shape="c("F","M")," size=".6)" #="" vary="" both="" and="" color="" "brown"),="" stroke="" fill="" colors="" set,="" with="" a="" least-squares="" fit="" line="" for="" each="" group="" fit.line="ls" )<="" p="">
#-------------------------------------- # analysis of a single numeric variable #--------------------------------------
# default dot plot (1-variable scatter plot, continuous) Plot(Salary) # dot plot with custom colors for outliers Plot(Salary, pt.reg=23, color.out15="hotpink", color.out30="darkred") # one variable scatterplot with added jitter of points Plot(Salary, method="jitter") # by variable dot plot with custom colors, keeps only 1 shape Plot(Salary, by=Gender, color.stroke=c("steelblue", "hotpink")) # line chart, with both line and point activated Plot(Salary, object="both")
# default histogram Plot(Salary, object="bar") # or, specify a bin parameter, which also sets the object to "bar" Plot(Salary, bin.width=5000)
# 1-D run chart instead of bubble plot by specifying line Plot(Salary, object="line") # two 1-D run charts in same plot Plot(c(Pre, Post), object="line")
#------------------------------------------ # analysis of a single categorical variable #------------------------------------------
# Default 1-D bubble plot # frequency plot, in place bar chart Plot(Dept) # plot of frequencies for each category (level), replaces bar chart Plot(Dept, topic="count")
#---------------------------------------------------- # scatterplot of numeric against categorical variable #----------------------------------------------------
# generates a means chart Plot(Dept, Salary) # rotated axis labels and then offset to fit Plot(Dept, Salary, rotate.values=45, offset=1) # just plot means Plot(Dept, Salary, topic="mean") # bar plot of means Plot(Dept, Salary, object="bar", topic="mean")
#---------------------------------------------------- # analysis of two categorical variables (Likert data) #---------------------------------------------------- mydata <- rd("Mach4", format="lessR", quiet=TRUE) # Likert data, 0 to 5
# size of each plotted point (bubble) depends on its joint frequency # triggered by default when < n.cat=10 unique values for each variable Plot(m06, m07) # use value labels for the integer values LikertCats <- c("Strongly Disagree", "Disagree", "Slightly Disagree", "Slightly Agree", "Agree", "Strongly Agree") Plot(m06, m07, value.labels=LikertCats) # get correlation analysis instead of cross-tab analysis Plot(m06, m07, n.cat=2) # plot Likert data and get sunflower plot with loess line Plot(m06, m07, object="sunflower", fit.line="loess")
# two variable bar chart, default is topic="count" for object="bar" Plot(m06, m07, object="bar")
#----------------------------- # Bubble Plot Frequency Matrix #-----------------------------
# generate a table of frequency distributions for multiple categorical # variables witht he same response scale # specify a range of x-variables, no y-variable # each row is a bubble plot of frequencies for a single variable Plot(c(m06,m07,m09,m10), rotate=25, offset=1) # for each bubble, lighten fill color, make border black Plot(m06:m12, color.fill=rgb(.094,.455,.804,alpha=.45), color.stroke="black") # color range Plot(c(m06,m07,m09,m10), color.low="lemonchiffon2", color.hi="lightsteelblue2") # create BPFM for entire Mach IV scale with labels, store as a pdf file Plot(m01:m20, value.labels=LikertCats, pdf.file="MachFreqs.pdf")
#------------------- # Cleveland dot plot #------------------- mydata <- rd("Employee", format="lessR", quiet=TRUE)
# row.names on the y-axis Plot(Salary, row.names) # with options Plot(Salary, row.names, sort.yx=TRUE, segments.y=TRUE, color.bg="off", color.grid="off") # Cleveland dot plot with two x-variables Plot(c(Pre, Post), row.names, segments.y=TRUE, color.bg="off", color.grid="off")
#--------------- # function curve #---------------
x <- seq(10,50,by=1) y1 <- sqrt(x) y2 <- x**.33 # x is sorted with equal intervals so object set to line by default Plot(x, y1) # custom function plot Plot(x, y1, ylab="My Y", xlab="My X", main="My Curve", color.stroke="blue", color.bg="snow", color.area="lightsteelblue", color.grid="lightsalmon") # multiple plots, need data frame mydata <- data.frame(x, y1, y2) Plot(x, c(y1, y2))
#----------- # modern art #-----------
clr <- colors()
clr[-(153:353)] # get rid of most of the grays
n <- sample(2:30, size=1)
x <- rnorm(n)
y <- rnorm(n)
color1 <- clr[sample(1:length(clr), size=1)]
color2 <- clr[sample(1:length(clr), size=1)]
Plot(x, y, object="line", color.area=color1, color.stroke=color2,
xy.ticks=FALSE, main="Modern Art", xlab="", ylab="",
cex.main=2, col.main="lightsteelblue", n.cat=0)