reg
, reg.brief
Provides a regression analysis with extensive output, including graphics, from a single, simple function call with many default settings, each of which can be re-specified. The computations are obtained from the R
function lm
and related R
regression functions, of which the output of these functions are re-arranged and collated.
By default the data exists as a data frame with the default name of mydata
, or specify explicitly with the data
option. Specify the model in the function call as an R formula
, that is, for a basic model, the response variable followed by a tilde, followed by the list of predictor variables, each pair separated by a plus sign.
Output is generated into distinct pieces by topic. When the output is assigned to an object, such as r
in r <- reg(Y ~ X)
, the full or partial output can be accessed for later analysis and/or viewing. A primary such analysis is with knitr
for dynamic report generation, run from, for example, RStudio
. The input instructions to knitr
are written comments and interpretation with embedded R
code. Doing a knitr
analysis is to "knit" these comments and subsequent output together so that the R
output is embedded in the resulting document, either html, pdf or Word, by default with explanation and interpretation. Generate a complete knitr
set of instructions ready to knit from the knitr.file
option. Simply specify the option and create the file and then open in RStudio
and click the knit
button to create a formatted document that consists of the statistical results and interpretative comments. See the following sections arguments
, value
and examples
for more information.
Regression(my.formula, data=mydata, digits.d=NULL, standardize=FALSE,
knitr.file=NULL, explain=getOption("explain"),
interpret=getOption("interpret"), results=getOption("results"), text.width=120, brief=getOption("brief"), show.R=FALSE,
res.rows=NULL, res.sort=c("cooks","rstudent","dffits","off"),
pred.rows=NULL, pred.sort=c("predint", "off"),
subsets=NULL, cooks.cut=1,
scatter.coef=TRUE, scatter.3D=FALSE, graphics=TRUE,
X1.new=NULL, X2.new=NULL, X3.new=NULL, X4.new=NULL,
X5.new=NULL,
pdf=FALSE, pdf.width=5, pdf.height=5, refs=FALSE,
fun.call=NULL, ...)
reg(...)
reg.brief(..., brief=TRUE)
formula
for specifying a model. For
example, for a response variable named Y and two predictor variables, X1 and
X2, specify the corresponding linear model as Y ~ X1 + X2.mydata
, otherwise explicitly specify.FALSE
the knitr
explanations
of the results are not provided. Set globally with options(explain=FALSE).FALSE
the knitr
interpretations
of the results are not provided. Set globally with options(interpret=FALSE).FALSE
the knitr
results
are not provided, relying upon the interpretations. Set globally with
options(results=FALSE).TRUE
, reduced text output. Can change system default
with set
function.lessR
output, albeit without
the additional formatting of the results such as combining output of different
functions into a table."all"
."cooks"
, for specifying Cook's distance as the sort
criterion for the display of the rows of data and associated residuals. Other values
are "rstudent"
for externally Studentized residuals, "dffits
"predint"
, which sorts the rows of data and associated
intervals by the lower bound of each prediction interval. Turn off this sort by
specifying a value of "off"
.leaps
package. Set to FALSE
to
turn off.TRUE
.TRUE
. In knitr
can be useful
to set to FALSE
so that regPlot
can be used to place
the graphics within the output file.TRUE
, then graphics are written to pdf files.TRUE
, then list the references for R and the packages used from
which functions were used to generate the output.knitr
to pass the function call when
obtained from the abbreviated function call reg
.lm
which provides the
core computations.R
object, otherwise it simply appears at the console. The components of this object are redesigned in lessR
version 3.3 into (a) pieces of text that form the readable output and (b) a variety of statistics. The readable output are character strings such as tables amenable for viewing and interpretation. The statistics are numerical values amenable for further analysis, such as to be referenced in a subsequent knitr
document. The motivation of these three types of output is to facilitate knitr
documents, as the name of each piece, preceded by the name of the saved object followed by a $, can be inserted into the knitr
document (see examples
).TEXT OUTPUT
out_background
: Variables in the model, rows of data and retained
out_estimates
: Estimated coefficients, hypothesis tests and confidence intervals
out_fit
: Fit indices
out_anova
: Analysis of variance
out_cor
: Correlations among all variables in the model
out_collinear
: Collinearity analysis
out_subsets
: R squared adjusted for all (or many) possible subsets
out_residuals
: Residuals
out_predict
: Analysis of residuals and influence
out_cite
: List of packages from other developers used in the analysis
out_ref
: References if selected on the Regression
function call
out_knitr.file
: Lists the name and location of the knitr
instructions
out_plots
: List of plots generated if more than one
Also separated from the rest of the text output are the major headings, which can then be deleted from custom collations of the output.
out_title_bck
: BACKGROUND
out_title_basic
: BASIC ANALYSIS
out_title_rel
: RELATIONS AMONG THE VARIABLES
out_title_res
: ANALYSIS OF RESIDUALS AND INFLUENCE
out_title_pred
: FORECASTING ERROR
STATISTICS
fun.call
: Function call that generated the analysis
formula
: Model formula that specifies the model
n.vars
: Number of variables in the model
n.obs
: Number of rows of data submitted for analysis
n.keep
: Number of rows of data retained in the analysis
coefficients
: Estimated regression coefficients
sterrs
: Standard errors of the estimated coefficients
tvalues
: t-values of the estimated coefficients for null of 0
pvalues
: p-values from the t-tests of the estimated coefficients
cilb
: lower bound of 95% confidence interval of estimate
ciub
: upper bound of 95% confidence interval of estimate
anova_model
: Model df, ss, ms, F-value and p-value
anova_residual
: Residual df, ss and ms
anova_total
: Total df, ss and ms
se
: standard deviation of the residuals
resid_range
: 95% range of normally distributed fitted residuals
Rsq
: R-squared
Rsqadj
: adjusted R-squared
PRESS
: PRESS sum of squares
RsqPRESS
: PRESS R-squared
cor
: correlation matrix of all variables in the model
tolerances
: tolerance of each predictor variable for collinearity analysis
VIF
: Variance inflation factor for each predictor variable
resid.max
: Five largest values of the residuals on which the ouput is sorted
pred_min_max
: Rows with the smallest and largest prediction intervals
residuals
: Residuals
fitted.values
: Fitted values
cooks.distance
: Cook's distance
model
: Data retained for the analysis
terms
: Terms specified for the analysis
Although not typically needed for analysis, if the regression output is assigned to an object named, for example, r
, then the complete contents of the object can be viewed directly with the unclass
function, here as unclass(r)
. Invoking the class
function on the saved object reveals a class of {out_all}. The class of each of the text pieces of output is {out_piece}.
Regression
is to combine the following function calls into one, as well as provide ancillary analyses such as as graphics, organizing output into tables and sorting to assist interpretation of the output, as well as run through knitr
, such as with RStudio
. The basic analysis successively invokes several standard R functions beginning with the standard R function for estimation of a linear model, lm
. The output of the analysis of lm
is stored in the object lm.out
, available for further analysis in the R environment upon completion of the Regression
function. By default reg
automatically provides the analyses from the standard R functions, summary
, confint
and anova
, with some of the standard output modified and enhanced. The correlation matrix of the model variables is obtained with cor
function. The residual analysis invokes fitted
, resid
, rstudent
, and cooks.distance
functions. The option for prediction intervals calls the standard R function predict
, once with the argument interval="confidence"
and once with interval="prediction"
. The lessR
Density
function provides the histogram and density plots for the residuals and the ScatterPlot
function provides the scatter plots of the residuals with the fitted values and of the data for the one-predictor model. The pairs
function provides the scatterplot matrix of all the variables in the model. Thomas Lumley's leaps
package contains the leaps
function that provides the analysis of the fit of all possible model subsets. The car
package provides Henric Nilsson and John Fox's vif
function for the computation of the variance inflation factors for the collinearity analysis. The scatter3d
function from Fox and Weisberg's car
package provides the interactive 3d scatterplot for models with exactly two predictor variables.
INPUT DATA FRAME
The name mydata
is by default provided by the Read
function included in this package for reading and displaying information about the data in preparation for analysis. If all the variables in the model are not in the same data frame, the analysis will not complete. The data frame does not need to be attached, just specified by name with the data
option if the name is not the default mydata
.
TEXT OUTPUT
The output is produced in pieces by topic (see values
below), automatically collated by default in the final output. But the pieces are available for later reference if the output of the function is directed toward an object, such as r
in r <- reg(Y ~ X)
. This is especially useful if the pieces are accessed within knitr
or individual pieces are displayed at the console.
The text output is organized to provide the most relevant information while at the same time minimizing the total amount of output, particularly for analyses with large numbers of observations (rows of data), the display of which is by default restricted to only the most interesting or representative observations in the analyses of the residuals and predicted values. Additional economy can be obtained by invoking the brief=TRUE
option, or run reg.brief
, which limits the analysis to just the basic analysis of the estimated coefficients and fit.
knitr
A file ready for input into knitr
can be obtained by specifying a value for knitr.file
. For the specified file name, the directory to which the file is written is displayed on the console text output, and the file type .Rmd
is automatically appended to the specified name if it is not included in the specification. Process with RStudio
, or with the knit
function from the knitr
package and the markdownToHTML
function from the markdown
package.
The output from knitr.file
is conceptually partitioned into three parts: explanations, interpretations and results. By default all available output is genreated but the flags explain
, interpret
and results
can be set to FALSE
to reduce the output. The options can be specified in a specific function all or set globally, such as with options(explain=FALSE)
. Turning off all three flags leaves just the outline of the potential output and a bare minimum of results.
The default analysis provides as text output to the console the model's parameter estimates and corresponding hypothesis tests and confidence intervals, goodness of fit indices, the ANOVA table, correlation matrix of the model's variables, analysis of residuals and influence as well as the confidence and prediction intervals for each observation in the model. Also provided, for multiple regression models, collinearity analysis of the predictor variables and adjusted R-squared for the corresponding models defined by each possible subset of the predictor variables.
DECIMAL DIGITS
The number of decimal digits displayed on the output is, by default, the maximum number of decimal digits for all the data values of the response variable. Or, this value can be explicitly specified with the digits.d
parameter.
GRAPHICS OUTPUT
Three default graphs are provided. When running R
by itself, by default the graphs are written to separate graphics windows (which may overlap each other completely, in which case move the top graphics windows). Or, the pdf
option may be invoked to save the graphs to a single pdf file called regOut.pdf
. Within RStudio
the graphs are successively written to the Plots
window. Within knitr
from RStudio
the graphics will all appear by default at the beginning of the output. Or set to graphics=FALSE
, and generate them individually with the accompanying function regPlot
at the desired location within the file.
1. A histogram of the residuals includes the superimposed normal and general density plots from the Density
function included in this lessR
package. The overlapping density plots, which both overlap the histogram, are filled with semi-transparent colors to enhance readability.
2. A scatterplot of the residuals with the fitted values is also provided from the ScatterPlot
function included in this package. The point corresponding to the largest value of Cook's distance, regardless of its size, is plotted in red and labeled and the corresponding value of Cook's distance specified in the subtitle of the plot. Also by default all points with a Cook's distance value larger than 1.0 are plotted in red, a value that can be specified to any arbitrary value with the cooks.cut
option. This scatterplot also includes the lowess
curve.
3. For models with a single predictor variable, a scatterplot of the data is produced, which also includes the regression line and corresponding confidence and prediction intervals. As with the density histogram plot of the residuals and the scatterplot of the fitted values and residuals, the scatterplot includes a colored background with grid lines. For multiple regression models, a scatterplot matrix of the variables in the model with the lowess
best-fit line of each constituent scatterplot is produced. If the scatter.coef
option is invoked, each scatterplot in the upper-diagonal of the correlation matrix is replaced with its correlation coefficient.
RESIDUAL ANALYSIS
By default the residual analysis lists the data and fitted value for each observation as well as the residual, Studentized residual, Cook's distance and dffits, with the first 20 observations listed and sorted by Cook's distance. The res.sort
option provides for sorting by the Studentized residuals or not sorting at all. The res.rows
option provides for listing these rows of data and computed statistics statistics for any specified number of observations (rows). To turn off the analysis of residuals, specify res.rows=0
.
PREDICTION INTERVALS
The output for the confidence and prediction intervals includes a table with the data and fitted value for each observation, the lower and upper bounds for the confidence interval and the prediction interval, and the wide of the prediction interval. The observations are sorted by the lower bound of each prediction interval. If there are 25 or more observations then the information for only the first four, the middle four and the last four observations is displayed. To turn off the analysis of prediction intervals, specify pred.rows=0
, which also removes the corresponding intervals from the scatterplot produced with a model with exactly one predictor variable, yielding just the scatterplot and the regression line.
The data for the default analysis of the prediction intervals is for the values of the predictor variables for each observation, that is, for each row of the data. New values of the predictor variables can be specified for the calculation of the prediction intervals by providing values for the options X1.new
for the values of the first listed predictor variable in the model, X2.new
for the second listed predictor variable, and so forth for up to five predictor variables. To provide these values, use functions such as seq
for specifying a sequence of values and c
for specifying a vector of values. For multiple regression models, all combinations of the specified new values for all of the predictor variables are analyzed.
RELATIONS AMONG THE VARIABLES
By default the correlation matrix of all the variables in the model is displayed, and, for multiple regression models, collinearity analysis is provided with the vif
function from the Fox and Weisberg (2011) car
package. Also provided are the first 50 models with the largest R squared adjusted from each possible model from an analysis of all possible subsets of the predictor variables. This all subsets analysis requires the leaps
function from the leaps
package. These contributed packages are automatically loaded if available. To turn off the all possible sets option, set subsets=FALSE
.
INVOKED R OPTIONS
The options
function is called to turn off the stars for different significance levels (show.signif.stars=FALSE), to turn off scientific notation for the output (scipen=30), and to set the width of the text output at the console to 120 characters. The later option can be re-specified with the text.width
option. After Regression
is finished with a normal termination, the options are re-set to their values before the Regression
function began executing.
COLOR THEME
A color theme for all the colors can be chosen for a specific plot with the colors
option. Or, the color theme can be changed for all subsequent graphical analysis with the lessR
function set
. The default color theme is dodgerblue
, but a gray scale is available with "gray"
, and other themes are available as explained in set
.
VARIABLE LABELS
If variable labels exist, then the corresponding variable label is by default listed as the label for the horizontal axis and on the text output. For more information, see Read
.
Lumley, T., leaps
function from the leaps
package.
Nilsson, H. and Fox, J., vif
function from the car
package.
Gerbing, D. W. (2013). R Data Analysis without Programming, Chapters 9 and 10, NY: Routledge.
Xie, Y. (2013). Dynamic Documents with R and knitr, Chapman & Hall/CRC The R Series.
formula
, lm
, summary.lm
, anova
, confint
, fitted
, resid
, rstudent
, cooks.distance
, Nest
, regPlot
# Read internal data set
mydata <- rd("Reading", format="lessR", quiet=TRUE)
# One-predictor regression
# Provide all default analyses including scatterplot etc.
Regression(Reading ~ Verbal)
# short name of function call
reg(Reading ~ Verbal)
# Provide only the brief analysis on the standardized variables
reg.brief(Reading ~ Verbal, standardize=TRUE)
# Access the pieces of output, here in an object named \code{r}
r <- reg(Reading ~ Verbal + Absent + Income)
# Display all output at the console
r
# list the names of all the saved components
names(r)
# Display just the estimated coefficients and their inferential analysis
r$out_estimates
# This output obtained from the \code{knitr} file with the
# following automatically generated instructions (when results=TRUE)
#```{r, echo=FALSE}
#r$out_estimates
#```
# Generate knitr instructions with the option: knitr.file
# Output file will be reg_knit.Rmd, a simple text file that can
# be edited with any text editor including RStudio from which it
# can be knit to generate dynamic output such as to a Word document
reg(Reading ~ Verbal + Absent, knitr.file="read")
# knitr.file with no explanations
reg(Reading ~ Verbal + Absent, knitr.file="read", explain=FALSE)
# Modify the default settings as specified
Regression(Reading ~ Verbal, res.row=8, res.sort="rstudent", pred.rows=0, digits.d=4)
# Multiple regression model
# Provide all default analyses
Regression(Reading ~ Verbal + Absent + Income)
# Save the three plots as pdf files 4 inches square
Regression(Reading ~ Verbal, pdf=TRUE, pdf.width=4, pdf.height=4)
# Compare nested models
# Reduced model: Reading ~ Verbal
# Full model: Reading ~ Verbal + Income + Absent
Nest(Reading, Verbal, c(Income, Absent))
# Specify new values of the predictor variables to calculate
# forecasted values and the corresponding prediction intervals
# Specify an input data frame other than mydata, see help(mtcars)
Regression(mpg ~ hp + wt + disp, data=mtcars,
X1.new=seq(50,350,50), X2.new=c(2,3), X3.new=c(100,300))
Run the code above in your browser using DataLab