RegressionInterface: Univariate Regression Modelling

Description

A collection and description of easy to use functions to perform an univariate regression analysis from several methods, to analyse and summarize the fit, and to predict for new data records. The models include: ll{ "LM" Linear Modelling, "GLM" Generalized Linear Modelling, "GAM" Generalized Additive Modelling, "PPR" Projection Pursuit Regression, "MARS" Multivariate Adaptive Regression Splines, "POLYMARS" Polytochomous MARS, and "NNET" Feedforward Neural Network Modelling. } Available methods are: ll{ predict Predict method for objects of class 'fARMA', print Print method for objects of class 'fARMA', plot Plot method for objects of class 'fARMA', summary Summary method for objects of class 'fARMA', fitted.values Fitted values method for objects of class 'fARMA', residuals Residuals method for objects of class 'fARMA'. } The print method prints the returned object from a regression fit, and the summary method performs a diagnostic analysis and summarizes the results of the fit in a detailed form. The plot method produces diagostic plots. The predict method forecasts from new data records. Two other methods to print the fitted values, and the residuals are available. Furthermore, a S-Plus Finmetrics like ordinary least square 'OLS' function has been added including S3 print, plot and summary methods. ll{ OLS Predict method for objects of class 'fARMA', print Print method for objects of class 'fARMA', plot Plot method for objects of class 'fARMA', summary Summary method for objects of class 'fARMA'. }

Usage

regSim(model = c("LM3", "LOGIT3", "GAM3"), n = 100)
regFit(formula, data, use = c("lm", "rlm", "am", "ppr", "mars", "nnet", 
    "polymars"), title = NULL, description = NULL, ...)
gregFit(formula, family, data, use = c("glm", "gam"), 
    title = NULL, description = NULL, ...)
    
## S3 method for class 'fREG':
predict(object, newdata, se.fit = FALSE, type = "response", \dots)
show.fREG(object)
## S3 method for class 'fREG':
plot(x, \dots)
## S3 method for class 'fREG':
summary(object, \dots)
## S3 method for class 'fREG':
coef(object, \dots)
## S3 method for class 'fREG':
fitted(object, \dots)
## S3 method for class 'fREG':
residuals(object, \dots)
## S3 method for class 'fREG':
vcov(object, \dots)
OLS(formula, data, ...)
## S3 method for class 'OLS':
print(x, \dots)
## S3 method for class 'OLS':
plot(x, \dots)
## S3 method for class 'OLS':
summary(object, \dots)

Arguments

data, newdata

data is the data frame containing the variables in the model. By default the variables are taken from environment(formula), typically the environment from which LM is called. newdata<

description

a brief description of the porject of type character.

family

a description of the error distribution and link function to be used in glm and gam models. See glm and family for mor

formula

a symbolic description of the model to be fit. A typical glm predictor has the form response ~ terms where response is the (numeric) response vector and terms is a series of ter

use

denotes the regression method by a character string used to fit the model. method must be one of the strings in the default argument. "LM", for linear regression models, "GLM" for generalized linear

model

[regSim] - a character string selecting one from three two-dimensional benchnmark models: "LM2", "LOGIT2", or "GAM2".

[regSim] - an integer value setting the length of the series to be simulated. The default value is 100.

object, x

[regFit] - is an object returned by the regression function regFit and serves as input for the predict, print, summary, print.summary, and plot methods.

se.fit

[predict] - ...

title

a character string which allows for a project title.

type

a character string, the type of prediction.

...

additional optional arguments to be passed to the underlying functions. For details we refer to inspect the following help pages: lm, glm, gam

Value

Function regFit: returns an S4 object of class "fREG", with the folliwing slots:
callthe matched function call.
datathe input data in form of a data.frame.
descriptionallows for a brief project description.
fitthe results as a list returned from the underlying regression model function, e.g. fit$parameters - the fitted model parameters, fit$residuals - the model residuals, fit$fitted.values - the fitted values of the model, and many more. For details we refer to the help pages of the selected regression model.
methodthe selected regression model naming the applied method.
formulathe formula expression describing the model.
familythe selected family and link name if available, otherwise a string vector with to empty strings.
parametersnamed parameters or coefficients of the fitted model.
titlea title string.
Methods: The output from the print method gives information at least about the function call, the fitted model parameters, and the residuals variance. The plot method produces three figures, the first plots the series of residuals, the second does a quantile-quantile plot of the residual plot, and the third plots the fitted values vs. the residuals. Additional plots can be generated from the plot method (if available) of the underlying model, see the example below. The summary method provides additional information, like errors on the model parameters as far as available, and adds additional information about the fit. The predict method forecasts from a fitted model. The returned values are the same as produced by the prediction function of the selected regression model. Especially, $fit returns the forecast vector. The residuals and fitted.values methods return the residuals and the fitted values as numeric vectors. Function OLS: returns an S3 object of class "OLS" that represents an ordinary least squares fit. The list has the same elements like an object of class "lm", and additionally the elements $call, $formula and $data.

Details

LM -- Linear Modelling: Univariate linear regression analysis is a statistical methodology that assumes a linear relationship between some predictor variables and a response variable. The goal is to estimate the coefficients and to predict new data from the estimated linear relationship. The function plot.lm provides four plots: a plot of residuals against fitted values, a Scale-Location plot of sqrt{| residuals |} against fitted values, a normal QQ plot, and a plot of Cook's distances versus row labels. [stats:lm] GLM -- Generalized Linear Models: Generalized linear modelling extends the linear model in two directions. (i) with a monotonic differentiable link function describing how the expected values are related to the linear predictor, and (ii) with response variables having a probability distribution from an exponential family. [stats:glm] GAM -- Generalized Additive Models: An additive model generalizes a linear model by smoothing individually each predictor term. A generalized additive model extends the additive model in the same spirit as the generalized liner amodel extends the linear model, namely for allowing a link function and for allowing non-normal distributions from the exponential family. [mgcv:gam] PPR -- Projection Pursuit Regression: The basic method is given by Friedman (1984), and is essentially the same code used by S-PLUS's ppreg. It is observed that this code is extremely sensitive to the compiler used. The algorithm first adds up to max.terms, by default ppr.nterms, ridge terms one at a time; it will use less if it is unable to find a term to add that makes sufficient difference. The levels of optimization (argument optlevel), by default 2, differ in how thoroughly the models are refitted during this process. At level 0 the existing ridge terms are not refitted. At level 1 the projection directions are not refitted, but the ridge functions and the regression coefficients are. Levels 2 and 3 refit all the terms; level 3 is more careful to re-balance the contributions from each regressor at each step and so is a little less likely to converge to a saddle point of the sum of squares criterion. The plot method plots Ridge functions for the projection pursuit regression fit. [stats:ppr] MARS -- Multivariate Adaptive Regression Splines: This function was coded from scratch, and did not use any of Friedman's mars code. It gives quite similar results to Friedman's program in our tests, but not exactly the same results. We have not implemented Friedman's anova decomposition nor are categorical predictors handled properly yet. Our version does handle multiple response variables, however. As it is not well-tested, we would like to hear of any bugs. Additional arguments which can be passed to the "mars" estimator are: w - an optional vector of observation weights. wp - an optional vector of response weights. degree - an optional integer specifying maximum interaction degree, default is 1. nk - an optional integer specifying the maximum number of model terms. penalty - an optional value specifying the cost per degree of freedom charge, default is 2. thresh - an optional value specifying forward stepwise stopping threshold, default is 0.001. prune - an optional logical value specifying whether the model should be pruned in a backward stepwise fashion, default is TRUE. trace.mars - an optional logical value specifying whether info should be printed along the way, default is FALSE. forward.step - an optional logical value specifying whether forward stepwise process should be carried out, default is TRUE. prevfit - optional data structure from previous fit. To see the effect of changing the penalty paramater, one can use prevfit with forward.step = FALSE. [mda:mars] POLYMARS -- Polytochomous MARS: The algorithm employed by polymars is different from the MARS(tm) algorithm of Friedman (1991), though it has many similarities. Also the name polymars has been used for this algorithm well before MARS was trademarked. Additional arguments which can be passed to the "polymars" estimator are: maxsize - the maximum number of basis functions that the model is allowed to grow to in the stepwise addition procedure. Default is $\min(6*(n^{1/3}),n/4,100)$, where n is the number of observations. gcv - parameter used to find the overall best model from a sequence of fitted models. The residual sum of squares of a model is penalized by dividing by the square of 1-(gcv x model size)/cases. A larger gcv value would tend to produce a smaller model. additive - Should the fitted model be additive in the predictors? startmodel - the first model that is to be fit by polymars. It is either an object of the class polymars or a model dreamed up by the user. In that case, it takes the form of a 4 x n matrix, where n is the number of basis functions in the starting model excluding the intercept. Each row corresponds to one basis function (with two possible components). Column 1 is the index of the first predictor involved. Column 2 is a possible knot in this predictor. If column 2 is NA, the first component is linear. Column 3 is the possible second predictor involved (if column 3 is NA the basis function only depends on one predictor). Column 4 contains the possible knot for the predictor in column 3, and it is NA when this component is linear. Example: if a row reads 3 NA 2 4.7, the corresponding basis function is $[X_3 * (X_2-4.7)_+]$; if a row reads 2 4.3 NA NA the corresponding basis function is $[(X_2-4.3)_+]$. A fifth column can be added with 1s and 0s, The 1s specify which basis functions of the startmodel must be in each model. Thus, these functions stay in the model during the whole stepwise fitting procedure. If startmodel is not specified polymars starts with a model that only contains the intercept. weights - optional vector of observation weights; if supplied, the algorithm fits to minimize the sum of the weights multiplied by the squared residuals. The length of weights must be the same as the number of observations. The weights must be nonnegative. no.interact - an optional matrix used if certain predictor interactions are not allowed in the model. It is given as a matrix of size 2 x m, with predictor indices as entries. The two predictors of any row cannot have interaction terms with each other. knots - defines how the function is to find potential knots for the spline basis functions. This can be set to the maximum number of knots you would like to be considered for each predictor. Usually, to avoid the design matrix becoming singular the actual number of knots produced is constrained to at most every third order statistic in any predictor. This constraint can be adjusted using the knot.space argument. It can also be a vector with the number of potential knots for each predictor. Again the actual number of knots produced is constrained to be at most every third order statistic any predictor. A third possibility is to provide a matrix where each columns corresponds to the ordered knots you would like to have considered for that predictor. This matrix should be filled out to a rectangular data structure with NAs. The default is min(20, round(n/4)) knots per predictor. When specifying knots as a vector an entry of -1 indicates that the predictor is a categorical variable and each unique entry in it's column is treated as a level. When specifying knots as a single number or a matrix and there are categorical variables these are specified separately as such using the factor argument. knot.space - is an integer describing the minimum number of order statistics apart that two knots can be. Knots should not be too close to insure numerical stability. ts.resp - testset responses for model selection. Should have the same number of columns as the training set response. A testset can be used for the model selection. Depending on the value of classify, either the model with the smallest testset residual sum of squares or the smallest testset classification error is provided. Overrides gcv. ts.pred - testset predictors. Should have the same number of columns as the training set predictors. ts.weights - testset observation weights. A vector of length equal to the number of cases of the testset. All weights must be non-negative. classify - when the response is discrete (categorical), polymars can be used for classification. In particular, when classify = TRUE, a discrete response with K levels is replaced by K indicator variables as response. Model selection is still being carried out using gcv, except when a testset is provided, in which case testset misclassification is used to select the best model. factors - used to indicate that certain variables in the predictor set are categorical variables. Specified as a vector containing the appropriate predictor indices (column numbers of categorical variables in predictors matrix). Factors can also be set when the knots argument is given as a vector, with -1 as the appropriate entries for factors. tolerance - for each possible candidate to be added/deleted the resulting residual sums of squares of the model, with/without this candidate, must be calculated. The inversion of of the "X-transpose by X" matrix, X being the design matrix, is done by an updating procedure c.f. C.R. Rao - Linear Statistical Inference and Its Applications, 2nd. edition, page 33. In the inversion the size of the bottom right-hand entry of this matrix is critical. If it

s value is near zero or the value 
        of it

s inverse is almost zero then the inversion procedure becomes somewhat inaccurate. The lower the tolerance value the more careful the procedure is in selecting candidates for addition to the model but it may exclude too conservatively. And the other hand if the tolerance is set too high a spurious result with a singular or otherwise sub-optimal model may occur. By default tolerance is set to 1.0e-5. verbose - when set to TRUE, the function will print out a line for each addition or deletion stage. For example, " + 8 : 5 3.25 2 NA" means adding interaction basis function of predictor 5 with knot at 3.25 and predictor 2 (linear), to make a model of size 8, including intercept. [polyclass:polymars] NNET -- Feedforward Neural Network Regression: If the response in formula is a factor, an appropriate classification network is constructed; this has one output and entropy fit if the number of levels is two, and a number of outputs equal to the number of classes and a softmax output stage for more levels. If the response is not a factor, it is passed on unchanged to nnet.default. A quasi-Newton optimizer is used, written in C. [nnet:nnet] OLS -- Ordinary Least Square Fit: This function was introduced to mimc the Finmetrics S-Plus function OLS. The function wraps R's "lm". Currently it does not support the full functionality of Finmetrics' OLS function.

References

Belsley D.A., Kuh E., Welsch R.E. (1980); Regression Diagnostics; Wiley, New York. Dobson, A.J. (1990); An Introduction to Generalized Linear Models; Chapman and Hall, London.

Draper N.R., Smith H. (1981); Applied Regression Analysis; Wiley, New York.

Friedman, J.H. (1991); Multivariate Adaptive Regression Splines (with discussion), The Annals of Statistics 19, 1--141. Friedman J.H., and Stuetzle W. (1981); Projection Pursuit Regression; Journal of the American Statistical Association 76, 817-823.

Friedman J.H. (1984); SMART User's Guide; Laboratory for Computational Statistics, Stanford University Technical Report No. 1. Green, Silverman (1994); Nonparametric Regression and Generalized Linear Models; Chapman and Hall.

Gu, Wahba (1991); Minimizing GCV/GML Scores with Multiple Smoothing Parameters via the Newton Method; SIAM J. Sci. Statist. Comput. 12, 383-398.

Hastie T., Tibshirani R. (1990); Generalized Additive Models; Chapman and Hall, London.

Kooperberg Ch., Bose S., and Stone C.J. (1997); Polychotomous Regression, Journal of the American Statistical Association 92, 117--127.

McCullagh P., Nelder, J.A. (1989); Generalized Linear Models; Chapman and Hall, London.

Myers R.H. (1986); Classical and Modern Regression with Applications; Duxbury, Boston.

Rousseeuw P.J., Leroy, A. (1987); Robust Regression and Outlier Detection; Wiley, New York.

Seber G.A.F. (1977); Linear Regression Analysis; Wiley, New York.

Stone C.J., Hansen M., Kooperberg Ch., and Truong Y.K. (1997); The use of polynomial splines and their tensor products in extended linear modeling (with discussion).

Venables, W.N., Ripley, B.D. (1999); Modern Applied Statistics with S-PLUS; Springer, New York. Wahba (1990); Spline Models of Observational Data; SIAM.

Weisberg S. (1985); Applied Linear Regression; Wiley, New York. Wood (2000); Modelling and Smoothing Parameter Estimation with Multiple Quadratic Penalties; JRSSB 62, 413-428.

Wood (2001); mgcv: GAMs and Generalized Ridge Regression for R. R News 1, 20-25.

Wood (2001); Thin Plate Regression Splines.

There exists a vast literature on regression. The references listed above are just a small sample of what is available. The book by Myers' is an introductory text book that covers discussions of much of the recent advances in regression technology. Seber's book is at a higher mathematical level and covers much of the classical theory of least squares.

Examples

Run this code

## regFit -
   data(recession) 
   recession[,1] = paste(recession[,1], "28", sep = "")
   
## myPlot -
   myPlot = function(recession, in.sample) {
     recession = as.timeSeries(recession)[, "recession"]
     in.sample = as.timeSeries(recession)[, "recession"]
     Date = recession[, "date"]
     Date = trunc(Date/100) + (Date-100*trunc(Date/100))/12
     Recession = recession[, "recession"]
     inSample = as.vector(in.sample)
     plot(Date, Recession, type = "n", main = "US Recession")
     grid()
     lines(Date, Recession, type = "h", col = "steelblue")
     lines(Date, inSample) 
   }
   
## Generalized Additive Modelling:
   require(mgcv)
   par(mfrow = c(2, 2))
   fit = gregFit(formula = recession ~ s(tbills3m) + s(tbonds10y),
     family = gaussian(), data = recession, use = "gam")
   # In Sample Prediction:
   in.sample = predict(fit, newdata = recession)$fit  
   myPlot(recession, in.sample)
   # Summary:
   summary(fit)
   # Add plots from the original plot method:
   gam.fit = fit@fit
   class(gam.fit) = "gam"
   plot(gam.fit)

Run the code above in your browser using DataLab