fsrfan: Robust transformations for regression

Description

The transformations for negative and positive responses were determined by Yeo and Johnson (2000) by imposing the smoothness condition that the second derivative of zYJ (\(\lambda\)) with respect to y be smooth at y = 0. However some authors, for example Weisberg (2005), query the physical interpretability of this constraint which is oftern violated in data analysis. Accordingly, Atkinson et al. (2019) and (2020) extend the Yeo-Johnson transformation to allow two values of the transformations parameter: \(\lambda_N\) for negative observations and \(\lambda_P\) for non-negative ones.

FSRfan monitors:

the t test associated with the constructed variable computed assuming the same transformation parameter for positive and negative observations fixed. In short we call this test, "global score test for positive observations".
the t test associated with the constructed variable computed assuming a different transformation for positive observations keeping the value of the transformation parameter for negative observations fixed. In short we call this test, "test for positive observations".
the t test associated with the constructed variable computed assuming a different transformation for negative observations keeping the value of the transformation parameter for positive observations fixed. In short we call this test, "test for negative observations".
the F test for the joint presence of the two constructed variables described in points 2) and 3).
the F likelihood ratio test based on the MLE of \(\lambda_P\) and \(\lambda_N\). In this case the residual sum of squares of the null model bsaed on a single trasnformation parameter \(\lambda\) is compared with the residual sum of squares of the model based on data transformed data using MLE of \(\lambda_P\) and \(\lambda_N\).

Usage

fsrfan(
  y,
  x,
  intercept = TRUE,
  plot = FALSE,
  family = c("BoxCox", "YJ", "YJpn", "YJall"),
  la = c(-1, -0.5, 0, 0.5, 1),
  lms,
  alpha = 0.75,
  h,
  init,
  msg = TRUE,
  nocheck = FALSE,
  nsamp = 1000,
  conflev = 0.99,
  xlab,
  ylab,
  main,
  xlim,
  ylim,
  cex.lab,
  cex.axis,
  lwd = 2,
  lwd.env = 1,
  trace = FALSE
)

Arguments

Response variable. A vector with n elements that contains the response variable.

An n x p data matrix (n observations and p variables). Rows of x represent observations, and columns represent variables.

Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations.

intercept

wheather to use constant term (default is intercept=TRUE

plot

If plot=FALSE (default) or plot=0 no plot is produced. If plot=TRUE a fan plot is shown.

family

string which identifies the family of transformations which must be used. Possible values are c('BoxCox', 'YJ', 'YJpn', 'YJall'). Default is 'BoxCox'. The Box-Cox family of power transformations equals \((y^{\lambda}-1)/\lambda\) for \(\lambda\) not equal to zero, and \(\log(y)\) if \(\lambda = 0\). The Yeo-Johnson (YJ) transformation is the Box-Cox transformation of \(y+1\) for nonnegative values, and of \(|y|+1\) with parameter \(2-\lambda\) for \(y\) negative. Remember that BoxCox can be used only if input y is positive. Yeo-Johnson family of transformations does not have this limitation. If family='YJpn' Yeo-Johnson family is applied but in this case it is also possible to monitor (in the output arguments Scorep and Scoren) the score test for positive and negative observations respectively. If family='YJall', it is also possible to monitor the joint F test for the presence of the two constructed variables for positive and negative observations.

values of the transformation parameter for which it is necessary to compute the score test. Default value of lambda is la=c(-1, -0.5, 0, 0.5, 1), i.e., the five most common values of lambda.

lms

how to find the initlal subset to initialize the search. If lms=1 (default) Least Median of Squares (LMS) is computed, else Least Trimmed Squares (LTS) is computed. If, lms is matrix of size p - 1 + intercept X length(la) it contains in column j=1,..., lenght(la) the list of units forming the initial subset for the search associated with la(j). In this case the input option nsamp is ignored.

alpha

the percentage (roughly) of squared residuals whose sum will be minimized, by default alpha=0.5. In general, alpha must between 0.5 and 1.

The number of observations that have determined the least trimmed squares estimator, scalar. h is an integer greater or equal than p but smaller then n. Generally h=[0.5*(n+p+1)] (default value).

init

Search initialization. It specifies the initial subset size to start monitoring the value of the score test. If init is not specified it will be set equal to: p+1, if the sample size is smaller than 40 or min(3 * p + 1, floor(0.5 * (n+p+1))), otherwise.

msg

Controls whether to display or not messages on the screen If msg==TRUE (default) messages are displayed on the screen. If msg=2, detailed messages are displayed, for example the information at iteration level.

nocheck

Whether to check input arguments. If nocheck=TRUE no check is performed on matrix y and matrix X. Notice that y and X are left unchanged. In other words the additional column of ones for the intercept is not added. The default is nocheck=FALSE.

nsamp

number of subsamples which will be extracted to find the robust estimator. If nsamp=0 all subsets will be extracted. They will be n choose p.

Remark: if the number of all possible subset is <1000 the default is to extract all subsets otherwise just 1000. If nsamp is a matrix of size r-by-p, it contains in the rows the subsets which sill have to be extracted. For example, if p=3 and nsamp=c(2,4,9; 23, 45, 49; 90, 34, 1) the first subset is made up of units c(2, 4, 9), the second subset of units c(23, 45, 49) and the third subset of units c(90 34 1).

conflev

Confidence level for the bands (default is 0.99, that is we plot two horizontal lines corresponding to values -2.58 and 2.58).

xlab

A label for the X-axis, default is 'Subset size m'

ylab

A label for the Y-axis, default is 'Score test statistic'

main

A label for the title, default is 'Fan plot'

xlim

Minimum and maximum for the X-axis

ylim

Minimum and maximum for the Y-axis

cex.lab

The magnification to be used for x and y labels relative to the current setting of cex

cex.axis

The magnification to be used for axis annotation relative to the current setting of cex

lwd

The line width of the curves which contain the score test, a positive number, default is lwd=2

lwd.env

The line width of the lines associated with the envelopes, a positive number, default is lwd.env=1

trace

Whether to print intermediate results. Default is trace=FALSE.

Value

An S3 object of class fsrfan.object will be returned which is basically a list containing the following elements:

la vector containing the values of lambda for which fan plot is constructed
bs matrix of size p X length(la) containing the units forming the initial subset for each value of lambda
Score a matrix containing the values of the score test for each value of the transformation parameter:
- 1st col = fwd search index;
- 2nd col = value of the score test in each step of the fwd search for la[1]
- ...
Scorep matrix containing the values of the score test for positive observations for each value of the transformation parameter.

Note: this output is present only if input option family='YJpn' or family='YJall'.
Scoren matrix containing the values of the score test for negative observations for each value of the transformation parameter.

Note: this output is present only if input option 'family' is 'YJpn' or 'YJall'.
Scoreb matrix containing the values of the score test for the joint presence of both constructed variables (associated with positive and negative observations) for each value of the transformation parameter. In this case the reference distribution is the \(F\) with 2 and subset_size - p degrees of freedom.

Note: this output is present only if input option family='YJall'.
Un a three-dimensional array containing length(la) matrices of size retnUn=(n-init) X retpUn=11. Each matrix contains the unit(s) included in the subset at each step in the search associated with the corresponding element of la.

REMARK: at each step the new subset is compared with the old subset. Un contains the unit(s) present in the new subset but not in the old one.

References

Atkinson, A.C. and Riani, M. (2000), Robust Diagnostic Regression Analysis Springer Verlag, New York.

Atkinson, A.C. and Riani, M. (2002), Tests in the fan plot for robust, diagnostic transformations in regression, Chemometrics and Intelligent Laboratory Systems, 60, pp. 87--100.

Atkinson, A.C. Riani, M. and Corbellini A. (2019), The analysis of transformations for profit-and-loss data, Journal of the Royal Statistical Society, Series C, "Applied Statistics", 69, pp. 251--275. 10.1111/rssc.12389

Atkinson, A.C. Riani, M. and Corbellini A. (2021), The Box-Cox Transformation: Review and Extensions, Statistical Science, 36(2), pp. 239--255. 10.1214/20-STS778.

Examples

Run this code

# NOT RUN {
# }
# NOT RUN {
   data(wool)
   XX <- wool
   y <- XX[, ncol(XX)]
   X <- XX[, 1:(ncol(XX)-1), drop=FALSE]

   out <- fsrfan(y, X)                    # call 'fsrfan' with all default parameters

   out <- fsrfan(y, X, plot=TRUE)         # call 'fsrfan' and produce the plot

   ## call 'fsrfan' with Yeo-Johnson (YJ) transformation
   out <- fsrfan(y, X, family="YJ", plot=TRUE)

# }
# NOT RUN {
# }

Run the code above in your browser using DataLab