The transformations for negative and positive responses were determined by Yeo and Johnson (2000) by imposing the smoothness condition that the second derivative of zYJ (\(\lambda\)) with respect to y be smooth at y = 0. However some authors, for example Weisberg (2005), query the physical interpretability of this constraint which is oftern violated in data analysis. Accordingly, Atkinson et al. (2019) and (2020) extend the Yeo-Johnson transformation to allow two values of the transformations parameter: \(\lambda_N\) for negative observations and \(\lambda_P\) for non-negative ones.
FSRfan monitors:
the t test associated with the constructed variable computed assuming the same transformation parameter for positive and negative observations fixed. In short we call this test, "global score test for positive observations".
the t test associated with the constructed variable computed assuming a different transformation for positive observations keeping the value of the transformation parameter for negative observations fixed. In short we call this test, "test for positive observations".
the t test associated with the constructed variable computed assuming a different transformation for negative observations keeping the value of the transformation parameter for positive observations fixed. In short we call this test, "test for negative observations".
the F test for the joint presence of the two constructed variables described in points 2) and 3).
the F likelihood ratio test based on the MLE of \(\lambda_P\) and \(\lambda_N\). In this case the residual sum of squares of the null model bsaed on a single trasnformation parameter \(\lambda\) is compared with the residual sum of squares of the model based on data transformed data using MLE of \(\lambda_P\) and \(\lambda_N\).
fsrfan(
y,
x,
intercept = TRUE,
plot = FALSE,
family = c("BoxCox", "YJ", "YJpn", "YJall"),
la = c(-1, -0.5, 0, 0.5, 1),
lms,
alpha = 0.75,
h,
init,
msg = TRUE,
nocheck = FALSE,
nsamp = 1000,
conflev = 0.99,
xlab,
ylab,
main,
xlim,
ylim,
cex.lab,
cex.axis,
lwd = 2,
lwd.env = 1,
trace = FALSE
)
Response variable. A vector with n
elements that
contains the response variable.
An n x p
data matrix (n
observations and p
variables).
Rows of x
represent observations, and columns represent variables.
Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations.
wheather to use constant term (default is intercept=TRUE
If plot=FALSE
(default) or plot=0
no plot is produced.
If plot=TRUE
a fan plot is shown.
string which identifies the family of transformations which must be used. Possible values are
c('BoxCox', 'YJ', 'YJpn', 'YJall')
. Default is 'BoxCox'
. The Box-Cox family of power
transformations equals \((y^{\lambda}-1)/\lambda\) for \(\lambda\) not equal to zero, and \(\log(y)\)
if \(\lambda = 0\).
The Yeo-Johnson (YJ) transformation is the Box-Cox transformation of \(y+1\) for nonnegative values, and of
\(|y|+1\) with parameter \(2-\lambda\) for \(y\) negative. Remember that BoxCox can be used only
if input y is positive. Yeo-Johnson family of transformations does not have this limitation.
If family='YJpn'
Yeo-Johnson family is applied but in this case it is also possible
to monitor (in the output arguments Scorep
and Scoren
) the score test for
positive and negative observations respectively. If family='YJall'
, it is also
possible to monitor the joint F test for the presence of the two constructed variables
for positive and negative observations.
values of the transformation parameter for which it is necessary
to compute the score test. Default value of lambda is
la=c(-1, -0.5, 0, 0.5, 1)
, i.e., the five most common values of lambda.
how to find the initlal subset to initialize the search. If lms=1
(default)
Least Median of Squares (LMS) is computed, else Least Trimmed Squares (LTS) is computed.
If, lms
is matrix of size p - 1 + intercept X length(la)
it contains in column
j=1,..., lenght(la)
the list of units forming the initial subset for the search
associated with la(j)
. In this case the input option nsamp
is ignored.
the percentage (roughly) of squared residuals whose sum will be minimized,
by default alpha=0.5
. In general, alpha must between 0.5 and 1.
The number of observations that have determined the least trimmed squares
estimator, scalar. h
is an integer greater or equal than p
but smaller
then n
. Generally h=[0.5*(n+p+1)]
(default value).
Search initialization. It specifies the initial subset size to start
monitoring the value of the score test. If init
is not specified it will
be set equal to: p+1
, if the sample size is smaller than 40 or
min(3 * p + 1, floor(0.5 * (n+p+1)))
, otherwise.
Controls whether to display or not messages on the screen If msg==TRUE
(default)
messages are displayed on the screen. If msg=2
, detailed messages are displayed,
for example the information at iteration level.
Whether to check input arguments. If nocheck=TRUE
no check is performed
on matrix y
and matrix X
. Notice that y
and X
are left unchanged. In other words the additional column of ones for the
intercept is not added. The default is nocheck=FALSE
.
number of subsamples which will be extracted to find the robust estimator. If nsamp=0
all subsets will be extracted. They will be n choose p
.
Remark: if the number of all possible subset is <1000 the default is to extract all subsets
otherwise just 1000. If nsamp
is a matrix of size r-by-p
, it contains in the rows
the subsets which sill have to be extracted. For example, if p=3
and nsamp=c(2,4,9; 23, 45, 49; 90, 34, 1)
the first subset is made up of units c(2, 4, 9)
, the second subset of units c(23, 45, 49)
and the third subset of units c(90 34 1)
.
Confidence level for the bands (default is 0.99, that is we plot two horizontal lines corresponding to values -2.58 and 2.58).
A label for the X-axis, default is 'Subset size m'
A label for the Y-axis, default is 'Score test statistic'
A label for the title, default is 'Fan plot'
Minimum and maximum for the X-axis
Minimum and maximum for the Y-axis
The magnification to be used for x and y labels relative to the current setting of cex
The magnification to be used for axis annotation relative to the current setting of cex
The line width of the curves which contain the score test, a positive number, default is lwd=2
The line width of the lines associated with the envelopes, a positive number, default is lwd.env=1
Whether to print intermediate results. Default is trace=FALSE
.
An S3 object of class fsrfan.object
will be returned which is basically a list
containing the following elements:
la
vector containing the values of lambda for which fan plot is constructed
bs
matrix of size p X length(la)
containing the units forming
the initial subset for each value of lambda
Score
a matrix containing the values of the score test for
each value of the transformation parameter:
1st col = fwd search index;
2nd col = value of the score test in each step of the fwd search for la[1]
...
Scorep
matrix containing the values of the score test for positive
observations for each value of the transformation parameter.
Note: this output is present only if input option family='YJpn'
or family='YJall'
.
Scoren
matrix containing the values of the score test for negative observations
for each value of the transformation parameter.
Note: this output is present only if input option 'family' is 'YJpn' or 'YJall'.
Scoreb
matrix containing the values of the score test for the joint
presence of both constructed variables (associated with positive and negative
observations) for each value of the transformation parameter. In this case
the reference distribution is the \(F\) with 2 and subset_size - p
degrees of freedom.
Note: this output is present only if input option family='YJall'
.
Un
a three-dimensional array containing length(la)
matrices of
size retnUn=(n-init) X retpUn=11
. Each matrix contains
the unit(s) included in the subset at each step in the search associated
with the corresponding element of la
.
REMARK: at each step the new subset is compared with the old subset.
Un
contains the unit(s) present in the new subset but not in the old one.
Atkinson, A.C. and Riani, M. (2000), Robust Diagnostic Regression Analysis Springer Verlag, New York.
Atkinson, A.C. and Riani, M. (2002), Tests in the fan plot for robust, diagnostic transformations in regression, Chemometrics and Intelligent Laboratory Systems, 60, pp. 87--100.
Atkinson, A.C. Riani, M. and Corbellini A. (2019), The analysis of transformations for profit-and-loss data, Journal of the Royal Statistical Society, Series C, "Applied Statistics", 69, pp. 251--275. 10.1111/rssc.12389
Atkinson, A.C. Riani, M. and Corbellini A. (2021), The Box-Cox Transformation: Review and Extensions, Statistical Science, 36(2), pp. 239--255. 10.1214/20-STS778.
# NOT RUN {
# }
# NOT RUN {
data(wool)
XX <- wool
y <- XX[, ncol(XX)]
X <- XX[, 1:(ncol(XX)-1), drop=FALSE]
out <- fsrfan(y, X) # call 'fsrfan' with all default parameters
out <- fsrfan(y, X, plot=TRUE) # call 'fsrfan' and produce the plot
## call 'fsrfan' with Yeo-Johnson (YJ) transformation
out <- fsrfan(y, X, family="YJ", plot=TRUE)
# }
# NOT RUN {
# }
Run the code above in your browser using DataLab