tclustreg
for different number of groups k
and restriction factors c
.(the last two letters stand for 'Information Criterion') computes
the values of BIC (MIXMIX), ICL (MIXCLA) or CLA (CLACLA), for different values
of k
(number of groups) and different values of c
(restriction factor for the variances of the residuals), for
a prespecified level of trimming. In order to minimize randomness, given k
,
the same subsets are used for each value of c
.
tclustregIC(
y,
x,
alphaLik,
alphaX,
intercept = TRUE,
plot = FALSE,
nsamp,
refsteps = 10,
reftol = 1e-13,
equalweights = FALSE,
wtrim = 0,
we,
msg = TRUE,
RandNumbForNini,
trace = FALSE,
...
)
An S3 object of class tclustreg.object
Response variable. A vector with n
elements that
contains the response variable.
An n x p data matrix (n observations and p variables). Rows of x represent observations, and columns represent variables.
Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations.
Trimming level, a scalar between 0 and 0.5 or an
integer specifying the number of observations which have to be trimmed.
If alphaLik=0
, there is no trimming. More in detail, if 0 < alphaLik < 1
clustering is based on h = floor(n * (1 - alphaLik))
observations.
If alphaLik
is an integer greater than 1 clustering is
based on h = n - floor(alphaLik)
. More in detail, likelihood
contributions are sorted and the units associated with the smallest n - h
contributions are trimmed.
Second-level trimming or constrained weighted model for x
.
wheather to use constant term (default is intercept=TRUE
If plot=FALSE
(default) or plot=0
no plot is produced.
If plot=TRUE
a plot with the final allocation is shown (using the spmplot function).
If X
is 2-dimensional, the lines associated to the groups are shown too.
If a scalar, it contains the number of subsamples which will be extracted.
If nsamp = 0
all subsets will be extracted. Remark - if the number of all possible
subset is greater than 300 the default is to extract all subsets, otherwise just 300.
If nsamp
is a matrix it contains in the rows the indexes of the subsets which
have to be extracted. nsamp
in this case can be conveniently generated by
function subsets()
. nsamp
must have k * p
columns. The first p
columns are used to estimate the regression coefficient of group 1, ..., the last p
columns are used to estimate the regression coefficient of group k
.
Number of refining iterations in each subsample. Default is refsteps=10
.
refsteps = 0
means "raw-subsampling" without iterations.
Tolerance of the refining steps. The default value is 1e-14
A logical specifying wheather cluster weights in the concentration
and assignment steps shall be considered. If equalweights=TRUE
we are (ideally)
assuming equally sized groups, else if equalweights = false
(default) we allow for
different group weights. Please, check in the given references which functions
are maximized in both cases.
How to apply the weights on the observations - a flag taking values in c(0, 1, 2, 3, 4).
If wtrim==0
(no weights), the algorithm reduces to the standard tclustreg
algorithm.
If wtrim==1
, trimming is done by weighting the observations using values specified in vector
we
. In this case, vector we
must be supplied by the user.
If wtrim==2
, trimming is again done by weighting the observations
using values specified in vector we
. In this case, vector we
is computed from the data as a function of the density estimate pdfe.
Specifically, the weight of each observation is the probability of retaining
the observation, computed as
$$pretain_{ig} = 1-pdfe_{ig}/max_{ig}(pdfe_{ig})$$
If wtrim==3
, trimming is again done by weighting the observations using
values specified in vector we
. In this case, each element wei of vector
we
is a Bernoulli random variable with probability of success
\(pdfe_{ig}\).
In the clustering framework this is done under the constraint that no group is empty.
If wtrim==4
, trimming is done with the tandem approach of Cerioli and Perrotta (2014).
Weights. A vector of size n-by-1 containing application-specific weights Default is a vector of ones.
Controls whether to display or not messages on the screen If msg==TRUE
(default)
messages are displayed on the screen. If msg=2
, detailed messages are displayed,
for example the information at iteration level.
pre-extracted random numbers to initialize proportions.
Matrix of size k-by-nrow(nsamp) containing the random numbers which
are used to initialize the proportions of the groups. This option is effective only if
nsamp
is a matrix which contains pre-extracted subsamples. The purpose of this
option is to enable the user to replicate the results when the function tclustreg()
is called using a parfor instruction (as it happens for example in routine IC, where
tclustreg()
is called through a parfor for different values of the restriction factor).
The default is that RandNumbForNini
is empty - then uniform random numbers are used.
Whether to print intermediate results. Default is trace=FALSE
.
potential further arguments passed to lower level functions.
FSDA team, valentin.todorov@chello.at
Torti F., Perrotta D., Riani, M. and Cerioli A. (2019). Assessing Robust Methodologies for Clustering Linear Regression Data, Advances in Data Analysis and Classification, Vol. 13, pp 227-257.