IASSMR.kNN.fit: Impact point selection with IASSMR and kNN estimation

Description

This function implements the Improved Algorithm for Sparse Semiparametric Multi-functional Regression (IASSMR) with kNN estimation. This algorithm is specifically designed for estimating multi-functional partial linear single-index models, which incorporate multiple scalar variables and a functional covariate as predictors. These scalar variables are derived from the discretisation of a curve and have linear effects while the functional covariate exhibits a single-index effect.

IASSMR is a two-stage procedure that selects the impact points of the discretised curve and estimates the model. The algorithm employs a penalised least-squares regularisation procedure, integrated with kNN estimation using Nadaraya-Watson weights. It uses B-spline expansions to represent curves and eligible functional indexes. Additionally, it utilises an objective criterion (criterion) to determine the initial number of covariates in the reduced model (w.opt), the number of neighbours (k.opt), and the penalisation parameter (lambda.opt).

Usage

IASSMR.kNN.fit(x, z, y, train.1 = NULL, train.2 = NULL, 
seed.coeff = c(-1, 0, 1), order.Bspline = 3, nknot.theta = 3, knearest = NULL,
min.knn = 2, max.knn = NULL, step = NULL, range.grid = NULL, 
kind.of.kernel = "quad", nknot = NULL, lambda.min = NULL, lambda.min.h = NULL, 
lambda.min.l = NULL, factor.pn = 1, nlambda = 100, vn = ncol(z), nfolds = 10, 
seed = 123, wn = c(10, 15, 20), criterion = "GCV", penalty = "grSCAD", 
max.iter = 1000, n.core = NULL)

Value

call: The matched call.
fitted.values: Estimated scalar response.
residuals: Differences between y and the fitted.values.
beta.est: $\hat{β}$ (i.e. estimate of $β_{0}$ when the optimal tuning parameters w.opt, lambda.opt, vn.opt and k.opt are used).
theta.est: Coefficients of $\hat{θ}$ in the B-spline basis (i.e. estimate of $θ_{0}$ when the optimal tuning parameters w.opt, lambda.opt, vn.opt and k.opt are used): a vector of length(order.Bspline+nknot.theta).
indexes.beta.nonnull: Indexes of the non-zero $\hat{β_{j}}$ .
k.opt: Selected number of nearest neighbours (when w.opt is considered).
w.opt: Selected initial number of covariates in the reduced model.
lambda.opt: Selected value of the penalisation parameter $λ$ (when w.opt is considered).
IC: Value of the criterion function considered to select w.opt, lambda.opt, vn.opt and k.opt.
vn.opt: Selected value of vn in the second step (when w.opt is considered).
beta2: Estimate of $β_{0}^{2}$ for each value of the sequence wn.
theta2: Estimate of $θ_{0}^{2}$ for each value of the sequence wn (i.e. its coefficients in the B-spline basis).
indexes.beta.nonnull2: Indexes of the non-zero linear coefficients after the step 2 of the method for each value of the sequence wn.
knn2: Selected number of neighbours in the second step of the algorithm for each value of the sequence wn.
IC2: Optimal value of the criterion function in the second step for each value of the sequence wn.
lambda2: Selected value of penalisation parameter in the second step for each value of the sequence wn.
index02: Indexes of the covariates (in the entire set of $p_{n}$ ) used to build $R_{n}^{2}$ for each value of the sequence wn.
beta1: Estimate of $β_{0}^{1}$ for each value of the sequence wn.
theta1: Estimate of $θ_{0}^{1}$ for each value of the sequence wn (i.e. its coefficients in the B-spline basis).
knn1: Selected number of neighbours in the first step of the algorithm for each value of the sequence wn.
IC1: Optimal value of the criterion function in the first step for each value of the sequence wn.
lambda1: Selected value of penalisation parameter in the first step for each value of the sequence wn.
index01: Indexes of the covariates (in the whole set of $p_{n}$ ) used to build $R_{n}^{1}$ for each value of the sequence wn.
index1: Indexes of the non-zero linear coefficients after the step 1 of the method for each value of the sequence wn.
...

Arguments

x: Matrix containing the observations of the functional covariate collected by row (functional single-index component).
z: Matrix containing the observations of the functional covariate that is discretised collected by row (linear component).
y: Vector containing the scalar response.
train.1: Positions of the data that are used as the training sample in the 1st step. The default setting is train.1<-1:ceiling(n/2).
train.2: Positions of the data that are used as the training sample in the 2nd step. The default setting is train.2<-(ceiling(n/2)+1):n.
seed.coeff: Vector of initial values used to build the set $Θ_{n}$ (see section Details). The coefficients for the B-spline representation of each eligible functional index $θ \in Θ_{n}$ are obtained from seed.coeff. The default is c(-1,0,1).
order.Bspline: Positive integer giving the order of the B-spline basis functions. This is the number of coefficients in each piecewise polynomial segment. The default is 3.
nknot.theta: Positive integer indicating the number of regularly spaced interior knots in the B-spline expansion of $θ_{0}$ . The default is 3.
knearest: Vector of positive integers containing the sequence in which the number of nearest neighbours k.opt is selected. If knearest=NULL, then knearest <- seq(from =min.knn, to = max.knn, by = step).
min.knn: A positive integer that represents the minimum value in the sequence for selecting the number of nearest neighbours k.opt. This value should be less than the sample size. The default is 2.
max.knn: A positive integer that represents the maximum value in the sequence for selecting number of nearest neighbours k.opt. This value should be less than the sample size. The default is max.knn <- n%/%5.
step: A positive integer used to construct the sequence of k-nearest neighbours as follows: min.knn, min.knn + step, min.knn + 2*step, min.knn + 3*step,.... The default value for step is step<-ceiling(n/100).
range.grid: Vector of length 2 containing the endpoints of the grid at which the observations of the functional covariate x are evaluated (i.e. the range of the discretisation). If range.grid=NULL, then range.grid=c(1,p) is considered, where p is the discretisation size of x (i.e. ncol(x)).
kind.of.kernel: The type of kernel function used. Currently, only Epanechnikov kernel ("quad") is available.
nknot: Positive integer indicating the number of interior knots for the B-spline expansion of the functional covariate. The default value is (p - order.Bspline - 1)%/%2.
lambda.min: The smallest value for lambda (i. e., the lower endpoint of the sequence in which lambda.opt is selected), as fraction of lambda.max. The defaults is lambda.min.l if the sample size is larger than factor.pn times the number of linear covariates and lambda.min.h otherwise.
lambda.min.h: The lower endpoint of the sequence in which lambda.opt is selected if the sample size is smaller than factor.pn times the number of linear covariates. The default is 0.05.
lambda.min.l: The lower endpoint of the sequence in which lambda.opt is selected if the sample size is larger than factor.pn times the number of linear covariates. The default is 0.0001.
factor.pn: Positive integer used to set lambda.min. The default value is 1.
nlambda: Positive integer indicating the number of values in the sequence from which lambda.opt is selected. The default is 100.
vn: Positive integer or vector of positive integers indicating the number of groups of consecutive variables to be penalised together. The default value is vn=ncol(z), resulting in the individual penalization of each scalar covariate.
nfolds: Number of cross-validation folds (used when criterion="k-fold-CV"). Default is 10.
seed: You may set the seed for the random number generator to ensure reproducible results (applicable when criterion="k-fold-CV" is used). The default seed value is 123.
wn: A vector of positive integers indicating the eligible number of covariates in the reduced model. For more information, refer to the section Details. The default is c(10,15,20).
criterion: The criterion used to select the tuning and regularisation parameters: wn.opt, lambda.opt and k.opt (also vn.opt if needed). Options include "GCV", "BIC", "AIC", or "k-fold-CV". The default setting is "GCV".
penalty: The penalty function applied in the penalised least-squares procedure. Currently, only "grLasso" and "grSCAD" are implemented. The default is "grSCAD".
max.iter: Maximum number of iterations allowed across the entire path. The default value is 1000.
n.core: Number of CPU cores designated for parallel execution. The default is n.core<-availableCores(omit=1).

Author

German Aneiros Perez german.aneiros@udc.es

Silvia Novo Diaz snovo@est-econ.uc3m.es

Details

The multi-functional partial linear single-index model (MFPLSIM) is given by the expression $Y_{i} = \sum_{j = 1}^{p_{n}} β_{0 j} ζ_{i} (t_{j}) + r (⟨ θ_{0}, X_{i} ⟩) + ε_{i}, (i = 1, \dots, n),$ where:

$Y_{i}$ represents a real random response and $X_{i}$ denotes a random element belonging to some separable Hilbert space $H$ with inner product denoted by $⟨ \cdot, \cdot ⟩$ . The second functional predictor $ζ_{i}$ is assumed to be a curve defined on the interval $[a, b]$ , observed at the points $a \leq t_{1} < \dots < t_{p_{n}} \leq b$ .
$β_{0} = (β_{01}, \dots, β_{0 p_{n}})^{⊤}$ is a vector of unknown real coefficients, and $r (\cdot)$ denotes a smooth unknown link function. In addition, $θ_{0}$ is an unknown functional direction in $H$ .
$ε_{i}$ denotes the random error.

In the MFPLSIM, it is assumed that only a few scalar variables from the set ${ζ (t_{1}), \dots, ζ (t_{p_{n}})}$ are part of the model. Therefore, the relevant variables in the linear component (the impact points of the curve $ζ$ on the response) must be selected, and the model estimated.

In this function, the MFPLSIM is fitted using the IASSMR. The IASSMR is a two-step procedure. For this, we divide the sample into two independent subsamples, each asymptotically half the size of the original ( $n_{1} \sim n_{2} \sim n / 2$ ). One subsample is used in the first stage of the method, and the other in the second stage.The subsamples are defined as follows: $E^{1} = {(ζ_{i}, X_{i}, Y_{i}), i = 1, \dots, n_{1}},$ $E^{2} = {(ζ_{i}, X_{i}, Y_{i}), i = n_{1} + 1, \dots, n_{1} + n_{2} = n} .$

Note that these two subsamples are specified in the program through the arguments train.1 and train.2. The superscript $s$ , where $s = 1, 2$ , indicates the stage of the method in which the sample, function, variable, or parameter is involved.

To explain the algorithm, we assume that the number $p_{n}$ of linear covariates can be expressed as follows: $p_{n} = q_{n} w_{n}$ , with $q_{n}$ and $w_{n}$ being integers.

First step. The FASSMR (see FASSMR.kNN.fit) combined with kNN estimation is applied using only the subsample $E^{1}$ . Specifically:
- Consider a subset of the initial $p_{n}$ linear covariates, which contains only $w_{n}$ equally spaced discretized observations of $ζ$ covering the entire interval $[a, b]$ . This subset is the following: $R_{n}^{1} = {ζ (t_{k}^{1}), k = 1, \dots, w_{n}},$ where $t_{k}^{1} = t_{[(2 k - 1) q_{n} / 2]}$ and $[z]$ denotes the smallest integer not less than the real number $z$ .The size (cardinality) of this subset is provided to the program in the argument wn (which contains a sequence of eligible sizes).
- Consider the following reduced model, which involves only the $w_{n}$ linear covariates belonging to $R_{n}^{1}$ : $Y_{i} = \sum_{k = 1}^{w_{n}} β_{0 k}^{1} ζ_{i} (t_{k}^{1}) + r^{1} (⟨ θ_{0}^{1}, X_{i} ⟩) + ε_{i}^{1} .$ The penalised least-squares variable selection procedure, with kNN estimation, is applied to the reduced model. This is done using the function sfplsim.kNN.fit, which requires the remaining arguments (see sfplsim.kNN.fit). The estimates obtained after that are the outputs of the first step of the algorithm.
Second step. The variables selected in the first step, along with those in their neighborhood, are included. The penalised least-squares procedure, combined with kNN estimation, is carried out again considering only the subsample $E^{2}$ . Specifically:
- Consider a new set of variables: $R_{n}^{2} = ⋃_{{k, {\hat{β}}_{0 k}^{1} \neq 0}} {ζ (t_{(k - 1) q_{n} + 1}), \dots, ζ (t_{k q_{n}})} .$ Denoting by $r_{n} = ♯ (R_{n}^{2})$ , the variables in $R_{n}^{2}$ can be renamed as follows: $R_{n}^{2} = {ζ (t_{1}^{2}), \dots, ζ (t_{r_{n}}^{2})},$
- Consider the following model, which involves only the linear covariates belonging to $R_{n}^{2}$ $Y_{i} = \sum_{k = 1}^{r_{n}} β_{0 k}^{2} ζ_{i} (t_{k}^{2}) + r^{2} (⟨ θ_{0}^{2}, X_{i} ⟩) + ε_{i}^{2} .$ The penalised least-squares variable selection procedure, with kNN estimation, is applied to this model using the function sfplsim.kNN.fit.

The outputs of the second step are the estimates of the MFPLSIM. For further details on this algorithm, see Novo et al. (2021).

Remark: If the condition $p_{n} = w_{n} q_{n}$ is not met (then $p_{n} / w_{n}$ is not an integer number), the function considers variable $q_{n} = q_{n, k}$ values $k = 1, \dots, w_{n}$ . Specifically: $q_{n, k} = {\begin{cases} [p_{n} / w_{n}] + 1 & k \in {1, \dots, p_{n} - w_{n} [p_{n} / w_{n}]}, \\ [p_{n} / w_{n}] & k \in {p_{n} - w_{n} [p_{n} / w_{n}] + 1, \dots, w_{n}}, \end{cases}$ where $[z]$ denotes the integer part of the real number $z$ .

The function supports parallel computation. To avoid it, we can set n.core=1.

References

Novo, S., Vieu, P., and Aneiros, G., (2021) Fast and efficient algorithms for sparse semiparametric bi-functional regression. Australian and New Zealand Journal of Statistics, 63, 606--638, tools:::Rd_expr_doi("https://doi.org/10.1111/anzs.12355").

Examples

Run this code

# \donttest{
data(Sugar)

y<-Sugar$ash
x<-Sugar$wave.290
z<-Sugar$wave.240

#Outliers
index.y.25 <- y > 25
index.atip <- index.y.25
(1:268)[index.atip]

#Dataset to model
x.sug <- x[!index.atip,]
z.sug<- z[!index.atip,]
y.sug <- y[!index.atip]

train<-1:216

ptm=proc.time()
fit<- IASSMR.kNN.fit(x=x.sug[train,],z=z.sug[train,], y=y.sug[train],
        train.1=1:108,train.2=109:216,nknot.theta=2,lambda.min.h=0.07, 
        lambda.min.l=0.07, max.knn=20, nknot=20,criterion="BIC", max.iter=5000)
proc.time()-ptm

fit 
names(fit)
# }

Run the code above in your browser using DataLab

Last chance! 50% off unlimited learning