K-fold cross-validation procedure to choose the number of clusters and the tuning parameters for the sparse and smooth functional clustering (SaS-Funclust) method (Centofanti et al., 2021).
sasfclust_cv(
X = NULL,
timeindex = NULL,
curve = NULL,
grid = NULL,
q = 30,
lambda_l_seq = 10^seq(-1, 2),
lambda_s_seq = 10^seq(-5, -3),
G_seq = 2,
tol = 10^-7,
maxit = 50,
par_LQA = list(eps_diff = 1e-06, MAX_iter_LQA = 200, eps_LQA = 1e-05),
plot = FALSE,
trace = FALSE,
init = "kmeans",
varcon = "diagonal",
lambda_s_ini = NULL,
K_fold = 5,
X_test = NULL,
grid_test = NULL,
m1 = 1,
m2 = 0,
m3 = 1,
ncores = 1
)For functional data observed over a regular grid: a matrix of where the rows must correspond to argument values and columns to replications. For functional data observed over an irregular grid: a vector of length \(\sum_{i=1}^{N}n_i\), with \(N\) the number of curves, where the entries from \(\sum_{i=1}^{k-1}(n_i+1)\) to \(\sum_{i=1}^{k}n_i\) are elements representing the observations for curve \(k\).
A vector of length \(\sum_{i=1}^{N}n_i\). The entries from \(\sum_{i=1}^{k-1}(n_i+1)\) to \(\sum_{i=1}^{k}n_i\) provide the locations on grid of curve \(k\).
So for example, if the \(k\)th curve is observed at time points \(t_l,t_m\) of the grid then the the entries from \(\sum_{i=1}^{k-1}(n_i+1)\) to \(\sum_{i=1}^{k}n_i\) would be \(l,m\), being \(n_k=2\).
If X is a matrix, timeindex is ignored.
A vector of length \(\sum_{i=1}^{N}n_i\). The entries from \(\sum_{i=1}^{k-1}(n_i+1)\) to \(\sum_{i=1}^{k}n_i\) are equal to \(k\). If X is a matrix, curve is ignored.
The vector of time points where the curves are sampled.
For Functional data observed over an irregular grid, timeindex and grid provide the time points for each curve.
The dimension of the set of B-spline functions.
Sequence of tuning parameter of the functional adaptive pairwise fusion penalty (FAPFP).
Sequence of tuning parameter of the smoothness penalty.
Sequence of number of clusters.
The tolerance for the stopping condition of the expectation conditional maximization (ECM) algorithms.
The algorithm stops when the log-likelihood difference between two consecutive iterations is less or equal than tol.
The maximum number of iterations allowed in the ECM algorithm.
A list of parameters for the local quadratic approximation (LQA) in the ECM algorithm.
eps_diff is the lower bound for the coefficient mean differences, values below eps_diff are set to zero.
MAX_iter_LQA is the maximum number of iterations allowed in the LQA.
eps_LQA is the tolerance for the stopping condition of LQA.
If TRUE, the estimated cluster means are plotted at each iteration of the ECM algorithm. Default is FALSE.
If TRUE, information are shown at each iteration of the ECM algorithm. Default is FALSE.
It is the way to initialize the ECM algorithm. There are three ways of initialization: "kmeans", "model-based", and "hierarchical", that provide initialization through the k-means algorithm, model-based clustering based on parameterized finite Gaussian mixture model, and hierarchical clustering, respectively. Default is "kmeans".
A vector of character strings indicating the type of coefficient covariance matrix. Three values are allowed: "full", "diagonal", and "equal". "full" means unrestricted cluster coefficient covariance matrices allowed to be different among clusters. "diagonal" means diagonal cluster coefficient covariance matrices that are equal among clusters. "equal" means diagonal cluster coefficient covariance matrices, with equal diagonal entries, that are equal among clusters. Default is "diagonal".
The tuning parameter used to obtain the functional data through smoothing B-splines before applying the initialization algorithm. If NULL a Generalized cross validation procedure is used as described in Ramsay (2005). Default is NULL.
Number of folds. Default is 5.
Only for functional data observed over a regular grid, a matrix where the rows must correspond to argument values and columns to replications of the test set. Default in NULL.
The vector of time points where the test set curves are sampled. Default is NULL.
The m-standard deviation rule parameter to choose G for each lambda_s and lambda_l.
The m-standard deviation rule parameter to choose lambda_s fixed G for each lambda_l.
The m-standard deviation rule parameter to choose lambda_l fixed G and lambda_s.
If ncores>1, then parallel computing is used, with ncores cores. Default is 1.
A list containing the following arguments:
G_opt: The optimal number of clusters.
lambda_l_opt: The optimal tuning parameter of the FAPFP.
lambda_s_opt: The optimal tuning parameter of the smoothness penalty.
comb_list: The combinations of G,lambda_s and lambda_l explored.
CV: The cross-validation values obtained for each combination of G,lambda_s and lambda_l.
CV_sd: The standard deviations of the cross-validation values.
zeros: Fraction of domain over which the estimated cluster means are fused.
ms: The m-standard deviation rule parameters.
class: A label for the output type.
Centofanti, F., Lepore, A., & Palumbo, B. (2021). Sparse and Smooth Functional Data Clustering. arXiv preprint arXiv:2103.15224.
Ramsay, J., Ramsay, J., & Silverman, B. W. (2005). Functional Data Analysis. Springer Science & Business Media.
# NOT RUN {
library(sasfunclust)
train<-simulate_data("Scenario I",n_i=20,var_e = 1,var_b = 0.5^2)
lambda_s_seq=10^seq(-4,-3)
lambda_l_seq=10^seq(-1,0)
G_seq=2
mod_cv<-sasfclust_cv(X=train$X,grid=train$grid,G_seq=G_seq,
lambda_l_seq = lambda_l_seq,lambda_s_seq =lambda_s_seq,maxit = 20,K_fold = 2,q=10)
plot(mod_cv)
# }
Run the code above in your browser using DataLab