NN: the pre-trained Neural Networks for Determining the Number of Factors

Description

This function will invoke a pre-trained Neural Networks (DNN or LSTM) that can reliably perform the task of determining the number of factors. The maximum number of factors that the network can discuss is 10. The DNN model is implemented in Python and trained on PyTorch (https://pytorch.org/) with CUDA 11.8 for acceleration. The LSTM model is implemented in Python and trained on PyTorch (https://pytorch.org/) with CUDA 12.6 for acceleration. After training, the DNN and LSTM were saved as DNN.onnx and LSTM.onnx file. The NN function performs inference by loading the DNN.onnx or LSTM.onnx file in both Python and R environments. Therefore, please note that Python (suggested >= 3.11) and the libraries numpy and onnxruntime are required. @seealso check_python_libraries

To run this function, Python (suggested >= 3.11) is required, along with the installation of numpy and onnxruntime. See more in Details and Note.

Usage

NN(
  response,
  model = "DNN",
  cor.type = "pearson",
  use = "pairwise.complete.obs",
  vis = TRUE,
  plot = TRUE
)

Value

An object of class NN is a list containing the following components:

nfact: The number of factors to be retained.
features: A matrix (1×54 or 1×20) containing all the features for determining the number of factors by the DNN or LSTM.
probability: A matrix containing the probabilities for factor numbers ranging from 1 to 10 (1x10), where the number in the $f$-th column represents the probability that the number of factors for the response is $f$.

Arguments

response: A required N × I matrix or data.frame consisting of the responses of N individuals to I items.
model: A character string indicating the model type. Possible values are "DNN" (default) or "LSTM".
cor.type: A character string indicating which correlation coefficient (or covariance) is to be computed. One of "pearson" (default), "kendall", or "spearman". @seealso cor.
use: an optional character string giving a method for computing covariances in the presence of missing values. This must be one of the strings "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs" (default). @seealso cor.
vis: A Boolean variable that will print the factor retention results when set to TRUE, and will not print when set to FALSE. (default = TRUE)
plot: A Boolean variable that will print the NN plot when set to TRUE, and will not print it when set to FALSE. @seealso plot.NN. (Default = TRUE)

Author

Haijiang Qin <Haijiang133@outlook.com>

Details

Due to the improved performance of DNN with larger datasets (Chen et al., 2017), a total of 10,000,000 datasets (data.datasets.DNN) were simulated to extract features for training deep learning neural networks. Each dataset was generated following the methods described by Auerswald & Moshagen (2019) and Goretzko & Buhner (2020), with the following specifications:

Factor number: F ~ U[1,10]
Sample size: N ~ U[100,1000]
Number of variables per factor: vpf ~ [3,20]
Factor correlation: fc ~ U[0.0,0.4]
Primary loadings: pl ~ U[0.35,0.80]
Cross-loadings: cl ~ U[-0.2,0.2]

A population correlation matrix was created for each data set based on the following decomposition: $$\mathbf{\Sigma} = \mathbf{\Lambda} \mathbf{\Phi} \mathbf{\Lambda}^T + \mathbf{\Delta}$$ where $\mathbf{\Lambda}$ is the loading matrix, $\mathbf{\Phi}$ is the factor correlation matrix, and $\mathbf{\Delta}$ is a diagonal matrix, with $\mathbf{\Delta} = 1 - \text{diag}(\mathbf{\Lambda} \mathbf{\Phi} \mathbf{\Lambda}^T)$. The purpose of $\mathbf{\Delta}$ is to ensure that the diagonal elements of $\mathbf{\Sigma} $ are 1.

The response data for each subject was simulated using the following formula: $$X_i = L_i + \epsilon_i, \quad 1 \leq i \leq I$$ where $L_i$ follows a normal distribution $N(0, \sigma)$, representing the contribution of latent factors, and $\epsilon_i$ is the residual term following a standard normal distribution. $L_i$ and $\epsilon_i$ are uncorrelated, and $\epsilon_i$ and $\epsilon_j$ are also uncorrelated.

For each simulated dataset, a total of 6 types of features (which can be classified into 2 types; @seealso extractor.feature.NN) are extracted and compiled into a feature vector, consisting of 54 features: 8 + 8 + 8 + 10 + 10 + 10. These features are as follows:

1. Clustering-Based Features

(1): Hierarchical clustering is performed with correlation coefficients as dissimilarity. The top 9 tree node heights are calculated, and all heights are divided by the maximum height. The heights from the 2nd to 9th nodes are used as features. @seealso EFAhclust
(2): Hierarchical clustering with Euclidean distance as dissimilarity is performed. The top 9 tree node heights are calculated, and all heights are divided by the maximum height. The heights from the 2nd to 9th nodes are used as features. @seealso EFAhclust
(3): K-means clustering is applied with the number of clusters ranging from 1 to 9. The within-cluster sum of squares (WSS) for clusters 2 to 9 are divided by the WSS for a single cluster. @seealso EFAkmeans

These three features are based on clustering algorithms. The purpose of division is to normalize the data. These clustering metrics often contain information unrelated to the number of factors, such as the number of items and the number of respondents, which can be avoided by normalization. The reason for using the 2nd to 9th data is that only the top F-1 data are needed to determine the number of factors F. The first data point is fixed at 1 after the division operations, so it is excluded. This approach helps in model simplification.

2. Traditional Exploratory Factor Analysis Features (Eigenvalues)

(4): The top 10 largest eigenvalues.
(5): The ratio of the top 10 largest eigenvalues to the corresponding reference eigenvalues from Empirical Kaiser Criterion (EKC; Braeken & van Assen, 2017). @seealso EKC
(6): The cumulative variance proportion of the top 10 largest eigenvalues.

Only the top 10 elements are used to simplify the model.

The DNN model is implemented in Python and trained on PyTorch (https://download.pytorch.org/whl/cu118) with CUDA 11.8 for acceleration. After training, the DNN was saved as a DNN.onnx file. The NN function performs inference by loading the DNN.onnx file in both Python and R environments.

And a total of 1,000,000 datasets (data.datasets.LSTM) were simulated to extract features for training LSTM. Each dataset was generated by:

Factor number: F ~ U[1,10]
Sample size: N ~ U[100,1000]
Number of variables per factor: vpf ~ [3,10]
Factor correlation: fc ~ U[0.0,0.5]
Primary loadings: pl ~ U[0.35,0.80]
Cross-loadings: cl ~ U[-0.2,0.2]

For each simulated dataset, a total of 2 types of features (@seealso extractor.feature.NN). These features are as follows:

(1): The top 10 largest eigenvalues.
(2): The difference of the top 10 largest eigenvalues to the corresponding reference eigenvalues from arallel Analysis (PA). @seealso PA

The LSTM model is implemented in Python and trained on PyTorch (https://download.pytorch.org/whl/cu126) with CUDA 12.6 for acceleration. After training, the LSTM was saved as a LSTM.onnx file. The NN function performs inference by loading the LSTM.onnx file in both Python and R environments.

References

Auerswald, M., & Moshagen, M. (2019). How to determine the number of factors to retain in exploratory factor analysis: A comparison of extraction methods under realistic conditions. Psychological methods, 24(4), 468-491. https://doi.org/10.1037/met0000200.

Braeken, J., & van Assen, M. A. L. M. (2017). An empirical Kaiser criterion. Psychological methods, 22(3), 450-466. https://doi.org/10.1037/met0000074.

Goretzko, D., & Buhner, M. (2020). One model to rule them all? Using machine learning algorithms to determine the number of factors in exploratory factor analysis. Psychol Methods, 25(6), 776-786. https://doi.org/10.1037/met0000262.