Learn R Programming

EFAfactors (version 1.2.2)

CDF: the Comparison Data Forest (CDF) Approach

Description

The Comparison Data Forest (CDF; Goretzko & Ruscio, 2019) approach is a combination of Random Forest with the comparison data (CD) approach.

Usage

CDF(
  response,
  num.trees = 500,
  mtry = "sqrt",
  nfact.max = 10,
  N.pop = 10000,
  N.Samples = 500,
  cor.type = "pearson",
  use = "pairwise.complete.obs",
  vis = TRUE,
  plot = TRUE
)

Value

An object of class CDF is a list containing the following components:

nfact

The number of factors to be retained.

RF

the trained Random Forest model

probability

A matrix containing the probabilities for factor numbers ranging from 1 to nfact.max (1xnfact.max), where the number in the f-th column represents the probability that the number of factors for the response is f.

features

A matrix (1×181) containing all the features for determining the number of factors. @seealso extractor.feature.FF

Arguments

response

A required N × I matrix or data.frame consisting of the responses of N individuals to × I items.

num.trees

the number of trees in the Random Forest. (default = 500) See details.

mtry

the maximum depth for each tree, can be a number or a character ("sqrt"). When mtry = "sqrt", it means that the maximum depth of each tree will be determined by the square root of the number of available features (converted to an integer by round).default = "sqrt". See details.

nfact.max

The maximum number of factors discussed by CD approach. (default = 10)

N.pop

Size of finite populations of simulating.. (default = 10,000)

N.Samples

Number of samples drawn from each population. (default = 500)

cor.type

A character string indicating which correlation coefficient (or covariance) is to be computed. One of "pearson" (default), "kendall", or "spearman". @seealso cor.

use

an optional character string giving a method for computing covariances in the presence of missing values. This must be one of the strings "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs" (default). @seealso cor.

vis

A Boolean variable that will print the factor retention results when set to TRUE, and will not print when set to FALSE. (default = TRUE)

plot

A Boolean variable that will print the CDF plot when set to TRUE, and will not print it when set to FALSE. @seealso plot.CDF. (Default = TRUE)

Author

Haijiang Qin <Haijiang133@outlook.com>

Details

The Comparison Data Forest (CDF; Goretzko & Ruscio, 2019) Approach is a combination of random forest with the comparison data (CD) approach. Its basic steps involve using the method of Ruscio & Roche (2012) to simulate data with different factor counts, then extracting features from this data to train a random forest model. Once the model is trained, it can be used to predict the number of factors in empirical data. The algorithm consists of the following steps:

1. **Simulation Data:**

(1)

For each value of \(nfact\) in the range from 1 to \(nfact_{max}\), generate a population data using the GenData function.

(2)

Each population is based on \(nfact\) factors and consists of \(N_{pop}\) observations.

(3)

For each generated population, repeat the following for \(N_{rep}\) times, For the \(j\)-th in \(N_{rep}\): a. Draw a sample \(N_{sam}\) from the population that matches the size of the empirical data; b. Compute a feature set \(\mathbf{fea}_{nfact,j}\) from each \(N_{sam}\).

(4)

Combine all the generated feature sets \(\mathbf{fea}_{nfact,j}\) into a data frame as \(\mathbf{data}_{train, nfact}\).

(5)

Combine all \(\mathbf{data}_{train, nfact}\) into a final data frame as the training dataset \(\mathbf{data}_{train}\).

2. **Training RF:**

Train a Random Forest model \(RF\) using the combined \(\mathbf{data}_{train}\).

3. **Prediction the Empirical Data:**

(1)

Calculate the feature set \(\mathbf{fea}_{emp}\)for the empirical data.

(2)

Use the trained Random Forest model \(RF\) to predict the number of factors \(nfact_{emp}\) for the empirical data: $$nfact_{emp} = RF(\mathbf{fea}_{emp})$$

According to Goretzko & Ruscio (2024) and Breiman (2001), the number of trees in the Random Forest num.trees is recommended to be 500. The Random Forest in CDF performs a classification task, so the recommended maximum depth for each tree mtry is \(\sqrt{q}\) (where \(q\) is the number of features), which results in \(m_{try}=\sqrt{181}=13\).

Since the CDF approach requires extensive data simulation and computation, which is much more time consuming than the CD Approach, C++ code is used to speed up the process.

References

Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324

Goretzko, D., & Ruscio, J. (2024). The comparison data forest: A new comparison data approach to determine the number of factors in exploratory factor analysis. Behavior Research Methods, 56(3), 1838-1851. https://doi.org/10.3758/s13428-023-02122-4

Ruscio, J., & Roche, B. (2012). Determining the number of factors to retain in an exploratory factor analysis using comparison data of known factorial structure. Psychological Assessment, 24, 282–292. http://dx.doi.org/10.1037/a0025697.

See Also

GenData

Examples

Run this code
library(EFAfactors)
set.seed(123)

##Take the data.bfi dataset as an example.
data(data.bfi)

response <- as.matrix(data.bfi[, 1:25]) ## loading data
response <- na.omit(response) ## Remove samples with NA/missing values

## Transform the scores of reverse-scored items to normal scoring
response[, c(1, 9, 10, 11, 12, 22, 25)] <- 6 - response[, c(1, 9, 10, 11, 12, 22, 25)] + 1

## Run CDF function with default parameters.
# \donttest{
CDF.obj <- CDF(response)

print(CDF.obj)

## CDF plot
plot(CDF.obj)

## Get the nfact results.
nfact <- CDF.obj$nfact
print(nfact)

# }

## Limit the maximum number of factors to 8, with populations set to 5000.
# \donttest{
CDF.obj <- CDF(response, nfact.max=8, N.pop = 5000)

print(CDF.obj)

## CDF plot
plot(CDF.obj)

## Get the nfact results.
nfact <- CDF.obj$nfact
print(nfact)

# }



Run the code above in your browser using DataLab