GAabbreviate: Abbreviating questionnaires (or other measures) using Genetic Algorithms (GAs)

Description

The GAabbreviate uses Genetic Algorithms as an optimization tool to create abbreviated forms of lengthy questionnaires (or other measures) that maximally capture the variance in the original data of the long form.

Usage

GAabbreviate(items = NULL, 
             scales = NULL, 
             itemCost = 0.05, 
             maxItems = 5, 
             maxiter = 100, 
             popSize = 50, 
             ..., 
             plot = FALSE, 
             verbose = TRUE, 
             crossVal = TRUE, 
             impute = FALSE, 
             pairwise = FALSE, 
             minR = 0, 
             sWeights = NULL, 
             nSample = NULL)

Arguments

items

A matrix of subjects x item scores.

scales

A matrix of subjects x scale scores.

itemCost

The fitness cost of each item. This will usually need to be determined by trial and error.

maxItems

The maximum number of items used to score each scale.

maxiter

Number of generations of GA to run.

popSize

Size of population in each generation of the GA.

...

further arguments passed to ga for tuning GAs.

plot

Logical; if TRUE, plot results after every generation (this will slow down the process).

verbose

Logical; if TRUE display some info during the search.

crossVal

Logical; if TRUE, cross-validation will be performed. Note that if you turn this off, the predictive fit of the resulting measure will be biased by overfitting.

impute

Logical; if TRUE, the mean value will be imputed for any missing item or scale scores. This is NOT recommended. Instead, you should decide how to handle missing values yourself before passing the items and scales variables.

pairwise

Logical; if TRUE, the GA will use pairwise deletion to select items, i.e. some scales/items may have diferent NAs than others. If FALSE, the GA will crash if any NAs are passed. It's recommended to leave this off, as NAs should r

minR

The minimum bivariate item-scale correlation required in order to retain an item. Note that if this is set above 0, the number of items retained can be lower than the value of maxItems.

sWeights

Weighting of scales. By default, all scales will have unit weighting, but if you want to emphasize some scales more heavily, pass a vector with length equal to the number of scales.

nSample

For extremely large datasets, you may wish to use only a subset of observations to generate a measure. Passing any non-zero number will randomly select nSample observations to use instead of drawing on the full dataset.

Value

An object of class 'GAabbreviate' providing the following information:
dataThe input data.
settingsThe input settings.
resultsThe results obtained.
bestThe cost and fit of the final solution.
GAAn object of class 'GA'.
measureA list of measure values.

Details

The GAabbreviate uses Genetic Algorithms (GAs) as an optimization tool for shortening a large set of variables (e.g., in a lengthy battery of questionnaires) into a shorter subset that maximally captures the variance in the original data. An exhaustive search of all possible shorter forms of the original measure would be time consuming, especially for a measure with a large number of items. For a long form of length $L$ (e.g., 100 items of a self-report scale), the size of the search space is $2^L$ (1.26e+30) and forms a hypercube of $L$ dimensions. The GA uses hypercube sampling by sampling the corners of the $L$-dimensional hypercube. It optimizes the search by mimicking Darwinian evolution mechanisms (of selection, crossover, and mutation) while searching through a "landscape" of the collection of all possible fitness values to find an optimal value. This does not imply that the GA finds the "best" possible solution. Rather, the GA is highly efficient in quickly yielding a "good" and "robust" solution rated against a user-defined fitness criterion.

The GAabbreviate uses the GA package (Scrucca, 2012) to efficiently implement Yarkoni's (2010) scale abbreviation cost function:

$$Cost = Ik + \sum_{i=1}^s w_i(1-R_i)^2$$

where $I$ represents a user-specified fixed item cost, $k$ represents the number of items retained by the GA (in any given iteration), $s$ is the number of subscales in the measure, $w_i$ are the weights (by default {w_i = 1} for any $i$) associated with each subscale (if there are any subsets to be retained), and $R_i^2$ is the amount of variance in the ith subscale that can be explained by a linear combination of individual item scores. Adjusting the value of $I$ low or high yields longer or shorter measures respectively. When the cost of each individual item retained in each generation outweighs the cost of a loss in explained variance, the GA yields a relatively brief measure. When the cost is low, the GA yields a relatively longer measure maximizing explained variance (Yarkoni, 2010).

Sahdra, Ciarrochi, Parker & Scrucca (2015) contains an example of how GAabbreviate can be used for item-reduction of a multidimensional scale.

References

Sahdra, B. K., Ciarrochi, J., Parker, P., & Scrucca, L. (2015). The 30-Item Multidimensional Experiential Avoidance Questionnaire: Mapping Experiential Avoidance in America. Manuscript under review.

Scrucca, L. (2012). GA: a package for genetic algorithms in R. Journal of Statistical Software, 53, 1-37.

Yarkoni, T. (2010). The abbreviation of personality, or how to measure 200 personality scales with 200 items. Journal of Research in Personality, 44(2), 180-198.

Examples

Run this code

### Example using random generated data
nsubject = 100
nitems = 15
set.seed(123)
items = matrix(sample(1:5, nsubject*nitems, replace = TRUE), 
               nrow = nsubject, ncol = nitems)
scales = cbind(rowSums(items[,1:10]), rowSums(items[,11:15]))

GAA = GAabbreviate(items, scales, itemCost = 0.01, maxItems = 5, 
                   popSize = 50, maxiter = 300, run = 100,
                   verbose = TRUE)
plot(GAA)
summary(GAA)
GAA$best
GAA$measure

Run the code above in your browser using DataLab