STPGA-package: Selection of Training Populations by Genetic Algorithm

Description

Can be utilized to select a (test data) calibrated training population in high dimensional prediction problems. Once a ''good'' training set is identified the response variable can be obtained only for this set to build a model for predicting the response (in the test set).

Arguments

Details

ll{ Package: STPGA Type: Package Version: 2.0 Date: 2016-05-05 License: GPL-3 }

The package is useful for high dimensional prediction problems where per individual cost of observing / analyzing the response variable is high and therefore a small number of training examples is sought or when the candidate set from which the training set must be chosen (is not representative of the test data set).

The function "GenAlgForSubsetSelection" uses a simple genetic algorithm to identify a training set of a specified size from a larger set of candidates which minimizes an optimization criterion (for a known test set). The function "GenAlgForSubsetSelectionNoTest" tries to identify a training set of a specified size from a larger set of candidates which minimizes an optimization criterion, no test set is specified. Let $P$ be the $n\times m$ matrix of explanatory variables (or their first few principal components) partitioned as $$P=\left[ \begin{array}{c} P_{Candidate}\ \hline P_{Test} \end{array} \right]$$ where $P_{Candidate}$ is the matrix of explanatory variables for the individuals in the candidate set and $P_{Test}$ is the matrix of explanatory variables for the individuals in the test set. $P_{Train}$ is the set of individuals in the training set.

References

References: Akdemir, Deniz. "Training population selection for (breeding value) prediction." arXiv preprint arXiv:1401.7953 (2014).